Intel Parallel Studio XE Evaluation Guide

June 8, 2017, 5:58 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® MPI Library 2018 Beta Release Notes for Linux* OS

≪ Previous: Mathematical Concepts and Principles of Naive Bayes

System Requirements and Prerequisites

To ensure successful installation, please review the release notes and verify that your system has the capability, capacity and prerequisites to install the product.

Find the desired version

Go to: Intel® Parallel Studio XE Try and Buy
Select the OS you need and click Download FREE Trial>
What is in each package?
- Windows*: Intel Parallel Studio XE Cluster Edition for Windows* (C++ and Fortran)
- Linux*: Intel Parallel Studio XE Cluster Edition for Linux* (C++ and Fortran)
- OS X* C++: Intel Parallel Studio XE Composer Edition for C++ OS X*
- OS X* Fortran: Intel Parallel Studio XE Composer Edition for Fortran OS X*
Note: although you are offered the Cluster edition you will be able to download and install a smaller, customized package

Complete the evaluation request form

You will be asked to supply your email address and some additional information.
After submitting the form you will receive a Registration Email from Intel.
Important: If you don’t find the email in your Inbox look for it in other folders, such as Promotions, Spam, etc.

Register for Priority Customer Support [Optional]

Registering for Evaluation does not create a full Intel account as no login id nor password are required to evaluate the product.
In the Registration Email you will find a link to create a full account.
Why should you create a full account?
- A full account will give you 30 days of Priority Customer Support
- You will be able to log into Intel Registration Center and manage your licenses
- This is a single sign-in account. It will enable you to access the Intel® Developer Zone, Online Support Center and other areas within Intel
Note: If you already have an Intel account you don’t need to create another one.

Download

Click the Download > link in the email to download the product.
You can choose from two download options:
- The Online Download option will launch the online installer. You will be able to install the product or create a customized package for later installation
- The Offline Download option will download the complete package
  Note: This package is typically big. The main advantage of this option is to download a package for an installation on a different OS

Downloading an older version

If you wish to install and evaluate an older version of the product see: How do I get an older version of an Intel Software Development Product for instructions.

Using the online installer to download a customized package

In the installer choose the option to download for later installation.
Proceed to select the components to download or use the default configuration.
Note: The full evaluation package is quite big as it encompasses the compilers, libraries, analyzers and cluster tools.
You can use the package to install the product on your system or on another computer, as desired.

Install

In the Online Installer choose the option Install to this computer.
OR use a previously downloaded package to install the product.
Proceed to select the components to install or choose the recommended settings.
Important: You do not require a serial number in order to install and evaluate the product.
- If you have not installed Intel Parallel System XE before, in the license activation screen, choose the “Evaluation” activation option.
- If the Installer finds a compatible license file on your system it will recommend a “License” activation.
Given the suite size the installation is expected to be quite lengthy. Please refrain from aborting mid-installation. If you cancel the installation, please let the rollback run its course.

Installing the product without Internet connection

If you install your product with no Internet connection you will need to use a “License” activation. If you don’t have a compatible license file already on your system you will need to create one. See: How do I get a license file for offline installation.

Installing the product on multiple systems

The product can be installed on multiple systems as defined by our EULA. If you work in a VM environment and need additional activation see: How do I release activation.

Start your evaluation

Once the installation is complete the “Getting Started” guide will open. If you are new to our products this is a good way to explore and get familiar with the compilers, libraries and tools. You can also find the guides on our site. See: Getting Started with Intel Parallel Studio XE.

Need support?

If you run into issues during installation or during the evaluation of the product, let us know. We want to hear from you and help you get the most out of your evaluation.

Check out our FAQs.
For peer questions and discussions, see our Developer Forums.
To report issues and seek help, please file a ticket at the Online Service Center.
Note: In order to get Priority Support make sure to register in the Intel Registration Center. Use the link in your email.

What’s next?

We, at the Intel, are continually working to improve your experience with our developer program. After your evaluation you will receive a Feedback Survey. We would greatly appreciate a few minutes of your time to provide us feedback on what we are doing well and how we can improve.

We hope you enjoyed your evaluation and would like to purchase one of Intel Software Development Products. Please see our Purchasing FAQ for additional information.

↧

Intel® MPI Library 2018 Beta Release Notes for Linux* OS

June 12, 2017, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® VTune™ Amplifier Disk I/O analysis with Intel® Optane Memory

≪ Previous: Intel Parallel Studio XE Evaluation Guide

Overview

Intel® MPI Library is a multi-fabric message passing library based on ANL* MPICH3* and OSU* MVAPICH2*.

Intel® MPI Library implements the Message Passing Interface, version 3.1 (MPI-3) specification. The library is thread-safe and provides the MPI standard compliant multi-threading support.

To receive technical support and updates, you need to register your product copy. See Technical Support below.

Product Contents

The Intel® MPI Library Runtime Environment (RTO) contains the tools you need to run programs including scalable process management system (Hydra), supporting utilities, and shared (.so) libraries.
The Intel® MPI Library Development Kit (SDK) includes all of the Runtime Environment components and compilation tools: compiler wrapper scripts (mpicc, mpiicc, etc.), include files and modules, static (.a) libraries, debug libraries, and test codes.

What's New

Intel® MPI Library 2018 Beta Update 1

Deprecated support for the IPM statistics format.

Intel® MPI Library 2018 Beta

Improved startup times for Hydra when using shm:ofi or shm:tmi.
Hard finalization is now the default.
The default fabric list is changed when Intel® Omni-Path Architecture is detected.
Removed support for the Intel® Xeon Phi™ coprocessor (code named Knights Corner).
Documentation is now online.

Intel® MPI Library 2017 Update 2

Added environment variables I_MPI_HARD_FINALIZE and I_MPI_MEMORY_SWAP_LOCK.

Intel® MPI Library 2017 Update 1

PMI-2 support for SLURM*, improved SLURM support by default.
Improved mini help and diagnostic messages, man1 pages for mpiexec.hydra, hydra_persist, and hydra_nameserver.
Deprecations:
- Intel® Xeon Phi™ coprocessor (code named Knights Corner) support.
- Cross-OS launches support.
- DAPL, TMI, and OFA fabrics support.

Intel® MPI Library 2017

Support for the MPI-3.1 standard.
New topology-aware collective communication algorithms (I_MPI_ADJUST family).
Effective MCDRAM (NUMA memory) support. See the Developer Reference, section Tuning Reference > Memory Placement Policy Control for more information.
Controls for asynchronous progress thread pinning (I_MPI_ASYNC_PROGRESS).
Direct receive functionality for the OFI* fabric (I_MPI_OFI_DRECV).
PMI2 protocol support (I_MPI_PMI2).
New process startup method (I_MPI_HYDRA_PREFORK).
Startup improvements for the SLURM* job manager (I_MPI_SLURM_EXT).
New algorithm for MPI-IO collective read operation on the Lustre* file system (I_MPI_LUSTRE_STRIPE_AWARE).
Debian Almquist (dash) shell support in compiler wrapper scripts and mpitune.
Performance tuning for processors based on Intel® microarchitecture codenamed Broadwell and for Intel® Omni-Path Architecture (Intel® OPA).
Performance tuning for Intel® Xeon Phi™ Processor and Coprocessor (code named Knights Landing) and Intel® OPA.
OFI latency and message rate improvements.
OFI is now the default fabric for Intel® OPA and Intel® True Scale Fabric.
MPD process manager is removed.
Dedicated pvfs2 ADIO driver is disabled.
SSHM support is removed.
Support for the Intel® microarchitectures older than the generation codenamed Sandy Bridge is deprecated.
Documentation improvements.

Key Features

MPI-1, MPI-2.2 and MPI-3.1 specification conformance.
Support for Intel® Xeon Phi™ processors (formerly code named Knights Landing).
MPICH ABI compatibility.
Support for any combination of the following network fabrics:
- Network fabrics supporting Intel® Omni-Path Architecture (Intel® OPA) devices, through either Tag Matching Interface (TMI) or OpenFabrics Interface* (OFI*).
- Network fabrics with tag matching capabilities through Tag Matching Interface (TMI), such as Intel® True Scale Fabric, Infiniband*, Myrinet* and other interconnects.
- Native InfiniBand* interface through OFED* verbs provided by Open Fabrics Alliance* (OFA*).
- Open Fabrics Interface* (OFI*).
- RDMA-capable network fabrics through DAPL*, such as InfiniBand* and Myrinet*.
- Sockets, for example, TCP/IP over Ethernet*, Gigabit Ethernet*, and other interconnects.
Support for the following MPI communication modes related to Intel® Xeon Phi™ coprocessor:
- Communication inside the Intel Xeon Phi coprocessor.
- Communication between the Intel Xeon Phi coprocessor and the host CPU inside one node.
- Communication between the Intel Xeon Phi coprocessors inside one node.
- Communication between the Intel Xeon Phi coprocessors and host CPU between several nodes.
(SDK only) Support for Intel® 64 architecture and Intel® MIC Architecture clusters using:
- Intel® C++/Fortran Compiler 14.0 and newer.
- GNU* C, C++ and Fortran 95 compilers.
(SDK only) C, C++, Fortran 77, Fortran 90, and Fortran 2008 language bindings.
(SDK only) Dynamic or static linking.

System Requirements

Hardware Requirements

Systems based on the Intel® 64 architecture, in particular:
- Intel® Core™ processor family
- Intel® Xeon® E5 v4 processor family recommended
- Intel® Xeon® E7 v3 processor family recommended
- 2nd Generation Intel® Xeon Phi™ Processor (formerly code named Knights Landing)
1 GB of RAM per core (2 GB recommended)
1 GB of free hard disk space

Software Requirements

Operating systems:
- Red Hat* Enterprise Linux* 6, 7
- Fedora* 23, 24
- CentOS* 6, 7
- SUSE* Linux Enterprise Server* 11, 12
- Ubuntu* LTS 14.04, 16.04
- Debian* 7, 8
(SDK only) Compilers:
- GNU*: C, C++, Fortran 77 3.3 or newer, Fortran 95 4.4.0 or newer
- Intel® C++/Fortran Compiler 15.0 or newer
Debuggers:
- Rogue Wave* Software TotalView* 6.8 or newer
- Allinea* DDT* 1.9.2 or newer
- GNU* Debuggers 7.4 or newer
Batch systems:
- Platform* LSF* 6.1 or newer
- Altair* PBS Pro* 7.1 or newer
- Torque* 1.2.0 or newer
- Parallelnavi* NQS* V2.0L10 or newer
- NetBatch* v6.x or newer
- SLURM* 1.2.21 or newer
- Univa* Grid Engine* 6.1 or newer
- IBM* LoadLeveler* 4.1.1.5 or newer
- Platform* Lava* 1.0
Recommended InfiniBand* software:
- OpenFabrics* Enterprise Distribution (OFED*) 1.5.4.1 or newer
- Intel® True Scale Fabric Host Channel Adapter Host Drivers & Software (OFED) v7.2.0 or newer
- Mellanox* OFED* 1.5.3 or newer
Virtual environments:
- Docker* 1.13.0
Additional software:
- The memory placement functionality for NUMA nodes requires the libnuma.so library and numactl utility installed. numactl should include numactl, numactl-devel and numactl-libs.

Known Issues and Limitations

The I_MPI_JOB_FAST_STARTUP variable takes effect only when shm is selected as the intra-node fabric.
ILP64 is not supported by MPI modules for Fortran* 2008.
In case of program termination (like signal), remove trash in the /dev/shm/ directory manually with:
```
rm -r /dev/shm/shm-col-space-*
```
In case of large number of simultaneously used communicators (more than 10,000) per node, it is recommended to increase the maximum numbers of memory mappings with one of the following methods:
- ```
echo 1048576 > /proc/sys/vm/max_map_count
```
- ```
sysctl -w vm.max_map_count=1048576
```
- disable shared memory collectives by setting the variable: I_MPI_COLL_INTRANODE=pt2pt
On some Linux* distributions Intel® MPI Library may fail for non-root users due to security limitations. This was observed on Ubuntu* 12.04, and could impact other distributions and versions as well. Two workarounds exist:
- Enable ptrace for non-root users with:
```
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
```
- Revert the Intel® MPI Library to an earlier shared memory mechanism, which is not impacted, by setting: I_MPI_SHM_LMT=shm
Ubuntu* does not allow attaching a debugger to a non-child process. In order to use -gdb, this behavior must be disabled by setting the sysctl value in /proc/sys/kernel/yama/ptrace_scope to 0.
Cross-OS runs using ssh from a Windows* host fail. Two workarounds exist:
- Create a symlink on the Linux* host that looks identical to the Windows* path to pmi_proxy.
- Start hydra_persist on the Linux* host in the background (hydra_persist &) and use -bootstrap service from the Windows* host. This requires that the Hydra service also be installed and started on the Windows* host.
The OFA fabric and certain DAPL providers may not work or provide worthwhile performance with the Intel® Omni-Path Fabric. For better performance, try choosing the OFI or TMI fabric.
Enabling statistics gathering may result in increased time in MPI_Finalize.
In systems where some nodes have only Intel® True Scale Fabric or Intel® Omni-Path Fabric available, while others have both Intel® True Scale and e.g. Mellanox* HCAs, automatic fabric detection will lead to a hang or failure, as the first type of nodes will select ofi/tmi, and the second type will select dapl as the internode fabric. To avoid this, explicitly specify a fabric that is available on all the nodes.
In order to run a mixed OS job (Linux* and Windows*), all binaries must link to the same single or multithreaded MPI library. The single- and multithreaded libraries are incompatible with each other and should not be mixed. Note that the pre-compiled binaries for the Intel® MPI Benchmarks are inconsistent (Linux* version links to multithreaded, Windows* version links to single threaded) and as such, at least one must be rebuilt to match the other.
Intel® MPI Library does not support using the OFA fabric over an Intel® Symmetric Communications Interface (Intel® SCI) adapter. If you are using an Intel SCI adapter, such as with Intel® Many Integrated Core Architecture, you will need to select a different fabric.
The TMI and OFI fabrics over PSM do not support messages larger than 2³² - 1 bytes. If you have messages larger than this limit, select a different fabric.
If a communication between two existing MPI applications is established using the process attachment mechanism, the library does not control whether the same fabric has been selected for each application. This situation may cause unexpected applications behavior. Set the I_MPI_FABRICS variable to the same values for each application to avoid this issue.
Do not load thread-safe libraries through dlopen(3).
Certain DAPL providers may not function properly if your application uses system(3), fork(2), vfork(2), or clone(2) system calls. Do not use these system calls or functions based upon them. For example, system(3), with OFED* DAPL provider with Linux* kernel version earlier than official version 2.6.16. Set the RDMAV_FORK_SAFE environment variable to enable the OFED workaround with compatible kernel version.
MPI_Mprobe, MPI_Improbe, and MPI_Cancel are not supported by the TMI and OFI fabrics.
You may get an error message at the end of a checkpoint-restart enabled application, if some of the application processes exit in the middle of taking a checkpoint image. Such an error does not impact the application and can be ignored. To avoid this error, set a larger number than before for the -checkpoint-interval option. The error message may look as follows:
```
[proxy:0:0@hostname] HYDT_ckpoint_blcr_checkpoint (./tools/ckpoint/blcr/
ckpoint_blcr.c:313): cr_poll_checkpoint failed: No such process
[proxy:0:0@hostname] ckpoint_thread (./tools/ckpoint/ckpoint.c:559):
blcr checkpoint returned error
[proxy:0:0@hostname] HYDT_ckpoint_finalize (./tools/ckpoint/ckpoint.c:878)
 : Error in checkpoint thread 0x7
```
Intel® MPI Library requires the presence of the /dev/shm device in the system. To avoid failures related to the inability to create a shared memory segment, make sure the /dev/shm device is set up correctly.
Intel® MPI Library uses TCP sockets to pass stdin stream to the application. If you redirect a large file, the transfer can take long and cause the communication to hang on the remote side. To avoid this issue, pass large files to the application as command line options.
DAPL auto provider selection mechanism and improved NUMA support require dapl-2.0.37 or newer.
If you set I_MPI_SHM_LMT=direct, the setting has no effect if the Linux* kernel version is lower than 3.2.
When using the Linux boot parameter isolcpus with an Intel® Xeon Phi™ processor using default MPI settings, an application launch may fail. If possible, change or remove the isolcpus Linux boot parameter. If it is not possible, you can try setting I_MPI_PIN to off.
In some cases, collective calls over the OFA fabric may provide incorrect results. Try setting I_MPI_ADJUST_ALLGATHER to a value between 1 and 4 to resolve the issue.

Technical Support

Every purchase of an Intel® Software Development Product includes a year of support services, which provides priority customer support at our Online Support Service Center web site, http://www.intel.com/supporttickets.

In order to get support you need to register your product in the Intel® Registration Center. If your product is not registered, you will not receive priority support.

↧

Intel® VTune™ Amplifier Disk I/O analysis with Intel® Optane Memory

June 7, 2017, 2:54 am

Latest and popular articles on Intel Technologies

≫ Next: Call for submissions: Intel HPC Developer Conference

≪ Previous: Intel® MPI Library 2018 Beta Release Notes for Linux* OS

This article will talk about Intel® VTune™ Amplifier I/O Analysis with Intel® Optane Memory. Several benchmark tools like crystaldisk, IOmeter, System Mark or PC Mark etc. are used to evaluate system I/O efficiency with usually a score number. For some power users, PC-gaming geeks might be satisfied with those numbers served for performance validation purpose. How about the further technical-depth information like slow I/O activities identification, detailed I/O queue depth visualization in timeline, I/O function APIs callstacks and even the correlation with other system metrics to give further debug or profiling information for a software developer? Software Developers need the clues to understand how I/O efficient his program performs. VTune tries to provide such insights with its new feature, Disk I/O Analysis Type.

A bit about I/O Performance metrics

First of all, there are some basics you might need to know; I/O Queue Depth, Read/Write Latency, I/O Bandwidth, they are the I/O metrics used to track I/O efficiency. I/O Queue Depth means how many I/O commands wait in a queue to be served. This queue depth (size) depends on application, driver, OS implementation or the definition of host controller interface’s spec., like AHCI or NVMe and etc.. Comparing to ACHI with a single queue design, NVMe has multiple queues design supports parallel operations.

Imagine that a software program issues multiple I/O requests pass through the framework, software libraries, VM, container, runtimes, OS’s I/O scheduler, driver to the host controller of I/O device. These requests can be temporarily delayed in any of these components due to different queue implementation and other reasons. Observing the change of system’s queue depth can help understand how busy system I/O utilization is and overall I/O access patterns. From OS perspective, high queue depth represents a state that system is working to consume pending I/O requests. Zero queue depth means I/O scheduler is idle. From Storage device perspective, high queue depth design shows the storage media or controller has the confidence to serve a bulk of I/O requests in a higher speed comparing to lower queue depth design. Read/Write Latency shows how quick storage device finishes or response I/O request. Its inverse also represents IOPS (I/O per second). As for I/O Bandwidth, it will be tightened to the capability offered by different host controller interfaces. For example, SATA 3.0 can achieve 600MB/s of the theoretical bandwidth and NVMe PCIe 3.0 x2 lanes can do ~1.87GB/s.

Optane+NAND SSD

We will expect the system I/O performance increase after adopting Intel® Optane Memory + Intel Rapid Storage technique.

Insight from VTune for a workload running on Optane enabled setup

IOAPI_time_ssdvsoptane [figure1]

The figure 1 shows two VTune results are based on a benchmark program, PCMark, running on “single SATA NAND SSD” vs “SATA NAND SSD + additional 16GB NVMe Optane module within IRST RAID 0 mode”. Besides the basics of VTune’s online help for Disk I/O analysis, you can also observe I/O APIs effective time by applying “Task Domain” grouping view. As VTune indicates, I/O API’s CPU time also gets improved with Optane’s acceleration. It make senses since most of I/O API calls are synchronous in this case and I/O media with Optane acceleration responses quickly.

Latency SSD vs Optane

[figure 2]

In figure 2, it shows how VTune measure the latency for single I/O operation. We compare 3^rd FileRead operation of the test#3(importing pictures to Windows Photo Gallery) of benchmark workload on both cases. It shows Optane+SSD can help nearly 5 times gain for this read operation speed in 300us vs 60us.

On linux target, VTune also provides the Page fault metric. Page fault event usually invokes disk I/O to handle page swapping. To avoid frequent Disk I/O caused by page fault events, the typical direction is to keep more pages in the memory instead swap pages back to the disk. Intel® Memory Drive Technology provides a solution to expand memory capacity and Optane provides the best proximity to memory’s speed. And that’s transparent to application and OS, it also mitigates the Disk I/O penalty to further increase the performance. One common mistake is that using asynchronous I/O can always help application’s I/O performance. Asynchronous I/O is to actually add more responsiveness back to the application because asynchronous I/O does not need to put CPU to wait. Putting CPU to wait is the case when synchronous I/O API is used but I/O operation is not finished.

With all that software design suggestions above, the extra performance solution is to upgrade your hardware to faster media. Intel® Optane is Intel’s edge non-volatile memory technology enabling memory-like performance at storage-like capacity and cost. VTune can even help to juice out more software performance by providing insight analysis.

Call for submissions: Intel HPC Developer Conference

June 13, 2017, 4:16 pm

Latest and popular articles on Intel Technologies

≫ Next: Use Intel® Optane™ Technology and Intel® 3D NAND SSDs to Build High-Performance Cloud Storage Solutions

≪ Previous: Intel® VTune™ Amplifier Disk I/O analysis with Intel® Optane Memory

Please consider giving a talk, tutorial or presenting a poster at this year's Intel HPC Developer Conference (November 11-12, 2017 - just before SC17 in Denver).

Submissions will be reviewed and responded to in a rolling fashion - so submit soon! (Best to submit by July 20, but okay until August 18.)

Submit online: https://intelhpcdc2017cfa.hubb.me (full information on dates, topics, etc. is on that web site).

The prior Intel HPC Developer Conferences have been very well rated by attendees - and that is due to the high quality of speakers (talks tutorials, panels, etc.) that we have enjoyed. We are adding poster sessions this year to open up more discussions with attendees.

Technical talks of 30 minutes, Tutorials of 90, 120 or 180 minutes and Poster sessions submissions are encouraged. Topics range include Parallel Programming, AI (ML/HPDA), High Productivity Languages, Visualization (esp. Software Defined Visualization and In Situ Visualization), Enterprise and Systems.

We expect to have another great conference this year - and we know that rests on the high quality presenters. We look forward to your submissions. Feel free to drop me a note if you have any questions - or simply put in your proposal online, and put any questions in with your submission (we can talk!).

↧

Use Intel® Optane™ Technology and Intel® 3D NAND SSDs to Build High-Performance Cloud Storage Solutions

June 13, 2017, 9:55 pm

Latest and popular articles on Intel Technologies

≫ Next: Before Salmi Games Can Make Bread, It Needs Some Jam

≪ Previous: Call for submissions: Intel HPC Developer Conference

Download Ceph configuration file [1.9KB]

Introduction

As solid-state drives (SSDs) become more affordable, cloud providers are working to provide high-performance, highly reliable SSD-based storage for their customers. As one of the most open source scale-out storage solutions, Ceph faces increasing demand from customers who wish to use SSDs with Ceph to build high-performance storage solutions for their clouds.

The disruptive Intel® Optane™ Solid State Drive based on 3D XPoint™ technology fills the performance gap between DRAM and NAND-based SSDs. At the same time, Intel® 3D NAND TLC is reducing the cost gap between SSDs and traditional spindle hard drives, making all-flash storage an affordable option.

This article presents three Ceph all-flash storage system reference designs, and provides Ceph performance test results on the first Intel Optane and P4500 TLC NAND based all-flash cluster. This cluster delivers multi-million IOPS with extremely low latency as well as increased storage density with competitive dollar-per-gigabyte costs. Click on the link above for a Ceph configuration file with Ceph BlueStore tuning and optimization guidelines, including tuning for rocksdb to mitigate the impact of compaction.

What Motivates Red Hat Ceph* Storage All-Flash Array Development

Several motivations are driving the development of Ceph-based all-flash storage systems. Cloud storage providers (CSPs) are struggling to deliver performance at increasingly massive scale. A common scenario is to build an Amazon EBS-like service for an OpenStack*-based public/private cloud, leading many CSPs to adopt Ceph-based all-flash storage systems. Meanwhile, there is strong demand to run enterprise applications in the cloud. For example, customers are adapting OLTP workloads to run on Ceph when they migrate from traditional enterprise storage solutions. In addition to the major goal of leveraging the multi-purpose Ceph all-flash storage cluster to reduce TCO, performance is an important factor for these OLTP workloads. Moreover, with the steadily declining price of SSDs and efficiency-boosting technologies like deduplication and compression, an all-flash array is becoming increasingly acceptable.

Intel® Optane™ and 3D NAND Technology

Intel Optane technology provides an unparalleled combination of high throughput, low latency, high quality of service, and high endurance. It is a unique combination of 3D XPoint™ Memory Media, Intel Memory and Storage Controllers, Intel Interconnect IP and Intel® software¹. Together these building blocks deliver a revolutionary leap forward in decreasing latency and accelerating systems for workloads demanding large capacity and fast storage.

Intel 3D NAND technology improves regular two-dimensional storage by stacking storage cells to increase capacity through higher density and lower cost per gigabyte, and offers the reliability, speed, and performance expected of solid-state memory³. It offers a cost-effective replacement for traditional hard-disk drives (HDDs) to help customers accelerate user experiences, improve the performance of apps and services across segments, and reduce IT costs.

Intel Ceph Storage Reference Architectures

Based on different usage cases and application characteristics, Intel has proposed three reference architectures (RAs) for Ceph-based all-flash arrays.

Standard configuration

Standard configuration is ideally suited for throughput optimized workloads that need high-capacity storage with good performance. We recommend using NVMe*/PCIe* SSD for journal and caching to achieve the best performance while balancing the cost. Table 1 describes the RA using 1x Intel® SSD DC P4600 Series as a journal or BlueStore* rocksdb write-ahead log (WAL) device, 12x up to 4 TB HDD for data, an Intel® Xeon® processor, and an Intel® Network Interface Card.

Example: 1x 1.6 TB Intel SSD DC P4600 as a journal, Intel® Cache Acceleration Software, 12 HDDs, Intel® Xeon® processor E5-2650 v4 .

Table 1. Standard configuration.

Ceph Storage Node configuration – Standard
CPU	Intel® Xeon® processor E5-2650 v4
Memory	64 GB
NIC	Single 10Gb E, Intel® 82599 10 Gigabit Ethernet Controller or Intel® Ethernet Controller X550
Storage	Data: 12 x 4 TB HDD Journal or WAL: 1x Intel® SSD DC P4600 1.6 TB Caching: P4600
Caching Software	Intel® Cache Acceleration Software 3.0, option: Intel® Rapid Storage Technology enterprise/MD4.3; open source cache-like bcache/flashcache

TCO-Optimized Configuration

This configuration provides the best possible performance for workloads that need higher performance, especially for throughput, IOPS, and SLAs with medium storage capacity requirements, leveraging a mixed of NVMe and SATA SSDs.

Table 2. TCO-optimized configuration

Ceph Storage node –TCO Optimized
CPU	Intel® Xeon® processor E5-2690 v4
Memory	128 GB
NIC	Dual 10GbE (20 GB), Intel® 82599 10 Gigabit Ethernet Controller
Storage	Data: 4x Intel® SSD DC P4500 4, 8, or 16 TB or Intel DC SATA SSDs Journal or WAL: 1x Intel® SSD DC P4600 Series 1.6 TB

IOPS-Optimized Configuration

The IOPS-optimized configuration provided best performance (throughput and latency) with Intel Optane Solid State Drives as Journal (FileStore) and WAL device (BlueStore) for a standalone Ceph cluster.

All NVMe/PCIe SSD Ceph system
Intel Optane Solid State Drive for FileStore Journal or BlueStore WAL
NVMe/PCIe SSD data, Intel Xeon processor, Intel® NICs
Example: 4x Intel SSD P4500 4, 8, or 16 TB for data, 1x Intel® Optane™ SSD DC P4800X 375 GB as journal (or WAL and database), Intel Xeon processor, Intel® NICs.

Table 3. IOPS optimized configuration

*Ceph Storage node –IOPS optimized**
CPU	Intel® Xeon® processor E5-2699 v4
Memory	>= 128 GB
NIC	2x 40GbE (80 Gb), 4x Dual 10GbE (800 Gb), Intel® Ethernet Converged Network Adapter X710 family
Storage	Data: 4x Intel® SSD DC P4500 4, 8, or 16 TB Journal or WAL : 1x Intel Optane SSD DC P4800X 375 GB

Notes

Journal: Ceph supports multiple storage back-end. The most popular one is FileStore, based on a filesystem (for example, XFS*) to store its data. In FileStore, Ceph OSDs use a journal for speed and consistency. Using SSD as a journal device will significantly improve Ceph cluster performance.
WAL: BlueStore is a new storage back-end designed to replace FileStore in the near future. It overcomes several limitations of XFS and POSIX* that exist in FileStore. BlueStore consumes raw partitions directly to store the data, but the metadata comes with an OSD, which will be stored in Rocksdb. Rocksdb uses a write-ahead log to ensure data consistency.
The RA is not a fixed configuration. We will continue to refresh it with latest Intel® products.

Ceph All-Flash Array performance

This section presents a performance evaluation of the IOPS-optimized configuration based on Ceph BlueStore.

System configuration

The test system described in Table 4 consisted of five Ceph storage servers, each fitted with two Intel® Xeon® processors E5-2699 v4 CPUs and 128 GB memory, plus 1x Intel® SSD DC P3700 2TB as a BlueStore WAL device, and 4x TB Intel® SSD DC P3520 2TB as a data drive. 1x Intel® Ethernet Converged Network Adapters X710 NIC 40 Gb NIC, two ports bonding together through bonding mode 6, used as separate cluster and public networks for Ceph, make up the system topology described in Figure 1. The test system also consisted of 5 client nodes, each fitted with two Intel Xeon processors E5-2699 v4, 64 GB memory, and 1x Intel Ethernet Converged Network Adapters X710 NIC 40 Gb NIC, two ports bonding together through bonding mode 6.

Ceph 12.0.0 (Luminous dev) was used, and each Intel SSD DC P3520 Series runs 4 OSD daemons. The rbd pool used for the testing was configured with 2 replica.

Table 4. System configuration.

Ceph Storage node – IOPS optimized
CPU	Intel® Xeon® processor E5-2699 v4 2.20 GHz
Memory	128 GB
NIC	1x 40 G Intel® Ethernet Converged Network Adapters X710, two ports bonding mode 6
Disks	1x Intel® SSD DC P3700 (2T) + 4x Intel® SSD DC P3520 2 TB
Software configuration	Ubuntu* 14.04, Ceph 12.0.0

Diagram of cluster topology
Figure 1. Cluster topology.

Testing methodology

To simulate a typical usage scenario, four test patterns were selected using fio with librbd. It consisted of 4K random read and write, and 64K sequential read and write. For each pattern, the throughput (IOPS or bandwidth) was measured as performance metrics with the number of volumes scaling; the volume size was 30 GB. To get stable performance, the volumes were pre-allocated to bypass the performance impact of thin-provisioning. OSD page cache was dropped before each run to eliminate page cache impact. For each test case, fio was configured with a 100 seconds warm up and 300 seconds data collection. Detailed fio testing parameters are included as part of the software configuration.

Performance overview

Table 5 shows a promising performance after tuning on this five-node cluster. 64K sequential read and write throughput is 5630 MB/s and 4200 MB/s respectively (maximums with the Intel Ethernet Converged Network Adapters X710 NIC in bonding mode 6). 4K random read throughput is 1312K IOPS with 1ms average latency, while 4 KB random write throughput is 331K IOPS with 4.8 ms average latency. The performance measured in the testing was roughly within expectations, except for a regression of 64K sequential write tests compared with previous Ceph releases, which requires further investigation and optimization.

Table 5. Performance overview.

Pattern	Throughput	Average Latency
64KB Sequential Write	4200 MB/s	18.9ms
64KB Sequential Read	5630 MB/s	17.7ms
4KB Random Write	331K IOPS	4.8ms
4KB Random Read	1312K IOPS	1.2ms

Scalability tests

Figures 2 to 5 show the graph of throughput for 4K random and 64K sequential workloads with different number of volumes, where each fio was running in the volume with a queue depth of 16.

Ceph demonstrated excellent 4K random read performance on the all-flash array reference architecture, as the total number of volumes increased from 1 to 100, the total 4K random read IOPS peaked around 1310 K IOPS, with an average latency around 1.2 ms. The total 4K random write IOPS peaked around 330K IOPS, with an average latency around 4.8 ms.

graphic of results for 4K Random read performance
Figure 2. 4K Random read performance.

graphic of results for 4K random write performance load line
Figure 3. 4K random write performance load line.

For 64K sequential read and write, as the total number of volumes increased from 1 to 100, the sequential read throughput peaked around 5630 MB/s, while sequential write peaked around 4200 MB/s. The sequential write throughput was lower than the previous Ceph release (11.0.2). It requires further investigation and optimization; stay tuned for further updates.

graphic of results for 64K sequential read throughput
Figure 4. 64K sequential read throughput

graphic of results for 64K sequential write throughput
Figure 5. 64K sequential write throughput

Latency Improvement with Intel® Optane™SSD

Fig 6 shows the latency comparison for 4K random write workloads with 1x Intel® SSD DC P3700 series 2.0 TB and 1x Intel Optane SSD DC P4800X series 375 GB drive as rocksdb & WAL device. The results proved with the Intel Optane SSD DC P4800X series 375 GB SSD as rocksdb and WAL drive in Ceph BlueData, the latency was significantly reduced: a 226% reduction in 99.99% latency.

graphic of results for 4K random read and 4K random write latency comparison
Figure 6. 4K random read and 4K random write latency comparison

Summary

Ceph is one of most open source scale-out storage solutions, and there is growing interest among Cloud providers in building Ceph-based high-performance all-flash array storage solutions. We proposed three different reference architecture configurations targeting for different usage scenarios. The results for testing that simulated different workload pattern demonstrated that a Ceph all-flash system could deliver very high performance with excellent latency.

Software configuration

Fio configuration used for the testing

Take 4K random read for example.

[global]
    direct=1
    time_based
[fiorbd-randread-4k-qd16-30g-100-300-rbd]
    rw=randread
    bs=4k
    iodepth=16
    ramp_time=100
    runtime=300
    ioengine=rbd
    clientname=${RBDNAME}
    pool=${POOLNAME}
    rbdname=${RBDNAME}
    iodepth_batch_submit=1
    iodepth_batch_complete=1
    norandommap

This sample source code is released under the Intel Sample Source Code License Agreement.

↧

Before Salmi Games Can Make Bread, It Needs Some Jam

June 10, 2017, 8:27 am

Latest and popular articles on Intel Technologies

≫ Next: Intel’s Virtual Reality Director Knows the Future (Hint: It’s Not About Headsets)

≪ Previous: Use Intel® Optane™ Technology and Intel® 3D NAND SSDs to Build High-Performance Cloud Storage Solutions

The original article is published by Intel Game Dev on VentureBeat*: Before Salmi Games can make bread, it needs some jam. Get more game dev news and related topics from https://venturebeat.com/category/intel-game-dev/Intel on VentureBeat.

Glowing colorful geometric shapes moving against a black screen

Presented by Intel

Code jams have become incredibly popular, often gathering dozens, hundreds, or even thousands of programmers to innovate, collaborate, and compete in relatively quick coding endeavors. In a similar form, game jams have sprung up as a way for game makers to conceive and create a viable game, sometimes in as little as 24 hours.

The game jam concept was well known to Yacine Salmi and his collaborator Stefan Hell. In fact, it’s how they got to know each other. When they decided to come up with their own game ideas, they studied titles they liked and then held two-man brainstorming sessions that they treated like internal game jams.

Image of Stefan Hell and Yacine Salmi standing together in an outdoor area — Above: Stefan Hell (left) and Yacine Salmi of Salmi Games

What they generated from those exercises became the seed for the creation of a viable game-development studio called Salmi Games. Last year, the small studio released Ellipsis, an “avoid-’em-up” title that’s reminiscent of Geometry Wars. Initially released on mobile devices, this past January PC version was launched via Steam*.

At their fingertips

Salmi — who was born in America, but is currently living in Munich, Germany — started an umbrella company in 2013 that enabled him to make a living doing freelance coding and that funded his desire to make games on the side.

“I had previously worked in the game industry for 10 years, but this was my attempt at doing the indie life while paying my bills,” Salmi says. “I had previously done another indie company, but it didn’t go as well. I put all of my eggs in one basket and it just fell apart in the end. [This time], I wanted to build something sustainable.”

The first Salmi game jams were intended to explore how touch could be used to control a game. With the touch concept, it made sense to target mobile devices. But touch-controlled games had issues, and Salmi wanted to come up with a way around that.

Glowing colorful geometric shapes moving against a black screen

“The main reason people don’t build touch-controlled games is because your finger or your hand tend to hide the action,” Salmi explains. “But we really liked the concept we came up with, so we decided to develop it further and work around the limitation presented by the player’s hand.

“We decided to build out levels that were very large and sparse, so you’d have time to see your objectives, see your enemies, and move around. It sort of became a dance with your hand and your fingers.”

The game did well, but Salmi says they hoped to bring it to PC, which was an uncommon path for game software. Not wanting to just port the game over to PC, they devoted time to “do a proper PC version.”

“We realized it would work with a mouse…but we didn’t just want to do a port,” Salmi explains. “We added content, we added a level editor, we redid all of our assets. We really tuned it for the PC…we made sure the game ran on every type of device, and that’s where the Intel optimization tools came in handy.”

The game not only did well on PC, it won Game of the Year, as well as Best Action Game, in the 2016 Intel® Level Up Game Developer Contest.

Glowing colorful geometric shapes moving against a black screen

The next game could be a smash

Not content to sit back on Ellipsis’ success, Salmi and Hell are exploring new and ambitious product ideas. They’re now pursuing a game with the working title Late For Work, a virtual-reality (VR) game that plays like the old arcade game Rampage, where a King Kong-like gorilla scales skyscrapers and tries to smash them to smithereens, all the while avoiding the humans seeking to take him out.

An early concept includes a multiplayer mode that Salmi hopes will make it a “social VR game.” One person will wear the headset and play the gorilla, while the other two players use gamepads on a PC or jump in with their phones, taking over planes, tanks, and cars. Then they’ll all switch places, so everyone gets a chance to be the gorilla in VR gear.

With such a high-reaching concept, Salmi says that they’re looking at outside funding and considering bringing in more coders to help. Until recently, he and Hell used to work out of their respective Munich homes, but the pair recently moved into an office… and optimistically has gotten one with room for five.

“I’m hoping in the next three or four months we can ramp up… add an artist, add a programmer, and probably we’ll do some more outsourcing on the audio and the animation sides.”

Such are the woes of becoming successful and growing your company.

And all of that from a few two-man game jams.

↧

Intel’s Virtual Reality Director Knows the Future (Hint: It’s Not About Headsets)

June 12, 2017, 10:19 am

Latest and popular articles on Intel Technologies

≫ Next: Managing Amazon Greengrass Core Devices Remotely with Wind River Helix* Device Cloud

≪ Previous: Before Salmi Games Can Make Bread, It Needs Some Jam

The original article is published by Intel Game Dev on VentureBeat*: Intel’s VR director knows the future (Hint: It’s not about headsets) Get more game dev news and related topics from Intel on VentureBeat.

Shutterstock

Presented by Intel

Virtual reality (VR) is a big-buzz topic in gaming today, but a lot of questions remain about what direction the genre will take, and how it will evolve. Attendees of the upcoming GamesBeat Summit 2017 will get a highly educated perspective on VR’s future thanks to a presentation from Kim Pallister, director of the Intel® Virtual Reality Center of Excellence in Oregon.

The Virtual Reality Center is part of the Intel Client Computing Group, which, according to Pallister, “drives the business of selling Intel silicon and solutions into PCs — desktop and notebook PCs — the bread-and-butter business for us.” Within that overarching mission, Pallister and his team focus on how PCs will handle VR applications, and still provide users with the best performance possible.

“It’s up to us to understand what we need to be doing to these PCs over time; what we need to do to our roadmap, and to the PCs that come out, as VR becomes another usage for this very versatile platform,” Pallister explains.

“Part of our role is looking at how requirements are affected,” he continues. “Part of it is working with partners like Valve*, HTC* and Oculus*, along with others, on where their roadmaps are going — and making sure that we’re aligned. Similarly, we’re working with Microsoft* on preparing for the Windows* mixed-reality effort they have coming, and the PC-connected headsets that they’re helping their partners bring to market.”

Image of Kim Pallister — Above: Intel’s Kim Pallister Image Credit: Intel

Pallister adds that the Virtual Reality Center is “doing research and development on various technologies to help move the industry forward.” That effort comes in different forms: through best-practice software techniques; sample apps and methods for getting the most out of the CPU; and improving the user experience — such as how a VR headset could be used wirelessly, so the user doesn’t need to be tethered to the PC.

From fiction to real life

The theme for GamesBeat Summit 2017 is “How games, sci-fi, and tech create real-world magic.” Pallister’s talk will be geared toward how VR — as well as augmented reality (AR) and mixed-reality experiences — will change in the near future, based on how hardware and software will change. He’ll also address what game developers will need to do to stay on the bleeding edge of this swiftly evolving technology, which is still in its embryonic stages.

“A lot of the talk has been about this intersection between science fiction and where the VR industry is heading,” Pallister says, “So I’m looking at what the potential technologies are on the near-to-medium time horizon — not just from Intel, but from the industry at large — and what they might mean to the content and experiences that get developed there. I’m also looking at what some of the challenges will be in designing those experiences — in terms of game design, and how to steer the user experience. There’s a pretty rich vein of conversation that can be had there.

“Everybody in both the hardware and software spaces is learning as they iterate — there’s a lot of rapid evolution — and some of these technologies will take time for people to figure out how to wield those tools in effective ways.”

Who’s driving?

Pallister notes that the PC does particularly well when it’s the center of a fresh, rapidly evolving category, such as VR is now. This fast pace can create a chaotic situation at times, with numerous companies and individuals driving innovation, and trying to forge their way through this somewhat uncharted territory. Out of such chaos, however, can come a sense of order — and it’s order driven not by one self-designated, perhaps restrictive authority figure, with everyone else being forced to play “follow the leader”, but by discovery, progress, and a sense of community (even if there’s ultimately competition among those community members).

“Especially in an early space like VR, one of the advantages that the PC platform brings is that it’s an open ecosystem,” Pallister says. “In a space where nobody knows what the future holds, you’re far better off where lots of people can make different choices and different bets, and try different things, as opposed to having a single vendor that says: ‘We will decide what the future is, and you will all follow us.’ ”

But who will be the “lots of people” that Pallister says will push the VR Revolution?

“Not just Intel, but the players in the industry — including the vast majority of hardware and platforms players — all recognize that the developers are going to be the ones figuring out a lot of this stuff. And so the more we can give them flexibility, and give them tools to work with, the more they’re going to help guide us on this path.”

↧

Managing Amazon Greengrass Core Devices Remotely with Wind River Helix* Device Cloud

June 12, 2017, 3:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Development Strategy Turns Players into Robot Builders

≪ Previous: Intel’s Virtual Reality Director Knows the Future (Hint: It’s Not About Headsets)

IoT devices come in many flavors these days from generic gateways to specialized devices. Using Intel® IoT Gateway Technology, Ubuntu* 16, and Wind River Helix* Device Cloud(HDC), remote management of your IoT system just became simple. There are many cloud service providers to choose from these days. Amazon has recently released a new IoT solution that supports Intel® IoT Gateway Technology called Amazon Greengrass Core. This tutorial will show you a method to restart your Amazon Greengrass Core Device remotely using HDC.

Prerequisites

Install Ubuntu 16 (https://help.ubuntu.com/16.04/installation-guide/)
Sign up for an Amazon Web Services (AWS)* Account (https://aws.amazon.com/)
Install Amazon Greengrass Core (http://docs.aws.amazon.com/greengrass/latest/developerguide/gg-gs.html)
Sign up for a HDC Trial Account (https://www.windriver.com/evaluations/)
Download HDC Agent (https://windshare.windriver.com/)
Install HDC Agent (https://knowledge.windriver.com/en-us/000_Products/040/050/020/000_Wind_River_Helix_Device_Cloud_Getting_Started/060/000)

Tutorial

1. Login into the HDC Portal at https://www.helixdevicecloud.com and Select the Device.

2. Remote login into Ubuntu 16 Device.

3. Stop and Start Amazon Greengrass Core.

Summary

This tutorial demonstrated how to restart an Amazon Greengrass Core Device remotely using Intel® IoT Gateway Technology and Helix Device Cloud. Now it is easy to manage your IoT solutions after deployment.

About the Author

Mike Rylee is a Software Engineer at Intel Corporation with a background in developing embedded systems and apps for Android*, Windows*, iOS*, and Mac*. He currently works on Internet of Things projects.

↧

Development Strategy Turns Players into Robot Builders

June 14, 2017, 11:09 am

Latest and popular articles on Intel Technologies

≫ Next: CPUs are set to dominate high end visualization

≪ Previous: Managing Amazon Greengrass Core Devices Remotely with Wind River Helix* Device Cloud

The original article is published by Intel Game Dev on VentureBeat*: Freejam Games’ development strategy turns players into robot builders. Get more game dev news and related topics from Intel on VentureBeat.

Freejam Games

Presented by Intel

A few years ago, Mark Simmons was toiling away at game development jobs at a work-for-hire contract studio. He enjoyed that he was working with friends, but he wasn’t excited by the restrictions imposed by such a situation: tight budgets, tough time schedules, and, most of all, that he was working on other people’s games.

On the side, he started working on a prototype that centered on the ability for players to contribute to the project via user-generated content (UGC). Inspired by Eric Ries’ book The Lean Startup, Simmons felt that a studio could be founded with just a few good people doing the main work, aided by UGC. He believed that UGC would provide “the means to allow a small developer to make games that were much larger than the small group could make on their own — to harness the power of the community.”

Simmons’ physics degree led him to put together a prototype of blocks — with inspiration from what Minecraft* accomplished — that could be placed together and would interact with the world properly. That prototype became a demo he played with his development friends, and an investor friend eventually got involved with a helping hand.

“[He] gave us this opportunity to build our own company,” Simmons says. “We all quit our jobs and formed this new company on the premise that if we weren’t successful in 18 months, it’d be dead.”

So, in April 2013, Freejam was born in Portsmouth, UK, with Simmons as the CEO/Game Director, and four developer friends making up the rest of the team. The basic prototype Simmons had put together provided the foundation for what they’d be working on going forward.

Share and share alike

From there, the group kept building onto the product and adding more functionality. Initially, there was the ability to connect blocks together, put wheels on the whole thing, and then drive it around a small area. Then the team added the ability to pick up green crystals that served as a form of currency, which could be used to buy more items.

Most studios work toward constructing a finished product before they endeavor to sell it to consumers, but Freejam’s intent was for players to create content that would make for a bigger game. That led them to release the prototype to the world to get feedback, and build up a community.

“We wanted to learn as much as we could learn,” Simmons says, “and we felt like we would learn more if we were bold and just put it out there in a raw form to develop it with the community.”

Screenshot of a battle bot being built in a 3d environment — Above: Building a battle bot Image Credit: Freejam

The concept, Simmons says, was to “build, measure, learn” by getting the game into people’s hands, analyzing the subsequent data, implementing new features, releasing the update to the community, analyzing the data… Lather, rinse, repeat. It wasn’t making money for them, as it was a free-to-play project, but, at the same time, Simmons says they didn’t have the strategy to build a complete product and expect it to be a blockbuster hit.

“We think it’s crazy to spend three years working on a title, and then launch it — and then hope that it’s good,” Simmons explains. “Obviously, for some developers that works really well, and there are some huge success stories with that approach. But as an indie, you haven’t got the brute-force money backing you to be able to make, say, an Overwatch, where it’s just so beautiful and so polished and so amazing that it’s just better than everything else.

“So, what you’ve got to do is innovate, and if you’re innovating, you’re trying to do something fundamentally different from everyone else. And there’s an inherent risk in trying to do something different from everyone else, because your idea may just suck and the audience may not go for it.”

Luckily, that wasn’t an issue. Simmons’s prototype became Robocraft, which itself became a much bigger, feature-filled shoot-’em-up. It’s still essentially a free-to-play product, but with in-app purchases — such as the ability to buy salvage crates to get more items, or a “membership” that brings some benefits. The benefits, however, won’t make you ultra-powerful, causing an imbalance among players in the community.

“We tried really hard to make sure the game is not pay-to-win in any way, and it’s fair on the monetization side, so the prices are honest, and, ultimately, anyone who’s playing the game for free can get everything within the game in a reasonable amount of time,” Simmons says. “We try to make sure it’s pretty fair.”

Make or break

Freejam is like any other developer in that it has faced — and continues to face — issues around producing its game. Simmons notes, for instance, that the varied, regularly changing PC specs are a constant challenge. It’s tough to make a game that’ll satisfactorily play on everyone’s computer.

Variety of robots team up to shot at adversary in a 3d environment — Above: Teaming up in the third-person-shooter Robocraft Image Credit: Freejam

Also, while Robocraft’s ongoing iteration and revision means the game continues to grow (a good thing), sometimes a change that’s made doesn’t sit well with everyone in the game’s player base (a potentially bad thing).

“We’ve always been very open to changing the game if we feel a part of it is not working,” Simmons says. “Inevitably, you get some players that love the game the way it was, and where you’re constantly changing the game in fairly significant ways — and we’ve probably changed our game much more than most would after its launch — that comes with a certain amount of friction within the existing community. They get tired of the change, or resist the change.

“You get this constant tension. [On one hand, you have] new players who are coming at it for the first time — and [you’re seeing] it’s a better game, because they’re hanging around for longer and they’re telling more of their friends and leaving more positive reviews. On the other hand, you have this older group of users who’ve been playing it since Day One, and they remember a certain point in time, which was their favorite point-in-time with the development, and the change that’s been made isn’t a good one.”

It’s a battle that developers regularly need to fight: Do you add a new feature or alter the game for what you think is the better, at the risk of upsetting your existing base of long-time players? Or do you always cater to the veterans, running the risk that you might make it harder for newbies to engage with your game? Fortunately, Freejam — which has grown from its original five developers to a staff of 40 now — has 12-million registered players, amassed over the last three-plus years, to give the studio the vital feedback needed to make the right choices.

Image of the July 2016 Freejam team standing on a dock by the water — Above: The Freejam team has grown from its original five Image Credit: Freejam

↧

CPUs are set to dominate high end visualization

June 19, 2017, 11:32 am

Latest and popular articles on Intel Technologies

≫ Next: Configure Open vSwitch* with Data Plane Development Kit on Ubuntu Server* 17.04

≪ Previous: Development Strategy Turns Players into Robot Builders

Carson Brownlee, Intel. It is certainly provocative to say that CPUs will dominate any part of visualization - but I say it with confidence that the data supports why this is happening. The primary drivers are (1) data sizes, (2) minimizing data movement, and (3) ability to change to O(n log n) algorithms. Couple that with the ultra-hot topic of "Software Defined Visualization" that makes these three things possible - and you have a lot to consider about how the world is changing.

Of course, what is "high end" today often becomes common place over time... so this trend may affect us all eventually. It's at least worth understanding the elements at play.

At ISC17, in Germany, this week (June 19-21) Intel is demoing (and selling) their vision of a “dream machine” for doing software defined visualization with a special eye towards in situ visualization development. Jim Jeffers, Intel, and friends are demonstrating it at ISC'17 in Germany, and they will be at SIGGRAPH'17 too. The "dream machine" can support visualization of data sets up to 1.5TB in size. They designed it to address the needs of the scientific visualization and professional rendering markets.

Photo credit (above): Asteroid Deep Water Impact Analysis; Data Courtesy: John Patchett, Galen Glisner per Los Alamos National Laboratory tech report LA-UR-17-21595. Visualization: Carson Brownlee, Intel.

With Jim's help, I wrote an article about how more information about how CPUs now offer higher performance and a lower cost than competing GPU-based solutions for the largest visualization tasks. The full article is posted with coverage at TechEnablement site.

In the full article, aside from my writing about the trend - I do provide links to technical papers the show this trend towards CPUs as the preferred solution for visualization of large data (really really big), as well as links to conferences, and links about the "visualization dream machine" (how I describe it, not what Intel calls it officially).

Dream Machine for Software Defined Visualization

Photo: Intel/Colfex Visualization "Dream" Machine

↧

Configure Open vSwitch* with Data Plane Development Kit on Ubuntu Server* 17.04

June 13, 2017, 2:28 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Guard Extensions (Intel® SGX) Part 9: Power Events and Data Sealing

≪ Previous: CPUs are set to dominate high end visualization

Overview

In this article, we will be configuring Open vSwitch* with Data Plane Development Kit (OVS-DPDK) on Ubuntu Server* 17.04. With the new release of this package, OVS-DPDK has been updated to use the latest release of both the DPDK (v16.11.1) and Open vSwitch (v2.6.1) projects. We took it for a test drive and were impressed with how seamless and easy it is to use OVS-DPDK on Ubuntu*.

We configured OVS-DPDK with two vhost-user ports and allocated them to two virtual machines (VMs). We then ran a simple iperf3* test case. The following diagram captures the setup.

Test-Case Configuration

Installing OVS-DPDK using Advanced Packaging Tool* (APT*)

To install OVS-DPDK on our system, run the following commands. Also, we will update ovs-vswitchd to use the ovs-vswitchd-dpdk package.

sudo apt-get install openvswitch-switch-dpdk
sudo update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch
-dpdk/ovs-vswitchd-dpdk

Then restart the ovs-vswitchd service with the following command to use the DPDK:

sudo systemctl restart openvswitch-switch.service

Configuring Ubuntu Server* 17.04 for OVS-DPDK

The system we are using in this demo is a 2-socket, 22 cores per socket, Intel® Hyper-Threading Technology (Intel® HT Technology) enabled server, giving us 88 logical cores total. The CPU model used is an Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz. To configure Ubuntu for optimal use of OVS-DPDK, we will change the GRUB* command-line options that are passed to Ubuntu at boot time for our system. To do this we will edit the following config file:

/etc/default/grub

Change the setting GRUB_CMDLINE_LINUX_DEFAULT to the following:

 GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=2048 iommu=pt intel_iommu=on isolcpus=1-21,23-43,45-65,67-87"

This makes GRUB aware of the new options to pass to Ubuntu during boot time. We set isolcpus so that the Linux* scheduler would only run on two physical cores. Later, we will allocate the remaining cores to the DPDK. Also, we set the number of pages and page size for hugepages. For details on why hugepages are required, and how they can help to improve performance, please see the explanation in the Getting Started Guide for Linux on dpdk.org.

Note: The isolcpus setting varies depending on how many cores are available per CPU.

Also, we will edit /etc/dpdk/dpdk.conf to specify the number of hugepages to reserve on system boot. Uncomment and change the setting NR_1G_PAGES to the following:

NR_1G_PAGES=8

Depending on your system memory size, you may increase or decrease the number of 1G pages.

After both files have been updated run the following commands:

sudo update-grub
sudo reboot

A reboot will apply the new settings. Also during the boot enter the BIOS and enable:

- Intel® Virtualization Technology (Intel® VT-x)

- Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d)

Once logged back into your Ubuntu session we will create a mount path for our hugepages:

sudo mkdir -p /mnt/huge
sudo mkdir -p /mnt/huge_2mb
sudo mount -t hugetlbfs none /mnt/huge
sudo mount -t hugetlbfs none /mnt/huge_2mb -o pagesize=2MB
sudo mount -t hugetlbfs none /dev/hugepages

To ensure that the changes are in effect, run the commands below:

grep HugePages_ /proc/meminfo
cat /proc/cmdline

If the changes took place, your output from the above commands should look similar to the image below:

Configuring OVS-DPDK Settings

To initialize the ovs-vsctl database, a one-time step, we will run the command ‘sudo ovs-vsctl --no-wait init’. The OVS database will contain user set options for OVS and the DPDK. To pass in arguments to the DPDK we will use the command-line utility as follows:

‘sudo ovs-vsctl ovs-vsctl set Open_vSwitch . <argument>’.

Additionally, the OVS-DPDK package relies on the following config files:

/etc/dpdk/dpdk.conf – Configures hugepages

/etc/dpdk/interfaces – Configures/assigns network interface cards (NICs) for DPDK use

For more information on OVS-DPDK, unzip the following files:

/usr/share/doc/openvswitch-common/INSTALL.DPDK.md.gz
OVS DPDK install guide
/usr/share/doc/openvswitch-common/INSTALL.DPDK-ADVANCED.md.gz
Advanced OVS DPDK install guide

Next, we will configure OVS to use DPDK with the following command:

sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true

Once the OVS is set up to use DPDK, we will change one OVS setting, two important DPDK configuration settings, and bind our NIC devices to the DPDK.

DPDK Settings

dpdk-lcore-mask: Specifies the CPU cores on which dpdk lcore threads should be spawned. A hex string is expected.
dpdk-socket-mem: Comma-separated list of memory to preallocate from hugepages on specific sockets.

OVS Settings

pmd-cpu (poll mode drive-mask: PMD (poll-mode driver) threads can be created and pinned to CPU cores by explicitly specifying pmd-cpu-mask. These threads poll the DPDK devices for new packets instead of having the NIC driver send an interrupt when a new packet arrives.

The following commands are used to configure these settings:

sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0xfffffbffffefffffbffffe
sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
sudo ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=1E0000000001E

For dpdk-lcore-mask we used a mask of 0xfffffbffffefffffbffffe to specify the CPU cores on which dpdk-lcore should spawn. In our system, we have the dpdk-lcore threads spawn on all cores except cores 0, 22, 44, and 66. Those cores are reserved for the Linux scheduler. Similarly, for the pmd-cpu-mask, we used the mask 1E0000000001E to spawn four pmd threads for non-uniform memory access (NUMA) Node 0, and another four pmd threads for NUMA Node 1. Lastly, since we have a two-socket system, we allocate 1 GB of memory per NUMA Node; that is, “1024, 1024”. For a single-socket system, the string would just be “1024”.

Creating OVS-DPDK Bridge and Ports

For our sample test case, we will create a bridge and add two DPDK vhost-user ports. To create an OVS bridge and two DPDK ports, run the following commands:

sudo ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
sudo ovs-vsctl add-port br0 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuser
sudo ovs-vsctl add-port br0 vhost-user2 -- set Interface vhost-user2 type=dpdkvhostuser

To ensure that the bridge and vhost-user ports have been properly set up and configured, run the command:

sudo ovs-vsctl show

If all is successful you should see output like the image below:

Binding Devices to DPDK

To bind your NIC device to the DPDK you must run the dpdk-devbind command. For example, to bind eth1 from the current driver and move to use vfio-pci driver, run:dpdk-devbind --bind=vfio-pci eth1.To use the vfio-pci driver, run modsprobe to load it and its dependencies.

This is what it looked like on my system, with 4 x 10 Gb interfaces available:

sudo modprobe vfio-pci
sudo dpdk-devbind --bind=vfio-pci ens785f0
sudo dpdk-devbind --bind=vfio-pci ens785f1
sudo dpdk-devbind --bind=vfio-pci ens785f2
sudo dpdk-devbind --bind=vfio-pci ens785f3

To check whether the NIC cards you specified are bound to the DPDK, run the command:

sudo dpdk-devbind --status

If all is correct, you should have an output similar to the image below:

Using DPDK vhost-user Ports with VMs

Creating VMs is out of the scope of this document. Once we have two VMs created (in this example, virtual disks us17_04vm1.qcow2 and us17_04vm2.qcow2), the following commands show how to use the DPDK vhost-user ports we created earlier.

Ensure that the QEMU* version on the system is v2.2.0 or above, as discussed under “DPDK vhost-user Prerequisites” in the OVS DPDK INSTALL GUIDE on https://github.com/openvswitch.

sudo qemu-system-x86_64 -m 1024 -smp 4 -cpu host -hda /home/user/us17_04vm1.qcow2 -boot c -enable-kvm -no-reboot -net none -nographic \
-chardev socket,id=char1,path=/run/openvswitch/vhost-user1 \
-netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce \
-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 \
-object memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc \
-virtfs local,path=/home/user/iperf_debs,mount_tag=host0,security_model=none,id=vm1_dev

sudo qemu-system-x86_64 -m 1024 -smp 4 -cpu host -hda /home/user/us17_04vm2.qcow2 -boot c -enable-kvm -no-reboot -net none -nographic \
-chardev socket,id=char2,path=/run/openvswitch/vhost-user2 \
-netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce \
-device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 \
-object memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc \
-virtfs local,path=/home/user/iperf_debs,mount_tag=host0,security_model=none,id=vm2_dev \

DPDK vhost-user inter-VM Test Case with iperf3*

In the previous step, we configured two VMs, each with a Virtio* NIC that is connected to the OVS-DPDK bridge.

Configure the NIC IP address on both VMs to be on the same subnet. Install iperf3 from http://software.es.net/iperf, and then run a simple network test case. On one VM, start iperf3 in server mode iperf3 -s and run the iperf3 client on another VM, iperf3 –c server_ip. The network throughput and performance varies, depending on your system hardware capabilities and configuration.

OVS Using DPDK

OVS Without DPDK

From the above images, we observe that the OVS-DPDK transfer rate is roughly ~2.5x greater than OVS without DPDK.

Summary

Ubuntu has standard packages available for using OVS-DPDK. In this article, we discussed how to install, configure, and use this package for enhanced network throughput and performance. We also covered how to configure a simple OVS-DPDK bridge with DPDK vhost-user ports for an inter-VM application use case. Lastly, we observed that the OVS with DPDK gave us ~2.5x greater transfer rate than OVS without DPDK on a simple inter-vm test case on our system.

About the Author

Yaser Ahmed is a software engineer at Intel Corporation who has an MS degree in Applied Statistics from DePaul University and a BS degree in Electrical Engineering from the University of Minnesota.

↧

Intel® Software Guard Extensions (Intel® SGX) Part 9: Power Events and Data Sealing

June 22, 2017, 3:12 pm

Latest and popular articles on Intel Technologies

≫ Next: Build and Install TensorFlow* on Intel® Architecture

≪ Previous: Configure Open vSwitch* with Data Plane Development Kit on Ubuntu Server* 17.04

Download [ZIP 598KB]

In part 9 of the Intel® Software Guard Extensions (Intel® SGX) tutorial series we’ll address some of the complexities surrounding the suspend and resume power cycle. Our application needs to do more than just survive power transitions: it must also provide a smooth user experience without compromising overall security. First, we’ll discuss what happens to enclaves when the system resumes from the sleep state and provide general advice on how to manage power transitions in an Intel SGX application. We’ll examine the data sealing capabilities of Intel SGX and show how they can help smooth the transitions between power states, while also pointing out some of the serious pitfalls that can occur when they are used improperly. Finally, we’ll apply these techniques to the Tutorial Password Manager in order to create a smooth user experience.

You can find a list of all the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

Source code is provided with this installment of the series.

Suspend, Hibernate, and Resume

Applications must be able to survive a sleep and resume cycle. When the system resumes from suspend or hibernation, applications should return to their previous state, or, if necessary, create a new state specifically to handle the wake event. What applications shouldn’t do is become unstable or crash as a direct result of that change in the power state. Call this the “rule zero” of managing power events.

Most applications don’t actually need special handling for these events. When the system suspends, the application state is preserved because RAM is still powered on. When the system hibernates, the RAM is saved to a special hibernation file on disk, which is used to restore the system state when it’s powered back on. You don’t need to add code to enable or take advantage of this core feature of the OS. There are two notable exceptions, however:

Applications that rely on physical hardware that isn’t guaranteed to be preserved across power events, such as CPU caches.
Scenarios where possible changes to the system context can affect program logic. For example, a location-based application can be moved hundreds of miles while it’s sleeping and would need to re-acquire its location. An application that works with sensitive data may choose to guard against theft by reprompting the user for his or her password.

Our Tutorial Password Manager actually falls into both categories. Certainly, if a laptop running our password manager is stolen, the thief would potentially have access to the victim’s passwords until they explicitly closed the application or locked the vault. The first category, though, may be less obvious: Intel SGX is a hardware feature that is not preserved across power events.

We can demonstrate this by running the Tutorial Password Manager, unlocking the vault, suspending the system, waking it back up, and then trying to read a password or edit one of the accounts. Follow those sequences, and you’ll get one of the error dialogs shown in Figure 1 or Figure 2.

Figure 1. Error received when attempting to edit an account after resuming from sleep.

Figure 2. Error received when attempting to view an account password after resuming from sleep.

As currently written, the Tutorial Password Manager violates rule zero: it becomes unstable after resuming from a sleep operation. The application needs special handling for power events.

Enclaves and Power Events

When a processor leaves S0 or S1 for a lower-power state, the enclave page cache (EPC) is destroyed: all EPC pages are erased along with their encryption keys. Since enclaves store their code and data in the EPC, when the EPC goes away the enclaves go with it. This means that enclaves do not survive power events that take the system to state S2 or lower.

Table 1 provides a summary of the power states.

Table 1. CPU power states

State	Description
S0	Active run state. The CPU is executing instructions, and background tasks are running even if the system appears idle and the display is powered off.
S1	Processor caches are flushed, CPU stops executing instructions. Power to CPU and RAM is maintained. Devices may or may not power off. This is a high-power standby state, sometimes called “power on suspend.”
S2	CPU is powered off. CPU context and contents of the system cache are lost.
S3	RAM is powered on to preserve its contents. A standby or sleep state.
S4	RAM is saved to nonvolatile storage in a hibernation file before powering off. When powered on, the hibernation file is read in to restore the system state. A hibernation state.
S5	“Soft off.” The system is off but some components are powered to allow a full system power-on via some external event, such as Wake-on-LAN, a system management component, or a connected device.

Power state S1 is not typically seen on modern systems, and state S2 is uncommon in general. Most CPUs go to power state S3 when put in “sleep” mode and drop to S4 when hibernating to disk.

The Windows* OS provides a mechanism for applications to subscribe to wakeup events, but that won’t help any ECALLs that are in progress when the power transition occurs (and, by extension, any OCALLs either since they are launched from inside of ECALLs). When the enclave is destroyed, the execution context for the ECALL is destroyed with it, any nested OCALLs and ECALLs are destroyed, and the outer-most ECALL immediately returns with a status of SGX_ERROR_ENCLAVE_LOST.

It is important to note that any OCALLs that are in progress are destroyed without warning, which means any changes they are making in unprotected memory will potentially be incomplete. Since unprotected memory is maintained or restored when resuming from the S3 and S4 power states, it is important that developers use reliable and robust procedures to prevent partial write corruptions. Applications must not end up in an indeterminate or invalid state when power resumes.

General Advice for Managing Power Transitions

Planning for power transitions begins before a sleep or hibernation event occurs. Decide how extensive the enclave recovery needs to be. Should the application be able to pick up exactly where it left off without user intervention? Will it resume interrupted tasks, restart them, or just abort? Will the user interface, if any, reflect the change in state? The answers to these questions will drive the rest of the application design. As a general rule, the more autonomous and seamless the recovery is, the more complex the program logic will need to be.

An application may also have different levels of recovery at different points. Some stages of an application may be easier to seamlessly recover from than others, and in some execution contexts it may not make sense or even be good security practice to attempt a seamless recovery at all.

Once the overall enclave recovery strategy has been identified, the process of preparing an enclave for a power event is as follows:

Determine the minimal state information and data that needs to be saved in order to reconstruct the enclave.
Periodically seal the state information and save it to unprotected memory (data sealing is discussed below). The sealed state data can be sent back to the main application as an [out] pointer parameter to an ECALL, or the ECALL can make an OCALL specifically to save state data.
When an SGX_ERROR_ENCLAVE_LOST code is returned by an ECALL, explicitly destroy the enclave and then recreate it. It is strongly recommended that applications explicitly destroy the enclave with a call to sgx_enclave_destroy().
Restore the enclave state using an ECALL that is designed to do so.

It is important to save the enclave state to untrusted memory before a power transition occurs. Even if the OS is able to send an event to an application when it is about to enter a standby mode, there are no guarantees that the application will have sufficient time to act before the system physically goes to sleep.

Data Sealing

When an enclave needs to preserve data across instantiations, either in preparation for a power event or between executions of the parent application, it needs to send that data out to untrusted memory. The problem with untrusted memory, however, is exactly that: it is untrusted. It is neither encrypted nor integrity checked, so any data sent outside the enclave in the clear is potentially leaking secrets. Furthermore, if that data were to be modified in untrusted memory, future instantiations of the enclave would not be able to detect that the modification occurred.

To address this problem, Intel SGX provides a capability called data sealing. When data is sealed, it is encrypted with advanced encryption standard (AES) in Galois/Counter Mode (GCM) using a 128-bit key that is derived from CPU-specific key material and some additional inputs, guided by one of two key policies. The use of AES-GCM provides both confidentiality of the data being sealed and integrity checking when the data is read back in and unsealed (decrypted).

As mentioned above, the key used in data sealing is derived from several inputs. The two key policies defined by data sealing determine what those inputs are:

MRSIGNER. The encryption key is derived from the CPU’s key material, the security version number (SVN), and the enclave signing key used by the developer. Data sealed using MRSIGNER can be unsealed by other enclaves on that same system that originate from the same software vendor (enclaves that share the same signing key). The use of an SVN allows enclaves to unseal data that was sealed by previous versions of an enclave, but prevents older enclaves from unsealing data from newer versions. It allows enclave developers to enforce software version upgrades.
MRENCLAVE. The encryption key is derived from the CPU’s key material and the enclave’s cryptographic signature. Data signed with the MRENCLAVE policy can only be unsealed by that exact enclave on that system.

Note that the CPU is a common component in the two key policies. Each processor has some random, hardware-based key material—physical circuitry on the processor—which is built into it as part of the manufacturing process. This ensures that data sealed by an enclave on one CPU cannot be unsealed by enclaves on another CPU. Each CPU will result in a different signing key, even if all other aspects of the signing policy (enclave measurement, enclave signing key, SVN) are the same.

The data sealing and unsealing API is really a set of convenience functions. They provide a high-level interface to the underlying AES-GCM encryption and 128-bit key derivation functions.

Once data has been sealed in the enclave, it can be sent out to untrusted memory and optionally written to disk.

Caveats

There is a caveat with data sealing, though, and it has significant security implications. Your enclave API needs to include an ECALL that will take sealed data as an input and then unseal it. However, Intel SGX does not authenticate the calling application, so you cannot assume that only your application is loading your enclave. This means that your enclave can be loaded and executed by anyone, even applications you didn’t write. As you might recall from Part 1, enclave applications are divided into two parts: the trusted part, which is made up of the enclaves, and the untrusted part, which is the rest of the application. These terms, “trusted” and “untrusted,” are chosen deliberately.

Intel SGX cannot authenticate the calling application because this would require a trusted execution chain that runs from system power-on all the way through boot, the OS load, and launching the application. This is far outside the scope of Intel SGX, which limits the trusted execution environment to just the enclaves themselves. Because there’s no way for the enclave to validate the caller, each enclave must be written defensibly. Your enclave cannot make any assumptions about the application that has called into it. An enclave must be written under the assumption that any application can load it and execute its API, and that its ECALLs can be executed in any order.

Normally this is not a significant constraint, but sealing and unsealing data complicates matters significantly because both the sealed data and the means to unseal it are exposed to arbitrary applications. The enclave API must not allow applications to use sealed data to bypass security mechanisms.

Take the following scenario as an example: A file encryption program wants to save end users the hassle of re-entering their password every time the application runs, so it seals their password using the data sealing functions and the MRENCLAVE policy, and then writes the sealed data to disk. When the application starts, it looks for the sealed data file, and if it’s present, reads it in and makes an ECALL to unseal the data and restore the user’s password into the enclave.

The problems with this hypothetical application are two-fold:

It assumes that it is the only application that will ever load the enclave.
It doesn’t authenticate the end user when the data is unsealed.

A malicious software developer can write their own application that loads the same enclave and follows the same procedure (looks for the sealed data file, and invokes the ECALL to unseal it inside the enclave). While the malicious application can’t expose the user’s password, it can use the enclave’s ECALLs to encrypt and decrypt the user’s files using their stored password, which is nearly as bad. The malicious user has gained the ability to decrypt files without having to know the user’s password at all!

A non-Intel SGX version of this same application that offered this same convenience feature would also be vulnerable, but that’s not the point. If the goal is to use Intel SGX features to harden the application’s security, those same features should not be undermined by poor programming practices!

Managing Power Transitions in the Tutorial Password Manager

Now that we understand how power events affect enclaves and know what tools are available to assist with the recovery process, we can turn our attention to the Tutorial Password Manager. As currently written, it has two problems:

It becomes unstable after a power event.
It assumes the password vault should remain unlocked after the system resumes.

Before we can solve the first problem we need to address the second one, and that means making some design decisions.

Sleep and Resume Behavior

The big decision that needs to be made for the Tutorial Password Manager is whether or not to lock the password vault when the system resumes from a sleep state.

The primary argument for locking the password vault after a sleep/resume cycle is to protect the password database in case the physical system is stolen while it’s suspended. This would prevent the thief from being able to access the password database after waking up the device. However, having the system lock the password vault immediately can also be a user interface friction: sometimes, aggressive power management settings cause a running system to sleep while the user is still in front of the device. If the user wakes the system back up immediately, they might be irritated to find that their password vault has been locked.

This issue really comes down to balancing user convenience against security, so the right approach is to give the user control over the application’s behavior. The default will be for the password vault to lock immediately upon suspend/resume, but the user can configure the application to wait up to 10 minutes after the sleep event before the vault is forcibly locked.

Intel® Software Guard Extensions and Non-Intel Software Guard Extensions Code Paths

Interestingly, the default behavior of the Intel SGX code path differs from that of the non-Intel SGX code path. Enclaves are destroyed during the sleep/resume cycle, which means that we effectively lock the password vault as a result. To give the user the illusion that the password vault never locked at all, we have to not only reload the vault file from disk, but also explicitly unlock it again without forcing the user to re-enter their password (this has some security implications, which we discuss below).

For the non-Intel SGX code path, the vault is just stored in regular memory. When the system resumes, system memory is unchanged and the application continues as normal. Thus, the default behavior is that an unlocked password vault remains unlocked when the system resumes.

Application Design

With the behavior of the application decided, we turn to the application design. Both code paths need to handle the sleep/resume cycle and place the vault in the correct state: locked or unlocked.

The Non-Intel Software Guard Extensions Code Path

This is the simpler of the two code paths. As mentioned above, the non-Intel SGX code path will, by default, leave the password vault unlocked if it was unlocked when the system went to sleep. When the system resumes it only needs to see how long it slept: if the sleep time exceeds the maximum configured by the user, the password vault should be explicitly locked.

To keep track of the sleep duration, we’ll need a periodic heartbeat that records the current time. This time will serve as the “sleep start” time when the system resumes. For security, the heartbeat time will be encrypted using the database key.

The Intel Software Guard Extensions Code Path

No matter how the application is configured, the system will need code to recreate the enclave and reopen the password vault. This will put the vault in the locked state.

The application will then need to see how long it has been sleeping. If the sleep time was less than the maximum configured by the user, the password vault needs to be explicitly unlocked without prompting the user for his or her master passphrase. In order to do that the application needs the passphrase, and that means the passphrase must be saved to untrusted memory so that it can be read back in when the system is restored.

The only safe way to save a secret to untrusted memory is to use data sealing, but this presents a significant security issue: As mentioned previously, our enclave can be loaded by any application, and the same ECALL that is used to unseal the master password will be available for anyone to use. Our password manager application exposes secrets to the end user (their passwords), and the master password is the only means of authenticating the user. The point of keeping the password vault unlocked after the sleep/resume cycle is to prevent the user from having to authenticate. That means we are creating a logic flow where a malicious user could potentially use our enclave’s API to unseal the user’s master password and then extract their account and password data.

In order to mitigate this risk, we’ll do the following:

Data will be sealed using the MRENCLAVE policy.
Sealed data will be kept in memory only. Writing it to disk would increase the attack surface.
In addition to sealing the password, we’ll also include the process ID. The enclave will require that the process ID of the calling process match the one that was saved when unsealing the data. If they don’t match, the vault will be left in the locked state.
The current system time will be sealed periodically using a heartbeat function. This will serve as the “sleep start” time.
The sleep duration will be checked in the enclave.

Note that verification logic must be in the enclave where it cannot be modified or manipulated.

This is not a perfect solution, but it helps. A malicious application would need to scrape the sealed data from memory, crash the user’s existing process, and then create new processes over and over until it gets one with the same process ID. It will have to do all of this before the lock timeout is reached (or take control of the system clock).

Common Needs

Both code paths will need some common infrastructure:

A timer to provide the heartbeat. We’ll use a timer interval of 15 seconds.
An event handler that is called when the system resumes from a sleep state.
Safe handling for any potential race conditions, since wakeup events are asynchronous.
Code that updates the UI to reflect the “locked” state of the password vault

Implementation

We won’t go over every change in the code base, but we’ll look at the major components and how they work.

User Options

The lock timeout value is set in the new Tools -> Options configuration dialog, shown in Figure 3.

Figure 3. Configuration options.

This parameter is saved immediately to the Windows registry under HKEY_LOCAL_USER and is loaded by the application on startup. If the registry value is not present, the lock timeout defaults to zero (lock the vault immediately after going to sleep).

The Intel SGX code path also saves this value in the enclave.

The Heartbeat

Figure 4 shows the declaration for the Heartbeat class which is ultimately responsible for recording the vault’s state information. The heartbeat is only run if state information is needed, however. If the user has set the lock timeout to zero, we don’t need to maintain state because we know to lock the vault immediately when the system resumes.

class PASSWORDMANAGERCORE_API Heartbeat {
	class PasswordManagerCoreNative *nmgr;
	HANDLE timer;
	void start_timer();
public:
	Heartbeat();
	~Heartbeat();
	void set_manager(PasswordManagerCoreNative *nmgr_in);
	void heartbeat();

	void start();
	void stop();
};

Figure 4. The Heartbeat class.

The PasswordManagerCoreNative class gains a Heartbeat object as a class member, and the Heartbeat object is initialized with a reference back to the containing PasswordManagerCoreNative object.

The Heartbeat class obtains a timer from CreateTimerQueueTimer and executes the callback function heartbeat_proc when the timer expires, as shown in Figure 5. The timer is sent a reference to the Heartbeat object, which in turn calls the heartbeat method in the Heartbeat class, which in turn calls the heartbeat method in PasswordManagerCoreNative and restarts the timer.

static void CALLBACK heartbeat_proc(PVOID param, BOOLEAN fired)
{
   // Call the heartbeat method in the Heartbeat object
	Heartbeat *hb = (Heartbeat *)param;
	hb->heartbeat();
}

Heartbeat::Heartbeat()
{
	timer = NULL;
}

Heartbeat::~Heartbeat()
{
	if (timer == NULL) DeleteTimerQueueTimer(NULL, &timer, NULL);
}

void Heartbeat::set_manager(PasswordManagerCoreNative *nmgr_in)
{
	nmgr = nmgr_in;

}

void Heartbeat::heartbeat ()
{
	// Call the heartbeat method in the native password manager
	// object. Restart the timer unless there was an error.

	if (nmgr->heartbeat()) start_timer();
}

void Heartbeat::start()
{
	stop();

	// Perform our first heartbeat right away.

	if (nmgr->heartbeat()) start_timer();
}

void Heartbeat::start_timer()
{
	// Set our heartbeat timer. Use the default Timer Queue

	CreateTimerQueueTimer(&timer, NULL, (WAITORTIMERCALLBACK)heartbeat_proc,
		(void *)this, HEARTBEAT_INTERVAL_SECS * 1000, 0, 0);
}

void Heartbeat::stop()
{
	// Stop the timer (if it exists)

	if (timer != NULL) {
		DeleteTimerQueueTimer(NULL, timer, NULL);
		timer = NULL;
	}
}

Figure 5. The Heartbeat class methods and timer callback function.

The heartbeat method in the PasswordManagerCoreNative object maintains the state information. To prevent partial write corruption, it has a two-element array of state data and an index pointer to the current index (0 or 1). The new state information is obtained from:

The new ECALL ve_heartbeat in the Intel SGX code path (by way of ew_heartbeat in EnclaveBridge.cpp).
The Vault method heartbeat in the non-Intel SGX code path.

After the new state has been received, it updates the next element (alternating between elements 0 and 1) of the array, and then updates the index pointer. The last operation is our atomic update, ensuring that the state information is complete before we officially mark it as the “current” state.

Intel Software Guard Extensions code path

The ve_heartbeat ECALL simply calls the heartbeat method in the E_Vault object, as shown in Figure 6.

int E_Vault::heartbeat(char *state_data, uint32_t sz)
{
	sgx_status_t status;
	vault_state_t vault_state;
	uint64_t ts;

	// Copy the db key

	memcpy(vault_state.db_key, db_key, 16);

	// To get the system time and PID we need to make an OCALL

	status = ve_o_process_info(&ts, &vault_state.pid);
	if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;

	vault_state.lastheartbeat = (sgx_time_t)ts;

	// Storing both the start and end times provides some
	// protection against clock manipulation. It's not perfect,
	// but it's better than nothing.

	vault_state.lockafter = vault_state.lastheartbeat + lock_delay;

	// Saves us an ECALL to have to reset this when the vault is restored.

	vault_state.lock_delay = lock_delay;

	// Seal our data with the MRENCLAVE policy. We defined our
	// struct as packed to support working on the address
	// directly like this.

	status = sgx_seal_data(0, NULL, sizeof(vault_state_t), (uint8_t *)&vault_state, sz, (sgx_sealed_data_t *) state_data);
	if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;

	return NL_STATUS_OK;
}

Figure 6. The heartbeat in the enclave.

It has to obtain the current system time and the process ID, and to do this we have added our first OCALL to the enclave, ve_o_process_info. When the OCALL returns, we update our state information and then call sgx_seal_data to seal it into the state_data buffer.

One restriction of the Intel SGX seal and unseal functions is that they can only operate on enclave memory. That means the state_data parameter must be a marshaled data buffer when used in this manner. If you need to write sealed data to a raw pointer that references untrusted memory (one that is passed with the user_check parameter), you must first seal the data to an enclave-local data buffer and then copy it over.

The OCALL is defined in EnclaveBridge.cpp:

// OCALL to retrieve the current process ID and
// local system time.

void SGX_CDECL ve_o_process_info(uint64_t *ts, uint64_t *pid)
{
	DWORD dwpid= GetCurrentProcessId();
	time_t ltime;

	time(&ltime);

	*ts = (uint64_t)ltime;
	*pid = (uint64_t)dwpid;
}

Because the heartbeat runs asynchronously, two threads can enter the enclave at the same time. This means the number of Thread Control Structures (TCSs) allocated to the enclave must be increased from the default of 1 to 2. This can be done one of two ways:

Right-click the Enclave project, select Intel SGX Configuration -> Enclave Settings to bring up the configuration window, and then set Thread Number to 2 (see Figure 7).
Edit the Enclave.config.xml file in the Enclave project directly, and then change the <TCSNum> parameter to 2.

Figure 7. Enclave settings dialog.

Detecting Suspend and Resume Events

A suspend and resume cycle will destroy the enclave, and that will be detected by the next ECALL. However, we shouldn’t rely on this mechanism to perform enclave recovery, because we need to act as soon as the system wakes up from the sleep state. That means we need an event listener to receive the power state change messages that are generated by Windows.

The best place to capture these is in the user interface layer. In addition to performing the enclave recovery, we must be able to lock the password vault if the system was in the sleep state longer than maximum sleep time set in the user options. When the vault is locked, the user interface also needs to be updated to reflect the new vault state.

One limitation of the Windows Presentation Foundation* is that it does not provide event hooks for power-related messages. The workaround is to hook in to the message handler for the underlying window handle. Our main application window and all of our dialog windows need a listener so that we can gracefully close each one.

The hook procedure for the main window is shown in Figure 8.

private IntPtr Main_Power_Hook(IntPtr hwnd, int msg, IntPtr wParam, IntPtr lParam, ref bool handled)
{
    UInt16 pmsg;

    // C# doesn't have definitions for power messages, so we'll get them via C++/CLI. It returns a
    // simple UInt16 that defines only the things we care about.
    pmsg= PowerManagement.message(msg, wParam, lParam);

    if ( pmsg == PowerManagementMessage.Suspend )
    {
        mgr.suspend();
        handled = true;
    } else if (pmsg == PowerManagementMessage.Resume)
    {
        int vstate = mgr.resume();

        if (vstate == ResumeVaultState.Locked) lockVault();
        handled = true;
    }

    return IntPtr.Zero;
}

Figure 8. Message hook for the main window.

To get at the messages, the handler must dip down to native code. This is done using the new PowerManagement class, which defines a static function called message, shown in Figure 9. It returns one of four values:

PWR_MSG_NONE	The message was not a power event.
PWR_MSG_OTHER	The message was power-related, but not a suspend or resume message.
PWR_MSG_RESUME	The system has woken up from a low-power or sleep state.
PWR_MSG_SUSPEND	The system is suspending to a low-power state.

UINT16 PowerManagement::message(int msg, IntPtr wParam, IntPtr lParam)
{
	INT32 subcode;

	// We only care about power-related messages

	if (msg != WM_POWERBROADCAST) return PWR_MSG_NONE;

	subcode = wParam.ToInt32();

	if ( subcode == PBT_APMRESUMEAUTOMATIC ) return PWR_MSG_RESUME;
	else if (subcode == PBT_APMSUSPEND ) return PWR_MSG_SUSPEND;

	// Don't care about other power events.

	return PWR_MSG_OTHER;
}

Figure 9. The message listener.

We actually listen for both suspend and resume messages here, but the suspend handler does very little work. When a system is transitioning to a sleep state, an application has less than 2 seconds to act on the power message. All we do with the sleep message is stop the heartbeat. This isn’t strictly necessary, and is just a precaution against having a heartbeat execute while the system is suspending.

The resume message is handled by calling the resume method in PasswordManagerCore. It’s job is to figure out whether the vault should be locked or unlocked. It does this by checking the current system time against the saved vault state (if any). If there’s no state, or if the system has slept longer than the maximum allowed, it returns ResumeVaultState.Locked.

Restoring the Enclave

In the Intel SGX code path, the enclave has to be recreated before the enclave state information can be checked. The code for this is shown in Figure 10.

bool PasswordManagerCore::restore_vault(bool flag_async)
{
	bool got_lock= false;
	int rv;

	// Only let one thread do the restore if both come in at the
	// same time. A spinlock approach is inefficient but simple.
	// This is OK for our application, but a high-performance
	// application (or one with a long-running work loop)
	// would want something else.

	try {
		slock.Enter(got_lock);

		if (_nlink->supports_sgx()) {
			bool do_restore = true;

			// This part is only needed for enclave-based vaults.

			if (flag_async) {
				// If we are entering as a result of a power event,
				// make sure the vault has not already been restored
				// by the synchronous/UI thread (ie, a failed ECALL).

				rv = _nlink->ping_vault();
				if (rv != NL_STATUS_LOST_ENCLAVE) do_restore = false;
				// If do_store is false, then we'll also use the
				// last value of rv_restore as our return value.
				// This will tell us whether or not we should lock the
				// vault.
			}

			if (do_restore) {
				// If the vaultfile isn't open then we are locked or hadn't
				// been opened to be begin with.

				if (!vaultfile->is_open()) {
					// Have we opened a vault yet?
					if (vaultfile->get_vault_path()->Length == 0) goto restore_error;

					// We were explicitly locked, so reopen.
					rv = vaultfile->open_read(vaultfile->get_vault_path());
					if (rv != NL_STATUS_OK) goto restore_error;
				}

				// Reinitialize the vault from the header.

				rv = _vault_reinitialize();
				if (rv != NL_STATUS_OK) goto restore_error;

				// Now, call to the native object to restore the vault state.
				rv = _nlink->restore_vault_state();
				if (rv != NL_STATUS_OK) goto restore_error;

				// The database password was restored to the vault. Now restore
				// the vault, itself.

				rv = send_vault_data();
			restore_error:
				restore_rv = (rv == NL_STATUS_OK);
			}
		}
		else {
			rv = _nlink->check_vault_state();
			restore_rv = (rv == NL_STATUS_OK);
		}

		slock.Exit(false);
	}
	catch (...) {
		// We don't need to do anything here.
	}

	return restore_rv;
}

Figure 10. The restore_vault() method.

The enclave and vault are reinitialized from the vault data file, and the vault state is restored using the method restore_vault_state in PasswordManagerCoreNative.

Which Thread Restores the Vault State?

The Tutorial Password Manager can have up to three threads executing at any given time. They are:

The main UI
The heartbeat
The power event handler

Only one of these threads should be responsible for actually restoring the enclave, but it is possible that both the heartbeat and the main UI thread are in the middle of an ECALL when a power event occurs. In that case, both ECALLs will fail with the error code SGX_ERR_ENCLAVE_LOST while the power event handler is executing. Given this potential race condition, it’s necessary to decide which thread is given the job of enclave recovery.

If the lock timeout is set to zero, there won’t be a heartbeat thread at all, so it doesn’t make sense to put enclave recovery logic there. If the heartbeat ECALL returns SGX_ERR_ENCLAVE_LOST, it simply stops the heartbeat and assumes other threads will be dealing with it.

That leaves the UI thread and the power event handler, and a good argument can be made that both threads need the ability to recover an enclave. The event handler will catch all suspend/resume cycles immediately, so it make sense to have enclave recovery happen there. However, as we pointed out earlier it is entirely possible for a power event to occur during an active ECALL on the UI thread, and there’s no reason to prevent that thread from starting the recovery, especially since it might occur before the power event message is received. This not only provides a safety net in case the event handler fails to execute for some reason, but it also provides a quick and easy retry loop for the operation.

Since we can’t have both of these threads run the recovery at the same time, we need to use locking to ensure that only the first thread to arrive is given the job. The second one simply waits for the first to finish.

It’s also possible that a failed ECALL will complete the recovery process before the event handler enters the recovery loop. To prevent the event handler from blindly repeating the enclave recovery procedure, we have added a quick test to make sure the enclave hasn’t already been recreated.

Detection in the UI Thread

The UI thread detects power events by looking for ECALLs that fail with SGX_ERR_LOST_ENCLAVE. The wrapper functions in EnclaveBridge.cpp automatically relaunch the enclave and pass the error NL_STATUS_ENCLAVE_RECREATED back up to the PasswordManagerCore object.

Each method in PasswordManagerCore handles this return code uniquely. Some methods, such as initialize, initialize_from_header, and lock_vault don’t actually have to restore state at all, but most of the others do and they call in to restore_vault as show in Figure 11.

int PasswordManagerCore::accounts_password_to_clipboard(UInt32 idx)
{
	UINT32 index = idx;
	int rv;
	int tries = 3;

	while (tries--) {
		rv = _nlink->accounts_password_to_clipboard(index);
		if (rv == NL_STATUS_RECREATED_ENCLAVE) {
			if (!restore_vault()) {
				rv = NL_STATUS_LOST_ENCLAVE;
				tries = 0;
			}
		}
		else break;
	}

	return rv;
}

Figure 11. Detecting a power event on the main UI thread.

Here, the method gets three attempts to restore the vault before giving up. This retry count of three is an arbitrary limit: it’s not likely that we’ll have multiple power events in rapid succession but it’s possible. Though we don’t want to just give up after one attempt, we also don’t want to loop forever in case there’s a system issue that prevents the enclave from ever being recreated.

Restoring and Checking State

The last step is to examine the state data for the vault and determine whether the vault should be locked or unlocked. In the Intel SGX code path, the sealed state data is sent into the enclave where it is unsealed, and then compared to current system data obtained from the OCALL ve_o_process_info. This method, restore_state, is shown in Figure 12.

int E_Vault::restore_state(char *state_data, uint32_t sz)
{
	sgx_status_t status;
	vault_state_t vault_state;
	uint64_t now, thispid;
	uint32_t szout = sz;

	// First, make an OCALL to get the current process ID and system time.
	// Make these OCALLs so that the parameters aren't be supplied by the
	// ECALL (which would make it trivial for the calling process to fake
	// this information)

	status = ve_o_process_info(&now, &thispid);
	if (status != SGX_SUCCESS) {
		// Zap the state data.
		memset_s(state_data, sz, 0, sz);
		return NL_STATUS_SGXERROR;
	}

	status = sgx_unseal_data((sgx_sealed_data_t *)state_data, NULL, 0, (uint8_t *)&vault_state, &szout);
	// Zap the state data.
	memset_s(state_data, sz, 0, sz);

	if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;

	if (thispid != vault_state.pid) return NL_STATUS_PERM;
	if (now < vault_state.lastheartbeat) return NL_STATUS_PERM;
	if (now > vault_state.lockafter) return NL_STATUS_PERM;

	// Everything checks out. Restore the key and mark the vault as unlocked.

	lock_delay = vault_state.lock_delay;

	memcpy(db_key, vault_state.db_key, 16);
	_VST_CLEAR(_VST_LOCKED);

	return NL_STATUS_OK;
}

Figure 12. Restoring state in the enclave.

Note that unsealing data is programmatically simpler than sealing it: the key derivation and policy information is embedded in the sealed data blob. Unlike data sealing there is only one unseal function, sgx_unseal_data, and it takes fewer parameters than its counterpart.

This method returns NL_STATUS_OK if the vault is restored to the unlocked state, and NL_STATUS_PERM if it is restored to the locked state.

Lingering Issues

The Tutorial Password Manager as currently implemented still has issues that need to be addressed.

There is still a race condition in the enclave recovery logic. Because the ECALL wrappers in EnclaveBridge.cpp immediately recreate the enclave before returning an error code to the PasswordManagerCore layer, it is possible for the power event handler thread to enter the restore_vault method after the enclave has been recreated but before the enclave recovery has completed. This can cause the power event handler to return the wrong status to the UI layer, placing the UI in the “locked” or “unlocked” state incorrectly.
We depend on the system clock when validating our state data, but the system clock is actually untrusted. A malicious user can manipulate the time in order to force the password vault into an unlocked state when the system wakes up (this can be addressed by using trusted time, instead).

Summary

In order to prevent cold boot attacks and other attacks against memory images in RAM, Intel SGX destroys the Enclave Page Cache whenever the system enters a low-power state. However, this added security comes at a price: software complexity that can’t be avoided. All real-world Intel SGX applications need to plan for power events and incorporate enclave recovery logic because failing to do so will lead to runtime errors during the application’s execution.

Power event planning can rapidly escalate the application’s level of sophistication. The user experience needs of the Tutorial Password Manager took us from a single-threaded application with relatively simple constructs to one with multiple, asynchronous threads, locking, and atomic memory updates via simple journaling. As a general rule, seamless enclave recovery requires careful design and a significant amount of added program logic.

Sample Code

The code sample for this part of the series builds against the Intel SGX SDK version 1.7 using Microsoft Visual Studio* 2015.

Release Notes

Running a mixed-mode Intel SGX application under the debugger in Visual Studio will cause an exception to be thrown if a power event is triggered. The exception occurs when an ECALL detects the lost enclave and returns SGX_ERROR_LOST_ENCLAVE.
The non-Intel SGX code path was updated to use Microsoft’s DPAPI to store the database encryption key. This is a better solution than the in-memory XOR’ing.

Coming Up Next

In Part 10 of the series, we’ll discuss debugging mixed-mode Intel SGX applications with Visual Studio. Stay tuned!

↧

Build and Install TensorFlow* on Intel® Architecture

June 21, 2017, 11:58 am

Latest and popular articles on Intel Technologies

≫ Next: Build and Install TensorFlow* Serving on Intel® Architecture

≪ Previous: Intel® Software Guard Extensions (Intel® SGX) Part 9: Power Events and Data Sealing

Introduction

TensorFlow* is a leading deep learning and machine learning framework, and as of May 2017, it now integrates optimizations for Intel® Xeon® processors and Intel® Xeon Phi™ processors. This is the first in a series of tutorials providing information for developers who want to build, install, and explore TensorFlow optimized on Intel architecture from sources available in the GitHub* repository.

Resources

The TensorFlow website is a key resource for learning about the framework, providing informative overviews, tutorials, and technical information on its various components. This is the first stop for developers interested in understanding the full extent of what TensorFlow has to offer in the area of deep learning.

The article TensorFlow Optimizations on Modern Intel® Architecture introduces the specific graph optimizations, performance experiments, and details for building and installing TensorFlow with CPU optimizations. This article is highly recommended for developers who want to understand the details of how to fully optimize TensorFlow for different topologies, and the performance improvements they can achieve in doing so.

Installation Overview

The installation steps presented in this document are distilled from information provided in the Installing TensorFlow from Sources guide on the TensorFlow website. The steps outlined below are provided to give a quick overview of the installation process; however, since third-party information is subject to change over time, it is recommended that you also review the information provided on the TensorFlow website.

The installation guidelines presented in this document focus on installing TensorFlow with CPU support only. The target operating system and Python* distribution are Ubuntu* 16.04 and Python 2.7, respectively.

Installing the Bazel* Build Tool

Bazel* is the publicly available build tool from Google*. If Bazel is already installed on your system you can skip this section. Otherwise, enter the following commands to add the Bazel distribution URI, perform the installation, and update Bazel on your system:

echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list sudo apt install curl curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add - sudo apt-get update && sudo apt-get install bazel sudo apt-get upgrade bazel

Installing Python* Dependencies

If the Python dependencies are already installed on your system you can skip this section. To install the required packages for Python 2.7, enter the following command:

sudo apt-get install python-numpy python-dev python-pip python-wheel

Building a TensorFlow* Pip Package for Installation

If the program Git* is not currently installed on your system, issue the following command:

sudo apt install git

Clone the GitHub repository by issuing the following command:

git clone https://github.com/tensorflow/tensorflow

The tensorflow directory created during cloning contains a script named configure that must be executed prior to creating the pip package and installing TensorFlow. This script allows you to identify the pathname, dependencies, and other build configuration options. For TensorFlow optimized on Intel architecture, this script also allows you to set up Intel® Math Kernel Library (Intel® MKL) related environment settings. Execute the following commands:

cd tensorflow ./configure

Important: Select ‘Y’ to build TensorFlow with Intel MKL support, and ‘Y’ to download MKL LIB from the web. Select the default settings for the other configuration parameters. When the script has completed running, issue the following command to build the pip package:

bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

Installing TensorFlow—Native Pip Option

At this point in the process the newly created pip package will be located in tmp/tensorflow_pkg. The next step is to install TensorFlow, which can be done either as a native pip installation, or in an Anaconda* virtual environment as described in the next section. For a native pip installation simply enter the following command:

sudo pip install /tmp/tensorflow_pkg/tensorflow-1.2.0rc1-cp27-cp27mu-linux_x86_64.whl

(Note: The name of the wheel, as shown above in italics, may be different for your particular build.)

Once these steps have been completed be sure to validate the installation before proceeding to the next section. Note: When running the Python validation script provided in the link, be sure to change to a different directory, for example:

cd ..

Installing TensorFlow—Conda* Environment Option

Note: If you already have Anaconda installed on your system you can skip this step.

Download Anaconda from the download page and follow the directions to run the installer script. (For this tutorial, we used the 64-bit, x86, Python 2.7 version of Anaconda.) During the installation you need to agree to the license, choose the defaults, and choose 'yes' to add Anaconda to your path. Once the installation is complete, close the terminal and open a new one.

Next, we will create a conda environment and install TensorFlow from the newly created pip package located in tmp/tensorflow_pkg. Run the following commands to create a TensorFlow environment called "inteltf" and issue the following commands:

conda create -n inteltf source activate inteltf pip install /tmp/tensorflow_pkg/tensorflow-1.2.0rc1-cp27-cp27mu-linux_x86_64.whl

(Note: The name of the wheel, as shown above in italics, may be different for your particular build.)

source deactivate inteltf

Close the terminal and open a new one before proceeding.

Restart the inteltf environment and validate the TensorFlow installation by running the following Python code from the website:

source activate inteltf
	python>>> import tensorflow as tf>>> hello = tf.constant('Hello, TensorFlow!')>>> sess = tf.Session()>>> print(sess.run(hello))

The Python program should output “Hello, TensorFlow!” if the installation was successful.

Coming Up

The next article in the series describes how to install TensorFlow Serving*, a high-performance serving system for machine learning models designed for production environments.

↧

Build and Install TensorFlow* Serving on Intel® Architecture

June 21, 2017, 2:12 pm

Latest and popular articles on Intel Technologies

≫ Next: Train and Use a TensorFlow* Model on Intel® Architecture

≪ Previous: Build and Install TensorFlow* on Intel® Architecture

Introduction

The first tutorial in this series, Build and Install TensorFlow* on Intel® Architecture, demonstrated how to build and install TensorFlow optimized on Intel architecture from sources available in the GitHub* repository. The information provided in this paper describes how to build and install TensorFlow* Serving, a high-performance serving system for machine learning models designed for production environments.

Installation Overview

The installation guidelines presented in this document are distilled from information available on the TensorFlow Serving GitHub website. The steps outlined below are provided to give a quick overview of the installation process; however, since third-party information is subject to change over time it is recommended that you also review the information provided on the TensorFlow Serving website.

Important: The step-by-step guidelines provided below assume the reader has already completed the tutorial Build and Install TensorFlow on Intel® Architecture, which includes the steps to install the Bazel* build tool and some of the other required dependencies not covered here.

Installing gRPC*

Begin by installing the Google Protocol RPC* library (gRPC*), a framework for implementing remote procedure call (RPC) services.

sudo pip install grpcio

Installing Dependencies

Next, ensure the other TensorFlow Serving dependencies are installed by issuing the following command:

sudo apt-get update && sudo apt-get install -y \ build-essential \ curl \ libcurl3-dev \ git \ libfreetype6-dev \ libpng12-dev \ libzmq3-dev \ pkg-config \ python-dev \ python-numpy \ python-pip \ software-properties-common \ swig \ zip \ zlib1g-dev

Installing TensorFlow* Serving

Clone TensorFlow Serving from the GitHub repository by issuing the following command:

git clone --recurse-submodules https://github.com/tensorflow/serving

The serving/tensorflow directory created during the cloning process contains a script named “configure” that must be executed to identify the pathname, dependencies, and other build configuration options. For TensorFlow optimized on Intel architecture, this script also allows you to set up Intel® Math Kernel Library (Intel® MKL) related environment settings. Issue the following commands:

cd serving/tensorflow ./configure

Important: Select ‘Y’ to build TensorFlow with MKL support, and ‘Y’ to download MKL LIB from the web. Select the default settings for the other configuration parameters.

cd .. bazel build --config=mkl --copt="-DEIGEN_USE_VML" tensorflow_serving/...

Testing the Installation

Test the TensorFlow Serving installation by issuing the following command:

bazel test tensorflow_serving/...

If everything worked OK you should see results similar to Figure 1.

Screenshot of a command prompt window with results of correct installation

Figure 1. TensorFlow Serving installation test results.

Coming Up

The next article in this series describes how to train and save a TensorFlow model, host the model in TensorFlow Serving, and use the model for inference in a client-side application.

↧

Train and Use a TensorFlow* Model on Intel® Architecture

June 21, 2017, 2:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Performance Optimization of memcpy in DPDK

≪ Previous: Build and Install TensorFlow* Serving on Intel® Architecture

Introduction

TensorFlow* is a leading deep learning and machine learning framework, and as of May 2017, it now integrates optimizations for Intel® Xeon® processors and Intel® Xeon Phi™ processors. This is the third in a series of tutorials providing information for developers who want to build, install, and explore TensorFlow optimized on Intel® architecture from sources available in the GitHub* repository.

The first tutorial in this series Build and Install TensorFlow for Intel Architecture demonstrates how to build and install TensorFlow optimized on Intel architecture from sources in the GitHub* repository.

The second tutorial in the series Build and Install TensorFlow Serving on Intel Architecture describes how build and install TensorFlow Serving, a high-performance serving system for machine learning models designed for production environments.

In this tutorial we will train and save a TensorFlow model, build a TensorFlow model server, and test the server using a client application. This tutorial is based on the MNIST for ML Beginners and Serving a TensorFlow Model tutorials on the TensorFlow website. You are encouraged to review these tutorials before proceeding to fully understand the details of how models are trained and saved.

Train and Save a MNIST Model

According to Wikipedia, the MNIST (Modified National Institute of Standards and Technology) database contains 60,000 training images and 10,000 testing images used for training and testing in the field of machine learning. Because of its relative simplicity, the MNIST database is often used as an introductory dataset for demonstrating machine learning frameworks.

To get started, open a terminal and issue the following commands:

cd ~/serving bazel build //tensorflow_serving/example:mnist_saved_model rm -rf /tmp/mnist_model bazel-bin/tensorflow_serving/example/mnist_saved_model /tmp/mnist_model

Troubleshooting: At the time of this writing, the TensorFlow Serving repository identified an error logged as “NotFoundError in mnist_export example #421.” If you encounter an error after issuing the last command try this workaround:

Open serving bazel-bin/tensorflow_serving/example/mnist_saved_model.runfiles/ org_tensorflow/tensorflow/contrib/image/__init__.py
Comment-out (#) the following line as shown:
#from tensorflow.contrib.image.python.ops.single_image_random_dot_stereograms import single_image_random_dot_stereograms
Save and close __init__.py.
Try issuing the command again:
bazel-bin/tensorflow_serving/example/mnist_saved_model /tmp/mnist_model

Since we omitted the training_iterations and model_version command-line parameters when we ran mnist_saved_model, they defaulted to 1000 and 1, respectively. Because we passed /tmp/mnist_model for the export directory, the trained model was saved in /tmp/mnist_model/1.

As explained in the TensorFlow tutorial documentation, the “1” version sub-directory contains the following files:

saved_model.pb is the serialized tensorflow::SavedModel. It includes one or more graph definitions of the model, as well as metadata of the model such as signatures.
variables are files that hold the serialized variables of the graphs.

Troubleshooting: In some instances you might encounter an issue with the downloaded training files getting corrupted when the script runs. This error is identified as "Not a gzipped file #170" on GitHub. If necessary, these files can be downloaded manually by issuing the following commands from the /tmp directory:

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Build and Start the TensorFlow Model Server

Build the TensorFlow model server by issuing the following command:

bazel build //tensorflow_serving/model_servers:tensorflow_model_server

Start the TensorFlow model server by issuing the following command:

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=mnist --model_base_path=/tmp/mnist_model/ &

Test the TensorFlow Model Server

The last command started the ModelServer running in the terminal. To test the server using the mnist_client utility provided in the TensorFlow Serving installation, enter the following commands from the /serving directory:

bazel build //tensorflow_serving/example:mnist_client bazel-bin/tensorflow_serving/example/mnist_client --num_tests=1000 --server=localhost:9000

If everything worked, you should see results similar to Figure 1.

Screenshot of a command prompt window with client test results

Figure 1. TensorFlow client test results

Troubleshooting: There is an error identified on GitHub as “gRPC doesn't respect the no_proxy environment variable” that may result in an “Endpoint read failed” error when you run the client application. Issue the env command to see if the http_proxy environment variable is set. If so, it can be temporarily unset by issuing the following command:

unset http_proxy

Summary

In this series of tutorials we explored the process of building the TensorFlow machine learning framework and TensorFlow Serving, a high-performance serving system for machine learning models, optimized for Intel architecture. A simple model based on the MNIST dataset was trained and saved, and it was then deployed using a TensorFlow model server. Lastly, the mnist_client example included in the GitHub repository was used to demonstrate how a client-side application can leverage a TensorFlow model server to do simple machine learning inference.

For additional information on this subject please visit the TensorFlow website, a key resource for learning more about the framework. The article entitled “TensorFlow Optimizations on Modern Intel Architecture” introduces the specific graph optimizations, performance experiments, and details for building and installing TensorFlow with CPU optimizations.

↧

Performance Optimization of memcpy in DPDK

June 26, 2017, 7:10 am

Latest and popular articles on Intel Technologies

≫ Next: Configure SR-IOV Network Virtual Functions in Linux* KVM*

≪ Previous: Train and Use a TensorFlow* Model on Intel® Architecture

Introduction

Memory copy, memcpy, is a simple yet diverse operation, as there are possibly hundreds of code implementations that copy data from one part of memory to another. However, the discussion on how to evaluate and optimize for a better memcpy never stops.

This article discusses how optimizations are positioned, conducted, and evaluated for use with memcpy in the Data Plane Development Kit (DPDK).

First, let’s look at the following simple memcpy function:

void * simple_memcpy(void *dst, const void *src, size_t n)
{
        const uint8_t *_src = src;
        uint8_t *_dst = dst;
        size_t i;

        for (i = 0; i < n; ++i)
                _dst[i] = _src[i];

        return dst;
}

Is there anything wrong with this function? Not really. But it surely missed some optimization methods. The function:

Does not employ single instruction, multiple data (SIMD)
Has no instruction-level parallelism
Lacks load/store address alignment

The performance of the above implementation depends entirely on the compiler’s optimization. Surprisingly, in some scenarios, this function outperforms the glibc memcpy. Of course, the compiler takes most of the credit by optimizing the implementation. But it also gets us thinking: Is there an ultimate memcpy implementation that outperforms all others?

This article holds the view that the ultimate memcpy implementation, providing the best performance in any given scenario (hardware + software + data) simply does not exist. Ironically, the best memcpy implementation is to completely avoid memcpy operations; the second-best implementation might be to handcraft dedicated code for each and every memcpy call, and there are others. Memcpy should not be considered and measured as one standalone part of the program; instead, the program should be seen as a whole—the data that one memcpy accesses has been and will be accessed by other parts of the program, also the instructions from memcpy and other parts of the program interact inside the CPU pipeline in an out-of-order manner. This is why DPDK introduced rte_memcpy, to accelerate the critical memcpy paths in core DPDK scenarios.

Common Optimization Methods for memcpy

There are abundant materials online for memcpy optimization; we provide only a brief summary of optimization methods here.

Generally speaking, memcpy spends CPU cycles on:

Data load/store
Additional calculation tasks (such as address alignment processing)
Branch prediction

Common optimization directions for memcpy:

Maximize memory/cache bandwidth (vector instruction, instruction-level parallel)
Load/store address alignment
Batched sequential access
Use non-temporal access instruction as appropriate
Use the String instruction as appropriate to speed up larger copies

Most importantly, all instructions are executed through the CPU pipeline; therefore, pipeline efficiency is everything, and the instruction flow needs to be optimized to avoid pipeline stall.

Optimizing Memcpy for DPDK

Since early 2015, the exclusive memcpy implementation for DPDK, rte_memcpy, has been optimized several times to accelerate different DPDK use-case scenarios, such as vhost Rx/Tx. All the analysis and code changes can be viewed in the DPDK git log.

There are many ways to optimize an implementation. The simplest and most straightforward way is trial and error; to make a variety of improvements with some baseline knowledge, verify them in the target scenario, and then choose a better one by using a set of evaluation criteria. All you need is experience, patience, and a little imagination. Although this approach can sometimes bring surprises, it is neither efficient nor reassuring.

Another common approach sounds more promising: At first, initial effort is invested to fully understand the source code (assembly code, if necessary) behaviors and to establish the theoretical optimal performance. With this optimal baseline, the performance gap can be confirmed. Runtime sampling is then conducted to analyze defects of the existing code to seek improvement. This may require a lot of experience and analysis effort. For example, the vhost enqueue optimization in DPDK 16.11 is the result of several weeks work spent sampling and analyzing. Finally, by moving three lines of code, tests performed with DPDK testpmd showed that enqueue efficiency was improved by 1.7 times as the enqueue cost is reduced from about 250 cycles per packet to 150 cycles per packet. Later, in DPDK 17.02, the rte_memcpy optimization patch was derived from the same idea. This is hard to achieve by the first method.

See Appendix A for a description of the test hardware configuration we used for testing. To learn more about DPDK performance testing with testpmd, read Testing DPDK Performance and Features with TestPMD on Intel® Developer Zone.

There are many useful tools for profiling and sampling such as perf and VTune™. They are very effective as long as you know what data you’re looking for.

Show Me the Data!

Ultimately, the goal of optimization is to speed up the performance of the target application scenario, which is a combination of hardware, software, and data. The evaluation methods vary.

For memcpy, the use of a micro-benchmark can easily get a few key performance numbers such as the copy rate (MB/s); however, that approach lacks reference values. That’s because memcpy is normally optimized at the programming language level as well as the instruction level for the specific hardware platform and specific software code, even specific data length; and the memcpy algorithm itself doesn’t have much space for improvement. In this case, different scenarios require different optimization techniques. Therefore, micro-benchmarks speak only for themselves.

Also, it is not advisable to evaluate performance by timestamping the memcpy code. The modern CPU has a very complex pipeline that supports prefetching and out-of-order execution, which results in significant deviations when the performance is measured at the cycle level. Although forced synchronization can be done by adding serialization instructions, it may change the execution sequence of instruction flow, degrade program performance, and breach the original intention of performance measurement. Meanwhile, instructions which are highly optimized by the compiler also appear to be out-of-order with respect to the programming language. Forced sequential compiling also significantly impacts performance and makes the result meaningless. Besides, the execution time of an instruction stream includes not only the ideal execution cycles, but also the data access latency caused by pipeline stall cycles. Since the data accessed by a piece of code probably has been and will be accessed by other parts of the program, it may appear to have a shorter execution time by advancing or delaying the data access. These complex factors make the seemingly easy task of memcpy performance evaluation troublesome.

Therefore, field testing should be used for optimization evaluation. For example, in the application of Open vSwitch* (OvS) in the cloud, memcpy is heavily used in vhost Rx/Tx, and in this case we should take the final packet forwarding rate as the performance evaluation criteria for the memcpy optimization.

Test Results

Figure 1 below shows the test results of an example with Physical-VM-Physical (PVP) traffic close to the actual application scenario. And by replacing rte_memcpy in DPDK vhost with memcpy provided by glibc, comparative data is gained. The results show that an increase of 22 percent of the total bandwidth can be obtained simply by accelerating the vhost Rx/Tx part by the use of rte_memcpy. Our test configuration is described below in Appendix A.

colored column show comparisons of performances in DPDK rte_memcpy and glibc memcpy in OvS-DPDK — Figure 1. Performance comparison between DPDK rte_memcpy and glibc memcpy in OvS-DPDK

Continue the Conversation

Join the DPDK mailing list, dev@dpdk.org, where your feedback and questions about rte_memcpy are welcomed.

About the Author

Zhihong Wang is a software engineer at Intel. He has worked on various areas, including CPU performance benchmarking and analysis, packet processing performance optimization, network virtualization.

Appendix A

Test Environment

PVP flow: Ixia* sends the packet to the physical network card, OvS-DPDK forwards the packet received by the physical network card to the virtual machine, the virtual machine processes the packet and sends it back to the physical network card by the OvS-DPDK, and finally back to Ixia
Virtual machine will use MAC-forwarding using DPDK testpmd
OvS-DPDK Version: Commit f56f0b73b67226a18f97be2198c0952dad534f1c
DPDK Version: 17.02
GCC/GLIBC Version: 6.2.1/2.23
Linux*: 4.7.5-200.fc24.x86_64
CPU: Intel® Xeon® processor E5-2699 v3 at 2.30GHz

OvS-DPDK Compile and Boot Commands

./ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true ./ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024" ./ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev ./ovs-vsctl add-port ovsbr0 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuser ./ovs-vsctl add-port ovsbr0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:06:00.0 ./ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10000 ./ovs-ofctl del-flows ovsbr0 ./ovs-ofctl add-flow ovsbr0 in_port=1,action=output:2 ./ovs-ofctl add-flow ovsbr0 in_port=2,action=output:1

Use DPDK testpmd for Virtual Machine Forwarding

set fwd mac start

↧

Configure SR-IOV Network Virtual Functions in Linux* KVM*

June 19, 2017, 10:18 am

Latest and popular articles on Intel Technologies

≫ Next: Trion Worlds: Moving With the Times

≪ Previous: Performance Optimization of memcpy in DPDK

Introduction

This tutorial demonstrates several different ways of using single root input/output virtualization (SR-IOV) network virtual functions (VFs) in Linux* KVM* virtual machines (VMs) and discusses the pros and cons of each method.

Here’s the short story: use the KVM virtual network pool of SR-IOV adapters method. It has the same performance as the VF PCI* passthrough method, but it’s much easier to set up. If you must use the macvtap method, use virtio as your device model because every other option will give you horrible performance. And finally, if you are using a 40 Gbps Intel® Ethernet Server Adapter XL710, consider using the Data Plane Development Kit (DPDK) in the guest; otherwise you won’t be able to take full advantage of the 40 Gbps connection.

There are a few downloads associated with this tutorial that you can get from github.com/intel:

A script that groups virtual functions by their physical functions,
A script that lists virtual functions and their PCI and parent physical functions, and
The KVM XML virtual machine definition file for the test VM

SR-IOV Basics

SR-IOV provides the ability to partition a single physical PCI resource into virtual PCI functions which can then be injected into a VM. In the case of network VFs, SR-IOV improves north-south network performance (that is, traffic with endpoints outside the host machine) by allowing traffic to bypass the host machine’s network stack.

Supported Intel Network Interface Cards

A complete list of Intel Ethernet Server Adapters and Intel® Ethernet Controllers that support SR-IOV is available online, but in this tutorial, I evaluated just four:

The Intel® Ethernet Server Adapter X710, which supports up to 32 VFs per port
The Intel Ethernet Server Adapter XL710, which supports up to 64 VFs per port
The Intel® Ethernet Controller X540-AT2, which supports 32 VFs per port
The Intel® Ethernet Controller 10 Gigabit 82599EB, which supports 32 VFs per port

Assumptions

There are several different ways to inject an SR-IOV network VF into a Linux KVM VM. This tutorial evaluates three of those ways:

As an SR-IOV VF PCI passthrough device
As an SR-IOV VF network adapter using macvtap
As an SR-IOV VF network adapter using a KVM virtual network pool of adapters

Most of the steps in this tutorial can be done using either the command line virsh tool or using the virt-manager GUI. If you prefer to use the GUI, you’ll find screenshots to guide you; if you are partial to the command line, you’ll find code and XML snippets to help. Note that there are several steps in this tutorial that cannot be done via the GUI.

Network Configuration

The test setup included two physical servers—net2s22c05 and net2s18c03—and one VM—sr-iov-vf-testvm—that was hosted on net2s22c05. Net2s22C05 had one each of the four Intel Ethernet Server Adapters listed above with one port in each adapter directly linked to a NIC port with equivalent link speed in net2s18c03. The NIC ports on each system were in the same subnet: those on net2s18c03 all had static IP addresses with .1 as the final dotted quad, the net2s22c05 ports had .2 as the final dotted quad, and the virtual ports in sr-iov-vf-testvm all had .3 as the final dotted quad:

Network Configuration

System Configuration

Host Configuration

CPU	2-Socket, 22-core Intel® Xeon® processor E5-2699 v4 @ 2.20 GHz
Memory	128 GB
NIC	Intel® Ethernet Controller X540-AT2
	Intel® 82599 10 Gigabit TN Network Connection
	Intel® Ethernet Controller X710 for 10GbE SFP+
	Intel® Ethernet Controller XL710 for 40GbE QSFP+
Operating System	Ubuntu* 16.04 LTS
Kernel parameters	GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt”

Guest Configuration

The following XML snippets are taken via # virsh dumpxml sr-iov-vf-testvm

CPU	<vcpu placement='static'>8</vcpu><cpu mode='host-passthrough'><topology sockets='1' cores='8' threads='1'/></cpu>
Memory	<memory unit='KiB'>12582912</memory><currentMemory unit='KiB'>12582912</currentMemory>
NIC	<interface type='network'><mac address='52:54:00:4d:2a:82'/><source network='default'/><model type='rtl8139'/><address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/></interface>
	The SR-IOV NIC XML tag varied based the configurations discussed in this tutorial.
Operating System	Ubuntu 14.04 LTS. Note: This OS and Linux* kernel version were chosen based on a specific usage. Otherwise, newer versions would have been used.
Linux* Kernel Version	3.13.0-24-lowlatency
Software	ufw purged lshw installed Note: Ubuntu 14.04 LTS did not come with the i40evf driver preinstalled. I built the driver from source and then loaded it into the kernel. I used version 2.0.22. Instructions for building and loading the driver are located in the README file.

The complete KVM definition file is available online.

Scope

This tutorial does not focus on performance. And even though the performance of the Intel Ethernet Server Adapter XL710 SR-IOV connection listed below clearly demonstrates the value of the DPDK, this tutorial does not focus on configuring SR-IOV VF network adapters to use DPDK in the guest VM environment. For more information on this topic, see the Single Root IO Virtualization and Open vSwitch Hands-On Lab Tutorials. You can find detailed instructions on how to set up SR-IOV VFs on the host in this SR-IOV Configuration Guide and the video Creating Virtual Functions using SR-IOV. But to get you started, once you have enabled iommu=pt and intel_iommu=on as kernel boot parameters, if you are running a Linux kernel that is at least 3.8.x, to initialize SR-IOV VFs issue the following command:

     #echo 4 > /sys/class/net/<device name>/device/sriov_numvfs

Once an SR-IOV NIC VF is created on the host, the driver/OS assigns a MAC address and creates a network interface for the VF adapter.

Parameters

When evaluating the advantages and disadvantages of each insertion method, I looked at the following:

Host device model
PCI device information as reported in the VM
Link speed as reported by the VM
Host driver
Guest driver
Simple performance characteristics using iperf
Ease of setup

Host Device Model

This is the device type that is specified when the SR-IOV network adapter is inserted into the VM. In the virt-manager GUI, the following typical options are available:

Hypervisor default (which in our configuration defaulted to rtl8139
rtl8139
e1000
virtio

Additional options were available on our test host machine, but they had to be entered into the VM XML definition using # virsh edit. I additionally evaluated the following:

ixgbe
i82559er

VM Link Speed

I evaluated link speed of the SR-IOV VF network adapter in the VM using the following command:

     # ethtool eth1 | grep Speed

VM Link Speed

Host Network Driver

This is the driver that the KVM Virtual Machine Manager (VMM) uses for the NIC as displayed in the <driver> XML tag when I ran the following command on the host after starting the VM:

     # virsh dumpxml sr-iov-vf-testvm | grep -w hostdev -A9

Host Network Driver

Guest Network Driver

This is the driver that the VM uses for the NIC. I found the information by first determining the SR-IOV NIC PCI interface information in the VM:

     # lshw -c network –businfo

Guest Network Driver

Using this PCI bus information, I then ran the following command to find what driver the VM had loaded into the kernel for the SR-IOV NIC:

     # lspci -vmmks 00:03.0

Guest Network Driver

Performance

Because this is not a performance-oriented paper, this data is provided only to give a rough idea of the performance of different configurations. The command I ran on the server system was

     # iperf -s -f m

Performance

And the client command was:

     # iperf -c <server ip address> -f m -P 2

Performance

I only did one run with the test VM as the server and one run with the test VM as a client.

Ease of Setup

This is an admittedly subjective evaluation parameter. But I think you’ll agree that there was a clear loser: the option of inserting the SR-IOV VF as a PCI passthrough device.

SR-IOV Virtual Function PCI Passthrough Device

The most basic way to connect an SR-IOV VF to a KVM VM is by directly importing the VF as a PCI device using the PCI bus information that the host OS assigned to it when it was created.

Using the Command Line

Once the VF has been created, the network adapter driver automatically creates the infrastructure necessary to use it.

Step 1: Find the VF PCI bus information.

In order to find the PCI bus information for the VF, you need to know how to identify it, and sometimes the interface name that is assigned to the VF seems arbitrary. For example, in the following figure there are two VFs and the PCI bus information is outlined in red, but it is impossible to determine from this information which physical port the VFs are associated with.

     # lshw -c network -businfo

Find the VF PCI bus information.

The following bash script lists all the VFs associated with a physical function.

#!/bin/bash

NIC_DIR="/sys/class/net"
for i in $( ls $NIC_DIR) ;
do
	if [ -d "${NIC_DIR}/$i/device" -a ! -L "${NIC_DIR}/$i/device/physfn" ]; then
		declare -a VF_PCI_BDF
		declare -a VF_INTERFACE
		k=0
		for j in $( ls "${NIC_DIR}/$i/device" ) ;
		do
			if [[ "$j" == "virtfn"* ]]; then
				VF_PCI=$( readlink "${NIC_DIR}/$i/device/$j" | cut -d '/' -f2 )
				VF_PCI_BDF[$k]=$VF_PCI
				#get the interface name for the VF at this PCI Address
				for iface in $( ls $NIC_DIR );
				do
					link_dir=$( readlink ${NIC_DIR}/$iface )
					if [[ "$link_dir" == *"$VF_PCI"* ]]; then
						VF_INTERFACE[$k]=$iface
					fi
				done
				((k++))
			fi
		done
		NUM_VFs=${#VF_PCI_BDF[@]}
		if [[ $NUM_VFs -gt 0 ]]; then
			#get the PF Device Description
			PF_PCI=$( readlink "${NIC_DIR}/$i/device" | cut -d '/' -f4 )
			PF_VENDOR=$( lspci -vmmks $PF_PCI | grep ^Vendor | cut -f2)
			PF_NAME=$( lspci -vmmks $PF_PCI | grep ^Device | cut -f2).
			echo "Virtual Functions on $PF_VENDOR $PF_NAME ($i):"
			echo -e "PCI BDF\t\tInterface"
			echo -e "=======\t\t========="
			for (( l = 0; l < $NUM_VFs; l++ )) ;
			do
				echo -e "${VF_PCI_BDF[$l]}\t${VF_INTERFACE[$l]}"
			done
			unset VF_PCI_BDF
			unset VF_INTERFACE
			echo ""
		fi
	fi
done

With the PCI bus information from this script, I imported a VF from the first port on my Intel Ethernet Controller X540-AT2 as a PCI passthrough device.

PCI passthrough device

Step 2: Add a hostdev tag to the VM.

Using the command line, use # virsh edit <VM name> to add a hostdev XML tag to the machine. Use the host machine PCI Bus, Domain, and Function information from the bash script above for the source tag’s address domain, bus, slot, and function attributes.

# virsh edit <name of virtual machine>
# virsh dump <name of virtual machine><domain>
…
<devices>
…
<hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0000' bus='0x03' slot='0x10' function='0x0'/></source></hostdev>
…
</devices>
…
</domain>

Once you exit the virsh edit command, KVM automatically adds an additional <address> tag to the hostdev tag to allocate the PCI bus address in the VM.

Step 3: Start the VM.

     # virsh start <name of virtual machine>

Start the VM.

Using the GUI

Note: I have not found an elegant way to discover the SR-IOV PCI bus information using graphical tools.

Step 1: Find the VF PCI bus information.

See the commands from Step 1 above.

Step 2: Add a PCI host device to the VM.

Once you have the host PCI bus information for the VF, using the virt-manager GUI, click Add Hardware.

Add a PCI host device to the VM.

After selecting PCI Host Device, you’ll see an array of PCI devices shown that can be imported into our VM.

Add a PCI host device to the VM.

Give keyboard focus to the Host Device drop-down list, and then start typing the PCI Bus Device Function information from the bash script above, substituting a colon for the period (‘03:10:0’ in this case). After the desired VF comes into focus, click Finish.

Add a PCI host device to the VM.

The PCI device just imported now shows up in the VM list of devices.

Add a PCI host device to the VM.

Step 3: Start the VM.

Add a PCI host device to the VM.

Summary

When using this method of directly inserting the PCI host device into the VM, there is no ability to change the host device model: for all NIC models, the host used the vfio driver. The Intel Ethernet Servers Adapters XL710 and X710 adapters used the i40evf driver in the guest, and for both, the VM PCI Device information reported the adapter name as “XL710/X710 Virtual Function.” The Intel Ethernet Controller X540-AT2 and Intel 82599 10 Gigabit Ethernet Controller adapters used the ixgbevf driver in the guest, and the VM PCI device information reported “X540 Ethernet Controller Virtual Function” and “82599 Ethernet Controller Virtual Function” respectively. With the exception of the XL710, which showed a link speed of 40 Gbps, all 10 GB adapters showed a link speed of 10 Gbps. For the X540, 82599, and X710 adapters, the iperf test ran at nearly line rate (~9.4 Gbps), and performance was roughly ~8 percent worse when the VM was the iperf server versus when the VM was the iperf client. While the XL710 performed better than the 10 Gb NICs, it performed at roughly 70 percent line rate when the iperf server ran on the VM, and at roughly 40 percent line rate when the iperf client was on the VM. This disparity is most likely due to the kernel being overwhelmed by the high rate of I/O interrupts, a problem that would be solved by using DPDK.

The one advantage to this method is that it allows control over which VF is inserted into the VM, whereas the virtual network pool of adapters method does not. This method of injecting an SR-IOV VF network adapter into a KVM VM is the most complex to set up and provides the fewest host device model options. Performance is not significantly different than the method that involves a KVM virtual network pool of adapters. However, that method is much simpler to use. Unless you need control over which VF is inserted into your VM, I don’t recommend using this method.

SR-IOV Network Adapter Macvtap

The next way to add an SR-IOV network adapter to a KVM VM is as a VF network adapter connected to a macvtap on the host. Unlike the previous method, this method does not require you to know the PCI bus information for the VF, but you do need to know the name of the interface that the OS created for the VF when it was created.

Using the command-line

Much of this method of connecting an SR-IOV VF to a VM can be done via the virt-manager GUI, but step 1 must be done using the command line.

Step 1: Determine the VF interface name

As shown in the following figure, after creating the VF, use the bash script listed above to display the network interface names and PCI bus information assigned to the VFs.

Determine the VF interface name

With this information, insert the VFs into your KVM VM using either the virt-manager GUI or the virsh command line.

Step 2: Add an interface tag to the VM.

To use the command-line with the macvtap adapter solution, with the VM shut off, edit the VM configuration file and add an ‘interface’ tag with sub-elements and attributes shown below. The interface ‘type’ is ‘direct’, and the ‘dev’ attribute of the ‘source’ sub-element must point to the interface name that the host OS assigned to the target VF. Be sure to specify the ‘mode’ attribute of the ‘source’ element as ‘passthrough’:

# virsh edit <name of virtual machine>
# virsh dump <name of virtual machine><domain>
…
<devices>
…
   <interface type='direct'><source dev='enp3s16f1' mode='passthrough'/></interface>
…
</devices>
…
</domain>

Once the editor is closed, KVM automatically assigns a MAC address to the SR-IOV interface, uses the default model type value of rtl8139, and assigns the NIC a slot on the VM’s PCI bus.

Step 3: Start the VM.

Start the VM.

As the VM starts, KVM creates a macvtap adapter ‘macvtap0’ on the VF specified. On the host, you can see that the macvtap adapter that KVM created for your VF NIC uses a MAC address that is different than the MAC address on the other end of the macvtap in the VM:

     # ip l | grep enp3s16f1 -A1

Start the VM.

The fact that there are two MAC addresses assigned to the same VF—one by the host OS and one by the VM—suggests that the network stack using this configuration is more complex and likely slower.

Using the GUI

With the exception of determining the interface name of the desired VF, all the steps of this method can be done using the virt-manager GUI.

Step 1: Determine the VF interface name.

See the command line Step 1 above.

Step 2: Add the SR-IOV macvtap adapter to the VM.

Using virt-manager, add hardware to the VM.