Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

An Overview of Advanced Server-Based Networking Technologies

$
0
0


Photo: Pexels.com

Introduction

As data center server technology has become more sophisticated over the last decade or so, there has been an accelerating trend toward loading more networking capability onto servers. Alongside this is a wide proliferation of tooling, both in server and network interface hardware, and in the software stack that utilizes them, to speed the delivery of network data to and from where it most needs to be. Several of these tools are undergoing rapid development right now; many of them in open source projects.

Many people are curious about the advanced technologies in networking, even if they aren’t explicitly working in traditional high-volume fields such as telecommunications. This guide aims to differentiate and explain the usage of some of the most prominent of these (mostly open source) tools, at an overview level, as they stand today. We’ll also attempt to explain the scenarios under which various tools or combinations of tools are desirable.

A Starting Point: Why Server Networking Became So Important

In the beginning (say, the 90s, more or less), things were pretty simple. A single server had a network interface card (NIC), and an operating system (OS) driver for that NIC, upon which a desired software network stack (like TCP/IP, usually, although there are many others) would run. The NIC driver used interrupts to demand attention from the OS kernel when new network packets would arrive or depart.

With the arrival of virtualization as a major shift in server usage, this same model was replicated into virtual machines (VMs), wherein virtual NICs serve as the way to get network packets into and out of the virtualized OS. To facilitate this, hypervisors integrated a software-based bridge.

The bridge had the physical NIC connected on a software port, and all of the virtual NICs attached to the same bridge via their own software ports. A common variant of this (today) is Linux* bridging, which is built directly into the Linux kernel. The hypervisor side of the VM NICs could be represented by a number of underlying types of virtual NIC, most commonly a tap device (or network tap) that emulates a link-layer device.

Simple virtualization setups still use this method without much of a problem. However, there are two main reasons why something more complex is called for: security and performance.

On the security side, as virtualized infrastructure became more prevalent and dense, the need to separate packet flows from neighboring VMs on the same host became more pressing. If a single VM on a host were to be compromised, it is certainly for the best if it remained on its own isolated switch port, rather than sharing a bridge with many other VMs and the physical NIC. This certainly isn’t the only security issue, but it’s the one we’ll address first.

On the performance side, all of the interrupts from the virtual NICs add up quickly. Simple switching reduces some of this load (but there will be much more on this part of the equation as we move along).

Beyond Bridging: Open vSwitch*

Enter the virtual switch. This is a sophisticated software version of what used to be done only in (usually expensive) hardware components. While there are several software switches in the marketplace today, we will keep our focus to the most prominent of open source variants: Open vSwitch* (OvS*).


Figure 1: Graphic: By Goran tek-en (own work derived from: virtual networking)
[CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

OvS offers a robust, extensible switching architecture that enables sophisticated flow control and port isolation. It is used as the underlying basis for many of the more advanced technologies that will be discussed later in this article. Due to its extensibility, it can also be modified by vendors who want to add proprietary technologies of their own. As such, OvS is frequently a critical piece of the modern data center.

It is also programmable, both with OpenFlow* and with its own management system. This makes it a common choice as an underlying technology for VM and container management and orchestration systems (notably OpenStack*, which we will cover later).

Getting Help from Hardware: SR-IOV

Returning to performance as an issue in dense, virtualized environments, let’s take a look at some in-hardware solutions that have arisen to help matters.

Earlier, we mentioned interrupts as being problematic as the number of virtual devices on a server host proliferates. These interrupts are a constant drain on the OS kernel. Furthermore, if you think about how the networking we’ve described so far works, there are multiple interrupts being generated for each packet that arrives or departs on the physical wire.

For example, if a packet arrives at the physical NIC, a processor core running the hypervisor will receive an interrupt to deal with it. At this point it will be dispositioned using the kernel’s native network stack, and forwarded to the virtual NIC it is bound for. Forwarding in this case means copying that packet’s contents to the appropriate memory space for the virtual NIC it is bound for, and then a new interrupt is sent to the hypervisor for the core that is running that VM. Already we’ve doubled the number of interrupts, memory addresses, and processor cores that have to deal with the packet. Finally, inside the VM, yet another interrupt is delivered to the virtual kernel.

Once we add in potential additional forwards from more complicated scenarios, we will end up tripling, or more, the amount of work that the server host has to perform for each packet. This comes along with the overhead of interrupts and context switches, so the drain on system resources becomes very significant, very rapidly.

Relieving Major Bottlenecks in Virtualized Networking

One of the first ways to alleviate this problem was a technology introduced in Intel® network interface cards, called Virtual Machine Device Queues* (VMDQ*). As the name implies, VMDQ allocates individual queues within the network hardware for virtual machine NICs. This way, Layer-2 sorting of packets on the wire can be directly copied to the memory of the VM that it is bound for, skipping that first interrupt on the hypervisor core. Likewise, outbound packets from VMs are copied directly to the allotted queue on the NIC. [Note: VMDQ support is not included in the Linux/kernel-based virtual machine (KVM*) kernel, but it is implemented in several other hypervisors.]

This removal of Layer-2 sorting from the hypervisor’s work provides a definite improvement over basic bridging to VMs. However, even with VMDQ, the kernel is still operating with interrupt-based network I/O.

SR-IOV: It’s Like Do Not Disturb Mode for Hypervisors

Single root input/output virtualization, (SR-IOV), is a mouthful to say, but it is a boon to operators that require high-speed processing of a high volume of network packets. A lot of resources exist to explain SR-IOV, but we will try to distill it to its essence. To use SR-IOV, hardware that implements the specification must be used, and the technology enabled. Usually this capability can be turned on via the server BIOS.

Once available, a hypervisor can equip a given virtual machine with a virtual function (VF), which is a lightweight virtual Peripheral Component Interconnect express (PCIe) device pointer unique to that VM. By accessing this unique handle, the VM will communicate directly with the PCIe NIC on its own channel. The NIC will then directly sort incoming packets to the memory address of the virtual NIC of the VM, without having to communicate with the hypervisor at all. The sole interrupt is delivered directly to the VM’s kernel. This results in substantial performance improvements over bridging, and frees up the host kernel to manage actual workloads, rather than spending overhead on packet manipulation.


Figure 2: Diagram of SR-IOV. 1=Host, 2=VM, 3=Virtual NIC, 4 = Hypervisor (for example, KVM*), 5=Physical NIC, 6 = Physical Function, 7 = Virtual Function, 8 = Physical Network.
Graphic: By George Shuklin (own work)
[CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

This performance boost does come with a caveat that should be well understood. In completely bypassing the host OS kernel, there is no easy way to filter or internally route traffic that is delivered over SR-IOV VFs. So, for example, virtual NICs won’t be plumbed into an OvS, where they can be directed out onto specific VLANs or otherwise manipulated. For another example, they can’t be firewalled by the host.

What this means is that using SR-IOV technology should be a careful and deliberate choice: to favor speed over flexibility, in certain ways. While it is the fastest way to deliver traffic over a shared physical NIC to VMs, there are other, more flexible methods of doing so that also avoid interrupt proliferation.

Power to the User: Data Plane Development Kit (DPDK)

We’ve discussed performance quite a bit at this point. Let’s return for a moment to the question of security. When packets are all processed within the host server’s kernel, this results in a somewhat weaker security profile. A single VM that becomes compromised could conceivably elevate privilege and have access to all packet flows on the host. Just as we run virtual machines in user mode to prevent that privileged access, we can remove network packet processing from the kernel and into user mode as well. As an additional benefit, we can improve performance at the same time by polling physical network devices instead of requiring them to interrupt the host kernel.

The way we accomplish these results is with the Data Plane Development Kit (DPDK). The DPDK is a set of libraries that can be used to speed applications that perform packet processing. Therefore, to take advantage of the DPDK, we need to run applications that are programmed and linked with these libraries. Fortunately, one application that can be relatively easily enabled for DPDK usage is OvS!

Major Linux distributions such as CentOS* and Ubuntu* provide DPDK-enabled versions of OvS in their distribution repositories for easy installation. Once installed and configured correctly, the DPDK-enabled OvS supports user space data paths. Physical NICs that are configured for use with the DPDK driver, instead of the normal kernel driver, do not route traffic through the kernel at all. Instead, traffic is handled directly by OvS outside of the kernel.


Figure 3: Output from an OvS* command that shows DPDK*-enabled ports. The top ‘dpdk0’ port is a physical interface, while ‘vudf7d6f78-36’ is a dpdkvhostuser port attached to a running VM.

Also, the DPDK-enabled OvS can support poll mode drivers (PMD). When a NIC device, controlled by the DPDK driver, is added to an OvS bridge, a designated processor core or cores will assume the responsibility of polling the device for traffic. There is no interrupt generated. VMs added to the same bridge in OvS will use the ‘dpdkvhostuser’ port (or the newer ‘dpdkvhostuserclient’) to enable the virtual NIC. On Linux/KVM, the VM will use the well-known virtio driver to represent the port. The dedicated core(s) will handle traffic direction without disrupting other cores with floods of interrupts, especially in a high-volume environment.

The DPDK also utilizes HugePages* on Linux to directly access memory regions in an efficient manner. When using the DPDK, virtual machine memory is mapped to HugePage regions, which OvS/DPDK is then able to use for direct reads and writes of packet contents. This memory utilization further speeds packet processing.

OvS flexibility enables a high-performing virtual networking environment while still being able to route and filter packets. Note that this routing and filtering must be accomplished via other user-space processes, since the OS kernel is not receiving or viewing the packet flows.

The DPDK offers far more than just the PMD and functionality with OvS. These are known use cases that work in high-volume networking installations. Any network application can be coded to use the DPDK and benefit from the direct delivery and retrieval of packet contents to its own memory addresses. In the next segment, we’ll look at a technology that also uses the DPDK (but not OvS), to power an innovative packet processing engine.


Figure 4: The many technologies that DPDK interrelates. Shown here supporting OvS* and VPP switches, along with others, are different varieties of NICs, accelerators, and field programmable gate array (FPGA) devices.
[Graphic: Intel]

The Software Data Plane: FD.io* and VPP

Our next technology is actually software that has been around since 2002, running on specialized networking hardware. With the most up-to-date and powerful data center server offerings, however (such as the Intel® Xeon® Scalable processor family), it is capable of performing very well on general-purpose servers. It is also now open source, enabling a rapidly-growing development community to make it even better. The technology is called vector packet processing (VPP), an innovative take on dealing with high-volume network traffic.

For clarity around names: FD.io* (pronounced Fido by its community) stands for Fast Data–Input/Output, and is the name of the Linux Foundation* project that sponsors VPP development, along with related technologies and development libraries. Sometimes we see FD.io, VPP, and FD.io VPP used interchangeably, even though they are not all quite the same thing. Throughout this article, we will primarily refer to VPP specifically.

VPP functions by working with bursts of packets, rather than individual packets at a time. It processes this burst, or frame, or vector of packets on a single processor core, which allows for efficient use of the core’s instruction cache. This works well because, essentially, the same set of operations needs to be performed on each packet regardless of its content (read the header, determine next-hop, and so on).

VPP also relies on the statistical likelihood that many of the packets in a given frame are part of the same communication, or at least bound for the same location. This results in efficient use of the core’s data cache in a similar fashion as the instructions. By exploiting the processor cache well, repetitive instructions and data are utilized to maximize packet throughput.

VPP is extensible, using a graph node architecture that can accommodate externally developed plugins as full partners. For example, the initial node that an incoming frame encounters is solely tasked with sorting traffic by protocol: IPv4, IPv6, address resolution protocol (ARP), multiprotocol label switching (MPLS), and so on. The frame isn’t broken up, but the sorter ascertains the most efficient node for that grouping of packets—if the majority are IPv4, then the frame will be forwarded to the IPv4 node. Frames move to the next node that is assigned to that particular protocol’s input processing role.


Figure 5:Some of the native VPP nodes displayed in the node graph layout. DPDK underpins and controls the physical NICs.

This process continues with the packets moving like a train through the nodes of the graph until they encounter the output node, where they are forwarded to their destinations. Plugins can be written that insert into this graph of nodes at any point a developer desires, allowing for rapid, specialized processing.

VPP already integrates many components, and can fully replace software switches and routers with its own variants. It is written to take full advantage of DPDK and uses the dpdkvhostuser semantics natively. Like DPDK, it is also a set of libraries that developers can use to improve their network application performance. The goals of the FD.io project teams are ambitious, but achievable—they look forward to achieving up to 1Tb/s throughput on an industry standard, two-socket server with Intel Xeon Scalable processors.

Moving Up the Stack: Network Function Virtualization (NFV)

We will very briefly touch on some higher-level technologies that utilize those we already discussed. Most details here are outside the scope of this article, but links are included, should you wish to dive further into this realm.

Network Function Virtualization

The compelling market advantage of dense virtualization extends to more than just general-purpose computing. Just like OvS and VPP implement a hardware network switch in software, many other purpose-built network components that have traditionally been sold as discrete pieces of hardware can be implemented and run in software as virtual network functions, or VNFs. These run in specialized virtual machines, and each VNF specializes in a particular network function. (We’ll note here that VNFs themselves can greatly benefit in performance from being developed with FD.io libraries!)

Examples of typical VNFs include firewalls, quality-of-service implementations, deep packet inspectors, customer premises equipment (CPE) and similar high-capacity packet processing functions. Putting together an infrastructure that supports the use of VNFs results in NFVI: network function virtualization infrastructure.

There are many components to a proper NFVI. We will review some of them here, but a full examination of NFVI is outside of our scope. It is enough, here, to realize that the technologies we have introduced have combined to make NFV a reality in the industry. These technologies underpin much of what constitutes NFVI.

On the Control Plane: Software-Defined Networking

At the heart of many NFVI deployments is a software-defined networking (SDN) infrastructure. We mentioned earlier that OvS can be programmed remotely via OpenFlow or native commands. An SDN control surface can exploit this capability to allow for the automatic creation, update, and removal of distributed networks across a data center infrastructure. The controller can then itself expose an application programming interface (API), and allow automated manipulation of the entire data center’s virtualized network infrastructure.

SDNs are core components of many cloud computing installations for this reason.

OpenDaylight* is an open source example of an SDN. Another project of the Linux Foundation, it can work with OvS (among many other technologies) on the control plane, and can be driven by the OpenStack Neutron* cloud networking service.

Virtualized Infrastructure Management: Orchestrating NFVI

Though OpenStack is mostly known as a cloud computing and cloud storage system, in the NFVI world it can function as a virtualized infrastructure management (VIM) solution. When combined with other NFVI components and layered over an SDN, which is in turn using the advanced networking capabilities outlined earlier in this article, a VIM can work as a full-featured NFV factory, generating and scheduling VNFs in response to demand, and tying them together in a process known as service chaining.

To put all that together, if OpenStack were laid over an OpenDaylight SDN, which in turn was in control of servers enabled with OvS and DPDK, using KVM for virtualization, we would have an NFV-capable infrastructure.

VIMs are not limited to OpenStack. For example, there is much interest in Kubernetes* as a VIM. Kubernetes is newer and natively works with container technology, which can be a good fit for VNFs in some cases. It is less feature-rich than OpenStack, and requires additional tooling choices for managing virtual networks and other functions, but some appreciate this leaner approach.

The aim of the Open Platform for Network Function Virtualization* (OPNFV*) project is to specify and create reference platforms for exactly this kind of infrastructure.

Summary

Clearly, the world of advanced networking is burgeoning with advancements and rapid development. Dense virtualization has offered both challenge and opportunity for network throughput. It is an exciting time for networking on commodity data center server components.

OvS implements a sophisticated software switch that is fully programmable. When it is built against the DPDK, it achieves high throughput by enabling direct insertion of packets off the wire into the memory space of the VMs attached to it, and removes the performance and security problems of kernel-mode, interrupt-driven network stack usage.

SR-IOV offers phenomenal performance by giving each VM on a system its own lightweight PCIe extension. Hardware enabled for SR-IOV can bypass the host OS entirely and place packets directly into the memory space of the VM they are bound for. This offers very high throughput, but does come at the cost of some reduced flexibility for on-host routing and filtering.

FD.io/VPP, like OvS, works with DPDK, and is an innovative software technique for maximizing the use of a server’s CPU caching. It also offers ready-made switches and routers (and many other components), along with libraries for developers to use its high-throughput capabilities for their own networking applications.

Up the stack, NFV infrastructure works with software-defined networking control plane deployments, such as OpenDaylight, to automate the creation, update, and removal of networks across the data center. SDNs can work with OvS-DPDK and VPP to maximize throughput across the data center on commodity servers. NFVI is managed by a virtualized infrastructure manager (such as OpenStack), to deploy and maintain service chains of VNFs.

We hope this overview is of use to the curious; it is, of course, very high level and general. We offer the following selection of linked resources to further your understanding of these technologies.

Resources

Open vSwitch

SR-IOV

DPDK

FD.io/VPP

  • FD.io - The Fast Data Project: The Universal Dataplane, https://fd.io/.

NFV/SDN/Further Up the Stack

About the Author

Jim Chamings is a Sr. Software Engineer at Intel Corporation. He only recently took on NFV and advanced networking as a specialty, pairing it with an extensive background in cloud computing. He works for the Intel Developer Relations Division, in the Data Center Scale Engineering team.  You can reach him for questions, comments, or just to say howdy, at jim.chamings@intel.com.


How to analyze MKL code using Intel® Advisor 2018

$
0
0

Introduction

Vectorization Advisor is a vectorization optimization tool that lets you identify loops that will benefit most from vectorization, identify what is blocking effective vectorization, explore the benefit of alternative data reorganizations, and increase the confidence that vectorization is safe.

 Intel provides an optimized ready-to-use math library containing a rich set of popularly used mathematical routines called Intel® Math Kernel Library (Intel® MKL:https://software.intel.com/en-us/mkl) to achieve higher application performance and reduce development time.

Intel® Advisor 2018 is now equipped with a special feature which enables developers to gather insights as to how effectively they are using Intel® MKL.

A simple matrix multiplication code is used here for the purpose of demonstrating the feature. For the purpose of explaining different ways of implementing Matrix multiplication and the best method of achieving it, we name the code snippets as solution 1, 2, 3, and 4 respectively

  • Solution 1 is the traditional way of multiplying 2 matrices using 3 for loops. However, it turns out to be an inefficient code since it doesn’t get vectorized. Find out why and how by Intel® Advisor analysis.
  • Solution 2 is a comparatively efficient way of multiplying matrices by transposing matrix (for better memory access and reuse of data loads). 
  • Solution 3 is yet another way of achieving the task efficiently which uses cache blocking technique for better cache usage
  • Solution 4 simply replaces all the implementations with Intel® MKL routines which are optimized for IA and bring dramatic increase in performance.

Step 1: 

Create a new project and chose the application you want to analyze.

Scroll down and be sure that “Analyze MKL loops and function” checkbox is checked. This will be automatically checked in. if not checked, Intel® Advisor will not dive into MKL code 

Step2: 

Now, you can analyze your application by collecting and analyzing data using Intel® Advisor “Survey Target” Tab on the left hand side of the pane. Note: This might take a little longer!

Observations: Look for the Summary tab details which gives you an overview of the applications statistics like the top time consuming loops, loop metrics which explains the breakdown of time spent in user code/ MKL code and Recommendations revealing hints to achieve better efficiency

Observation2: You can see that loop at matmul.cpp:18 is the top time consuming loop and it is scalar. Loop is a part of solution 1.

for(int k = 0; k < size; k++)

{

C[i*size + j] += A[i*size + k] * B[k*size + j];

}

Why the loop was not vectorized?

Due to cross iteration dependency (unchanged indexing of c[]) the loop remains scalar. You can follow Intel® Advisor recommendation tab which suggests for a dependency analysis.

What should be done to get this vectorized?

Note: Follow the Intel® Advisor recommendations to vectorize the loop by going for a dependencies analysis. In the performance issues column, Intel® Advisor points out that there is an assumed dependency and the compiler did not auto vectorize the loop.

Solution: when you do the dependencies analysis you determined whether a dependency is real. Now you have to make the source code changes in order to vectorize the loop.

Go back and take a look at the next hotspot: Loop at matmul.cpp: 30

You notice that loop at 30 takes significant amount of time. Note that this loops is found in solution2. Unlike the previous loop, there is no cross iteration dependency for c and hence the loop is vectorized.

What can be done to improve the efficiency?

Generally outer loops are not vectorized. Hence interchanging the loops to achieve vectorization and the efficiency increases.

Go back and take a look at the next hotspot: Loop at matmul: 48

Note that this loop is found in solution3. It is the inner most loop of blocked implementation.Blocking is the classic way of improving cache usage. The code is not cache friendly and blocking is a classic way of improving cache usage.

 

The above three solutions require developer efforts in order to improve performance. However, Intel® MKL has a ready set of routines which address these optimization techniques (like parallelisation/ Vectorization). Notice the dramatic change in performance when these implementations are replaced by Intel® MKL routines. Advisor can now deep dive into MKL code and exhibit if the user is using these routines efficiently.

Return back to the Survey and roofline tab, and press the MKL button (see screen shot below) on the top pane. This displays all the MKL loops along with the statistics.

Self-time of the loop would be an ideal metric to measure the efficiency of the loop and you can notice that loops in [MKL BLAS] and loops in [MKL SERVICE] take the minimal time compared to the previous implementations.

Roofline Analysis of loops: 

A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks.

Collect this data by pushing “Run Roofline” button on the LHS.

Matmul.cpp:18        0.11 GFLOPS    (matmul_naive) Red dot

Matmul.cpp:30        0.76 GFLOPS (matmul_transposed) yellow dot

Matmul.cpp:48        0.66 GFLOPS    (matmul_blocked) yellow dot

MKL BLAS              8.2    GFLOPS      (ccblas_dgemm) green dot

MKL BLAS              31.94 GFLOPS (ccblas_dgemm) green dot 

The size and color of the dots in Intel® Advisor’s Roofline chart indicates how much of the total program time a loop or function takes.

Small, green dots take up relatively little time, hence are likely not worth optimizing. Large, red dots consume most of the execution time and hence the best candidates for optimization are the large, red dots with a large amount of space between them and the topmost roofs.

Summary:  Discover the usability of core math functions from the Intel® Math Kernel Library (Intel® MKL) to improve the performance of your application and the analysis is now made easy with Intel® Advisor's new feature to analyze MKL code. 

 

 

Configuration and Performance of Vhost/Virtio in Data Plane Development Kit (DPDK)

$
0
0

Introduction to Vhost/Virtio

Vhost/virtio is a semi-virtualized device abstract interface specification that has been widely applied in QEMU* and kernel-based virtual machines (KVM). It is usually called virtio when used as a front-end driver in a guest operating system or vhost when used as a back-end driver in a host. Compared with a pure software input/output (I/O) simulation on a host, virtio can achieve better performance, and is widely used in data centers. Linux* kernel has provided corresponding device drivers, which are virtio-net and vhost-net. To help improve data throughput performance, the Data Plane Development Kit (DPDK) provides a user state poll mode driver (PMD), virtio-pmd, and a user state implementation, vhost-user. This article will describe how to configure and use vhost/virtio using a DPDK code sample, testpmd. Performance numbers for each vhost/virtio Rx/Tx path are listed.


A typical application scenario with virtio

Receive and Transmit Paths

In DPDK’s vhost/virtio, three Rx (receive) and Tx (transmit) paths are provided for different user scenarios. The mergeable path is designed for large packet Rx/Tx, the vector path for pure I/O forwarding, and the non-mergeable path is the default path if no parameter is given.

Mergeable Path

The advantage of using this receive path lies in that the vhost can organize the independent mbuf in an available ring into a linked list to receive packets of larger size. This is the most widely adopted path, as well as the focus of performance optimization of the DPDK development team in recent months. Receive and transmit functions used by this path are configured as follows:

eth_dev->tx_pkt_burst = &virtio_xmit_pkts;
eth_dev->rx_pkt_burst = &virtio_recv_mergeable_pkts;

The path can be started by setting the flag VIRTIO_NET_F_MRG_RXBUF during the connection negotiation between the vhost and QEMU. Vhost-user supports this function by default and the commands for enabling this feature in QEMU are as follows:

qemu-system-x86_64 -name vhost-vm1
…..-device virtio-net-pci,mac=52:54:00:00:00:01,netdev=mynet1,mrg_rxbuf=on \
……

DPDK will choose the corresponding Rx functions according to the VIRTIO_NET_F_MRG_RXBUF flag:

if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
                        eth_dev->rx_pkt_burst = &virtio_recv_mergeable_pkts;
else
                        eth_dev->rx_pkt_burst = &virtio_recv_pkts;

The difference between the mergeable path and the two other paths is that the value of rte_eth_txconf->txq_flags won’t be affected as long as mergeable is on.

Vector Path

This path utilizes single instruction, multiple data (SIMD) instructions in the processor to vectorize the data Rx/Tx actions. It has better performance in pure I/O packet forwarding scenarios. The receive and transmit functions of this path are as follows:

eth_dev->tx_pkt_burst = virtio_xmit_pkts_simple;
eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;

Requirements for using this path include:

  1. The platform processor should support the corresponding instruction set. The x86 platform should support Streaming SIMD Extensions 3(SSE3), which can be examined by rte_cpu_get_flag_enabled (RTE_CPUFLAG_SSE3) in DPDK. The ARM* platform should support NEON*, which can be examined by rte_cpu_get_flag_enabled (RTE_CPUFLAG_NEON) in DPDK.
  2. The mergeable path in the Rx side should be closed, which can be examined by the function in DPDK. The command to shut down this function is as follows:
    	qemu-system-x86_64 -name vhost-vm1…..
    	-device virtio-net-pci,mac=52:54:00:00:00:01,netdev=mynet1,mrg_rxbuf=off \
    	……
    	
  3. The Offload function, which includes VLAN offload, SCTP checksum offload, UDP checksum offload, and TCP checksum offload, is not enabled.
  4. rte_eth_txconf->txq_flags need to be set to 1. For example, in the testpmd sample provided by DPDK, we can configure virtio devices in the virtual machine with the following command:

testpmd -c 0x3 -n 4 -- -i --txqflags=0xf01

It can be seen that the function of the vector path is relatively limited, which is why it didn’t become a key focus of DPDK performance optimization.

Non-mergeable Path

The non-mergeable path is rarely used. Here are its receive and transmit paths:

eth_dev->tx_pkt_burst = &virtio_xmit_pkts;
eth_dev->rx_pkt_burst = &virtio_recv_pkts;

The following configuration is required for the application of this path:

  1. Mergeable closed in Rx direction.
  2. rte_eth_txconf->txq_flags need to be set to 0. For example, in testpmd, we can configure virtio devices in the virtual machine with the following commands:
    #testpmd -c 0x3 -n 4 -- -i --txqflags=0xf00

PVP Path Performance Comparisons

Performance of the different DPDK vhost/virtio receive and transmit paths are compared using a Physical-VM-Physical (PVP) test. In this test, testpmd is used to generate the vhost-user port as follows:

testpmd -c 0xc00 -n 4 --socket-mem 2048,2048 --vdev 'eth_vhost0,iface=vhost-net,queues=1' -- -i --nb-cores=1

In the VM, testpmd is used to control the virtio device. The test scenarios are shown in the diagram below. The main purpose is to test the north-south data forwarding capability in a virtualized environment. An Ixia* traffic generator sends the 64 byte packet to the Ethernet card at a wire-speed of 10 gigabits per second, the testpmd in the physical machine calls vhost-user to forward packets to the virtual machine, and the testpmd in the virtual machine calls virtio-user to send the packet back to the physical machine and finally back to Ixia. Its send path is as follows:

IXIA→NIC port1→Vhost-user0→Virtio-user0→NIC port1→IXIA

graphic map of send path
The send path

I/O Forwarding Throughput

Using DPDK 17.05 with an I/O forwarding configuration, the performance of different paths is as follows:

image of a comparison chart

I/O Forwarding Test Results

In the case of pure I/O forwarding, the vector path has the best throughput, almost 15 percent higher than mergeable. 

Mac Forwarding Throughput

Under the MAC forwarding configuration, the forwarding performance of different paths is as follows:

image of a comparison chart

MAC Forwarding Test Results

In this case, performance of the three paths is almost the same. We recommend using the mergeable path because it offers more function.

PVP MAC Forwarding Throughput 

The chart below shows the forwarding performance trend of PVP MAC on the x86 platform starting with DPDK 16.07. Because of the mergeable path’s broader application scenarios, DPDK engineers have been optimizing it since then, and the performance of PVP has improved by nearly 20 percent on this path.

image of a comparison chart

PVP MAC Forwarding Test Results

Note: The performance drop on DPDK 16.11 is mainly due to the overhead brought by the new features added, such as vhost xstats and the indirect descriptor table.

Test Bench Information

CPU: Intel® Xeon® CPU E5-2680 v2 @ 2.80GHz

OS: Ubuntu 16.04

About the Author

Yao Lei is an Intel software engineer. He is mainly responsible for the DPDK virtualization testing-related work.

Notices

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

For more information go to http://www.intel.com/performance.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

Intel’s 3D XPoint™ Technology Products – What’s Available and What’s Coming Soon

$
0
0

Overview

3D XPoint™ technology, jointly developed by Intel and Micron, was first announced in 2015 in a news release titled Intel and Micron Produce Breakthrough Memory Technology. Are you interested in 3D XPoint technology and want to know what products Intel is shipping today or planning for the near future? Read on to learn about memory and storage products that are available as of October 2017, and persistent memory DIMMS, which are expected to ship in 2018.

Intel® Optane™ Technology

The Intel® Optane™ technology brand combines 3D XPoint™ Memory Media, Intel Memory and Storage Controllers, Intel Interconnect IP, and Intel® software. Together these building blocks provide an unmatched combination of high throughput, low latency, high Quality of Service, and high endurance for consumer and data center workloads demanding large capacity and fast storage.

Today you can buy two different products built on Intel Optane technology:

  • Intel® Optane™ SSD DC P4800X Series – for the data center
  • Intel® Optane™ memory – for a supercharged PC experience  

Intel® Optane™ SSD DC P4800X Series

The Intel Optane SSD DC P4800X Series is the first product to combine the attributes of memory and storage. This product provides a new storage tier that breaks through the bottlenecks of traditional NAND storage to accelerate applications and enable better per server performance.

The Intel® Optane™ SSD DC P4800X Series

The Intel® Optane™ SSD DC P4800X Series

The Intel Optane SSD DC P4800X Series accelerates applications for fast caching and storage to increase scale per server and reduce transaction costs for latency-sensitive workloads. In addition, the Intel Optane SSD DC P4800X Series enables data centers to deploy bigger and more affordable data sets that will enable them to gain new insights from large memory pools through the use of Intel® Memory Drive Technology. Check out the Intel Optane SSD DC P4800X Series website for more information about this product, or read the Intel Optane SSD DC P4800X Series product brief for details about performance, latency, Quality of Service, and endurance. 

Intel® Memory Drive Technology

Intel Memory Drive Technology increases memory capacity beyond DRAM limitations and delivers DRAM-like performance in a completely transparent manner to the operating system and application. This technology is a software middle-layer product, much like a hypervisor for the memory subsystem within a server. When you pair this software with one or more Intel Optane SSD DC P4800X Series and system DRAM on a supported Intel® Xeon® processor-based platform, the combination appears as a single memory pool.

Extend Memory with an Intel® Optane™ SSD for a Bigger Memory Footprint

Extend Memory with an Intel® Optane™ SSD for a Bigger Memory Footprint

You don’t need to change your applications to take advantage of this expanded memory pool, because the software actively manages data transfers between the DRAM and SSD to provide the optimal combination of capacity and performance. Artificial intelligence (AI) and database applications are obvious beneficiaries of this technology. For product information, read the Intel Memory Drive Technology product brief.

Intel® Optane™ Memory

For the PC consumer market, Intel Optane memory contains 3D XPoint Memory Media in a standard M.2 form factor module. It plugs into a standard M.2 connector on PCs built using the 7th gen Intel® Core™ processor platform. Intel Optane memory creates a bridge between DRAM and a SATA hard drive and uses Intel® Rapid Storage Technology (Intel® RST) software to enable accelerated access to important files.

Intel® Optane™ memory module

Intel® Optane™ memory module

Intel Optane memory bridges the functionality of DRAM and storage in that it is high performance, byte-addressable, and offers persistent storage. Intel RST learns which files your PC uses frequently and over a short period of time stores these key files in Intel Optane memory, so that the next time you boot, the files are available immediately in memory. Intel Optane memory enables PCs to deliver significantly more performance and faster load times across a broad range of personal computing experiences. It will enable new levels of PC responsiveness for everything from compute-intensive engineering applications to high-end gaming, digital content creation, web browsing, and even everyday office productivity applications. Read What Happens When Your PC Meets Intel Optane Memory? by Intel Executive Vice President Navin Shenoy for an overview of what this technology will mean to PC users.

Many of Intel’s partners have Intel Optane memory-ready motherboards, systems, and products shipping now or planned for the near future.

The Future: What about Persistent Memory DIMMS?

At the May 2017 SAP Sapphire NOW conference, Intel demonstrated live for the first time its future persistent memory solution in a DIMM form factor. This DIMM highlights a commitment to rearchitecting the data center to support the future needs of a data-intensive world driven by the growth of AI, 5G, autonomous driving, and virtual reality. Based on 3D XPoint Memory Media, persistent memory from Intel is a transformational technology that will deliver memory that is higher capacity, affordable, and persistent. Check out the article A New Breakthrough in Persistent Memory Gets Its First Public Demo for more information about the SAP Sapphire demo.

Persistent memory DIMM from Intel

Persistent memory DIMM from Intel

With persistent memory DIMMs, Intel is revolutionizing the storage hierarchy to bring large amounts of data closer to the Intel Xeon processor, which will enable new usage models and make in-memory applications like SAP HANA even more powerful. Persistent memory DIMMs will be available in 2018 as part of an Intel® Xeon® processor Scalable family refresh (codename: Cascade Lake). To get ready for next year’s planned product release, software developers can get started preparing their application for persistent memory today. Visit the Intel® Developer Zone’s Persistent Memory Programming website for the information you need to update your software, and stay tuned for further product details.

Summary

Now you know that you can get Intel Optane technology for the data center or your home PC today. In the data center, you can use the Intel Optane SSD DC P4800X Series to boost caching and storage performance or extend memory with Intel Memory Drive technology. At home, Intel Optane memory enables you to enjoy faster performance for engineering applications, gaming, and digital content creation, as well as everyday activities such as web browsing.

For the future, look forward to persistent memory DIMMS from Intel, planned to be available in 2018 as part of an Intel Xeon processor Scalable family refresh (codename: Cascade Lake).

About the Author

Debra Graham worked for many years as a developer of enterprise storage applications. Now she works to help developers optimize their use of Intel® technology through articles and videos on Intel Developer Zone.

Innovate FPGA Design Contest

$
0
0

Show your creativity and get a free development kit.
Submit your design ideas by December 1st 2017.

FPGAs are changing the world of embedded compute. Every day, people are inventing new ways to use an FPGA to accelerate a critical function, help systems adapt to a changing world, or even add new types of interfaces. 

The competition is open to students, professors, makers, and members of the industry. Teams can showcase their creativity and innovation while interacting with leading industry experts and companies. Successful teams will each receive a powerful Terasic DE10-Nano development kit to use in their design. Competitors will be part of an amazing group that will show the world what is possible with FPGAs!

Register today to compete for prizes and industry recognition! Here’s what you can win:

  • Hundreds of teams, selected by the Innovate FPGA community, will receive a FREE Terasic DE10-Nano kit to begin developing their project.
  • Winners of the Regional Finals will have their travel, meals, and lodging expenses paid to attend the Grand Final event in San Jose, California.
  • Cash prizes totaling over $30,000 will be awarded to the top teams in the Regional and Grand Final events. 
  • Grand Final winners will receive worldwide industry recognition and the opportunity to have their design promoted by Terasic, Intel, and the Innovate FPGA sponsors. 

Be part of the future of embedded compute by visiting the Innovate FPGA website

Feeling creative? Register as a developer and show us what you’ve got!

Register as a Developer

Interested to see what others are working on? Join the community to check out the projects, comment, and vote for your favorite design.

Join the Community

 

Additional Information

Learn about the Terasic DE10-Nano Development Kit

Intel® Cluster Checker Release Notes and New Features

$
0
0

This page provides the current Release Notes for Intel® Cluster Checker. The notes are categorized by year, from newest to oldest, with individual releases listed within each year.

Click a version to expand it into a summary of new features and changes in that version since the last release, and access the download buttons for the detailed release notes, which include important information, such as pre-requisites, software compatibility, installation instructions, and known issues.

You can copy a link to a specific version's section by clicking the chain icon next to its name.

To get product updates, log in to the Intel® Software Development Products Registration Center.
For questions or technical support, visit Intel® Software Developer Support.

2018

Initial Release

Release Notes

  • Added support for Intel(R) Xeon(R) Scalable Processors.
  • Added Framework Definition feature to allow for customization of analysis.
  • Added support for Intel(R) Turbo Boost Technology validation.
  • Added support for analysis from multiple database sources.
  • Updated samples and SDK.
  • Converted documentation to online format.
  • Enhanced Intel(R) Omni-Path Architecture validation.
  • Added OpenFabrics Interfaces support.
  • Enhanced user viewable message output.
  • Bug fixes

Rendering Researchers: Won-Jong Lee

$
0
0

Won-Jong is a research scientist at Advanced Rendering Technology Team in Intel, where he is working on graphics research. Before joining in Intel, he worked on ray tracing, advanced computer graphics, real-time rendering, reconfigurable processor, and hardware algorithms for mobile GPUs at Samsung Advanced Institute of Technology (SAIT) as a principal research scientist and team lead. He received a Ph. D. and MS in computer science at Yonsei University, where he researched on graphics architecture, simulation framework, and parallel rendering on GPU clusters. While studying in computer science, he also worked on scientific volume visualization during his internship in National Institute of Advanced Industrial Science and Technology (AIST). He received BS in computer science and engineering at Inha University, where his interests were real time graphics and computer architecture. His primary research interests are ray tracing, real-time rendering, global illumination, AR/VR and machine learning.

Personal Homepage

Rendering Researchers: Michael J. Doyle

$
0
0

Michael joined the Advanced Rendering Technology group at Intel in August 2017. He earned a Bachelor's degree in Computer Science at Trinity College Dublin, Ireland in 2008. In 2014, he earned his Ph.D degree in Computer Science at Trinity, focusing on fixed-function hardware for ray-tracing and BVH construction. Following his Ph.D, he completed a postdoctoral position at Trinity, where he developed FPGA prototypes for visualization and ray-tracing applications. His research interests are in real-time ray-tracing, graphics hardware architecture, and reconfigurable hardware design.


Intel® Parallel Computing Center at University of Stuttgart

$
0
0

Principal Investigators:

Guido Reina is a postdoctoral researcher at the Visualization Research Center of the University of Stuttgart (VISUS). He defended his PhD thesis in computer science in 2008 at the University of Stuttgart, Germany. His research interests include rendering and analysis of particle-based datasets, large displays, and the optimization of parallel methods.

Description:

MegaMol was designed as a modular, GPU-centric visualization framework aimed chiefly at interactive post-mortem analysis of particle-based data sets on the personal workstations of researchers. Its flexible design allowed the core framework to reach a mature and stable state while still supporting a rapid prototyping workflow via plugins, which provide cutting-edge algorithms and novel analysis capabilities for application scenarios.

While the design rationale and narrow focus allows MegaMol to have an advantage over more general frameworks, the current trend towards ever-increasing simulation sizes has made post-mortem analysis costly and, in some scenarios, even impossible. With this Intel® Parallel Computing Center for Visualization, we aim to modernize and restructure the MegaMol architecture to scale to current data set sizes and make MegaMol capable of running headless and in situ, either on the simulation nodes or a separate rendering cluster. This means that the data flow and modular composition of a MegaMol instance need to be adapted to make use of OSPRay without additional overhead, especially with regard to data management.

We also plan to port application-specific abstractions for molecular data from MegaMol into OSPRay, to allow for more expressive visualizations. One example is the direct raytracing of solvent excluded surfaces on the basis of the original particle data and next to no overhead. Finally, we will investigate how parallelization libraries from Intel, for example the spatial data structures in Embree, can help to better utilize CPU cores for analysis and pre-processing in general.

Related websites:

http://megamol.org
http://www.visus.uni-stuttgart.de/en/institute.html

"Illegal instructions" errors for some Intel® IPP functions

$
0
0

Symptom:
Some Intel(R) IPP functions may report "illegal instructions” errors on the Intel® AVX-512 processor systems running with Windows* 7 Service Pack 1 (SP1) OS.  The problem only happens with such specific systems.


Reason:
The problem is caused by the incorrect code dispatching for that specific systems. Intel® IPP dispatched the Intel® AVX-512 optimization code, which is not supported by that OS.


Solutions:
The problem will be fixed in the up-coming Intel® IPP releases.  Users need to update to new versions of Intel® IPP to get the fix of the problem. The workaround for the Intel® IPP 2018 and its previous releases is to use the Intel® IPP ippSetCpuFeatures() API to manually dispatch Intel® AVX2 optimization code for that systems.

Deep Learning for Cryptocurrency Trading

$
0
0

A new potential use case of deep learning is the use of it to develop a Cryptocurrency Trader Sentiment Detector. I am currently developing a Sentiment Analyzer on News Headlines, Reddit posts, and Twitter posts by utilizing Recursive Tensor Neural Networks (RNTN) to provide insight into the overall trader sentiment. Trader Sentiment is a key factor in being able to determine cryptocurrency price movements. This article will further discuss the benefits of Trader Sentiment Analysis for Cryptocurrencies and the advantages RNTN’s offer for Sentiment Analysis. The long-term vision of this project is to be able to develop an Artificial Intelligence (AI) Cryptocurrency Trading Bot that can not only consider trader sentiment to make trading decisions but also take advantage of other opportunities such as arbitrage which is the purchase and sale of an asset to profit from a difference in the price. 

In order to understand the technical components of the application, it is important to understand the current state of cryptocurrencies and the blockchain applications that they are based on. This article includes information on where cryptocurrencies derive value and the key characteristics of cryptocurrencies. 

Cryptocurrencies are an emerging currency and digital asset class. Cryptocurrencies act as a medium of exchange. Digital currencies and digital assets are designed on top of blockchains. The cryptographic nature of cryptocurrencies prevents the creation and duplication of cryptocurrency tokens. Cryptocurrencies have a finite supply. Cryptocurrencies derive part of their value because investors believe that the finite supply along with rising demand over time will only lead the price of them to increase. 

Figure 1: Cryptocurrency

Blockchains enable us to record transactions permanently within a distributed ledger. Blockchains allow us to record and conduct transactions of all kinds (exchange of currency/data/service) without the need for a centralized authority. 

Blockchain platforms are highly secure because transactions are automatically recorded and tracked by nodes (machines) on the network. Transactions are processed and recorded only on the ledger once verified by most of the nodes on the network. Transactions are recorded on the ledger with a hash of the date and time stamp on the ledger when they occur. If an entity wants to manipulate or change the information, it becomes impossible because the records would have to be modified on most of the nodes in the decentralized network. Transactions which occur on the blockchain offer many benefits including speed, lower cost, security, fewer errors, elimination of central points of attack and failure. 

Figure 2: Blockchain

The immutable nature of the information and transactions recorded on the blockchain allow Smart Contracts to be programmed. Smart Contracts are computer protocols that facilitate, verify, or carry out a transaction when certain criteria are met. Smart Contracts can interact with other contracts, make decisions, store data, and transfer currency between individuals. Smart Contracts can run exactly as programmed without downtime, censorship, fraud, or any interference from a separate entity. Smart Contracts can be used to improve business processes in every industry, business, and system where a transaction of some sort is occurring between two individuals. 

Blockchains are often referred to as the trust protocol. The unique design of blockchain, through the distributed ledger and Smart Contracts, allows us to conduct transactions that require trust. Investors in blockchain believe that traditional transaction platforms in society can be replaced by decentralized platforms. 

Figure 3: Smart Contract

There are many innovative cryptocurrencies. Some notable and long-standing ones are Bitcoin, Litecoin, Ethereum, Golem, and Siacoin. Bitcoin and Litecoin are both peer to peer payment protocols. Ethereum is a developer platform for Decentralized Applications (DAPPS) which can run Smart Contracts. Golem is a decentralized supercomputer network, which can be used for AI application testing and other tasks requiring high computational power. Siacoin is a decentralized cloud storage network. 

For these decentralized applications, their native cryptocurrencies act as an entry point for utilization of each network. On Ethereum programmed applications cannot be run without offering some Ethereum as a payment to the network. Golem’s supercomputer network cannot be utilized unless the Golem cryptocurrency is offered in exchange. Cloud Storage cannot be implemented on the network unless Siacoin is offered in exchange. 

Cryptocurrency tokens are also offered as a reward or bounty to nodes which are running on the network. For example, owners of bitcoin nodes receive bitcoin as a reward for offering computational power to maintaining the network. Litecoin node owners receive Litecoin as rewards for helping power the Litecoin network. The Ethereum Network also relies on nodes to maintain the network. The nodes create, verify, publish, and propagate information for the Ethereum Blockchain. In exchange for offering their computers as nodes to the Ethereum network, the node owners receive Ethereum in exchange. Golem and Siacoin also work the same way in that node owners receive the Golem and Siacoin, respectively, for maintaining the network. The owners of Golem contribute their spare computational power to keep the blockchain powered supercomputer running. Siacoin node owners offer their spare computational storage to maintain the decentralized storage blockchain. The reward system for node owners incentivizes them to continue to maintain the network. The nodes maintaining the network ensure the continued existence of the cryptocurrency and its value. The more node owners there are the more valuable the cryptocurrency.

Figure 4: Blockchain Nodes

Teams wanting to start their own decentralized applications often launch Initial Coin Offerings (ICO’s). During ICO’s teams behind a decentralized application offer a portion of the tokens available for sale to early investors at a low rate. They also publish a whitepaper which is a technical paper that thoroughly discusses all the aspects of the platform and cryptocurrency; to give investors an understanding of the blockchain technology behind the project. The whitepaper lists the benefits the blockchain offers as well as the economic incentives it offers the community of investors and future node owners. Investors, who believe in the team and the project, may buy the tokens. The participants in the ICO generate the initial value of a cryptocurrency by investing. This allows the team behind it to use the funds raised to fund development and all other costs associated with the project.  
As a project continues to develop more investors are attracted to invest and purchase tokens through exchanges. People are incentivized to run nodes on a network as they can be rewarded in a cryptocurrency. The greater the number of node owners the stronger the network. If all things go well with the project, the result is a peer to peer transaction platform that is significantly faster, cost-efficient, and better than any other transaction platform out there. The entry point which is the cryptocurrency also at this point will be priced very well compared to inception as the cryptocurrency is now utilized on a million, billion, or possibly even a trillion-dollar blockchain platform. Whether blockchain platforms will grow to that scale, will only be realized over time. The following are growth rates and prices for some of the cryptocurrencies discussed in this article.

Figure 5: Cryptocurrency Growth Rates

These growth rates demonstrate that significant profit can be made. The key question is how we can use current forecasting technologies to predict price movements. There is no standard method to forecast price movements. Many blockchain platforms aim to provide enhanced technological solutions to existing inefficiencies in transaction platforms. A primary way to gauge the prospect of a project is performing a thorough evaluation of the project’s whitepaper. The whitepaper can give great insight into the technical aspects of the blockchain being proposed as well as give an overall scope of the project. Unfortunately, there is no standard methodology to track prices. There are no fundamentals to be observed in comparison to the stock market. There are no quarterly reports to perform any valuation. Technical Analysis can be useful in achieving the best spreads in trades as well as used to take advantage of arbitrage situations. Technical Analysis can also be used to predict price movements but this article will focus on market sentiment as this is where Deep Learning can be applied efficiently. 

Market Sentiment is important to detect cryptocurrency price movements. Once a cryptocurrency and its blockchain protocol are past initial development stages; macro events, political events, network upgrades, conferences, partnerships, Segwits, and trader sentiment tend to create cryptocurrency price movements. One way to gauge these cryptocurrency price movements is to utilize Sentiment Analysis which is a subset of Natural Language Processing (NLP). Many of these cryptocurrency price movements could be determined by Herd Instinct. Herd Instinct according to behavioral finance is a mentality characterized by lack of individual decision-making, causing people to think and act in the same way as the majority of those around them. The price movements tend to be based on market sentiment and the opinions of the communities surrounding the cryptocurrency. Based on these reasons, I believe that sentiment analysis of News Headlines, Reddit posts, and Twitter posts should be the best indicator of the direction of cryptocurrency price movements. 

A prominent technique for Sentiment Analysis currently is the use of a Recurrent Neural Network (RNN). RNN’s parse a given text and tokenize the words. The frequency of words is identified and a bag of words representation is created. Then the subjectivity of each word is searched from an existing lexicon. A lexicon is a database of emotional values prerecorded by researchers. The overall sentiment is then computed to classify the text based on the lexicon. RNN’s work well for longer pieces of texts but are ineffective at analyzing sentiment in shorter texts such as News Headlines, Reddit Posts, and Twitter Posts. RNN’s fail to consider all the semantics of linguistics by failing to consider compositionality (word order). This leads RNN’s to be ineffective at identifying change in sentiment and understanding the scope of negation.

Figure 6: Recurrent Neural Networks

An RNTN is best suited for this type of project as it can consider the semantic compositionality of text. When dealing with shorter pieces of text such as a tweet it becomes very important to be able to detect the compositionality of it as there is less information to determine sentiment. 

RNTN’s, as mentioned, are great at considering syntactical order. RNTN’s are made up of multiple parts including the parent group known as the root, the child groups known as the leaves, and the scores. Leaf groups receive input and the root group uses a classifier to determine the class and score. 

When data is given to the sentiment analyzer it is parsed into a binary tree. Specific vector representations are formed of all the words and represented as leaves. From the bottom up the vectors are used as parameters to optimize and as feature inputs to a softmax classifer. The vectors are classified into five classes along with a score. 

The next step is where recursion occurs. When similarities are encoded between two words, the 2 vectors move across to the next root. A score and class are outputted. A score represents the positivity or negativity of a parse while the class encodes the structure in current parses. The first leaf group receives the parse and then the second leaf receives the next word. The score of the parse with all three words are outputted and it moves on to the next root group. The recursion process continues until all inputs are used up with every single word included. In practical applications RNTN’s end up being more complex than this. Rather than using the immediate next word in a sentence for the next leaf group; a RNTN would try all the next words and eventually checks vectors that represent entire sub-parses. Performing this at every step of the recursive process the RNTN can analyze every possible score of the syntactic parse.  

Figure 7: Labeled Sentiment Statement

The RNTN can use the score value produced by the root group to pick the best substructure at each recursive process. After the final structure is determined, the net backtracks and labels the data to figure out the grammatical structure of the sentence. 
RNTN’s are trained using backpropagation. They are trained by comparing the predicted sentence structure with the proper sentence structure which is obtained from a set of labeled training data. After the data is trained there is a higher probability of the RNTN’S ability to parse things like what was seen in training. One key difference between Recursive Neural Networks and RNTN’s is that for an RNTN the same composition function is used as a tensor so there are fewer parameters to learn. Because of this mentioned difference, similar words can have similar compositional behavior just as similar words can have similar vectors. 

A resource that will be used is StanfordCoreNLP which contains a large set of NLP tools. The sentiment analyzer also utilizes the Stanford Sentiment Treebank which is a large corpus of data with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. During the recursion process the RNTN is referring to this data set to determine the class and score for a given parse. The Stanford Sentiment Treebank includes a total of 215,154 unique phrases from 10,662 sentences from parse trees which were annotated by 3 human researchers. The Treebank dataset is based off movie reviews from rottentomatoes.com.

One possible challenge that may be encountered during the project is that the RNTN model along with the Stanford Sentiment Treebank may not contain enough data to be able to determine sentiment of cryptocurrencies. The Stanford Sentiment Treebank is based off movie review data, it may not be able to recognize sentiment for all the newer terminology associated with cryptocurrencies. This type of problem may occur, for example, if someone were to post “Ethereum is going to the moon this weekend”. This is a common phrase that is used when people think a cryptocurrency is experiencing or about to experience a large price surge. This is an example of an extremely positive phrase but the sentiment analyzer classified it below as neutral as it was unable to understand the positive connotation of “going to the moon” in the context/compositionality of cryptocurrency. 

Figure 8: Undetected Sentiment in Sentence

A possible workaround to this problem is to label and annotate a dataset of cryptocurrency tweets and reddit posts myself to teach the RNTN to be able to detect sentiment in cryptocurrency prices. The following is an example of the same sentence labeled in this example. The sentiment scores vary as there are five levels. Extremely positive, positive, neutral, negative, and extremely negative. I assigned “moon” a positive connotation as this is normally associated with it. “To”, “The”, “going” and most of the other words are given a neutral connotation as they do not have any positive or negative meaning when associated with cryptocurrencies; however, the parse “going to the moon” yields a higher result because the Sentiment Treebank associated “going” and “moon” in the same parse. This parse grants a more positive result as indicated by the dark blue because “going” a direction of cryptocurrency price movement is associated with “moon”, a positive price movement direction. 

Figure 9: Labeled Sentiment for Blockchain Statement

Cryptocurrency Price Movements are driven by trader sentiment and therefore being able to detect sentiment in social media posts and news headlines can yield valuable insight. A sentiment analyzer can be key to being able to detect price movements. The long-term vision of the project is to develop an AI cryptocurrency trading bot. As blockchains and smart contracts continue to develop, the world will see an automation of many processes as well as an increase in blockchain based transaction platforms; whether that includes the exchange of digital currency, digital assets, data, and services. As this transition occurs over the next few years cryptocurrencies will only continue to appreciate, which is why a sentiment analyzer of social media posts and news headlines can yield valuable insight into cryptocurrencies and their price movements. 

Image Sources: 
Figure 1: Cryptocurrency 
Figure 2: Blockchains 
Figure 3: Smart Contracts 
Figure 4: Blockchain Nodes 
Figure 6: Recurrent Neural Networks 
Figure 7: Stanford Sentiment Treebank

Intel® Speech Enabling Developer Kit for Far-field Voice Experiences in Smart Home Applications

$
0
0

Intel and Amazon have collaborated to bring intelligent voice control to consumer products worldwide with tools and technology that enables computing at the edge and in the cloud.

The Intel® Speech Enabling Developer Kit is a complete audio front-end solution for far-field voice control that helps device manufacturers accelerate product designs featuring Amazon Alexa* Voice Service.

Based on a new architecture that delivers high quality far-field voice experiences even in the most acoustically challenging environments, the Intel® Speech Enabling Developer Kit features:

  • 8-microphone radial array
  • Dual DSP with inference engine 
  • Custom Amazon Alexa* Voice Service wake word engine
  • High-performance algorithms for noise reduction, beam forming, and acoustic echo cancellation

Developers can now pre-order the Intel® Speech Enabling Developer Kit. Kits will ship the week of November 27, 2017.

Developer Workshop

Intel and Amazon will also host the first hands-on developer workshop for the Intel® Speech Enabling Developer Kit at Amazon Re:Invent in Las Vegas on November 30, 2017.  
The invite-only event will provide a complete walk-through of the kit for members of the media and a two-hour hands-on developer workshop conducted by experts from both Intel and Amazon.

Register for Amazon Re:Invent

For future developer workshops hosted by Intel and Amazon see: https://developer.amazon.com/alexa-voice-service/dev-kits/intel-speech-enabling.

Gentle Introduction to PyDAAL: Vol 3 of 3 Analytics Model Building and Deployment

$
0
0

Previous: Vol 2: Basic Operations on Numeric Tables

Earlier in the Gentle Introduction Series (Volume 1 and Volume 2), we covered fundamentals of the Intel® Data Analytics Acceleration Library (Intel® DAAL) custom Data Structure and basic operations that can be performed. Volume 3 will focus on the algorithm component of Intel® DAAL where the data management element is leveraged to drive analysis and build machine learning models

Intel® DAAL has classes available to construct a wide range of popular machine learning algorithms for analytics model building that include classification, regression, recommender systems and neural networks. Training and Prediction are separated into 2 pieces in Intel® DAAL model building. This separation allows the user to store and transfer only what’s needed for prediction when it comes time for model deployment. Typical Machine Learning workflow involves:

  • Training stage that includes identifying patterns in input data that maps behavior of data features in accordance with a target variable.
  • Prediction stage that requires employing the trained model on a new data set.

Additionally, Intel® DAAL also contains on-board model scoring, in the form of separate classes to evaluate trained model performance and compute standard quality metrics. Various sets of quality metrics can be reported based on the type of analytics model built.

To accelerate the model building process, Intel® DAAL is reinforced with the distributed parallel processing mode for large data sets, including a programming model that makes it easy for users to implement a Master-Slave approach. Mpi4py can be easily interfaced with PyDAAL, as Intel® DAAL’s serialization and deserialization classes enable data exchange between nodes during parallel computation.

Volumes in Gentle Introduction Series

  • Vol 1: Data Structures - Covers introduction to Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples
  • Vol 2: Basic Operations on Numeric Tables - Covers introduction to possible operations that can be performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples
  • Vol 3: Analytics Model Building and Deployment – Covers introduction to analytics model building and evaluation in Intel® DAAL with serialized deployment and distributed model fitting on large datasets.

Analytics Modelling:

1. Batch Learning with PyDAAL

Intel DAAL includes classes that support the following stages in analytics model building and deployment process:

  1. Training
  2. Prediction
  3. Model Evaluation and Quality Metric
  4. Trained Model Storage and Portability

1.1 Analytics Modelling Training and Prediction Workflow:

1.2 Build and Predict with PyDAAL Analytics Models:

As described earlier, Intel DAAL model building is separated into two different stages with two associated classes (“training”, “prediction”)

The training stage usually involves complex computations with possibly very large datasets, calling for extensive memory footprint. DAAL’s two separate classes allows users to perform training stage on a powerful machine, and optionally the subsequent prediction stage on a simpler machine. Furthermore, this facilitates the user to store and transmit only necessary training stage results that are required for prediction stage.

Four numeric tables are created at the beginning of model building process, two in each stage (training and prediction) as listed below:

Stage

Numeric Tables

Description

Training

trainData

This includes the feature values/predictors

Training

trainDependentVariables

This includes the target values (i.e., labels/responses)

Prediction

testData

This includes the feature values/predictors of test data

Prediction

testGroundTruth

This includes the target (i.e., labels/responses)

Note: See Volume 2 for details on creating and working with numeric tables

Below illustrates a high-level overview on training and prediction stages of the analytics model building process:

Helper Functions: Linear Regression

The next section can be copy/pasted into a user’s script or adapted to a specific use case. The Helper function block provided below can be used directly to automate the training and prediction stages of DAAL’s Linear Regression algorithm. The helper function is followed be a full usage code example.

'''
training() function
-----------------
Arguments:
        train data of type numeric table, train dependent values of type numeric table
Returns:
        training results object
'''
def training(trainData,trainDependentVariables):
    from daal.algorithms.linear_regression import training
    algorithm = training.Batch ()
    # Pass a training data set and dependent values to the algorithm
    algorithm.input.set (training.data, trainData)
    algorithm.input.set (training.dependentVariables, trainDependentVariables)
    trainingResult = algorithm.compute ()
    return trainingResult

'''
prediction() function
-----------------
Arguments:
        training result object, test data of type numeric table
Returns:
        predicted responses of type numeric table
'''
def prediction(trainingResult,testData):
    from daal.algorithms.linear_regression import  prediction, training
    algorithm = prediction.Batch()
    # Pass a testing data set and the trained model to the algorithm
    algorithm.input.setTable(prediction.data, testData)
    algorithm.input.setModel(prediction.model, trainingResult.get(training.model))
    predictionResult = algorithm.compute ()
    predictedResponses = predictionResult.get(prediction.prediction) 
    return predictedResponses

To use: copy the complete block of helper function and call training() and prediction()methods.

Usage Example: Linear Regression

Below is a code example implementing the provided training and predict helper functions:

#import required modules
from daal.data_management import HomogenNumericTable
import numpy as np
from utils import printNumericTable
seeded = np.random.RandomState (42)

#set up train and test numeric tables
trainData =HomogenNumericTable(seeded.rand(200,10))
trainDependentVariables = HomogenNumericTable(seeded.rand (200, 2))
testData =HomogenNumericTable(seeded.rand(50,10))
testGroundTruth = HomogenNumericTable(seeded.rand (50, 2))

#--------------
#Training Stage
#--------------
trainingResult = training(trainData,trainDependentVariables)
#--------------
#Prediction Stage
#--------------
predictionResult = prediction(trainingResult, testData)

#Print and compare results
printNumericTable (predictionResult, "Linear Regression prediction results: (first 10 rows):", 10)
printNumericTable (testGroundTruth, "Ground truth (first 10 rows):", 10)

Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain training and prediction stages as methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.

1.3 Trained Model Evaluation and Quality Metrics:

Intel DAAL offers quality metrics classes for binary classifiers, multi-class classifiers and regression algorithms to measure quality of the trained model. Various standard metrics are computed by Intel DAAL quality metrics library for different types of analytics modeling.

Binary Classification:

Accuracy, Precision, Recall, F1-score, Specificity, AUC

Click here for more details on notations and definitions.

Multi-class Classification:

Average accuracy, Error rate, Micro precision (Precisionμ ), Micro recall (Recallμ ), Micro F-score (F-scoreμ ), Macro precision (Precision M), Macro recall (Recall M), Macro F-score (F-score M)

Click here for more details on notations and definitions.

Regression:

For regression models, Intel DAAL computes metrics using 2 libraries:

  • Single Beta: Computes and produces metrics results based on Individual beta coefficients of trained model.

RMSE, Vector of variances, variance-covariance matrices, Z-score statistics

  • Group Beta: Computes and produces metrics results based on group of beta coefficients of trained model.

Mean and Variance of expected responses, Regression Sum of Squares, Sum of Squares of Residuals, Total Sum of Squares, Coefficient of Determination, F-Statistic

Click here for more details on notations and definitions.

Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain quality metrics methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.

1.4 Trained Model Storage and Portability:

Trained models can be serialized into byte-type numpy arrays and deserialized using Intel DAAL’s data archive classes to:

  • Support data transmission between devices.
  • Save and restore from disk at a later date to predict response for an incoming observation or re-train the model with a set of new observations.

Optionally, to reduce network traffic and memory footprint, serialized models can further be compressed and later decompressed using the deserialization method.

Steps to attain model portability in Intel DAAL:

  1. Serialization:
    1. Serialize training stage results(trainingResults) into Intel DAAL’s Input Data Archive object
    2. Create an empty byte type numpy array object(bufferArray) of size Input Data Archive object
    3. Populate bufferArray with Input Data Archive contents
    4. Compress bufferArray to numpy array object (optional)
    5. Save bufferArray as .npy file to disk (optional)
  2. Deserialization
    1. Load .npy file from disk to numpy array object(if Serialization step 1e was performed)
    2. Decompress numpy array object to bufferArray (if Serialization step 1d was performed)
    3. Create Intel DAAL’s Output Data Archive object with bufferArray contents
    4. Create an empty original training stage results object (trainingResults)
    5. Deserialize Output Data Archive contents into trainingResults

Note: As mentioned in deserialization step 2d, an empty original training results object is required for Intel DAAL’s data archive methods to deserialize the serialized training results object.

Helper Functions: Linear Regression

The next section can be copy/pasted into a user’s script or adapted to a specific use case. The helper function block provided below can be used directly to automate model storage and portability of DAAL’s Linear Regression algorithm. The helper function is followed be a full usage code example.

import numpy as np
import warnings
from daal.data_management import  (HomogenNumericTable, InputDataArchive, OutputDataArchive, \
                                   Compressor_Zlib, Decompressor_Zlib, level9, DecompressionStream, CompressionStream)
'''
Arguments: serialized numpy array
Returns Compressed numpy array'''

def compress(arrayData):
    compressor = Compressor_Zlib ()
    compressor.parameter.gzHeader = True
    compressor.parameter.level = level9
    comprStream = CompressionStream (compressor)
    comprStream.push_back (arrayData)
    compressedData = np.empty (comprStream.getCompressedDataSize (), dtype=np.uint8)
    comprStream.copyCompressedArray (compressedData)
    return compressedData

'''
Arguments: deserialized numpy array
Returns decompressed numpy array'''
def decompress(arrayData):
    decompressor = Decompressor_Zlib ()
    decompressor.parameter.gzHeader = True
    # Create a stream for decompression
    deComprStream = DecompressionStream (decompressor)
    # Write the compressed data to the decompression stream and decompress it
    deComprStream.push_back (arrayData)
    # Allocate memory to store the decompressed data
    bufferArray = np.empty (deComprStream.getDecompressedDataSize (), dtype=np.uint8)
    # Store the decompressed data
    deComprStream.copyDecompressedArray (bufferArray)
    return bufferArray

#-------------------
#***Serialization***
#-------------------
'''
Method 1:
    Arguments: data(type nT/model)
    Returns  dictionary with serailized array (type object) and object Information (type string)
Method 2:
    Arguments: data(type nT/model), fileName(.npy file to save serialized array to disk)
    Saves serialized numpy array as "fileName" argument
    Saves object information as "filename.txt"
 Method 3:
    Arguments: data(type nT/model), useCompression = True
    Returns  dictionary with compressed array (type object) and object information (type string)
Method 4:
    Arguments: data(type nT/model), fileName(.npy file to save serialized array to disk), useCompression = True
    Saves compresseed numpy array as "fileName" argument
    Saves object information as "filename.txt"'''

def serialize(data, fileName=None, useCompression= False):
    buffArrObjName = (str(type(data)).split()[1].split('>')[0]+"()").replace("'",'')
    dataArch = InputDataArchive()
    data.serialize (dataArch)
    length = dataArch.getSizeOfArchive()
    bufferArray = np.zeros(length, dtype=np.ubyte)
    dataArch.copyArchiveToArray(bufferArray)
    if useCompression == True:
        if fileName != None:
            if len (fileName.rsplit (".", 1)) == 2:
                fileName = fileName.rsplit (".", 1)[0]
            compressedData = compress(bufferArray)
            np.save (fileName, compressedData)
        else:
            comBufferArray = compress (bufferArray)
            serialObjectDict = {"Array Object":comBufferArray,
                                "Object Information": buffArrObjName}
            return serialObjectDict
    else:
        if fileName != None:
            if len (fileName.rsplit (".", 1)) == 2:
                fileName = fileName.rsplit (".", 1)[0]
            np.save(fileName, bufferArray)
        else:
            serialObjectDict = {"Array Object": bufferArray,
                                "Object Information": buffArrObjName}
            return serialObjectDict
    infoFile = open (fileName + ".txt", "w")
    infoFile.write (buffArrObjName)
    infoFile.close ()
#---------------------
#***Deserialization***
#---------------------
'''
Returns deserialized/ decompressed numeric table/model
Input can be serialized/ compressed numpy array or serialized/ compressed .npy file saved to disk'''
def deserialize(serialObjectDict = None, fileName=None,useCompression = False):
    import daal
    if fileName!=None and serialObjectDict == None:
        bufferArray = np.load(fileName)
        buffArrObjName = open(fileName.rsplit (".", 1)[0]+".txt","r").read()
    elif  fileName == None and any(serialObjectDict):
        bufferArray = serialObjectDict["Array Object"]
        buffArrObjName = serialObjectDict["Object Information"]
    else:
         warnings.warn ('Expecting "bufferArray" or "fileName" argument, NOT both')
         raise SystemExit
    if useCompression == True:
        bufferArray = decompress(bufferArray)
    dataArch = OutputDataArchive (bufferArray)
    try:
        deSerialObj = eval(buffArrObjName)
    except AttributeError :
        deSerialObj = HomogenNumericTable()
    deSerialObj.deserialize(dataArch)
    return deSerialObj

To use: copy the complete block of helper function and call serialize() and deserialize()methods.

Usage Example: Linear Regression:

The example below Implements serialize() and deserialize() functions on Linear Regression trainingResult. (Refer Linear Regression usage example from section Build and Predict with PyDAAL Analytics Models to compute trainingResult )

#Serialize
serialTrainingResultArray = serialize(trainingResult) # Run Usage Example: Linear Regression from section 1.2
#Deserialize
deserialTrainingResult = deserialize(serialTrainingResultArray)

#predict
predictionResult = prediction(deserialTrainingResult, testData)

#Print and compare results
printNumericTable (predictionResult, "Linear Regression deserialized prediction results: (first 10 rows):", 10)
printNumericTable (testGroundTruth, "Ground truth (first 10 rows):", 10)

Examples below implement other combinations of serialize() and deserialize() methods with different input arguments

#---compress and serialize
serialTrainingResultArray = serialize(trainingResult, useCompression=True)
#---decompress and deserialize
deserialTrainingResult = deserialize(serialTrainingResultArray, useCompression=True)

#---serialize and save to disk as numpy array
serialize(trainingResult,fileName="trainingResult")

#---deserialize file from disk
deserialTrainingResult = deserialize(fileName="trainingResult.npy")

Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain model storage and portability stages as methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.

2. Distributed Learning with PyDAAL and MPI:

PyDAAL and mpi4py can be used to easily distribute model training for many of DAAL’s algorithm implementations using the Single Program Multiple Data (SPMD) technique. Other Python machine learning libraries allow for the trivial implementation of a parameter-tuning grid search, mainly because it’s an “embarrassingly parallel” process. What sets Intel DAAL apart is the included IA-optimized distributed versions of many of its model training classes. This means acceleration of single model training is enabled with similar syntax to batch learning. For these implementations, the DAAL engineering team has provided a slave method to compute partial training results on row-grouped chunks of data, and then a master method for reduction of the partial results into a final model result.

Serialization and Message Passing:

Messages passed with MPI4Py are passed as serialized objects. MPI4Py uses the popular Python object serialization library Pickle under the hood during this process. PyDAAL uses SWIG (Simplified Wrapper and Interface Generator) as its wrapper interface. Unfortunately SWIG is not compatible with Pickle. Fortunately, DAAL has built-in serialized and deserialization functionality. See Trained Model Portability section for details.The table below demonstrates the master and slave methods for the distributed version of PyDAAL’s covariance model method.

Note: The serialize and deserialize helper functions are provided in the Trained Model Portability section of this volume.

The next section can be copy/pasted into a user’s script or adapted to a specific use case. The helper function block provided below can be used carry out distributed computation of the covariance matrix, but can be adapted for fitting other types of models. See Computation Modes section in developer’s docs for more details on distributed model fitting.The helper function is followed be a full usage code example.

Helper Functions: Covariance Matrix

# Define slave compute routine'''
Defined Slave and Master Routines as Python Functions
Returns serialized partial model result. Input is serialized partial numeric table'''
from CustomUtils import getBlockOfNumericTable, serialize, deserialize
from daal.data_management import HomogenNumericTable
from daal.algorithms.covariance import (
    Distributed_Step1LocalFloat64DefaultDense, data, partialResults,
    Distributed_Step2MasterFloat64DefaultDense
)

  
def computestep1Local(serialnT):
   # Deseralize using Helper Function
   partialnT = deserialize(serialnT)
   # Create partial model object
   model = Distributed_Step1LocalFloat64DefaultDense()
   # Set input data for the model
   model.input.set(data, partialnT)
   # Get the computed partial estimate result
   partialResult = model.compute()
   # Seralize using Helper Function
   serialpartialResult = serialize(partialResult)
   
   return serialpartialResult

# Define master compute routine
'''
Imports global variable finalResult. Computes master version of model and sets full model result into finalResult. Inputs are array of serialized partial results and MPI world size'''
def computeOnMasterNode(serialPartialResult, size):
    global finalResult
    # Create master model object
    model = Distributed_Step2MasterFloat64DefaultDense()
    # Add partial results to the distributed master model
    for i in range(size):       
        # Deseralize using Helper Function
        partialResult = deserialize(serialPartialResult[i]) 
        # Set input objects for the model
        model.input.add(partialResults, partialResult)
    # Recompute a partial estimate after combining partial results
    model.compute()
    # Finalize the result in the distributed processing mode
    finalResult = model.finalizeCompute()

Usage Example: Covariance Matrix

The below example uses the complete block of helper functions given above and implements computestep1Local(), computeOnMasterNode() functions with mpi4py to construct a Covariance Matrix.

from mpi4py import MPI
from CustomUtils import getBlockOfNumericTable, serialize, deserialize
from daal.data_management import HomogenNumericTable'''
Begin MPI Initialization and Run Options'''
# Get MPI vars
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()   
  
# Initialize result vars to fill
serialPartialResults = [0] * size
finalResult = None

'''
Begin Data Set Creation

The below example variable values can be used:
numRows, numCols = 1000, 100

'''
# Create random array for demonstration
# numRows, numCols defined by user
seeded = np.random.RandomState(42)
fullArray = seeded.rand(numRows, numCols)

# Build seeded random data matrix, and slice into chunks
# rowStart and rowEnd determined by size of the chunks
sizeOfChunk = int(numRows/size)
rowStart = rank*sizeOfChunk
rowEnd = ((rank+1)*sizeOfChunk)-1
array = fullArray[rowStart:rowEnd, :]
partialnT = HomogenNumericTable(array)
serialnT = serialize(partialnT)


'''
Begin Distributed Execution'''

if rank == 0:

   serialPartialResults[rank] = computestep1Local(serialnT)
  
   if size > 1:
      # Begin receive slave partial results on master
      for i in range(1, size):
         rank, size, name, serialPartialResults[rank] =
    MPI.COMM_WORLD.recv(source=MPI.ANY_SOURCE, tag=1)

   computeOnMasterNode(serialPartialResults,size)

else:
   serialPartialResult =  computestep1Local(serialnT)
   MPI.COMM_WORLD.send((rank, size, name, serialPartialResult), dest=0, tag=1)


LINUX shell commands to run the covariance matrix usage example
------------------------------------------------------------------------------------------

# Source and activate Intel Distribution of Python (idp) env
source ../anaconda3/bin/activate
source activate idp
# optionally set mpi environmental variable to shared memory mode
export I_MPI_SHM_LMT=shm

# Cd to script directory, and call Python interpreter inside mpirun command
cd ../script_directory
mpirun –n # python script.py

Conclusion

Previous volumes (Volume 1 and Volume 2) demonstrated Intel® Data Analytics Acceleration Library’s (Intel® DAAL) Numeric Table data structure and basic operations on Numeric Tables. Volume 3 discussed Intel® DAAL’s algorithm component and performing analytical modelling through different stages with both batch and distributed processing. Also, Volume 3 demonstrated how to achieve model probability (Serialization) and perform model evaluation (Quality Metrics) process. Furthermore, this volume utilized Intel® DAAL classes to provide helper functions that deliver a standalone solution in model fitting and deployment process.

Other Related Links

About Intel® Software Development Products Community Forums

$
0
0

Intel Software Development Products Community Forums are free and welcome for all users of our software tools. It’s a place where community members and experts connect to help and learn from one another. Intel engineers monitor these forums to share product news and updates, and to refer customers to online documentation, hot topics, and how-to articles.

  • Customers with paid Licenses or active Priority Support may submit confidential technical questions or requests via the Online Service Center.
  • Customers with questions on registration, download, licensing, and licensing-related installation may also submit via the Online Service Center. 

This article applies to the following Community Forums: 

Intel® C++ Compiler

Intel® Fortran Compiler: Windows* | Linux* and macOS*

Intel® Distribution for Python*

Intel® Math Kernel Library

Intel® Data Analytics Acceleration Library

Intel® Integrated Performance Primitives

Intel® Threading Building Blocks

Intel® VTune™ Amplifier

Intel® Advisor

Intel® Inspector

Intel® MPI Library and Intel® Trace Analyzer and Collector

Intel® System Studio

Intel® System Debugger

Arizona Sunshine* Follows Intel’s Guidelines for Immersive VR Experiences

$
0
0

Concept art for Arizona Sunshine*

 Figure 1. Concept art for Arizona Sunshine*.

With a dazzling launch in early 2017 that saw Arizona Sunshine* become the fastest-selling non-bundled virtual reality title to date, and instant recognition as the 2016 “Best Vive Game” according to UploadVR, the zombie-killer game is not just another VR shooter. Combining immersive game play with intriguing multi-player options, this game takes full advantage of VR capabilities to promote playability in both outdoor and underground environments.

Through its association with Netherlands-based Vertigo Games and nearby indie developer Jaywalkers Interactive, Intel helped add sizzle to Arizona Sunshine by fine-tuning the CPU capabilities to provide end-to-end VR realism. The power of a strong CPU performance becomes apparent with every jaw-dropping zombie horde attack. From the resources available when a player chooses and loads a weapon, to the responsiveness of the surrounding eerie world, the immersive qualities of the VR interface make it easy to forget that it’s just a game.

In recent years an influx of exploratory games and experiences designed to keep up with the new wave of VR titles have hit the marketplace. With these new experiences comes the complexity of ensuring a perfect mix of CPU and GPU power to provide the level of realism that is expected within VR. Several features in Arizona Sunshine provide this level of immersion. Following Intel’s Guidelines for Immersive Virtual Reality Experiences, the features break down into three key categories:

  • Physical Foundation
  • Basic Realism
  • Beyond Novelty

This paper highlights how Arizona Sunshine—built exclusively for VR systems—has benefitted from adherence to the Intel guidelines.

Mission Overview: Kill the Zombies

When Arizona Sunshine starts, the unnamed lead character awakes in a sheltered cave, safe from the scorching post-apocalyptic desert heat. He soon hears a radio transmission and realizes that other humans are alive, and they’re also battling for survival against the zombie horde. As he sets out to find help, the story arc is set.

While game action takes place in the desert heat and underground in an old mine, the goal is the same: survive and find a friend. The lead character offers quips and asides as he fights his way through the room-scale VR, using a teleport system and motion-controlled weapons to pace him through bite-sized chunks of action.

Figure 2. From the bright Arizona landscape to dark, underground mines, Arizona Sunshine offers plenty of targets.

Game play is enhanced whether playing on an Oculus Rift* or Vive VR System on the PC, or on a Sony PlayStation* using the PlayStation VR Aim Controller. Players learn to squint through one eye to aim as they progress through 25 weapons, from pistols to assault rifles. Campaign mode is a standard shoot-fest where players can play alone or with others, and the ultra-engaging Dynamic Horde mode is where players slaughter numerous zombies, either with friends or alone.

When powered by at least an Intel® Core™ i7 6700K processor, Arizona Sunshine really packs a punch. Continual enemies in a complex, interactive environment bring unprecedented realism to VR gaming with zombie destruction, rich environmental elements, and a cinema-quality playing field. The AI is simple, but ruthless, and if a zombie gets close enough, it can flail away, masking a good, clean head-shot, while it tears players to pieces. The custom-built animation system makes getting in a good kill shot as satisfying as anything else in the VR universe.

Intel’s New VR Guidelines Provide Key Guardrails

As with any new technology, a new VR game might seem fine at first glance—and great to own simply because so few choices exist. But deeper into game play, players may find the title contains fatal flaws that cause them to drop the game. Because VR has already been studied in detail, and with more research continuously available, those flaws are avoidable.

Intel has expended a significant usability testing effort to extend VR research and to extrapolate some basic rules. So far, researchers have observed players’ initial and continuing experiences with a variety of VR activities, followed by detailed debriefing sessions and questionnaires to discover the specific factors that made the experiences enjoyable. One key finding was a high statistical correlation between enjoyment and the level of immersion. The research also revealed several aspects of the games and environments that closely correlated with immersion and are, therefore, keys to extending that feeling.

Figure 3. Head underground and Arizona Sunshine immediately welcomes players with echoes, faint noises, and swarming zombies.

The following guidelines are condensed from Intel’s Guidelines for Immersive Virtual Reality Experiences and are described in relation to Arizona Sunshine.

Physical Foundation—Safety First

The first rule for any VR game is safety. Because VR players are wrapped in a headset and may forget the limitations of their actual environment, laying a safe physical foundation is crucial. VR games make immersion possible by using technology that keeps the player safe from injury while wearing a headset, comfortably free from soreness due to hardware ergonomics, and free from motion sickness. Players must not be distracted by unrelated sights and sounds from the outside world; otherwise, the immersive aspects are dashed. Worrying about safety while fighting off zombie hordes is not an option—players must be free from concern in their physical space.

As players are initiated into Arizona Sunshine, they quickly notice several features that keep them safe and free from dizziness:

  • Swift teleportation. Movement in the game is a pain-free and seamless interaction. When players choose to teleport, the game provides four different directions that allow for precise movements around each scene. Players feel like they are traveling through the virtual world without running into objects in the physical world.
  • Zero latency. Players should detect a welcome lack of latency when teleporting, interacting with objects, and even unloading their weapons. Every movement that players make is swift and natural-feeling. Using the most powerful gaming system is crucial to maximizing the VR experience; optimizing the processing load between the CPU and the GPU is crucial, and made possible by powerful Intel tools such as Intel® Graphics Performance Analyzers. At times, the scrupulous attention to detail in Arizona Sunshine means that reloading in the middle of a fight is just as time-consuming in the game as it is in real life, even without latency issues.
  • Appropriate use of space. As players move in the game, they should notice that distances match what they would be in the real world. For instance, if they drop ammo on the ground, they must physically reach down to the ground to grab it.

Battling underground provides a realistic sense of confinement and entrapment.

Figure 4. Battling underground provides a realistic sense of confinement and entrapment.

Basic Realism Depends on Potent Sound Effects and Visuals

To be successful and fully immersive, Intel’s guidelines stress that the virtual world should seem real by providing smooth 3D video, realistic sound, intuitive controls for manipulating the environment, and natural responses to the players’ actions in the virtual world.

These concepts are an integral part of game play in Arizona Sunshine; the VR environment and sounds are realistic and responses to players’ actions are genuine.

Although the level of realism throughout Arizona Sunshine is a pleasant surprise, it’s the avalanche of shambling zombies that exponentially boosts the “fear factor.” The underlying interactions and feedback of each object feel captivating and build the realistic sensation to the game. These key features add to the realism of Arizona Sunshine:

  • Responsive destruction. The game has an overwhelming amount of realistic destruction. As players battle with the zombies, they soon realize that accurately killing them is heavily reliant on skill, rather than just spraying bullets and hoping to hit a vital spot. Based on where the player shoots the zombie, as well as the distance of the shot, the zombie body parts react and destruct appropriately. For example, shooting a zombie in the head at close range causes it to perish, while shooting one in the leg keeps it alive with only the leg itself detached. Similarly, instead of opening cars or going through doors, players can simply destroy the windows and door knobs with a well-aimed shot from their weapon.  
  • Responsive world. From the billboards to the deserted huts on the ground, almost everything in the virtual world is responsive. The zombies take that responsive element to another level. During combat, merely hiding from the zombies by locking oneself in a shed may be the best tactic. The zombies naturally react to this by attempting to break down the door but then go away after they are unsuccessful. In this way, the game responds to the players’ tactics. The player feels engaged as the main character and not controlled by the game’s assumptions.
  • Haptic feedback.“Haptics” refers to the systemized sense of touch in a game to provide important information to the player. Arizona Sunshine does a great job at relaying its own haptic language to communicate certain patterns. For example, when the player is getting shredded by a zombie, the controllers shake in response, providing important feedback that heightens the sense of danger. Similarly, when the player dies, the virtual world turns green and the controllers shake to indicate a fatality.
  • Visual feedback. Initially learning which items in the rich environment can—or even should be—interacted with, takes practice. As players approach any object, a glowing circle appears as an invitation for interaction. These visual clues are an important part of the learning curve, providing a boost to productivity before a player starts to feel like giving up hope of conquering the undead army.
  • Basin of attraction. In physics, this is where a collection of all possible initial conditions in a dynamic system converge on a particular attractor. The game uses that concept to help the player complete simple tasks without failure. For example, holstering the gun or placing ammo on the player’s belt can be done by mimicking the action of bringing the objects close to the desired destination.
  • Realistic sound. Using the Doppler Effect, the game provides excellent realistic sounds based on the distance and mass of objects. For example, as zombies approach, the player can hear the progression in their screech volume. Other natural sounds, such as water waves, crickets chirping, and wind blowing add to the realism as players navigate the world. Adding such realistic effects greatly enhances the immersive quality to Arizona Sunshine. When sounds are faint, it’s because they should be faint—enemies cannot trick players by suddenly materializing in a break of the time-space continuum.
  • Responsiveness to player movements. As players enter each level and reach different checkpoints, several unique visuals and audio elements respond to player movements. For example, when players look through tinted windows, the visuals appear darker.
  • Responsive sound. During several moments, the virtual world responds to various sounds that the players initiate. For example, when firing a gun or making loud movements, the zombies react by swarming to the player’s exact location. Enforcing stealth in the game can result in players actually holding their breath, a good sign that the game’s interface is working as intended.

The ubiquitous zombies are programmed to swarm and are attracted to players’ noises.

Figure 5. The ubiquitous zombies are programmed to swarm and are attracted to players’ noises.

Beyond Novelty—Do Something New!

Being first at something isn’t enough for a game to succeed. Merely developing VR aspects in a dull game means it’s still, at core, a dull game. Arizona Sunshine solves this dilemma by providing a main character that players cheer on and want to make successful. This provides a story arc that players can follow to propel the main character to a satisfying conclusion while he interacts with players through sarcastic asides that often mimic exactly what the player might be thinking.

To keep the immersion alive and engaging—rather than merely impressive—developers should mimic reality by enabling interaction with nearly everything in the virtual world. This offers good game play that is independent of technology, making VR interactions core to the experience and easing the player quickly and smoothly into the virtual world.

Because Arizona Sunshine offers tough challenges and an interesting storyline, players engage in their mission and are willing to do anything to achieve it. Two key features enable this level of immersion:

  • Ubiquitous interactions. Throughout the game, players find that the virtual world respects their ability to interact with the playing environment and allows for interaction with almost every object. For example, unless the object is locked, nearly everything with a handle can be opened and closed. The downside of this feature is that a creaking hinge or a squeaking door knob could be the tiny noise that brings the zombie horde immediately to the player.
  • Captivating interactions. Aside from ubiquitous and constant interactions, the game also provides realistic challenges that capture the player’s interest and evoke unexpected emotion in almost every scene. For example, a player might innocently pick up a grenade, which could set off a fiery explosion. This not only heightens the sense of danger but also injects a dose of adrenaline and subsequent pragmatism to heighten playability.

Follow the Rules, Reap the Rewards

Arizona Sunshine embraces Intel’s Guidelines for Immersive VR Experiences and genuinely delights players in its post-apocalyptic world. By following Intel’s basic guidelines, the game is more than just an early VR success; it is a blueprint for future designers. Simply porting yet another zombie shooter might guarantee some level of profitability, just by being early to market. To truly turn heads, Arizona Sunshine stays simple in approach but complex in implementation; and it takes full advantage of the technology available. When played on powerful Intel gaming systems that are optimized to balance the load between processing and graphics, the result is an immersive, addicting first-person shooter game that shines as a great example of harnessing the power of new technologies.

Additional Resources

Official Game Site

Intel’s Guidelines for Immersive Virtual Reality Experiences

About the Author

Hope Idaewor has been an intern with Intel immersive user experience team, and currently a student at Georgia Institute of Technology.


Paperbark Creates a Love letter To The Australian bushland

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Paperbark creates a love letter to the Australian bushland. Get more game dev news and related topics from Intel on VentureBeat.

Paperbark in game screenshot

Wombats are nocturnal animals. So if you pull one out of its slumber and force it to hunt for food during the day, chances are good that it’ll have a hard time moving around.

That's exactly the predicament Paperbark's wombat finds itself in. The sleepy, clumsy, and most likely grumpy animal is your tour guide through the game's evocative landscape. Indie developer Paper House wants to capture what life is like in "the bush"— an Australian colloquialism that refers to specific patches of wilderness scattered throughout the country. They're drier than a forest, and contain plants and animals that thrive under harsh weather conditions.

Unlike the outback desert so often seen in TV and movies, the bush is, according to lead designer Terry Burdak, much more familiar and accessible to Australians. They don't have to travel far to venture into the bush; some major cities even have bushland right next to it.

"I've never been to the desert! I've never gone that far inland before," remarked the 30-year-old Burdak. "The outback almost feels foreign to me. So it's kind of weird that it's the go-to thing for representing Australia."

To the developers, the bush is an underappreciated gem that more people — including those around the world — should know about. By playing as the wombat, you'll meet many of the bush's natural denizens and slowly begin to see what makes this ecosystem so unique.

Paperbark's relaxing journey thoroughly impressed the judges behind this year's Intel® Level Up Game Developer Contest, and they gave it the Best Game – Open Genre award.

Though Paperbark defies traditional gaming labels, it's still trying to tell a story. Burdak and his team want players of all ages to enjoy it (so you won't find any game-over screens here). And the reason for that stems from a shared childhood experience.

Paperbark in game screenshot
Above: Paperbark's playful wombat.

A new kind of storybook

Paper House has three full-time employees: Burdak, programmer Ryan Boulton, and artist Nina Bennett. They were all game design students at RMIT University in Melbourne when they decided to work together on their senior project. At the time, they wanted the game to look like a storybook (Boulton had already been working on the tech responsible for Paperbark's watercolor-like visuals), and Burdak suggested they should use old children's books as inspiration.

It was a perfect fit. All of them grew up in rural parts of the country, and the picture books they read as kids were about exploring the Australian environment.

"There are some really great ones like Blinky Bill, Snugglepot and Cuddlepie, and Possum Magic. They really depicted these landscapes wonderfully," said Burdak.

After graduating from college, they showed an early prototype of Paperbark at different gaming conferences. It was during these events that they found out about Film Victoria, one of the few government organizations in Australia that, as Burdak noted, provide "substantial funding" for game development. Once they successfully secured a FilmVic grant in 2015, the former students quit their day jobs and formed a company. They could finally finish making Paperbark.

This also allowed them to bring on much-needed freelancers, including an ecologist who checks to see if all the plants and animals in the game actually live in the bush. But one hire in particular was especially important. The developers were looking for someone who could elevate the storybook aspect of the game, and they found that person in a published children's author.

"That was a really big thing that I wanted to do …Apart from Paperbark having the same artistic sensibilities as the books, we thought it'd be a good idea to actually get a children's author to write the story for the game," Burdak said, adding that they'll be sharing her name shortly. "She hasn't worked in games before, which was another thing we were interested in — someone who can give us a unique perspective."

Paperbark in game screenshot
Above: You'll learn about some of the native plants and animals.

Capturing a feeling, not a place

Despite the great strides the studio is making to depict the nuances of the bush, Paperbark isn't actually set in a real part of Australia. Instead, it's a composite of several national parks in the state of Victoria. Burdak brought up Campo Santo's Firewatch as an example: It takes place in a fictional part of the otherwise very real Shoshone National Forest in Wyoming. Burdak said it was validating to see a fellow developer using nature as an integral part of the game.

"We've created this — not necessarily made-up — but a really stylized version of these locations. … We describe it as a love letter to the bush," he later added. "Conveying our feelings to the bush and to these landscapes is more important than recreating any actual location."

Paper House is taking a similar, less literal approach to the story. Throughout the game, you'll hit so-called "storybook moments" where the camera shifts and bits of text appear on the screen to describe what's happening. But as cute as the wombat can be— at one point during an early demo, it's stuck in a large tree trunk, and tries to wiggle its way out — the animal is mostly there to push you from one place to the next. It's a vessel on which players can project their own experiences.

"At the end of the game, what the player is going to take away is not necessarily this journey of the wombat, but the journey they've gone on themselves," said Burdak.

It's an ambitious goal, and brings to mind another indie game, the appropriately called Journey. The 2012 hit from Thatgamecompany also didn't have any named characters or an obvious story arc, but the experience of playing through it (sometimes with another player online) is what makes it special.

It's unfair to directly compare an in-progress game to what Journey ultimately achieved, but the parallels between them show that Paper House is following a rich history of developers who aren't afraid of expanding gaming's boundaries.

Paperbark is currently set for a 2018 release on iOS devices.

A boy And His Giraffe: A random Drawing Inspires An Award-winning Game

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: A boy and his giraffe: A random drawing inspires an award-winning game. Get more game dev news and related topics from Intel on VentureBeat.

Adventure Pals in game screenshot

Platforming games have a strong track record of producing memorable characters. Millions around the world can instantly recognize Mario's cherubic nose and mustache, or Sonic the Hedgehog's triangular spines. In The Adventure Pals, the hero's most distinguishing feature might just be Sparkles, a pet giraffe who lives inside his backpack.

The quirky character is a good indicator of the off-the-wall humor you'll find in Massive Monster's upcoming game. The kid and his giraffe are on a quest to rescue his father from the evil Mr. B, who is inexplicably transforming all the old people into giant hotdogs. The silly premise feels like a throwback to ‘90s-era platformers, but one that incorporates the style and flair found in today's popular cartoons.

It's one of the reasons why the judges at this year's Intel® Level Up Game Developer Contest gave The Adventure Pals their Best Character Design award. Fittingly, the idea for the game started with a drawing.

A few years ago, artist Julian Wilton posted an illustration of a boy and his giraffe in an online forum. Jay Armstrong, a programmer at Massive Monster, liked it so much that he wanted to team up with Wilton to make a game based on his artwork.

"It was a really bad way to design a game," Wilton joked. "I was like, ‘I want to make this into a game.' And Jay had made some platformers in the past. I don't think we had any strong ideas of what the game would be."

Inspired by indie hits like Castle Crashers and Super Chibi Knight, the two of them eventually figured it out. In 2012, they released Super Adventure Pals. It was a free browser game hosted on the Armor Games website, which serves as an online library of user-submitted Flash games. Wilton was more than familiar with the popular software: He'd been making his own Flash games since he was 15-years-old.

Adventure Pals in game screenshot
Above: The giraffe can help you cross large gaps with its magical helicopter tongue.

Simple but elegant

Those early game development experiences had a huge influence on Wilton's art style.

"Back in the day, there was definitely a style associated with Flash games because of the way Flash works — and the programs used to make the games — which I still use today for a lot of things because it's a very easy tool for creating art," he said.

Wilton keeps his designs simple and relatable, pointing out how animated shows like Adventure Time, Over the Garden Wall, and Gravity Falls contrast their relatively basic-looking characters with gorgeously detailed backgrounds. He likes to use a similar aesthetic in his games. Whenever he creates a new character, he uses a thick digital brush to sketch out its silhouette.

"I think that's the most important part of designing a character because that's what the audience is going to see," Wilton said. "I think a lot of people make mistakes when they're doing character stuff, where they might draw a human body and they'll zoom in and add all these details. … But when you zoom out, all that detail is gone. You won't notice it."

You can see that philosophy in action through the screenshots posted on this page. Even though they're bursting with activity, it's easy to identify the boy through his black hat and the dangling giraffe. And the huge, glossy eyes that almost all the characters have immediately tell you that The Adventure Pals doesn't take itself too seriously.

"I guess my design work — the way I do graphic design — probably feeds into my character design as well, where it's about communicating the idea of the character to the audience," said Wilton.

Adventure Pals in game screenshot
Above: Forget the whale. What the heck is going on with that pirate?

A proper reboot

By the end of Super Adventure Pals's development, Wilton and Armstrong learned so much from that experience that they immediately wanted to make a sequel. But at that point, as Wilton said, Flash games "were dying off." So for the next project (at first called Super Adventure Pals 2), Massive Monster decided to make it for PC and consoles.

The developers once again partnered with Armor Games as their publisher, and received additional funding through a Kickstarter campaign they launched in July 2016. In addition to making the game bigger and better than the Flash incarnation, they put more of a focus on the giraffe's movement and combat abilities, such as how it can help the boy hover in the air by spinning its tongue.

The duo also wanted to make sure that the new game could be enjoyed by all kinds of players. They looked at the type of multifaceted humor that Disney movies tend to have: They're fun for kids on a surface level, and adults can appreciate the more mature jokes and themes.

"There aren't too many games that do that. That's where we're trying to position the game," said Wilton.

They also hired a writer to help them shape that humor and expand on the story so that it's more of a coming-of-age tale. But as the game changed, it no longer made sense to keep the original moniker. Wilton and Armstrong thought it'd be better to start with a clean slate.

"At the end of the day, we realized we didn't want to be associated with Super Adventure Pals that much anymore," said Wilton. "Because when we look at it, we're like, ‘That's a piece of shit!' People still love it. But I don't think players are going to come to the game because they played Super Adventure Pals. … The new game is miles ahead of the Flash game."

The Adventure Pals is coming to PlayStation 4, Xbox One, and PC in 2018.

Parallel Processing with DirectX 3D* 12

$
0
0

John Stone

Integrated Computer Solutions, Inc.

Download sample code

Introduction

We will examine rendering parallel topics using Direct3D* 12. We will use the results from the paper, A Comparison of the Intel® Core™ i5 Processor and Intel® Core™ i7 Processor with Visualizations in OpenGL* and Oculus* VR, and extend the code there to contain a Direct3D 12 renderer, after which we are going to re-implement its particle system as a Direct3D 12 compute shader. You can find the sample code described in this article at the download link above.

CPU Compute

Our first task is to add a Direct3D 12 renderer to the particle system in the Intel Core i5 Processor-versus-Intel Core i7 Processor article mentioned above. The software design there makes this easy to do since it nicely encapsulates the concept of rendering.

Renderer

Interface

We create a new file named RendererD3D12.cpp and add the corresponding class to the Renderers.h file.

#pragma once

#include "Particles.h"

namespace Particle { namespace Renderers {

// An D3D12 renderer drawing into a window
struct D3D12 : Renderer
{
    D3D12();
    virtual ~D3D12();

    void* allocate(size_t) override;
    void* operator()(time, iterator, iterator, bool&) override;

    struct Data; Data& _;
};

}}

Our job is to fill in the implementations for the constructor, destructor, allocate(), and operator() methods. By examining the code in RendererOpenGL.cpp we can see that the constructor should create all of the GPU resources, allocate() should create a pipelined persistently mapped vertex upload buffer, and operator() should render a frame. We can see that the OpenGL implementation gets the job done in just 279 lines of code, but we will find out that Direct3D 12 takes significantly more work to do the same job.

Implementation

Event Loop

Let us start by deciding that we are going to do raw Direct3D 12; that is, we are going to program directly to the Microsoft* published API without relying on helper libraries. The first thing we need to do is create a window and implement a minimum event loop.

void D3D12::Data::MakeWindow()
{
    // Initialize the window class.
    WNDCLASSEX k = {};
    k.cbSize = sizeof(WNDCLASSEX);
    k.style = CS_HREDRAW | CS_VREDRAW;
    k.lpfnWndProc = DefWindowProc;
    k.hInstance = GetModuleHandle(NULL);
    k.hCursor = LoadCursor(NULL, IDC_ARROW);
    k.lpszClassName = L"RendererD3D12";
    RegisterClassEx(&k);

    // Create the window and store a handle to it.
    win.hwnd = CreateWindow(k.lpszClassName, k.lpszClassName, WS_OVERLAPPEDWINDOW, 5, 34, win.Width, win.Height, NULL, NULL, GetModuleHandle(NULL), NULL);
}
void D3D12::Data::ShowWindow()
{
    ::ShowWindow(win.hwnd, SW_SHOWDEFAULT);
}
void D3D12::Data::PollEvents()
{
    for (MSG msg = {}; PeekMessage(&msg, NULL, 0, 0, PM_REMOVE); ) {
        TranslateMessage(&msg);
        DispatchMessage(&msg);
    }
}

Initialization

Not too bad, just 25 lines of code. Let us talk about what we need to do to turn that window into a rendering surface for Direct3D 12. Here’s a brief list:

    // initialize D3D12
    _.MakeWindow();
    _.CreateViewport();
    _.CreateScissorRect();
    _.CreateFactory();
    _.CreateDevice();
    _.CreateCommandQueue();
    _.CreateSwapChain();
    _.CreateRenderTargets();
    _.CreateDepthBuffer();
    _.CreateConstantBuffer();
    _.CreateFence();
    _.ShowWindow();
    _.CreateRootSignature();
    _.CreatePSO();
    _.CreateCommandList();
    _.Finish();

Wow, that is 16 steps to initialize Direct3D 12! Each step is going to expand to a function containing between 6 and 85 lines of code for a total of nearly 500 lines of C++ code. In examining the OpenGL renderer we see that it gets the same job done in about 100 lines of C++ code, which gives us a 5x expansion number when working in Direct3D 12. Now remember, this is pretty much as basic a renderer as you can get. It just issues a single draw-call per frame and renders its pixels to a single window. You will find that things get even more complicated in Direct3D 12 as you try to do more. Putting all these facts together suggests that Direct3D 12 programming is not for the faint of heart, to which this author wholeheartedly agrees.

There are many tutorials on this initialization process on the web, so I am not going to be covering that here. You can examine the source code for all the gory details. I will mention that unlike OpenGL in Direct3D 12, you, the programmer, are responsible for creating and managing your own backbuffer/present pipeline. You also need to pipeline updates to all changing data including constant (uniform in OpenGL) data as well as the persistently mapped vertex upload buffer. We put this all together in the Direct3D 12 code encapsulated in the frames variable in the d3d12 structure.

Custom Upload Heap

One thing I did notice while doing this was that using the default D3D12_HEAP_TYPE_UPLOAD heap type for the vertex upload buffer led to extremely slow frame times. This puzzled me for quite a while until I ran across some information indicating a read penalty that occurs when accessing write-combined memory. Some further digging into the Direct3D 12 documentation shows this:

Applications should avoid CPU reads from pointers to resources on UPLOAD heaps, even accidentally. CPU reads will work, but are prohibitively slow on many common GPU architectures, so consider the following …

We did not have this problem in the OpenGL code since it mapped its vertex upload memory like this:

    GLbitfield flags = GL_MAP_READ_BIT | GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;

To get the same effect in Direct3D 12 you need to use a custom heap type (see D3D12::Data::CreateGeometryBuffer() method):

// Describe the heap properties
    D3D12_HEAP_PROPERTIES hp = {};
    hp.Type = D3D12_HEAP_TYPE_CUSTOM;
    hp.CPUPageProperty = D3D12_CPU_PAGE_PROPERTY_WRITE_BACK;
    hp.MemoryPoolPreference = D3D12_MEMORY_POOL_L0;

where the key piece of the puzzle is the use of D3D12_CPU_PAGE_PROPERTY_WRITE_BACK CPUPageProperty.

Shaders

One nice thing I found when working with Direct3D 12 is that it allows you to create one piece of code containing both the vertex and pixel shaders unlike OpenGL, which requires separate programs. Direct3D 12 can do that because it allows you to specify the main function when compiling the shaders:

// Create the pipeline state, which includes compiling and loading shaders.
#if defined(_DEBUG)
UINT f = D3DCOMPILE_DEBUG | D3DCOMPILE_SKIP_OPTIMIZATION;
#else
UINT f = 0;
#endif
f |= D3DCOMPILE_WARNINGS_ARE_ERRORS;
ComPtr<ID3DBlob> errs;
ComPtr<ID3DBlob> vShader;
hr = D3DCompile(shader.c_str(), shader.size(), 0, 0, 0, "vsMain", "vs_5_0", f, 0, &vShader, &errs);
if (FAILED(hr)) {
    fputs((const char*)errs->GetBufferPointer(), stderr);
    std::exit(1);
}
ComPtr<ID3DBlob> pShader;
hr = D3DCompile(shader.c_str(), shader.size(), 0, 0, 0, "psMain", "ps_5_0", f, 0, &pShader, &errs);
if (FAILED(hr)) {
    fputs((const char*)errs->GetBufferPointer(), stderr);
    std::exit(1);
}

Notice the strings "vsMain" and "psMain" in the above code. These are the main functions for the vertex shader and pixel shader, respectively.

Rendering

Now that the GPU resources are allocated and initialized (including the vertex upload buffer) we can finally turn our attention to actually rendering the particles, and again we see the chubbiness of Direct3D 12. OpenGL gets the job done in just 19 lines of code while Direct3D 12 bloats that into 105 lines. You can examine this for yourself by checking out the D3D12::operator() method.

Vertex Buffer View

There is one thing that was a little surprising to me. When developing the software using the Direct3D 12 Debug Layers and the Warp software rasterizer (see the next section on GPU compute for details) I was getting an error message complaining that I was overrunning the buffer bounds. This puzzled me for a while until I realized that when mapping the attributes in the vertex buffer, I needed to account for their offsets in the structure when determining their buffer length. You can see this in the following code fragment:

// Configure the vertex array pointers
auto start  = _.d3d12.vertexBuffer->GetGPUVirtualAddress();
UINT size   = UINT(_.memory.stride);
UINT stride = UINT(sizeof(*begin));
D3D12_VERTEX_BUFFER_VIEW vertexBufferViews[] = {
    { start + ((char*)&begin->pos - _.memory.ptr), size - offsetof(item,pos), stride },
    { start + ((char*)&begin->clr - _.memory.ptr), size - offsetof(item,clr), stride },
    { start + ((char*)&begin->vel - _.memory.ptr), size - offsetof(item,vel), stride },
    { start + ((char*)&begin->age - _.memory.ptr), size - offsetof(item,age), stride },
};

Notice how we subtract each attribute’s starting offset from the buffer size via offsetof() macros to avoid this buffer overrun warning.

Results

Other than these few gotchas, things went pretty much as described in the Direct3D 12 tutorials.

Render Direct3D 12

GPU Compute

Introduction

Now that we have finished rendering the particle system calculated on the CPU, let us consider what it would take to move this to the GPU for the ultimate in parallel programming.

The first thing I considered, after struggling with getting every single field in every single structure used in the Direct3D 12 CPU portion to be correct and consistent, was facing the prospect of two to three times more work for the GPU compute problem. I quickly decided the right thing to do was to look for some kind of helper framework.

MiniEngine*

I thought that there would be a multitude to choose from on the Internet, but my google searches were only turning up entire game engines (Unity*, Unreal*, and Oryol*), which really are too abstract for what I was doing. I wanted something that was Direct3D 12 specific, and I was beginning to think that I would actually have to write the thousands of lines of code before finally discovering a small team inside Microsoft that seems to be solely involved in DirectX* 12 and graphics education. I found their YouTube* channel, and from there I found their git GitHub* repository.

They have a set of standard how-to samples for Direct3D 12 like most other tutorial sites, but they also have the MiniEngine: A DirectX 12 Engine Starter Kit. Looking through this it seemed to be exactly was I was looking for; a framework taking the drudgery out of using the Direct3D 12 API, but also small and simple enough to keep things straightforward, to understand and use.

The code that accompanies this article does not include the MiniEngine*. Instead, it is a project created by the MiniEngine’s createNewSolution.bat file.

Installation

In order to use the accompanying code you first need to download MiniEngine:

download MiniEngine

Install MiniEngine into the gpu/MiniEngine folder like so:

MiniEngine folder

Direct3D 12 Debug Layers

Install the Direct3D 12 Debug Layers development tool (which is not included in the SDK installed by Microsoft Visual Studio*). Instead it’s an optional Windows* feature. You install it by going to your Windows® 10 Settings application and choosing Apps.

Direct3D 12 Debug Layers

Click on Manage optional features and choose Add a feature.

Manage optional features

Choose Graphics Tools, click Install, and then wait for the install to complete:

Add a feature

Manage optional features

Go ahead and open the code sample by double-clicking on the gpu\MiniEngine\app\app_VS15.sln file. Once Visual Studio 2017 is finished launching, we need to make two tiny tweaks to the MiniEngine code to align things better to what we want to do.

Customize MiniEngine

The first thing is to convince MiniEngine to use the software Warp rasterizer in debug mode, since that driver contains a lot of logging code to keep us informed if we are not doing things quite right. To do this, open the Core/Source Files/Graphics/GraphicsCore.cpp file and navigate to line 322, and change it from this:

static const bool bUseWarpDriver = false;

to this:

#ifdef _DEBUG
    static const bool bUseWarpDriver = true;
#else
    static const bool bUseWarpDriver = false;
#endif

Secondly, the MiniEngine is hard-coded to a few fixed resolutions. We want to run with our window at 800 x 600 to match the size of the window created in the CPU section. To do this we need to navigate to line 137 and change the code there from this:

switch (eResolution((int)TargetResolution))
        {
        default:
        case k720p:
            NativeWidth = 1280;
            NativeHeight = 720;
            break;
        case k900p:
            NativeWidth = 1600;
            NativeHeight = 900;
            break;
        case k1080p:
            NativeWidth = 1920;
            NativeHeight = 1080;
            break;
        case k1440p:
            NativeWidth = 2560;
            NativeHeight = 1440;
            break;
        case k1800p:
            NativeWidth = 3200;
            NativeHeight = 1800;
            break;
        case k2160p:
            NativeWidth = 3840;
            NativeHeight = 2160;
            break;
        }

to this:

#if 0
        switch (eResolution((int)TargetResolution))
        {
        default:
        case k720p:
            NativeWidth = 1280;
            NativeHeight = 720;
            break;
        case k900p:
            NativeWidth = 1600;
            NativeHeight = 900;
            break;
        case k1080p:
            NativeWidth = 1920;
            NativeHeight = 1080;
            break;
        case k1440p:
            NativeWidth = 2560;
            NativeHeight = 1440;
            break;
        case k1800p:
            NativeWidth = 3200;
            NativeHeight = 1800;
            break;
        case k2160p:
            NativeWidth = 3840;
            NativeHeight = 2160;
            break;
        }
#else
        NativeWidth = g_DisplayWidth;
        NativeHeight = g_DisplayHeight;
#endif

After this change the size of the rendering surface will track the size of the window, and we can control the size of the window with the g_DisplayWidth and g_DisplayHeight global variables.

Sample Code

Now that we have the MiniEngine configured the way we like it, let us turn our attention to what we need to do to actually render something using it. Notice that it encapsulates all of the verbose Direct3D 12 config structures into nice C++ classes, and implements a malloc-type heap for Direct3D 12 Resources. This, combined with its automatic pipelining of resources, makes it very easy to use. The 500+ line rendering code in the CPU section is reduced to about 38 lines of setup code and 31 lines of rendering code (69 lines total). This is a huge improvement!

Setup Code

// configure root signature
graphic.rootSig.Reset(1, 0);
graphic.rootSig[0].InitAsConstantBuffer(0, D3D12_SHADER_VISIBILITY_ALL);
graphic.rootSig.Finalize(L"Graphic", D3D12_ROOT_SIGNATURE_FLAG_ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT);

// configure the vertex inputs
D3D12_INPUT_ELEMENT_DESC vertElem[] =
{
    { "POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, offsetof(particles::item, pos), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 },
    { "COLOR",    0, DXGI_FORMAT_R8G8B8A8_UNORM,  0, offsetof(particles::item, clr), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 },
    { "VELOCITY", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, offsetof(particles::item, vel), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 },
    { "AGE",      0, DXGI_FORMAT_R32_FLOAT,       0, offsetof(particles::item, age), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 },
};

// query the MiniEngine formats
DXGI_FORMAT ColorFormat = g_SceneColorBuffer.GetFormat();
DXGI_FORMAT DepthFormat = g_SceneDepthBuffer.GetFormat();

// configure the PSO
graphic.pso.SetRootSignature(graphic.rootSig);
graphic.pso.SetRasterizerState(RasterizerDefault);
graphic.pso.SetBlendState(BlendDisable);
graphic.pso.SetDepthStencilState(DepthStateReadWrite);
graphic.pso.SetInputLayout(_countof(vertElem), vertElem);
graphic.pso.SetPrimitiveTopologyType(D3D12_PRIMITIVE_TOPOLOGY_TYPE_POINT);
graphic.pso.SetRenderTargetFormats(1, &ColorFormat, DepthFormat);
graphic.pso.SetVertexShader(g_pGraphicVS, sizeof(g_pGraphicVS));
graphic.pso.SetPixelShader(g_pGraphicPS, sizeof(g_pGraphicPS));
graphic.pso.Finalize();

// set view and projection matrices
DirectX::XMStoreFloat4x4(
    &graphic.view, DirectX::XMMatrixLookAtLH({ 0.f,-4.5f,2.f }, { 0.f,0.f,-0.3f }, { 0.f,0.f,1.f }));
DirectX::XMStoreFloat4x4(&graphic.proj, DirectX::XMMatrixPerspectiveFovLH(DirectX::XMConvertToRadians(45.f), float(g_DisplayWidth)/g_DisplayHeight, 0.01f, 1000.f));
}

Rendering Code

// render graphics
GraphicsContext& context = GraphicsContext::Begin(L"Scene Render");

// transition
context.TransitionResource(readBuffer, D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER);
context.TransitionResource(g_SceneColorBuffer, D3D12_RESOURCE_STATE_RENDER_TARGET);
context.TransitionResource(g_SceneDepthBuffer, D3D12_RESOURCE_STATE_DEPTH_WRITE, true);

// configure
context.SetRootSignature(graphic.rootSig);
context.SetViewportAndScissor(0, 0, g_SceneColorBuffer.GetWidth(), g_SceneColorBuffer.GetHeight());
context.SetRenderTarget(g_SceneColorBuffer.GetRTV(), g_SceneDepthBuffer.GetDSV());

// clear
context.ClearColor(g_SceneColorBuffer);
context.ClearDepth(g_SceneDepthBuffer);

// update
struct { DirectX::XMFLOAT4X4 MVP; } vsConstants;
DirectX::XMMATRIX view = DirectX::XMLoadFloat4x4(&graphic.view);
DirectX::XMMATRIX proj = DirectX::XMLoadFloat4x4(&graphic.proj);
DirectX::XMMATRIX mvp = view * proj;
DirectX::XMStoreFloat4x4(&vsConstants.MVP, DirectX::XMMatrixTranspose(mvp));
context.SetDynamicConstantBufferView(0, sizeof(vsConstants), &vsConstants);

// draw
context.SetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_POINTLIST);
context.SetVertexBuffer(0, readBuffer.VertexBufferView());
context.SetPipelineState(graphic.pso);
context.Draw(particles::nParticles, 0);

// finish
context.Finish();

CPU

Pipelining and Rendering

Now that we have the 69 lines of rendering code, let us think about what we need to do for GPU compute. My thinking is that we should have two buffers of particle data and render one while the other is being updated for the next frame by GPU compute. You can see this in the code:

static constexpr int    Frames = 2;
    StructuredBuffer        vertexBuffer[Frames];
    int                     current = 0;
and in the rendering code like this:
    // advance pipeline
    auto& readBuffer = vertexBuffer[current];
    current = (current + 1) % Frames;
    auto& writeBuffer = vertexBuffer[current];

GPU

Introduction

Now we have to think about how to implement a particle rendering system on the GPU. An important part of the CPU algorithm is a sorting/partitioning step after the update() to collect all dead particles together at the end of the pool, to make it a fast operation to emit new ones. At first you might think we need to replicate that step on the GPU, which is technically possible via a Bitonic* sort algorithm (MiniEngine actually contains an implementation of this algorithm), but after further thinking you may realize that this sort is only required if you want fast looping over the particle pool when emitting. On the GPU this loop is not required, and is replaced by a GPU thread being launched to process each particle in the pool in parallel (remember the title of this article is Parallel Processing with DirectX 3D* 12). Knowing this, you may realize that all that is actually needed is for each thread to have access to a global count of particles to emit for each frame. Each thread will then examine its data to see if its particle is available for emitting, and if it is it will atomically get-and-decrement the global counter. If it gets a value that is positive, the thread actually goes ahead and emits the particle for further processing; otherwise the thread does nothing.

Atomic Counter

If only Direct3D 12 had an atomic counter available and easily accessed by the compute shader …. Hmmm.

The compute shader RWStructuredBuffer data type has an optional hidden counter variable. Examining the MiniEngine source code reveals that it implements this optional feature and wraps it in a convenient member function:

void CommandContext::WriteBuffer( GpuResource& Dest, size_t DestOffset, const void* BufferData, size_t NumBytes )
void CommandContext::WriteBuffer( GpuResource& Dest, size_t DestOffset, const void* BufferData, size_t NumBytes )

This makes the compute C++ rendering code straightforward, as shown:

// render compute
ComputeContext& context = ComputeContext::Begin(L"Scene Compute");

// update counter
context.WriteBuffer(writeBuffer.GetCounterBuffer(), 0, &flow.num2Create, 4);

// compute
context.SetRootSignature(compute.rootSig);
context.SetPipelineState(compute.pso);
context.TransitionResource(readBuffer, D3D12_RESOURCE_STATE_NON_PIXEL_SHADER_RESOURCE);
context.TransitionResource(writeBuffer, D3D12_RESOURCE_STATE_UNORDERED_ACCESS);
context.SetConstant(0, particles::nParticles, 0);
context.SetConstant(0, flow.dt,               1);
context.SetDynamicDescriptor(1, 0, readBuffer.GetSRV());
context.SetDynamicDescriptor(2, 0, writeBuffer.GetUAV());
context.Dispatch((particles::nParticles+255)/256, 1, 1);
context.InsertUAVBarrier(writeBuffer);

// finish
context.Finish();

The corresponding compute shader code:

[numthreads(256,1,1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
    // initialize random number generator
    rand_seed(DTid.x);

    // get the particle
    Particle p = I[DTid.x];

    // compute the particle if it's alive
    if (p.age >= 0)
        compute(p);

    // otherwise initialize the particle if we got one from the pool
    else if (!(O.DecrementCounter()>>31))
        initialize(p);

    // write the particle
    O[DTid.x] = p;
}

Results

Just like that, we have a compute version of our particle system in only an additional 20 lines of C++ code.

App

Social Media Can Be an Indie’s Best Friend

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Social media can be an indie’s best friend. Get more game dev news and related topics from Intel on VentureBeat.

Pepper Grinder in game screenshot

When Riv Hester started sharing snapshots of an early prototype on Twitter, he didn’t think much of it. But as the number of retweets reached the hundreds (and sometimes thousands), he realized that this was the most publicity any of his games had ever received. He had to make a decision.

"So I was like, 'Crap, guess I’m working on this now,'" said the solo developer.

Hester put aside his other projects in favor of finishing the "drill game," which he later christened Pepper Grinder. In the 2D platformer, Pepper is a blue-haired treasure hunter who loses her precious pile of loot after her boat crashes on a mysterious island. Armed with her trusty drill, she sets out to find her barbaric thieves (mutated narwhals who live on the island) and take back her treasure chests.

Pepper Grinder will be the first retail release for Hester, who, outside of working on his own games and participating at game jams, doesn't have formal development experience. He studied fine art and animation in college, and didn't see game-making as a viable career until he was about to graduate. In recent years, companies like Epic Games and Unity have made their game engines free, so it's easier than ever for anyone to dip their toes in development.

Even at this early stage (Hester doesn't plan on releasing the game till 2018), Pepper Grinder is shaping up to be something special. Aside from its popularity on social media, it recently won the Best Use of Physics and Best Platformer awards from the annual Intel® Level Up Game Developer Contest.

Pepper Grinder in game screenshot
Above: Don't you wish you had a drill that was half your size?

Building the perfect drill

The source of Pepper Grinder's nascent success is the compelling digging mechanic. Whenever you turn the drill on, it launches you into the air, and you can carry that momentum with you when you drill into and out of the dirt. Exiting gives you a speed boost, so if you time your movements right, you barely have to touch the ground as you jump from one dirt tunnel to the next.

It's almost hypnotic.

"I think when I first hit on the idea of Pepper Grinder, I was watching a speedrun of Ecco the Dolphin. I was just thinking about that kind of movement — diving in and out of the water," said Hester. "And I was wondering how that could be expanded on or changed. I'm not really sure how I drew the connection in my head, but it wound up being Ecco plus Dig Dug in a platformer."

Hester managed to nail down the basic feel and behavior of the drill within the first week of prototyping the game. But even though he's mostly satisfied with it, he might continue to make small tweaks between now and Pepper Grinder's release date.

The grappling hook — introduced early on in the game — was a bit more complicated. Hester experimented with it a lot before he found something that could complement the drilling. The result is just as thrilling: If you manage to hook onto a swingable surface shortly after leaving a tunnel, the speed boost will carry over into the swing and launch you up even higher than usual.

This leads to some intriguing platforming challenges, especially when you're surrounded by deadly thorns.

"The gameplay is all defined by numbers, and you're constantly going back and forth between tweaking those numbers and then adjusting things like how high platforms can be in your levels, playing through with them, seeing how it feels, and then going back through that process again," said Hester.

"The math and the numbers and all that stuff — that's what makes it tick. But what's important is how they're expressed in what the player is doing."

Pepper Grinder in game screenshot
Above: No pressure.

The Twitter effect

Originally, Hester began sharing his work on Twitter (@Ahr_Ech) looking for feedback from other developers, which is difficult to get when you mostly work by yourself. But he didn't expect to see such a massive growth in the number of 'likes' and responses from both devs and fans. The experience made him realize that when you're a one-man developer, you also have to be your loudest cheerleader.

"It's kind of a weird crash course in marketing," Hester said, laughing.

Over the past year, he noticed that GIFs tend to get more traction than YouTube links or screenshots (indie developers regularly promote their work using the #screenshotsaturday tag).

But he'll only post glimpses of the game if it makes sense. Sometimes, he'll spend days working on background processes that a player will never see, which doesn't translate well in social media pictures.

"As long as I have something interesting to show, I try to keep the content consistent and just keep people engaged," Hester said.

Deep Learning Training and Testing on a Single Node Intel® Xeon® Scalable Processor System Using Intel® Optimized Caffe*

$
0
0

I. Introduction

This document provides step-by-step instructions on how to train and test your trained single node Intel® Xeon® Scalable processor platform system, using an Intel® distribution of Caffe* framework for image recognition datasets (CIFAR10, MNIST). This document provides beginner level instructions, and both training and inference is happening on the same system. The steps have been verified on Intel Xeon Scalable processors as well as Intel® Xeon Phi™ processor systems, but should work on any latest Intel Xeon processor-based system. None of the software pieces used in this document were performance optimized.

This document is targeted for a beginner level audience who want to learn how to proceed with training and testing a deep learning dataset using the Intel distribution of Caffe framework once they have Intel Xeon processor-based hardware. The document assumes that the reader has basic Linux* knowledge and is familiar with concepts of deep learning training. The instructions can be confidently used as they are, or can be the foundation for enhancements and/or modifications.

This document is divided into seven major sections including the introduction. Section II details hardware and software bill of materials used to implement and verify the training. Section III covers installing CentOS Linux* as the base operating system. Sections IV and V cover details of software suites that need to be installed to have all the tools, libraries, and compilers needed for the training. Sections VI and VII enlist the steps needed build the model, train, and test the model with two simple datasets.

The hardware and software bill of materials used for verified implementation have been mentioned in Section II. Users can try a different configuration, but the configuration in Section II is recommended. Intel® Parallel Studio XE Cluster Edition provides you with most of the basic tools & libraries in one package installation, that are used for complete implementation of steps in this document. Furthermore, starting with Intel Parallel Studio XE Cluster Edition from the beginning will accelerate the learning curve needed for multinode implementation of the same training and testing, as this software will be significantly instrumental on a multinode implementation.

Similar follow-up documentation on step-by-step instructions detailing benchmarking on single-node, multinode implementation, and other frameworks implementation can be expected to be published in the future. 

II. Hardware and Software Bill of Materials

ItemManufacturerModel/Version
Hardware  
Intel® Server ChassisIntelR1208WT
Intel® Server BoardIntelS2600WT
(2x) Intel® Xeon® Scalable processorIntelIntel® Xeon® Gold 6148 processor
(6x) 32GB LRDIMM DDR4Crucial*CT32G4LFD4266
(1x) Intel® SSD 1.2TBIntelS3520
Software  
CentOS Linux* Installation DVD 7.3.1611
Intel® Parallel Studio XE Cluster Edition 2017.4
Intel® Distribution of Caffe* MKL2017
Intel® Machine Learning Scaling Library for Linux* OS 2017.1.016

III. Install the Linux* Operating System

This section requires the following software component: CentOS-7-x86_64-*1611.iso. The software can be downloaded from the CentOS website.

DVD ISO was used for implementing and verifying the steps in this document, but the reader can use Everything ISO and Minimal ISO if preferred.

  • Insert the CentOS* 7.3.1611 install disc/USB. Boot from the drive and select Install CentOS 7.
  • Select Date and Time.
  • If necessary, select Installation Destination.
    • Select the automatic partitioning option.
    • Click Done to return home. Accept all defaults for the partitioning wizard if prompted.
  • Select Network and host name.
    • Enter “<hostname>” as the hostname.
      • Click the Apply button for the hostname to take effect.
    • Select Ethernet enp3s0f3 and click Configure to setup the external interface.
      • From the General section, check Automatically connect to this network when it’s available.
      • Configure the external interface as necessary. Save and exit.
    • Select the toggle to ON for the interface.
    • Click Done to return home
  • Select “Software Selection”
    • In the box labeled “Base Environment” on the left side, select “Infrastructure server”.
    • Click Done to return home.
  • Wait until the Begin Installation button is available, which may take several minutes. Then click it to continue.
  • While waiting for the installation to finish, set the root password.
  • Click Reboot when the installation is complete.
  • Boot from the primary device.
  • Log in as root.

Note: The next steps can all be done from the command line. If you need a GUI version of CentOS, follow the steps in the Appendix.

Configure YUM*

If the public network implements a proxy server for internet access, Yellowdog Updater Modified* (YUM*) must be configured in order to use it.

  • Open the /etc/yum.conf file for editing.
  • Under the main section, append the following line:
    Proxy=http://<address>:<port>
    where <address> is the address of the proxy server and <port> is the HTTP port.
  • Save the file and exit.

Disable updates and extras. Certain procedures in this document require packages to be built against the kernel. A future kernel update may break the compatibility of these built packages with the new kernel, so we disable repository updates and extras to provide further longevity to this document.

This document may not be used as is when CentOS updates to the next version. To use this document after such an update, it is necessary to redefine repository paths to point to CentOS 7.3 in the CentOS vault. To disable repository updates and extras:

Yum-config-manager --disable updates --disable extras

Install EPEL*

Extra Packages for Enterprise Linux* (EPEL*) provides 100 percent, high quality add-on software packages for Linux distribution [7]. This helps to avoid any error messages you might see during Caffe install and build.

Install GNU* C Compiler

Check whether the GNU Compiler Collection* (GCC*) is installed. Should be part of the Development Tools install. You can check by typing:

gcc --version or whereis gcc

IV. Install Intel® Distribution of Caffe*

  • Install Intel distribution of Caffe prerequisites:
Yum –y install git python-devel boost boost-devel cmakh>umpy \
gflags gflags-devel glog glog-devel protobuf \
protobuf-devel hdf5 hdf5-devel lmdb lmdb-devel leveldb leveldb-devel \
snappy-devel opencv opencv-devel
  • Install Intel® Machine Learning Scaling Library:

The Intel Machine Learning Scaling Library provides an efficient implementation of communication patterns used in deep learning (make sure to download the latest version from GitHub*; update the path below as needed):

yum -y install https://github.com/01org/MLSL/releases/download/v2017.1-Preview/intel-mlsl-devel-64-2017.1-016.x86_64.rpm

  • Append commands to source environments to the end of the system skeleton .bashrc.

The environment for the Intel Machine Learning Scaling Library may be loaded by sourcing environment scripts (EOF is End Of File):

cat >>/etc/skel/.bashrc <<EOF
#===== Intel Machine Learning Scaling Library ====
source /opt/intel/mlsl_2017.0.006/intel64/bin/mlslvars.sh
EOF

Configure HTTP and HTTPS Proxies:

If your network implements a proxy server for Internet access, configure the HTTP and HTTPS proxies to use it.

a. Run the following command to enable the proxy for HTTP and HTTPS:

cat >>/etc/skel/.bashrc <<EOF
#====== HTTP and HTTPs proxies ========
export http_proxy=http://<address>:<port>
export https_proxy=https://<address>:<port>
EOF

V. Install Intel® Parallel Studio XE 2017 Cluster Edition

Note: This section requires the following software component:

parallel_studio_xe_2017_update4.tgz

Get the Parallel Studio XE 2017 Cluster Edition product and license file.

Get the Parallel Studio XE 2017 Cluster Edition installation guide:

If you are just going through the document for educational purposes, you can use the 30-day trial version of Intel Parallel Studio XE. However, if you plan to use the basic software installation for long-term professional use and will be building upon this basic guide, then a licensed version is recommended.

If you are saving to a USB, you might have to save two separate Zip* files, and then do cat parallel_*zip*>psxe_update4.zip; then unzip psxe_update4.zip:

  • Install prerequisite packages:

yum –y install gtk2 redhat-lsb gcc gcc-c++ kernel-devel

  • Extract the installer:

tar –xzf parallel_studio_xe_2017_update4.tgz –C /usr/src

  • Install Intel Parallel Studio XE 2017 Cluster Edition:
    1. Start the installer

/usr/src/parallel_studio_xe_2017_update4/install.sh

2. Press Enter to continue.
3. Read the end-user license agreement. Press Space to scroll through each page and continue to the next prompt.
4. Type the word accept and press Enter.
5. Wait for the prerequisite check to finish. This check may take several minutes.
6. Follow the prompts to activate the license. Activation may take several minutes. Press Enter to continue.
7. Accept or decline involvement in the Intel® Software Improvement Program. Press Enter to continue.
8. Press Enter to begin configuring the installation.
9. Press Space to scroll, type 2, then press Enter.
10. Press 1 to deselect IA-32 architecture, then press Enter.
11. Press Enter once to proceed, then press Enter again to begin the installation.
12. If the prompt shown below appears, select ‘y’ and press Enter.

13. Wait for the installation to finish. Installation may take several minutes. When prompted, press Enter to complete the installation.

Set Up Environment Scripts

Append commands to source environments to the end of the system skeleton .bashrc.

Components of Intel Parallel Studio 2017 XE may be loaded by sourcing environment scripts:

cat >>/etc/skel/.bashrc <<EOF
#=== Intel Parallel Studio XE 2017 Update 4 ====
source /opt/intel/parallel_studio_xe_2017.*
EOF

***Post Installation***

VI. Build Intel Distribution of Caffe

You need to build the software so the source code is converted into an executable code that can be run on the platform.

  • Execute the following Git* commands to obtain the latest snapshot of Intel distribution of Caffe:

git clone https://github.com/intel/caffe.git intelcaffe

source /opt/intel/mlsl_2017.1.016/intel64/bin/mlslvars.sh

  • Building from make file (if this fails, try building from the cmake file as mentioned in the next bullet):

cd intelcaffe/

Make a copy of the Makefile.config.example:

cp Makefile.config.example Makefile.config

Open Makefile.config in your favorite editor and uncomment USE_MLSL variable:

vi Makefile.config

USE_MLSL :=1

Execute the make command to build Intel distribution of Caffe:

make –j  <#number of cores> -k

  • Building from cmake (optional; use only if step above unsuccessful):

cd intelcaffe

mkdir build

cd build

Execute the following CMake command in order to prepare the build:

cmake .. –DBLAS=mk1 –DUSE_MLSL=1 –DCPU_ONLY=1

Build Intel distribution of Caffe with multinode support and <#> as the number of cores. This step may take several minutes:

Make –j <# of cores> -k

Note:The build of Intel distribution of Caffe will trigger Intel® Math Kernel Library for Machine Learning to be downloaded to the intelcaffe/external/mkl/directory and be automatically configured.

VII. Train on Intel Distribution Caffe and Test the Training3, 6

CIFAR10 dataset

  • Train the system:

cd ~/intelcaffe

Get CIFAR10 data:

./data/cifar10/get_cifar10.sh

Convert CIFAR10 data into leveldb format and compute image mean:

./examples/cifar10/create_cifar10.sh

  • Test the training:

./examples/cifar10/train_quick.sh

Expect approximately 75 percent accuracy.

MNIST dataset

  • Train the system:

cd ~/intelcaffe

Get CIFAR10 data:

./data/mnist/get_mnist.sh

Convert MNIST data into leveldb format and compute image mean:

./examples/mnist/create_mnist.sh

  • Test the training:

./examples/mnist/train_lenet.sh

Expect approximately 99 percent accuracy.

Acknowledgement

Special thanks to my colleague, Anuya Welling, for documenting the step in her reference design, which was significantly used in writing this document. Also, all her help in resolving various issues that I faced during the process is much appreciated. I would also like to acknowledge all the helpful resources available on GitHub, which were highly instrumental in validating the steps mentioned in this document.

References

  1. Intel® Scalable System Framework (Intel® SSF) Reference Design. 2017.03.31
  2. Guide to multi-node training with Intel® Distribution of Caffe*
  3. Alex’s CIFAR-10 tutorial, Caffe style
  4. Intel Product Registration Center
  5. Multi-node CIFAR10
  6. Training LeNet on MNIST with Caffe
  7. How to Enable EPEL Repository for RHEL/CentOS 7.x/6.x/5.x
  8. https://software.intel.com/en-us/ai-academy/basics

Appendix

For CentOS GUI installation. Make sure you have the required Zip files on a thumb drive.

fdisk –l

#choose the sdb drive whichever name it says for USB
mount /dev/sdb1 /media

ls /media
yum –y localinstall /media/unzip-6.0-15.el7.x86_64.rpm
cat /media/CentOS-7-x86_64-Everything-1611.zip.00* >cent7.zip
umount /media
unzip cent7.zip

ls /media
mkdir /media/cdrom
mount –o loop CentOS-7-x86_64-Everything-1611.iso /media/cdrom

vi /etc/yum.repos.d/CentOS-Media.repo
#change enabled=0 to 1
enabled=1

mv /etc/yum.repos.d/* /root/
mv CentOS-Media.repo /etc/yum.repos.d/
#Done because network wasn’t connected; if network connected then just set proxies and it should work without moving the CentOS file…
yum repolsit
yum –y groupinstall “Development and Creative Workstation” ; yum –y groupinstall “Development Tools”
#This takes time

Once installation is done, type startx to start GUI version of CentOS.

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>