OpenStack* Enhanced Platform Awareness: Feature Breakdown and Analysis

July 10, 2017, 4:56 pm

Latest and popular articles on Intel Technologies

≫ Next: Overview of Intel® Computer Vision SDK and How it Applies to IoT

≪ Previous: Solve SVD problem for sparse matrix with Intel Math Kernel Library

1 Introduction

OpenStack* Enhanced Platform Awareness (EPA) contributions from Intel and others enable fine-grained matching of workload requirements to platform capabilities. EPA features provide OpenStack with an improved understanding of the underlying platform hardware (HW), which allows it accurately assign the workload to the best HW resource.

This paper explains the OpenStack EPA features listed in the table in section 1.3. Each feature is covered in isolation, with a brief description, configuration steps to enable the feature, and a short discussion of its benefits.

1.1 Audience and purpose

This document is intended to help understand the performance gains for each EPA feature in isolation. Each section has detailed information on how to configure a system to utilize the EPA feature in question.

This document focuses on EPA features available in the Newton* release. The precursor to this document can be found here.

1.2 EPA Features Covered

Feature Name	*First OpenStack Release**	Description	Benefit	Performance Data
Host CPU feature request	Icehouse*	Expose host CPU features to OpenStack managed guests	Guest can directly use CPU features instead of emulated CPU features	~20% to ~40% improvement in guest computation
PCI passthrough	Havana*	Provide direct access to a physical or virtual PCI device	Avoid the latencies introduced by hypervisor and virtual switching layers	~8% improvement in network throughput
HugePages* support	Kilo*	Use memory pages larger than the standard size	Fewer memory translations requiring fewer cycles	~10% to ~20% improvement in memory access speed
NUMA awareness	Juno*	Ensures virtual CPUs (vCPU)s executing processes and the memory used by these processes are on the same NUMA node	Ensures all memory accesses are local to the node and thus do not consume the limited cross-node memory bandwidth, adding latency to memory accesses	~10% improvement in guest processing
IO based NUMA scheduling	Kilo*	Creates an affinity that associates a VM with the same NUMA nodes as the PCI device passed into the VM	Delivers optimal performance when assigning PCI device to a guest	~25% improvement in network throughput for smaller packets
CPU pinning	Kilo	Supports the pinning of VMs to physical processors	Avoids scheduling mechanism moving the guest virtual CPUs to other host physical CPU cores, improving performance and determinism	~10 % to ~20% improvement in guest processing
CPU threading policies	Mitaka*	Provides control over how guests can use the host hyper thread siblings	More fine-grained deployment of guests on HT-enabled systems	Up to ~50% improvement in guest processing
OVS-DPDK, neutron	Liberty*	An industry standard virtual switch accelerated by DPDK	Accelerated virtual switching	~900% throughput improvement

2 Test Configuration

Following is an overview of the environment that was used for testing the EPA features covered in this document.

2.1 Deployment

Several OpenStack deployment tools are available. Devstack which is basically a script used to configure and deploy each OpenStack service, was used for to demonstrate EPA features in this document. Devstack uses a single configuration file to determine the functionality of each node in your OpenStack cluster. Devstack modifies each OpenStack service configuration file to reflect the user's requirements defined in the configuration file.

To avoid dependency on a particular OpenStack deployment tool, the respective OpenStack configuration file that was modified for the respective service will be mentioned.

2.2 Topology

Network topology flowchart
Figure 1: Network topology

2.3 Hardware

Item	Description	Notes
Platform	Intel® Server System R1000WT Family
Form factor	1U Rack
Processor(s)	Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz	55MB Cache with Hyper-threading enabled
Cores	44 physical cores/CPU	44 hyper-threaded cores per CPU for 88 total cores
Memory	132G RAM	DDR4 2133
NIC’s	2 * Intel® Ethernet Controller 10 Gigabit 82599
BIOS	SE5C610.86B.01.01.0019.101220160604	Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d) Hyper-Threading enabled

2.4 Software

Item	Description	Notes
Host OS	Ubuntu 16.04.1 LTS	4.2.0 Kernel
Hypervisor	Libvirt-3.1/Qemu-2.5.0
Orchestration	OpenStack (Newton release)
Virtual switch	OpenvSwitch 2.5.0
Data plane development kit	DPDK 16.07
Guest OS	Ubuntu 16.04.1 LTS	4.2.0 Kernel

2.5 Traffic generator

An Ixia XG-12 traffic generator was used to generate the networking workload for some of the tests described in this document. To simulate a worst-case scenario from a networking perspective, 64-byte packets are used.

3 Host CPU Feature Request

This feature allows the user to expose a specific host CPU instruction set to a guest. Instead of the hypervisor emulating the CPU instruction set, the guest can directly access the host's CPU feature. While there are many host CPU features available, the Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) instruction set is used in this example.

One sample use case would be a security application requiring a high level of cryptographic performance. This could be instrumented to leverage specific instructions such as Intel® AES-NI.

The following steps detail how to configure the host CPU feature request for this use case.

3.1 Configure the compute node

3.1.1 System configuration

Before a specific CPU feature is requested, the availability of the CPU instruction set should be checked using the cpuid instruction.

3.1.2 Configure libvirt driver

The Nova* libvirt driver takes its configuration information from a section in the main Nova file /etc/nova/nova.conf. This allows for customization of certain Nova libvirt driver functionality.

For example:

[libvirt]
...
cpu_mode = host-model
virt_type = kvm

The cpu_mode option in /etc/nova/nova.conf can take one of the following values: none, host-passthrough, host-model, and custom.

host-model

Libvirt identifies the CPU model in the /usr/share/libvirt/cpu_map.xml file that most closely matches the host, and requests additional CPU flags to complete the match. This configuration provides the maximum functionality and performance, and maintains good reliability and compatibility if the guest is migrated to another host with slightly different host CPUs.

host-passthrough

Libvirt tells KVM to pass through the host CPU with no modifications. The difference between host-passthrough and host-model is that, instead of just matching feature flags, every last detail of the host CPU is matched. This gives the best performance, and can be important to some apps that check low-level CPU details, but it comes at a cost with respect to migration. The guest can only be migrated to a matching host CPU.

custom

You can explicitly specify one of the supported named models using the cpu_model configuration option.

3.2 Configure the Controller node

3.2.1 Enable the compute capabilities filter in Nova*

The Nova scheduler is responsible for deciding which compute node can satisfy the requirements of your guest. It does this using a set of filters; to enable this feature, simply add the compute capability filter.

During the scheduling phase, the ComputeCapabilitiesFilter in Nova compares the CPU features requested by the guest with the compute node CPU capabilities. This ensures that the guest is scheduled on a compute node that satisfies the guest’s PCI device request.

Nova filters are configured in /etc/nova/nova.conf

scheduler_default_filters = ...,ComputeCapabilitiesFilter,...

3.2.2 Create a Nova flavor that requests the ntel® Advanced Encryption Standard New Instructions (Intel® AES-NI) for a VM

openstack flavor set <FLAVOR> --property hw:capabilities:cpu_info:features=aes <GUEST>

3.2.3 Boot guest with modified flavor

openstack server create --image <IMAGE> --flavor <FLAVOR> <GUEST>

3.2.4 Performance benefit

This feature gives the guest direct access to a host CPU feature instead of the guest using an emulated CPU feature. This feature can deliver a double digit performance improvement, depending on the size of data buffer being used.

To demonstrate the benefit of this feature, a crypto workload (openssl speed -evp aes256) is executed on guest A that has not requested a host CPU feature, while guest B has requested the host Intel AES-NI CPU feature. Guest A will use an emulated CPU feature, while guest B will use the host's CPU feature.

data graphic
Figure 2: CPU feature request comparison

4 Sharing Host PCI Device with a Guest

In most cases the guest will require some form of network connectivity. To do this, OpenStack needs to create and configure a network interface card (NIC) for guest use. There are several methods of doing this. The one you choose depends on your cloud requirements. The table below highlights each option and their respective pros and cons.

	NIC Emulation	PCI Passthrough (PF)	SRIOV (VF)
Overview	Hypervisor fully emulates the PCI device	The full PCI device is allocated to the guest.	A PCI device VF is allocated to the guest
Guest sharing	Yes	No	Yes
Guest IO performance	Slow	Fast	Fast

Device emulation is performed by the hypervisor, which has an obvious overhead. This overhead is worthwhile as long as the device needs to be shared by multiple guest operating systems. If sharing is not necessary there are more efficient methods for sharing devices.

data graphic
Figure 3: Host to guest communication methods

The PCI passthrough feature in OpenStack gives the guest full access and control of a physical PCI device. This mechanism can be used on any kind of PCI device, NIC, graphics processing unit (GPU), HW crypto accelerator (QAT), or any other device that can be attached to a PCI bus.

An example use case for this feature would be to pass a PCI network interface to a guest, avoiding the latencies introduced by hypervisor and virtual switching layers. Instead, the guest will use the PCI device directly.

When a full PCI device is assigned to a guest, the hypervisor detaches the PCI device from the host OS and assigns it to the guest, which means the PCI device is no longer available to the host OS. A downside of PCI passthrough is that the full physical device is assigned to only one guest and cannot be shared, and guest migration is not currently supported.

4.1 Configure the compute node

4.1.1 System configuration

Enable VT-d in BIOS.

Add “intel_iommu=on” to kernel boot line to enable the kernel.

Edit this file: /etc/default/grub

GRUB_CMDLINE_LINUX="intel_iommu=on"
sudo update-grub
sudo reboot

To verify VT-d/IOMMU is enabled on your system:

sudo dmesg | grep IOMMU
[    0.000000] DMAR: IOMMU enabled
[    0.133339] DMAR-IR: IOAPIC id 10 under DRHD base  0xfbffc000 IOMMU 0
[    0.133340] DMAR-IR: IOAPIC id 8 under DRHD base  0xc7ffc000 IOMMU 1
[    0.133341] DMAR-IR: IOAPIC id 9 under DRHD base  0xc7ffc000 IOMMU 1

4.1.2 Configure your PCI whitelist

OpenStack uses a PCI whitelist to define which PCI devices are available to guests. There are several ways to define your PCI whitelist; here is one method.

The Nova PCI whitelist is configured in: /etc/nova/nova.conf

[default]
pci_passthrough_whitelist={"address":"0000:02:00.1","vendor_id":"8086","physical_network":"default"}

4.1.3 Configure the PCI alias

Following the Newton release, you also need to configure the PCI alias on the compute node. This is to enable resizing a guest that has been allocated a PCI device.

Get the vendor and product ID of the PCI device:

sudo ethtool -i ens513f1 | grep bus-info
bus-info: 0000:02:00.1

sudo lspci -n | grep 02:00.1
02:00.1 0200: 8086:10fb (rev 01)

Nova PCI alias tags are configured in: /etc/nova/nova.conf

[default]
pci_alias = {"vendor_id":"8086","product_id":"10fb","device_type":"type-PF", "name":"nic" }

NOTE: To pass through a complete PCI device, you need to explicitly request a physical function in the pci_alias by setting the device_type = type-PF.

4.2 Configure the Controller Node

Nova scheduler is responsible for deciding which compute node can satisfy the requirements of your guest. It does this using a set of filters; to enable this feature add the PCI passthrough filter.

4.2.1 Enable the PCI passthrough filter in Nova

During the scheduling phase, the Nova PciPassthroughFilter filters compute nodes based on PCI devices they expose to the guest. This ensures that the guest is scheduled on a compute node that satisfies the guest’s PCI device request.

Nova filters are configured in: /etc/nova/nova.conf

scheduler_default_filters = ...,ComputeFilter,PciPassthroughFilter,...

NOTE: If you make changes to the nova.conf file on a running system, you will need to restart the Nova scheduler and Nova compute services.

4.2.2 Configure your PCI device alias

To make the requesting of a PCI device easier you can assign an alias to the PCI device. Define the PCI device information with an alias tag and then reference the alias tag in the Nova flavor.

Nova PCI alias tags are configured in: /etc/nova/nova.conf

Use the PCI device vendor and product ID obtained from step 4.1.3:

[default]
pci_alias = {"vendor_id":"8086","product_id":"10fb","device_type":"type-PF", "name":"nic" }

NOTE: To pass through a complete PCI device you must explicitly request a physical function in the pci_alias by setting the device_type = type-PF.

Modify Nova flavor

If you request a PCI passthrough for the guest, you also need to define a non-uniform memory access (NUMA) topology for the guest.

openstack flavor set <FLAVOR> --property  "pci_passthrough:alias"="nic:1"
openstack flavor set <FLAVOR> --property  hw:numa_nodes=1
openstack flavor set <FLAVOR> --property  hw:numa_cpus.0=0
openstack flavor set <FLAVOR> --property  hw:numa_mem.0=2048

Here, an existing flavor is modified to define a guest with a single NUMA node, one vCPU and 2G of RAM, and a single PCI physical device. You can create a new flavor if you need one.

4.3 Boot guest with modified flavor

openstack server create --image <IMAGE> --flavor <FLAVOR> <GUEST>

4.4 Performance benefit

This feature allows a PCI device to be directly attached to the guest, removing the overhead of the hypervisor and virtual switching layers, delivering a single digit gain in throughput.

To demonstrate the benefit of this feature, the conventional path a packet takes via the hypervisor and virtual switch is compared with the optimal path, bypassing the hypervisor and virtual switch layers.

Using these test scenarios, iperf3 is used to measure the throughput, and ping (ICMP) to measure the latencies for each scenario.

data graphic
Figure 4: Guest PCI device throughput

data graphic
Figure 5: Guest PCI device latency

4.5 PCI virtual function passthrough

The preceding section covered the passing of a physical PCI device to the guest. This section covers passing a virtual function to the guest.

Single root input output virtualization (SR-IOV) is a specification that allows a single PCI device to appear as multiple PCI devices. SR-IOV can virtualize a single PCIe Ethernet controller (NIC) to appear as multiple Ethernet controllers. You can directly assign each virtual NIC to a virtual machine (VM), bypassing the hypervisor and virtual switch layer. As a result, users are able to achieve low latency and near-line rate speeds. Of course, the total bandwidth of the physical PCI device will be shared between all allocated virtual functions.

The physical PCI device is referred to as the physical function (PF) and a virtual PCI device is referred to as a virtual function (VF). Virtual functions are lightweight functions that lack configuration resources.

The major benefit of this feature is that it makes it possible to run a large number of virtual machines per PCI device, which reduces the need for hardware and the resultant costs of space and power required by hardware devices.

4.6 Configure the Compute node

4.6.1 System configuration

Enable VT-d in BIOS.

Add “intel_iommu=on” to kernel boot line to enable the kernel. Edit this file: /etc/default/grub

GRUB_CMDLINE_LINUX="intel_iommu=on"
sudo update-grub
sudo reboot

To verify that VT-d/IOMMU is enabled on your system, execute the following command:

sudo dmesg | grep IOMMU
[    0.000000] DMAR: IOMMU enabled
[    0.133339] DMAR-IR: IOAPIC id 10 under DRHD base  0xfbffc000 IOMMU 0
[    0.133340] DMAR-IR: IOAPIC id 8 under DRHD base  0xc7ffc000 IOMMU 1
[    0.133341] DMAR-IR: IOAPIC id 9 under DRHD base  0xc7ffc000 IOMMU 1

4.6.2 Enable SR-IOV on a PCI device

There are several ways to enable a SR-IOV on a PCI device. Here is a method to enable a single virtual function on a PCI Ethernet controller (ens803f1):

sudo su -c "echo 1 > /sys/class/net/ens803f1/device/sriov_numvfs"

4.6.3 Configure your PCI whitelist

OpenStack uses a PCI whitelist to define which PCI devices are available to guests. There are several ways to define your PCI whitelist; here is one method.

The Nova PCI whitelist is configured in: /etc/nova/nova.conf

[default]
pci_passthrough_whitelist= {"address":"0000:02:10.1","vendor_id":"8086","physical_network":"default"}

4.6.4 Configure your PCI device alias

See section 3.2.2 for PCI device alias configuration.

NOTE: To pass through a virtual PCI device you just need to add the vendor and product ID for the device. If you use the PF PCI address, all associated VFs will be exposed to Nova.

4.7 Configure the controller node

4.7.1 Enable the PCI passthrough filter in Nova

Follow the steps described in section 4.2.1.

4.7.2 Configure your PCI device alias

To make the requesting of a PCI device easier you can assign an alias to the PCI device, define the PCI device information with an alias tag, and then reference the alias tag in the Nova flavor.

Nova PCI alias tags are configured in: /etc/nova/nova.conf

Use the PCI info obtained in step 4.1.3:

[default]
pci_alias = {"vendor_id":”8086","product_id":"10ed", "name":"nic" }

NOTE: To pass through a virtual PCI device (VF) you just need to add the vendor and product id of the VF.

Modify Nova flavor

If you request PCI passthrough for the guest, you also need to define a NUMA topology for the guest.

openstack flavor set <FLAVOR> --property  "pci_passthrough:alias"="nic:1"
openstack flavor set <FLAVOR> --property  hw:numa_nodes=1
openstack flavor set <FLAVOR> --property  hw:numa_cpus.0=0
openstack flavor set <FLAVOR> --property  hw:numa_mem.0=2048

Here, an existing flavor is modified to define a guest with a single NUMA node, one vCPU and 2G of RAM, and a single PCI physical device. You can create a new flavor if you need to.

4.7.3 Boot guest with modified flavor

openstack server create -–image <IMAGE> --flavor <FLAVOR> <GUEST-NAME>

5 Hugepage Support

5.1 Description

When a process uses memory the CPU marks the RAM as used by the process. This memory is divided into chunks of 4KB, or pages. The CPU and operating system must remember where in memory these pages are and to which process they belong. When processes begin to use large amounts of memory, lookups can take a lot of time; this is where hugepages come in. Depending on the processor two different huge page sizes can be used on x86_64 architecture, 2MB or 1GB. Using these larger page sizes makes lookups much quicker.

To show the value of hugepages in OpenStack, the “sysbench” benchmark suite is used along with two VMs; one with 2MB hugepages and one with regular 4KB pages.

5.2 Configuration

5.2.1 Compute host

First, enable hugepages on the compute host

sudo mkdir -p /mnt/huge
sudo mount -t hugetlbfs nodev /mnt/huge
sudo echo 8192 > \ /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

If 1GB hugepages are needed it is necessary to configure this at boot time through the GRUB command line. It is also possible to set 2MB hugepages at this stage.

GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=8”

Enable hugepages to work with KVM/QEMU and libvirt. First, edit the line in the qemu-kvm file:

				vi /etc/default/qemu-kvm

				#Edit the line to match the line below
				KVM_HUGEPAGES=1

Now, tell libvirt where the hugepage table is mounted, and edit the security driver for libvirt. Add the hugetlbfs mount point to the cgroup_device_acl list:

vi /etc/libvirt/qemu.conf

security_driver = "none"
hugetlbfs_mount = "/mnt/huge"

cgroup_device_acl = [
"/dev/null", "/dev/full", "/dev/zero","/dev/random", "/dev/urandom","/dev/ptmx", "/dev/kvm", "/dev/kqemu","/dev/rtc", "/dev/hpet","/dev/net/tun","/dev/vfio/vfio","/mnt/huge"
]

Now restart libvirt-bin and the Nova compute service:

sudo service libvirt-bin restart
sudo service nova-compute restart

5.2.2 Test

Create a flavor that utilizes hugepages. For this benchmarking work, 2MB pages are utilized.

On the Controller node, run:

openstack flavor create hugepage_flavor --ram 4096 --disk 100 --vcpus 4
openstack flavor set hugepage_flavor --property hw:mem_page_size=2MB

To test hugepages, a VM running with regular 4KB pages and one using 2MB pages are needed. First, create the 2MB hugepage VM:

On the Controller node, run:

openstack server create --image ubu160410G --flavor hugepages --nic \ net-id=e203cb1e-988f-4bb5-bbd1-54fb34783e02 --availability-zone \ nova::silpixa00395293 hugepage_vm

Now simply alter the above statement to create a 4KB page VM:

openstack server create --image ubu160410G --flavor smallpages --nic \ net-id=e203cb1e-988f-4bb5-bbd1-54fb34783e02 --availability-zone \ nova::silpixa00395293 default_vm

Sysbench was used to benchmark the benefit of using hugepages within a VM. Sysbench is a benchmarking tool with multiple modes of operation, including CPU, memory, filesystem, and more. The memory mode is utilized to benchmark these VMs.

Install sysbench on both VMs:

sudo apt install sysbench

Run the command to benchmark memory:

sysbench --test=memory --memory-block-size=<SIZE_OF_RAM> \ --memory-total-size=<SIZE_OF_DISK> run

An example using 100MB of RAM and 50GB on disk:

sysbench --test=memory --memory-block-size=100M \ --memory-total-size=50G run

5.3 Performance benefit

The graphs below show that there is a significant increase in performance when using 2MB hugepages instead of the default 4K memory pages, for this specific benchmark.

data graphic
Figure 6: Hugepage time comparison

data graphic
Figure 7: Hugepage operations per second comparison

6 NUMA Awareness

6.1 Description

NUMA, or non-uniform memory access, describes a system with more than one system bus. CPU resources and memory resources are grouped together into a NUMA node. Communication between a CPU and memory within a NUMA node is much faster than in an ordinary system layout.

To show the benefits of using NUMA awareness within VMs, sysbench is used.

6.2 Configuration

First, create a flavor that has the NUMA awareness property.

On the Controller node, run:

openstack flavor create numa_aware_flavor --vcpus 4 --disk 20 -- ram \ 4096

openstack flavor set numa_aware_flavor --property \ hw:numa_mempolicy=strict --property hw:numa_cpus.0=0,1,2,3 --property \ hw:numa_mem.0=4096

Create two VMs, one which is NUMA-aware and one which is not NUMA-aware.

On the Controller node, run:

openstack server create --image ubu160410G --flavor numa_aware_flavor \ --nic net-id=e203cb1e-988f-4bb5-bbd1-54fb34783e02 --availability-zone \ nova::silpixa00395293 numa_aware_vm

openstack server create --image ubu160410G --flavor default--nic \ net-id=e203cb1e-988f-4bb5-bbd1-54fb34783e02 --availability-zone \ nova::silpixa00395293 default_vm

The threads mode and the memory mode of sysbench are utilized in order to benchmark these VMs.

To install sysbench, log into the VMs created above, and run the following command:

sudo apt install sysbench

From the VMs run the following commands.

The command to benchmark threads is:

sysbench --test=threads --num_thread=256 --thread-wields=10000 \ --thread-locks=128 run

The command to benchmark memory is:

sysbench --test=memory --memory-block-size=1K --memory-total-size=50G \ run

6.3 Performance benefit

The graph below shows that there is an increase in both thread and memory using the benchmark described above, when the NUMA awareness property is set.

Graphic data
Figure 8: NUMA awareness benchmarks

7 I/O Based NUMA Scheduling

The NUMA awareness feature described in section 5 details how to request a guest NUMA topology that matches the host NUMA topology. This ensures that all memory accesses are local to the NUMA node, and thus not consuming the very limited cross-node memory bandwidth, which adds latency to memory accesses.

However, this configuration does not take into consideration the locality of the I/O device providing data to the guest processing cores. For example, if guest vCPU cores are assigned to a particular NUMA node, but the NIC transferring the data is local to another NUMA node; this will result in reduced application performance.

data graphic
Figure 9 Guest NUMA placement considerations

The above diagram highlights two guest placement configurations. With the good placement configuration the guest physical CPU (pCPU) cores, memory allocation, and PCI device are all associated with the same NUMA node.

An optimal configuration would be where the guests assigned PCI device, RAM allocation, and assigned pCPU are associated with the same NUMA node. This will ensure that there is no cross-NUMA node memory traffic.

The configuration for this feature is similar to the configuration for PCI passthrough, described in sections 3.1 and 3.2

NOTE: In section 3 a single NUMA node is requested for the guest, and its vCPU is bound to host NUMA node 0:

openstack flavor set <FLAVOR> --property  hw:numa_nodes=1
openstack flavor set <FLAVOR> --property  hw:numa_cpus.0=0

If the platform has only one PCI device and it is associated with NUMA node 1, the guest will fail to boot.

7.1 Benefit

The advantage of this feature is that the guest PCI device and pCPU’s cores are all associated with the same NUMA node, avoiding cross-node memory traffic. This can deliver a significant improvement in network throughput, especially for smaller packets.

To demonstrate the benefit of this feature, the network throughput of guest A, that uses a PCI NIC that is associated with the same NUMA node, and the network throughput of guest B, that uses a PCI NIC that is associated with a remote NUMA node, is compared.

data graphic
Figure 10: NUMA awareness throughput comparison

8 Configure Open vSwitch

8.1 Description

When deploying OpenStack, vanilla Open vSwitch (OVS) is the default virtual switch used by OpenStack.

OVS comes as standard in most, if not all, OpenStack deployment tools such as Mirantis Fuel* and OpenStack Devstack.

8.1.1 Configure the controller node

Devstack deploys OpenStack based on a local.conf file. The details required by the local.conf file will change, based on your system, but an example for the Controller node is shown below:

OVS_LOG_DIR=/opt/stack/logs
OVS_BRIDGE_MAPPINGS="default:<bridge-name>"
PUBLIC_BRIDGE=br-ex

8.1.2 Configure the compute node

The parameters required for the Compute node are almost identical. Simply remove the public_bridge parameter:

OVS_LOG_DIR=/opt/stack/logs
OVS_BRIDGE_MAPPINGS="default:<bridge-name>"

To test vanilla OVS create a VM on the Compute Host, and use a traffic generator to send traffic to the VM. Have it sent back out through the host to the generator, and note the throughput.

The VM requires two networks to be connected to it in order for traffic to be sent up and then come back down. By default, OpenStack creates a single network which is usable by VMs on the host; that is, private.

Create a second network and subnet for the second NIC, and attach the subnet to the preexisting router:

openstack network create private2 --availability-zone nova
openstack subnet create subnet2 --network private2 --subnet-range \ 11.0.0.0/24
openstack router add subnet router1 subnet2

When that is done create the VM:

openstack server create --image ubu160410G --flavor m1.small --nic \ net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> \ vm_name

Now, configure the system to forward packets from the packet generator through the VMs and back to Ixia*.

The setup for this is explained in detail in the section below, called Configure packet forwarding test with two virtual networks.

8.2 OVS-DPDK

8.2.1 Description

OVS-DPDK will be used to see how much of an increase in performance is received over vanilla OVS. To utilize OVS-DPDK you will need to set it up. In this case, OpenStack Devstack is used, and changing from vanilla OVS to OVS-DPDK requires some parameters to be changed within the local.conf file, and it requires you to restack the node.

For this test, send traffic from the Ixia traffic generator through the VM hosted on the Compute node and back to Ixia. In this test case, OVS-DPDK only needs to be set up on the Compute node.

8.2.2 Configure the compute node

Within the local.conf file add the specific parameters as shown below:

OVS_DPDK_MODE=compute
OVS_NUM_HUGEPAGES=<num-hugepages>
OVS_DATAPATH_TYPE=netdev
#Create the OVS Openstack management bridge and give it a name
OVS_BRIDGE_MAPPINGS="default:<bridge>"
#Select the interfaces you wish to be handled by DPDK
OVS_DPDK_PORT_MAPPINGS=”<interface>:<bridge>”,”<interface2>:<bridge>”

Now restack the Compute node. Once that is complete, the setup can be benchmarked.

8.2.3 Configure the controller node

You need two networks connected to the VM in order for traffic to be sent up and then come back down. By default, OpenStack creates a single network that is usable by VMs on the host; that is, private. A second network and subnet must be created for the second NIC, and attach the subnet to the preexisting router.

On the Controller node, run:

openstack network create private2 --availability-zone nova
openstack subnet create subnet2 --network private2 --subnet-range \ 11.0.0.0/24
openstack router add subnet router1 subnet2

Once that is done, create the VM. To utilize DPDK the VM must use hugepages. Details on how to set up your Compute node to use hugepages are given in the “Hugepage Support” section.

On the Controller node run:

openstack server create --image ubu160410G --flavor hugepage_flavor \ --nic net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> \ vm_name

Now configure the system to forward packets from Ixia through the VMs and back to Ixia. The setup for this is explained in detail in the section below, called Configure packet forwarding test with two virtual networks.

Once that is complete run traffic through the Host.

8.3 Performance Benefits

The graph below highlights the performance gain when using a DPDK accelerated OVS.

data graphic
Figure 11: Virtual switch throughput comparison

9 CPU Pinning

9.1 Description

CPU pinning allows a VM to be pinned to specific CPUs without worrying about being moved around by the kernel scheduler. This increases the performance of the VM while the host is under heavy load. Its processes will not be moved from CPU to CPU, and instead they will be run within the pinned CPUs.

9.2 Configuration

There are two ways to use this feature in Newton, either by editing the properties of a flavor, or editing the properties of an image file. Both are shown below.

openstack flavor set <FLAVOR_NAME> --property hw:cpu_policy=dedicated
openstack image set <IMAGE_ID> --property hw_cpu_policy=dedicated

For the following test the Ixia traffic generator is connected to the Compute Host. Two VMs with two vNICs are needed; one VM with core pinning enabled and one with it disabled. Two separate flavors were created with the only difference being the cpu_policy.

On the Controller node run:

openstack flavor create un_pinned --ram 4096 --disk 20 --vcpus 4
openstack flavor create pinned --ram 4096 --disk 20 --vcpus 4

There is no need to change the policy for the unpinned flavor as the default cpu_policy is ‘shared’. For the pinned flavor set the cpu_policy to ‘dedicated’.

On the Controller node run:

openstack flavor set pinned --property hw:cpu_policy=dedicated

Create a network and subnet for the second NIC and attach the subnet to the preexisting router.

On the Controller node, run:

openstack network create private2 --availability-zone nova
openstack subnet create subnet2 --network private2 --subnet-range \ 11.0.0.0/24
openstack router add subnet router1 subnet2

Once this is complete create two VMs; one with core pinning enabled and one without.

On the Controller node, run:

openstack server create --image ubu160410G --flavor pinned --nic \ net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> pinnedvm

openstack server create --image ubu160410G --flavor un_pinned --nic \ net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> defaultvm

Now, configure the system to forward packets from Ixia through the VMs and back to Ixia. The setup for this is explained in detail in the section below, called Configure packet forwarding test with two virtual networks. Send traffic through both VMs while the host is idle and also while it is under stress, and graph the results. Use the Linux* ‘stress’ command. To do this, install stress on the Compute Host.

On Ubuntu simply run:

sudo apt-get install stress

The test run command in this benchmark is shown here:

stress --cpu 56 --io 4 --vm 2 --vm-bytes 128M --timeout 60s&

9.3 Performance benefit

The graph below highlights the performance gain when using the CPU pinning feature.

data graphic
Figure 12: CPU pinning throughput comparison

10 CPU Thread Policies

10.1 Description

CPU thread policies work with CPU pinning to ensure that the performance of your VM is maximized. CPU thread policy isolate allows entire physical cores to be allocated for use by a VM. While CPU pinning alone may allow Intel® Hyper-Threading Technology (Intel® HT Technology) siblings to be used by different processes, thread policy isolate ensures that this cannot happen. It also ensures that a physical core does not have more than one process trying to access it at one time. Similar to how CPU pinning was benchmarked, start by creating a new OpenStack flavor.

10.2 Configuration

On the Controller node, run:

openstack flavor create pinned_thread_policy --ram 4096 --disk 20 \ --vcpus 4

Thread policies were created to work with CPU pinning, so add both CPU pinning and thread policies to this flavor.

On the Controller node, run:

openstack flavor set pinned_with_thread --property \ hw:cpu_policy=dedicated --property hw:cpu_thread_policy=isolate

As is the case in the Pinning benchmark above, a second private network is needed to test this feature.

On the Controller node, run:

openstack network create private2 --availability-zone nova
openstack subnet create subnet2 --network private2 --subnet-range \ 11.0.0.0/24
openstack router add subnet router1 subnet2

Now create the VM. This VM is benchmarked versus the two VMs created in the previous section.

On the Controller node, run:

openstack server create --image ubu160410G --flavor pinned_with_thread \ --nic net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> \ pinned_thread_vm

Ensure that the system is configured to forward traffic from Ixia through the VM and back to Ixia; read the section Configure packet forwarding test with two virtual networks. Send traffic through the VM while the host is idle and while it is under stress. Use the Linux ‘stress’ command. To do this, install stress on the Compute Host:

sudo apt-get install stress

On the Compute Host run the following command:

stress --cpu 56 --io 4 --vm 2 --vm-bytes 128M --timeout 60s&

10.3 Performance benefit

The graph below shows that a pinned VM actually performs slightly better than the other VMs while the system is unstressed. However, when the system is stressed there is a large increase in performance for the thread isolated VM over the pinned VM.

data graphic
Figure 13: CPU thread policy through comparison

11 Appendix

This section details some learning we came across while working on this paper.

11.1 Configure packet forwarding test with two virtual networks

This section details the setup for testing throughput in a VM. Here we use standard OVS and L2 forwarding in the VM.

data graphic
Figure 14: Packet forwarding test topology

11.1.2 Host configuration

Standard OVS deployed by OpenStack uses two bridges: a physical bridge (br-ens787f1) to plug the physical NICs into, and an integration bridge (br-int) that the VM VNICs get plugged into.

Plug in physical NICs:

sudo ovs-vsctl add-port br-ens787f1 ens803f1
sudo ovs-vsctl add-port br-ens787f1 ens803f0

Modify the rules on the integration bridge to allow traffic to and from the VM.

First, find out the port numbering in OVS ports on the integration bridge:

sudo ovs-ofctl show br-int
1(int-br-ens787f1): addr:12:36:84:3b:d3:7e
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 4(qvobf529352-2e): addr:b6:34:b5:bf:73:40
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 5(qvo70aa7875-b0): addr:92:96:06:8b:fe:b9
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br-int): addr:5a:c9:6e:f8:3a:40
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max

There are, however, issues with the default setup. If you attempt to have heavy traffic passed up to the VM and back down to the host through the same connection, OVS may cause an error to occur, which may cause your system to crash. To overcome this you will need to add a second connection from the integration bridge to the physical bridge.

Traffic going to the VM:

sudo ovs-ofctl add-flow br-int priority=10,in_port=1,action=output=4

Traffic coming from the VM:

sudo ovs-ofctl add-flow br-int priority=10,in_port=5,action=output=1

11.1.3 VM configuration

First, make sure there are two NICs up and running. This can be done manually or persistently by editing this file: /etc/network/interfaces

auto ens3
iface ens3 inet dhcp

auto ens4
aface ens4 inet dhcp

Then restart the network:

/etc/init.d/networking restart

Following this step there should be two running NICs in the VM.

By default, a system's routing table has just one default gateway; this will be whichever NIC came up first. To access both VM networks from the host, remove the default gateway. It is possible to add a second routing table to do this, but this is the easiest and quickest way. A downside of this approach is that you will not be able to communicate with the VM from the host, so you can use Horizon* for the remaining steps.

Now, forward the traffic coming in on one NIC to the other NIC. L2 bridging is used for this:

ifconfig ens3 0.0.0.0
ifconfig ens4 0.0.0.0

brctl addbr br0
brctl stp br0 on
brctl addif br0 ens3
brctl addif br0 ens4
brctl show

ifconfig ens3 up
ifconfig ens4 up
ifconfig br0 up

The two VM NICs should now be added to br0.

11.1.4 Ixia configuration

There are two ports on the traffic generator. Let’s call them C10P3 and C10P4.

On C10P3 configure the source and destination MAC and IP

SRC: 00:00:00:00:00:11, DST: 00:00:00:00:00:10

SRC: 11.0.0.100, DST 10.0.0.100

On C10P4 configure the source and destination MAC and IP

SRC: 00:00:00:00:00:10, DST: 00:00:00:00:00:11

SRC: 10.0.0.100, DST 11.0.0.100

As a VLAN network is being used here, VLAN tags must be configured.

Set them to the tags Openstack has given, in this case it's 1208.

Once these steps are complete you can start sending packets to the host, and you can verify that VM traffic is hitting the rules on the integration bridge by running the command:

watch -d sudo ovs-ofctl dump-flows br-int

You should see the packets received and packets sent increase on their respective flows.

11.2 AppArmor* issue

AppArmor* has many security features which may require additional configuration. One such issue is that if you attempt to allocate HugePages to a VM, AppArmor will cause Libvirtd* to give a permission denied message. To get around this, edit the qemu.conf file and change the security driver field, as follows:

vi /etc/libvirt/qemu.conf

security_driver = "none"

11.3 Share host ssh public keys with the guest for direct access

openstack keypair create --public-key ~/.ssh/id_rsa.pub mykey
openstack keypair list

11.4 Add rules to default security group for icmp and ssh access to guests

openstack security group rule create --protocol icmp --ingress default
openstack security group rule create --protocol tcp --dst-port 22 \ --ingress default
openstack security group list
openstack security group show default

11.5 Boot a VM

openstack image list
openstack flavor list
openstack keypair  list
openstack server create --image ubuntu1604 --flavor R4D6C4  --security-group \ default --key-name mykey vm1
openstack server list

11.6 Resize an image filesystem

sudo apt install libguestfs-tools

View your image partitions

sudo virt-filesystems --long -h --all -a ubuntu1604-5G.qcow2
Name       Type        VFS      Label  MBR  Size  Parent
/dev/sda1  filesystem  ext4     -      -    3.0G  -
/dev/sda2  filesystem  unknown  -      -    1.0K  -
/dev/sda5  filesystem  swap     -      -    2.0G  -
/dev/sda1  partition   -        -      83   3.0G  /dev/sda
/dev/sda2  partition   -        -      05   1.0K  /dev/sda
/dev/sda5  partition   -        -      82   2.0G  /dev/sda
/dev/sda   device      -        -      -    5.0G

Here’s how to expand /dev/sda1:

Create a 10G image template:

sudo truncate -r ubuntu1604-5G.qcow2 ubuntu1604-10G.qcow2

Extend the 5G image by 5G:

sudo truncate -s +5G ubuntu1604-10G.qcow2

Resize 5G image to 10G image template:

sudo virt-resize --expand /dev/sda1 /home/tester/ubuntu1604-5G.qcow2 \ /home/tester/ubuntu1604-10G.qcow2

11.7 Expand the filesystem of a running Ubuntu* image

11.7.1 Delete existing partitions

sudo fdisk /dev/sda

Command (m for help): p

Disk /dev/sda: 268.4 GB, 268435456000 bytes
255 heads, 63 sectors/track, 32635 cylinders, total 524288000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e49fa

   Device Boot  	Start     	End  	Blocks   Id  System
/dev/sda1   *    	2048   192940031	96468992   83  Linux
/dev/sda2   	192942078   209713151 	8385537	5  Extended
/dev/sda5   	192942080   209713151 	8385536   82  Linux swap / Solaris

Command (m for help): d
Partition number (1-5): 1

Command (m for help): d
Partition number (1-5): 2

11.7.2 Create new partitions

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): p
Partition number (1-4, default 1):
Using default value 1
First sector (2048-524287999, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-524287999, default 524287999): 507516925

Command (m for help): n
Partition type:
   p   primary (1 primary, 0 extended, 3 free)
   e   extended
Select (default p): e
Partition number (1-4, default 2): 2
First sector (507516926-524287999, default 507516926):
Using default value 507516926
Last sector, +sectors or +size{K,M,G} (507516926-524287999, default 524287999):
Using default value 524287999

Command (m for help): n
Partition type:
   p   primary (1 primary, 1 extended, 2 free)
   l   logical (numbered from 5)
Select (default p): l
Adding logical partition 5
First sector (507518974-524287999, default 507518974):
Using default value 507518974
Last sector, +sectors or +size{K,M,G} (507518974-524287999, default 524287999):
Using default value 524287999

11.7.3 Change logical partition to SWAP

Command (m for help): t
Partition number (1-5): 5

Hex code (type L to list codes): 82
Changed system type of partition 5 to 82 (Linux swap / Solaris)

11.7.4 View new partitions

Command (m for help): p

Disk /dev/sda: 268.4 GB, 268435456000 bytes
255 heads, 63 sectors/track, 32635 cylinders, total 524288000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e49fa

   Device Boot  	Start     	End  	Blocks   Id  System
/dev/sda1        	2048   507516925   253757439   83  Linux
/dev/sda2   	507516926   524287999 	8385537	5  Extended
/dev/sda5   	507518974   524287999 	8384513   82  Linux swap / Solaris

11.7.5 Write changes

Command (m for help): w
The partition table has been altered!

FYI: Ignore any errors or warnings at this point and reboot the system:

sudo reboot

11.7.6 Increase the filesystem size

sudo resize2fs /dev/sda1

11.7.7 Activate SWAP

sudo mkswap /dev/sda5
sudo swapon --all --verbose
swapon on /dev/sda5

11.8 Patch ports

Patch ports can be used to create links between OVS bridges. They are useful when you are running the benchmarks that require traffic to be sent up to a VM and back out to a traffic generator. OpenStack only creates one link between the bridges, and having traffic going up and down the same link can cause issues.

To create a patch port, ‘patch1’, on the bridge ‘br-eno2’, which has a peer called ‘patch2’, do the following:

sudo ovs-vsctl add-port br-eno2 patch1 -- set Interface patch1 \ type=patch options:peer=patch2

To create a patch port, ‘patch2’, on the bridge ‘br-int’, which has a peer called ‘patch1’, do the following:

sudo ovs-vsctl add-port br-int patch2 -- set Interface patch2 \ type=patch options:peer=patch1

12 References

http://docs.openstack.org/juno/config-reference/content/kvm.html

http://docs.openstack.org/mitaka/networking-guide/config-sriov.html

https://networkbuilders.intel.com/docs/OpenStack_EPA.pdf

http://www.slideshare.net/oraccha/toward-a-practical-hpc-cloud-performance-tuning-of-a-virtualized-hpc-cluster

↧

Overview of Intel® Computer Vision SDK and How it Applies to IoT

July 11, 2017, 9:22 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Enhanced Privacy ID (EPID) Security Technology

≪ Previous: OpenStack* Enhanced Platform Awareness: Feature Breakdown and Analysis

What is Intel® Computer Vision SDK?

The Intel® Computer Vision SDK is an Intel-optimized and accelerated computer vision software development kit based on the OpenVX* standard. The SDK integrates pre-built OpenCV with deep learning support using an included Deep Learning (DL) Deployment toolkit.

About OpenVX* and the Khronos Group*

OpenVX* is an open, royalty-free standard for cross platform acceleration of computer vision applications. The Khronos Group*, an industry consortium, defined OpenVX.

The Khronos Group is a not for profit, member-funded consortium dedicated to the creation of royalty-free open standards for graphics, parallel computing, vision processing. Intel joined the Khronos Group as a Promoter Member in March 2006.

OpenVX* Benefits

The OpenVX* API for computer vision standardizes the application interface for computer vision applications. This enables performance and power-optimized computer vision processing and allows the application layer to transparently use vendor specific hardware optimization and acceleration, when available.

OpenVX* also specifies an API independent standard file format for exchanging deep learning data between training systems and inference engines, called the Neural Network Exchange Format (NNEF*).

Using an extension of OpenVX*, developers can represent Convolutional Neural Network topologies as OpenVX* graphs. This allows developers to mix CNN with traditional vision functions.

Intel® CV SDK Contents

Intel-optimized implementation of the OpenVX* 1.1 API with custom extensions and kernels.
Pre-built binaries of OpenCV with Intel® VTune™ Amplifier hooks for profiling.
Vision Algorithm Designer (VAD), IDE tool
Deep Learning Model Optimizer tool.
Deep Learning Inference Engine.
Sample applications.

customer sw

Hardware and Software Requirements

Developers can program CV SDK using C/C++ on an Ubuntu* 64-bit development platform using Cmake to manage their builds and using GCC compiler.

The recommended development platform hardware is 6^th Generation Intel® Core™ processor or better with integrated Iris ® Pro Graphics or HD Graphics.

Target platforms include: next-generation Intel Atom® processors (formerly known as Apollo Lake), Intel® Core™ processors and Intel® Xeon® processors. The target processors have integrated Iris Pro Graphics or HD Graphics to use OpenCL GPU kernels.

Intel® CV SDK Development Benefits

Intel® Hardware Optimization and Acceleration

Intel® CV SDK which is Intel's OpenVX* implementation, offers CPU kernels which are multi-threaded (with Intel® Threading Building Blocks) and vectorized (with Intel® Integrated Performance Primitives).

This optimized Intel® implementation of OpenCL™ supports Intel® GPUs on integrated Iris Pro or HD Graphics platforms.

Using Intel® CV SDK, developers can access early support for the new dedicated IPU (Image Processing Units) on Next-Generation Intel Atom® processors (formerly Apollo Lake).

These new processors feature an integrated, four- vector image-processing unit capable of supporting advanced vision functions and up to 4 concurrent HD IP cameras.

Custom Intel® Extensions

Intel® CV SDK extends the original OpenVX standard with specific APIs and many Kernel extensions that allow developers to add performance efficient (for example, tiled) versions of their own algorithms to the processing pipelines.

Heterogenous Computing Support

Intel® CV SDK supports both task and data parallelism to maximize the use of all available compute resources including CPU, GPU, and the new dedicated IPU (Image Processing Units).

Profiling Support

Intel® CV SDK has a pre-built OpenCV implementation. This OpenCV implementation integrates hooks for Instrumentation and Tracing technology (ITT) which allows profiling vision applications using Intel® VTune™ Amplifier.

Intel® CV SDK and IoT

One of the most important senses used by humans is our sight and vision. As much as 80% of our interaction with our environment is based on vision.

Until now, IoT relied on multiple sensors to perform basic telemetry and automation tasks because computer vision was expensive, complex, and inaccessible to most developers.

However, with the advent of cheap, HD cameras, processors with built-in CV accelerators and robust computer vision software stacks, there is a rising trend in the use of camera-based computer vision as an IoT sensor for multiple verticals.

Integration with machine learning and deep learning systems opens new application use cases for the use of computer vision in IoT and brings the power of embedded CNN and DNN to the edge.

Related Software:

Intel® VTune™ Amplifier – Advanced Intel toolkit for profiling, visualizing, and tuning multi-processor, multi-threaded or vectorized Intel® platforms which support Instrumentation and Tracing technology (ITT).

Intel® Vision Algorithm Designer – An IDE on top of OpenVX for the development of OpenVX algorithms, workloads, and capabilities in an intuitive and visual manner.

Intel® Deep Learning (DL) Deployment toolkit – A cross-platform DL model optimizer which helps integrate DL inference with application logic.

Intel® Deep Learning Inference Engine - supports inference operations on several popular image classification networks and the deployment of deep learning solutions by delivering a unified API to integrate the inference with application logic.

Intel® SDK for OpenCL™ applications - Accelerated and optimized application performance with Intel® Graphics Technology compute offload and high-performance media pipelines.

Getting Started:

Quick Start Guide for Intel® Computer Vision SDK Beta

Intel's Deep Learning Deployment Toolkit Installation Guide

↧

Intel® Enhanced Privacy ID (EPID) Security Technology

July 13, 2017, 8:45 am

Latest and popular articles on Intel Technologies

≫ Next: Face Beautification API for Intel® Graphics Technology

≪ Previous: Overview of Intel® Computer Vision SDK and How it Applies to IoT

Introduction

With the increasing number of connected devices, the importance of security and user privacy has never been more relevant. Protecting information content is critical to prevent exposure of trade secrets for businesses, identity theft for individuals, and countless other harmful scenarios that cost both money and time to repair. Part of protecting data and privacy includes ensuring that the devices touching the data are authentic, have not been hijacked, or even replicated into a non-genuine piece of hardware.

In this article we will discuss the Intel® Enhanced Privacy Identification (EPID) security scheme, which helps to specifically address two device level security issues; anonymity and membership revocation. Billions of existing devices, including most Intel® platforms manufactured since 2008, create signatures that need Intel® EPID verification. Intel is providing the Intel® EPID SDK open source and encouraging device manufacturers to adopt it as an industry standard for device ID in IoT.

Security Review – Public Key Encryption and Infrastructure

When exchanging data between two people or systems, it is important to ensure that it arrives securely, and is not forged. The recipient should have a high confidence that the sender is who they say they are. One of the most widely used methods of ensuring this trusted data transport is by employing a DigitalSignature.One method of creating a digital signature is using the Public Key Encryption (PKE) security scheme. Using a mathematical hashing algorithm, two binary keys are generated which work together to encrypt and decrypt the data. Data that is encrypted (or signed in this use case) using the private key can only be decrypted (verified) using the matching public key. The private key is never shared with anyone, and the public key is available to everyone. This method guarantees that any data decrypted using a public key was indeed encrypted using the matching private key. For the most part, using Public Key Encryption for device authenticity works well, however it does have a few limitations.

Problem 1: Certifying Keys

The first limitation involves the validity of the sender’s key. In order to verify a signature, the public key of the sender is required, however there is no way to guarantee it belongs to the sender, or has not been stolen or tampered with. An additional step can be taken to ensure the validity of the public key which involves certification from a third party called an issuer. Using Public Key Infrastructure (PKI), the level of security can be raised by introducing a new element called a digital certificate, which is signed by the issuer’s private key. The certificate contains the public key of the member, the member’s name, and other optional details. Using this method guarantees that the public key being used is the actual key issued, and hence is the actual sender of the data. Think of an issuer as a notary who guarantees that this signature is correct because they witnessed the person writing it. A digital certificate issued from a certified authority solves the problem of certifying that a public key is authentic.

Problem 2: Shielding Unique Identity

A second limitation with PKI is the inability to remain anonymous while still being granted access. Because one public certificate contains the key owner’s name and information, the ownership of the secured information is inherently known, and if the same device is verified multiple times, its activity could be tracked. Usage of PKI for signed and encrypted emails is useful in this scenario where it is desired for the users to be identified. The recipient installs the public certificate of the sender, and when opening the message has a level of trust that the sender signed these emails using a protected matching private key.

As devices increasingly play roles in requesting authentication to systems, there is a greater need for both devices and users to be anonymous. While a valid attestation is required, it is not a requirement that the device be individually identified or that the device provide any additional details other than the minimum amount of information necessary to prove that they are a genuine member of a trusted device family. Taking this approach allows devices to access a system based on their approved level of access and not any personal information like a MACID or who they are. In other words, in the Intel® EPID scheme, if a hundred authentic signatures are verified, the verifier would not be able to determine whether one hundred devices were authenticated, or if the same device was authenticated one hundred times.

Problem 3: Revoking Access

Yet another limitation with PKE is in the fact that there exists no easy mechanism for revoking a private key that has been compromised. If anyone other than the user gets access to the private key, they can masquerade as that user resulting in a loss of trust for the device. These private keys are often stored in a separate chip called a Trusted Platform Module (TPM) which is also encrypted. While this hardware trusted module approach is much more secure, the existence of the private key on the device still creates the possibility that it can be stolen. Fixing the problem of a stolen key would involve issuance of a new certificate and manual intervention to flash a new private key onto the device. Adding the ability to easily revoke a compromised private key would allow a device to be flagged and disabled automatically, and prevent any further identify theft.

Roles in a Public Key Infrastructure

CA	A Certified Authority is the entity that is issuing security certificates.
RA	The Registration Authority accepts requests for new certificates, ensures the authenticity of the requestor and completes the registration process to the CA for the requestor.
VA	A Validation Authority is a trusted third party that can validate a certificate on behalf of a Certificate Authority.
Member	The role of member can be assumed by an end user or a device. The member is the role that is requesting attestation of itself during secure exchanges.

Figure 1 - PKI Roles and Process Flow

Authentication vs Identification

Gaining access to a system should not always require user identification. The intent behind requesting access to a system is to obtain access; providing only a minimal, certifiable proof that access has been granted. Each user might require a certain level of anonymity based on specific use-cases. Take, as an example, a medical wristband device that monitors sleep habits of someone that is experiencing insomnia. For the individual, it is important to ensure that the data is provided to the doctors for analysis without allowing anyone else to potentially identify them or their private medical data.

For users accessing services, the authentication process is owned by the access provider, which unfortunately often ties access rights directly to an account identifier, which is usually then linked directly to additional account details that the user may want to hold private. Unfortunately, most systems today require a user or device to actually identify themselves in a way that can in fact be traced back to the original user with every transaction. An example in software would be a username. For a device, it might be a MACID or even a public key provided that was stored on secure storage. To prevent this from occurring, the ability must exist by which a user can effectively and rightfully use a system without being required to provide any information that can be immediately linked to themselves. One example would be a toll both. A user should be able to pass through the booth because they were previously issued a valid RFID tag, however no personal information should be required, and the user is not directly known in any transaction or tracking on that device. If the requirement is to trace whom is travelling through the toll booth, then that right is reserved by the access provider, however there are instances when authentication should be separated from identification.

Direct Anonymous Attestation is a security scheme proposed in 2004 that permits a device in a system to attest membership of a group while preserving the identity of the individual. Drafted by Ernie Brickell (Intel Corp), Jan Camenisch (IBM Research®), and Liqun Chen (HP Laboratories®), DAA is now approved by the Trusted Computing Group (TCG) as the recommended method for attestation of a remote device, and is outlined in ISO/IEM 20008.

What is EPID?

Enhanced Privacy Identification (EPID) is an implementation ISO 20008 from Intel that addresses two of the problems with PKI Security Scheme: anonymity and membership revocation. Intel includes EPID keys in many of its processors, starting with chipset series 5 in 2008 which includes all Intel® Core™ processor family products. In 2016, Intel as a certified EPID Key Generation Facility, announced that it has distributed over 4.5 billion EPID keys since 2008.

The first improvement over PKI provides a user or device “Direct Anonymous Attestation” which provides the ability to authenticate a device for a given level of access while allowing the device to remain anonymous. This is accomplished by introducing the concept of a Group level membership authentication scheme. Instead of a 1:1 public to private key assignment for an individual user, EPID allows a group of membership private keys to be associated together, and linked to one public group key. This public EPID group public key can be used to verify the signature produced by any EPID member private key in the group. Most importantly, no one, including the issuer, has any way to know the identity of the user. Only the member device has access to the private key, and will validate only using a properly provisioned EPID signature.

The second security improvement that EPID provides is the ability to revoke an individual device by detecting a compromised signature or key. If the private key used by a device has been compromised or stolen, allowing the EPID ecosystem to recognize this allows the device to be revoked as well as prevent any future forgery. During a security exchange, the EPID protocol requires that members perform mathematical proofs to indicate that they could not have created any of the signatures that have been flagged on a signature revocation list. This built in revocation feature allows devices or even entire groups of devices to be instantly flagged for revocation, instantly being denied service. It allows anonymous devices to be revoked on the basis of a signature alone, which allows an issuer to ban a device from a group without ever knowing which device was banned.

Intel® EPID Roles

There are three roles in the EPID security ecosystem. Firstly, the Issuer is the authority group that assigns or issues EPID Group IDs and Keys to individual platforms; similar devices that should be grouped together from an access level perspective. The issuer manages group membership and maintains current versions of all revocation lists. Using a newly generated private key for the group, the issuer generates one group public key, and as many EPID member private keys as requested, all of which are paired with the one group public key. The Member role is an end device, and represents one individual member in a group of many members, all sharing the same level of access. Finally, the Verifier role serves as the gatekeeper: checking and verifying EPID signatures generated by platforms ultimately ensuring they belong to the correct group. Using the EPID group public key, the verifier is able to validate the EPID signature of any member in the group with no knowledge of membership identity.

Figure 2 – EPID Roles

Issuer	Creates, stores, and distributes issuer signed public certificates for groups. Creates, distributes, and then destroys private keys. Private keys are not retained by an issuer, and are held private by member devices in Trusted Storage such as TPM 1.2 compliant device. Creates and maintains revocation lists.
Verifier	Challenges member verification requests using EPID Group Public Key and revocation lists. Identify any member or group revocations.
Member	An end device for a particular platform. Protect private EPID keys into protected TPM 1.2 complaint storage. Sign messages when challenged.

Now that the roles in EPID have been discussed, let’s discuss the different security keys used in the EPID security scheme. First, for a given group of devices or platforms, the issuer generates the group public key and group private key simultaneously. The group private key has one purpose: the issuer uses it as the basis to create new member private keys, and for that reason the issuer keeps the group private key secret from all other parties. The EPID group public key is maintained by the issuer and verifier, and the private member keys are distributed to the device platforms before the issuer destroys its local versions.

Figure 3 - Using a unique key allocated for a group, an issuer will create one EPID Group Public key and many member EPID private keys as requested.

Security Keys used in EPID

Owner	Pub/Pri	Description	Usage
Issuer	PRI	CA Root Issuing authority private ECC key	Used to sign EPID Group public key and parameters, ensures trust all the way to member.
Issuing CA	PUB	Issuing authority public ECC key	Provided to platform members to enable trust with a verifier and issuer.
Issuer	PRI	Group Private Key. One per group.	Created by issuer for a group. Used to generate private member keys.
Group	PUB	EPID Group Public Key Generated by issuer	Provided to platform devices during provisioning upon request. Used by verifiers to validate EPID member signatures.
Member	PRI	EPID member private key. Private unique key for each device, can be fused, must be secured.	Generated by issuer using Group Private Key. Stored securely or embedded/fused into Silicon Golden private key ready for provisioning into final EPID key. Used to create valid EPID signatures that can be decrypted using a paired EPID Group public key.

The Intel® EPID scheme works with three types of keys: the group public key, the issuing private key, and the member private key. A group public key corresponds to the unique member private keys that are part of the group. Member private keys are generated from the issuing private key, which always remains secret and known only to the issuer.

To ensure that material generated by the issuer is authentic, another level of security is added: the issuing CA certificate. The CA (Certificate Authority) public key contains the ECDSA public key of the issuing CA. The verifier uses this key to authenticate that information provided by the issuer is genuine.

Intel® EPID Signature

An Intel® EPID signature is created using the following parameters:

Member private key
Group public key
Message to be signed
Signature revocation proof list (to prove that it did not create any signatures that were flagged for revocation in the past)

An Intel® EPID signature is verified using the following parameters:

Member’s signature
CA certificate (to certify authenticity of issuer material before it is used)
Group public key
Group revocation list
Member private key revocation list
Signature revocation list

Intel® EPID Process Flows

Embedding

By including the Intel® EPID key into the manufacturing process for a device, a part can be identified as genuine after deployment into the field without any human intervention. This process saves time and improves security by not distributing any private keys or requiring any interaction with the end user. Sequence 1 shows a vendor of a hardware device initiating a request with an Intel® EPID Issuer and ultimately deploying the generated Intel® EPID member keys with the device. The process starts by the vendor requesting to join the ecosystem managed by this Issuer. It can also be said that this member is choosing this Issuer as the Key Generation Facility. When a new member requests to join, an issuer first generates a set of platform keys which are held private and used to generate one group public key and one or more member private keys. The member private keys are deployed with each device securely, and are not known by anyone else including the issuer who does not retain any of the private keys. The Intel® EPID group public key is stored with the issuer and distributed to verifiers upon request.

Sequence 1 – Intel® EPID key request and distribution process

For products supporting Intel® EPID, Intel fuses a 512 bit number directly into a submodule of the processor called the Management Engine. This Intel® EPID private member key is encoded with an Intel® EPID Group ID that uniquely identifies the device as part of the group. As an Issuer and Verifier, Intel maintains public certificates for each of devices encoded with Intel® EPID keys. The private member keys require the same level of protection as a standard PKI private key. Access to the private key can only be achieved using an application that is signed by an RSA Security Certificate whose root of trust is Intel. Other silicon manufactures can follow a similar process, allowing only trusted applications of their corporations to access the private key on their products.

Provisioning

After deployment into the field, a device is not ready to use Intel® EPID out of the box. Before it can be brought to life, it must follow a process called provisioning, which allows it to attest its authenticity using a valid Intel® EPID signature for all future transactions. Sequence 2 shows a possible provisioning process for first boot of an IOT device that uses Intel® EPID. Once granted access to the Internet, a device can call home to state it is online and also check for software updates.

Before granting access however, the provider answering the call must ensure that the device is authentic. In a typical onboarding scenario, a verifier will be sent to a member device requesting a provisioning status. If the device is not already provisioned, meaning it has previously been authenticated, it can complete provisioning by requesting a public Intel® EPID Group Key from the verifier. The member device then stores both the private and public Intel® EPID keys into secure storage, and is able to successfully sign Intel® EPID signatures as well as reply to provisioning status challenges.

SEQUENCE 2 – Intel® EPID Provisioning Flow

Revocation

Because the Intel® EPID security scheme allows for anonymous, group membership attestation, it must also provide the ability to reject or decommission members or groups at any time. Intel® EPID supports revocation at the membership level through identification of an Intel® EPID member signature, or if known, the private member key.

In addition, Intel® EPID supports revocation of an entire Group, which revokes access for all devices in that group. One typical use case, as shown in SEQUENCE 3, Member revocation can be initiated by a verifier or an issuer, however only the issuer can revoke an Intel® EPID member or group. Once a group is revoked, verifiers will no longer reference any signature or key based revocations for the group, meaning it will be ignored.

The Intel® EPID protocol that is exchanged between member-verifier-issuer contains revocation lists, which can grow in size over time for a platform group that has many compromised members. An increase in revocations comes at a linear performance decrease, meaning it will take longer to validate everyone in the chain over time. One solution an issuer can pursue when this occurs is to create a new Group and move the uncompromised members into that new group. The old group can then be revoked.

SEQUENCE 3 – Verifier submits request to Issuer to revoke a member or group

Summary of Revocation Lists
PRIV-RL – Private Member Key is known
SIG-RL – Platform Member Key is not recovered, however signature is known
GROUP-RL – Entire Group should be revoked

While members normally exchange signature exchanges with verifiers, communication also occurs directly with the issuer. The join protocol between a member device and issuer supports the possibility to transport a valid Intel® EPID private key to the device. This can be used for replacement of a compromised key or remote provisioning when the key is not available by the member. A secure, trusted transport mechanism of the key is assumed and outside the scope of the protocol.

Intel® EPID Use Cases

A perfect example usage of Intel® EPID is to prove that a hardware device is genuine. After deployment from a manufacturer, it is important for a device to have the ability to truthfully identify itself during software updates or requesting access to a system. Once authorized, the device is then said to be genuine and a valid member of a group while still remaining anonymous.

Another example is related to digital streaming content. Digital Rights Management (DRM) currently uses Intel® EPID to ensure that a remote hardware device is secure prior to streaming data to the device. This process ensures that the hardware player streaming the content is authentic. Intel® Insider™ technology, which focuses on ensuring digital movie content delivered from service providers, only works on clients that also support Intel® Insider™. This gives content providers a level of trust that their content cannot be copied simply by viewing on the device. There is no disruption to current services, and the only impact would be to those trying to pirate digital content that has been protected using Intel® Insider™.

Intel® Insider™
http://blogs.intel.com/technology/2011/01/intel_insider_-_what_is_it_no/

Intel® Identity Protection Technology with One Time Password (OTP) also uses Intel® EPID keys to implement a two factor authentication method that enhances security beyond a simple username/password.

One time password
https://www.intel.com/content/www/us/en/architecture-and-technology/identity-protection/one-time-password.html

SGX – Software Guard Extensions on Intel® products allow applications to run in a trusted, protected area of memory allocated as an ‘enclave,’ preventing any outside access to the application memory or execution space.

SGX
https://software.intel.com/en-us/sgx

Silicon providers such as Microchip* and Cypress Semiconductor* are now implementing Intel® EPID into their products as well.

Microchip announces plans for implementing Intel® EPID
http://download.intel.com/newsroom/kits/idf/2015_fall/pdfs/Intel_EPID_Fact_Sheet.pdf

Intel Products offering Intel® EPID

Beginning with the release of Series 5 Chipsets, EPID keys have been fused and deployed in all products included in all series five and newer chipsets. For more information on which products are supported, visit the ARK at http://ark.intel.com/#@ConsumerChipsets

Intel® EPID SDK – Member and Verifier APIs

The Intel® EPID SDK is an open source library that provides support for both member and verifier Intel® EPID tasks. It does not include any Issuer APIs, which means it is not meant to create EPID keys. The SDK comes with documentation and examples for signing and verifying messages using included sample Issuer material, which in a real system would be generated by the issuer (Public group Intel® EPID key, Private member Intel® EPID key, and additional information such as the revocation lists.) Verifier APIs do exist that allow populating a special kind of signature revocation list known as the verifier blacklist, however that list can only be populated if members opt-in to allowing themselves to be tracked, and only the issuer can perform the creation of revocation lists that apply to the entire group.

First steps with Intel® EPID

To get started, download the latest Intel® EPID SDK, and begin by reading the documentation included in the doc subfolder with each distribution. https://01.org/epid-sdk/downloads

After building the SDK, navigate to the _install\epid-sdk\example folder and try out the included examples for signing and verifying signatures. The folder contents are shown below which include the sample private key, Issuer certificates, and revocation lists required to complete verifications. The files are well named, making it very easy to know their contents.

Figure 4 – Directory listing of the Intel® EPID 4.0.0 SDK

Intel® EPID Member Example

Create a digital signature using the sample Intel® EPID member private key, groupID, and a sample text string of any content.

signmsg.exe --msg=”TEST TEXT BLOB”

The signmsg command will output a signature file (./sig.dat) whose contents can only be verified using a matching Intel® EPID public key, and the message to be signed. Regardless of what initiates or triggers the verification process, the verifier and member have to use the same message parameter for verification to succeed.

Intel® EPID Verifier Example

Creation and validation of signatures depends that both ends (Member and Verifier) use the same message, hashing algorithm, basename, and signature revocation lists. A change to any of these will result in a signature verification failure. During a validation flow, the verifier may send a text message for the member to sign.

Verify a digital signature using the SDK with the same message.

verifysig --msg=”TEST TEXT BLOB”

Figure 5 – Console sign and verify success

If not specified, the SDK will use default values for the hashing algorithm.

If a different message or hashing algorithm are used, the verification will fail.

Figure 6 – Console sign and verify failure

The executables included with the Intel® EPID SDK examples are intended only for quick validation or integration tests of signatures, and to demonstrate basic member and verifier capability. A developer wanting to implement member or verifier functions would start by taking a look at the included documentation, which includes both an API reference and sample walkthroughs for signing and verifying in Intel® EPID.

Figure 7 – Intel® EPID SDK Documentation

The Intel® EPID SDK is constantly improving with each release, aligning to the newest Intel® EPID standards and providing optimizations for hashing algorithms using Intel® Performance Primitives.

How to implement Intel® EPID

OEM and ODMs can take advantage of the fact that Intel® EPID keys are available on all Intel® products that include series 5+ firmware. The Intel® EPID SDK can be used to create the platform code that will run on the device, however it can only be executed on a platform device in a secured, trusted environment that is signed by Intel. Only a signed application running in the ME secure firmware can access the Intel® EPID key for the purpose of provisioning. An OEM/ODM can work with an Intel representative for guidance on how to enable Intel® EPID on an existing Intel® product that supports Intel® EPID.

Other silicon manufacturers are following suit and adopting Intel® EPID technology. Both Cypress Semiconductor and Microchip are starting to ship products with embedded Intel® EPID member keys as well. What this means is that employment of an Intel® EPID ecosystem can be accomplished regardless of Intel® Silicon – adhering to the rules of the Intel® EPID Security scheme is what permits a device to take advantage of the Intel® EPID features.

Visit the Intel Intel® EPID SDK deployment site for more documentation and API walkthroughs for signing and verifying messages https://01.org/epid-sdk/ .

If you are interested in implementing Intel® EPID into your products, or to join our Zero Touch Onboarding POC, start by emailing iotonboarding@intel.com

If you would like to use Intel’s Key Generation Facility to act as an Intel® EPID issuer for creation of Intel® EPID keys, please start by contacting iotonboarding@intel.com.

Quick Facts

Intel has issued over 4 billion Intel® EPID keys since the release of the Series 5 chipset in 2008
Devices in an Intel® EPID Ecosystem are allowed to authenticate anonymously using only a Group ID
Intel® EPID is Intel’s implementation of Direct Anonymous Attestation
Intel® EPID supports revoking devices based on Private Key, Intel® EPID Signature, or an entire Group
Silicon providers can create their own Intel® EPID ecosystem
OEM/ODMs can use Intel® EPID compliant silicon devices to provide quick and secure provisioning
Intel® products include an embedded true random number generator – providing quicker, more secure seed values for hashing algorithms. (The SDK requires a secure random number generator to be used in any implementation of Intel® EPID.)

Summary

In this article, we discussed an Intel ® security scheme called Intel® EPID that allows devices to attest membership of a group without being individually identified. Intel® Enhanced Privacy Identification technology 2.0 enhances direct anonymous attestation by providing a member revocation ability based on member or group signatures. Choosing Intel products allows OEM/ODMs and ISVs to take advantage of built-in security keys provided by Intel already available in numerous product families. Silicon providers can also take advantage of the Intel® EPID technology by embedding private keys directly into their hardware, and joining their own Intel® EPID ecosystem. With a predicted 50 to 100 billion connected IOT devices by 2020, security and device authenticity should be imperative for both manufacturers and end users.

A very special thanks to the members of the Intel® EPID SDK team for taking time to answer questions on Intel® EPID and the Intel® EPID SDK.

Terminology

AES-NI	AES - New Instructions is a hardware embedded feature available in most newer Intel® products.
AIK	Attestation Identity Key
AMT	Active Management Technology - Support out of band remote access.
Anonymity	A property that allows a device to avoid being uniquely identified or tracked.
Attestation	A process by which a user or device guarantees they are who they say they are.
Certificate	An electronic document issued by a third-party trusted authority (issuer) that verifies the validity of a Public Key. The contents include a subject and a verifiable signature from the Issuer, which adds an additional layer of trust around the contents.
DAA	Direct Anonymous Attestation
DER	Certificate File format - Distinguished Encoding Resource
ECC	Elliptic Curve Cryptography
EPID	Enhanced Privacy Identification
EPID key	A private key held by an individual and not shared with anyone. Is used to create a valid Intel® EPID signature that can be verified using a matching Intel® EPID public group key
iKGF	Intel® Key Generation Facility
Intel SCS	Setup and Configuration Software - Used to access AMT capabilities
ISM	Intel® Standard Manageability
ISO 2008-2:2013	ISO standard for Anonymous digital signature security mechanisms https://www.iso.org/obp/ui/#iso:std:iso-iec:20008:-2:ed-1:v1:en
ME	Intel® Management Engine, sometimes also called Security and Management Engine
ODM	Original Device Manufacturer
OEM	Original Equipment Manufacturer
PEM	Certificate File format - Privacy Enhanced Mail
PKE	Public Key Encryption
PKI	Public Key Infrastructure
Platform	A platform is considered a piece of hardware or device.
Private Key	A key that is owned by an individual or device and is held private and never shared with anyone. It is most commonly used to encrypt a message into cipher-text that can only be opened using a matching Public key.
Public Key	A key provided to the public that will only decrypt a document encrypted using a matching private key
SBT	Small Business Technology
Secure Key	A text string that matches the output of a defined algorithm and allows plain text to be transformed into cipher-text or vice-versa.
SIGMA	SIGn and Message Authentication - A protocol from Intel for platform to verifier two way authentication.
X.509	IEEE standard for certificate format and content

About the Author

Matt Chandler is a senior software and applications engineer with Intel since 2004. He is currently working on scale enabling projects for Internet of Things including ISV support for smart buildings, device security, and retail digital signage vertical segments.

References

Intel® EPID White Paper

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/intel-epid-white-paper.pdf

NISC-PEC, December 2011

http://csrc.nist.gov/groups/ST/PEC2011/presentations2011/brickell.pdf

Wikipedia References on Security

https://en.wikipedia.org/wiki/Direct_Anonymous_Attestation
https://en.wikipedia.org/wiki/Public-key_cryptography
https://en.wikipedia.org/wiki/Public_key_infrastructure
https://en.wikipedia.org/wiki/Public_key_certificate

ACM conference 2004, “Direct Anonymous Attestation”

https://eprint.iacr.org/2004/205.pdf

Platform Embedded Security Technology Revealed

http://www.apress.com/us/book/9781430265719

Wikipedia Image license for PKI process

https://en.wikipedia.org/wiki/Public_key_infrastructure#/media/File:Public-Key-Infrastructure.svg
https://creativecommons.org/licenses/by-sa/3.0/

↧

Face Beautification API for Intel® Graphics Technology

July 14, 2017, 3:43 pm

Latest and popular articles on Intel Technologies

≫ Next: What's New? - Intel® VTune™ Amplifier XE 2017 Update 4

≪ Previous: Intel® Enhanced Privacy ID (EPID) Security Technology

Download sample code [16MB]

Abstract

This paper highlights the C++ API for enabling applications to support Face Beautification, which is one of the features supported by Intel® Graphics Technology. It outlines of the list of available effects in the Face Beautification Library Version 1.0, and describes C++ API definitions and methods included in the library.

Face Beautification in Intel® Graphics Technology

The Face Beautification feature supported by Intel Graphics Technology provides the capability to automatically adjust facial landmarks. Using the current implementation of the APIs, you can create an automatic framework and develop face enhancement tools for a better user experience. Because there is a lot of information that can be extracted from an image, the API helps to implement automatic Face Beautification.

There are two methods for enabling Face Beautification in an application. The first version of Face Beautification support was available through Device Driver Interface (DD)I implementations; applications could be enabled by a call into the private DDI. Now, we have a second and simpler version, using a C++ API that can assist application development. Developers can access the C++ API via the Face Beautification static library. The Face Beautification feature set is detailed in the table below.

FB Features	Category
Face brightening	Global
Face whitening	Global
Skin Foundation	Skin map based
Skin Smoothing	Skin map based
Skin blush	Skin map + Landmark
Eye circles removal	Skin map + Landmark
Eye bags removal	Skin map + Landmark
Eye wrinkles removal	Skin map + Landmark
Red lips	Landmark based
Big eyes	Landmark based
Cute nose	Landmark based
Slim face	Landmark based
Happy face	Landmark based

API Definitions

There are five APIs for Face Beautification. The current infrastructure supports one Face Beautification feature i.e. FBRedLip. There are three structures that store the input and output properties, FDFB mode feature, feature strength, and other parameters. The constructor initializes the class data members based on information provided by the structure. The first API initializes the device using fDeviceInitialization() followed by fConfiguration() for setting the device properties based on structures passed to the constructor. There is a separate API provided for Face Detection and Face Beautification (FDFB) mode - FDFBMode_Initialization(). After the device initialization and device configuration, the pipeline is executed for each frame using the ExecutionPipeline() API. The destructor is called automatically to release the memory object.

FB_API(FDFB_IN_OUT_PARAM init_file_var, FDFB_REDLIP_PARAM FBRedLip, FDFB_MODE_PARAMS FDFB_Mode_Val);
void fDeviceInitialization();
void fConfiguration();
void FDFBMode_Initialization();
void ExecutionPipeline(char* tempBuffer);
int convertFileToFaceList(std::fstream& file, std::vector<VPE_FACE_RECT>& list);
~FB_API();

The details of the structure used by the API’s is provided below:

typedef DXGI_FORMAT FDFB_FORMAT;
typedef struct FDFB_IN_OUT_PARAM
{
 FDFB_FORMAT inputFormat;
 FDFB_FORMAT outputFormat;
 UINT inputWidth;
 UINT inputHeight;
 UINT outputWidth;
 UINT outputHeight;
} FDFB_IN_OUT_PARAM;

typedef struct FDFB_REDLIP_PARAM
{
 UINT FBRedLipStrengthEnable;
 UINT FBRedLipStrength;
} FDFB_REDLIP_PARAM;

typedef struct FDFB_MODE_PARAMS
{
 GUID * pVprepOperationGUID;
 VPE_FDFB_MODE_ENUM FDFBMode;
 VPE_FD_FACE_SELECTION_CRITERIA_ENUM faceSelectionCriteria;
 std::vector<vpe_face_rect> list;
} FDFB_MODE_PARAMS;</vpe_face_rect>

Usage and Program Flow

The Face Beautification header file and static library (.lib) are provided. Create the project and include the header file in the additional include directory, add .lib to additional library directories, and add the name of the static library to the additional dependencies on the Input tab.
In the application, provide the input file, the output file, and the face file.
Provide or read properties into variables like input width, input height, output width, output height, input format, and output file format. If the application wants to enable FDFB mode, provide which feature is enabled and its strength. Set the face selection criteria. If face selection criteria and FB feature strength is not provided, then the driver uses the default values. Use convertFileToFaceList() to convert the face file to list format.
Create a class object and pack the information into structures. Call the class constructor to initialize the value of class data members.
Call the device initialization function fDeviceInitialization().
Call the device configuration function fConfiguration().
Call the FDFB mode initialization function FDFBMode_Initialization().
Execute the pipeline by calling ExecutionPipeline(char* tempbuffer). This function loops for all the frames. A new buffer is passed for every new frame; update the buffer accordingly.
Write the output to the output file. A class destructor is called automatically to release the memory objects.

Future Work

The current API implementation supports one FDFB feature, FBRedLip. The upcoming versions will include a larger set of Face Beautification features. If requested, face detection support will be included as well. The DDI implementation support takes virtual camera input; this feature can be extended to the C++ API as well.

About the Author

Sonal Sharma is a software application engineer working at Intel in California. Her work responsibility includes performance profiling and analysis, and CPU/GPU code optimization for media applications.

↧

What's New? - Intel® VTune™ Amplifier XE 2017 Update 4

July 6, 2017, 8:26 am

Latest and popular articles on Intel Technologies

≫ Next: Path of Exile’s storied road to success

≪ Previous: Face Beautification API for Intel® Graphics Technology

Intel® VTune™ Amplifier XE 2017 performance profiler

A performance profiler for serial and parallel performance analysis. Overview, training, support.

New for the 2017 Update 4! (Optional update unless you need...)

As compared to 2017 Update 3:

General Exploration, Memory Access, HPC Performance Characterization analysis types extended to support Intel® Xeon® Processor Scalable family
Support for Microsoft Windows* 10 Creators Update (RS2)

Resources

Learn (“How to” videos, technical articles, documentation, …)
Support (forum, knowledgebase articles, how to contact Intel® Premier Support)
Release Notes (pre-requisites, software compatibility, installation instructions, and known issues)

File: vtune_amplifier_xe_2017_update4.tar.gz

Installer for Intel® VTune™ Amplifier XE 2017 for Linux* Update 4

File: VTune_Amplifier_XE_2017_update4_setup.exe

Installer for Intel® VTune™ Amplifier XE 2017 for Windows* Update 4

File: vtune_amplifier_xe_2017_update4.dmg

Installer for Intel® VTune™ Amplifier XE 2017 - OS X* host only Update 4

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

↧

Path of Exile’s storied road to success

July 11, 2017, 12:03 pm

Latest and popular articles on Intel Technologies

≫ Next: Unattended Baggage Detection Using Deep Neural Networks in Intel® Architecture

≪ Previous: What's New? - Intel® VTune™ Amplifier XE 2017 Update 4

The original article is published by Intel Game Dev on VentureBeat*: Path of Exile’s storied road to success Get more game dev news and related topics from Intel on VentureBeat.

Screenshot of game Path of Exile

It’s probably fair to say that if you’re a fledgling indie development studio casting around for signs of inspiration, good choices, and role models you’d be hard-pressed to find a better example than Grinding Gear Games, creators of action-RPG, Path of Exile. The studio, founded in late 2006 in Auckland, New Zealand, took its sweet time to release the game, but the patience and constant work paid off as it’s a certifiable hit commercially and among its legion of dedicated fans.

From a starting team of three the studio has ballooned to 100 developers all focused on the single project in front of them. They are still generating new content, and with it, encouraging new players to the fold while enticing lapsed players with smartly considered methods of keeping the experience fresh.

It all started with a love of Diablo II*

“We played a lot of online action RPGs like Titan Quest and Dungeon Siege, but especially Diablo* II,” says Producer and Lead Designer, Chris Wilson, “and as 2006 approached we felt it was strange that no studios were making games like this, specifically online, with good item economies.” Wilson and his early team knew that other studios were making games following some of the action-RPG tropes of Diablo II but whatever their intentions, those games ended up being single-player games.

Screenshot of game Path of Exile
Above: Path of Exile springs epic visual moments into the action-RPG gameplay.

“We felt that Diablo II players were looking for something newer and we thought, somewhat naively, why don’t we make that game?” says Wilson. There was method in the apparent madness since none of these friends had ever put together a game studio before. “There was a hole in the market, tens of millions of players were looking for something to play, and we felt like we can do that,” adds Wilson. That concept of identifying a hole in the market that evidently had a fan base but wasn’t being satisfactorily served is a core concept that would prove vital in allowing Path of Exile to be a success. That, and a little talent, of course.

Getting started

The early days wouldn’t be easy though. “We pooled our life savings, set up in a garage, and three of us started to make Path of Exile. It’s a survivor story, really, as we had to learn how to make games, scale a studio up to 100 people, but it was successful and it all came from a desire to fill a hole in the market,” says Wilson. “We knew it was the right product to make, we just didn’t know if we were capable of making it,” he adds.

From a design perspective, the team established pillars that would have to be adhered to for this game to be a success. “We knew to be successful it had to be an online game and that items had to be stored on the servers. These games thrive on the fact that they have items that are incredibly hard to obtain online, and players are willing to spend a long time to attain them,” says Wilson.

Screenshot of game Path of Exile
Above: Building a game with an absorbing item economy was key to Path of Exile’s design and success.

The next pillar dealt with requiring random levels and items to help retain players for the long haul. “It’s important for replayability that the levels are procedurally generated so that when you play through, it feels different,” says Wilson. In addition, staples of the genre like visceral combat that was also responsive and, as Wilson described, “punchy,” were the kind of standards that the Diablo crowd would both recognize and feel was core to their enjoyment.

These pieces would lead to an important goal. “We want people to play the game for ten years. And we already have players who are entering their sixth year, so that’s working well,” says Wilson.

As developers everywhere know, building technology while you’re building a game is far from easy. For the Grinding Gear team, there were no shortcuts. “At the time, there were no off-the-shelf online game back ends that you could just purchase. We needed the game to support tens of thousands of players online simultaneously. So, we looked at how other games architected and came up with a hybrid that we would build,” says Wilson.

As a result, every system was custom built for the requirements of this game. Wilson also revealed that it’s only this year that the team has investigated middleware options to help with new features like adding video into the game.

Screenshot of game Path of Exile
Above: Maps built using the procedural generation system ensures a different experience every time.

Slow growth

Given all this work, it shouldn’t be too much of a surprise that it took until 2013 before the game was officially in full release. Though the alpha period had generated an engaged community, building beyond that was a struggle with a traditional press tour resulting in positive plaudits and feedback, but only 2,000 additional people hitting the forums. “Now 2,000 people is nice, but nowadays you get that by just tweeting something,” says Wilson.

This slow-burn was frustrating, but the belief in the core product remained resolute. “Eventually the inevitability worked and it passed a quality threshold where people were willing to tell their friends,” says Wilson. It is also a commitment to let the game speak for itself that has stopped Grinding Gear from embarking on refer-a-friend programs and similar marketing techniques to attract players. Rather, they would prefer a more organic process whereby “a friend disappears and you wonder where they are. You discover they’ve disappeared because they’ve been playing this awesome game for six hours a day!” adds Wilson.

The randomized system — such a core part of the replayability — has also played well with the community and with YouTubers. “We call them the reddit moments,” says Wilson, “when the game does something interesting enough, the player will say ‘hey, that was cool, I need to post it on reddit.’”

Those moments might be action situations, but the randomized naming of items and monsters can generate its own comedy and, even, naughtiness. “There was a monster where the game generated the name Stink Stink, and so that of course has become a community meme.

“We quickly learned that while it sounded cool to have the word ‘black’ as a prefix so you could have cool names like Black Bane, it was way too quickly able to generate offensive stuff, so that had to be removed early in the beta!” adds Wilson.

Screenshot of game Path of Exile
Above: The inspiration from Diablo* II is quite apparent in the game style and layout.

The emergence of the streamer and YouTube* community during Path of Exile’s development and release has certainly aided gamer awareness, but hasn’t affected any core design or feature set ideas. That said, Wilson suggested that the team have considered a few ideas to address the people who are helping generate awareness and extending the game’s reach.

“We have considered a game mode for streamers (or any user) where two streamers enter and they’re competing with each other in some way — probably not directly — like who gets through the maze first. Then their viewers have some mechanism for donating or voting that makes it harder for the other streamer. So, two rivals can have fun in friendly competition,” explains Wilson.

The business of free-to-play

Now approaching a full four years of full release, Path of Exile continues its upwards growth curve, buoyed by new content, daily news posts, and unveiling a new server every three months that keeps drawing players back. Wilson accepts that the game is now profitable, powered by its 100-strong development team. That scenario wasn’t always quite so apparent when this new studio started out with its game design dream. “It pretty much is a passion project. It started with ‘players want this game’ and only turned into the business of ‘okay, how is this game going to pay for itself’ a bit later,” says Wilson.

“We had seen games like Maple Quest be very successful in Korea, but nobody had really done a free-to-play model in the West,” says Wilson, “and our big revolution was to see if we could be the first free-to-play in the West. We weren’t because the game took forever to make.”

Wilson and his team did what he describes as “rough back-of-envelope math” to figure out the business model in those early days. “From surveying other games, we figured for every average concurrent user, you make about 50 cents a day. So, with 1000 people on average logged in, you’re making $500 a day. That obviously only pays for a couple of staff members. We looked into the logistics of what it’s going to take to run this online game with a skeleton crew, not making much content, and decided we needed 10,000 players logged in on average in order to pay for the game, so that was our goal — the 10,000 concurrent players mark,” reveals Wilson.

It turned out that Wilson and his team had the math wrong. Quite wrong! “You actually make a lot more money than that and you also require a lot more people. We have a 100-person team and we thought you could run an online game with six!” says Wilson. Fortunately for Grinding Gear, fans enjoying their game experience are willing to pay for the entertainment and, coupled with blowing the 10,000-concurrent number out of the water, the game is able to support that 100-person crew.

Screenshot of game Path of Exile
Above: The anticipated economics turned out to be quite different than expected once the game began to scale its concurrent players.

Coming next

Wilson is clear that Path of Exile remains the total focus of Grinding Gear Games for the foreseeable future, with no plans to diversify into other games. “We have a lot of stories still to tell with Path of Exile,” he says, adding “We hear game ideas all the time and someone will say ‘we can make a game to beat Dota 2’ and I shake my head and go back into my office!”

Now if the studio were looking for opportunities — and they’re not — it would follow that same philosophy of finding an underrepresented genre with an established fan base. “It would be something like the old-school point-and-click adventures or the Command & Conquer RTS. Those are some of the areas that we would look at,” says Wilson, but also made it abundantly clear that no, the studio is not announcing work on any such projects.

“We’re not making a VR game, we’re not making a survival game like DayZ, we avoided making a Minecraft* game. We’ve avoided jumping on any bandwagon, but would look where areas are being underserved,” Wilson added.

Remaining focused on Path of Exile continues to pay dividends, as does believing that despite reports to the contrary, the PC continues to maintain its viability as a major gaming platform. There are lessons here for every development studio.

↧

Unattended Baggage Detection Using Deep Neural Networks in Intel® Architecture

July 14, 2017, 9:15 am

Latest and popular articles on Intel Technologies

≫ Next: Power System Infrastructure Monitoring Using Deep Learning on Intel® Architecture

≪ Previous: Path of Exile’s storied road to success

In a world becoming ever more attuned to potential security threats, the need to deploy sophisticated surveillance systems is increasing. An intellectual system that functions as an intuitive “robotic eye” for accurate, real-time detection of unattended baggage has become a critical need for security personnel at airports, stations, malls, and in other public areas. This article discusses inferencing a Microsoft Common Objects in Context (MS-COCO) detection model for detecting unattended baggage in a train station.

1. Evolution of Object Detection Algorithms

Image classification involves predicting the label of an image among predefined labels. This assumes that there is a single object of interest in the image and it covers a significant portion of the image. Detection is about not only finding the class of the object but also localizing the extent of the object in the image. The object can be lying anywhere in the image and can be any size (scale). So object classification is not helpful when there are multiple objects in an image, the objects are small, and the exact location and image are desired.

Traditional methods of detection involved using a block-wise orientation histogram (SIFT or HOG) feature that could not achieve high accuracy in standard data sets such as PASCAL VOC. These methods encode low-level characteristics of the objects and therefore cannot effectively distinguish among the different labels. Methods based on deep learning (convolutional networks) have become the state-of-the-art in object detection in images. Various network topologies have evolved over time, as shown in Figure 1.

Figure 1: Evolution of detection algorithms [1].

2. Installation

2.1 Building and Installing Caffe* Optimized for Intel® Architecture

Caffe can be installed and used with several combinations of development tools and libraries on a variety of platforms. Here we describe the steps to build and install Caffe* optimized for Intel® architecture with the Intel® Math Kernel Library 2017 on Ubuntu*-based systems. Please refer to the git* clone https://github.com/intel/caffe.

1. Clone the Caffe optimized for Intel architecture and pull down all the dependencies.

Navigate to the local caffe directory and copy the makefile.config.example and rename it to makefile.config.

2. Make sure the following lines are uncommented in the makefile.config.

makefile.config

# CPU-only switch (uncomment to build without GPU support)

CPU_ONLY := 1

3. Install OpenCV*.

For computer vision and image augmentation, install the OpenCV 3.2 version.

sudo apt-get install python-opencv

Remember to enable OPENCV_VERSION := 3 in Makefile.config before running make when using OpenCV 3 or higher.

4. Build the local Caffe.

Navigate to the local caffe directory.

NUM_THREADS=41
make -j $NUM_THREADS

5. Install and load the Python*modules.

make pycaffe
pip install pandas
pip install scipy
    import sys
    CAFFE_ROOT = 'path/to/caffe'
    sys.path.append(CAFFE_ROOT)
    import caffe
    caffe.set_mode_cpu()

3. Solution Architecture and Design

Our solution aims at identifying unattended baggage in public areas like railway stations, airports and so on and then triggering an alarm. Detections are done in surveillance videos using the business rules defined in section 3.3. Network Topology

Of the different detection techniques mentioned in Figure 1, we decided to choose the Single Shot multibox Detector (SSD) optimized for Intel architecture [2].Researchers say that it has promising performance even in embedded systems and high-end devices and hence is likely to be used for real-time detections.

Figure 2. Input Image and Feature Maps

SSD only needs an input image and ground truth (GT) boxes for each object during training. In a convolutional fashion, a small set (four, in our example) of default boxes of different aspect ratios at each location in several feature maps with different scales [8 × 8 and 4 × 4 in (b) and (c)] is evaluated (see Figure 2). The SSD leverages the Faster RCNN [3] Region Proposal Network (RPN) [4], using it to directly classify object inside each prior box instead of just scoring the object confidence.

For each default box, the network predicts both the shape offsets and the confidences for all object categories [(c1, c2, ..., cp)]. At training time, the default boxes are first matched to the ground truth boxes. For example, two default boxes—one with a cat and one with a dog—are matched, which are treated as positives. The boxes other than default boxes are treated as negatives. The model loss is a weighted sum between localization loss (example: Smooth L1) and confidence loss (example: Softmax).

Since our use case involves baggage detection, either the SSD network needs to be trained with different kinds of baggage or we can use a pretrained model like SSD300 trained on an MS-COCO data set. We decided to use the pretrained model, which is available for download at https://github.com/weiliu89/caffe/tree/ssd#models

3.1 Design and Scope

The scope of this use case limited to the detection of baggage that stays un-attended for a period of time. Identifying the exact owner and tracking the baggage is beyond the scope of this use case.

Because of the large number of boxes generated during the model inference, it is essential to perform non-maximum suppression (NMS) efficiently during inference. By using a confidence threshold of 0.01, most boxes can be filtered out. The NMS can then be applied with a Jaccard overlap of 0.45 per class, keeping the top 400 detections per image. Figure 3 shows the flow diagram for running detection on a surveillance video.

Figure 3. Detection flow diagram.

The surveillance video is broken down into frames using OpenCV with a configurable frames per second. As the frames are generated, they are passed to the detection model, which localizes the different objects in the form of four coordinates (xmin, xmax, ymin, and ymax) and provides a classification score to the different possible objects. By applying the NMS threshold and setting confidence thresholds, the number of predictions can be reduced and kept to the prediction that is the most optimal. OpenCV is used to draw a rectangular box with various colors around the detected baggage and the person.

3.2 Defining the Business Rules

Abandoned luggage in our context is defined as items of luggage that have been abandoned by their owner. Each item of luggage has one owner, and each owns at most one item of luggage. Luggage is defined as including all types of baggage that can be carried by hand. Examples: trunks, bags, rucksacks, backpacks, parcels, and suitcases.

The following rules apply to attended and unattended luggage:

A luggage is owned and attended to by a person who enters the scene with the luggage until the point at which the luggage is not in physical contact with the person.
At this point the luggage is attended to by the owner ONLY when they are within a distance of 20 inches (spatial rule). All distances are measured between Euclidean distances.
A luggage item is unattended when the owner is farther than b meters (where b ≥ a from the luggage. In this case the system applies the spacio-temporal rule to detect whether this item of luggage has been abandoned (triggering an alarm event).
The spacio-temporal rule to determine abandonment: an item of luggage that has been left unattended by its owner for a period of t consecutive seconds during which time the owner has not re-attended to the luggage nor has the luggage been attended to by a second party (instigated by physical contact, in which case a theft/tampering event may be raised). The image below (Figure 7) shows an item of luggage left unattended for t (=10) seconds, at which point the alarm event is triggered. Here we relate the time t with the number of frames f per second. If we have n frames per second in our input video, t seconds would be defined as (t*f) frames. In short, a bag that has been unattended in (t*f) consecutive frames triggers the alarm.

3.3 Inferencing the MS-COCO Model

Implementation or inferencing is done using Python 2.7.6 and OpenCV 3.2. The following steps are performed (code snippets are included for reference):

Read the input video as follows:

CAFFE_ROOT = '/home/979648/SSD/caffe/
# -> Reading the video file and storing in a directory
TEST_VIDEO = cv2.VideoCapture(os.getcwd()+
‘InputVideo/SurveillanceVideo.avi')
MODEL_DEF = 'deploy.prototxt'

Load the network architecture.

net = caffe.Net(MODEL_DEF, MODEL_WEIGHTS,caffe.TEST)

Read the video by frame and inference each frame against the model to obtain a detection and classification score.

success, image = TEST_VIDEO.read()
if (success):
    refObj = None
    imageToNet = cv2.resize(image, (300, 300))
    image_convert = np.swapaxes(np.swapaxes(imageToNet, 1, 2), 0, 1)
    net.blobs['data'].data[...] = image_convert
    # Forward pass.
    detections = net.forward()['detection_out']

    # Parse the outputs.
    det_label = detections[0, 0, :, 1]
    det_conf = detections[0, 0, :, 2]
    det_xmin = detections[0, 0, :, 3]
    det_ymin = detections[0, 0, :, 4]
    det_xmax = detections[0, 0, :, 5]
    det_ymax = detections[0, 0, :, 6]

    # Get detections with confidence higher than 0.6.
    top_indices = [i for i, conf in enumerate(det_conf) if conf >=    CONFIDENCE]

    top_conf = det_conf[top_indices]

    top_label_indices = det_label[top_indices].tolist()
    top_labels = get_labelname(labelmap, top_label_indices)
    top_xmin = det_xmin[top_indices]
    top_ymin = det_ymin[top_indices]
    top_xmax = det_xmax[top_indices]
    top_ymax = det_ymax[top_indices]

    colors = plt.cm.hsv(np.linspace(0, 1, 21)).tolist()

    currentAxis = plt.gca()
    # print('Detected Size : ', top_conf.shape[0])

    detectionDF = pd.DataFrame()
    if (top_conf.shape[0] != 0):
        for i in xrange(top_conf.shape[0]):
            xmin = int(round(top_xmin[i] * image.shape[1]))
            ymin = int(round(top_ymin[i] * image.shape[0]))
            xmax = int(round(top_xmax[i] * image.shape[1]))
            ymax = int(round(top_ymax[i] * image.shape[0]))
            score = top_conf[i]
            label = int(top_label_indices[i])
            label_name = top_labels[i]
            display_txt = '%s: %.2f' % (label_name, score)
            detectionDF = detectionDF.append(
                    {'label_name': label_name, 'score': score, 'xmin': xmin, 'ymin': ymin, 'xmax': xmax, 'ymax': ymax},
                    ignore_index=True)

		detectionDF = detectionDF.sort('score', ascending=False)

For calculating the distance between objects in an image, a reference object has to be used. A reference object has two main properties:

a) The dimensions of the object in some measurable unit, such as inches or millimeters. In this case we consider the dimesions to be in inches.

b) We can easily find and identify the reference object in our image.

Also, an approximate width of the reference object has to be assumed. In this case we assume the width of the suitcase (args[‘width’]) to be 27 inches.

$ pip install imutils
if refObj is None:
	  # unpack the ordered bounding box, then compute the
	  # midpoint between the top-left and top-right points,
	  # followed by the midpoint between the top-right and
	  # bottom-right
	  (tl, tr, br, bl) = box
	  (tlblX, tlblY) = midpoint(tl, bl)
	  (trbrX, trbrY) = midpoint(tr, br)

	  # compute the Euclidean distance between the midpoints,
	  # then construct the reference object
	  D = dist.euclidean((tlblX, tlblY), (trbrX, trbrY))
	  refObj = (box, (cX, cY), D / args["width"])
	  continue

Once the reference object is obtained, the distance between the reference object and the other objects in the image is calculated. The business rules are applied, and then the appropriate alarm will be triggered. In this case, a red box will be highlighted on the object.
```
if refObj != None:
D = dist.euclidean((objBotX, objBotY), (int(tlblX), int(tlblY))) /  refObj[2]
(mX, mY) = midpoint((objBotX, objBotY), (tlblX, tlblY))

//apply spacio temporal rule
     // Highlight with Green/Yellow /Red
```
Save the processed images, and then append them to the output video.

4. Experimental Results

The following detection (see Figures 4—7) was obtained when the inference use case was run on a sample YouTube* video available at https://www.youtube.com/watch?v=fpTG4ELZ3bE


Figure 4: Person enters the scene with the baggage, which is currently safe (highlighted with green).

Figure 5: The owner is moving away from the baggage.	Figure 6: The system raises a warning signal.

Figure 7: Owner is almost out of the frame and the system raises a video alarm (blinking in red).

5. Conclusion and Future Work

We observed that the system can detect baggage accurately in medium- to high-quality images. The system is also capable of detecting more than one baggage in the case of multiple owners. However the system failed to detect the baggage in a low-quality video. The distance calculation does not include focal length, angle of the camera, and the plane, and hence the current calculation logic has its own limitations. The current system is also not capable of tracking the baggage.

The model was inferenced using the Intel® Xeon® processor E5-2699 v4 @ 2.20 GHz with 22 cores and 64 GB free memory. Future work will include enhancement to the current use case by identifying the owner of the baggage and also tracking the baggage. Videos with different angles and focal lengths will also be inferenced to judge the effectiveness of the system. The next phases of our work will also consider efforts to parallelize the inference model.

6. References and Links

↧

Power System Infrastructure Monitoring Using Deep Learning on Intel® Architecture

July 14, 2017, 2:11 pm

Latest and popular articles on Intel Technologies

≫ Next: Twisted Pixel brings Hollywood A-list voices to VR

≪ Previous: Unattended Baggage Detection Using Deep Neural Networks in Intel® Architecture

List of Abbreviations

Abbreviations	Expanded Form
DL	deep learning
LSD	line segment detector
UAV	unmanned aerial vehicle
GPU	graphics processing unit

Abstract

The work in this paper evaluates the performance of Intel® Xeon® processor powered machines for running deep learning on the GoogleNet* topology (Inception* v3). The functional problem tackled is the identification of power system components such as pylons, conductors, and insulators from the real-world video footage captured by unmanned aerial vehicles (UAVs) or commercially available drones. By conducting multiple experiments we tried to derive the optimal batch size, iteration count, and learning rate for the model to converge.

Introduction

Recent advances in computer-aided visual object recognition, namely the application of deep learning, has made it possible to solve a wide array of real-world problems which previously were impossible. In this work, we present a novel method for detecting the components of power system infrastructure such as pylons, conductor cables, and insulators.

The original implementation of this algorithm took advantage of the power of the NVIDIA* graphics processing unit (GPU) during training and detection. The current work primarily focuses on implementing the algorithm on TensorFlow* CPU mode and executing it over Intel® Xeon® processors.

During execution, we will record performance metrics across the different CPU configurations.

Environment Setup

Hardware Setup

Table 1. Intel® Xeon® processor configuration.

Intel Xeon processor	Model Name: Intel® Xeon® processor E5-2699 v4 @ 2.20GHz
	Core(s) Per Socket: 22 RAM (free): 123 GB
	OS: Ubuntu* 16.1

Software Setup

Python* Setup
The experiment is tested on Python* version 2.7.x. Verify the version as follows:

Figure 1. Verify Python* version.
TensorFlow* Setup
1. Install TensorFlow using pip: “$ pip install tensorflow”. By default, this would install the latest wheel for your CPU architecture. Our experiment is built and tested on TensorFlow³ version 1.0.x.
2. Verify the installation as shown in Figure 2:
Figure 2. Verify TensorFlow setup.
Inception* Model
The experiments detailed in the subsequent sections employ the transfer learning technique to speed up the entire process. For this purpose, we used a pretrained GoogleNet* model, namely Inception* v3. The details of the transfer learning process are explained in the subsequent sections.
Download the Inception v3 model from the following link: http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz
TensorBoard*
We use TensorBoard* in our experiments to visualize the progress and the results of individual experiment runs.
TensorBoard is installed along with TensorFlow. After installing TensorFlow, enter the following command from the bash script to ensure that TensorBoard is available:
“ $ tensorboard --help ”

Solution Design

The entire solution is divided into three stages. They are:

Data Preprocessing
Model Training
Inference

Figure 3. High-level solution design.

Data Preprocessing

The images used for training the model are collected through aerial drone missions carried out in the field. The images collected vary in resolution, aspect, and orientation, with respect to the object of interest.

The entire preprocessing pipeline is built using OpenCV* 2 (Python implementation). The high-level objective of preprocessing is to convert the raw, high-resolution drone images into a labeled set of image patches of size 32 x 32, which is used for training the deep learning model.

The various processes involved in the preprocessing pipeline are as follows:

Image annotation
Generating binary masks
Creating labeled image patches

The individual processes involved in the pipeline are detailed in the following steps:

Step 1: Image annotation.

Those experienced in the art of building and training convolutional neural network (CNN) architectures will quickly relate to the image annotation task. It involves manually labeling the objects within your training image set. In our experiment, we relied on the Python tool, LabelImg*⁴, for annotation. The tool outputs the object coordinates in XML format for further processing.

Figure 4. Image without annotation.

Figure 5. Image with annotation overlay.

The preceding images depict a typical annotation activity carried out on the raw images.

Step 2: Generating binary masks.

Binary masks refer to the mode of image representation where we depict either the presence or absence of an object. Hence, for every raw image, we generate individual binary masks corresponding to each of the labels available. The binary masks so created are used in the steps that follow for actually labeling the image patches. This idea is depicted in the following images. In the current implementation, the mask generation process is developed using Python OpenCV.

Figure 6. Generating binary masks from the raw image.

Step 3: Creating labeled image patches.

Once the binary mask is generated, we run a 32 x 32 filter over the raw image and compare the activations (white pixel count) obtained in the various masks for the corresponding filter position.

Figure 7. Creating labeled image patches.

If the activation in a particular mask is found to be above the defined threshold of 5 percent of patch area (0.05*32*32), we label the patch to match the mask’s label. The output of this activity is a set of 32 x 32 image patches partitioned into multiple directories based on their labels. The forthcoming model training phase of the experiment directly accesses this partitioned directory structure for label-specific training images.

Figure 8. Preprocessing output directory structure.

Please note that in the above-described patch generation process, the total number of patches generated varies, depending on other variables such as size of the filter (32 x 32 in this case), resolution of input images, and the activation threshold, while comparing with binary masks.

Network Topology and Model Training

Inception v3 Model

Figure 9. Inception V3 topology.

Inception V3 is a revolutionary deep learning architecture, which achieved state of the art performance in ILSVRC14 (ImageNet* Large Scale Visual Recognition Challenge 2014).

The most striking advantage of Inception over the other topologies is the depth of feature learning achieved, keeping the memory and CPU cost nearly at a par with other topologies. The architecture tries to improve on performance by reducing the effective sparsity of the data structures by converting them into dense matrices through clustering. This sparse-to-dense conversion is achieved architecturally by designing telescopic convolutions (1 x 1 to 3 x 3 to 5 x 5). This is commonly referred to as the network-in-network.

Transfer Learning on Inception

In our experiments we applied transfer learning on a pretrained Inception model (trained on ImageNet data). The transfer learning approach initializes the last fully connected layer with random weights (or zeroes), and when the system is trained for the new data (in our case, the power system infrastructure images), these weights are readjusted. The base concept of transfer learning is that the initial many layers in the topology will have learned some of the base features such as edges and curves, and this learning can be reused for the new problem with the new data. However, the final, fully connected layers would be fine-tuned for the very specific labels that it is trained for. Hence, this needs to be retrained on the new data.

This is achieved through the Python API, as follows:

Add new hidden layer, Rectified Linear Unit (ReLU):

hidden_units_layer_1 = 1024
  layer_weights_fc1 = tf.Variable(
      tf.truncated_normal([BOTTLENECK_TENSOR_SIZE, hidden_units_layer_1], stddev=0.001),
      name='fc1_weights')
  layer_biases_fc1 = tf.Variable(tf.zeros([hidden_units_layer_1]), name='fc1_biases')
  hidden_layer_1 = tf.nn.relu(tf.matmul(bottleneck_input, layer_weights_fc1,name='fc1_matmul') + layer_biases_fc1)

Add new softmax function:

layer_weights_fc2 = tf.Variable(
      tf.truncated_normal([hidden_units_layer_1, class_count], stddev=0.001),
      name='final_weights')
  layer_biases_fc2 = tf.Variable(tf.zeros([class_count]), name='final_biases')

  logits = tf.matmul(hidden_layer_1, layer_weights_fc2,
                     name='final_matmul') + layer_biases_fc2
  final_tensor = tf.nn.softmax(logits, name=final_tensor_name)

Testing and Inference

Testing is done on a 90:10 split on the entire image set. The test images go through the same patch generation process that was invoked during the training phase. The resultant patches are sent for detection on the trained model.

Figure 10. Result of model inference overlaid on raw image.

After detection, the patches are passed through a line segment detector (LSD) for the final localization.

Figure 11. Result of running LSD.

Results

The different iterations of the experiments involve varying batch sizes and iteration counts.

During the experiments, in order to reduce the time consumed during preprocessing, we modified the preprocessing logic. Therefore, the metrics for different variants of the preprocessing logic were also captured.

We also observed that in the inception model, bottleneck tensors are cached during the initial run, so the training time during the subsequent runs would be much less. The final training result for the Intel Xeon processor is as follows:

Table 2. Experiment results.

Note: Inference time is inclusive of the preprocessing (patch) operation along with the time for the actual detection on the trained model.

Conclusion and Future Work

The functional use case tackled in this paper involved the detection and localization of power system components. The use case could be further expanded to identifying powersystem components that are damaged.

The training and inference time observed could be further improved by using an Intel optimized version of TensorFlow⁵.

References and Links

The references and links used to create this paper are as follows:

1. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna, Rethinking the Inception Architecture for Computer Vision (2015).

2. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016).

3. Tensorflow Repository, https://github.com/tensorflow/tensorflow.

4. LabelImg – Python tool for image annotation, https://github.com/tzutalin/labelImg.

5. Optimized Tensorflow – TensorFlow optimizations on Modern Intel Hardware, https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

↧

Twisted Pixel brings Hollywood A-list voices to VR

July 17, 2017, 1:25 pm

Latest and popular articles on Intel Technologies

≫ Next: Introduction to the DPDK Sample Applications

≪ Previous: Power System Infrastructure Monitoring Using Deep Learning on Intel® Architecture

The original article is published by Intel Game Dev on VentureBeat*: Twisted Pixel brings Hollywood A-list voices to VR Get more game dev news and related topics from Intel on VentureBeat.

Promo image for game Wilson's heart

Launching a game into the fledgling VR world might for some sound like a risk: become a pioneering early adopter or languish as an also-ran in a field of too few units? For Twisted Pixel Games and Chief Creative Officer Josh Bear it wasn’t an early-technology play as might be suggested by a resume that includes Gunslinger, an early support showcase for the fateful Kinect on Xbox 360.

“It wasn’t the intention to focus the studio around gaming technology,” says Bear, “but we got to see Project Natal — as Kinect was initially known — and Microsoft was incredibly excited about it, so we got to make Gunslinger.”

The studio, with several games under its belt over its ten-year tenure, had always wanted to make Wilson’s Heart, which is out now in VR. “We had an early prototype that we had called ‘The Hands Game’ working with a gamepad with a first-person narrative. It used your hands to pick up objects like guns, like a first-person shooter…but then when the Oculus guys showed us that technology, with its Touch controller, we were all like ‘oh man, this just makes sense for this game,’” explains Bear.

Changing the core technology was not a straightforward process, however. What had begun life as a PC game with first-person sensibilities now had to adapt its functions to fit a whole new ballgame. “We basically started over when we had VR,” says Bear, “we even had to reassess the black and white graphics aesthetic: Would it be weird in VR? Will it work or feel unreal to gamers?”

“We had to do a lot of work with the controllers to make sure the hands felt good. They are so much a part of the game that we had to make sure they felt, acted, and looked cool,” he adds.

game Wilson's Heart
Above: The gameplay blends psychological horror with puzzles, and not many “gotcha” horror jump moments.

Questions and challenges will continue to be answered with work, ideas, and understanding the new paradigms as VR evolves. Fundamentals of the game experience are affected: how long is too long, how short is too short? Nobody will currently build a 100-hour VR game (yet) but length of the total experience is one of the challenges that Bear’s team had to grapple with.

“Some people took six hours, some eight, and we saw ten or 12. But we have to be cognizant of that because you’re going to want to get out of the helmet at some point no matter how much you like this stuff,” says Bear.

Do you feel dizzy?

Of course, one of VR’s major challenges impacting experiences at this stage of the technology’s development is motion sickness. One design change that was required in switching from PC to VR was handling locomotion. “We didn’t want players getting sick, but we wanted to keep people in it as much as possible. So, when you warp in the game and the screen goes black you can still hear Wilson breathing or his footsteps or sound effects to push you along,” says Bear, “so we had to think about everything very differently in VR.”

An early version of the game allowing players to walk around anywhere and go where they want — and, in fact, one of the most requested features emerging from players experiencing the game now — caused problems for people.

“Hopefully we — or someone — will figure out the way to make that happen, but it’s the reason we went with the teleportation system.” The result? “We’ve had no issues with motion sickness at all,” says Bear.

Bear adds, “Oculus has been great about taking this as a top concern. But it’s hard because some people love it so much they just want to walk around, but they could pull the wires out and bust their system. So yeah, it’s a big challenge.”

The game experience itself is set in the 40s post-WWII and is one of psychological horror borne of Bear’s deep passion for the old Universal monster movies. “My favorite was The Wolfman, and Abbott and Costello Meet Frankenstein, Boris Karloff, all that…and throw in some Twilight Zone. We wanted to see if we could do our own homage to those movies.”

Those movies all had defining actors making their roles their own, and in themselves becoming an iconic part of the lore. Wilson’s Heart stars a top-tier, all-star cast of voice actors that are a really standout addition to the game.

Game Wilson heart
Above: The cast from Peter Weller to Michael B. Jordan to Rosario Dawson and Alfred Molina to Kurtwood Smith is outstanding.

Hollywood heavyweights

“I really wanted Peter Weller for Wilson…really love Robocop— who doesn’t — but he was great in Naked Lunch, Buckaroo Banzai, and others. So we flew to New Mexico, showed him a brief demo of the game and he was super-gracious, loved the concept, and the art, and just said he would do it,” said Bear.

For the other roles, he looked at the requirements of the character and identified ideal actors. “We got all our first picks,” he says. That includes Michael B. Jordan, Alfred Molina, Rosario Dawson, and Paul Reubens.

A kick for any Robocop fan is that in addition to headliner Peter Weller, Kurtwood Smith also provides voice talent. “Although we never had them in the same room, to have those two was amazing,” says big Robocop fan, Bear, “and Kurtwood even threw in a couple of Robocop lines all on his own, and I tried to keep calm about it in the VO room! And we did use one in the game.”

How is it working with top-tier talent for a videogame? “They were right on point,” says Bear, “they would improve the lines we’d written on-the-fly…when you have that kind of talent, it makes things a lot easier.”

But come on…what were they actually like?

“Some of the nicest people I’ve ever met!” says Bear.

Garnering positive reception from press and community alike, Bear understands the nature of the current VR beast. “To be fair, you need the rig, you need a powerful PC to run it, and you need the controllers. So, I do think that the game will have a long tail as there is more adoption,” he says.

For the 30-plus developers at Twisted Pixel, it’s not all about VR, however. “As much as we love VR, we love PC and console stuff. It’s more about the concept and what platform fits it best, rather than just trying to cram something onto VR.”

Whatever the next step, Bear gets the nature of the business “We have the 3- to 5-year outlook, of course,” he offers, “but as you know, that often goes to shit!”

↧

Introduction to the DPDK Sample Applications

July 13, 2017, 8:16 am

Latest and popular articles on Intel Technologies

≫ Next: Python mpi4py on Intel® True Scale and Omni-Path Clusters

≪ Previous: Twisted Pixel brings Hollywood A-list voices to VR

This article describes the Data Plane Development Kit* (DPDK*) sample applications.

The DPDK sample applications are small standalone applications which demonstrate various features of DPDK. They can be considered a cookbook of DPDK features. A user interested in getting started with DPDK can take the applications, try out the features, and then extend them to fit their needs.

The DPDK Sample Applications

Table 1 shows a list of some of the sample applications that are available in the examples directory of DPDK:

Bonding	Netmap Compatibility
Command Line	Packet Ordering
Distributor	Performance Thread
Ethtool	Precision Time Protocol (PTP) Client
Exception Path	Quality of Service (QoS) Metering
Hello World	QoS Scheduler
Internet Protocol (IP) Fragmentation	Quota and Watermark
IP Pipeline	RX/TX Callbacks
IP Reassembly	Server Node EFD
IPsec Security Gateway	Basic Forwarding/Skeleton App
IPv4 Multicast	Tunnel End Point (TEP) Termination
Kernel NIC Interface	Timer
Network Layer 2 Forwarding + variants	Vhost
Network Layer 3 Forwarding + variants	Vhost Xen
Link Status Interrupt	VMDQ Forwarding
Load Balancer	VMDQ and DCB Forwarding
Multi-process	VM Power Management

Table 1. Some of the DPDK sample applications.

These examples range from simple to reasonably complex but most are designed to demonstrate one particular feature of DPDK. Some of the more interesting examples are highlighted below.

Hello World: As with most introductions to a programming framework a good place to start is with the Hello World application. The Hello World example sets up the DPDK Environment Abstraction Layer (EAL), and prints a simple "Hello World" message to each of the DPDK-enabled cores. This application doesn’t do any packet formatting but it is a good way to test whether the DPDK environment is compiled and set up properly.
Basic Forwarding/Skeleton application: The basic forwarding/skeleton contains the minimum amount of code required to enable basic packet forwarding with DPDK. This will allow the user to test and see if their network interfaces are working with DPDK.
Network Layer 2 Forwarding: The Network Layer 2 forwarding, or L2fwd application, does forwarding based on Ethernet MAC addresses like a simple switch.
Network Layer 3 Forwarding: The Network Layer 3 forwarding, or L3fwd application, does forwarding based on Internet protocols, IPv4, or IPv6 like a simple router.
Packet Distributor: The packet distributor demonstrates how to distribute packets arriving on an Rx port to different cores for processing and transmission.
Multi process application: The multi process application shows how two DPDK processes can work together using queues and memory pools to share information.
RX/TX Callbacks application: The RX/TX Callbacks sample application is a packet forwarding application that demonstrates the use of user-defined callbacks on received and transmitted packets. The application calculates the latency of the packet between RX (packet arrival) and TX (packet transmission) by adding callbacks to the RX and TX packet processing functions.
IPSec Security Gateway: The IPSec security gateway application is a minimal example of something closer to a real-world usage example. This is also a good example of an application using the DPDK Cryptodev* framework.
Precision Time Protocol (PTP) client: The PTP client is another minimal implementation of a real-world application. In this case the application is a PTP client that communicates with a PTP master clock to synchronize time on a network interface card (NIC) using the IEEE1588 protocol.
Quality of Service (QoS) Scheduler: The QoS Scheduler application demonstrates the use of DPDK to provide QoS scheduling.

There are many more examples which are documented online at dpdk.org. Each of the documented sample applications show how to compile, configure, and run the application, as well as explaining the main code behind the functionality.

In the next section, we will look at the Network Layer 3 forwarding (L3fwd) sample application in more detail.

The Network Layer 3 Forwarding Sample Application

The Network Layer 3 forwarding, or L3fwd application, demonstrates packet forwarding based on Internet protocol, IPv4, or IPv6 like a simple router. The L3fwd application has two modes of operation, longest prefix match (LPM) and exact match (EM), which demonstrate the use of the DPDK LPM and Hash libraries.

Figure 1 shows a block diagram of the L3fwd application set up to forward packets from a traffic generator using two ports.

Figure 1. The L3fwd application set up to forward packets from a traffic generator.

Longest prefix match (LPM) is a table search method, typically used to find the best route match in IP forwarding applications. The L3fwd application statically configures a set of rules and loads them into an LPM object at initialization time. By default, L3fwd has a statically defined destination LPM table with eight routes, as shown in Table 2.

Table 2. Default LPM routes in L3fwd.

L3fwd uses the IPv4 destination address of the packet to identify its next hop; i.e., the output port ID from the LPM table. It can also route based on IPv6 addresses (from DPDK 17.05).

Exact match (EM) is hash-based table search method to find the best route match in IP forwarding applications. In EM lookup, the search key is represented by a five-tuple value of Source IP address, Destination IP address, Source Port, Destination Port and Protocol. The set of flows used by the application is statically configured and loaded into the hash object at initialization time. By default, L3fwd has a statically defined destination EM table with four routes, as shown in Table 3.

Table 3. Default EM routes in L3fwd.

The next hop, i.e., the output interface for the packet is identified from the EM table entry. EM-based forwarding supports IPv4 and IPv6.

Building the Application

The L3fwd application can be built as shown below. The environment variables used are described in the DPDK Getting Started guides.

$ export RTE_SDK=/path/to/rte_sdk
$ export RTE_TARGET=x86_64-native-linuxapp-gcc
$ cd $RTE_SDK/examples/l3fwd
$ make clean
$ make

Running the Application

The command line for the L3fwd application has the following options:

$./build/l3fwd [EAL options] --
                -p PORTMASK [-P] [-E] [-L]
                --config(port,queue,lcore)[,(port,queue,lcore)]
                [--eth-dest=X,MM:MM:MM:MM:MM:MM]
                [--enable-jumbo [--max-pkt-len PKTLEN]]
                [--no-numa]
                [--hash-entry-num 0x0n]
                [--ipv6]
                [--parse-ptype]

This comprises the EAL parameters, which are common to all DPDK applications, and application-specific parameters.

The L3fwd app uses the LPM as the default lookup method. The lookup method can be changed with a command-line option at runtime:

-E: selects the Exact Match lookup method.

-L: selects the LPM lookup method.

Here are some examples of running the L3fwd application.

#LPM
$./build/l3fwd -l 1,2 -n 4 -- -p 0x3 -L --config="(0,0,1),(1,0,2)"
#EM
$./build/l3fwd -l 1,2 -n 4 -- -p 0x3 -E --config="(0,0,1),(1,0,2)" \
               --parse-ptype

Conclusion

This article describes an overview of a subset of the DPDK sample applications and then goes into more detail on the L3fwd sample application.

About the Author

Bernard Iremonger is a network software engineer with Intel Corporation. His work is primarily focused on the development of the data plane libraries for DPDK. His contributions include conversion of the DPDK documentation to use the Sphinx* documentation tool, enhancements to the poll mode drivers (PMDs) to support port hot plug and live migration of virtual machines (VMs) with single root IO virtualization (SRIOV) virtual functions (VFs), API extensions to the ixgbe, and i40e PMDs for control of the VFs from the physical function (PF).

↧

Python mpi4py on Intel® True Scale and Omni-Path Clusters

July 17, 2017, 2:37 am

Latest and popular articles on Intel Technologies

≫ Next: Getting Started in Linux with Intel® SDK for OpenCL™ Applications

≪ Previous: Introduction to the DPDK Sample Applications

Python users of the mpi4py package, leveraging capabilities for distributed computing on supercomputers with the Intel® True Scale or Intel® Omni-Path interconnects might run into issues with the default configuration of mpi4py.

The mpi4py package is using matching probes (MPI_Mpobe) for the receiving function recv() instead of regular MPI_Recv operations per default. These matching probes from the MPI 3.0 standard however are not supported for all fabrics, which may lead to a hang in the receiving function.

Therefore, users are recommended to leverage the OFI fabric instead of TMI for Omni-Path systems. For Intel® MPI, the configuration could look like the following environment variable setting.:

I_MPI_FABRICS=ofi

Users utilizing True Scale or Omni-Path systems via the TMI fabric, might alternatively switch off the usage of matching probe operations withing the mpi4py recv() function.

This can be established via

mpi4py.rc.recv_mprobe = False

right after importing the mpi4py package.

↧

Getting Started in Linux with Intel® SDK for OpenCL™ Applications

July 18, 2017, 12:33 pm

Latest and popular articles on Intel Technologies

≫ Next: New Issue of The Parallel Universe is Here: Tuning Autonomous Driving Using Intel® System Studio

≪ Previous: Python mpi4py on Intel® True Scale and Omni-Path Clusters

This article is a step by step guide to quickly get started developing using Intel® SDK for OpenCL™ Applications with the Linux SRB5 driver package.

Install the driver
Install the SDK
Set up Eclipse

For SRB4.1 instructions, please see https://software.intel.com/en-us/articles/sdk-for-opencl-gsg-srb41.

Step 1: Install the driver

This script covers the steps needed to install the SRB5 driver package in Ubuntu 14.04, Ubuntu 16.04, CentOS 7.2, and CentOS 7.3.

To use

$ mv install_OCL_driver.sh_.txt install_OCL_driver.sh
$ sudo su
$ ./install_OCL_driver.sh

This script automates downloading the driver package, installing prerequisites and user-mode components, patching the 4.7 kernel, and building it.

You can check your progress with the System Analyzer Utility. If successful, you should see smoke test results looking like this at the bottom of the the system analyzer output:

--------------------------
Component Smoke Tests:
--------------------------
[ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

Experimental installation without kernel patch or rebuild:

If using Ubuntu 16.04 with the default 4.8 kernel you may be able to skip the kernel patch and rebuild steps. This configuration works fairly well but several features (i.e. OpenCL 2.x device-side enqueue and shared virtual memory, VTune GPU support) require patches. Install without patches has been "smoke test" validated to check that it is viable to suggest for experimental use only, but it is not fully supported or certified.

Step 2: Install the SDK

This script will set up all prerequisites for successful SDK install for Ubuntu.

$ mv install_SDK_prereq_ubuntu.sh_.txt install_SDK_prereq_ubuntu.sh
$ sudo su
$ ./install_SDK_prereq_ubuntu.sh

After this, run the SDK installer.

Here is a kernel to test the SDK install:

__kernel void simpleAdd(
                       __global int *pA,
                       __global int *pB,
                       __global int *pC)
{
    const int id = get_global_id(0);
    pC[id] = pA[id] + pB[id];
}

Check that the command line compiler ioc64 is installed with

$ ioc64 -input=simpleAdd.cl -asm

(expected output)
No command specified, using 'build' as default
OpenCL Intel(R) Graphics device was found!
Device name: Intel(R) HD Graphics
Device version: OpenCL 2.0
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
fcl build 1 succeeded.
bcl build succeeded.

simpleAdd info:
	Maximum work-group size: 256
	Compiler work-group size: (0, 0, 0)
	Local memory size: 0
	Preferred multiple of work-group size: 32
	Minimum amount of private memory: 0

Build succeeded!

Step 3: Set up Eclipse

Intel SDK for OpenCL applications works with Eclipse Mars and Neon.

After installing, copy the CodeBuilder*.jar file from the SDK eclipse-plug-in folder to the Eclipse dropins folder.

$ cd eclipse/dropins
$ find /opt/intel -name 'CodeBuilder*.jar' -exec cp {} . \;

Start Eclipse. Code-Builder options should be available in the main menu.

↧

New Issue of The Parallel Universe is Here: Tuning Autonomous Driving Using Intel® System Studio

July 19, 2017, 9:40 am

Latest and popular articles on Intel Technologies

≫ Next: Introduction to VR: Creating a First-Person Player Game for the Oculus Rift*

≪ Previous: Getting Started in Linux with Intel® SDK for OpenCL™ Applications

Everything old is new again, and that’s just fine with us.

We hope you’ll agree after you read the latest issue of The Parallel Universe, Intel’s quarterly magazine for developers. In this issue, we take a fresh look at what’s come before (OpenMP, MySQL*, Intel® C++ Compiler, vectorization) while looking ahead (autonomous driving applications, edge-to-cloud data compression, the latest programming languages).

Download it and see what’s possible with leading-edge HPC products and practices.

Conducting hotspot analysis for autonomous driving applications
What’s so unique about programming language Julia* and why its use doubles every year
How vectorization saves the day for an open source application used for large-scale, 3-D simulations.
Plus lots more

Read the new issue

↧

Introduction to VR: Creating a First-Person Player Game for the Oculus Rift*

July 18, 2017, 10:09 am

Latest and popular articles on Intel Technologies

≫ Next: How to use the Intel® Advisor Python API

≪ Previous: New Issue of The Parallel Universe is Here: Tuning Autonomous Driving Using Intel® System Studio

View PDF [12,544KB]

Introduction

This article introduces virtual reality (VR) concepts and discusses how to integrate a Unity* application with the Oculus Rift*, add an Oculus first-person player character to the game, and teleport the player to the scene. This article is aimed at an existing Unity developer who wants to integrate Oculus Rift into the Unity scene. The assumption is that the reader already has the setup to create a VR game for Oculus: an Oculus-ready PC and the Oculus Rift and touch controllers.

Development tools

Unity 5.5 or greater
Oculus Rift and touch controllers

Creating a Terrain in Unity

Multiple online resources are available that explain how to create a basic terrain in Unity. I followed the Unity manual. Adding lots of trees and grass details to the scene may have a performance impact, causing the frames per second (FPS) to decrease significantly. Make sure you have the optimal number of trees, if required, and set the min height/max height and min width/max width of the grass as low as possible to lessen the impact on the FPS. In order to improve the VR experience in your game, a minimum of 90 FPS is recommended.

Setting up the Oculus Rift

This section explains how to set up the Oculus Rift, place the Oculus first-person character in the scene, and teleport the player from one scene to another.

To set up Oculus, follow the downloadable instructions from the Oculus website.

Once you complete the setup, make sure that Oculus is integrated with your machine, and then do the following:

Download the Oculus utilities for Unity 5.
Import the Unity package into your Unity project.
Remove the Main Camera object from your scene. It’s unnecessary because the Oculus OVRPlayerController prefab already comes with a custom VR camera.
Navigate to the Assets/OVR/Prefabs folder.
Drag and drop the OVRPlayerController prefab into your scene. You can work with the OVRCameraRig prefab. For a description of these prefabs and an explanation of their differences, go to this link. The sample shown below was implemented using OVRPlayerController.

Adjust the headset for the best fit and so you have clear visibility of the scene. Adjust the settings as necessary and according to the instructions provided while setting up the Oculus Rift. You can click the Stats button to observe the scene’s FPS. If the FPS is less than the recommended 90 FPS, decrease the details in your Unity scene or troubleshoot to find out what parts of the scene are consuming more CPU/GPU and why that is impacting the FPS.

Now let’s look at how we can interact with the objects in your scene with the Oculus touch controllers. Let’s add a shotgun model to the scene so that the player can attack the enemies. You can either create your own model or download it from the Unity Asset Store. I downloaded the model from the Unity store.

Make this model the child of the RightHandAnchor of the OVRPlayerController as shown below.
Adjust the size and orientation of the model so that it fits the scene and your requirements.

Now once you move the right touch controller, you are directly interacting with the shotgun in the scene.

Adding the Code to Work with the Oculus Touch Controller

In the code snippet shown below, we check the OVRInput and based on the button pressed in the Oculus touch controller, we are doing one of three things:

If the Primary Index Trigger button (that is, the right controller’s trigger button) is pressed, we call the RayCastShoot function with the Teleport option set to false. This condition lets the player object fire at the enemies and any other targets that we set up in the scene. We are also making sure that we can only fire once within the specified time variable by checking the condition Time.Time > nextfire.
If the A button on the controller is pressed, we call the RayCastShoot function, setting the Teleport option to true. This option allows the player to teleport to different points in the terrain. The teleported points can be either predefined points set in the scene, or the player can be teleported directly to the hit point. It is up to the developer to decide, based on the requirements of the game, where in the scene to teleport the player.
If the B button of the controller is pressed at any time in the game, the position of the player is reset to its original position.

void Update () {

        if (OVRInput.Get(OVRInput.Button.PrimaryIndexTrigger) && (Time. Time > nextfire))

        {
            //If the Primary Index trigger is pressed on the touch controller we fire at the targets
            nextfire = Time. Time + fireRate;
            audioSource.Play();

            // Teleporting is set to false here
            RayCastShoot(false);        }
        else if (OVRInput.Get(OVRInput.RawButton.A) && (Time.time > nextfire))
        {

            // Teleporting is set to true when Button A is pressed on the controller
            nextfire = Time.time + fireRate;

            RayCastShoot(true);

        }

        else if (OVRInput.Get(OVRInput.RawButton.B) && (Time.time > nextfire))
        {
            // If Button B is pressed on the controller player is reset to his original position

            nextfire = Time.time + fireRate;
            player.transform.position = resetPosition;

        }
    }

In the sample below, I added zombies, downloadable from the Unity Asset Store, as the enemies and also added some targets, like rocks and grenades, to add more particle effects, like explosions, rock impacts, and so on, to the scene. I also created a simple animation for the zombie following this tutorial.

Now let’s look at the RayCastshoot function. Physics.Raycast casts a ray from the gunTransform position in a forward direction against the colliders in the scene. The range is specified in the weaponRange variable. If the ray hits something, it is stored in the hit variable.

RaycastHit hit;

        if (Physics.Raycast(gunTransform.position, gunTransform.forward, out hit, weaponRange))

The function RayCastshoot takes a Boolean value. If the value is true, the function teleports the player; if the value is false, it checks for any objects in the scene, like zombies, rocks, grenades, and so on, that it collides with and destroys the enemies and the targets.

The first thing we do is for the zombie object. We add the Physics rigid body component and set its kinematic value to true. We also add a small script, which we named Enemy.cs, and attach it to the enemy object. The script shown below takes a function and checks the life of the enemy. Each call to the enemyhit function (that is, whenever we fire at the enemy) reduces the enemy’s life by one. After the enemy is shot five times, it is destroyed.

In the RayCastshoot function we call this function to get the handle to the Zombie object and determine if we are actually firing at the zombie.

Enemy enemy = hit.collider.GetComponentInParent<Enemy>();

If the enemy object is not null, we call the enemyhit function to reduce its life by one. We also instantiate the blood effect prefab, as shown below, each time the zombie is hit. We check the full life of the enemy, and if it less than zero we destroy the zombie object.

//Enemy.cs
public class Enemy : MonoBehaviour {

    //public GameObject explosionPrefab;
    public int fullLife = 5;


    public void enemyhit(int life)
    {
        //subtract life  when Damage function is called
        fullLife -= life;

        //Check if full life has fallen below zero
        if (fullLife <= 0)
        {
            //Destroy the enemy if the full life is less than or equal to zero
            Destroy(gameObject);

        }
    }

}

// if the hit object is the enemy
//Raycastexample.cs from where we are calling the enemyhit function
if (enemy != null)

            {
                enemy.enemyhit(1);

                //Checks the health of the enemy and resets to  max again
                //Instantiates the blood effect prefab for each hit

                var bloodEffect = Instantiate(bloodPrefab);
                bloodEffect.transform.position = hit.point;

                if (enemy.fullLife <= 0)
                {

                    enemy.fullLife = 5;
                }
            }

If the object we hit is anything other than the zombie, we can access the object by adding a tag to the different objects in the scene. For example, we added a tag called “Mud” for the ground, a tag called “Rock” for rocks, and so on. As shown in the code sample below, we can compare the tags to objects that we hit, then instantiate the respective prefab effects for those objects.

//If the hit targets are the targets other than the enemy like the mud, Rocks , Grenades on the terrain
else
            {
                var impactEffect = Instantiate(impactEffectPrefab);
                impactEffect.transform.position = hit.point;
                Destroy(impactEffect, 4);

                // If the Target is the ground
                if ((hit.collider.gameObject.CompareTag("Mud")))
                {

                    var mudeffect = Instantiate(mudPrefab);
                    mudeffect.transform.position = hit.point;


                }

                // If the Target is  Rocks
                else if ((hit.collider.gameObject.CompareTag("Rock")))
                {

                    var rockeffect = Instantiate(rockPrefab);
                    rockeffect.transform.position = hit.point;
                }

                // If the Target is the Grenades

                else if ((hit.collider.gameObject.CompareTag("Grenade")))
                {

                    var grenadeEffect = Instantiate(explosionPrefab);
                    grenadeEffect.transform.position = hit.point;
                    Destroy(grenadeEffect, 4);

                }
            }

        }

Teleporting

Teleporting is an important aspect in VR games that is recommended so that the user can avoid nausea when moving around the scene. The example shown below implements a simple teleporting mechanism in Unity. In the code we can either teleport the player to the “hit” point or we can create multiple points in the terrain where the player can be teleported.

Create an empty game object and name it “Teleport.”
Create a tag called “Teleport.”
Assign the Teleport tag to the teleport object as shown below.
Press CTRL+D and duplicate these points to create more teleport points in the scene. Adjust the positions of the points so that they span the terrain. I set the y position the same as my OVR player prefab so that the y value in the points are the same as my camera position.

As per the code below, if the teleport is set to true, we get the array of all the points in the teleportPoints variable, and we randomly pick one of these points for the player to teleport.

var newPosition = teleportPoints[Random.Range(0, teleportPoints.Length)];

Finally, we set the player’s transform position to the new position.

player.transform.position = newPosition.transform.position;

if (teleport)
            {
                //If the player needs to be teleported to the hit point
                // Vector3 newposition = hit.point;
                //player.transform.position = new Vector3(newposition.x, player.transform.position.y, newposition.z);


                //If the player needs to be teleported to the teleport points that are created in the Unity scene. Below code teleports the player
                // to one of the points randomly

                var teleportPoints = GameObject.FindGameObjectsWithTag("Teleport");
                var newPosition = teleportPoints[Random.Range(0, teleportPoints.Length)];


                player.transform.position = newPosition.transform.position;

                return;
            }

Building Settings and Deploying the VR Application

After you are finished with the game, deploy your VR application for PCs.

Go to File > Build Settings, and then for Target Platform, select Windows.
Go to Edit > Project Settings > Player, and then click the Inspector tab.
Click Other Settings, and then select the Virtual Reality Supported check box.
Compile and then build to get the final VR application.

Conclusion

Creating a VR game is a lot of fun, but it also requires preciseness. If you are an existing Unity developer and have a game that is not specific to VR, you can also integrate it with Oculus Rift and port it as a VR game. At the end of this article, we list a number of references that focus on best practices in VR.

Below is the complete script for the sample scene discussed in this article.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class Raycastexample : MonoBehaviour {


    //Audio clip to play
    public AudioClip clip;
    public AudioSource audioSource;

    //rate of firing at the targets
    public float fireRate = .25f;
    // Range to which Raycast will detect the collision
    public float weaponRange = 300f;

    //Prefab for Impacts at the target
    public GameObject impactEffectPrefab;
    //Prefab for Impacts for grenade explosions
    public GameObject explosionPrefab;

    //Prefab at gun transform position
    public GameObject GunfirePrefab;

    //Prefab if the target is the terrain
    public GameObject mudPrefab;
    // Prefab when hits the Zombie
    public GameObject bloodPrefab;

    // prefabs when hits the rocks
    public GameObject rockPrefab;

    // Player transform that is used in teleporting
    public Transform player;
    private float nextfire;

    //transform at the Gun end to show some muzzle effects when firing
    public Transform gunTransform;
    // Position to reset the player to its original position when "B" is pressed on the touch controller
    private Vector3 resetPosition;

    // Use this for initialization
    void Start () {

        // Play the Audio clip while firing
        audioSource = GetComponent();
        audioSource.clip = clip;
        // Reset position after teleporting to set the position to his original position
        resetPosition = transform.position;

    }

	// Update is called once per frame
	void Update () {
        //If the Primary Index trigger is pressed on the touch controller we fire at the targets
        if (OVRInput.Get(OVRInput.Button.PrimaryIndexTrigger) && (Time.time > nextfire))
        {
            nextfire = Time.time + fireRate;
            audioSource.Play();
            // Teleporting is set to false here
            RayCastShoot(false);
        }
        else if (OVRInput.Get(OVRInput.RawButton.A) && (Time.time > nextfire))
        {

            // Teleporting is set to true when Button A is pressed on the controller
            nextfire = Time.time + fireRate;

            RayCastShoot(true);

        }

        else if (OVRInput.Get(OVRInput.RawButton.B) && (Time.time > nextfire))
        {
            // If Button B is pressed on the controller player is reset to his original position

            nextfire = Time.time + fireRate;
            player.transform.position = resetPosition;

        }

    }

    private void RayCastShoot(bool teleport)
    {
        RaycastHit hit;
        //Casts a ray against the targets in the scene and returns the "hit" object.
        if (Physics.Raycast(gunTransform.position, gunTransform.forward, out hit, weaponRange))
        {

            if (teleport)
            {
                //If the player needs to be teleported to the hit point
                // Vector3 newposition = hit.point;
                //player.transform.position = new Vector3(newposition.x, player.transform.position.y, newposition.z);


                //If the player needs to be teleported to the teleport points that are created in the Unity scene. Below code teleports the player
                // to one of the points randomly

                var teleportPoints = GameObject.FindGameObjectsWithTag("Teleport");
                var newPosition = teleportPoints[Random.Range(0, teleportPoints.Length)];


                player.transform.position = newPosition.transform.position;



                return;
            }

            //Attach the Enemy script as component to the enemy

            Enemy enemy = hit.collider.GetComponentInParent();

            // Muzzle effects of the Gun and its tranfrom poisiton is the Gun

            var GunEffect = Instantiate(GunfirePrefab);
            GunfirePrefab.transform.position = gunTransform.position;



            // if the hit object is the enemy

            if (enemy != null)

            {
                enemy.enemyhit(1);

                //Checks the health of the enemy and resets to  max again
                //Instantiates the blood effect prefab for each hit

                var bloodEffect = Instantiate(bloodPrefab);
                bloodEffect.transform.position = hit.point;

                if (enemy.fullLife <= 0)
                {

                    enemy.fullLife = 5;
                }
            }

            //If the hit targets are the targets other than the enemy like the mud, Rocks , Grenades on the terrain

            else
            {
                var impactEffect = Instantiate(impactEffectPrefab);
                impactEffect.transform.position = hit.point;
                Destroy(impactEffect, 4);

                // If the Target is the groud
                if ((hit.collider.gameObject.CompareTag("Mud")))
                {
                    Debug.Log(hit.collider.name + ", " + hit.collider.tag);
                    var mudeffect = Instantiate(mudPrefab);
                    mudeffect.transform.position = hit.point;


                }

                // If the Target is  Rocks
                else if ((hit.collider.gameObject.CompareTag("Rock")))
                {

                    var rockeffect = Instantiate(rockPrefab);
                    rockeffect.transform.position = hit.point;
                }

                // If the Target is the Grenades

                else if ((hit.collider.gameObject.CompareTag("Grenade")))
                {

                    var grenadeEffect = Instantiate(explosionPrefab);
                    grenadeEffect.transform.position = hit.point;
                    Destroy(grenadeEffect, 4);

                }
            }

        }
    }
}

//Enemy.cs
using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class Enemy : MonoBehaviour {

    //public GameObject explosionPrefab;
    public int fullLife = 5;

    // Use this for initialization
    void Start () {

	}


    public void enemyhit(int life)
    {
        //subtract life  when Damage function is called
        fullLife -= life;

        //Check if full life has fallen below zero
        if (fullLife <= 0)
        {
            //Destroy the enemy if the full life is less than or equal to zero
            Destroy(gameObject);

        }
    }


}

References for VR on the Intel® Developer Zone

Virtual Reality User Experience Tips from VRMonkey: https://software.intel.com/en-us/articles/virtual-reality-user-experience-tips-from-vrmonkey

Presence, Reality, and the Art of Astonishment in Arizona Sunshine: https://software.intel.com/en-us/blogs/2016/12/01/presence-reality-and-the-art-of-astonishment-in-arizona-sunshine

Combating VR Sickness with User Experience Design: https://software.intel.com/en-us/articles/combating-vr-sickness-with-user-experience-design

Interview with Well Told Entertainment about their Virtual Reality Escape Room Game: https://software.intel.com/en-us/blogs/2016/11/30/interview-with-well-told-entertainment-about-their-virtual-reality-escape-room-game

What is the Next Leap in VR Experiences?: https://software.intel.com/en-us/videos/what-is-the-next-leap-in-vr-experiences

VR Optimization Tips from Underminer Studios: https://software.intel.com/en-us/articles/vr-optimization-tips-from-underminer-studios

VR Optimizations with Intel® Graphics Performance Analyzers: https://software.intel.com/en-us/videos/vr-optimizations-with-intel-graphics-performance-analyzers

Creating Immersive Virtual Worlds Within Reach of Current-Generation CPUs: https://software.intel.com/en-us/articles/creating-immersive-virtual-worlds-within-reach-of-current-generation-cpus

About the Author

Praveen Kundurthy works in the Intel® Software and Services Group. His main focus is on mobile technologies, Microsoft Windows*, virtual reality, and game development.

↧

How to use the Intel® Advisor Python API

July 19, 2017, 10:53 am

Latest and popular articles on Intel Technologies

≫ Next: Machine Learning and Knowledge Reasoning Probing with Intel® Architecture

≪ Previous: Introduction to VR: Creating a First-Person Player Game for the Oculus Rift*

Introduction

You can now access the Intel® Advisor database using our new Python API. We have provided several reference examples on how to use this new functionality. The API provides a flexible way to report on useful program metrics (Over 500 metric elements can be displayed). This article will describe how to use this new functionality.

Getting started

To get started, you first need to setup the Intel Advisor environment.

> source advixe-vars.sh

Next, to setup the Intel Advisor database, you need to run some collections. Some of the program metrics require additional analysis such as tripcounts, memory access patterns and dependencies.

> advixe-cl --collect survey --project-dir ./your_project -- <your-executable-with-parameters>

> advixe-cl --collect tripcounts -flops-and-masks -callstack-flops --project-dir ./your_project -- <your-executable-with-parameters>

> advixe-cl --collect map –mark-up-list=1,2,3,4 --project-dir ./your_project -- <your-executable-with-parameters>

> advixe-cl --collect dependencies –mark-up-list=1,2,3,4 --project-dir ./your_project -- <your-executable-with-parameters>

Finally you will need to copy the Intel Advisor reference examples to a test area.

cp –r /opt/intel/advisor_2018/python_api/examples .

Using the Intel Advisor Python API

The reference examples we have provided are just small set of the reporting possible using this flexible way to access your program data. The file columns.txt provides a list of the metrics we currently support. Here is are some example showing the python api in action:

Generate a combined report showing all data collected
python joined.py ./your_project >& report.txt
Generate an html report
python to_html.py ./your_project
Generate a Roofline html chart
python to_html.py ./your_project
Generate cache simulation statistics
Before doing your data collection set the following environment variable: export ADVIXE_EXPERIMENTAL=cachesim
1. You need to do a memory access pattern collection to collect cache statistics
  advixe-cl --collect map –mark-up-list=4 --project-dir ./your_project -- <your-executable-with-parameters>
2. Setup cache collection in Project properties
3. cache.py ./your_project

Conclusion/Summary

The new Intel Advisor Python API provides a powerful way to generate meaningful program statistics and reports. The provided examples gives a framework of scripts showing the power of this this new interface.

↧

Machine Learning and Knowledge Reasoning Probing with Intel® Architecture

July 19, 2017, 2:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Enhanced Privacy ID (EPID) Security Technology

≪ Previous: How to use the Intel® Advisor Python API

Introduction

Intelligence varies in kinds and degrees, and it occurs in humans, many animals and some machines. Considering machines, it is said that Artificial Intelligence (AI) is the set of methods and procedures that provide machines with the ability to achieve goals in the world. It is present in many studying areas such as deep learning, computer vision, reinforcement learning, natural language processing, semantics, learning theory, case based reasoning, robotics, etc. During the 1990’s, the attention was on logic-based AI, mainly concerned with knowledge reasoning (KR), whereas the focus nowadays lies on machine learning (ML). This shift contributed to the field in a way knowledge reasoning never did. However, a new shift is coming. Knowledge reasoning resurges as a response to a demand on inference methods, while machine learning keeps its achievements on statistical approach. This new change occurs when knowledge reasoning and machine learning begin to cooperate with each other, a scenario at which computing is not yet defined.

Intelligent computing is pervasive, demands are monotonically increasing and time for results is shortening. But while consumer products operate in those conditions, the process of building the complex mathematical models, which support such applications, rely on a computational infra-structure that demands large amounts of energy, time and processing power. There is a race to develop specialized hardware to make modern AI methods significantly faster and cheaper.

The strategy of packing such specialized hardware with elaborated software components into a successful architecture is a wise plan of action. Intel® has incorporated to its expertise the top of technology on machine learning when it acquired the hardware and software startup Intel® Nervana™. Moreover, the well-known Altera®, which makes FPGAs chips that can be reconfigured to power up specific algorithms, was also integrated to the company. Therefore, the power and energy efficiency from Intel® processors and architecture can help companies, software houses, cloud providers, and end-user devices to upgrade their capability to use AI. The relevance of such chips for developing and training new AI algorithms cannot be underestimated.

AI systems are usually only perceived as software since this is the layer nearest to ordinary developers and final users. However, it also requires high hardware functionality to support calculations. This is why choosing the Deep Neural Network (DNN) performance primitives within the Intel® Math Kernel Library (Intel® MKL) and the Intel® Data Analytics Acceleration Library (Intel® DAAL) is a clever decision, since such libraries allow better usage of Intel processors and support AI development through hardware. Intelligent applications need to rely on CPUs that perform specific types of mathematical calculations such as vector algebra, linear equations, eigenvalues and eigenvectors, statistics, matrix decomposition, linear regression, and handle large quantities of basic computations in parallel, to mention some. Concerning machine learning, there are a lot of the neural network solutions within hardware artifacts, and deep learning requires a huge amount of matrix multiplication. Considering knowledge representation, the forward and backward chaining¹ demands many vector algebra computations, while resolution principle²requires singular value decomposition. Therefore, AI benefits from using specialized processors with speedy connections between parallel onboard computing cores, fast access to ample memory for storing complex models and data, and mathematical operations optimized for speed.

There are many research and development reports describing the usage of Intel® Architecture supporting machine learning applications. However, such context can also be used on symbolist AI approach, a market share that has been overlooked by programmers and software architects. This paper aims to promote the usage of Intel® architecture to speedup not only machine learning, but also knowledge reasoning applications.

Performance Test

In order to illustrate that knowledge reasoning applications can also benefit from using Intel architecture, this test will consider two tasks from real artificial intelligence problems: one as a baseline for comparison and the other as a knowledge reasoning sample. The first task (ML) represents the machine learning approach by using Complement Naive Bayes³ (github.com/oiwah/classifier) classifier in order to identify the encryption algorithm used to encode plain text messages⁴. The classification model is constructed by training over 600 text samples, with more than 140,000 characters each, cyphered with DES, Blowfish, ARC4, RSA, Rijndael, Serpent and Twofish. The second task (KR) represents the knowledge reasoning approach by using Resolution Principle² from an inference machine called Mentor*⁵ (not publically available) in order to detect frauds on car insurance claims. The sample is composed of 1,000 claims, and the inference machine is loaded with 78 first order logic rules.

Performance is measured based on how many RDTSC clock cycles (Read Time Stamp Counter)⁶ long it takes to run the tests. The RDTSC was used to track performance rather than wall-clock time because the former counts clock ticks and thus it is invariant even if the processor core changes frequency. This does not happens with wall-clock, and thus, RDTSC is a more precise measuring method than wall-clock. However, note that traditional performance measuring is usually accomplished by using wall-clock time since it provides an acceptable precision.

Tests were performed on a system equipped with Intel® Core™ i7 4500U@1.8 GHz processor, 64 bits, Ubuntu 16.04 LTS operating system, 8GB RAM, hyper threading turned on with 2 threads per core (you may check it by typing sudo dmidecode -t processor | grep -E '(Core Count|Thread Count)') and with system power management disabled.

First, the C++ source codes were compiled with gcc 5.4.0 compiler and the test was performed. Then, the same source codes were recompiled with Intel® C++ Compiler XE 17.0.4, Intel® Math Kernel Library (Intel® MKL) 2017 (-mkl=parallel) and a new test was performed. Note that many things happen within the operating system, which are invisible to the application programmer, affecting the cycle count, thus measuring variations are expected to occur. Hence, each test ran 300 times in a loop and it was discarded any result that is too much higher than other results.

Figure 1 shows the average clock cycles spent to build the Complement Naive Bayes classification model for the proposed theme. It uses statistical and math routines for training its model. The combination of Intel® C++ Compiler XE, Intel® MKL demand less clock cycles than the commonly used configuration for compiling C++ programs, and thus such tuning platform did a much better job. Notice that this evaluation compares source-codes that were not changed at all. Therefore, although it obtained a 1.66 speedup, it is expected higher values once parallelism and specialized methods are explored by developers.

Figure 1: Test of machine learning approach using Complement Naive Bayes classifier.

Figure 2 shows the average clock cycles spent to produce the deductions using the Resolution Principle as the core engine of an inference machine. It uses several math routines and lots of singular value decomposition to compute the first order predicates. Here, the Intel® C++ Compiler XE and Intel® MKL (-mkl=parallel) combination outperformed the traditional compiling configuration, and thus it also bet the ordinary developing environment. The speedup obtained was 2.95, despite neither parallelism was explored, nor were specialized methods called.

Figure 2: Test of knowledge reasoning approach using resolution principle to perform inference.

The former test shows a machine learning method being enhanced by a tuning environment. Such result is not significant, since this was already expected. The relevance of this test lies in its function as a reference to the latter test, in which the same environment was used. The inference machine, under the same conditions, also obtained a good speedup. This is an evidence that applications based on this approach, such as expert systems, deduction machines, theorem provers, can also be enhanced by Intel® architecture.

Conclusion

This article presented a performance test of Intel® tuning platform composed by Intel® processor, Intel® C++ Compiler XE and Intel® MKL applied to usual AI problems. The two existing approaches of artificial intelligence were probed. Machine learning was represented by an automatic classification method and knowledge reasoning was characterized by a computational inference method. The results suggest that it is possible to accelerate those AI computations as compared to using the traditional software developing environment by employing such tuning platform. These approaches are necessary to supply intelligent behavior to machines. The libraries and the processor helped to improve the performance of those functions by taking advantage of special features in Intel® products, speeding up the execution. The reader must notice that it was not necessary to modify source codes to take advantage of such features.

AI applications can run faster and consume less power when paired with processors designed to handle the set of mathematical operations these systems require. Intel® architecture provides specialized instruction sets in processors, with fast bus connections to parallel onboard computing cores and computational cheaper access to memory. The environment composed of Intel® processor, Intel® C++ Compiler XE and Intel® MKL empower developers to construct tomorrow’s intelligent machines.

References

1. Merritt, Dennis. Building Expert Systems in Prolog, Springer-Verlag, 1989.

2. Russell, Stuart; Norvig, Peter. Artificial Intelligence: A Modern Approach, Prentice Hall Series in Artificial Intelligence, Pearson Education Inc., 2^nd edition, 2003.

3. Rennie, Jason D.; Shih, Lawrence ; Teevan, Jaime; KargerDavid R. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: International Conference on Machine Learning, 616-623, 2003.

4. Mello, Flávio L.; Xexéo, José A. M. Cryptographic Algorithm Identification Using Machine Learning and Massive Processing. IEEE Transactions Latin America, v.14, p.4585 - 4590, 2016. doi: 10.1109/TLA.2016.7795833

5. Metadox Group, Mentor, 2017. http://www.metadox.com.br/mentor.html Accessed on June 12^th, 2017.

6. Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2B: Instruction Set Reference, M-U, Order Number: 253667-060US, September, 2016. http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-2b-manual.pdf Accessed on May 30^th, 2017.

↧

Intel® Enhanced Privacy ID (EPID) Security Technology

July 13, 2017, 8:45 am

Latest and popular articles on Intel Technologies

≫ Next: Machine Learning and Knowledge Reasoning Probing with Intel® Architecture

≪ Previous: Machine Learning and Knowledge Reasoning Probing with Intel® Architecture

Introduction

Security Review – Public Key Encryption and Infrastructure

Problem 1: Certifying Keys

Problem 2: Shielding Unique Identity

Problem 3: Revoking Access

Roles in a Public Key Infrastructure

CA	A Certified Authority is the entity that is issuing security certificates.
RA	The Registration Authority accepts requests for new certificates, ensures the authenticity of the requestor and completes the registration process to the CA for the requestor.
VA	A Validation Authority is a trusted third party that can validate a certificate on behalf of a Certificate Authority.
Member	The role of member can be assumed by an end user or a device. The member is the role that is requesting attestation of itself during secure exchanges.

Figure 1 - PKI Roles and Process Flow

Authentication vs Identification

What is EPID?

Intel® EPID Roles

Figure 2 – EPID Roles

Issuer	Creates, stores, and distributes issuer signed public certificates for groups. Creates, distributes, and then destroys private keys. Private keys are not retained by an issuer, and are held private by member devices in Trusted Storage such as TPM 1.2 compliant device. Creates and maintains revocation lists.
Verifier	Challenges member verification requests using EPID Group Public Key and revocation lists. Identify any member or group revocations.
Member	An end device for a particular platform. Protect private EPID keys into protected TPM 1.2 complaint storage. Sign messages when challenged.

Figure 3 - Using a unique key allocated for a group, an issuer will create one EPID Group Public key and many member EPID private keys as requested.

Security Keys used in EPID

Owner	Pub/Pri	Description	Usage
Issuer	PRI	CA Root Issuing authority private ECC key	Used to sign EPID Group public key and parameters, ensures trust all the way to member.
Issuing CA	PUB	Issuing authority public ECC key	Provided to platform members to enable trust with a verifier and issuer.
Issuer	PRI	Group Private Key. One per group.	Created by issuer for a group. Used to generate private member keys.
Group	PUB	EPID Group Public Key Generated by issuer	Provided to platform devices during provisioning upon request. Used by verifiers to validate EPID member signatures.
Member	PRI	EPID member private key. Private unique key for each device, can be fused, must be secured.	Generated by issuer using Group Private Key. Stored securely or embedded/fused into Silicon Golden private key ready for provisioning into final EPID key. Used to create valid EPID signatures that can be decrypted using a paired EPID Group public key.

Intel® EPID Signature

An Intel® EPID signature is created using the following parameters:

Member private key
Group public key
Message to be signed
Signature revocation proof list (to prove that it did not create any signatures that were flagged for revocation in the past)

An Intel® EPID signature is verified using the following parameters:

Member’s signature
CA certificate (to certify authenticity of issuer material before it is used)
Group public key
Group revocation list
Member private key revocation list
Signature revocation list

Intel® EPID Process Flows

Embedding

Sequence 1 – Intel® EPID key request and distribution process

Provisioning

SEQUENCE 2 – Intel® EPID Provisioning Flow

Revocation

SEQUENCE 3 – Verifier submits request to Issuer to revoke a member or group

Summary of Revocation Lists
PRIV-RL – Private Member Key is known
SIG-RL – Platform Member Key is not recovered, however signature is known
GROUP-RL – Entire Group should be revoked

Intel® EPID Use Cases

Intel® Insider™
http://blogs.intel.com/technology/2011/01/intel_insider_-_what_is_it_no/

One time password
https://www.intel.com/content/www/us/en/architecture-and-technology/identity-protection/one-time-password.html

SGX
https://software.intel.com/en-us/sgx

Silicon providers such as Microchip* and Cypress Semiconductor* are now implementing Intel® EPID into their products as well.

Microchip announces plans for implementing Intel® EPID
http://download.intel.com/newsroom/kits/idf/2015_fall/pdfs/Intel_EPID_Fact_Sheet.pdf

Intel Products offering Intel® EPID

Intel® EPID SDK – Member and Verifier APIs

First steps with Intel® EPID

To get started, download the latest Intel® EPID SDK, and begin by reading the documentation included in the doc subfolder with each distribution. https://01.org/epid-sdk/downloads

Figure 4 – Directory listing of the Intel® EPID 4.0.0 SDK

Intel® EPID Member Example

Create a digital signature using the sample Intel® EPID member private key, groupID, and a sample text string of any content.

signmsg.exe --msg=”TEST TEXT BLOB”

Intel® EPID Verifier Example

Verify a digital signature using the SDK with the same message.

verifysig --msg=”TEST TEXT BLOB”

Figure 5 – Console sign and verify success

If not specified, the SDK will use default values for the hashing algorithm.

If a different message or hashing algorithm are used, the verification will fail.

Figure 6 – Console sign and verify failure

Figure 7 – Intel® EPID SDK Documentation

The Intel® EPID SDK is constantly improving with each release, aligning to the newest Intel® EPID standards and providing optimizations for hashing algorithms using Intel® Performance Primitives.

How to implement Intel® EPID

Visit the Intel Intel® EPID SDK deployment site for more documentation and API walkthroughs for signing and verifying messages https://01.org/epid-sdk/ .

If you are interested in implementing Intel® EPID into your products, or to join our Zero Touch Onboarding POC, start by emailing iotonboarding@intel.com

If you would like to use Intel’s Key Generation Facility to act as an Intel® EPID issuer for creation of Intel® EPID keys, please start by contacting iotonboarding@intel.com.

Quick Facts

Intel has issued over 4 billion Intel® EPID keys since the release of the Series 5 chipset in 2008
Devices in an Intel® EPID Ecosystem are allowed to authenticate anonymously using only a Group ID
Intel® EPID is Intel’s implementation of Direct Anonymous Attestation
Intel® EPID supports revoking devices based on Private Key, Intel® EPID Signature, or an entire Group
Silicon providers can create their own Intel® EPID ecosystem
OEM/ODMs can use Intel® EPID compliant silicon devices to provide quick and secure provisioning
Intel® products include an embedded true random number generator – providing quicker, more secure seed values for hashing algorithms. (The SDK requires a secure random number generator to be used in any implementation of Intel® EPID.)

Summary

A very special thanks to the members of the Intel® EPID SDK team for taking time to answer questions on Intel® EPID and the Intel® EPID SDK.

Terminology

AES-NI	AES - New Instructions is a hardware embedded feature available in most newer Intel® products.
AIK	Attestation Identity Key
AMT	Active Management Technology - Support out of band remote access.
Anonymity	A property that allows a device to avoid being uniquely identified or tracked.
Attestation	A process by which a user or device guarantees they are who they say they are.
Certificate	An electronic document issued by a third-party trusted authority (issuer) that verifies the validity of a Public Key. The contents include a subject and a verifiable signature from the Issuer, which adds an additional layer of trust around the contents.
DAA	Direct Anonymous Attestation
DER	Certificate File format - Distinguished Encoding Resource
ECC	Elliptic Curve Cryptography
EPID	Enhanced Privacy Identification
EPID key	A private key held by an individual and not shared with anyone. Is used to create a valid Intel® EPID signature that can be verified using a matching Intel® EPID public group key
iKGF	Intel® Key Generation Facility
Intel SCS	Setup and Configuration Software - Used to access AMT capabilities
ISM	Intel® Standard Manageability
ISO 2008-2:2013	ISO standard for Anonymous digital signature security mechanisms https://www.iso.org/obp/ui/#iso:std:iso-iec:20008:-2:ed-1:v1:en
ME	Intel® Management Engine, sometimes also called Security and Management Engine
ODM	Original Device Manufacturer
OEM	Original Equipment Manufacturer
PEM	Certificate File format - Privacy Enhanced Mail
PKE	Public Key Encryption
PKI	Public Key Infrastructure
Platform	A platform is considered a piece of hardware or device.
Private Key	A key that is owned by an individual or device and is held private and never shared with anyone. It is most commonly used to encrypt a message into cipher-text that can only be opened using a matching Public key.
Public Key	A key provided to the public that will only decrypt a document encrypted using a matching private key
SBT	Small Business Technology
Secure Key	A text string that matches the output of a defined algorithm and allows plain text to be transformed into cipher-text or vice-versa.
SIGMA	SIGn and Message Authentication - A protocol from Intel for platform to verifier two way authentication.
X.509	IEEE standard for certificate format and content

About the Author

References

Intel® EPID White Paper

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/intel-epid-white-paper.pdf

NISC-PEC, December 2011

http://csrc.nist.gov/groups/ST/PEC2011/presentations2011/brickell.pdf

Wikipedia References on Security

ACM conference 2004, “Direct Anonymous Attestation”

https://eprint.iacr.org/2004/205.pdf

Platform Embedded Security Technology Revealed

http://www.apress.com/us/book/9781430265719

Wikipedia Image license for PKI process

https://en.wikipedia.org/wiki/Public_key_infrastructure#/media/File:Public-Key-Infrastructure.svg
https://creativecommons.org/licenses/by-sa/3.0/

↧

Machine Learning and Knowledge Reasoning Probing with Intel® Architecture

July 19, 2017, 2:24 pm

Latest and popular articles on Intel Technologies

≫ Next: How to find the host ID for floating licenses

≪ Previous: Intel® Enhanced Privacy ID (EPID) Security Technology

Introduction

The strategy of packing such specialized hardware with elaborated software components into a successful architecture is a wise plan of action. Intel® has incorporated to its expertise the top of technology on machine learning when it acquired the hardware and software startup Intel® Nervana™. Moreover, the well-known Altera®, which makes FPGAs chips that can be reconfigured to power up specific algorithms, was also integrated to the company. Therefore, the power and energy efficiency from Intel® processor and architecture can help companies, software houses, cloud providers, and end-user devices to upgrade their capability to use AI. The relevance of such chips for developing and training new AI algorithms cannot be underestimated.

Performance Test

First, the C++ source codes were compiled with gcc 5.4.0 compiler and the test was performed. Then, the same source codes were recompiled with Intel® C++ Compiler XE 17.0.4, Intel® MKL 2017 (-mkl=parallel) and a new test was performed. Note that many things happen within the operating system, which are invisible to the application programmer, affecting the cycle count, thus measuring variations are expected to occur. Hence, each test ran 300 times in a loop and it was discarded any result that is too much higher than other results.

Figure 1: Test of machine learning approach using Complement Naive Bayes classifier.

Figure 2: Test of knowledge reasoning approach using resolution principle to perform inference.

Conclusion

This article presented a performance test of a tuning platform composed by Intel® processor, Intel® C++ Compiler XE and Intel® MKL applied to usual AI problems. The two existing approaches of artificial intelligence were probed. Machine learning was represented by an automatic classification method and knowledge reasoning was characterized by a computational inference method. The results suggest that it is possible to accelerate those AI computations as compared to using the traditional software developing environment by employing such tuning platform. These approaches are necessary to supply intelligent behavior to machines. The libraries and the processor helped to improve the performance of those functions by taking advantage of special features in Intel® products, speeding up the execution. The reader must notice that it was not necessary to modify source codes to take advantage of such features.

AI applications can run faster and consume less power when paired with processors designed to handle the set of mathematical operations these systems require. Intel® architecture provides specialized instruction sets in processors, with fast bus connections to parallel onboard computing cores and computational cheaper access to memory. The environment composed of Intel® processor, Intel® C++ Compiler XE and Intel MKL empower developers to construct tomorrow’s intelligent machines.

References

1. Merritt, Dennis. Building Expert Systems in Prolog, Springer-Verlag, 1989.

2. Russell, Stuart; Norvig, Peter. Artificial Intelligence: A Modern Approach, Prentice Hall Series in Artificial Intelligence, Pearson Education Inc., 2^nd edition, 2003.

3. Rennie, Jason D.; Shih, Lawrence ; Teevan, Jaime; KargerDavid R. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: International Conference on Machine Learning, 616-623, 2003.

5. Metadox Group, Mentor, 2017. http://www.metadox.com.br/mentor.html Accessed on June 12^th, 2017.

↧

How to find the host ID for floating licenses

July 24, 2017, 12:40 pm

Latest and popular articles on Intel Technologies

≫ Next: Object Classification Using CNN Across Intel® Architecture

≪ Previous: Machine Learning and Knowledge Reasoning Probing with Intel® Architecture

The floating and named-user licenses for the Intel® Parallel Studio XE and Intel® System Studio products require that you provide host name and host ID information for the host computer that you install the associated license file on. To enable you to obtain the required license file, these unique values must be available when you register your product. Refer to the information below for help identifying the host name and host ID (i.e. Physical Address) for supported platforms.

Before registering your product and generating the license file, you should be familiar with the different license types and how they are used with the Intel® Software Development Products. License types supported are:

Floating (counted)
Named-user (uncounted)

Only counted licenses require use of the Intel® Software License Manager software on the host computer.

In this context the host computer is known as the “license server”. The “host ID” in this context, depending on terminology used for your host operating system, is a 12-character Physical (Ethernet adapter) Address or hardware (Ethernet) address.

The host name and host ID (i.e. Physical Address) are system-level identifiers on supported platforms used to generate the license file with specific host information for use only on the specified floating license server.

When entering the physical or hardware address as prompted by the Intel® Registration Center (IRC) , enter a 12-digit numeric value only and exclude all hyphens ("-") and colons (":"). For example, a host ID value of 30-8D-99-12-E4-87 should be entered as: 308D9912E487

This article pertains specifically to floating licenses. Please refer to How to find the Host ID for the Named-user license for information about the named-user license.

Refer to the Intel® Software License Manager FAQ for additional details including information about downloading the software license server software.

Identifying the hostname and host ID

Microsoft Windows*
---------------------------
1. Launch a Command Prompt.
   (Tip: Multiple methods exist for starting a Command Prompt, a few include:
         Windows 7*: Open the Start Menu and go to All Programs -> Accessories.
                               Locate and use the Command Prompt shortcut.
         Windows 8*: Open the Start screen. Click or tap on All apps and scroll right to locate the
                               Windows System folder. Locate and use the Command Prompt shortcut.
         Windows 10*: Open the Start Menu and go to All apps -> Windows System.
                                Locate and use the Command Prompt shortcut.

On all systems you may use the hostname/getmac commands as demonstrated below:

2. Use hostname at the command prompt to display the host name.
3. Use getmac /v at the command prompt to display the host ID.

In the resulting output below, the hostname is my-computer and the host ID is 30-8D-99-12-E4-87
(i.e. the value corresponding to the Physical Address for the Ethernet Network Adapter)

C:\> hostname
my-computer

C:\> getmac /v

Connection Name Network Adapter Physical Address    Transport Name
=============== =============== =================== ========================
Ethernet        Intel(R) Ethern 30-8D-99-12-E4-87   \Device\Tcpip_{1B304A28-
Wi-Fi           Intel(R) Dual B 34-02-86-7E-16-61   Media disconnected

On a system where the Intel® Software License Manager software is installed, you may elect to use lmhostid to obtain the hostname and host ID information for the system. For systems that report multiple host IDs, it may be necessary to use getmac /v to identify the host ID (i.e. Physical Address) associated with the Ethernet Network Adapter.

In the resulting output below, the hostname is my-computer and the host ID is 308d9912e487
(i.e. the value corresponding to the Physical Address for the Ethernet Network Adapter)

C:\> cd "C:\Program Files (x86)\Common Files\Intel\LicenseServer\"
C:\> .\lmhostid -hostname
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is "HOSTNAME= my-computer"

C:\> .\lmhostid
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is ""3402867e1661 308d9912e487""
Only use ONE from the list of hostids.

Linux*
--------
On all systems you may use the hostname/ifconfig commands as demonstrated below:

1. Use the command hostname to display the host name.
2. Use the command /sbin/ifconfig eth0 to display the HWaddr (i.e. hardware address) for the Ethernet adapter. On some systems it may be necessary to use: /sbin/ifconfig | grep eth

In the (partial) resulting output below, the hostname is my-othercomputer and the host ID is 00:1E:67:34:EF:18
(i.e. the value corresponding to the hardware (Ethernet) address)

$ hostname
my-othercomputer

$ /sbin/ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:1E:67:34:EF:18
          inet addr:10.25.234.110  Bcast:10.25.234.255  Mask:255.255.255.0

On a system where the Intel® Software License Manager software is installed, you may elect to use lmhostid to obtain the hostname and host ID information for the system. For systems that report multiple host IDs, it may be necessary to use the ifconfig command to identify the HWaddr (i.e. hardware address) for the Ethernet adapter.

In the resulting output below, the hostname is my-othercomputer and the host ID is 001e6734ef18
(i.e. the value corresponding to the hardware (Ethernet) address)

$  lmhostid -hostname
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is "HOSTNAME= my-othercomputer"

$  lmhostid
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is ""001e6734ef18 001e6734ef19""
Only use ONE from the list of hostids.

OS X*
-------
On all systems you may use the hostname/ifconfig commands as demonstrated below:

1. Use the command hostname to display the host name.
2. Run the command /sbin/ifconfig en0 to display the ether (i.e. hardware address) for the Ethernet adapter.

In the (partial) resulting output below, the hostname is my-macmini and the host ID is 40:6c:8f:1f:b8:57
(i.e. the value corresponding to the hardware (Ethernet) address)

$ hostname
my-macmini

$ /sbin/ifconfig en0
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        options=10b<RXCSUM,TXCSUM,VLAN_HWTAGGING,AV>
        ether 40:6c:8f:1f:b8:57

On a system where the Intel® Software License Manager software is installed, you may elect to use lmhostid to obtain the hostname and host ID information for the system. For systems that report multiple host IDs, it may be necessary to use the ifconfig command to identify the ether (i.e. hardware address) for the Ethernet adapter.

In the resulting output below, the hostname is my-macmini and the host ID is 406c8f1fb857
(i.e. the value corresponding to the hardware (Ethernet) address)

$  lmhostid -hostname
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is "HOSTNAME= my-macmini"

$  lmhostid
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is ""7073cbc3edd9 406c8f1fb857""
Only use ONE from the list of hostids.

Refer to the Software EULA for additional details on the Floating license.

For CoFluent products: Please refer to product documentation for instructions on how to find your composite host ID for node-locked and floating licenses.

↧

Object Classification Using CNN Across Intel® Architecture

July 24, 2017, 5:12 pm

Latest and popular articles on Intel Technologies

≫ Next: Building and Probing Prolog* with Intel® Architecture

≪ Previous: How to find the host ID for floating licenses

Abstract

In this work, we present the computational performance and classification accuracy for object classification using the VGG16 network on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The results can be used as criteria for iteration selection optimization in different experimental setups using these processors and also in multinode architecture. With an objective of evaluating accuracy for real-time logo detection from video, the results are applicable on a logo image dataset suitable for detecting the classification accuracy of the logos.

1. Introduction

Deep learning (DL), which refers to a class of neural network models with deep architectures, forms an important and expressive family of machine learning (ML) models. Modern deep learning models, such as convolutional neural networks (CNNs), have achieved notable successes in a wide spectrum of machine learning tasks including speech recognition¹, visual recognition², and language understanding³. The explosive success and rapid adoption of CNNs by the research community is largely attributable to high-performance computing hardware such as the Intel® Xeon® processor, Intel® Xeon Phi™ processor, and graphics processing units (GPUs), as well as a wide range of easy-to-use open source frameworks including Caffe*, TensorFlow*, the cognitive toolkit (CNTK*), Torch*, and so on.

2. Setting up a Multinode Cluster

The Intel® Distribution for Caffe* is designed for both single node and multinode operation. There are two general approaches to parallelization (data parallelism and model parallelism), and Intel uses data parallelism.

Data parallelism is when you use the same model for every thread, but feed it with different data. It means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. For example, a network is trained on three nodes. All of them have a batch size of 64. The (total) batch size in a single iteration of the stochastic gradient descent algorithm is 3*64=192. Model parallelism means using the same data across all nodes, but each node is responsible for estimating different parameters. The nodes then exchange their estimates with each other to come up with the right estimate for all parameters.

To set up a multinode cluster, download and install the Intel® Machine Learning Scaling Library (Intel® MLSL) 2017 package from https://github.com/01org/MLSL/releases/tag/v2017-Preview and source the mlslvars.sh, and then recompile the Caffe build with MLSL: = 1 in the makefile.config. When the makefile completes successfully, start the Caffe training using the message passing interface (MPI) command as follows:

mpirun -n 3 -ppn 1 -machinefile ~/mpd.hosts ./build/tools/caffe train \
  --solver=models/bvlc_googlenet/solver_client.prototxt --engine=MKL2017

where n defines the number of nodes and ppn represents the number of processes per node. The nodes will be configured in the ~/mpd.hosts with their respective IP addresses as follows:

192.161.32.1
192.161.32.2
192.161.32.3
192.161.32.4

Ansible* scripts are used to copy the binaries or files across the nodes.

Clustering communication employs Intel® Omni-Path Architecture (Intel® OPA)⁴.

Validation of cluster setup is performed by using the command ‘opainfo’ in all machines, and the port state must always be ‘Active’.

Screenshot of Intel Omni-Path cluster results

Figure 1:Intel® Omni-Path Architecture (Intel® OPA) cluster information.

3. Experiments

The current experiment focuses on measuring the performance of the VGG16 network on the Flickr* logo dataset, which has 32 different classes of logo. Intel® Optimized Technical Preview for Multinode Caffe* is used for experiments on the single node and with Intel® MLSL enabled for multinode experiments. The input images were all converted to lightning memory-mapped database (LMDB) format for better efficiency. All of the experiments are set to run for 10K iterations, and the observations are noted below. We conducted our experiments in the following machine configurations. Due to lack of time we had to limit our experiments to a single execution per architecture.

Intel Xeon Phi processor

Model Name: Intel® Xeon Phi™ processor 7250 @1.40GHz
Core(s) Per Socket: 68 RAM (free): 70 GB
OS: CentOS* 7.3

Intel Xeon processor

Model Name: Intel® Xeon® processor E5-2699 v4 @ 2.20GHz
Core(s) Per Socket: 22 RAM (free): 123 GB
OS: Ubuntu* 16.1

The multinode cluster setup is configured as follows:

KNL 01 (Master)

Model Name: Intel® Xeon Phi™ processor 7250 @1.40GHz
Core(s) Per Socket: 68 RAM (free): 70 GB
OS: CentOS 7.3

KNL 03 (Slave node)

Model Name: Intel Xeon Phi processor 7250 @1.40GHz
Core(s) Per Socket: 68 RAM (free): 70 GB
OS: CentOS 7.3

KNL 04 (Slave node)

Model Name: Intel Xeon Phi processor 7250 @1.40GHz
Core(s) Per Socket: 68 RAM (free): 70 GB
OS: CentOS 7.3

3.1. Training Data

The training and test image datasets were obtained from Datasets: FlickrLogos32 / FlickrLogos47, which is maintained by the Multimedia Computing and Computer Vision Lab, Augsburg University. There are 32 logo classes or brands in the dataset, which are downloaded from Flickr, as illustrated in the following figure:

Screenshot of a collage of logos, brands, objects, etc..
Figure 2:Flickr logo image dataset with 32 classes.

The 32 classes are as follows: Adidas*, Aldi*, Apple*, Becks*, BMW*, Carlsberg*, Chimay*, Coca-Cola*, Corona*, DHL*, Erdinger*, Esso*, Fedex*, Ferrari*, Ford*, Foster's*, Google*, Guinness*, Heineken*, HP*, Milka*, Nvidia*, Paulaner*, Pepsi*, Ritter Sport*, Shell, Singha*, Starbucks*, Stella Artois*, Texaco*, Tsingtao*, and UPS*.

The training set consists of 8240 images; 6000 images are no_logo images, and 70 images per class for 32 classes comprise the remaining 2240 images, thereby making the dataset highly skewed. Also, the training and test dataset is split in a ratio of 90:10 from the full 8240 samples.

3.2. Model Building and Network Topology

VGG16 network topology was used for our experiments. VGG16 network topology is a 16 weights layer (13 convolutional and 3 fully connected (FC) layers) and has very small (3 x 3) convolution filters, which showed significant enhancement in network performance and detection accuracy over prior art (winning the first and second prizes in the ImageNet* challenge in 2014), and henceforth widely used as a reference topology.

4. Results

4.1 Observations on Intel® Xeon® Processor

The Intel Xeon processors are running under the following software configurations:

Caffe Version: 1.0.0-rc3

MKL Version: _2017.0.2.20170110

MKL_DNN: SUPPORTED

GCC Version: 5.4.0

The following observations were noted while training for 10K iterations with a batch size of 32 and learning rate policy as POLY.

Figure 3:Training loss variation with iterations (batch size 32, LR policy as POLY).

Figure 4:Accuracy variation with iterations (batch size 32, LR policy as POLY).

The following observations were noted while training for 10K iterations with a batch size of 64 and learning rate policy as POLY.

Figure 5:Training loss variation with iterations (batch size 64, LR policy as POLY).

Figure 6:Accuracy variation with iterations (batch size 64, LR policy as POLY).

The real-time training and test observations using different batch sizes for the Intel Xeon processor is depicted in the following table. The Table 2 depicts how the accuracy varies with batch size.

Table 1:Real-time training results for Intel® Xeon® processor.

Batch Size	LR Policy	Start Time	End Time	Duration	Loss	Accuracy at Top 1	Accuracy at Top 5
32	POLY	18:20	23:46	5.26	0.00016	0.62	0.84
64	POLY	16:20	9:57	17:37	0.00003	0.64	0.86
64	STEP	16:41	6:37	13:56	0.0005	0.65	0.85

Table 2:Batch size versus accuracy details on the Intel® Xeon® processor.

32 Batch Size			64 Batch Size
Iterations	Accuracy@Top1	Accuracy@Top5	Iterations	Accuracy@Top1	Accuracy@Top5
0	0	0	0	0	0
1000	0.165937	0.49125	1000	0.30375	0.6375
2000	0.374375	0.754062	2000	0.419844	0.785156
3000	0.446875	0.74125	3000	0.513906	0.803437
4000	0.50375	0.78625	4000	0.522812	0.838437
5000	0.484062	0.783437	5000	0.580781	0.848594
6000	0.549062	0.819062	6000	0.584531	0.843594
7000	0.553125	0.826563	7000	0.632969	0.847969
8000	0.615625	0.807187	8000	0.64375	0.84875
9000	0.607813	0.83	9000	0.624844	0.856406
1000	0.614567	0.83616	10000	0.641234	0.859877

4.2 Observations on Intel® Xeon Phi™ Processor

The Intel Xeon Phi processors are running under the following software configurations:

Caffe Version: 1.0.0-rc3

MKL Version: _2017.0.2.20170110

MKL_DNN: SUPPORTED

GCC Version: 6.2

The following observations were noted while training for 10K iterations with a batch size of 32 and learning rate policy as POLY.

Figure 7:Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 32, LR policy as POLY).

Figure 8:Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 32, LR policy as POLY).

Figure 9: Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 64, LR policy as POLY).

Figure 10:Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 64, LR policy as POLY).

Training loss variation with iterations on Intel Xeon Phi processor (batch size 128, LR policy as POLY).

Figure 11:Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 128, LR policy as POLY).

Accuracy variation with iterations on Intel Xeon Phi processor (batch size 128, LR policy as POLY).

Figure 12: Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 128, LR policy as POLY).

Table 3:Batch size versus accuracy details for the Intel® Xeon Phi™ processor.

32 Batch Size			64 Batch Size
Iterations	Accuracy@Top1	Accuracy@Top5	Iterations	Accuracy@Top1	Accuracy@Top5
0	0	0	0	0	0
1000	0.138125	0.427812	1000	0.200469	0.54875
2000	0.24	0.589688	2000	0.330781	0.678594
3000	0.295625	0.621875	3000	0.362188	0.68375
4000	0.295312	0.660312	4000	0.40625	0.708906
5000	0.337813	0.67	5000	0.437813	0.74625
6000	0.374687	0.71	6000	0.40625	0.723594
7000	0.335	0.6875	7000	0.432187	0.749219
8000	0.38375	0.692187	8000	0.455312	0.745781
9000	0.39625	0.70875	9000	0.455469	0.722969
10000	0.40131	0.713456	10000	0.469871	0.748901

128 Batch Size
Iterations	Accuracy@Top1	Accuracy@Top5
0	0	0
1000	0.272266	0.665156
2000	0.397422	0.696328
3000	0.432813	0.750234
4000	0.46	0.723437
5000	0.446328	0.776641
6000	0.432969	0.74125
7000	0.473203	0.75
8000	0.419688	0.700938
9000	0.455312	0.763281
10000	0.478901	0.798771

Table 4:Real-time training results for the Intel® Xeon Phi™ processor.

Batch Size	LR Policy	Start Time	End Time	Duration	Loss	Accuracy at Top 1	Accuracy at Top 5
32	POLY	17:53	20:36	2:43	0.005	0.4	0.71
64	POLY	10:59	16:07	6:08	0.00007	0.47	0.75
128	POLY	18:00	4:19	10:19	0.00075	0.48	0.8

5. Conclusion and Future Work

We observed from Table 1 that the batch size of 32 was the optimal configuration in terms of speed and accuracy. Though there is a slight increase in accuracy with batch size 64, the gain seems to be quite low, compared to the increase in training time. It was also observed that the learning rate policies have quite a significant impact on the training time and less impact on accuracy. Perhaps the recalculation of the learning rates on every iteration would have slowed down this training. There is a minor gain in the Top 5 Accuracy with the LR policy as POLY, and this might be due to the optimal calculation of the learning rate. There is a chance that the gain might vary quite significantly in a larger dataset.

We observed from Table 3 that the Intel Xeon Phi processor efficiency increases as the batch size is increased, and also the decrease in loss happens faster as the batch size is increased. Table 4 infers that the higher batch size also runs faster on Intel Xeon Phi processors.

The observations as per the above tables implicates that training in Intel Xeon Phi machines are faster than the same conducted in Xeon machines. Thanks to the bootable host processor that delivers massive parallelism & vectorization. However the accuracy rate produced by Intel Xeon Phi processors is much lower than those produced for Intel Xeon processors for the same number of iterations, so it must be noted that we have to run a few more iterations on Intel Xeon Phi processors as compared to Intel Xeon processors to meet the same accuracy levels.

List of Abbreviations

Abbreviations	Expanded Form
MLSL	machine learning scalable library
CNN	convolution neural network
GPU	graphics processing unit
ML	machine learning
CNTK	cognitive toolkit
DL	deep learning
LMDB	lightning memory-mapped database

References and Links

1. Deng, L., LI, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M. L., Zweig, G., He, X., Williams, J., Gong, Y., and Aceri, A. Recent Advances in Deep Learning for Speech Research at Microsoft. In ICASSP (2013).

2. Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS (2012).

3. Mikolov, T., Chen, K., Corrado, G., and Deahn, J. Efficient Estimation of Word Representations in Vector Space. In ICLRW (2013).

4. Cherlopalle, Deepthi and Weage, Joshua Dell HPC Omni-Path Fabric: Supported Architecture and Application Study June 2016

More details on Intel Xeon Phi processor: Intel Xeon Phi Processor

Intel® Distribution for Caffe*: Manage Deep Learning Networks with Intel Distribution for Caffe

Multinode Guide:Guide to multi-node training with Intel® Distribution of Caffe*

Intel Omni Path Architecture Cluster Setup: Dell HPC Omni-Path Fabric: Supported Architecture and Application Study

Intel MLSL Package: Intel® MLSL 2017 Beta https://github.com/01org/MLSL/releases/tag/v2017-Beta

↧