Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Intel® Quark™ SE Microcontroller C1000 Developer Kit - Accelerometer Tutorial

$
0
0

Intel® System Studio for Microcontrollers includes multiple sample applications to help you get up to speed with its basic functionality and become familiar with the Intel® Quark™ Microcontroller Software Interface (Intel® QMSI). This sample application reads and outputs accelerometer data to the serial port.

Requirements

Instructions

  1. Connect the USB cable to the developer board and the host PC:

    Connect USB Port

    Note: This is the USB port that is connected to the FTDI chip

  2. Launch the Intel ® System Studio for Microcontrollers IDE.
  3. Create a project with the "Accelerometer" sample project file:

    • From the File menu, select New, and then select Intel Project for Microcontrollers.
    • Follow the “Create a new Project” screens

      • Developer board: Intel® Quark™ SE C1000 Developer Board
      • Project type: Intel® QMSI (1.1)
      • Core: Sensor Subsystem
      • Project name: Accelerometer found in the Developer Board Sensors folder
  4. Click Finish.

    Click Finish

    Note: The Accelerometer sample application runs on the Sensor Subsystem core, which doesn’t support the IPP library. Therefore it is not possible to build this sample application using the IPP library.

  5. Configure the Serial Terminal window to see the sensor output. This window will display your accelerometer sensor’s data via the UART interface over the serial cable.

    • On the bottom right-hand of the screen select the Serial Terminal pane and click on the plus sign + icon to open a new serial terminal connection:

      Configure Serial Terminal

    • Ensure the correct serial port is selected; the Custom configuration menu can also be clicked to modify the default serial connection settings:

      Open Serial Terminal

      Tip: The port will vary depending on the serial hardware used, and there may be more than one listed. There are a few ways to check your port:

      • Linux*: Use the ‘dmesg’ command to view your port status.
      • Windows*: Open Device Manager to view the Ports (COM & LPT) status.

        With either of these options, you can unplug the USB cable from your PC and reconnect it to see which COM port appears for the board.

    • Click OK and the connection to the Serial Terminal will be made. You should see status “Connected” in the serial console.

      Note: If you close the Serial Terminal window, you can open it again from:
      Window › Show View › Other › Intel ISSM › Serial Terminal

  6. Build and deploy your project.

    1. Select the "Accelerometer" project in the Project Explorer.
    2. Click the Build button to compile the project.
    3. From the Run drop-down list, select "Accelerometer (flashing)".
      Note: you can also deploy and debug. From the Debug drop-down list, select "Accelerometer (flashing)".

    4. You can now view X, Y, and Z values from the accelerometer in the Serial Terminal window.

      Serial Window Values

How it Works

The accelerometer sample uses the on-board Bosch BMC160 accelerometer connected to the microcontroller using the I2C interface, and the RTC (real time clock) integrated in the Intel® Quark™ microcontroller. It also uses integrated UART module for the data output over serial port.

The sample begins in the main function with setting up RTC parameters in an rtc configuration structure:

/* Configure the RTC and request the IRQ. */
rtc.init_val = 0;
rtc.alarm_en = true;
rtc.alarm_val = INTERVAL;
rtc.callback = accel_callback;
rtc.callback_data = NULL;

This configuration enables the RTC alarm, and sets accel_callback as the callback function for the RTC alarm. It is used to periodically print accelerometer data.

Next the code requests an interrupt for the RTC by using a QMSI API call, and also enables the RTC clock:

qm_irq_request(QM_IRQ_RTC_0, qm_rtc_isr_0);

/* Enable the RTC. */
clk_periph_enable(CLK_PERIPH_RTC_REGISTER | CLK_PERIPH_CLK);

After that it configures the accelerometer parameters depending on the accelerometer type (BMC150 or BMI160):

	/* Initialise the sensor config and set the mode. */
	bmx1xx_init(cfg);
	bmx1xx_accel_set_mode(BMX1XX_MODE_2G);

#if (BMC150_SENSOR)
	bmx1xx_set_bandwidth(BMC150_BANDWIDTH_64MS); /* Set the bandwidth. */
#elif(BMI160_SENSOR)
	bmx1xx_set_bandwidth(BMI160_BANDWIDTH_10MS); /* Set the bandwidth. */
#endif /* BMC150_SENSOR */

Next the RTC configuration is set up, thus enabling the RTC alarm:

       /* Start the RTC. */
       qm_rtc_set_config(QM_RTC_0, &rtc);

A while loop is used to wait for the defined number of samples from the accelerometer to be read and printed to the serial console output:

       /* Wait for the correct number of samples to be read. */
       while (!complete)

Each time the 125 millisecond interval is reached and the RTC alarm is triggered, the following accel_callback function is invoked.  The accel data structure defined at the start of the function is passed into the bmx1xx_read_accel function which populates it with the current accelerometer data read.  If this read is successful the accelerometer data is printed to the serial console output, otherwise an error message is printed.

/* Accel callback will run every time the RTC alarm triggers. */
static void accel_callback(void *data)
{
       bmx1xx_accel_t accel = {0};

       if (0 == bmx1xx_read_accel(&accel)) {
              QM_PRINTF("x %d y %d z %d\n", accel.x, accel.y, accel.z);
       } else {
              QM_PUTS("Error: unable to read from sensor");
       }

The callback function checks if the defined number of samples have been read, if not the RTC alarm is reset and the count is incremented, otherwise the complete variable is set to true.

       /* Reset the RTC alarm to fire again if necessary. */
       if (cb_count < NUM_SAMPLES) {
              qm_rtc_set_alarm(QM_RTC_0,
                            (QM_RTC[QM_RTC_0].rtc_ccvr + INTERVAL));
              cb_count++;
       } else {
              complete = true;
       }

Note that the application by default reads 500 samples (NUM_SAMPLES) before exiting.

Finally when the complete variable is set to true, the while loop exits and the applications prints a final statement to the serial console output and exits.

       QM_PUTS("Finished: Accelerometer example app");

       return 0;
}

Accelerometer Sample Application Code

/*
* {% copyright %}
*/

/*
* QMSI Accelerometer app example.
*
* This app will read the accelerometer data from the onboard BMC150/160 sensor
* and print it to the console every 125 milliseconds. The app will complete
* once it has read 500 samples.
*
* If the app is compiled with the Intel(R) Integrated Performance Primitives
* (IPP) library enabled, it will also print the Root Mean Square (RMS),
* variance and mean of the last 15 samples each time.
*/

#include <unistd.h>
#if (__IPP_ENABLED__)
#include <dsp.h>
#endif
#include "clk.h"
#include "qm_interrupt.h"
#include "qm_isr.h"
#include "qm_rtc.h"
#include "qm_uart.h"
#include "bmx1xx/bmx1xx.h"

#define INTERVAL (QM_RTC_ALARM_SECOND >> 3) /* 125 milliseconds. */
#define NUM_SAMPLES (500)
#if (__IPP_ENABLED__)
/* Number of samples to use to generate the statistics from. */
#define SAMPLES_SIZE (15)
#endif /* __IPP_ENABLED__ */

static volatile uint32_t cb_count = 0;
static volatile bool complete = false;

#if (__IPP_ENABLED__)
static float32_t samples[SAMPLES_SIZE];

static void print_axis_stats(int16_t value)
{
      static uint32_t index = 0;
      static uint32_t count = 0;
      float32_t mean, var, rms;

      /* Overwrite the oldest sample in the array. */
      samples[index] = value;
      /* Move the index on the next position, wrap around if necessary. */
      index = (index + 1) % SAMPLES_SIZE;

      /* Store number of samples until it reaches SAMPLES_SIZE. */
      count = count == SAMPLES_SIZE ? SAMPLES_SIZE : count + 1;

      /* Get the root mean square (RMS), variance and mean. */
      ippsq_rms_f32(samples, count, &rms);
      ippsq_var_f32(samples, count, &var);
      ippsq_mean_f32(samples, count, &mean);

      QM_PRINTF("rms %d var %d mean %d\n", (int)rms, (int)var, (int)mean);
}
#endif /* __IPP_ENABLE__ */

/* Accel callback will run every time the RTC alarm triggers. */
static void accel_callback(void *data)
{
      bmx1xx_accel_t accel = {0};

      if (0 == bmx1xx_read_accel(&accel)) {
            QM_PRINTF("x %d y %d z %d\n", accel.x, accel.y, accel.z);
      } else {
            QM_PUTS("Error: unable to read from sensor");
      }

#if (__IPP_ENABLED__)
      print_axis_stats(accel.z);
#endif /* __IPP_ENABLE__ */

      /* Reset the RTC alarm to fire again if necessary. */
      if (cb_count < NUM_SAMPLES) {
            qm_rtc_set_alarm(QM_RTC_0,
                        (QM_RTC[QM_RTC_0].rtc_ccvr + INTERVAL));
            cb_count++;
      } else {
            complete = true;
      }
}

int main(void)
{
      qm_rtc_config_t rtc;
      bmx1xx_setup_config_t cfg;

      QM_PUTS("Starting: Accelerometer example app");

      /* Configure the RTC and request the IRQ. */
      rtc.init_val = 0;
      rtc.alarm_en = true;
      rtc.alarm_val = INTERVAL;
      rtc.callback = accel_callback;
      rtc.callback_data = NULL;

      qm_irq_request(QM_IRQ_RTC_0, qm_rtc_isr_0);

      /* Enable the RTC. */
      clk_periph_enable(CLK_PERIPH_RTC_REGISTER | CLK_PERIPH_CLK);

#if (QUARK_D2000)
      cfg.pos = BMC150_J14_POS_0;
#endif /* QUARK_D2000 */

      /* Initialise the sensor config and set the mode. */
      bmx1xx_init(cfg);
      bmx1xx_accel_set_mode(BMX1XX_MODE_2G);

#if (BMC150_SENSOR)
      bmx1xx_set_bandwidth(BMC150_BANDWIDTH_64MS); /* Set the bandwidth. */
#elif(BMI160_SENSOR)
      bmx1xx_set_bandwidth(BMI160_BANDWIDTH_10MS); /* Set the bandwidth. */
#endif /* BMC150_SENSOR */

      /* Start the RTC. */
      qm_rtc_set_config(QM_RTC_0, &rtc);

      /* Wait for the correct number of samples to be read. */
      while (!complete)
            ;

      QM_PUTS("Finished: Accelerometer example app");

      return 0;
}

IoT Reference Implementation: How to Build a Transportation in a Box Solution

$
0
0

The Transportation in a Box solution is based on a previous Intel® IoT path-to-product connected transportation solution, which involved a comprehensive development process that began from ideation and prototyping, through productization. Taking advantage of that prior work, which was demonstrated at Intel® Developer Forum 2016, the Transportation in a Box solution demonstrates how path-to-product solutions can provide a point of departure that streamlines the development of IoT solutions. To read the full Intel® IoT Path-to-Product Transportation Case Study, check out The Making of a Connected Transportation Solution.

An Intel team based in Europe began work on the Transportation in a Box solution after identifying value in creating a variation on the existing path-to-product connected-transportation solution that was better suited to use at workshops, conferences and other industry events. The team’s goal was to build a variation that was functionally similar to the scale truck model used in the previous solution, but built to fit into a compact carrying case. The two solutions are shown side-by-side in Figure 1.

Transportation in a Box
Figure 1. The connected-transportation model compared to the Transportation in a Box solution.

The exercise in this document describes how to recreate the Transportation in a Box solution. It does not require special equipment or deep expertise, and it is meant to demonstrate how Intel® IoT path-to-product solutions can be adapted to the needs of specific project teams.

Visit GitHub* for this project’s latest code samples and documentation.

Introduction

The Transportation in a Box solution was developed using an Intel® IoT Gateway, the Grove* IoT Commercial Developer Kit, and the Intel® System Studio IoT Edition. It monitors the temperature within a truck’s refrigerated cargo area, as well as the open or closed status of the cargo doors. The gateway generates events based on changes to those statuses, to support end-user functionality on a tablet PC application.

From this exercise, developers will learn to do the following:

  • Set up the Dell iSeries* Wyse 3290 IoT Gateway, including installation of the OS, update of the MRAA* and UPM* libraries, and connection of the Arduino 101* (branded Genuino 101* outside the U.S.) board for connectivity to sensors.
  • Install the rest of the solution, including connection of sensors and other components, as well as cloning of the project software repository.
  • Add the program to Intel® System Studio IoT Edition, including creating a project and populating it with the files needed to build the solution.
  • Run the solution using the Eclipse* IDE or directly on the target platform.

What the Solution Does

The Transportation in a Box solution simulates the following parts of a transportation monitoring solution:

 

  • Door: The driver is notified of a change in door position (open or closed).
  • Thermometer: The inside temperature of the truck’s cargo area is monitored and that temperature data recorded.
  • Alarm: For temperatures above a specified threshold, the user interface plays an audible alarm. The alarm is cancelled by pressing the push button or when the temperature returns to normal (temperature below the specified threshold).
  • Cooling fan: The fan cools the truck’s cargo. The cooling fan is connected to the door and when the door is open, the fan shuts off (allowing the temperature to rise). Similarly, when the door is closed, the fan is on (in order to keep the cargo area of the truck below the specified temperature threshold).
  • Display: Displays the status of the system, temperature and door status.

How it Works

The solution operates based on sensor data that includes the open/closed status of the truck door, the temperature of the truck interior, and a number of events, including open and close actions of the door, changes in temperature, changes to settings in the temperature threshold, and trigger/stop events for the alarm. All data is forwarded to a web interface that can be used to monitor the status of the truck.

Set up the Dell iSeries* Wyse 3290 IoT Gateway

This section gives instructions for installing the Intel® IoT Gateway Software Suite on the Dell iSeries Wyse 3290 gateway.

Note: If you are on an Intel network, you need to set up a proxy server.

  1. Create an account on the Intel® IoT Platform Marketplace if you do not already have one.
  2. Order the Intel® IoT Gateway Software Suite, and then follow the instructions you will receive by email to download the image file.
  3. Unzip the archive, and then write the .img file to a 4 GB USB drive:
  4. Unplug the USB drive from your system, and then plug it into the Dell iSeries* Wyse 3290 gateway along with a monitor, keyboard, and power cable.
  5. Turn on the Dell iSeries Wyse 3290 Gateway, and then enter the BIOS by pressing F2 at boot time.
  6. Boot from the USB drive:
    1. On the Advanced tab, make sure Boot from USB is enabled.
    2. On the Boot tab, put the USB drive first in the order of the boot devices.
    3. Save the changes, and then reboot the system.
  7. Log in to the system with root:root.
  8. Install Wind River Linux* on local storage:

    ~# deploytool -d /dev/mmcblk0 --lvm 0 --reset-media -F

  9. Use the poweroff command to shut down your gateway, unplug the USB drive, and then turn your gateway back on to boot from the local storage device.
  10. Plug in an Ethernet cable, and then use the ifconfig eth0 command to find the IP address assigned to your gateway (assuming you have a proper network setup).
  11. Use the Intel® IoT Gateway Developer Hub to update the MRAA* and UPM* repositories to the latest versions from the official repository (https://01.org). You can achieve the same result by entering the following commands:

    ~# smart update

    ~# smart upgrade

    ~# smart install upm

  12. Connect the Arduino 101* board using the USB cable.
  13. Connect the Omega* RH-USB Temperature sensor to a USB port.

Install the Rest of the Solution 

This section gives instructions for the rest of the installation required for the solution, including connection of sensors and other components, as well as cloning of the project software repository.

  1. Connect the sensors and other components to the Dell iSeries Wyse 3290 IoT Gateway according to the connectivity schema that is provided in the bill of materials in Table 1.

    Table 1. Bill of materials and connectivity schema for the Transportation in a Box solution.

    Base System

    ComponentDetailsConnection
    Dell iSeries* Wyse 3290 gateway  
    Arduino* 101 boardSensor hubUSB
    USB Type A to Type B CableConnect Arduino 101 board to gateway 

    Sensors

    ComponentDetailsConnection
    Omega* RH-USBTemperature sensorUSB
    Grove* - RelayFan controlD8
    Grove - LCD RGB BacklightDisplay statsI2C
    Magnetic Contact SwitchDoor sensorD3
    Peltier Thermo-Electric Cooler Module+Heatsink Assembly - 12V 5ACooling fan 
    Rugged Metal On/Off Switch with White LED Ring - 16mm White On/OffAcknowledge alarmD4

     

  2. Clone the Path to Product repository with Git* on your computer as follows:

    $ git clone https://github.com/intel-iot-devkit/reference-implementations.git

  3. To download a .zip file, in your web browser go to and click the Download ZIP button at the lower right. Once the .zip file is downloaded, uncompress it, and then use the files in the directory for this example.

Add the Program to Intel® System Studio IoT Edition

Note: The screenshots in this section are from the alarm clock sample; however, the technique for adding the program is the same, just with different source files and jars.

  1. Open Intel® System Studio IoT Edition. It will start by asking for a workspace directory; choose one and then click OK.
  2. In Intel® System Studio IoT Edition, select File | New | Intel® IoT Java Project, as shown in Figure 2.

    New Java Project
    Figure 2. New Intel® IoT Java* Project.

  3. Give the project the name “Transportation Demo” as shown in Figure 3 and then click Next.

    Naming Project
    Figure 3. Naming the Intel® IoT project.

  4. Connect to the gateway from your computer to send code to it by choosing a name for the connection, entering the IP address of the gateway in the Target Name field, and clicking Finish, as shown in Figure 4.

    Note: You can also search for the gateway using the "Search Target" button. Click finish when you are done.

    Target Connection
    Figure 4. Creating a target connection.

  5. The preceding steps will have created an empty project. Copy the source files and the config file to the project:

    • Drag all of the files from the git repository's src folder into the new project's src folder in Intel® System Studio IoT Edition.
    • Make sure previously auto-generated main class is overridden.
  6. The project uses the following external jars: commons-cli-1.3.1.jar, tomcat-embed-core.jar, and tomcat-embed-logging-juli. These can be found in the Maven Central Repository. Create a "jars" folder in the project's root directory, and copy all needed jars in this folder. In Intel® System Studio IoT Edition, select all jar files in the jars folder, right-click them, and select Build path | Add to build path, as shown in Figure 5.

    Adding Jars
    Figure 5. Adding project jars to the build path.

  7. Add the UPM* jar files relevant to this specific sample, as illustrated in Figure 6:

    1. Right click on the project's root and select Build path | Configure build path.
    2. Select Java Build Path.
    3. Select the Libraries tab.
    4. Click the Add external JARs... button.
    5. Add the following jars, which can be found at the IOT Devkit installation root path\iss-iot-win\devkit-x86\sysroots\i586-poky-linux\usr\lib\java:

      • upm_grove.jar

      • upm_i2clcd.jar

      • upm_rhusb.jar

      • mraa.jar

      External Jars
      Figure 6. Adding external jars to the build path.

  8. Copy the www folder to the home directory on the target platform using scp or WinSCP*. Create a new Run configuration in Eclipse* for the project for the Java* application. Set the Main Class as com.intel.pathtoproduct.JavaONEDemoMulti in the Main tab. Then, in the arguments tab enter the following:

    -webapp <path/to/www/folder>

    Note: To run without an IDE, download the repo directly to the target platform and run the start.sh script.

Conclusion

As this how-to document demonstrates, IoT developers can build solutions at relatively low cost and without specialized skill sets. In particular, using an Intel® IoT Gateway and an Arduino 101* board, project teams can rapidly adapt existing Intel® IoT path-to-product solutions to address novel business needs.

More Information

Installing and Deploying SaffronArchitect 1.0

$
0
0

Installation and Deployment

Current installation and deployment information for the initial release of SaffronArchitect is located in the distribution folder. 

To extract the documentation:

1. Untar the distribution folder

    tar -xvf saffron-architect_1_0.tar.gz

2. Open the README file

    ./docs/README

Go to this page to learn more about future installation and deployment updates and document locations.

Related Links:

SaffronStreamline Issues and Defects Resolution (IDR) User Guide

SaffronArchitect User Guide

 

 

 

Using LibRealSense and PCL to Create Point Cloud Data

$
0
0

Download Source Code 

Table of Contents

Introduction
Assumptions
Software Requirements
Supported Cameras
Setting Up the Qt Creator Project
   CMake contents
The main.cpp Source Code File Contents
   Source code explained
      Overview
      main(...)
      createCloudPointViewer(...)
      printRSContextInfo(...)
      configureRSStreams(...)
      generatePointCloud(...)
Wrap Up

Introduction

In this article I will show you how to use LibRealSense and PCL to generate point cloud data and display that data in the PCL Viewer. This article assumes you have already downloaded, installed both LibRealSense and PCL, and have them set up properly in Ubuntu*. In this article I will be on Ubuntu 16.04 using the Qt Creator* IDE. I will be using CMake*, and as such, you really can use whichever IDE you prefer.

Assumptions

In this article I make the following assumptions that the reader:

  1. Is somewhat familiar with using Qt Creator. The reader should know how to open Qt Creator and create a brand new empty C++ project.
  2. Is familiar with C++.
  3. Knows how to get around Linux*.
  4. Knows what GitHub is and knows how to at least download a project from a GitHub repository.

In the end, you will have a nice starting point where you use this code base to create your own LibRealSense/PCL applications.

Software Requirements

Supported Cameras

  • Intel® RealSense™ camera R200

Setting Up the Qt Creator* Project

As mentioned, I’m going to assume that the reader already is familiar with opening up Qt Creator and creating a brand new empty C++ project. Really, the only thing you need to do if you wish to follow along with exactly how I created my project is to ensure you choose the CMake toolset.

CMake contents

I chose CMake because I’m creating a generic non-Qt project. An explanation of CMake is beyond the scope of this document. Just note that it’s including the libraries and headers for PCL and LibRealSense. After you have created a Plain C++ Application in Qt, you can open up the CMakeLists.txt file that gets generated for you and replace its contents with this:

project(LRS_PCL)
cmake_minimum_required(VERSION 2.8)
aux_source_directory(. SRC_LIST)
add_executable(${PROJECT_NAME} ${SRC_LIST})
set(CMAKE_PREFIX_PATH $ENV{HOME}/Qt/5.5/gcc_64)

find_package( PCL 1.7 REQUIRED )
find_package( Qt5 REQUIRED COMPONENTS Widgets Core )

include( CheckCXXCompilerFlag )
set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11" )

include_directories( ${PCL_INCLUDE_DIRS} /user/local/include )
link_directories( ${PCL_LIBRARY_DIRS} /usr/local/lib /usr/lib/x86_64-linux-gnu )
add_definitions( ${PCL_DEFINITIONS} )

target_link_libraries( LRS_PCL ${PCL_LIBRARIES} /usr/local/lib/librealsense.so)

The main.cpp Source Code File Contents

Here is the source code for the example application.

Source code explained

Overview

The structure is pretty simplistic. It’s a one-source code file containing everything we need for the sample. We have our header includes at the top. Because this is a sample application, we are not going to worry too much about “best practices” in defensive software engineering. Yes, we could have better error checking; however, the goal here is to make this sample application as easy to read and comprehend as possible.

main(…)

Obviously, as the name implies, this is the main function. Even though we are accepting command line parameters we are not using them.

We start by creating a few variables that get used in main(). The frame and fps variables are there to help calculate the frames per second and how long the apps have been running. Next, we create a pointer to a PCL PointXYZRGB object along with our own safe pointer to a PCLVisualizer.

At this point, we can generate the actual PCLVisualizer object. We do this in the function createPointCloudViewer by passing in the PointXYZRGP pointer, rsCloudPtr. This will return to us a newly created PCLVisualizer object that will display the point cloud generated from the Intel RealSense camera R200 data.

Now that we have the required PCL pieces created, we will create the RealSense functionality by starting with the RS context. We don’t NEED to print this context out, but I’m showing you how you can get the device count from the context, and if there is no device, how to take appropriate actions.

Next, we need to set up the Intel RealSense camera streams, which is done in the configureRSStreams function. Notice that we are sending in an rs::device, this is the Intel RealSense camera R200 device. Once PCL and the RealSense functionality have been set up and the camera is running, we can now start processing the data.

As mentioned above, the frame variables are for displaying frame rates and how may frames have been generated during the applications running cycle. Those are used next as we then call printTimeLoop to display the fps feedback to the user.

The next mentionable process is calling generatePointCloud, passing in the Intel RealSense camera object and the pointer to the PointXYZRGB structure that holds the point cloud data. Once the point cloud has been generated, we can then display the cloud. This is done by calling the visualizer’s updatePointCloud method followed up by calling spinOnce. I’m telling spinOnce to run for 1 millisecond.

The loop ends and it starts over. Upon exit, we tell the PCLVisualizer to close itself and return out of main.

At this point, the app will then quit.

createCloudPointViewer(…)

This function sets up the PCLVisualizer object. We pass it a pointer to the PointXYZRGB structure, which will hold the cloud data.

The first thing we do is create a new PCLVisualizer object named viewer. From there we set its background color, add the PointXYZRGB cloud structure, and set its rendering properties to a visualization point of 1. For fun, try changing this value to see how it affects the rendering of the cloud. For example, change 1 to 5. (I changed it back to 1.)

Next, we need to specify a coordinate system via the addCoordinateSystem. This causes the XYZ bars to be drawn in the visualizer. If you don’t want to see them, you can comment this function out. I wish I could tell you what initCameraParamers is doing, but… to be honest, I can’t. Even the PCL documentation doesn’t have a solid explanation. If you can give me a detailed explanation, I’d appreciate it. The best I’ve found is from a point cloud tutorial on the visualizer stating…

This final call sets up some handy camera parameters to make things look nice.

So, there you have it, right from PCL. ;) I won’t give them grief. I know it’s difficult to fully understand the ins and outs of every line of code. I know the pain.

printRSContextInfo(…)

This is very straightforward function. We are simply displaying the number of Intel RealSense devices, then ensuring there are more than 0. If not, we know that LibRealSense was not able to detect the Intel RealSense camera R200 and throw an error if one was not found; otherwise, the function will return success as true.

configureRSStreams(…)

This is where we are going to set up the Intel RealSense camera R200 for streaming. I show you how to get the camera’s name by displaying in a std::cout statement. This does not get displayed in the visualizer, just to an application output window in an IDE.

Next, two RealSense streams get enabled. —a color stream and depth stream—specifying that we want the best quality the camera can produce.

Once the streams are set up, we get the camera running by calling the camera’s start () function.

generatePointCloud(…)

Please note that this portion of the source code is derived from the LibRealSense sample cpp-tutorial-pointcloud.cpp; it demonstrates getting the above-mentioned color and depth data and creating unstructured OpenGL vertices as a point cloud. Knowledge of 3D math and geometry is assumed for the following explanation.

The function generatePointCloud() is at the heart of the sample as it demonstrates getting depth and color data, correlating them, and getting them into a PointXYZRGB point cloud structure in PCL.

The first thing that happens is that the pointers to the depth and color images are set, followed by getting the various metadata about the images from the intrinsics and extrinsics camera parameters. The point cloud object is then cleared and resized. This sets us up so we can loop through all the columns and rows of the depth and color images.

This algorithm will iterate over columns first, then by row. The outer-loop by dy and the inner-loop by dx.

Inside the inner-loop, a value for an iterator i is computed and linearizing the address space for depth_image[]. An individual depth_value is extracted by applying depth_image[i] as a pointer look-up for the value.

A real-valued (float) representation of the depth_pixel is constructed by considering both dx and dy as real values. This is because we are idealizing the 2D coordinate system of the image space as a continuous 2D function for purposes of computation. A depth_in_meters is computed from the above-mentioned depth_value and a previously extracted scale value.

A depth_point is computed by applying the deprojection function call. The Deprojection function takes as input a 2D pixel location on a stream's images, as well as a depth specified in meters, and maps it to a 3D point location within the stream's associated 3D coordinate space. It converts an image space point into a corresponding point in 3D world space. Once a 3D depth point is calculated, a transform to the corresponding color values for the 3D point is applied. Given this color point in 3D space, it computes the corresponding pixel coordinates in an image produced by the same camera. Finally, a projection of the 3D color point back into 2D image coordinates is performed for eventual rejection tests against image width and height bounds. When done, we set up logic to remove failure points for the intrinsic width and height bounds as well as for those that fail the NOISY flag.

Input of data into the PCL point cloud structure begins with a setup of the pointers to the xyz and TGB members to the point cloud object rs_cloud_ptr->points[i]. An adjustment to the point cloud coordinates is introduced to flip the coordinate system for the viewer into a Y-up, right-handed coordinate system.

Additional variables are introduced to enable arbitrary adjustments of the color data; for example, if tone or color needs adjusting due to camera color offsets.

At this stage of the function, the evaluation of the point cloud continues through an assessment of whether the point is valid or not. If it is not valid, the point is skipped and the iteration continues. However, if it is valid then the point is added to the point cloud as a PointXYZRGB point.

This process continues for each of the depth and color points in the image and both in the inner- and outer-loops of the function are repeated per frame.

Wrap up

In this article, I’ve attempted to show how you can get data from an Intel RealSense camera using the LibRealSense open source library, use that data, send it to PCL to generate point cloud data, and display it in the PCLViewer.

Intel® Aero Compute Board and Intel® RealSense™ Technology for Wi-Fi* Streaming of RGB Data

$
0
0

Download Source Code 

Contents

Introduction
Target Audience
General Information
What’s Needed for the Sample Application
What is the Intel® Aero Platform for UAVs
   Two examples:
The Intel® Aero Compute Board
   Operating system
   Connector information
Intel® Aero Vision Accessory Kit
Intel® RealSense™ Technology
GStreamer
Setting up Eclipse* Neon
   Header Files
   Libraries
The Source Code
   My Workflow
   Some Initial Thoughts
Intel Aero Compute Board Setup
Connecting Wirelessly to the Intel Aero Compute Board
   Troubleshooting
Useful Bash Shell Scripts
   migrateAero
   makeAero
How to Configure QGroundControl
   Step 1
   Step 2
   Step 3
   Step 4
Launch the Application
Intel Aero Compute Board and GitHub*
Other Resources
Summary

Introduction

This article shows you how to send a video stream from the Intel® Aero Compute Board that has an Intel® RealSense™ camera (R200) attached to it. This video stream will be broadcast over the compute board’s Wi-Fi* network to a machine that is connected to the Wi-Fi network. The video stream will be displayed in the QGroundControl* internal video window.

Target Audience

This article and the code sample within it is geared toward software engineers and drone developers who want to start learning about the Intel Aero Compute Board. Some information taken from other documents is also included.

General Information

The example given in this article assumes that you are working on an Ubuntu* 16.4 machine. Though you can work with GStreamer and LibRealSense on a Windows* platform, this article’s source code was written on top of Ubuntu 16.04; therefore giving any details for Windows is out of the scope of this document.

Although I will be referencing the Intel RealSense R200 camera in this article, this example is NOT using the LibRealSense library and taking advantage of the camera’s depth capabilities. Future articles will address that type of functionality. This is a simple app to get someone up and running with streaming to QGroundControl.

Note that the source code is running on the Intel Aero Compute Board, not a client computer. It sends a video stream out to a specified IP address. The client computer must be attached to the Intel Aero Compute Board network.

What’s Needed for the Sample Application

We assume that you do not have an Intel® Aero Ready to Fly Drone and will be working with the board itself.

What is the Intel® Aero Platform for UAVs

The Intel® Aero Platform for UAVs is a set of Intel® technologies that allow you to create applications that enable various drone functionalities. At its core is the Intel Aero Compute Board and the Intel® Aero Flight Controller. The combination of these two hardware devices allows for powerful drone applications. The flight controller handles all aspects of drone flight, while the Intel Aero Compute Board handles real-time computation. The two can work in isolation from one another or communicate via the MAVlink* protocol.

Two examples:

Video streaming: When connected to a camera, the Intel Aero Compute Board can handle all the computations of connecting to the camera and then pulling that stream of data and doing something with it. Perhaps it’s streaming that data back to a ground control station via the built-in Wi-Fi capabilities. All this computation is handled freely of the Aero flight controller.

Collision avoidance: The Intel Aero Compute Board is connected to a camera, this time the Intel RealSense camera (R200). The application can pull depth data from the camera, crunch that data, and make tactical maneuvers based on the environment around the drone. These maneuvers can be calculated on the compute board, and then using Mavlink, an altered course can be sent to the flight controller.

This article discusses video streaming; collision avoidance is out of the scope of this article.

The Intel® Aero Compute Board

Operating system

The Intel Aero Compute Board uses a customized version of Yocto* Linux*. Plans are being considered to provide Ubuntu in the future. Keeping the Intel Aero Compute Board up to date with the latest image of Yocto is out of the scope of this document. For more information on this process, please see the Intel-Aero / meta-intel-aero wiki.

Connector information

1Power and console UART
2USB 3.0 OTG
3Interface for Intel RealSense camera (R200)
44 lane MIPI* interface for high-res camera
51 lane MIPI interface for VGA camera
680 pin flexible I/O supports third-party flight controller and accessories( I2C, UART, GPIOs)
7MicroSD memory card slot
8Intel® Dual Band Wireless-AC
9M.2 Interface for PCIe* solid state drive
10Micro HDMI* port
RRESERVED for future use

Intel® Aero Vision Accessory Kit

The Intel® Vision Accessory Kit contains three cameras: an Intel RealSense camera (R200), an 8-megapixel (MP) RGB camera and a VGA camera that uses global shutter technology. With these three cameras, you have the ability to do depth detection using the Intel RealSense camera (R200) to perform use cases such as collision avoidance and creating point cloud data. With the 8-MP camera, the user can collect and stream much higher-quality RGB data than what the Intel RealSense camera (R200) is capable of streaming. With the VGA and its global shutter, one use case could be optical flow, which a developer could implement.

More detailed information about each camera can be found here.

Intel RealSense camera (R200)

8-MP RGB camera

VGA camera

Intel® RealSense™ Technology

With Intel RealSense technology using the Intel RealSense camera (R200), a user can stream depth, RGB, and IR data. The Intel Aero Platform for UAVs uses the open source library LibRealSense. This open source library is analogous to being a driver for the Intel RealSense camera (R200), allowing you to easily get streaming data from the camera. The library comes with several easy-to-understand tutorials for getting streaming up and running. For more information on using LibRealSense, visit the LibRealSense GitHub* site.

GStreamer

In order to develop against GStreamer on your Ubuntu computer, you must install the proper libraries. An in-depth look into the workings of GStreamer is beyond the scope of this article. For more information, see the GStreamer documentation. We recommend starting with the “Application Development Manual." To get all the necessary GStreamer libraries, install the following on your Ubuntu machine.

  • sudo apt-get update
  • sudo apt-get install ubuntu-restricted-extras
  • sudo apt-get install gstreamer1.0-libav
  • sudo apt-get install libgstreamer-plugins-base1.0-dev

As a bit of tribal knowledge, I have two different machines I’ve been developing on and these two different Ubuntu instances have installed Gstreamer in different locations: on one machine, Gstreamer headers and libraries are installed in /user/includes and /user/lib, and on the other, they are installed in /user/lib/x86_64-linux-gnu. You will see evidence of this in how I have included libraries and header files in my Eclipse project, which will appear as having duplicates. In hindsight, I could have just transferred the source code between two different project solutions.

Setting up Eclipse* Neon

As mentioned, you can use whatever IDE you like. I gravitated toward the C++ version of Eclipse Neon.

I assume that you know how to create an Eclipse C++ application and will just show you how I set up my include files and what libraries I chose.

Header Files

Libraries

At this point, you should be ready to compile the following source code.

The Source Code

You may notice in the following code that the spacing is off. This is because I copied and pasted it directly out of my IDE. I didn’t want to change the spacing for this article in order to avoid messing up the formatting if you copy the code into your own IDE.

//=============================================================================
// AeroStreamRGBSimple
// Demonstrates how to capture RGB data from the RealSense camera and send
// it through a GStreamer pipeline. The end of the pipeline uses a UDP
// element to stream to wifi
//
// Built on Ubuntu 16.04 and Eclipse Neon.
//
//     SOFTWARE DEPENDENCIES
//     * GStreamer
//
// Example
//   ./AeroStream 192.168.1.2
//=============================================================================

#include <gst/gst.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main( int argc, char *argv[ ] )
{

    // App requires a valid IP address to where QGroundControl is running.
    if( argc < 2 )
    {
        printf( "Inform address as first parameter.\n" );
		exit( EXIT_FAILURE );
	}

	char             str_pipeline[ 250 ]; // Holds the pipeline, needs to be large enough to hold the full string
		GstElement    	*pipeline     	= NULL;			 // The pipe for all the elements
		GError 			*error			= NULL;			// Holds error message if generated
		GMainLoop		*loop			= NULL;			// Main loop keeps the app running


		// Init GStreamer
		gst_init( &argc, &argv );


		// Construct the pipeline element string
	snprintf( str_pipeline, sizeof( str_pipeline ),
							"gst-launch-1.0 v4l2src device=/dev/video13 do-timestamp=true ! ""video/x-raw, format=YUY2, width=640, height=480, framerate=30/1 ! ""autovideoconvert ! vaapih264enc ! rtph264pay ! udpsink host=%s port=5600", argv[ 1 ] );


		// Parses the string to dynamically create the necessary elements behind the scene
	pipeline = gst_parse_launch( str_pipeline, &error );
		if( !pipeline )
		{
			g_print( "Parse error: %s\n", error->message );
			return 1;
		}

		// Set the gstreamer pipelines state to playing
		gst_element_set_state( pipeline, GST_STATE_PLAYING );

		// Create the app loop thread. This prevents the app from falling through to the end and exiting
		loop = g_main_loop_new( NULL, FALSE );
		g_main_loop_run( loop );

		// Clean up once the app ends executing

		gst_element_set_state( pipeline, GST_STATE_NULL );
		gst_object_unref( pipeline );
		gst_object_unref( loop );

		return 0;
}

My Workflow

A little about the workflow I used: I developed on my Ubuntu machine using Eclipse Neon. I then compiled the application to ensure there were no compilation errors. Next I transferred the files over to the Intel Aero Compute Board using shell scripts. Finally, I compiled the application on the Intel Aero Compute Board and ran it for testing.

Some Initial Thoughts

Again I want to mention that the information in this article does not teach GStreamer; rather it highlights a real working sample application. This article only touches the surface in how you can construct streams in GStreamer.

We start off ensuring that an IP address has been supplied as in input parameter. In a real-world sample it may be desirable to parse the input string to ensure it’s in the form of a real IP address. For the sake of simplicity, here we used the IP address of a client computer running QGroundControl. This client computer MUST be attached to the Intel Aero Compute Board Wi-Fi network for this to work.

Next we declare some variables. The pipeline will be populated by all the GstElements needed to run our sample code. The GMainLoop is not a GStreamer construct;, rather it’s part of the Gnome project. It runs in its own thread and is used to keep the application alive and from falling through to the end of the code.

The gst_parse_launch command will parse out the GStreamer string. Behind the scenes it analyzes all the elements in the string and constructs the GstElements along with other aspects of GStreamer. After checking to see that there are no errors, we set the pipeline’s state to playing.

Remember: This code runs on the Intel Aero Compute Board. You are sending this data FROM the Intel Aero Compute Board to a client machine somewhere on the same network.

Now I want to point out a couple of critical aspects of the GStreamer string.

  • v4l2src device=/dev/video13
    This tells the GStreamer pipeline which device to connect to. On the Intel Aero Compute Board, the Intel RealSense camera (R200) RGB is video13.
  • udpsink host=%s port=5600
    This tells GStreamer to use UDB and send the video stream via Wi-Fi to a particular IP address and port. Remember, the IP address is coming in via a command line parameter. You can include the port number on the command line as well if you want.

We create a new GMainLoop and get the loop running. Here the application will continue to run and while this loop is running, GStreamer is doing its processing of pulling data from the camera, processing that data, and sending it out to Wi-Fi.

At the end, we do some simple cleanup.

NOTE: While I work on my Ubuntu machine, I must still compile on the Intel Aero Compute Board.

NOTE: The way the GStreamer

Intel Aero Compute Board Setup

At this point, you have a project set up in your IDE. You’ve compiled the source code. The next step is to get the board connected.

The following images show you how to connect the board.

Now you can power up the board. Once it’s fully powered up, it will automatically start a Wi-Fi access point. The next step walks you through setting up connectivity on Ubuntu.

Connecting Wirelessly to the Intel Aero Compute Board

Once you have powered up the Intel Aero Compute Board, you can connect to it via Wi-Fi. In Ubuntu 16.04, you will see a derivative of CR_AP-xxxxx. This is the network connection you will be connecting to.


The SSID is 1234567890

Troubleshooting

If you do not see this network connection and provided you have hooked up a keyboard and monitor to your Intel Aero Compute Board, on the Intel Aero Computer Board, run the following command:

         sh-4.3# lspci

This shows you a list of PCI devices. Check for the following device:

         01:00.0 Network controller: Intel Corporation Wireless 8260 (rev3a)

If you do not se this connection, do a “warm” boot.

         sh-4.3# reboot

Wait for the Intel Aero Compute Board to reboot. You should now see the network controller if you run lspci a second time. Attempt once again to connect via the wireless network settings in Ubuntu.

At times, I have seen an error message in Ubuntu saying:

         Out of range

If you get this error, try the following:

  • Make sure there are no other active network connections; if there are, disconnect from them.
  • Reboot Ubuntu

More on the Intel Aero Compute Board Wi-Fi can be found at the Intel Aero Meta Wiki

Useful Bash Shell Scripts

Now that you’ve got the code compiled on Ubuntu, it’s time to move it over to the Intel Aero Compute Board. Remember that even though you might compile on your Ubuntu machine, you will still need to compile on the Intel Aero Compute Board as well. What I found was that if I skip this step, Yocto gives me an error saying that AeroStream is not a program.

To help expedite productivity, I’ve created a couple of small shell scripts. They aren’t necessary or required; I just got tired of typing the same things over and over.

migrateAero: First, it should be obvious that you must have a Wi-Fi connection to the Intel Aero Compute Board for this script to run .

This script runs from your Ubuntu machine. I keep it at the root of my Eclipse working folder. After I’ve made changes to the AeroStream project, I run this to migrate files over to the Intel Aero Compute Board. Technically, I don’t need to push the ‘makeAero’ script every time. But because I never know when I might change it, I always copy it over.

#!/bin/bash
# clean up these files or files wont get compiled on the Aero board. At least this is what I've found to be the case
rm AeroStream/Debug/src/AeroStream*
rm AeroStream/Debug/AeroStream

# Now push the entire AeroStream Eclipse Neon project to the Aero board. This will create the folder /home/AeroStream on the Aero board.
scp -r AeroStream root@192.168.1.1:/home

# makeAero script essentially runs a make and executes AeroStream
scp makeAero root@192.168.1.1:/home

makeAero: Runs on the Intel Aero Compute Board itself. It gets migrated with the project and ends up at the root of /home. All it’s doing is navigating into the debug directory and running the make file, and then launching AeroStream.

#!/bin/bash
#Created a shortcut script because I'm tired of typing this in every time I need to migrate

cd /home/AeroStream/Debug
make
./AeroStream

Instead of pushing the entire project over, you might instead just create your own make file(s) and just push the source code, however, this approach worked for me.

Also, you don’t even need to create a project on Ubuntu using Eclipse. Instead, if you feel confident enough you can just develop right there on the board itself.

How to Configure QGroundControl

There is one last step to complete: configuring QGroundControl. Downloading and installing QGroundControl is out of the scope of this document. However, I need to show you how to set up QGroundControl to talk to receive the GStreamer stream from the Intel Aero Compute Board Wi-Fi.

Note that QGroundControl also uses GStreamer for its video streaming capabilities. This is how the connection is actually being created. GStreamer has the ability to send to Wi-Fi from one location, and then listen for a signal from another location via Wi-Fi. And this is how QGC is accomplishing this.

NOTE: Make sure you are using the SAME port that you have configured in your GStreamer pipeline.

Step 1

When you launch QGroundControl, it opens into flight path mode. You need to click the QGroundControl icon to get to the configuration area.

Step 2

Click the Comm Links button. This displays the Comm Links configuration page.

Click Add.

Step 3

This displays the Create New Link Configuration page.

  1. Give the configuration a name. Any name is OK.
  2. For Type, select UDP.
  3. Select the Listening Port number. This port number must match the port that is being used from the GStreamer pipeline.
  4. Click OK.

Step 4

You will now see the new comm link in QGroundControl.

Launch the Application

NOTE: QGroundControl MUST be running first. It has to be put into listening mode. If you launch your streaming server application first, the connection will not be made. This is just an artifact of GStreamer.

  1. Launch QGroundControl.
  2. Launch AeroStream from the Intel Aero Compute Board. If everything has gone according to plan, you will see your video stream show up in QGroundControl.

Intel Aero Compute Board and GitHub*

Visit the Intel Aero Compute Board GitHub for various software code bases to keep your Intel Aero Compute Board up to date.

https://GitHub.com/intel-aero

https://GitHub.com/intel-aero/meta-intel-aero/wiki

Other Resources

http://www.intel.com/content/www/us/en/technology-innovation/aerial-technology-overview.html

https://software.intel.com/en-us/realsense/aero

http://qgroundcontrol.com/

http://click.intel.com/

Summary

This article helped get you up and running with streaming capabilities with the Intel Aero Compute Board. I gave you an overview of the board itself and showed you how to connect to it. I also showed you which libraries are needed, how I set up Eclipse for my own project, and how to get Wi-Fi up, transfer files, and set up QGroundControl. At this point you are ready to explore other capabilities of the board and streaming.

Intel® Clear Containers 1: The Container Landscape

$
0
0

Download PDF

Introduction

This article introduces the concept of Intel® Clear Containers technology and how it fits into the overall landscape of current container-based technologies. Many introductory articles have already been written about containers and virtual machines (VMs) over the last several years. Here's a good overview: A Beginner-Friendly Introduction To Containers, VMs, and Docker.

This article briefly summarizes the most salient features of existing VMs and containers.

Intel Clear Containers offer advantages to data center managers and developers because of their security and shared memory efficiencies, while their performance remains sufficient for running containerized applications.

What's In A Word: Container

The word “container” gets thrown around a lot in this article and many others, and it's important to understand its meaning.

If you’re an application developer, you probably think of a container as some kind of useful application running inside a private chunk of a computer's resources. For example, you would discuss downloading a container image of a web server from Docker* Hub*, and then running that container on a host.

If you’re a data center manager, your perception of a container might be slightly different. You know that application developers are running all kinds of things in your data center, but you are more concerned with the composition and management of the bounded resources the applications run in than the applications themselves. You might discuss running “an LXC container using Docker” for the developer's web server. You would refer to the application image and application itself as the containerized workload.

This is a subtle, but important, difference. Since we'll be looking directly at container technology, independent of the workloads run using it, we'll use the data center manager's definition of container: the technology that creates an instance of bounded resources for a containerized workload to use.

Virtual Machines versus Containers

It is technically a misstatement to suggest that containers came after VMs. In fact, containers are more-or-less a form of virtualized resources themselves, and both technologies are descendants in a long line of hardware abstractions that stretch back to early computing.

However, in the marketplace of modern IT, it's relatively clear that the era from roughly the mid-2000s to around 2014 (or so) was dominated by the rise of paravirtual machine usage, both in the traditional data center and in cloud computing environments. The increasing power of servers, combined with advancements in hardware platforms friendly to VMs like Intel® Virtualization Technology for IA-32, Intel® 64 and Intel® Architecture and Intel® Virtualization Technology for Directed I/O allowed data center managers to more flexibly assign workloads.

Early concerns about VMs included performance and security. As the platforms grew more and more robust, these concerns became less and less relevant. Eventually commodity hypervisors were capable of delivering performance with less than 2–3 percent falloff from direct physical access. From a security standpoint, VMs became more isolated, allowing them to run in user space, meaning a single ill-behaved VM that became compromised did not automatically allow an attacker access outside the VM itself.

Within the last few years, containers began exploding upon the scene. A container, whether provided by Linux* Containers (LXC), libcontainer*, or other types, offers direct access to hardware, so there's no performance penalty. They can be instantiated far more quickly than a regular VM since they don't have to go through a bootup process. They don't require the heavyweight overhead of an entire OS installation to run. Most importantly, the powerful trio of a container, a container management system (Docker), and a robust library of containerized applications (DockerHub) gives application developers access to rapid deployment and scaling of their applications that could not be equaled by traditional VMs.

While offering these huge rewards, containers reintroduced a security problem: they represent direct access to the server hardware that underpins them. A compromised container allows an attacker the capability to escape to the rest of the OS beneath it.

[Note: We don't mean to imply that this access is automatic or easy. There are many steps to secure containers in the current market. However, a compromise of the container itself—NOT the containerized application—results in likely elevation to the kernel level.]

That leaves us with this admittedly highly generalized statement of pros and cons for VMs versus containers:

Virtual MachineContainer
- Slow Boot+ Rapid Start
- Heavy Mgmt.+ Easy Mgmt.
+ Security- Security
= Performance*= Performance

*VMs take a negligible performance deficit due to hardware abstraction.

Best of Both Worlds: Intel Clear Containers

Intel has developed a new, open source method of launching containerized workloads called Intel Clear Containers. An Intel Clear Container, running on Intel architecture with Intel® Virtualization Technology enabled, is:

  • A highly-customized version of the QEMU-KVM* hypervisor, called qemu-lite.
    • Most of the boot-time probes and early system setup associated with a full-fledged hypervisor are unnecessary and stripped away.
    • This reduces startup time to be on a par with a normal container process.
  • A mini-OS that consists of:
    • A highly-optimized Linux kernel.
    • An optimized version of systemd.
    • Just enough drivers and additional binaries to bring up an overlay filesystem, set up networking, and attach volumes.
  • The correct tooling to bring up containerized workload images exactly as a normal container process would.

Intel Clear Containers can also be integrated with Docker 1.12, allowing the use of Docker just exactly as though operating normal OS containers via the native Docker execution engine. This drop-in is possible because the runtime is compatible with the Open Container Initiative* (OCI*). The important point is that from the application developer perspective, where “container” means the containerized workload, an Intel Clear Container looks and behaves just like a “normal” OS container.

There are some additional but less obvious benefits. Since the mini-OS uses a 4.0+ Linux kernel, it can take advantage of the “direct access” (DAX) feature of the kernel to replace what would be overhead associated with VM memory page cache management. The result is faster performance by the mini-OS kernel and a significant reduction in the memory footprint of the base OS and filesystem; only one copy needs to be resident in memory on a host that could be running thousands of containers.

In addition, Kernel Shared Memory (KSM) allows the containerized VMs to share memory securely for static information that is not already shared by DAX via a process of de-duplication. This results in an even more efficient memory profile. The upshot of these two combined technologies is that the system's memory gets used for the actual workloads, rather than redundant copies of the same OS and library data.

Given the entry of Intel Clear Containers onto the scene, we can expand the table from above:

Virtual MachineContainerIntel® Clear Container
- Slow Boot+ Rapid Start+ Rapid Start^
- Heavy Mgmt.+ Easy Mgmt.+ Easy Mgmt.
+ Security- Security+ Security
= Performance*= Performance= Performance*

*VMs take a negligible performance deficit due to hardware abstraction.

Conclusion

Intel Clear Containers offer a means of combining the best features of VMs with the power and flexibility that containers bring to application developers.

You can find more information about Intel Clear Containers at the official website

This is the first in a series of articles about Intel Clear Containers. In the second, we'll be demonstrating how to get started using Intel Clear Containers yourself.

About the Author

Jim Chamings is a Sr. Software Engineer at Intel Corporation, who focuses on enabling cloud technology for Intel’s Developer Relations Division. Before that he worked in Intel’s Open Source Technology Center (OTC), on both Intel Clear Containers and the Clear Linux for Intel Architecture Project. He’d be happy to hear from you about this article at: jim.chamings@intel.com.

Using Open vSwitch and DPDK with Neutron in DevStack

$
0
0

Download PDF [PDF 660 KB]

Introduction

This tutorial describes how to set up a demonstration or test cluster for Open vSwitch (OVS) and Data Plane Development Kit (DPDK) to run together on OpenStack, using DevStack as the deployment tool and the Neutron ML2/GRE Tunnel plugin.

While the learnings presented here could be used to inform a production deployment with all of these pieces in play, actual production deployments are beyond the scope of this document.

The primary source for most of the details presented here are the documents provided in the following git repository:

https://github.com/openstack/networking-ovs-dpdk

The doc/source/getstarted/devstack/ directory at the root of this repository contains instructions for installing DevStack + OVS + DPDK on multiple operating systems. This tutorial uses the Ubuntu* instructions provided there and expands upon them to present an end-to-end deployment guide for that operating system. Many of the lessons learned in the creation of this document can be applied to the CentOS* and Fedora* instructions also provided in the repository.

Anyone using this tutorial should, at least, understand how to install and configure Linux*, especially for multi-homed networking across multiple network interfaces.

Knowledge of Open vSwitch and OpenStack is not necessarily required, but would be exceedingly helpful.

Requirements

Hardware

  • Two physical machines, one to act as a controller and one as a compute node.
  • Systems must support VT-d and VT-x, and both capabilities should be enabled in the system BIOS.
  • Both machines should be equipped with DPDK-supported network interfaces. A list of the supported equipment can be found at https://github.com/openstack/networking-ovs-dpdk.
    • In the examples shown here, an Intel® X710 (i40e) quad-port NIC was used.
  • Both machines must have at least two active physical network connections: one to act as the 'public' (or 'API') network and one to act as the 'private' (or 'data') network.
    • This latter network should share the same broadcast domain/subnet between the two hosts.
    • See below for an example setup.
  • IMPORTANT: If you have a two-socket (or more) system, you should ensure that your involved network interfaces either span all NUMA nodes within the system, or are installed in a PCIe slot that services NUMA Node 0.
    • You can check this with the command from the package. On Ubuntu, you can install it with the following command:
        sudo apt-get install -y hwloc
    • lstopo will output a hierarchical view. Your network interface devices will be displayed underneath the NUMA Node that’s attached to the slot they are using.
    • If you have the NICs installed entirely on a NUMA Node other than 0, you will encounter a bug that will prevent correct OVS setup. You may wish to move your NIC device to a different PCIe slot.

Operating System

  • For this tutorial, we will use Ubuntu 16.04LTS Server. The installation .ISO file for this OS can be downloaded from https://www.ubuntu.com/download/server.
  • Steps should be taken to ensure that both systems are synchronized to an external time service. Various utilities are available for this function in Ubuntu 16.04, such as chrony.
  • At least one non-root user (with administrative sudo privileges) should be created at installation time. This user will be used to download and run DevStack.

Networking

  • At least two active network connections are required.
  • Both hosts must be able to reach the Internet either via one of the two connections on-board, or via some other device. If one of the two interfaces is your default interface for connecting to the Internet, then that interface should be your 'API network.'
  • The addressing for the two NICs on each host must be static. DevStack will not cope with using DHCP-assigned addresses for either interface.
    • In the examples, the initial configuration of the machines before running DevStack is shown in figure 1 below.
    • Note that in this particular configuration a third, unshown NIC on each machine functions as the default interface that DevStack uses to reach the Internet.

      Figure 1.Initial machine configuration before DevStack is installed.

Preparation

Once your systems are set up with appropriate BIOS configuration, operating systems, and network configurations, you can begin preparing each node for the DevStack installation. Perform these actions on BOTH nodes.

User Setup

  • Login as your non-root user, we'll assume here that it is called 'stack.'
  • The command sudo visudo will allow you to edit the /etc/sudoers file. You will want to set up the non-root user to be able to sudo without entering a password. The easiest way to do that is to add a line like this to the bottom of the file, then save it:
      stack ALL=(ALL) NOPASSWD: ALL
  • Test your configuration by logging completely out of the machine and back in, then try sudo echo Hello. You should not be prompted for a password.

A Word About Network Proxies

  • You may skip this section if your Internet access does not require going through a proxy service.
  • If you do work behind a proxy, please note that your OS and the DevStack installation will need to be configured appropriately for this. For the most part, this is a typical setup that you are likely accustomed to; for example, you likely configured the apt subsystem when you installed the Ubuntu OS in the first place.
  • Of special note here is that git needs to be configured with a wrapper command.
    • Install the socat package:
        sudo apt-get install -y socat
    • Create a text file in your non-root user's homedir called git-proxy-wrapper.
    • Here is an example of what should go into this file. Replace PROXY and PROXY PORT NUMBER with your appropriate values for your SOCKS4 proxy (not HTTP or HTTPS). 
      #!/bin/sh
      _proxy=<PROXY> 
      _proxyport=<PROXY PORT NUMBER>
      exec socat STDIO SOCKS4:$_proxy:$1:$2,socksport=$_proxyport
  • You may need more detail if your proxy service requires authentication or another protocol.
    See the documentation for more information.
  • Set this file executable (chmod +x git-proxy-wrapper) and set the envariable GIT_PROXY_COMMAND=/home/stack/git-proxy-wrapper (if the non-root user is 'stack'). You should add this export to your ~/.bashrc to ensure it is available at all times (like other proxy variables).

Install DevStack

  • You will need the package to clone the DevStack repository:
       sudo apt-get install -y git
  • To download DevStack, as your non-root user and in its home directory:
       git clone https://github.com/openstack-dev/devstack.git
  • You should now have the contents of the DevStack distribution in ~/devstack.

Configure and Run DevStack

The systems are now prepared for DevStack installation.

Controller Installation

In its default state, DevStack will make assumptions about the OpenStack services to install, and their configuration. We will create a local.conf file to change those assumptions, where pertinent, to ensure use of DPDK and OVS.

If you wish to clone the networking-ovs-dpdk repository (which is the first link in this article, in the Introduction) and use the sample files included in the repository, you will find them at doc/source/_downloads/local.conf.*. However, this guide will present pared-down versions of these that are ready-made for this installation.

  • Log in to your controller host as your non-root user.
      cd devstack
  • Edit a file called 'local.conf' and copy in the text in the following block:
    [[local|localrc]]
    #HOST_IP_IFACE=<device name of NIC for public/API network, e.g. 'eth0'>
    #Example:
    #HOST_IP_IFACE=eth0
    HOST_IP_IFACE=
    #HOST_IP=<static IPv4 address of public/API network NIC, e.g. '192.168.10.1'>
    #Example:
    #HOST_IP=192.168.10.1
    HOST_IP=
    HOST_NAME=$(hostname)
    MYSQL_PASSWORD=password
    DATABASE_PASSWORD=password
    RABBIT_PASSWORD=password
    ADMIN_PASSWORD=password
    SERVICE_PASSWORD=password
    HORIZON_PASSWORD=password
    SERVICE_TOKEN=tokentoken
    enable_plugin networking-ovs-dpdk https://github.com/openstack/networking-ovs-dpdk master
    OVS_DPDK_MODE=controller_ovs_dpdk
    disable_service n-net
    disable_service n-cpu
    enable_service neutron
    enable_service q-svc
    enable_service q-agt
    enable_service q-dhcp
    enable_service q-l3
    enable_service q-meta
    DEST=/opt/stack
    SCREEN_LOGDIR=$DEST/logs/screen
    LOGFILE=${SCREEN_LOGDIR}/xstack.sh.log
    LOGDAYS=1
    Q_ML2_TENANT_NETWORK_TYPE=gre
    ENABLE_TENANT_VLANS=False
    ENABLE_TENANT_TUNNELS=True
    #OVS_TUNNEL_CIDR_MAPPING=br-<device name of NIC for private network, e.g. 'eth1'>:<CIDR of private NIC, e.g. 192.168.20.1/24>
    #Example:
    #OVS_TUNNEL_CIDR_MAPPING=br-eth1:192.168.20.1/24
    OVS_TUNNEL_CIDR_MAPPING=
    Q_ML2_PLUGIN_GRE_TYPE_OPTIONS=(tunnel_id_ranges=400:500)
    OVS_NUM_HUGEPAGES=3072
    OVS_DATAPATH_TYPE=netdev
    OVS_LOG_DIR=/opt/stack/logs
    #OVS_BRIDGE_MAPPINGS="default:br-<device name of NIC for private network, e.g. 'eth1'>"
    #Example:
    #OVS_BRIDGE_MAPPINGS="default:br-eth1"
    OVS_BRIDGE_MAPPINGS=
    MULTI_HOST=1
    [[post-config|$NOVA_CONF]]
    [DEFAULT]
    firewall_driver=nova.virt.firewall.NoopFirewallDriver
    novncproxy_host=0.0.0.0
    novncproxy_port=6080
    scheduler_default_filters=RamFilter,ComputeFilter,AvailabilityZoneFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,PciPassthroughFilter,NUMATopologyFilter
  • Wherever you see an Example given in the file contents, you will need to fill in a value appropriate to your setup for that particular setting. The contents are dependent on your particular hardware and network setup.
  • Save the file.
  • You are now ready to run the DevStack .installation. Issue the command ./stack.sh.
  • Wait for successful completion and the return of your command prompt. It will take quite a while (20 minutes to an hour or more depending on your connection), as DevStack will need to install many software repositories.
  • If you encounter errors, the log at /opt/stack/logs/screen/xstack.sh.log will likely be the best source of useful debug information.
  • If you need to restart the installation, unless you are very familiar with DevStack's setup, it is best to wipe and reload the entire OS before proceeding. DevStack sometimes does not recover well from partial installations, unfortunately.

Compute Node Installation

The setup for the compute node is very similar to that of the controller, but the local.conf looks a little different. The same instructions apply here that you used above. Do not attempt compute node installation until your controller has installed successfully.

  • Log in to your compute node as the non-root user.
      cd devstack
  • Edit a local.conf file as above, but use the following text block and values:
    [[local|localrc]]
    #HOST_IP_IFACE=<device name of NIC for public/API network, e.g. 'eth0'>
    #Example:
    #HOST_IP_IFACE=eth0
    HOST_IP_IFACE=
    #HOST_IP=<static IPv4 address of public/API network NIC, e.g. '192.168.10.2'>
    #Example:
    #HOST_IP=192.168.10.1
    HOST_IP=
    HOST_NAME=$(hostname)
    #SERVICE_HOST=<IP address of public NIC on controller>
    #Example:
    #SERVICE_HOST=192.168.10.1
    SERVICE_HOST=
    MYSQL_HOST=$SERVICE_HOST
    RABBIT_HOST=$SERVICE_HOST
    GLANCE_HOST=$SERVICE_HOST
    KEYSTONE_AUTH_HOST=$SERVICE_HOST
    KEYSTONE_SERVICE_HOST=$SERVICE_HOST
    MYSQL_PASSWORD=password
    DATABASE_PASSWORD=password
    RABBIT_PASSWORD=password
    ADMIN_PASSWORD=password
    SERVICE_PASSWORD=password
    HORIZON_PASSWORD=password
    SERVICE_TOKEN=tokentoken
    enable_plugin networking-ovs-dpdk https://github.com/openstack/networking-ovs-dpdk master
    OVS_DPDK_MODE=compute
    disable_all_services
    enable_service n-cpu
    enable_service q-agt
    DEST=/opt/stack
    SCREEN_LOGDIR=$DEST/logs/screen
    LOGFILE=${SCREEN_LOGDIR}/xstack.sh.log
    LOGDAYS=1
    Q_ML2_TENANT_NETWORK_TYPE=gre
    ENABLE_TENANT_VLANS=False
    ENABLE_TENANT_TUNNELS=True
    #OVS_TUNNEL_CIDR_MAPPING=br-<device name of NIC for private network, e.g. 'eth1'>:<CIDR of private NIC, e.g. 192.168.20.2/24>
    #Example:
    #OVS_TUNNEL_CIDR_MAPPING=br-eth1:192.168.20.2/24
    OVS_TUNNEL_CIDR_MAPPING=
    Q_ML2_PLUGIN_GRE_TYPE_OPTIONS=(tunnel_id_ranges=400:500)
    OVS_NUM_HUGEPAGES=3072
    MULTI_HOST=1
    [[post-config|$NOVA_CONF]]
    [DEFAULT]
    firewall_driver=nova.virt.firewall.NoopFirewallDriver
    vnc_enabled=True
    vncserver_listen=0.0.0.0
    vncserver_proxyclient_address=$HOST_IP
    scheduler_default_filters=RamFilter,ComputeFilter,AvailabilityZoneFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,PciPassthroughFilter,NUMATopologyFilter
  • Save the file and follow the rest of the instructions as performed above for the controller.

Launch a Test Instance

At this point, you should have a working OpenStack installation. You can reach the Horizon dashboard at the public IP address of the controller node, on port 80. But before launching an instance, we need to make at least one 'flavor' in the Nova service aware of DPDK, and specifically the use of hugepages, so that it can appropriately use the DPDK-enabled private interface.

Enable DPDK in Nova

  • Log in to your controller host as the non-root user.
      cd devstack
  • source openrc admin demo - this command will set up your environment to act as the OpenStack 'admin' user in the 'demo' project.
      nova flavor-key nova flavor-key m1.small set hw:mem_page_size=large
  • The preceding command ensures that an instance launched with the m1.small size will use hugepages, thus enabling use of the DPDK device.

Open Ports

We need to make sure that ICMP and TCP traffic can reach any spawned VM.

  • Connect to your public API network with HTTP on port 80 through a browser.
  • Login with the credentials 'demo' and 'password.'
  • Select 'Access & Security' from the left-hand menu.
  • Click 'Manage Rules' in the 'default' security group.
  • Click 'Add Rule' button.
  • Select 'All ICMP' from 'Rule' drop-down menu and then click 'Add' in lower right.
  • Repeat the above 2 steps, selecting 'All TCP' this time.

Launch an instance

Now we are ready to launch a test instance.

  • Select 'instances' from the left-hand menu.
  • Select 'Launch Instance' from the right-side button.
  • Give an instance name such as 'test.' Click 'Next' at the lower right.
  • Ensure 'Select Boot Source' is set to 'Image.'
  • Select 'No' for 'Create New Volume.' It will not work in this tutorial setup because we have not enabled the Cinder service.
  • Click the '+' sign next to the Cirros image under 'Available.'
  • Click 'Next' in the lower-right corner.
  • Click the '+' sign next to the 'm1.small' flavor.
  • Select 'Launch Instance.'

Eventually, you should see your instance launch become available. Make note of the private IP address given.

Test Connectivity to Instance

The VM is up and running with a private IP address assigned to it. You can connect to this private IP, but only if you are in the same network namespace as the virtual router that the VM is connected to.

These instructions will show how to enter the virtual namespace and access the VM.

  • Log in to the controller host as the non-root user.
      sudo ip netns exec `ip netns | grep qrouter` /bin/bash
  • You should get a root-shell prompt. You can use this to ping the IP address of the VM.
  • You can also ssh to the VM. In CirrOS, you can log in with username 'cirros' and password 'cubswin:)'

Summary

This completes a demonstration installation of DPDK with OpenVSwitch and Neutron in DevStack. 

From here, you can examine the configurations that were generated by DevStack to learn how to apply those configurations in production instances of OpenStack. You will find the service configurations under the /opt/stack/ directory, and in their respective locations in /etc (e.g. /etc/nova/nova.conf). Of particular note for our purposes are /etc/neutron/neutron.conf, which defines the use of the ML2 plugin by Neutron, and /etc/neutron/plugins/ml2_conf.ini, which specifies how OpenVSwitch is to be configured and used by the Neutron agents.

End Notes

OVS Bridge Information

For reference, here is the sample bridge structure that shows up on a lab system that was used to test this tutorial. This is from the compute node. On this system, ens786f3 is the private/data network interface designation. There are two running VMs; their interfaces can be seen on br-int.

$ sudo ovs-vsctl show
3c8cd45e-9285-45b2-b57f-5c4febd53e3f
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "int-br-ens786f3"
            Interface "int-br-ens786f3"
                type: patch
                options: {peer="phy-br-ens786f3"}
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port br-int
            Interface br-int
                type: internal
        Port "vhufb2e2855-70"
            tag: 1
            Interface "vhufb2e2855-70"
                type: dpdkvhostuser
        Port "vhu53d18db8-b5"
            tag: 1
            Interface "vhu53d18db8-b5"
                type: dpdkvhostuser
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "gre-c0a81401"
            Interface "gre-c0a81401"
                type: gre
                options: {df_default="true", in_key=flow, local_ip="192.168.20.2", out_key=flow, remote_ip="192.168.20.1"}
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port br-tun
            Interface br-tun
                type: internal
    Bridge "br-ens786f3"
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "br-ens786f3"
            Interface "br-ens786f3"
                type: internal
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
        Port "phy-br-ens786f3"
            Interface "phy-br-ens786f3"
                type: patch
                options: {peer="int-br-ens786f3"}

Neutron Plugin Information

This tutorial set up the ML2/GRE tunnel plugin to Neutron, since it is the most likely plugin to work without additional setup for a specific network buildout. It is also possible to use the ML2/VXLAN plugin or the ML2/VLAN plugin. Examples for each of these plugins are given in the local.conf files in the networking-ovs-dpdk repository mentioned above.

NUMA Debugging

While it is beyond the scope of this tutorial to dive into multi-socket NUMA arrangements, it is important to understand that CPU pinning and PCIe will interact with DPDK and OVS, and sometimes to simply cause silent failures. Ensure that all of your CPU, memory allocations, and PCIe devices are within the same NUMA Node if you are having connectivity issues.

Additional OVS/DPDK Options of Note

OVS_PMD_CORE_MASK is an option that can be added to local.conf to isolate DPDK's PMD threads to specific CPU cores. The default value of '0x4' means that CPU #3 (and its hyperthread pair, if HT is enabled) will be pinned to the PMD thread. If you are using multiple NUMA nodes in your system, you should work out the bitwise mask to assign one PMD thread/CPU per node. You will see these CPUs spike to 100% utilization once DPDK is enabled as they begin polling.

You can find other interesting and useful OVS-DPDK settings and their default values in devstack/settings in the networking-ovs-dpdk repository.

About the Author

Jim Chamings is a Sr. Software Engineer at Intel Corporation, who focuses on enabling cloud technology for Intel’s Developer Relations Division. He’d be happy to hear from you about this article at: jim.chamings@intel.com.

Academic Research

$
0
0

Using Intel® VTune™ Amplifier on Cray* XC systems

$
0
0

Introduction

The goal of this article is to provide detailed description of the process of VTune Amplifier installation and using it for applications performance analysis, which is a little bit specific to Cray’s programming environment (PE).  We will be referencing to the CLE 6.0 – Cray installation and configuration model for software on Cray XC systems [1] The installation part of the article is targeting  site administrators and system supporters responsible for the Cray XC programming environment, while data collection and analysis part is applicable for Cray XC system users.

 

Installation 

The Cray CLE 6.0 provides a set of different compilers, performance analysis tools and run-time libraries including Intel Compiler and Intel MPI library. However, VTune Amplifier is not a part of it, and it require additional efforts for installing in the programming environment.

According to the Cray CLE 6.0 documentation [2], installation of additional software into a PE image root is performed on the system's System Management Workstation SMW. The PE image root is then pushed to the boot node so that it can be mounted by a group of Data Virtualization Service (DVS) servers and then mounted to the system's login and compute nodes. 

Cray positions advantages of PE image root model as the installation is designed to be system and hardware agnostic, so the same PE image root can also be used for other systems, such as eLogin systems or another Cray XC. A feature of Image Management and Provisioning System (IMPS) images is that they are easily "cloned" leveraging the use of rpm and zypper. This ability allows the site to test new PE releases, and also makes reverting back to previous PE releases easier. However, VTune in its part of the sampling driver installation is not system agnostic and requires thorough following of supported Linux kernel used for data collection. This will be shown later in the example.

Installing VTune Amplifier is performed on the SMW by using chroot to access the PE root image. You need to copy the installation package of VTune to PE image root, execute VTune installation procedure and create a VTune modulefile.  

The Craypkg-gen tool is used to generate a modulefile so that third party programming software like VTune can be used in a similar manner as the components of the Cray Programming Environment. But before that you need to define USER_INSTALL_DIR environment variable, which is for VTune would be /opt/intel.

The Craypkg-gen ‘-m’ option will create the modulefile:

$ craypkg-gen –m $USER_INSTALL_DIR/vtune_amplifier_xe_2017.0.2.478468

The ‘-m’ option also creates a set_default script that will make the associated modulefile the default version that is used by the module command. For this example, the following set_default script was created:

$USER_INSTALL/admin-pe/set_default_craypkg/set_default_vtune_amplifier_xe_2017.0.2.478468

Executing the generated set_default script will result in a “module load vtune” loading the vtune_amplifier_xe/2017.0.2.478468 modulefile.

 

Example of installing VTune Amplifier 2017

With having CLE 6.0 the Programming Environment software installed on to a PE image root, download the Intel VTune Amplifier 2017 package, and copy it to the PE image root:

smw # export  PECOMPUTE=/var/opt/cray/imps/image_roots/<pe_compute_cle_6.0_imagename>smw # cp vtune_amplifier_xe_2017_update1.tar.gz $PECOMPUTE/var/tmp

Note, it could be not a standalone VTune installation package, but the whole Intel parallel Studio XE package - parallel_studio_xe_2016_update1.tgz. In this case the installation would be different only in a sense of a selecting a VTune component.

If not using a FlexLm license server, which require a certain configuration, copy a registered license file to PE image for interactive installation:

smw # cp l_vtune_amplifier_xe_2017_p.lic $PECOMPUTE/var/tmp

Or copy the license file to the default Intel licenses directory:

smw # cp l_vtune_amplifier_xe_2017_p.lic $PECOMPUTE/opt/intel/licenses

Perform a chroot to PE image:

smw # chroot $PECOMPUTE

Untar the VTune Amplifier package:

smw # cd /var/tmpsmw # tar xzvf vtune_amplifier_xe_2017_update1.tar.gz

By default, the VTune installer is interactive and requires the administrator to respond to prompts. You might want to consult with the Intel® VTune™ Amplifier XE Installation Guide before proceeding.

smw # cd vtune_amplifier_xe_2017_update1/smw # ./install.sh 

Follow the command prompts to install the product.

If you need a non-interactive VTune installation, refer to the Automated Installation of Intel® VTune™ Amplifier XE help article.

 

Once the installer flow reached the sep driver installation, you can either postpone that step or provide a path to the Linux kernel source directory that runs on the Cray compute nodes.

Note: the Cray SWM 8.0 is based on SLES 12 system, which might not be the same as on the compute nodes. In this case you need to provide a path to the target OS kernel headers when requested by the VTune installer.

In case of postponed driver installation, go through the following steps (assuming that the compute node Linux kernel sources are unpacked to the usr/src/target_linux).

Use the GCC environment for building:

smw #  module swap PrgEnv-cray PrgEnv-gnu

Set environment variable CC, so that 'cc' is used as the compiler:

smw #  env CC=cc

Build the drivers (two kernel drivers will be built):

smw # cd vtune_amplifier_xe_2017/sepdk/srcsmw # ./build-driver –ni --kernel-src-dir=$PECOMPUTE/usr/src/target_linux

Install the drivers with permit to the user group (by default, the driver access group name is ‘vtune’ and the driver permissions - 660):

smw # ./insmod-sep3 -r -g <group>

By default, the driver will be installed in the current /sepdk/src directory. If you need to change it, use the --install-dir option with the insmod-sep3 script.

Refer to the <vtune-install-dir>/sepdk/src/README.txt document for more details on building the driver.

Create the VTune modulefile following the steps:

smw # module load craypkg-gensmw # craypkg-gen -m $PECOMPUTE/opt/intel/vtune_amplifier_xe_2017.0.2.478468smw # /opt/intel/vtune_amplifier_xe_2017.0.2.478468/amplxe-vars.sh

The above procedure will create the module file $PECOMPUTE/modulefiles/vtune_amplifier_xe /2017.0.2.478468

You might want to edit the newly created modulefile specifying path variables.

 

Collecting profile data with VTune Amplifier

In order to collect profiling data for further analysis you need to run VTune collector along with your application on a system. There are several ways how to launch an application for analysis and in general, they are described in the VTune Amplifier Help pages.

Cray systems have specifics of running applications by submitting batch jobs, so has VTune. Generally it is recommend using VTune command line tool, "amplxe-cl", to collect profiling data on compute nodes via batch jobs, and then using VTune GUI, “amplxe-gui”, to display results on a login node of the system. 

However, job scheduler utilities accepted as a part of task submitting procedure, as well as Compilers and MPI libraries used for creating parallel applications, may vary depending on specific requirements. This creates additional complexities for performance data collection using VTune or any other performance profiling tool. Below, we will give some common recipes on how to run performance data collection with two most frequently used job schedulers.

 

Slurm* workload manager and srun command

Here is an example of a job script for analysis of a pure MPI application:

#!/bin/bash -l
#SBATCH --partition debug
#SBATCH --vtune
#SBATCH --time 01:00:00
#SBATCH --nodes 2
#SBATCH --job-name myjob

module unload darshan
module load vtune
srun -n 64 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

This script will run the advanced-hotspots analysis over a.out program running on two nodes with 96 tasks in total. Other VTune options mean the following:

-collect advanced-hotspots type of analysis used by VTune collector (this is hardware events based collector as well as general-exploration and memory-access)

--trace-mpi allows the collectors to trace MPI code, and determine the MPI rank IDs if the code is linked to a non-Intel MPI library. When using Intel MPI library this options should be omitted.

-r my_res_dir name for results directory which will be created in a current directory

It is highly recommended to create result directory on the fast Lustre file system. VTune needs frequently purging trace data from memory to disk, so it’s not recommended putting results in to global file system as it’s projected to compute nodes via Cray DVS layer and might be may not fully supporting mmap functionality required by VTune collector.

In the script you need to unload the darshan module on the system before profiling your code, as VTune collector might interfere with the I/O characterization tool. Although, there might be no darshan tool installed in your system at all.

The --vtune flag is needed for dynamic insmod’ing driver for hardware events collection during the job.

Note the length of you job. Even if the '-t' is set to 1 hour, it doesn’t mean that VTune will be collecting data for the whole period if application run time. By default, the size of results directory is limited and when trace file reaches this limit, VTune will stop the collection while the application continues to run. The simple implication is that performance data will be collected over some part for application starting from its beginning. To overcome this limitation consider either increasing result directory size limit or decreasing sampling frequency. 

 

If you application uses a hybrid parallelization approach with combination of MPI and OpenMP, your job script for VTune analysis might look like the following:

#!/bin/bash -l
#SBATCH --partition debug
#SBATCH --vtune
#SBATCH --time 01:00:00
#SBATCH --nodes 2
#SBATCH --job-name myjob

module unload darshan
module load vtune
export OMP_NUM_THREADS=32
srun -n 2 –c 32 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

As you can see, tasks and threads assignment syntax remains the same for srun, and as with pure MPI application, you specify the amplxe-cl as a task to execute which will take care of distribution of the a.out tasks between compute nodes. In this case VTune creates only two per-node results directories, named my_res_dir.<nodename>. The per-OpenMP thread results will be aggregated in each per-node resulting trace file.

One of the downsides of using such approach is that VTune will analyze each task and it will create results against each MPI rank in the job. It’s not a problem when a job is distributed among a few ranks, but in case of hundreds or thousands tasks you might end up with enormous performance data size and infinite time to complete analysis finalization. In this case you might want to collect profile against a single or a subset of MPI ranks, leveraging the multiple program configuration from srun. This approach is described in the article [3]. 

For the aforementioned example you need to create a separate configuration file that will define which MPI ranks will be analyzed.

$ cat run_config.conf
0-1022 ./a.out
1023 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

And in the job script the srun line will look like the following:

srun –n 32 –c 32 --multi-prog ./srun_config.conf

 

Application Level Placement Scheduler* (ALPS) and aprun command

With ALPS running VTune by the the aprun command is similar to the Slurm/srun experience. Just make sure you are using the --trace-mpi option to make sure VTune is keeping one collector instance on each node with multiple MPI ranks.

For a pure MPI application your job script for VTune analysis might look like the following [4]:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR

aprun -n 32 –N 16 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

where:

-n– number of processes

-N– number of processes per node

In case of a hybrid parallelization approach with combination of MPI and OpenMP:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 8
aprun -n 32 –N 2 –d 8 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

where:

-d– depth or number of CPUs assigned per a process

If you’d like to analyze just on node, you need to modify the script for multiple executables:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR

aprun -n 16 ./a.out : -n 16 –N 16 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

 

Known limitations on VTune Amplifier collection in Cray XC systems

1. By default Cray compiler produces static binaries. General recommendation is to use dynamic linking for profiling under VTune Amplifier where possible to avoid a set of limitations that the tools has on profiling static binaries. If dynamic linking cannot be applied the following should be taken into account on VTune Amplifier limitations:

a) PIN-based analysis types don’t work with static binaries out of the box reporting the following message:

Error: Binary file of the analysis target does not contain symbols required for profiling. See the 'Analyzing Statically Linked Binaries' help topic for more details.

This impacts hotspots, concurrency and locks and waits collection and also memory access collection with memory object instrumentation. See https://software.intel.com/en-us/node/609433 how to work around the issue.

 

b) PMU-based analysis crashes on static binaries with OpenMP RTL from 2017 Gold and earlier Intel compiler versions.

To workaround the issue use a wrapper script with the following variables to be unset:

unset INTEL_LIBITTNOTIFY64
unset INTEL_ITTNOTIFY_GROUPS

The issue was fixed in Intel OpenMP RTL that is a part of Intel Compiler 2017 Update 1 and later.

 

c) Collection information based on User API will not be available including user pauses, resumes, frames, tasks defined by a user in their source code, OpenMP instrumentation based statistics such as Serial time vs Parallel time, imbalance on barriers etc, rank number capturing to enrich process names with MPI rank numbers.

 

2. In the case if VTune result directory is placed on a file system projected by Cray DVS VTune emits an error that the result cannot be finalized.

To workaround the issue - place a VTune result directory on a file system w/o Cray DVS projection (scratch etc) using '-r' VTune command line option.

 

3. It is required to add PMI_NO_FORK=1 to the application environment to make MPI profiling working and avoid MPI application hang under profiling.

 

Analyzing data with VTune Amplifier

VTune Amplifier provides a powerful and visual tools for multi-process, multithreading and single-threading performance analysis. In most cases it’s better using VTune GUI for opening collected results, while command-line tools have very similar results reporting functionality. For doing that you can enter a login node, load tune module and launch VTune GUI:

$ module load vtune
$ amplxe-gui

In the GUI you need to open an .amplxe project file in an appropriate results directory created during data collection. VTune Amplifier GUI is exposing a lot of graphic controls and objects, so performance wise it better for remote users to run an in-place X-server and open a client X-Window terminal using VNC* or NX* [5] software.

 

*Other names and brands may be claimed as the property of others.

 

References

[1] http://docs.cray.com/PDF/XC_Series_Software_Installation_and_Configuration_Guide_CLE60UP02_S-2559.pdf

[2] https://cug.org/proceedings/cug2016_proceedings/includes/files/pap127.pdf

[3] Running Intel® Parallel Studio XE Analysis Tools on Clusters with Slurm* / srun

[4] http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=;f=man/alpsm/10/cat1/aprun.1.html

[5] https://en.wikipedia.org/wiki/NX_technology

Intel® Clear Containers 2: Using Clear Containers with Docker

$
0
0

Download PDF

Introduction

This article describes multiple ways to get started using Intel® Clear Containers on a variety of operating systems. It is written for an audience that is familiar with Linux* operating systems, basic command-line shell usage, and has some familiarity with Docker*. We'll do an installation walk-through that explains the steps as we take them.

This article is the second in a series of three. The first article introduces the concept of Intel Clear Containers technology and describes how it fits into the overall landscape of current container-based technologies.

Requirements

You will need a host upon which to run Docker and Intel Clear Containers. As described below, the choice of OS is up to you, but the host you choose has some prerequisites:

  • Ideally, for realistic performance, you would want to use a physical host. If you do, then the following should be true as well:
    • It must be capable of using Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64, and Intel® Architecture (Intel® VT-x).
    • Intel VT-x must be enabled in the system BIOS.
  • You can use a kernel-based virtual machine (KVM), with nested virtualization, to try out Intel Clear Containers. Note that the physical host you are running the KVM instance on should satisfy the above conditions as well. It might work on a less-functional system, but ... no guarantees.

Two Paths: Clear Linux* for Intel® Architecture or Your Own Distribution

You have your choice of operating system to install to your host. You can either use Intel Clear Containers in Clear Linux* for Intel® Architecture, or use another common Linux distribution of your choice. IntelClear Containers do not behave or function differently on different operating systems, although the installation instructions differ. Detailed instructions exist for installing to CentOS* 7.2, Fedora* 24, and Ubuntu* 16.04.

Using the Clear Linux Project for Intel Architecture

Intel Clear Containers were developed by the team that develops the Clear Linux Project for Intel Architecture operating system distribution.

Intel Clear Containers are installed by default with Docker in the current version of Clear Linux. So, one way to get started with using Intel Clear Containers is to download and install Clear Linux. Instructions for installation on a physical host are available at https://clearlinux.org/documentation/gs_installing_clr_as_host.html.

For installation to a virtual machine, use the instructions at https://clearlinux.org/documentation/vm-kvm.html.

Software in Clear Linux is delivered in bundles; the Intel Clear Containers and Docker installation is contained in a bundle called containers-basic. Once you've installed your OS, it's very easy to add the bundle:

swupd bundle-add containers-basic

From there you could simply begin using Docker:

systemctl start docker-cor

This will start Docker with the correct runtime executable ('cor') as the execution engine, thus enabling Intel Clear Containers:

docker run -ti ubuntu (for example).

A complete production installation document, that includes directions for setting up user-level Docker control, is here for a physical host: https://github.com/01org/cc-oci-runtime/wiki/Installing-Clear-Containers-on-Clear-Linux.

Using a Common Linux* Distribution

If you are not using the Clear Linux for Intel Architecture distribution as your base operating system, it is possible to install to many common distributions. You will find guides for installation to CentOS 7.2, Fedora 24, and Ubuntu 16.04 (either Server or Desktop versions of these distributions will work just fine) at https://github.com/01org/cc-oci-runtime/wiki/Installation.

The essence of all of these installations follows the same basic flow:

  • Obtain the Intel Clear Containers runtime executable, called cc-oci-runtime. This is the Intel Clear Containers OCI-compatible runtime binary that is responsible for launching a qemu-lite process and integrating filesystem and network pieces.
  • Handle additional dependencies and configuration details that the base OS may be lacking.
  • Upgrade (or fresh-install) the local Docker installation to 1.12, which is the version that supports OCI and replaceable execution engines.
  • Configure the Docker daemon to use the Intel Clear Containers execution engine by default.

The repository instructions given are tailored to the specific distribution setups, but given this general workflow and some knowledge of your own particular distribution, just about any common distribution could be adapted to do the same without too much additional effort.

Installation Walk-Through: Ubuntu* 16.04 and Docker

This section will walk through the installation of Intel Clear Containers for Ubuntu 16.04, in detail. It will follow the instructions linked to above, so it might help to have the Ubuntu installation guide open as you read. It’s located at https://github.com/01org/cc-oci-runtime/wiki/Installing-Clear-Containers-on-Ubuntu-16.04

I'll be giving some context and explanations to the instructions as we go along, though, which may be helpful if you are adapting to your own distribution.

Note for Proxy Users

If you require the use of a proxy server for connections to the Internet, you'll want to pay attention to specific items called out in the following discussion. For most of this, it is sufficient to set the following proxy variables in your running shell, replacing the all-caps values as needed. This should be a familiar format for most that have to use these services.

# export http_proxy=http://[USER:PASSWORD@]PROXYHOST:PORT/
# export https_proxy=http://[USER:PASSWORD@]PROXYHOST:PORT/
# export no_proxy=localhost,127.0.0.0/24,*.MY.DOMAIN

Install the Intel Clear Containers Runtime

The first step is to obtain and install the Intel Clear Containers runtime, as described above. For Ubuntu, there is a package available, which can be downloaded and installed, but we have to resolve a simple dependency first.

sudo apt-get install libpixman-1-0

[Note: This seems like an odd dependency, and it is. However, there's certain pieces of the qemu-lite executable that can't easily be removed; this is a holdover dependency from the larger QEMU-KVM parent. It's very low-overhead and should be resolved in a later release of Intel Clear Containers.]

Now we'll add a repository service that has the runtime that we're after, as well as downloading the public key for that repository so that the Ubuntu packaging system can verify the integrity of the packages we download from it:

sudo sh -c "echo 'deb http://download.opensuse.org/repositories/home:/clearlinux:/preview:/clear-containers-2.0/xUbuntu_16.04/ /'>> /etc/apt/sources.list.d/cc-oci-runtime.list"
wget http://download.opensuse.org/repositories/home:clearlinux:preview:clear-containers-2.0/xUbuntu_16.04/Release.key
sudo apt-key add Release.key
sudo apt-get update
sudo apt-get install -y cc-oci-runtime

Configure OS for Intel Clear Containers

As of this writing, Section 3 of the installation instructions suggests the installation of additional kernel drivers and a reboot of your host at this point in the procedure. This is to acquire the default storage driver, ‘aufs’, for Docker.

However, there is a more up-to-date alternative called ‘overlay2’, and therefore this step is unnecessary with the addition of one small configuration change, which is detailed below. For now, it is okay to simply skip Section 3 and the installation of the “Linux kernel extras” packages.

One more thing remains in this step. Clear Linux for Intel Architecture updates very frequently (as often as twice a day) to stay ahead of security exploits and to be as up-to-date as possible in the open source world. Due to the Ubuntu packaging system, it's pretty certain that the mini-OS that's included as part of this package is ahead of where Ubuntu thinks it is. We need to update the OS to use the current mini-OS:

cd /usr/share/clear-containers/
sudo rm clear-containers.img
sudo ln -s clear-*-containers.img clear-containers.img
sudo sed -ie 's!"image":.*$!"image": "/usr/share/clear-containers/clear-containers.img",!g' /usr/share/defaults/cc-oci-runtime/vm.json

Install Docker* 1.12

As of this writing, even though Ubuntu 16.04 makes Docker 1.12 available as default for the OS, the packaging of it assumes the use of the native runtime. Therefore, it is necessary to install separate pieces directly from dockerproject.org rather than taking the operating system packaging.

Similarly to the above installation of the cc-oci-runtime, we're going to add a repository, add the key for the repo, and then perform installation from the repository.

sudo apt-get install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
sudo sh -c "echo 'deb https://apt.dockerproject.org/repo ubuntu-xenial main'>> /etc/apt/sources.list.d/docker.list"
sudo apt-get update
sudo apt-get purge lxc-docker

[Note for proxy users: the second command above will not work with just the usual proxy environment variables that we discussed above. You'll need to modify the command as shown here:

sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --keyserver-options http-proxy=http://[USER:PASSWORD@]HOST:PORT/ --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
]

Before we install, we should check the current versions available. As of this writing the most current version available is docker-engine=1.12.3-0~xenial.

Install the version specified in the instructions even if it is older! More current versions may or may not have been well-integrated with the OS distribution in question. Intel Clear Containers is under active development and improvement, so not all versions will work everywhere. (As stated earlier...the best place to ensure you've got an up-to-date and working Intel Clear Containers installation is in the Clear Linux for Intel Architecture Project distribution.)

You can look at the list of versions available with:

apt-cache policy docker-engine

There are newer versions that we shouldn't use. We'll have to specify the version the instructions tell us to:

sudo apt-get install docker-engine=1.12.1-0~xenial

Configure Docker Startup for use of Intel Clear Containers

Ubuntu 16.04 uses systemd for system initialization, therefore most of what's remaining is to make some alterations in systemd with regard to Docker startup. The following instructions will override the default startup and ensure use of the cc-oci-runtime.

sudo mkdir -p /etc/systemd/system/docker.service.d/

Edit a file in that directory (as root) called clr-containers.conf. Make it look like this:

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -D –s overlay2 --add-runtime cor=/usr/bin/cc-oci-runtime --default-runtime=cor

This is a systemd directive file for the docker service; it specifies the command-line options to the dockerd processes that will force it to use Intel Clear Containers instead of its native service.

Note also the addition of the ‘-s overlay2’ flag, which is not in the instructions (at time of this writing). This tells the Docker daemon to use the ‘overlay2’ storage driver in preference to ‘aufs’. This is the recommended storage driver to use for kernels of version 4.0 or greater.

Now we need to make sure systemd recognizes the change, and then restart the service:

sudo systemctl daemon-reload
sudo systemctl restart docker

Note that I've skipped some additional, optional configuration that's called out in the installation document. This additional configuration is to allow for large numbers of Intel Clear Containers to run on the same machine. Without performing this optional action, you will be limited on how many containers can run simultaneously. See section 6.1 of the instruction document if you want to remove this limitation.

Ready to Run

At this point you should be able to run Docker container startup normally:

sudo docker run -ti ubuntu

This will give you a command prompt on a simple Ubuntu container. You can log in separately and see the qemu-lite process running, like this (the container is running in the background window, the process display is in the foreground).

Summary

Now you have everything you need to take Intel Clear Containers for a test drive. For the most part, it will behave just like any other Docker installation. As shown, integration with Docker Hub and the huge library of container images present there is open to use by Intel Clear Containers.

This has been the second of a three-part series on Intel Clear Containers. In the final article, I'll dive into the technology a bit more, exploring some of the major engineering tradeoffs that have been made, and where development is likely headed in upcoming releases. I'll also discuss the use of Intel Clear Containers in various orchestration tools besides Docker.

Read the first article in the series: Intel® Clear Containers 1: The Container Landscape

About the Author

Jim Chamings is a Sr. Software Engineer at Intel Corporation, who focuses on enabling cloud technology for the Intel Developer Relations Division. Before that, he worked in the Intel Open Source Technology Center (OTC), on both Intel Clear Containers and the Clear Linux Project for Intel Architecture. He’d be happy to hear from you about this article at: jim.chamings@intel.com.

Intel® ISA-L: Cryptographic Hashes for Cloud Storage

$
0
0

Download Code Sample

Download PDF

Introduction

Today’s new devices generate data that requires centralized storage and access everywhere, thus increasing the demand for more and faster cloud storage. At the point where data is collected and packaged for the cloud, improvements in data processing performance are important. Intel® Intelligent Storage Acceleration Library (Intel® ISA-L), with the ability to generate cryptographic hashes extremely fast, can improve data encryption performance. In this article, a sample application that includes downloadable source code will be shared to demonstrate the utilization of the Intel® ISA-L cryptographic hash feature. The sample application has been tested on the hardware and software configuration presented in the table below. Depending on the platform capability, Intel ISA-L can run on various Intel® processor families. Improvements are obtained by speeding up computations through the use of the following instruction sets:

Hardware and Software Configuration

CPU and Chipset

Intel® Xeon® processor E5-2699 v4, 2.2 GHz

  • # of cores per chip: 22 (only used single core)
  • # of sockets: 2
  • Chipset: Intel® C610 chipset, QS (B-1 step)
  • System bus: 9.6 GT/s Intel® QuickPath Interconnect
  • Intel® Hyper-Threading Technology off
  • Intel® Speed Step Technology enabled
  • Intel® Turbo Boost Technology disabled

Platform

Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass)

  • BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932
  • DIMM slots: 24
  • Power supply: 1x1100W

Memory

Memory size: 256 GB (16X16 GB) DDR4 2133P

Brand/model: Micron* – MTA36ASF2G72PZ2GATESIG

Storage

Brand and model: 1 TB Western Digital* (WD1002FAEX)

Plus Intel® SSD P3700 Series (SSDPEDMD400G4)

Operating System

Ubuntu* 16.04 LTS (Xenial Xerus)

Linux kernel 4.4.0-21-generic

Why Use Intel® ISA-L?

Intel ISA-L has the capability to generate cryptographic hashes fast by utilizing the Single Instruction Multiple Data (SIMD). The cryptographic functions are part of a separate collection within Intel ISA-L and can be found in the GitHub repository 01org/isa-l_crypto. To demonstrate this multithreading hash feature, this article simulates a sample “producer-consumer” application. A variable number (from 1-16) of “producer” threads will fill a single buffer with data chunks, while a single “consumer” thread will take data chunks from the buffer and calculate cryptographic hashes using Intel ISA-L’s implementations. For this demo, a developer can choose the number of threads (producers) submitting data (2, 4, 8, or 16) and the type of hash (MD5, SHA1, SHA256, or SHA512). The example will produce output that shows the utilization of the “consumer” thread and the overall wall-clock time.

Prerequisites

Intel ISA-L has known support for Linux* and Microsoft Windows*. A full list of prerequisite packages can be found here.

Building the sample application (for Linux):

  1. Install the dependencies:
    • a c++14 compliant c++ compiler
    • cmake >= 3.1
    • git
    • autogen
    • autoconf
    • automake
    • yasm and/or nasm
    • libtool
    • boost's "Program Options" library and headers

    sudo apt-get update
    sudo apt-get install gcc g++ make cmake git autogen autoconf automake yasm nasm libtool libboost-all-dev

  2. Also needed is the latest versions of isa-l_crypto. The get_libs.bash script can be used to get it. The script will download the library from its official GitHub repository, build it, and install it in ./libs/usr.

    bash ./libs/get_libs.bash

  3. Build from the `ex3` directory:

    mkdir <build-dir>
    cd <build-dir>
    cmake -DCMAKE_BUILD_TYPE=Release $OLDPWD
    make

Getting Started with the Sample Application

The download button for the source code is provided at the beginning of the article. The sample application contains the following:

This example will go through the following steps at a high level work flow and only focus in detail on the consumer code found inside “consumers.cpp and the “hash.cpp” files:

Setup

1. In the “main.cpp” file, we first parse the arguments coming from the command line and display the options that are going to be performed.

int main(int argc, char* argv[])
{
     options options = options::parse(argc, argv);
     display_info(options);

2. From the “main.cpp” file, we call the shared_data routine to process the options from command line.

shared_data data(options);

In the “shared_data.cpp”, we create the `shared_data` that is the shared buffer that is going to be written to by the producers and read by the consumer, as well as the means to synchronize those reads and writes.

Parsing the option of the command line

3. In the options.cpp file, the program parses the command line arguments using: `options::parse()`.

Create the Producer

4. In the “main.cpp” file, we then create the producers and then call their `producer::run()` method in a new thread (`std::async` with the `std::launch::async` launch policy is used for that).

for (uint8_t i = 0; i < options.producers; ++i)
       producers_future_results.push_back(
            std::async(std::launch::async, &producer::run, &producers[i]));

In the “producer.cpp” file, each producer is assigned one chunk 'id' (stored in m_id ) in which it will submit data.

On each iteration, we:

  • wait until our chunk is ready_write , then fill it with data.
  • sleep for the appropriate amount of time to simulate the time it could take to generate data.

The program generates only very simple data: each chunk is filled repeatedly with only one random character (returned by random_data_generator::get() ). See the “random_data_generator.cpp” file for more details.

5. In the “main.cpp” file, the program stores data to the `std::future` object for each producer’s thread. Each std::future object holds a way to access the results of the thread once it’s done and wait synchronously for the thread o be done. The thread does not return any data.

std::vector<std::future<void>> producers_future_results;

Create the Consumer and start the hashing for the data

6. In the “main.cpp” file, the program then creates only one consumer and calls it's `consumer::run()` method

consumer consumer(data, options);
    consumer.run();

In the “consumer.cpp” file, the consumer will repeatedly:

  • wait for some chunks of data to be ready_read ( m_data.cv().wait_for ).
  • submit each of them to be hashed ( m_hash.hash_entire ).
  • mark those chunks as ready_write ( m_data.mark_ready_write ).
  • wait for the jobs to be done ( m_hash.hash_flush ).
  • unlock the mutex and notify all waiting threads, so the producers can start filling the chunks again

When all the data has been hashed we display the results, including the thread usage. This is computed by comparing the amount of time we waited for chunks to be ready and read to the amount of time we actually spent hashing the data.

consumer::consumer(shared_data& data, options& options)
    : m_data(data), m_options(options), m_hash(m_options.function)
{
}

void consumer::run()
{
    uint64_t hashes_submitted = 0;

    auto start_work    = std::chrono::steady_clock::now();
    auto wait_duration = std::chrono::nanoseconds{0};

    while (true)
    {
        auto start_wait = std::chrono::steady_clock::now();

        std::unique_lock<std::mutex> lk(m_data.mutex());

        // We wait for at least 1 chunk to be readable
        auto ready_in_time =
            m_data.cv().wait_for(lk, std::chrono::seconds{1}, [&] { return m_data.ready_read(); });

        auto end_wait = std::chrono::steady_clock::now();
        wait_duration += (end_wait - start_wait);

        if (!ready_in_time)
        {
            continue;
        }

        while (hashes_submitted < m_options.iterations)
        {
            int idx = m_data.first_chunck_ready_read();

            if (idx < 0)
                break;

            // We submit each readable chunk to the hash function, then mark that chunk as writable
            m_hash.hash_entire(m_data.get_chunk(idx), m_options.chunk_size);
            m_data.mark_ready_write(idx);
            ++hashes_submitted;
        }

        // We unlock the mutex and notify all waiting thread, so the producers can start filling the
        // chunks again
        lk.unlock();
        m_data.cv().notify_all();

        // We wait until all hash jobs are done
        for (int i = 0; i < m_options.producers; ++i)
            m_hash.hash_flush();

        display_progress(m_hash.generated_hashes(), m_options.iterations);

        if (hashes_submitted == m_options.iterations)
        {
            auto end_work      = std::chrono::steady_clock::now();
            auto work_duration = (end_work - start_work);

            std::cout << "[Info   ] Elasped time:          ";
            display_time(work_duration.count());
            std::cout << "\n";
            std::cout << "[Info   ] Consumer thread usage: "<< std::fixed << std::setprecision(1)<< (double)(work_duration - wait_duration).count() / work_duration.count() *
                             100<< " %\n";

            uint64_t total_size = m_options.chunk_size * m_options.iterations;
            uint64_t throughput = total_size /
                                  std::chrono::duration_cast<std::chrono::duration<double>>(
                                      work_duration - wait_duration)
                                      .count();

            std::cout << "[Info   ] Hash speed:            "<< size::to_string(throughput)<< "/s ("<< size::to_string(throughput, false) << "/s)\n";

            break;
        }
    }
}

The “hash.cpp” file provides a simple common interface to the md5/sha1/sha256/sha512 hash routines.

hash::hash(hash_function function) : m_function(function), m_generated_hashes(0)
{
    switch (m_function)
    {
        case hash_function::md5:
            m_hash_impl = md5(&md5_ctx_mgr_init, &md5_ctx_mgr_submit, &md5_ctx_mgr_flush);
            break;
        case hash_function::sha1:
            m_hash_impl = sha1(&sha1_ctx_mgr_init, &sha1_ctx_mgr_submit, &sha1_ctx_mgr_flush);
            break;
        case hash_function::sha256:
            m_hash_impl =
                sha256(&sha256_ctx_mgr_init, &sha256_ctx_mgr_submit, &sha256_ctx_mgr_flush);
            break;
        case hash_function::sha512:
            m_hash_impl =
                sha512(&sha512_ctx_mgr_init, &sha512_ctx_mgr_submit, &sha512_ctx_mgr_flush);
            break;
    }
}


void hash::hash_entire(const uint8_t* chunk, uint len)
{
    submit_visitor visitor(chunk, len);
    if (boost::apply_visitor(visitor, m_hash_impl))
        ++m_generated_hashes;
}


void hash::hash_flush()
{
    flush_visitor visitor;
    if (boost::apply_visitor(visitor, m_hash_impl))
        ++m_generated_hashes;
}


uint64_t hash::generated_hashes() const
{
    return m_generated_hashes;
}

7. Once `consumer::run` is done and returned to the main program, the program waits for each producer to be done, by calling `std::future::wait()` on each `std::future` object xx.

for (const auto& producer_future_result : producers_future_results)
        producer_future_result.wait();

Execute the Sample Application

In this example, the program generated data in N producer threads, and hashed the data using a single consumer thread. The program will show if the consumer thread can keep up with N producer threads.

Configuring the tests

Speed of data generation

Since this is not a real-world application, the data generation can be almost as fast or slow as we want. The “—speed” argument is used to choose how fast each producer is generating data.

If “--speed 50MB”, each producer thread would take 1 seconds to generate a 50MB chunk.

The faster the speed, the less time the consumer thread will have to hash the data before new chunks are available. This means the consumer thread usage will be higher.

Number of producers

The “—producers” argument is used to choose the number of producer threads to concurrently generate and submit data chunks.

Important note: On each iteration, the consumer thread will submit at most that number of chunks of data to be hashed. So, the higher the number, the more opportunity there is for “isa-l_crypto” to run more hash jobs at the same time. This is because of the way the program measures the consumer thread usage.

Chunk size

The size of the data chunks is being defined by each producer for each iteration and then the consumer submits the data chunk to the hash_function.

The “--chunk-size” argument is used to choose that value.

This is a very important value, as it directly affects how long each hash job will take.

Total size

This is the total amount of data to be generated and hashed. Knowing this and the other parameters, the program knows how many times chunks will need to be generated in total, and how many hash jobs will be submitted in total.

Using the “ --total-size” argument, it is important to pick a large enough value (compared to the chunk-size) that we will submit a large enough number of jobs, in order to cancel some of the noise in measuring the time taken by those jobs.

The results

[Info ] Elasped time: 2.603 s
[Info ] Consumer thread usage: 42.0 %
[Info ] Hash speed: 981.7 MB/s (936.2 MiB/s)

Elapsed time

This is the total time taken by the whole process

Consumer thread usage

We compare how long we spent waiting for chunks of data to be available to how long the consumer thread has been running in total.

Any value lower than 100% shows that the consumer thread was able to keep up with the producers and had to wait for new chunks of data.

A value very close to 100% shows that the consumer threads were consistently busy, and were not able to outrun the producers.

Hash speed

This is the effective speed at which the isa-l_crypto functions hashed the data. The clock for this starts running as soon as at least one data chunk is available, and stops when all these chunks have been hashed.

Running the example

Running this example “ex3” with the taskset command to core number 3 and 4 should give the following output:

The program runs as a single thread on core number 3. ~55% of its time is waiting for the producer to submit the data.

Running the program with the taskset command for core 3 to 20 for the 16 threads (producers) should give the following output:

The program runs as sixteen threads on core numbers 3 to 19. Only ~2% of its time is waiting for the producer to submit the data.

Notes: 2x Intel® Xeon® processor E5-2699v4 (HT off), Intel® Speed Step enabled, Intel® Turbo Boost Technology disabled, 16x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubuntu* 16.04 LTS, Linux kernel 4.4.0-21-generic, 1 TB Western Digital* (WD1002FAEX), 1 Intel® SSD P3700 Series (SSDPEDMD400G4), 22x per CPU socket. Performance measured by the written sample application in this article.

Conclusion

As demonstrated in this quick tutorial, the hash function feature can be applied to any storage application. The source code for the sample application is also for provided for your reference. Intel ISA-L has provided the library for storage developers to quickly adopt to your specific application run on Intel® Architecture.

Other Useful Links

Authors

Thai Le is a Software Engineer who focuses on cloud computing and performance computing analysis at Intel.

Steven Briscoe is an Application Engineer focusing on Cloud Computing within the Software Services Group at Intel Corporation (UK).

Notices

System configurations, SSD configurations and performance tests conducted are discussed in detail within the body of this paper. For more information go to http://www.intel.com/content/www/us/en/benchmarks/intel-product-performance.html.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel® ISA-L: Semi-Dynamic Compression Algorithms

$
0
0

Download Code Sample

Download PDF

Introduction

Compression algorithms traditionally use either a dynamic or static compression table. Those who want the best compression results use a dynamic table at the cost of more processing time, while the algorithms focused on throughput will use static tables. The Intel® Intelligent Storage Acceleration Library (Intel® ISA-L) semi-dynamic compression comes close to getting the best of both worlds. Testing shows the usage of semi-dynamic compression and decompression is only slightly slower than using a static table and almost as efficient as algorithms that use dynamic tables. This article's goal is to help you incorporate Intel ISA-L’s semi-dynamic compression and decompression algorithms into your storage application. It describes prerequisites for using Intel ISA-L, and includes a downloadable code sample, with full build instructions. The code sample is a compression tool that can be used to compare the compression ratio and performance of Intel ISA-L’s semi-dynamic compression algorithm on a public data set with the output of its open source equivalent, zlib*.

Hardware and Software Configuration

CPU and Chipset

Intel® Xeon® processor E5-2699 v4, 2.2 GHz

  • Number of cores per chip: 22 (only used single core)
  • Number of sockets: 2
  • Chipset: Intel® C610 series chipset, QS (B-1 step)
  • System bus: 9.6 GT/s Intel® QuickPath Interconnect
  • Intel® Hyper-Threading Technology off
  • Intel SpeedStep® technology enabled
  • Intel® Turbo Boost Technology disabled
Platform

Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass)

  • BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932
  • DIMM slots: 24
  • Power supply: 1x1100W
Memory

Memory size: 256 GB (16X16 GB) DDR4 2133P

Brand/model: Micron – MTA36ASF2G72PZ2GATESIG

Storage

Brand and model: 1 TB Western Digital (WD1002FAEX)

Plus Intel® SSD Data Center P3700 Series (SSDPEDMD400G4)

Operating System

Ubuntu* 16.04 LTS (Xenial Xerus)

Linux* kernel 4.4.0-21-generic

Note: Depending on the platform capability, Intel ISA-L can run on various Intel® processor families. Improvements are obtained by speeding up the computations through the use of the following instruction sets:

Why Use Intel® Intelligent Storage Library (Intel® ISA-L)?

Intel ISA-L has the ability to compress and decompress faster than zlib* with only a small sacrifice in the compression ratio. This capability is well suited for high throughput storage applications. This article includes a sample application that simulates a compression and decompression scenario where the output will show the efficiency. Click on the button at the top of this article to download.

Prerequisites

Intel ISA-L supports Linux and Microsoft Windows*. A full list of prerequisite packages can be found here.

Building the sample application (for Linux):

  1. Install the dependencies:
    • a c++14 compliant c++ compiler
    • cmake >= 3.1
    • git
    • autogen
    • autoconf
    • automake
    • yasm and/or nasm
    • libtool
    • boost's "Filesystem" library and headers
    • boost's "Program Options" library and headers
    • boost's "String Algo" headers

      >sudo apt-get update
      >sudo apt-get install gcc g++ make cmake git zlib1g-dev autogen autoconf automake yasm nasm libtool libboost-all-dev

  2. You also need the latest versions of isa-l and zlib. The get_libs.bash script can be used to get them. The script will download the two libraries from their official GitHub* repositories, build them, and then install them in `./libs/usr` directory.

    >`bash ./libs/get_libs.bash`

  3. Build from the `ex1` directory:
    • `mkdir <build-dir>`
    • `cd <build-dir>`
    • `cmake -DCMAKE_BUILD_TYPE=Release $OLDPWD`
    • `make`

Getting Started with the Sample Application 

The sample application contains the following files:

Sample App

This example goes through the following steps at a high-level work flow and focuses on the “main.cpp” and “bm_isal.cpp” files:

Setup

1. In the “main.cpp” file, the program parses the command line and displays the options that are going to be performed.

int main(int argc, char* argv[])
{
     options options = options::parse(argc, argv);

Parsing the option of the command line

2. In the options.cpp file, the program parses the command line arguments using `options::parse()`.

Create the benchmarks object

3. In the “main.cpp” file, the program will benchmark each raw file using a compression-level inside the benchmarks::add_benchmark() function. Since the benchmarks do not run concurrently, there is only one file “pointer” created.

benchmarks benchmarks;

// adding the benchmark for each files and libary/level combination
for (const auto& path : options.files)
{
	auto compression   = benchmark_info::Method::Compression;
	auto decompression = benchmark_info::Method::Decompression;
	auto isal          = benchmark_info::Library::ISAL;
	auto zlib          = benchmark_info::Library::ZLIB;

	benchmarks.add_benchmark({compression, isal, 0, path});
	benchmarks.add_benchmark({decompression, isal, 0, path});

	for (auto level : options.zlib_levels)
	{
		if (level >= 1 && level <= 9)
		{
			benchmarks.add_benchmark({compression, zlib, level, path});
			benchmarks.add_benchmark({decompression, zlib, level, path});
		}
		else
		{
			std::cout << "[Warning] zlib compression level "<< level << "will be ignored\n";
		}
	}
}

Intel® ISA-L compression and decompression

4. In the “bm_isal.cpp” file, the program performs the compression and decompression on the raw file using a single thread. The key functions to note are isal_deflate and isal_inflate. Both functions accept a stream as an argument, and this data structure holds the data about the input buffer, the length in bytes of the input buffer, and the output buffer and the size of the output buffer. end_of_stream indicates whether it will be last iteration.

std::string bm_isal::version()
{
    return std::to_string(ISAL_MAJOR_VERSION) + "." + std::to_string(ISAL_MINOR_VERSION) + "." +
           std::to_string(ISAL_PATCH_VERSION);
}

bm::raw_duration bm_isal::iter_deflate(file_wrapper* in_file, file_wrapper* out_file, int /*level*/)
{
    raw_duration duration{};

    struct isal_zstream stream;

    uint8_t input_buffer[BUF_SIZE];
    uint8_t output_buffer[BUF_SIZE];

    isal_deflate_init(&stream);
    stream.end_of_stream = 0;
    stream.flush         = NO_FLUSH;

    do
    {
        stream.avail_in      = static_cast<uint32_t>(in_file->read(input_buffer, BUF_SIZE));
        stream.end_of_stream = static_cast<uint32_t>(in_file->eof());
        stream.next_in       = input_buffer;
        do
        {
            stream.avail_out = BUF_SIZE;
            stream.next_out  = output_buffer;

            auto begin = std::chrono::steady_clock::now();
            isal_deflate(&stream);
            auto end = std::chrono::steady_clock::now();
            duration += (end - begin);

            out_file->write(output_buffer, BUF_SIZE - stream.avail_out);
        } while (stream.avail_out == 0);
    } while (stream.internal_state.state != ZSTATE_END);

    return duration;
}

bm::raw_duration bm_isal::iter_inflate(file_wrapper* in_file, file_wrapper* out_file)
{
    raw_duration duration{};

    int                  ret;
    int                  eof;
    struct inflate_state stream;

    uint8_t input_buffer[BUF_SIZE];
    uint8_t output_buffer[BUF_SIZE];

    isal_inflate_init(&stream);

    stream.avail_in = 0;
    stream.next_in  = nullptr;

    do
    {
        stream.avail_in = static_cast<uint32_t>(in_file->read(input_buffer, BUF_SIZE));
        eof             = in_file->eof();
        stream.next_in  = input_buffer;
        do
        {
            stream.avail_out = BUF_SIZE;
            stream.next_out  = output_buffer;

            auto begin = std::chrono::steady_clock::now();
            ret        = isal_inflate(&stream);
            auto end   = std::chrono::steady_clock::now();
            duration += (end - begin);

            out_file->write(output_buffer, BUF_SIZE - stream.avail_out);
        } while (stream.avail_out == 0);
    } while (ret != ISAL_END_INPUT && eof == 0);

    return duration;
}

5. When all compression and decompression tasks are complete, the program displays the results on the screen. All temporary files are deleted using benchmarks.run().

Execute the sample application

In this example, the program will run as a single thread through the compression and decompression functions of the Intel ISA-L and zlib.

Run

From the ex1 directory:

cd <build-bir>/ex1

./ex1 --help

Usage

Usage: ./ex1 [--help] [--folder <path>]... [--file <path>]... :
  --help                display this message
  --file path           use the file at 'path'
  --folder path         use all the files in 'path'
  --zlib-levels n,...   coma-separated list of compression level [1-9]
•	--file and --folder can be used multiple times to add more files to the benchmark
•	--folder will look for files recursively
•	the default --zlib-level is 6

Test corpuses are public data files designed to test the compression and decompression algorithms, which are available online (for example, Calgary and Silesia corpuses). The --folder option can be used to easily benchmark them: ./ex1 --folder /path/to/corpus/folder.

Running the example

As Intel CPUs have integrated PCI-e* onto the package, it is possible to optimize access to solid-state drives (SSD) and avoid a potential performance degradation for accesses over an Intel® QuickPath Interconnect (Intel® QPI)/Intel® Ultra Path Interconnect (Intel® UPI) Intel QPI/Intel UPI link. For example, if you have a two-socket (two CPU) system with a PCI-e SSD, this SSD will be attached to either one of the sockets. If the SSD is attached to socket 1 and the program accessing the SSD is being accessed on socket 2, these requests and the data have to go over the Intel QPI/Intel UPI link that is used to connect the sockets together. To avoid this potential problem you can find out which socket the PCI-e SSD is attached to and then set thread affinity so that program runs on the same socket as the SSD. The following commands shows the list of PCI-e devices attached to the system where it can find ‘ssd’ in the output. For example:

lscpi –vvv | grep –i ssd
cd /sys/class/pci_bus

PCI Identifier

05:00.0 is the PCI* identifier and can be used to get more details from within Linux.

cd /sys/class/pci_bus/0000:05/device

This directory includes a number of files that give additional information about the PCIe device, such as make, model, power settings, and so on. To determine which socket this PCIe device is connected to, use:

cat local_cpulist

The output returned looks like the following:

Output Return

Now we can use this information to set thread affinity, using taskset:

taskset -c 10 ./ex1..

For the `-c 10`option, this number can be anything from 0 to 21 as those are the core IDs for the socket this PCI-e SSD is attached to.

The application runs with the taskset command assigns to core number 10 which should give the output below. If the system does not have a PCI-e SSD, the application can just run without the taskset command. 

Compression Library

Program output displays a column for the compression library, either ‘isa-l’ or ‘zlib’. The table shows the compression ratio (compressed file/raw file), and the system and processor time that it takes to perform the operation. For decompression, it just measures the elapsed time for the decompression operation. All the data was produced on the same system.

Notes: 2x Intel® Xeon® processor E5-2699v4 (HT off), Intel® Speed Step enabled, Intel® Turbo Boost Technology disabled, 16x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubuntu* 16.04 LTS, Linux kernel 4.4.0-21-generic, 1 TB Western Digital* (WD1002FAEX), 1 Intel® SSD P3700 Series (SSDPEDMD400G4), 22x per CPU socket. Performance measured by the written sample application in this article.

Conclusion

This tutorial and its sample application demonstrates one method through which you can incorporate the Intel ISA-L compression and decompression features into your storage application. The sample application’s output data shows there is a balancing act between processing time (CPU time) and disk space. It can assist you in determining which compression and decompression algorithm best suits your requirements, then help you to quickly adapt your application to take advantage of Intel® Architecture with the Intel ISA-L.

Other Useful Links

Authors

Thai Le is a software engineer who focuses on cloud computing and performance computing analysis at Intel.

Steven Briscoe is an application engineer focusing on cloud computing within the Software Services Group at Intel Corporation (UK).

Notices

System configurations, SSD configurations and performance tests conducted are discussed in detail within the body of this paper. For more information go to http://www.intel.com/content/www/us/en/benchmarks/intel-product-performance.html.

This sample source code is released under the Intel Sample Source Code License Agreement.

SR-IOV and OVS-DPDK Hands-on Labs

$
0
0

 

Background

This document provides context for the SR-IOV_DPDK_Hands-on_Lab project on the SDN-NFV-Hands-on-Samples GitHub repository. These files were used in two hands-on labs at the IEEE NFV SDN conference in Palo Alto, California on November 7, 2016. The first hands-on lab focused on configuring Single Root IO Virtualization (SR-IOV); the second described how to configure an NFV use case for Open vSwitch with Data Plane Development Kit (OVS-DPDK). Instructions on how to run the labs can be found in the presentations that we used — SR-IOV-HandsOn-IEEE.pdf and OVS-DPDK-Tutorial-IEEE.pdf respectively. Presentations were delivered on site and course attendees asked that we publish the scripts used during the presentation. We have modified the presentations to make the labs useful to those who weren't present at the event.

The virtual machines that attendees used during the lab were hosted on a remote cluster; the host machine hardware is no longer available, but a copy of the virtual machine can be downloaded here. (Note, the file is very large: 14GB. Before downloading, read this additional information about the image. If there is sufficient interest, I’ll create a recipe to re-create the VM from scratch.)

Introduction

This document and the accompanying script repository are for those who want to learn how to:

  • Automate setting up SR-IOV interfaces with DPDK in a Linux* libvirt/KVM environment
  • Use Open vSwitch* (OVS) and DPDK in a nested VM environment
  • Provision that configuration into a cluster or datacenter

These scripts are not intended for use in a production environment.

Cluster Configuration

The training cluster, which as noted above can be downloaded in VM form, had 14 compute nodes, each with the following configuration:

  • 64 GB RAM
  • Intel® Xeon® CPU E5-2697 v2 at 2.70 GHz (48 CPUs: 2 socket, 12 core, hyper-threading on)
  • 500 GB hard drive
  • One Intel® Ethernet Controller XL710 (4x10 GB) for lab communication
  • Additional 1 GB NIC for management and learner access

Additionally, each compute node had the following software configuration. In our training cluster, the compute nodes and Host VM systems were running Ubuntu* 16.04, and the nested Tenant and VNF VMs were running Fedora* 23. You may have to alter the instructions below to suit your Linux distribution. Distribution-agnostic steps to enable these features are (hopefully) listed in the accompanying SR-IOV and OVS + DPDK presentations, but here is the high-level overview:

  1. Virtualization switches enabled in the BIOS (for example, VT-d). Varies by machine and OEM.
  2. IOMMU enabled at kernel initialization. Typically, this involves two steps:
    1. Add intel_iommu=on to your kernel boot parameters. On Ubuntu 16.04, add that phrase to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub.
    2. Update grub.
  3. Nested virtualization enabled:
    1. #rmmod kvm-intel
    2. #echo 'options kvm-intel nested=y'>> /etc/modprobe.d/dist.conf
    3. #modprobe kvm-intel
  4. A macvtap bridge on eno1. See Scott Lowe's blog Using KVM with Libvirt and macvtap Interfaces for more information. This step isn’t necessary to run either the SR-IOV or the OVS-DPDK labs; it was necessary to allow learners to SSH from the cluster jump server directly into their assigned Host, Tenant, and VNF VMs.) Disregard all references to the macvtap bridge if you're using the downloadable VM, but the configuration might be useful if you wish to set up the hardware configuration used at the hands-on labs.
  5. The following Ubuntu software packages are needed to run the various scripts in this package. The names of these packages will obviously vary based on distribution:
    1. libvirt-bin: Provides virsh
    2. virtinst: Provides virt-clone
    3. libguestfs-tools: Provides virt-sysprep and guestfish
    4. lshw: Provides lshw
    5. python-apt: Provides Python* wrapper for apt so Ansible* can use apt
    6. python-selinux: Allows Ansible to work with SELinux
    7. ansible: Provides all Ansible commands and infrastructure
    8. python-libvirt: Provides the Python wrapper to libvirt
    9. libpcap-dev: Required to run pktgen
  6. The following software was installed from tarball on either the compute node, or the VMs:
    1. qemu-2.6.0: Used on Host VM to launch nested Tenant VM and VNF VM
    2. pktgen-3.0.14: Used in both SR-IOV and OvS + DPDK lab
    3. openvswitch-2.6.0: Used in OvS + DPDK lab
    4. dpdk-16.07: Used in both SR-IOV and OvS + DPDK lab
    5. libhivex0_1.3.13-1build3_amd64.deb: Required for libguestfs-tools
    6. libhivex-bin_1.3.13-1build3_amd64.deb: Required for libguestfs-tools

Each compute node had four KVM virtual machines on it, and each virtual machine had two nested VMs in it. Because of this, the first level of VMs were named HostVM-## (where ## is a two-digit number from 00 to 55), and the nested VMs were called TenantVM-## and VNF-VM-## (where ## is a number from 00 to 55 corresponding to the ## of the Host VM). For more detailed information about the Host VM virtual hardware configuration, see the HostVM-Master.xml KVM domain definition file in the repository. For more information about the Tenant VM and VNF VM configuration, see the QEMU launch command line for them in start_Tenant-VM.sh and start_VNF-VM.sh.

Because there was only one Intel Ethernet Controller XL710 (total of 4x10GB ports) in each compute node, and the lab involved connecting port 0 to port 1 and port 2 to port 3 in loopback mode, only half of the virtual machines were used during the SR-IOV hands-on lab. Note that the Intel Ethernet Controller XL710 supports up to 64 virtual functions per port (256 total on the four-port card). However, because the original intent of the lab was to compare the performance of virtual and physical functions, I chose to enable only one virtual function per physical port. For arbitrary reasons, only the even-numbered Host VMs (for example, HostVM-00, HostVM-02, … HostVM-54) were used during the SR-IOV hands-on lab. All Host VM systems were used during the Open vSwitch + DPDK hands-on lab. See Figure 1.


Figure 1: SR-IOV Hands-on: Compute Node Configuration

Quick Guide

The scripts and files in this repository can be classified into three groups. The first group is the files that were used at the IEEE NFV SDN conference for the SR-IOV Hands-on Lab. The second group was used at the same conference for the Open vSwitch + DPDK Hands-on Lab. The third classification of scripts and files is for everything that was necessary to provision these two labs across 14 compute nodes, 56 virtual machines, and 112 nested virtual machines. If you are reading this article, I suspect you are trying to learn how to do a specific task associated with these labs. Here is a quick guide that will help you find what you are looking for.

SR-IOV Hands-on Lab—Managing the Interfaces

Enabling an SR-IOV NIC in a KVM/libvirt virtual machine involves the following high-level tasks:

  1. Add a NIC that supports SR-IOV to a compute node that supports SR-IOV.
  2. Enable virtualization in the BIOS.
  3. Enable input output memory management unit (IOMMU) in the OS/VMM kernel.
  4. Initialize the NIC SR-IOV virtual functions.
  5. Create a libvirt network using the SR-IOV virtual functions.
  6. Insert an SR-IOV NIC into the VM and boot the VM.
  7. Shut down the VM and remove the SR-IOV NIC from the VM definition.

Tasks 1–3 were done prior to the lab. More details about these steps can be found in the SR-IOV-HandsOn-IEEE.pdf presentation. Tasks 4–7 were done live in the lab, and scripts to accomplish those steps are listed below together with an additional script to clean up previous runs through the lab.

Clean up Previous Runs Through the Lab

Initialize the NIC SR-IOV Virtual Functions

Create a libvirt Network Using the SR-IOV Virtual Functions

Insert an SR-IOV NIC into the VM and Boot the VM

Automating the insertion of the SR-IOV NICs and scaling this across the compute nodes and VMs was by far the biggest obstacle in preparing for the SR-IOV hands-on lab. Libvirt commands that I expected to work, and that would have easily been scripted, just didn’t work with SR-IOV resources. Hopefully, something here will make your life easier.

Shut Down the VM and Remove an SR-IOV NIC from the VM

SR-IOV Hands-on Lab—Learner Scripts

The SR-IOV lab had the following steps with numbered scripts that indicated the order in which they should be run. The learners ran these scripts inside the Host VM systems after the SR-IOV NICs had been inserted.

Free up Huge Page Resources for DPDK

Build DPDK and Start the Test Applications

OVS-DPDK NFV Use Case Hands-on Lab

The DPDK hands-on lab had these high-level steps:

  1. Initialize Open vSwitch and DPDK on the Host VM and add flows.
  2. Start the nested VNF VM and initialize DPDK.
  3. Start the nested Tenant VM and pktgen in the Tenant VM.
  4. Edit iofwd.c and run testpmd in the VNF VM.
  5. Clean up the DPDK lab resources.

Initialize Open vSwitch and DPDK on the Host VM and add Flows

Start the Nested VNF VM and Initialize DPDK

Start the Nested Tenant VM and pktgen in the Tenant VM

Edit iofwd.c and run testpmd in the VNF VM

Clean up the DPDK Lab Resources

Description of Files in the Repository

What follows is a description of the files in the repository. Because the text here is taken from comments in the code, you can also consult the scripts directly. To give you a visual of how the files in the repository are organized, here is a screenshot of the root directory with all subfolders expanded:

Provisioning Root Directory

  • HostVM-Master.xml: This file is the KVM domain definition of the first-level virtual machine that was used in the lab. A few things to note about this machine:
    • The guest expects to be allocated 12 GB of RAM from the compute node. If the compute node does not have sufficient memory, the VM will not initialize.
    • The VM expects eight virtual CPUs, running in pass-through mode. Note the topology in the <cpu> tag. The default libvirt configuration is to map the eight CPUs from the compute node to eight sockets in the guest; however, this is not ideal for DPDK performance. With <topology sockets='1' cores='8' threads='1'/>, we have a virtual machine with eight cores in one socket, and no hyper-threading. Ideally, these resources would be directly mapped to underlying CPU hardware (that is, the VM would get eight cores from the same socket), but this was not true in our setup.
    • The VM expects the image to be located at /var/lib/libvirt/images/HostVM-Master.qcow2.
    • There are two non-SR-IOV NICs in each image.
      • Source network: default. This NIC is attached to the default libvirt network/bridge. In the VM, this NIC is assigned the ens3 interface and obtains its IP address dynamically from the KVM VMM.
      • Source network: macvtap-net. This NIC expects a network called macvtap-net to be active on the compute node. In the VM, it is assigned the ens8 interface, and its IP address is assigned statically. (See /etc/network/interfaces in the HostVM image and http://blog.scottlowe.org/2016/02/09/using-kvm-libvirt-macvtap-interfaces/. This NIC is not necessary for the lab to work, but it was necessary for communication within the lab cluster, and to allow users to directly communicate with the Host VM from the cluster jump server.). Disregard this network if you are working with the downloadable VM.
    • No pass-through hostdev NICs are in the image by default. They are added by scripts.
  • ansible_hosts: This file contains information about and the Ansible inventory of compute nodes and VMs in the cluster. For more information see http://docs.ansible.com/ansible/intro_inventory.html.
  • user: This little file is very helpful (albeit insecure) when copied into the /etc/sudoers.d directory. It allows the user with the username ‘user’ to have sudo access without a password. In the SDN-NFV cluster, the primary user on all Host VM and nested Tenant VM and VNF VM instances was ‘user’.
  • ssh_config: This file is a concatenation of information generated at the end of the cloneImages.sh script. Its contents were copied into the /etc/ssh/ssh_config file on the cluster jump server, and it allowed students to SSH directly into their assigned Host VM, Tenant VM and VNF VM directly from the jump server. The primary thing to note in this file is the static IP addresses associated with the Host VM images. These static IP addresses were assigned during the VM cloning process (see cloneImages.sh), and they were attached to the macvtap NIC in each Host VM instance. The Tenant VM and VNF VM IP addresses are the same for every instance. You will not use this script if you are working with the downloadable VM.

Compute-node

The compute-node directory contains scripts and files that were copied to and run on each compute node in the cluster.

  • compute-node/00_cleanup_sr-iov.sh: This script shuts down the networks that are using SR-IOV and unloads the SR-IOV virtual functions (VF) from the compute node kernel. It assumes that the four ports on the Intel Ethernet Controller XL710 NIC have been assigned interfaces enp6s0f0-3 (PCI Bus 6, Slot 0, Function 0-3). To run it across all compute nodes, use the utility script found at utility/cn_00_cleanup_sr-iov.sh.
  • compute-node/01_create_vfs.sh: This script automates the host device (compute node) OS parts of creating the SR-IOV NICs. It creates only one virtual function for each of the four ports on the Intel Ethernet Controller XL710 NIC. (Note that the Intel Ethernet Controller XL710 supports up to 64 virtual functions per port (256 total on the four-port card). However, because the original intent of the lab was to compare the performance of virtual and physical functions, I chose to enable only one virtual function per physical port.) It assumes that the four ports have been assigned interfaces enp6s0f0-3 (PCI Bus 6, Slot 0, Function 0-3). To run it across all compute nodes, use the utility script found at utility/cn_01_create_vf.sh.
  • compute-node/02_load_vfs_into_libvirt.sh: This script creates pools of libvirt networks on the SR-IOV functions that have been defined in the previous step. It assumes that the XML files sr-iov-port0-3.xml are already present at $SR_IOV_NET_FILE_PATH. To run this script across all compute nodes, use the utility script found at utility/cn_02_load_vfs_into_libvirt.sh.

Host-vm

The host-vm directory contains the scripts that were intended to be run on the 56 first-level guest machines in the cluster. The moniker Host VM is an oxymoron, but came from the fact that originally these VMs were hosts to the nested VNF VMs and Tenant VMs used in the DPDK hands-on lab. Most of the scripts in this directory and in its subdirectories were designed to be run by the learners during the lab.

  • host-vm/07_unload_all_x710_drivers.sh: This script verifies that the caller is root, and then unloads from the kernel all drivers for the Intel Ethernet Controller XL710 virtual function (both kernel and DPDK) in the Host VM. It also unmounts the huge pages that DPDK was using. This script can be called during the lab by each learner, or the instructor can call it on each Host VM by using the provisioning-master/utility/hvm_07_unload_all_x710_drivers.sh on the Ansible controller.

host-vm/training

The training directory contains two subfolders: dpdk-lab and sr-iov-lab.

host-vm/training/dpdk-lab

The dpdk-lab folder contains all of the scripts and files used during the DPDK hands-on lab with the exception of the Tenant and VNF VM images, which were located at /home/usr/vm-images, and are not included with this package. However, the scripts that were used in the nested Tenant and VNF VMs are included in the Tenant VM and VNF VM sub-folders, respectively.

  • host-vm/training/dpdk-lab/00_setup_dpdk-lab.sh: This script is optional and runs the following scripts, simplifying the DPDK lab setup:
    1. 01_start_ovs.sh
    2. 02_createports_ovs.sh
    3. 03_addroutes_vm-vm.sh
    4. 04_start_VNF-VM.sh
    5. 05_start_TenantVM.sh
  • host-vm/training/dpdk-lab/01_start_ovs.sh: This script cleans up any existing Open vSwitch processes and resources, and then launches Open vSwitch with DPDK using one 1-GB huge page.
  • host-vm/training/dpdk-lab/02_createports_ovs.sh: This script creates the Open vSwitch bridge br0 and adds four DPDK enabled vhost-user ports to the bridge.
  • host-vm/training/dpdk-lab/03_addroutes_vm-vm.sh: This script clears any existing flows, sets traffic flows in the following pattern, and then dumps the flows for debugging purposes:
    1. In port 2, out port 3
    2. In port 3, out port 2
    3. In port 1, out port 4
    4. In port 4, out port 1
  • host-vm/training/dpdk-lab/04_start_VNF-VM.sh: This script launches the VNF VM using the start_VNF-VM.sh script in a background screen and then prints the IP address of the nested VM to stdout.
  • host-vm/training/dpdk-lab/05_start_TenantVM.sh: This script launches the Tenant VM using the start_Tenant-VM.sh script in a background screen and then prints the IP address of the nested VM to stdout.
  • host-vm/training/dpdk-lab/10_cleanup_dpdk-lab.sh: This script is optional and runs the following scripts to clean up the dpdk-lab and free its resources:
    1. 11_stop_VMs.sh
    2. 12_stop_ovs.sh
  • host-vm/training/dpdk-lab/11_stop_VMs.sh: This script starts an SSH session into the nested VNF and Tenant VMs and runs the 'shutdown -h now' command to shut them down gracefully. Note: This script may not be run as root.
  • host-vm/training/dpdk-lab/12_stop_ovs.sh: This script kills all Open vSwitch processes and cleans up all databases and shared memory. It then removes all dpdk drivers from the kernel and inserts i40e and ixgbe drivers into the kernel. Finally, it unmounts the 1 GB huge pages at /mnt/huge.
  • host-vm/training/dpdk-lab/dump-flows.sh: Displays the Open vSwitch flows on br0 every 1 second.
  • host-vm/training/dpdk-lab/dump_ports.sh: Displays the Open vSwitch ports on br0 every 1 second.
  • host-vm/training/dpdk-lab/start_Tenant-VM.sh: This script launches the Tenant VM using the Fedora 23 image found at /home/usr/vm-images/Fed23_TenantVM.img. The image is launched with 4 GB of RAM on 4-1 GB huge pages from the Host VM. It has three NICs, two of which use the DPDK-enabled vhostuser ports from Open vSwitch, and the remaining NIC is attached to a Tun/Tap bridge interface and assigned the static IP address 192.168.120.11 by the OS upon boot.
  • host-vm/training/dpdk-lab/start_VNF-VM.sh: This script launches the Tenant VM using the Fedora 23 image found at /home/usr/vm-images/Fed23_VNFVM.img. The image is launched with 4 GB of RAM on 4-1 GB huge pages from the Host VM. It has three NICs, two of which use the DPDK-enabled vhostuser ports from Open vSwitch, and the remaining NIC is attached to a Tun/Tap bridge interface and assigned the static IP address 192.168.120.10 by the OS upon boot.

While the nested Tenant and VNF VM images are not included with this distribution, the key files from the DPDK hands-on lab are. The files of interest from the Tenant VM are located in the Tenant VM folder, and the files of interest from the VNF VM are located in the VNF VM folder.

  • host-vm/training/dpdk-lab/VNF-VM/home/user/.bash_profile: This is the .bash_profile from the VNF VM /home/user directory. It exports the following variables and then calls /root/setup_env.sh:
    • RTE_SDK: This is the location of DPDK. RTE stands for real time environment
    • RTE_TARGET: This is the type of DPDK build (64-bit native Linux app using gcc)
  • host-vm/training/dpdk-lab/VNF-VM/root/setup_env.sh: This script mounts the 1 GB huge pages that were allocated at system boot. Then it loads the igb_uio driver into the kernel and then binds PCI devices 00:04.0 and 00:05.0 (the DPDK-enabled vhostuser ports attached to Open vSwitch in the Host VM) to the igb_uio DPDK driver.
  • host-vm/training/dpdk-lab/VNF-VM/root/run_testpmd.sh: This script launches the working version of testpmd in case the learner wants to see how the lab is supposed to work. It locks DPDK to cores 2 and 3, and gives DPDK 512 MB of memory per socket. Then it starts testpmd with 64-byte packet bursts, and 2048 tx and rx descriptors.
  • host-vm/training/dpdk-lab/VNF-VM/root/dpdk-16.07/app/testpmd/iofwd.c: This is the source code file that contains the DPDK hands-on lab exercises. They are clearly marked with the phrase DPDK_HANDS_ON_LAB. Each step in the lab has hints, and the final source code solution is contained in the comments, thus allowing each learner to choose how much research to do in order to succeed.
  • host-vm/training/dpdk-lab/TenantVM/home/user/.bash_profile: This is the .bash_profile from the Tenant VM /home/user directory. It exports the following variables, calls /root/setup_env.sh, and then launches pktgen using /root/start_pktgen.sh:
    • RTE_SDK: This is the location of DPDK. RTE stands for real time environment
    • RTE_TARGET: This is the type of DPDK build (64-bit native Linux app using gcc)
  • host-vm/training/dpdk-lab/TenantVM/root/setup_env.sh: This script mounts the 1 GB huge pages that were allocated at system boot. Then it loads the igb_uio driver into the kernel and then binds PCI devices 00:04.0 and 00:05.0 (the DPDK-enabled vhostuser ports attached to Open vSwitch in the Host VM) to the igb_uio DPDK driver.
  • host-vm/training/dpdk-lab/TenantVM/root/start_pktgen.sh: This script launches pktgen. The parameters allow DPDK to run on CPUs 2–6, allocate 512 MB of memory per socket, and specify that the huge page memory for this process should have a prefix of 'pg' to distinguish it from other applications that might be using the huge pages. The pktgen parameters specify that it should run in promiscuous mode and that:
    • Core 2 should handle port 0 rx
    • Core 4 should handle port 0 tx
    • Core 3 should handle port 1 rx
    • Core 5 should handle port 1 tx

host-vm/training/sr-iov-lab

The sr-iov-lab folder contains all of the scripts that were used during the SR-IOV hands-on lab.

  • host-vm/training/sr-iov-lab/00_show_net_info.sh: Lists pci info filtering for the phrase 'net'.
  • host-vm/training/sr-iov-lab/01_stop_VMs.sh: This script starts an SSH session into the nested VNF and Tenant VMs and runs the 'shutdown -h now' command to shut them down gracefully. Note: This script may not be run as root.
  • host-vm/training/sr-iov-lab/02_stop_ovs.sh: This script kills all Open vSwitch processes and cleans up all databases and shared memory. It then removes all dpdk drivers from the kernel and inserts i40e and ixgbe drivers into the kernel. Finally, it unmounts the 1 GB huge pages at /mnt/huge.
  • host-vm/training/sr-iov-lab/03_unload_dpdk.sh: This script explicitly binds the NICs at PCI addresses 00:09.0 and 00:0a.0 to the Intel® kernel drivers. This is necessary so that we can find out what MAC address has been allocated to the NIC ports.
  • host-vm/training/sr-iov-lab/04_build_load_dpdk_on_vf.sh: This script builds and configures DPDK on the target PCI/SR-IOV NIC ports. It:
    1. Determines the MAC address for the testpmd and pktgen interfaces
    2. Builds DPDK
    3. Loads the DPDK igb_uio driver into the kernel
    4. Mounts the huge page memory
    5. Binds the SR-IOV NICS at PCI address 00:09.0 and 00:0a.0 to the DPDK igb_uio driver
    6. Displays the MAC addresses for the SR-IOV NICS that were just bound to DPDK so that information can be used in the following steps of the lab.
  • host-vm/training/sr-iov-lab/05_build_start_testpmd_on_vf.sh: This script re-builds DPDK and then launches testpmd. It expects to have the MAC address of the pktgen Ethernet port passed as a command-line parameter. The pktgen MAC address should have been displayed at the end of the previous step. See comments in the script for more information about the testpmd and DPDK parameters.
  • host-vm/training/sr-iov-lab/06_build_start_pktgen_on_vf.sh: This script re-builds DPDK and then launches pktgen. See comments below for more details about parameters.
  • host-vm/training/sr-iov-lab/07_unload_all_x710_drivers.sh: This script verifies that the caller is root, and then unloads from the kernel all drivers for the Intel Ethernet Controller XL710 virtual function (both kernel and DPDK) in the Host VM. It also unmounts the huge pages that DPDK was using. This script can be called during the lab by each learner, or the instructor can call it on each Host VM by using the provisioning-master/utility/hvm_07_unload_all_x710_drivers.sh script.

Files

The files directory contains several files that were used during the SDN NFV hands-on lab.

  • files/Macvtap.xml: Contains the XML definition of the libvirt macvtap network. You will not use this file if you are working with the downloadable VM.
  • files/sr-iov-networks: This folder contains the XML files that defined the SR-IOV libvirt networks. See https://wiki.libvirt.org/page/Networking#Assignment_from_a_pool_of_SRIOV_VFs_in_a_libvirt_.3Cnetwork.3E_definition)
    • files/sr-iov-networks/sr-iov-port0.xml: This file defines a pool of SR-IOV virtual function network interfaces that are attached to the enp6s0f0 physical function interface on the Intel Ethernet Controller XL710. The Intel Ethernet Controller XL710 supports up to 64 virtual function network interfaces in this pool. Libvirt exposes this pool as a virtual network from which virtual machines can allocate an SR-IOV virtual network interface by adding an interface whose source network is ‘sr-iov-port0’.
    • files/sr-iov-networks/sr-iov-port1.xml: This file defines a pool of SR-IOV virtual function network interfaces that are attached to the enp6s0f1 physical function interface on the Intel Ethernet Controller XL710. The Intel Ethernet Controller XL710 supports up to 64 virtual function network interfaces in this pool. Libvirt exposes this pool as a virtual network from which virtual machines can allocate an SR-IOV virtual network interface by adding an interface whose source network is ‘sr-iov-port1’.
    • files/sr-iov-networks/sr-iov-port2.xml: This file defines a pool of SR-IOV virtual function network interfaces that are attached to the enp6s0f2 physical function interface on the Intel Ethernet Controller XL710. The Intel Ethernet Controller XL710 supports up to 64 virtual function network interfaces in this pool. Libvirt exposes this pool as a virtual network from which virtual machines can allocate an SR-IOV virtual network interface by adding an interface whose source network is ‘sr-iov-port2’.
    • files/sr-iov-networks/sr-iov-port3.xml: This file defines a pool of SR-IOV virtual function network interfaces that are attached to the enp6s0f3 physical function interface on the Intel Ethernet Controller XL710. The Intel Ethernet Controller XL710 supports up to 64 virtual function network interfaces in this pool. Libvirt exposes this pool as a virtual network from which virtual machines can allocate an SR-IOV virtual network interface by adding an interface whose source network is ‘sr-iov-port3’.

Utility

The utility directory contains utility scripts that simplify key tasks of the SDN NFV hands-on lab. In general, they tie together YAML files from the roles directory with scripts from the scripts directory, making it unnecessary to have to enter full Ansible command lines for repeated tasks.

  • utility/aptInstall.sh: This helper script used the Ansible apt module to install packages on compute nodes and on the Host VMs, both of which ran Ubuntu 16.04. It used the aptInstall.yaml role found in ../roles. See the comments in aptInstall.yaml for more information.
  • utility/cn_00_cleanup_sr-iov.sh: This script uses Ansible to copy the 00_cleanup_sr-iov.sh script from the ../compute_node directory to the compute nodes and then executes it. See the documentation in 00_cleanup_sr-iov.sh and runScript.yaml for more information.
  • utility/cn_01_create_vf.sh: This script uses Ansible to copy the 01_create_vfs.sh script from the ../compute_node directory to the compute nodes and then executes it. See the documentation in 01_create_vfs.sh and runScript.yaml for more information.
  • utility/cn_02_load_vfs_into_libvirt.sh: This script runs the Ansible load_vfs_into_libvirt.yaml role from ../roles on the compute_nodes. See the load_vfs_into_libvirt.yaml documentation for more information.
  • utility/cn_03_insert_vfs_into_vm.sh: This script uses Ansible to list the libvirt networks on each of the compute nodes, and then executes the insert_sr-iov_nics.yaml Ansible role found in ../roles. See the documentation in insert_sr-iov_nics.yaml for more information.
  • utility/cn_04_remove_vf_from_vm.sh: This script launches the remove_sr-iov_nics.yaml found in the ../roles directory, and then sleeps for 20 seconds to give each of the newly redefined domains adequate time to reboot before printing out the network bus information of each of the domains that previously had SR-IOV interfaces. See remove_sr-iov_nics.yaml documentation for more information about what that role does.
  • utility/cn_change_hostvm_cpu_topology.sh: This script runs the change-hostvm-cpu-topology.yaml Ansible role found in ../roles, and then uses Ansible to run lscpu on each Host VM to verify that the CPU topology changed. See change-hostvm-cpu-topology.yaml for more information about what the role does.
  • utility/hvm_07_unload_all_x710_drivers.sh: This script used the generic runScript.yaml Ansible role found in ../roles to run the 07_unload_all_x710_drivers.sh script from ../host-vm on each one of the Host VM systems. See the documentation for 07_unload_all_x710_drivers.sh for more information about what that script does. This script passes three parameters into the runScript.yaml role:
    1. dest_path: This variable is defined in runScript.yaml, but the command-line extra-vars definition overrides that definition. It is the directory on the Host VM into which the script should be copied before it is run.
    2. script_name: This is the name of the script, in this case 07_unload_all_x710_drivers.sh.
    3. relative_source_path: This is the location of the script relative to provisioning_master_dir, which is defined in /etc/ansible/hosts. Thus, the location of the 07_unload_all_x710_drivers.sh script is {{ provisioning_master_dir }}/host-vm.
  • utility/listHostVMs.sh: A simple script that uses Ansible to call $sudo virsh list all on all of the compute_nodes to show the status of each of the Host VM systems.
  • utility/pingComputeNodes.sh: A simple script that uses Ansible to ping all of the compute_nodes.
  • utility/pingHostVMs.sh: A simple script that uses Ansible to ping all of the Host VM systems.
  • utility/pingSR-IOVHostVMs.sh: A simple script that uses Ansible to ping all of the SR-IOV VMs. See /etc/ansible/hosts.
  • utility/startHostVMs.sh: This script uses the startGuestVMs.yaml Ansible role found in ../roles to start all of the Host VMs. Then it sleeps for 20 seconds to give the VMs time to start, and then uses Ansible to list all of the Host VM systems to ascertain their status. See the ../roles/startGuestVMs.yaml for more information.
  • utility/stopHostVMs.sh: This script uses the shutdownGuestVMs.yaml Ansible role found in ../roles to shut down all of the Host VMs. The role will not return from each compute node until all VMs on that system are shut down. See ../roles/shutdownGuestVMs.yaml for more information.

Roles

The roles directory contains all of the files that Ansible scripts used to provision the SR-IOV lab.

  • roles/runScript.yaml: This script is a generic script that copies a specified script to a remote inventory and then executes it as root. It has several input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
    3. relative_source_path: This is a path relative to provisioning_master_dir in which the script file is located. This variable MUST be defined on the command line when instantiating this Ansible role.
    4. script_name: This is the name of the script that is found at provisioning_master_dir/relative_source_path and which will be copied to provisioning_slave_dir on the remote node and then executed. This variable MUST be defined on the command line when instantiating this Ansible role.
  • roles/load_vfs_into_libvirt.yaml: This role loads the SR-IOV virtual functions into libvirt networks. It has four tasks:
    1. Copy the 02_load_vfs_into_libvirt.sh script to the compute node.
    2. Change the owner of the script to root and make it executable.
    3. Copy the SR-IOV libvirt network definition XML files to the compute node.
    4. Execute the 02_load_vfs_into_libvirt.sh script on the compute node.
      This play has two input parameters:
      1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
      2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/insert_sr-iov_nics.yaml: This Ansible role prepares the compute nodes to run the insert_sr-iov_nics.py script found in ../scripts and then runs that script on the compute nodes. It copies the VMManager.py module to the remote node as well. It also prepares the compute nodes to run the startGuestVMs.py script found in ../scripts and then executes that script. See the documentation in insert_sr-iov_nics.py and startGuestVMs.py for more information. This role has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/remove_sr-iov_nics.yaml: This Ansible role prepares the compute nodes to run the remove_sr-iov_nics.py script found in ../scripts and then runs that script on the compute nodes. It copies the VMManager.py module to the remote node as well. It also prepares the compute nodes to run the startGuestVMs.py script found in ../scripts and then executes that script. See the documentation in remove_sr-iov_nics.py and startGuestVMs.py for more information.This role has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/aptInstall.yaml: This role simply uses apt to install a package on the target inventory, assuming the target inventory is running Ubuntu or Debian*. Its only input parameter is package_name, which must be defined as an --extra-vars on the command line.
  • roles/change-hostvm-cpu-topology.yaml: This Ansible role runs the change-hostvm-cpu-topology.py script found in ../scripts and then runs that script on the compute nodes. It also prepares the compute nodes to run the startGuestVMs.py script found in ../scripts and then executes that script. It copies the VMManager.py module to the remote node as well. See the documentation in change-hostvm-cpu-topology.py and startGuestVMs.py for more information.This role has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/edit_host_vm_kernel_params.yaml: This Ansible role modified the Host VM guest machine grub command line to add support for nine huge pages of 1 GB, remove CPUs 1–4 from the SMP scheduler algorithms, and set the other parameters for tickless kernel on CPUs 1–4.
  • roles/startGuestVMs.yaml: This Ansible role prepares the compute nodes to run the startGuestVMs.py script found in ../scripts and then executes that script. See the documentation in startGuestVMs.py for more information. This role has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/shutdownGuestVMs.yaml: This Ansible role prepares the compute nodes to run the shutdownGuestVMs.py script found in ../scripts and then executes that script. See the documentation in shutdownGuestVMs.py for more information. This role has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/cleanImages.yaml: This Ansible play cleans out the default libvirt image store location (/var/lib/libvirt/images) by deleting the directory and re-creating it. The play spares only the master VM image by first moving it to a temporary directory and then putting it back. Pre-conditions: This play assumes that all VMs are shut down and have been undefined.
  • roles/cloneImages.yaml: This Ansible play prepared the compute nodes to run the cloneImages.sh script found in ../scripts. It has the following tasks:
    1. Copy the cloning script to the compute node.
    2. Verify that the HostVM-Master.xml domain definition file was present on the compute node in the correct location.
    3. Verify that the HostVM-master.qcow2 image was present on the compute node and in the correct location.
    4. Run the cloneImages.sh script.
    This play assumes that the HostVM-Master domain definition file and qcow2 images are present on the compute node, and have the following input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
    The majority of the work is done in cloneImages.sh, which you can review for more information.
  • roles/copyFile.yaml: This Ansible play copies a file to the inventory by 1) removing the existing file from the remote system, 2) copying the file to a temp location on the remote system, 3) moving the file to the right location on the remote system, and 4) changing file ownership. The play has one input parameter:
    1. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/copyImages.yaml: This Ansible play was used to prepare for the cloneImages.yaml play. It copies the HostVM-Master domain definition XML file and image files to the target inventory. This play has two preconditions:
    1. The HostVM-Master.qcow2 image must be located at /var/lib/libvirt/images on the Ansible controller node.
    2. The HostVM-Master.xml domain definition file must be located at {{ provisioning_master_dir }}.
    The copyImages.yaml play has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/createmacvtap.yaml: You will not use this file if you are working with the downloadable VM. The createmacvtap.yaml Ansible play creates macvtap NICs on the compute nodes that the Host VMs use to communicate directly with the cluster 'public' network, or the network that the learners use. Theses macvtap logical NICs are simpler than a Linux bridge, and allow the learners to SSH to their assigned Host VM directly from the cluster jump server. The libvirt macvtap network is defined in ../files/macvtap.xml; the script to create the interfaces is in ../scripts/create_macvtapif.sh. See the create_macvtapif.sh script for more information. This role has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/edit_host_vm_kernel_params.yaml: This Ansible play modified the deployed Host VM guest machines’ grub command lines to add support for nine huge pages of 1 GB, remove CPUs 1–4 from the SMP scheduler algorithms, and set the other parameters for tickless kernel on CPUs 1–4. It then rebooted the Host VM system and waited for their return. Obviously, if the base image already has this in in the boot/grub command line, this play is not necessary.
  • roles/generateSSHInfo.yaml: The generateSSHInfo.yaml Ansible play collected SSH information about the Host VMs and the nested Tenant and VNF VMs—information that was generated during the clone process. Ideally, this would have been run as a playbook together with the cloneImages.yaml play. The play:
    1. Removed any existing ssh_config file on the compute node.
    2. Copied the generateSSHInfo.sh script to the compute node and made it executable.
    3. Ran the generateSSHInfo.sh script on the compute node.
    4. Gathered the ssh_config file from each compute node back to the Ansible controller.
    Once the SSH information was gathered, it was collated into a single ssh_config file that was then placed on the training cluster jump server so that all learners could directly SSH from the jump server into the Host, Tenant, and VNF VMs. This role has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.
  • roles/snapshotGuestVMS.yaml: The snapshotGuestVMs.yaml Ansible play copied the snapshotGuest.sh script from the ../scripts directory to the remote compute nodes, and then ran that script. See the snapshotGuestVMs.sh script for more information. This play has two input parameters:
    1. provisioning_slave_dir: This is the directory on the remote node into which the script file is copied. This variable is defined in /etc/ansible/hosts.
    2. provisioning_master_dir: This is the Ansible root directory on the Ansible controller. This variable is defined in /etc/ansible/hosts.

Scripts

The scripts directory contains many of the scripts that were run on both the compute nodes, the guest VMs, and the nested VMs.

  • scripts/insert_sr-iov_nics.py: This script is intended to be run on the compute nodes that are exposing the SR-IOV virtual functions. The module edits the libvirt XML definition file and inserts sr-iov nics on the compute node into the specified VMs. It expects there to be four SR-IOV NIC ports per compute node (for example, one XL710 ) with two guest VMs each getting two SR-IOV NIC ports. It also expects those SR-IOV NICS to already be defined as separate libvirt networks. (In reality, a single libvirt network can expose a pool of SR-IOV virtual functions, that number limited by the NIC hardware/driver. However, in this lab, each libvirt network exposed only a single virtual function. See https://wiki.libvirt.org/page/Networking#Assignment_from_a_pool_of_SRIOV_VFs_in_a_libvirt_.3Cnetwork.3E_definition for more information on how to define a libvirt network that exposes a pool of SR-IOV virtual functions.) If the SR-IOV NICs have not been defined as libvirt networks, and if those libvirt networks are not active, the edited VMs will fail to start. This module also expects there to already be two existing network interfaces in the VM XML domain definition file. See inline comments for more information.
  • scripts/VMManager.py: This Python module/library class contains all of the functions that we use to programmatically manage the VMs on each compute node. Because there is no state saved, all of the methods and attributes on this class are class scope and there is no __init()__ method. See the inline documentation for more information about what functions are available and what they do. This module is used by the following Python scripts and must be copied to the remote node along with those scripts:
    • insert_sr-iov_nics.py
    • startGuestVMs.py
    • shutdownGuestVMs.py
    • remove_sr-iov_nics.py
  • scripts/remove_sr-iov_nics.py: This module removes the sr-iov nics from specified VMs on the compute node. It expects there to be two guests per compute node that use the SR-IOV NICS, that those guests end with even numbers, and that each guest has been using two SR-IOV NICS. Note that when the XML snippets to define the SR-IOV NICS are first inserted into the domain XML definition, the XML tag looks like this, where 'sr-iov-port0' refers to an active libvirt network that has previously been defined to point at the NIC physical function that is hosting the virtual function:

    <interface type='network'>
        <source network='sr-iov-port0'/>
    </interface>

    However, once the domain XML has been validated by libvirt during the create or define processes, libvirt allocates an SR-IOV NIC from the libivrt SR-IOV network and modifies the SR-IOV NIC XML snippet to point to the host machine's PCI address of the virtual function. The new XML snippet begins like this, and thereafter refers to the source network PCI address rather than the name of the libvirt network:

    <interface type='hostdev'>

    This module does not restart the domains after editing them.
  • scripts/change-hostvm-cpu-topology.py: This module changes the CPU topology of the Host VM guests to one socket with eight cores and no hyper-threading. On each compute node, it shuts down the Host VM guests, saves the XML definition of each guest, undefines it, modifies the CPU XML tag, and then redefines the domain using the modified XML. The modified CPU tag looks like this:
    1. <cpu mode='host-passthrough'>
    2. <topology sockets='1' cores='8' threads='1'/>
    3. </cpu>
  • scripts/create_macvtapif.sh: This script creates a simple macvtap libvirt network on the compute node, activates, the network and sets it to auto start. The libvirt network is defined in macvtap.xml, and the name of the network is macvtap-net. You will not use this file if you are working with the downloadable VM.
  • scripts/generateSSHInfo.sh: This script generates the SSH information that gets rolled into the ssh_config file on the jump server. This script was derived from the cloneImages.sh script, and technically is not necessary; however, I’ve included it because it is used by generateSSHInfo.yaml, which demonstrates how to use the Ansible fetch module.
  • scripts/snapshotGuestVMs.sh: The snapshotGuestVMs.sh script starts all of the domains on the system, and then creates a snapshot labelling it with the current date.
  • scripts/startGuestVMs.py: The startGuestVMs.py script simply calls the VMManager.start_domains function with a mask of VMManager.VM_MASK_ALL. Its only precondition is that the VMManager module is present in the current working directory.
  • scripts/shutdownGuestVMs.py: The shutdownGuestVMs.py script simply calls the VMManager.shutdown_domains function with a mask of VMManager.VM_MASK_ALL. See the VMManager.shutdown_domains documentation for more information. Its only precondition is that the VMManager module is present in the current working directory.
  • scripts/undefineGuestVMs.sh: The undefineGuestVMs.sh script undefines all libvirt guests that are currently in a shutoff state.
  • scripts/cloneImages.sh: The cloneImages.sh script does the heavy lifting of creating the Host VM clones from HostVM-Master. It is intended to be run on the compute nodes and does the following:
    1. Calculates the last quad of the static IP Address of the new Host VM from the last two digits of the compute node hostname.
    2. Calculates the last two digits of the Host VM and nested Tenant and VNF VMs from the last two digits of the compute node's hostname.
    3. Uses virt-clone to create a complete clone of HostVM-Master.
    4. Uses virt-sysprep to:
      • Delete all user accounts except user and root.
      • Reset dhcp-client-state
      • Create a new machine-id
      • Reset ssh-hostkeys
      • Delete the ~/.ssh directory
      • Set the new hostname
      • dpkg-reconfigure openssh-server to get the SSH server keys set up correctly.
    5. Uses guestfish to edit the newly created image file and:
      • Add static IP address info into /etc/network/interfaces.d/ens3
      • Add the new hostname info to /etc/hosts
    6. Create an entry in the ssh_config file for the new Host, VNF, and Tenant VMs.
    The script has the following preconditions:
    1. The HostVM-Master.xml domain definition file is in $PROVISIONING_DIR.
    2. The HostVM-Master.qcow2 image file is in $VM_IMAGE_PATH.
    3. The last two digits of the compute node's hostname are numeric.
    4. virt-clone, virt-sysprep, and guestfish are installed on the compute node.
  • scripts/copy-id.sh: This helper script copied my SSH public key to all of the systems in the entire cluster, both compute node, and Host, Tenant, and VNF VMs. Ideally, those SSH keys would be copied during the cloning and creation process; I just didn't figure that out until it was too late. I copied the user SSH keys rather than the root SSH keys so that Ansible could be run as non-root user.
  • scripts/deleteGuestVMSnapshot.sh: The deleteGuestVMSnapshot.sh script deleted the snapshots attached to all guest VMs on a compute node that were currently in a shutoff state. If the VMs are not shut off, it does nothing to them.

Files Not Included in the Repository

HostVM-Master.qcow2

The Host VM qcow2 image is not included in the repository. However, it can be downloaded downloaded here. (Note, the file is very large: 14GB. If there is sufficient interest, I’ll create a recipe to re-create the VM from scratch.) You are welcome to use that image, but be aware of the hardware requirements to support it, most particularly the nine 1 GB huge pages that are allocated when the virtual machine boots. If the system you are booting the VM on doesn’t have that much RAM, try editing the kernel parameters at boot time or results may be unexpected.

The images for the nested Tenant and VNF virtual machines used in the DPDK hands-on lab (Fed23_TenantVM.img and Fed23_VNF-VM.img ) are also in this image at /home/user/vm-images.

At a high level, here is the information to recreate it.

  • OS: Ubuntu 16.04.
  • Kernel Parameters: Nine 1 GB huge pages are allocated at boot. This is because the DPDK hands-on lab expects four of the 1 GB pages for DPDK running in each of the nested VMs, and the remaining 1 GB huge page is consumed when Open vSwitch + DPDK is launched.
  • Username: user; Password: password. The user account has sudo without password privileges in the /etc/sudoers.d/user file.
  • Installed software sources:
    • DPDK version 16.07 is found at /usr/src/dpdk-16.07.
    • Pktgen version 3.0.14 is at /usr/src/pktgen-3.0.14. It also uses the RTE_SDK and RTE_TARGET environment variables.
    • Sources for qemu-2.6.0 are also installed at /home/user/qemu-2.6.0 and are used to launch the internal Tenant and VNF virtual machines.
    • libpcap-dev is installed because pktgen requires it.
    • Open vSwitch branch 2.6 sources are installed at /usr/src/ovs-branch-2.6.
    • The RTE_SDK environment variable is set in both the user and root accounts in .bashrc and points to this location. The RTE_TARGET environment variable is also defined so that DPDK will create x86_64-native-linuxapp-gcc builds.

NIC Physical Function PCI Pass-Through files

The original hands-on lab had scripts that would in-parallel insert physical functions from the Intel Ethernet Controller XL710 NICs into the same guest machines that hosted SR-IOV virtual functions so that the attendees could compare the performance of virtual and physical functions. These functions are not included in this package.

Acknowledgments and References

These provisioning scripts were created using things I learned from many places. The libvirt.org Networking wiki entry was indispensable. In particular, I need to acknowledge Tim Stoop at kumina.nl for the bash scripts that used virsh to manipulate domain lifecycle. (See https://github.com/kumina/shutdown-kvm-guests/blob/master/shutdown-kvm-guests.sh). I also learned how to use macvtap interfaces with libvirt from Scott Lowe’s blog on that topic. Mohammad Mohtashim’s Python class objects tutorial at https://www.tutorialspoint.com/python/python_classes_objects.htm helped me get back into Python after a few years’ absence. And, of course, the Ansible, Python 2.7, Libvirt Python and Libvirt networking documentation sites were indispensable. I also found a few helpful (albeit random) libvirt nuggets in the Rust libvirt-sys documentation, and worked through some roadblocks with the crowd-sourced wisdom of stackoverflow.com.

Getting Started with Intel® Machine Learning Scaling Library

$
0
0

Introduction

Intel® Machine Learning Scaling Library (Intel® MLSL) is a library providing an efficient implementation of communication patterns used in deep learning. It is intended for deep learning framework developers, who would like to benefit from scalability in their projects.

Some of the Intel MLSL features include:

  • Built on top of MPI, allows for use of other communication libraries
  • Optimized to drive scalability of communication patterns
  • Works across various interconnects: Intel® Omni-Path Architecture, InfiniBand*, and Ethernet
  • Common API to support deep learning frameworks (Caffe*, Theano*, Torch*, etc.)

Installation

Downloading Intel® MLSL Package

  1. Go to https://github.com/01org/MLSL/releases and download:

    • intel-mlsl-devel-64-<version>.<update>-<package>.x86_64.rpm for root installation, or
    • l_mlsl_p_<version>.<update>-<package>.tgz for user installation.
  2. From the same page, download the source code archive (.zip or .tar.gz). The archive contains the LICENSE.txt and PUBLIC_KEY.PUB files. PUBLIC_KEY.PUB is required for root installation.

System Requirements

Operating Systems

  • Red Hat* Enterprise Linux* 6 or 7
  • SuSE* Linux* Enterprise Server 12

Compilers

  • GNU*: C, C++ 4.4.0 or newer
  • Intel® C/C++ Compiler 16.0 or newer

Installing Intel® MLSL

Intel® MLSL package comprises the Intel MLSL Software Development Kit (SDK) and the Intel® MPI Library runtime components. Follow the steps below to install the package.

Root installation

  1. Log in as root.

  2. Install the package:

    rpm --import PUBLIC_KEY.PUB
    rpm -i intel-mlsl-devel-64-<version>.<update>-<package>.x86_64.rpm

    In the package name, <version>.<update>-<package> is a string, such as 2017.1-009.

Intel MLSL will be installed at /opt/intel/mlsl_<version>.<update>-<package>.

User installation

  1. Extract the package to the desired folder:

    tar -xvzf l_mlsl_p_<version>.<update>-<package>.tgz -C /tmp/
  2. Run the install.sh script, and follow the instructions:

    ./install.sh

Intel MLSL will be installed at $HOME/intel/mlsl_<version>.<update>-<package>.

Getting Started

After you have successfully installed the product, you are ready to use all of its functionality.

To get an idea of how Intel MLSL works and how to use the library API, you are recommended to build and launch a sample application supplied with Intel MLSL. The sample application emulates a deep learning framework operation while heavily utilizing the Intel MLSL API for parallelization.

Follow these steps to build and launch the application:

  1. Set up the Intel MLSL environment:

    source <install_dir>/intel64/bin/mlslvars.sh
  2. Build mlsl_test.cpp:

    cd <install_dir>/test
    make
  3. Launch the mlsl_test binary with mpirun on the desired number of nodes (N). mlsl_test takes two arguments:

    • num_groups– defines the type of parallelism, based on the following logic:

      • num_groups == 1– data parallelism
      • num_groups == N– model parallelism
      • num_groups > 1 and num_groups < N– hybrid parallelism
    • dist_update– enables distributed weight update

Launch command examples:

# use data parallelism
mpirun –n 8 -ppn 1 ./mlsl_test 1
# use model parallelism
mpirun –n 8 -ppn 1 ./mlsl_test 8
# use hybrid parallelism, enable distributed weight update
mpirun –n 8 -ppn 1 ./mlsl_test 2 1

The application implements the standard usage workflow of Intel MLSL. The sample detailed description, as well as the generic step-by-step workflow and API reference, are available in the Developer Guide and Reference supplied with Intel MLSL.

Reporting issues for Intel(R) SDK for OpenCL(tm) Applications and related runtimes/drivers

$
0
0

If you would  like to report an issue with Intel OpenCL tools or drivers, please follow this check list when reporting it on the forum to ensure faster service:

  1. Please let us know what Processor, Operating System, Graphics Driver Version, and Tool Version you are using
  2. Please state steps to reproduce the issue as precisely as possible
  3. If you are using command line tools, please provide the full command line
  4. If code is involved, create a small "Reproducer" sample and attach it to the message
  5. If you don't want your code to be seen by other forum users, please send a private message
  6. Before posting, search the forum to see if someone already answered a similar question

Thank you!


Getting Started with Graphics Performance Analyzer - Platform Analyzer

$
0
0

Prerequisite

You are recommended to get familiar how to use Graphics Performance Analyzer - System Analyzer first. System Analyzer is the central interface to record the trace logs for Platform Analyzer. You can refer this link to getting started with System Analyzer. Graphics Performance Analyzer is part of Intel System Studio.

Introduction

Platform Analyzer provides offline performance analysis capability† by reviewing the tasks timelines among real hardware engine, software queue for GPU scheduler and background threads activities. With this utility, developers can quickly review which GPU engines (Render, Video Enhance - Video Processing or Video Codec) resource are involved in the target test application. For example, the trace result can reveal whether the Gen graphics HW codec acceleration is used or not. Platform Analyzer can also provide a rough clue regarding performance issue by reviewing tasks timelines.

Take a quick look about Platform Analyzer below. Top window shows when graphics related tasks occurs on hardware engines, GPU software queue and threads timelines. Bottom windows contains overall time cost for each type task and HW engines usage utilization.

†Platform analyzer is part of GPA (Graphics Performance Analyzer) which focuses more on graphics application optimization usage, mainly on Windows DirectX and Android/Linux OpenGL apps. Check the GPA product page to get the quick overview of GPA.

A quick start to capture the trace logs

Follow the normal steps to start analyzing the target application. In windows, it allows to using a hotkeys combination to capture the platform analyzer trace logs directly.

  1. Right click the “GPA monitors” system icon and choose “Analyze Application” to lunch target application.
  2. Press Shift + Ctrl + T to capture the trace logs
  3. Once you see one overlay indicates capture completed as the figure below, the trace logs have been successfully captured.

 

Find the pattern: what symptom will the performance issues show in Platform Analyzer?

First of all, you might want to understand a little bit about VSYNC and Present function call in Platform Analyzer. Vsync indicates a hardware signal right after that system will output one frame to display device. The interval between two Vsync signals implies the refresh rate of display device. As for Present API, it’s used to indicate the operation to copy/move one frame to another memory destination of frame.

With the background knowledge above, you may watch and investigate these symptoms indicated by Platform Analyzer.

  • Irregular VSYNC pattern. (VSYNC intervals should be consistent otherwise screen blinking or flashing symptom may reveal during the test.)
  • Long delay (big packet) in GPU engine. (this packet could be the Present call, if the packet size/length in timeline cross multiple VSYNCs, it can cause display frames stuck)
  • Overloading - multiple heavy stacked packets in software queue, like the figure below. (Stacked packets means several tasks are schedule in the same time, the packet should be dropped or delayed to process if GPU cannot handle several tasks in timely manner. It also causes display frames stuck)

Further information

Intel System Studio includes three components of GPA, System Analyzer, Platform Analyzer and Frame Analyzer. System Analyzer provides an interface to record the logs what Platform Analyzer and Frame Analyzer need for offline analysis.

GPA Analyzers Name

Functionalities

System Analyzer

Provide real-time system & app analysis. Central interface to record logs for other GPA analyzers. More information.

Platform Analyzer

Provide GPU engines and threads activities interactions analysis. Present the captured log by showing all graphics hardware engines workloads (including decoding, video processing, rendering and computing) and threads activities in a timeline view. More information.

Frame Analyzer

Provide offline single frame rendering analysis. Reconstruct the frame by replaying DirectX/OpenGL APIs logged by System Analyzer. More information

 

See also

Intel® Graphics Performance Analyzers (Intel® GPA) Documentation

Register and Download Intel System Studio Windows Professional Edition.

Getting Started with Intel VTune Amplifier for System 2017 on Mac OS X

$
0
0

In Intel System Studio 2017, we will continue to support Intel VTune Amplifier for System on Mac OS X. You can use VTune to analyze the performance result captured from Windows, Linux, Android, FreeBSD. All test steps below are performed on Mac OS X El Capitan 10.11.5. First, you will need to know where to download the package. Visit http://intel.ly/system-studio, choose Linux as target OS, select Professional Edition then click “Download FREE Trial” button to get the software package. You should see the figure below and vtune_amplifier_2017_update1_for_systems.dmg is the main package.

Inside this .dmg compressed file, there are one installation guide (release_notes_amplifier_for_systems_osx.pdf), VTune installation file and the release note which contains the updates(What’s New section) in 2017. Double click the VTune installation file and choose “Install as root”. You will be able to see VTune application in Launchpad as the following figure.

Since OS X 10.5 started to support ssh-agent, it’s easy to configure password-less ssh tunnel for VTune remote profiling usage.

host> cat ~/.ssh/id_rsa.pub | ssh user@target 'cat >> ~/.ssh/authorized_keys'

By performing the commands above, you can easily establish a password-less ssh tunnel between host and target machines after. For the version earlier than OS X 10.5, you may need to copy the VTune result files/folder back to the host from target machine and open the result files via VTune GUI to perform the analysis task on host machine. Winscp is one of useful GUI tools for performing scp command.

As the following figure shows, our example profiling target is running Ubuntu Linux. You need to choose “remote Linux(SSH)”, fill-in the target’s IP address/user account for Analysis Target, and give application name you plan to profile. After that, you can select basic hotspot analysis type, advanced hotspot or other analysis types for Analysis type, then click start to start collecting the performance logs remotely. 

Once the log collection process is done. ( you can setup the collection time range or stop the collection whenever you like). VTune will display these profiling data with its unique way. The following screenshot is VTune’s major UI for performance analysis. For further reading and details, you can visit its online training page.

Trouble shooting

As the figure shows, VTune UI of host machine detects incompatible version of data collector on the remote target. Each release version of VTune package will contain the performance log viewer (GUI) and the corresponding target files ( data collectors, kernel drivers). The solution of this issue is to deploy the matched target files in the target machine.

3D Isotropic Acoustic Finite-Difference Wave Equation Code: A Many-Core Processor Implementation and Analysis

$
0
0

Finite difference is a simple and efficient mathematical tool that helps solve differential equations. In this paper, we solve an isotropic acoustic 3D wave equation using explicit, time domain finite differences.

Propagating seismic waves remains a compute-intensive task even when considering the simplest expression of the wave equation. In this paper, we explain how to implement and optimize a three-dimension isotropic kernel with finite differences to run on the Intel® Xeon® processor v4 Family and the Intel® Xeon Phi™ processor.

We also give a brief overview of new memory hierarchy introduced with the Intel® Xeon Phi™ processor and the different settings and modifications of the source code needed to incorporate the use of C/C++ High Bandwidth Memory (HBM) application programming interfaces (APIs) for doing dynamic storage allocation from Multi-Channel DRAM (MCDRAM).

Build Your Own Traffic Generator – DPDK-in-a-Box

$
0
0

Download PDF [ 245 MB]

Table of Contents

Introduction
About the Author 
The DPDK Traffic Generator 
   Block Diagram 
   Software
   Hardware
   Note NIC Information
Install the TRex* Traffic Generator 
   Configure the Traffic Generator 
   Note Platform lcore Count 
Run the Traffic Generator
Summary
Next Steps
Exercises
Appendix: Unbinding from DPDK & Binding to Kernel
   Root Cause
   Solution
References

Introduction

The purpose of this cookbook is to guide users through the steps required to build a Data Plane Development Kit (DPDK) based traffic generator.

We built a DPDK-in-a-Box using the MinnowBoard Turbot, which is a low cost, portable platform based on the Intel® Atom™ processor E3826. For the OS, we installed Ubuntu* 16.04 client with DPDK. The instructions in this document are tested on our DPDK-in-a-Box, as well as on an Intel® Core i7-5960X Haswell-E desktop. You can use any Intel® Architecture (IA) platform to build your own device.

For the traffic generator, we will use the TRex* realistic traffic generator. The TRex package is self-contained and can be easily installed.

About the Author

M Jay has worked with the Intel DPDK team since 2009. He joined Intel in 1991 and has worked in various divisions and roles within Intel as a 64-bit CPU front side bus architect and 64-bit HAL developer before joining the Intel DPDK team. M Jay holds 21 US patents, both individually and jointly, all issued while working at Intel. M Jay was awarded the Intel Achievement Award in 2016, Intel's highest honor based on innovation and results.

Please send your feedback to M Jay Muthurajan.Jayakumar@intel.com


Any Intel processor-based platform will work—desktop, server, laptop or embedded system.

The DPDK Traffic Generator

Block Diagram

Software

  • Ubuntu 16.04 Client OS with DPDK installed
  • TRex* realistic traffic generator

Hardware

Our DPDK-in-a-Box uses a MinnowBoard Turbot single board computer:

  • Out of the three Ethernet ports, the two at the bottom are for the traffic generator (dual gigabit Intel® Ethernet Controller I350). Connect a loopback cable between them.
  • Connect the third Ethernet port to the Internet (to download the TRex package).
  • Connect the keyboard and mouse to the USB ports.
  • Connect a display to the HDMI Interface.


The MinnowBoard Turbot

The MinnowBoard includes a microSD card and an SD adapter.

  • Insert the microSD card into the microSD Slot. The SD adapter should be ignored and not used.
  • Power on the DPDK-in-a-Box system. Ubuntu will be up and running right away.

Choose the username test and assign the password tester (or use the username and password specified by the Quick Start Guide that comes with the platform).

  • Log on as root by inputting the command, and verify that you are in the /home/test directory with the following two commands:
# sudo su
# ls

Note NIC Information

The configuration file for the traffic generator needs the PCI bus-related information and the MAC address. Note this information first using Linux commands, because once the DPDK or packet generator is run, these ports are unavailable to Linux.

  1. For PCI bus-related NIC information, type the following command:

    # lspci

    You will see the following output. Note down that for port 0 the information is 03:00.0 and for port 1 the information is 03:00.1.

  2. Find the MAC address with this command:

    # ifconfig

    You will see the following output. Note down that for port 0 the MAC address is 00:30:18:CB:F2:70 and for port 2 the MAC address is 00:30:18:CB:F2:71.

    Note that the first port in the screenshot below, enp2s0, is the port connected to the Internet. No need to make a note of this.

ItemPort 0Port 1
PCI Bus-related NIC info (from lspci)03:00.003:00.1
MAC address00:30:18:CB:F2:7000:30:18:CB:F2:71

Fill the following table with the information you gathered from your specific platform:

Item Port 0Port 1
PCI Bus-related NIC info (from lspci)  
MAC address 
 

What if you don’t see both of the ports in response to the ifconfig command?

One possible reason is that you might have run the DPDK based application previously, in which case the application might have claimed those ports, making them unavailable to the kernel. In that case, refer to the appendix on how to unbind the ports from DPDK so that the kernel can claim them and you can find the MAC address with the ifconfig command.

In the following discussion, we will assume that you successfully found the ports and have noted down the MAC addresses.

Install the TRex* Traffic Generator

Input the following commands:

# pwd
# mkdir trex
# cd trex
# wget –no-cache http://trex-tgn.cisco.com/trex/release/latest

You should see that the install is complete, and saved in /home/test/trex/latest:

The next step is to untar the package:

# tar –xzvf latest

Below you see that version 2.08 is the latest version at the time of this screen capture:

# ls –al

You will see the directory with the version installed. In this exercise, the directory is v2.08, as shown below in response to the ls –al command. Change directory to the version installed on your system; for example, cd <dir name with version installed>:

# cd v2.08

# ls –al

You will see the file t-rex-64, which is the traffic generator executable:

Configure the Traffic Generator

The good news is that the TRex package comes with a sample config file cfg/simple_cfg.yaml. Copy that to /etc/trex_cfg.yaml and edit the file by issuing the following commands, making sure that you’re in your /home/test/trex/<your version> directory:

# pwd
# cp cfg/simple_cfg.yaml /etc/trex_cfg.yaml
# gedit /etc/trex_cfg.yaml

Edit the file as shown below with the applicable NIC information you gathered in previous steps:

Below is the line-by-line description of the configuration information:

  • Port_limit should be 2 (since DPDK-in-a-Box has two ports)
  • Version should be 2
  • Interfaces should be the PCI bus ports you gathered using lspci. In this exercise they are [“03:00.0”, “03:00.1”]
  • Port_information contains a dest_mac, src_mac pair, which will be in the packet header of the traffic generated. The first pair is for port0. Since port0 is connected to port1, the first dest_mac is the MAC address of port 1. The second pair is for port1. Since port1 is connected to port0, the second dest_mac is the MAC address of port 0.

Please note that when you connect an appliance to which traffic must be injected, the dest_mac addresses will be that of the appliance.

Note Platform lcore Count

This section is for informational purposes only.

# cat /proc/cpuinfo will give you the lcore information as shown in the Exercises section.

Why is this information useful?

The command line below that runs the traffic generator uses the –c option to specify the number of lcores to be used for the traffic generator. You want to know how many lcores exist in the platform. Hence, issuing cat /proc/cpuinfo and eyeballing the number of lcores that are available in the system will be helpful.

Run the Traffic Generator

# sudo ./t-rex-64 –f cap2/dns.yaml –c 1 –d 100

What are the parameters –f, -c, and –d?

-f   for YAML traffic configuration file
-c   or number of cores. Monitor the CPU% of TRex – it should be ~50%. Use cores accordingly
-d   for duration of the test (sec). Default: 0

 

Summary

Below are three output screens: 1) During the traffic run, 2) Linux top command output, and 3) Final output after the completion of the run.


Screen output showing traffic during run (15 packets so far Tx & Rx)


Output of top –H command during the run


Screen output after completing the run (100 packets Tx & Rx)

Next Steps

Congratulations! With the above hands-on exercise, you have successfully built your own Intel DPDK based traffic generator.

As a next step, you can connect back-to-back two DPDK-in-a-Box platforms, and use one as a traffic generator and the other as a DPDK application development and test vehicle.

Please send your feedback to M Jay Muthurajan.Jayakumar@intel.com.

Exercises

  1. How would you configure the traffic generator for different packet lengths?
  2. To run the traffic generator forever, what should be the value of –d?
  3. How would you measure latency (assuming you have more cores)?
  4. Reason out the root cause and find the solution by looking up the error, “Note that the uio or vfio kernel modules to be used should be loaded into the kernel before running the dodk-devbind.py script” in Chapter 3 of the DPDK.org document Getting Started Guide for Linux.
  5. In the following screenshot, determine the hyperthreading state—enabled vs. disabled? (Hint: this is the Intel Atom processor platform.)

Appendix: Unbinding from DPDK & Binding to Kernel

This section is not needed if ifconfig is able to find the ports you want to use for traffic generation. In that case, you can skip this section.

What is the reason ifconfig cannot find the two ports? If you are only interested in the solution, skip this troubleshooting section and go to the Solution section.

Root Cause

ifconfig is not showing the two ports below. Why?

The reason that ifconfig is unable to find the two ports is possibly because the DPDK application was previously run and was aborted without releasing the ports, or it might be that a DPDK script runs automatically after boot and claims the ports. Regardless of the reason, the solution below will enable ifconfig to show both ports.

Solution

  1. Run ./setup.sh in the directory /home/test/dpdk/tools
  2. Display current Ethernet device settings
  3. Unbind the first port from IGB UIO (assuming it is bound to IGB UIO)
  4. Bind the port to IGB (the kernel driver)
  5. Repeat steps 3-5 to unbind the second port from IGB UIO and bind to IGB.

Select “Display current Ethernet device settings” (option 23 in this case).

Status showing two ports claimed by the DPDK driver.

Unbind the first NIC from DPDK (specifically IGB UIO).

  1. Select option 30 and then enter the PCI address of device to unbind:

  2. Bind the kernel driver igb to the device:

    If the inputs entered are correct, script acknowledges OK.

  3. Verify by displaying current Ethernet device settings.

Success!

Above you will see the first port 0000:30:00.0 bound to the kernel.

Now on to the second port 0000:30:00.1

Success!

Below you will see both ports bound back to kernel.

Now that both ports are bound back to kernel, ifconfig will give the needed info for those ports. 

References

  • MinnowBoard Wiki Home at minnowboard.org – learn more about the MinnowBoard Turbot single board computer.
  • Profiling DPDK Code with Intel® VTune™ Amplifier - Use Intel® VTune™ Amplifier to profile DPDK micro benchmarks with your application. This comprehensive reference provides guidelines and instructions.
  • Intel® VTune™ and Performance Optimizations - This session recording from the July 11, 2016DPDK/NFV DevLab covers performance optimization best practices, including analysis of NUMA affinity, microarchitecture optimizations with VTune, and tips to help you identify hotspots in your own application.
  • DPDK Performance Optimization Guidelines White Paper - Learn best-known methods to optimize your DPDK application's performance. Includes profiling methodology to help identify bottlenecks, then shows how to optimize BIOS settings, partition NUMA resources, optimize your Linux* configuration for DPDK, and more.

Intro to Device Side AVC Motion Estimation

$
0
0

Download the Sample

Download the device side VME sample code

Instructions on how to run it are in the VME Sample user guide.

Introduction

This article introduces the new device-side h.264/Advanced Video Coding (AVC) motion estimation extensions for OpenCL* available in the implementation for Intel Processor Graphics GPUs.  The video motion estimation (VME) hardware has been accessible for many years as part of the Intel® Media SDK video codecs, and via built-in functions invoked from the host.  The new  cl_intel_device_side_avc_motion_estimation extension (available in Linux driver SRB4)  provides a fine-grained interface to access the hardware supported AVC VME functionality from kernels. 

VME access with Intel extensions to the OpenCL* standard was previously described in :

Motion Estimation Overview

Motion estimation is the process of searching for a set of many local (block level) translations best matching temporal differences between one frame and another.  Without hardware acceleration this search can be expensive.  Accessing hardware acceleration of this operation can open up new ways of thinking about many algorithms.  

 For 6th Generation Core/Skylake processors the Compute Architecture of Intel® Processor Graphics Gen9 guide provides  more hardware details.  As a quick overview:

  • The VME hardware is part of the sampler blocks included in all subslices (bottom left of the image below). Intel subgroups work at the subslice level, and the device-side VME extension is also based on subgroups.
  • Gen9 hardware usually contains 3 subslices per slice (24 EUs/slice).
  • Processors contain varying numbers of slices.  You can determine the number of slices with a combination of info from ark.intel.com (which will give you the name of the processor graphics GPU) and other sites like notebookcheck.com which will give more info on the processor graphics model.

Motion search is one of the core operations of many modern codecs.  It also has many other applications.  For video processing it can be used in filters such as video stabilization and frame interpolation.  It can also be a helpful pre-processor for encode, with useful information about motion complexity, where motion is occurring in the frame, etc.   

From a computer vision perspective, the motion vectors could be useful as a base for algorithms based on optical flow.

Currently available Intel® Processor Graphics GPU hardware only supports AVC/H.264 motion estimation.  HEVC/H.265 motion estimation is significantly more complex.  However, this does not necessarily mean that AVC motion estimation is only relevant to AVC.  The choices returned from VME for the simpler AVC motion direction search can be used to narrow the search space for a second pass/more detailed search where needed.

 

The device-side interface

Previous implementations have focused on builtin functions called from the host.  The device-side implementation requires kernel code for several phases of setup:

  1. General initialization (intel_sub_group_avc_ime_initialize)
  2. Operation configuration (including inter and intra costs and other properties where needed)

(from vme_basic.cl)

      intel_sub_group_avc_ime_payload_t payload =
        intel_sub_group_avc_ime_initialize( srcCoord, partition_mask, sad_adjustment);

      payload = intel_sub_group_avc_ime_set_single_reference(
        refCoord, CLK_AVC_ME_SEARCH_WINDOW_EXHAUSTIVE_INTEL, payload);

      ulong cost_center = 0;
      uint2 packed_cost_table = intel_sub_group_avc_mce_get_default_medium_penalty_cost_table();
      uchar search_cost_precision = CLK_AVC_ME_COST_PRECISION_QPEL_INTEL;
      payload = intel_sub_group_avc_ime_set_motion_vector_cost_function( cost_center, packed_cost_table, search_cost_precision, payload );

There is then an evaluation phase (intel_sub_group_avc_ime_evaluate*).  After this results can be extracted with the intel_sub_group_avc_ime_get* functions.

      intel_sub_group_avc_ime_result_t result =
         intel_sub_group_avc_ime_evaluate_with_single_reference(
             srcImg, refImg, vme_samp, payload );

      // Process Results
      long mvs           = intel_sub_group_avc_ime_get_motion_vectors( result );
      ushort sads        = intel_sub_group_avc_ime_get_inter_distortions( result );
      uchar major_shape  = intel_sub_group_avc_ime_get_inter_major_shape( result );
      uchar minor_shapes = intel_sub_group_avc_ime_get_inter_minor_shapes( result );
      uchar2 shapes = { major_shape, minor_shapes };
      uchar directions   = intel_sub_group_avc_ime_get_inter_directions( result );

 

In the example, sub-pixel refinement of motion estimation is also implemented with similar steps:

     intel_sub_group_avc_ref_payload_t payload =
       intel_sub_group_avc_fme_initialize(
          srcCoord, mvs, major_shape, minor_shapes,
          directions, pixel_mode, sad_adjustment);

     payload =
      intel_sub_group_avc_ref_set_motion_vector_cost_function(
         cost_center,packed_cost_table,search_cost_precision,payload );

     intel_sub_group_avc_ref_result_t result =
      intel_sub_group_avc_ref_evaluate_with_single_reference(
         srcImg, refImg, vme_samp, payload );

     mvs = intel_sub_group_avc_ref_get_motion_vectors( result );
     sads = intel_sub_group_avc_ref_get_inter_distortions( result );

 

Expected Results

Note -- only a small region of the input/output frames used below.

From the inputs (a source frame and reference/previous frame)

There is a "zoom out" motion in the test clip included with the sample.  The image below shows where the major differences are located (from imagemagick comparison of source and ref image).

Output of the sample is an overlay of motion vectors as shown here.  Note that most fit the radial pattern expected from zoom.

Additional outputs include AVC macroblock shape choices (shown below) and residuals.

Conclusion

Video motion estimation is a powerful feature which can enable new ways of thinking about many algorithms for video codecs and computer vision.  The search for most representative motion vector, which is computationally expensive if done on CPU, can be offloaded to specialized hardware in the Intel Processor Graphics Architecture image samplers.  The new device-side (called from within a kernel) interface enables more flexibility and customization while potentially avoiding some of the performance costs of host-side launch.

Reference

 

 

For more complete information about compiler optimizations, see our Optimization Notice.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

 

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>