Code Sample: Rendering Objects in Parallel Using Vulkan* API

July 2, 2018, 1:12 am

Latest and popular articles on Intel Technologies

≫ Next: Rendering Objects in Parallel Using Vulkan* APIs

≪ Previous: Exciting Innovations Showcased at Unity* Unite Beijing 2018

File(s):	Download
License:	Intel Sample Source Code License Agreement

Optimized for...
OS:	64-bit Windows* 7, 8.1 or Windows® 10
Hardware:	GPU required
Software: (Programming Language, tool, IDE, Framework)	Microsoft Visual Studio* 2017, Qt Creator 4.5.0, C++ 17, Qt 5.10, Vulkan* 1.065.1 SDK, ASSIMP 4.1.0 library
Prerequisites:	Familiarity with Visual Studio, Vulkan* API, 3D graphics, parallel processing.

Introduction

One of the industry’s hottest new technologies, Vulkan APIs support multithreaded programming, simplify cross-platform development and have the backing of major chip, GPU and device-makers. The API is a collaborative effort by the industry to meet current demands of computer graphics. It is a new approach that emphasizes hiding the CPU bottleneck through parallelism, and allowing much more flexibility in application structure. Aside from components related only to graphics, the Vulkan API also defines the compute pipeline for numerical computation. In all, Vulkan APIs are positioned to become one of the next dominant graphics rendering platforms.

This code and accompanying article (see References below) discuss the process of rendering multiple FBX (Filmbox) and OBJ (Wavefront) objects using Vulkan APIs. The application employs a non-touch graphical user interface (GUI) that reads and displays multiple 3D object files in a common scene. Files are loaded and rendered using linear or parallel processing, selectable for the purpose of comparing performance. In addition, the application allows objects to be moved, rotated, and zoomed through a simple UI. We recommend that you read the article while looking at the code. Make sure you have the examples downloaded and use your favorite code browsing tool.

The code demonstrates the following concepts:

Loaded models displayed in a list
Selected objects identified on-screen with a bounding box
An object information and statistics display showing the number of vertices
The ability to specify either delta or absolute coordinates and rotations
An option to view objects in wireframe mode
Statistics for comparing single- versus multi-threading when reading object files

Get Started

At a high level, when programming using Vulkan, the goal is to construct a virtual device to which drawing commands will be submitted. The draw commands are submitted to constructs called “queues”. The number of queues available and their capabilities depend upon how they were selected during construction of the virtual device, and the actual capabilities of the hardware. The power of Vulkan lies in the fact that the workload submitted to queues could be assembled and sent in parallel to already executing tasks. Vulkan offers functionality to coherently maintain the resources and perform synchronization.

Tutorial: Rendering Objects in Parallel Using Vulkan* APIs

Reading FBX and OBJ files

The first first step is to set up and create the user interface. As we said, this UI is keyboard- and mouse-driven, but it could be enhanced to support touch.

Once the UI is in place, the work begins with reading either an FBX or OBJ file and loading it into memory. The application supports doing this using a single or multiple threads so you can see the difference in performance. We are going to cheat here and use the Open Asset Import Library (assimp) to read and parse the files. Once loaded, the object will be placed in a data structure (Object3D) that we can hand to Vulkan. This is described in detail in the article.

Displaying and manipulating the 3D objects

The main area of the user interface is a canvas where the loaded objects are displayed. These are place in a default location but can be moved anywhere on the canvas so they do not overlap. When you select an object from the list of loaded items it is highlighted with a bounding box. Once selected, you can move, rotate or resize the object by entering new coordinates or size into the form. Again, you can read the details in the accompanying article.

Using Vulkan to render the 3D objects

Loading the objects from memory and displaying them on the screen is handled gracefully by Vulkan. The source code contains code to show how to load an object file using Vulkan. About a dozen lines in, the loaded file is sent to the renderer with support for a secondary command buffer to allow object-loading in parallel. The system processor, GPU, and other factors of the host system as well as the size of the object file will determine single- and multi-threaded object rendering times. Your results will vary.

Because of the complexities of the Vulkan APIs, the biggest challenge was building Renderer, which implements application-specific rendering logic for Vulkan Window. Especially challenging was the synchronization of worker and UI threads without using mutual exclusive locks on rendering- and resource-releasing phases. On the rendering phase, this is achieved by separating command pools and secondary command buffers for each Object3D instance. On resource releasing phase, it is necessary to make sure the host and GPU rendering phases are finished.

The key functions of interest in Renderer are:

void Renderer::startNextFrame()
void Renderer::endFrame()
void Renderer::drawObject()
void Renderer::initPipeline()

This latter method was required in order to handle different types of graphical objects – those loaded from files and those dynamically generated in the form of bounding boxes that surround the selected object. This caused a problem because they use differing shaders, primitive topologies and polygon modes. The goal was to unify code as much as possible for the different objects to avoid replicating similar code. Both types of objects are expressed by single-class Object3D.

Conclusion

Coding flexibility is a hallmark of low-level Vulkan APIs but it is critical to remain focused on what is going on in each Vulkan step. These lower-level programming capabilities also allows for fine-tuning certain aspects of hardware access not available with OpenGL*. If you take it slow, and build your project in small, incremental steps, the payoffs will include far greater rendering performance, much lower runtime footprint, and greater portability to a multitude of devices and platforms.

References

Alexey Korenevsky, Integrated Computing Solutions, Inc., Vulkan Code Sample: Rendering Objects in Parallel, Rendering Objects in Parallel Using Vulkan* APIs, 2018

Open Asset Import Library

Updated Log

Created May 23, 2018

↧

Rendering Objects in Parallel Using Vulkan* APIs

July 2, 2018, 5:13 am

Latest and popular articles on Intel Technologies

≫ Next: Energy Analysis with Intel(R) System Studio 2019 Beta

≪ Previous: Code Sample: Rendering Objects in Parallel Using Vulkan* API

If you're a game developer and not yet up to speed on Vulkan*, you should be. Vulkan APIs are one of the industry's hottest new technologies. They support multithreaded programming, simplify cross-platform development and have the backing of makers of major chips, GPUs, and devices. Vulkan APIs are positioned to become one of the next dominant graphics rendering platforms. Characteristics of the platform help apps gain longevity and run in more places. You might say that Vulkan lets apps live long and prosper—and this code sample will help get you started.

The APIs were introduced by the Khronos Group* in 2015, and quickly gained the support of Intel and Google*. Unity Technologies* came on board in 2016, and Khronos* confirmed plans to bestow Vulkan with support for multiple discrete GPUs automatically. By 2017, as the Vulkan APIs matured, an increasing number of game makers announced that they would begin adopting it. Vulkan became available for Apple's macOS* and iOS* platforms in 2018.

Vulkan carries a low overhead while also providing greater control over threading and memory management as well as improving direct access to the GPU over OpenGL* and other predecessor APIs. These features combine to give the developer versatility for targeting an array of platforms with essentially the same code base. With early backing from major industry players, the Vulkan platform has tremendous potential, and developers should be advised to get on board soon. Vulkan is built for now.

To help experienced pro and indie developers prepare for Vulkan, this article walks through the code of a sample app that renders multiple .fbx and .obj objects using Vulkan APIs. The app employs a non-touch graphical user interface (GUI) that reads and displays multiple object files in a common scene. Files are loaded and rendered using linear or parallel processing, selectable for the purpose of comparing performance. In addition, the app allows objects to be moved, rotated, and zoomed through a simple UI.

Figure 1. Multiple rendered objects displayed simultaneously; the selected object is indicated with a bounding box.

The app also features:

Loaded models displayed in a list
Selected objects identified on-screen with a bounding box
An object info and stats display showing the number of vertices
The ability to specify either delta or absolute coordinates and rotations
Open object files in a file explorer window
Option to view objects in wireframe mode
Displays for stats for single- versus multithreading when reading and rendering

Keeping developers informed and educated on the latest technologies and development techniques is an important part of ensuring their success and prosperity. To that end, all source code and libraries from this project are available for download, so you can build and learn from the app on your own and adapt the functions for use in your own apps.

For people new to Vulkan, the learning curve could be steep. Because it gives developers rich features and a broad level of control, Vulkan contains far more structures and requires a greater number of initializations than OpenGL and other graphics libraries. For the sample app, the renderer alone (renderer.cpp) required more than 500 lines of code.

In an effort to minimize the amount of code required, this sample app focuses heavily on architecting a unified means of rendering different object types. Commonalities are identified in the initialization steps, which are separated from the general pipeline, and parts specific to a particular instance of 3D objects are loaded and rendered from a file. A boundary box is another type of object and requires its own shaders, settings, and pipeline. There is only one instance, however. Minimizing coding differences between object types also helped to improve flexibility and simplify the code.

One of the most significant challenges of developing this sample involved multithreaded rendering. Though Vulkan APIs are considered "thread-safe," some objects required explicit synchronization on the host side and at the point of implementation if applied to command pool and command buffer. When an object requests the command buffer, the buffer is allocated from the command pool. If the command pool is accessed in parallel from several threads at once, the app would crash or report a warning in the Vulkan console. One answer would be to use mutual exclusions, or mutexes, to serialize an access to the shared command pool. But this would eliminate the advantage of parallel processing, because threads would compete and block each other. Instead, the sample app implements separate command buffers and command pools for each 3D object instance, which then requires extra code for the release of resources.

What You'll Need

The minimum requirement for developing with Vulkan APIs on graphics processor units (GPUs) from Intel is a processor from the 6th Generation Intel® processor family (introduced in August 2015), running 64-bit Windows* 7, 8.1, or 10. Intel also offers a 64-bit Windows® 10-only driver for 6th, 7th, or 8th Generation processors. Vulkan drivers are now included with Intel® HD Graphics Driver, which helps simplify the setup process. Instructions are available for installing Vulkan drivers on Intel®-based systems running Unity* or Unreal* Engine 4.

Code Walk-Through

This app was built as an aid to developers learning to use Vulkan. This walk-through explains the techniques used to make the sample app, simplifying the work of getting started on your own. To reduce time spent on planning the architecture, the app was developed using an incremental, iterative process, which helps minimize changes during the coding phase. The project was divided into three parts: UI (MainWindow and VulkanWindow), model loader (Model.cpp/h), and rendering (Renderer.cpp/h). The feature list was prioritized and sorted by difficulty of implementation. Coding then started with the easiest features—refactoring and changing design only when needed.

MainWindow.cpp

In the sample app's main window, object files are loaded using either a single process or in parallel. Either way, a timer counts the total loading time to allow for comparison. When files are processed in parallel, the QtConcurrent component is used to process worker threads.

The "loadModels()" function starts the parallel or linear processing of files. In the first few lines, a counter is started. Then the loading times for file(s) are counted and an aiScene is created using the Assimp* external library. Next, the aiScene is converted to a class model created for this app that's more convenient to Vulkan. A progress dialog is created and presented while parallel file processing takes place.

void MainWindow::loadModels()
{
    clearModels();
    m_elapsedTimer.start(); // counts total loading time

    std::function<QSharedPointer<Model>(const QString &)> load = [](const QString &path) {
        QSharedPointer<Model> model;
        QFileInfo info(path);
        if (!info.exists())
            return model;
        QElapsedTimer timer;
        timer.start(); // loading time for this file
        Assimp::Importer importer;
// read file from disk and create aiScene (external library Asimp) instance
        const aiScene* scene = importer.ReadFile(path.toStdString(),
                                                 aiProcess_Triangulate |
                                                 aiProcess_RemoveComponent |
                                                 aiProcess_GenNormals |
                                                 aiProcess_JoinIdenticalVertices);

        qDebug() << path << (scene ? "OK" : importer.GetErrorString());
        if (scene) {
// aiScene format is not very convenient for renderer so we designed class Model to keep data ready for Vulkan renderer.
            model = QSharedPointer<Model>::create(info.fileName(), scene);  //convert aiScene to class Model (Model.cpp) that’s convenient for Vulkan renderer

            if (model->isValid()) {
                model->loadingTime = timer.elapsed();
            } else {
                model.clear();
            }
        }
        return model;
    };
// create a progress dialog for app user
    if (m_progressDialog == nullptr) {
        m_progressDialog = new QProgressDialog(this);
        QObject::connect(m_progressDialog, &QProgressDialog::canceled, &m_loadWatcher, &QFutureWatcher<void>::cancel);
        QObject::connect(&m_loadWatcher,  &QFutureWatcher<void>::progressRangeChanged, m_progressDialog, &QProgressDialog::setRange);
        QObject::connect(&m_loadWatcher, &QFutureWatcher<void>::progressValueChanged,  m_progressDialog, &QProgressDialog::setValue);
    }
    // using QtConcurrent for parallel file processing in worker threads
    QFuture<QSharedPointer<Model>> future = QtConcurrent::mapped(m_files, load);
    m_loadWatcher.setFuture(future);
//present the progress dialog to app user
    m_progressDialog->exec(); 
}

The "loadFinished()" function processes results of the parallel or linear processing, adds object file names to "listView," and passes models to the renderer.

void MainWindow::loadFinished() {
    qDebug("loadFinished");
    Q_ASSERT(m_vulkanWindow->renderer());
    m_progressDialog->close(); // close the progress dialog
// iterate around result of file load
    const auto & end = m_loadWatcher.future().constEnd();

// loop for populating list of file names
    for (auto it = m_loadWatcher.future().constBegin() ; it != end; ++it) {
        QSharedPointer<Model> model = *it;
        if (model) {
            ui->modelsList->addItem(model->fileName); // populates list view
// pass object to renderer (created in vulkanWindow, which is part of the mainWindow)
            m_vulkanWindow->renderer()->addObject(model); 
        }
    }

Identify the selected object on the screen by surrounding it with a bounding box.

mainwindow.cpp: MainWindow::currentRowChanged(int row)
{
...
   if (m_vulkanWindow->renderer())
           m_vulkanWindow->renderer()->selectObject(row);

renderer.cpp: Renderer::selectObject(int index) - inflates BoundaryBox object’s  model
...

Display object info and statistics (i.e., number of vertices) of the selected object on the screen. Here, object-specific statistics are created and loading time for the scene is displayed.

MainWindow::currentRowChanged(int row) - shows statistic for selected object:
{
…
// prepare object-specific statistics (verticies, etc)
QString stat = tr("Loading time: %1ms. Vertices: %2, Triangles: %3")
               .arg(item->model->loadingTime)
               .arg(item->model->totalVerticesCount())
               .arg(item->model->totalTrianglesCount());
ui->objectStatLabel->setText(stat);

// display total scene loading time
void MainWindow::loadFinished() 
ui->totalStatLabel->setText(tr("Total loading time: %1ms").arg(m_elapsedTimer.elapsed()));

// show rendering performance in frames per second
void MainWindow::timerEvent(QTimerEvent *) 
ui->fpsLabel->setText(tr("Performance: %1 fps").arg(renderer->fps(), 0, 'f', 2, '0'));
...

Enable users of the app to specify absolute coordinates and rotations.

void MainWindow::positionSliderChanged(int)
{
    const int row = ui->modelsList->currentRow();
    if (row == -1 || m_ignoreSlidersSignal || !m_vulkanWindow->renderer())
        return;
    m_vulkanWindow->renderer()->setPosition(row, ui->posXSlider->value() / 100.0f, ui->posYSlider->value() / 100.0f,
                                ui->posZSlider->value() / 100.0f );
}

void MainWindow::rotationSliderChanged(int)
{
    const int row = ui->modelsList->currentRow();
    if (row == -1 || m_ignoreSlidersSignal || !m_vulkanWindow->renderer())
        return;
     m_vulkanWindow->renderer()->setRotation(row, ui->rotationXSlider->value(), ui->rotationYSlider->value(),
                                ui->rotationZSlider->value());
}

Sample app uses file explorer for objects to render
Figure 2. The sample app implements a file explorer window for finding and opening objects to render.

Allow the app to open object files using a file explorer window.

MainWindow::MainWindow(QWidget *parent)
    : QWidget(parent),
      ui(new Ui::MainWindow)
{
…

connect(ui->loadButton, &QPushButton::clicked, this, [this] {
       const QStringList & files = QFileDialog::getOpenFileNames(this, tr("Select one or more files"), QString::null, "3D Models (*.obj *.fbx)");
       if (!files.isEmpty()) {
           m_files = files;
           loadModels();
           ui->reloadButton->setEnabled(true);
       }
   });
...

Figure 3. Objects rendered in wireframe mode; the selected object is indicated by a bounding box.

Allow the user to display objects in wireframe mode.

MainWindow::MainWindow(QWidget *parent)
    : QWidget(parent),
      ui(new Ui::MainWindow)
{
... 
 connect(ui->wireframeSwitch, &QCheckBox::stateChanged, this, [this]{
       if (m_vulkanWindow->renderer()) {
           m_vulkanWindow->renderer()->setWirefameMode(ui->wireframeSwitch->checkState() == Qt::Checked);
       }
   });
Renderer.cpp  (line 386-402):
void Renderer::setWirefameMode(bool enabled)
...

Renderer.cpp

Because of the complexities of the Vulkan APIs, the biggest challenge to this app's developer was building Renderer, which implements application-specific rendering logic for VulkanWindow.

Figure 4. Thread selection is simplified with a drop-down window; the ideal number is based on cores in the host system.

Especially challenging was the synchronization of worker and UI threads without using mutual exclusive locks on rendering and resource releasing phases. On the rendering phase, this is achieved by separating command pools and secondary command buffers for each Object3D instance. In the resource releasing phase, it is necessary to make sure the host and GPU rendering phases are finished.

Total loading time allows comparison
Figure 5. Total loading time and vertices count of an object file allow comparison of single- and multithreaded loading times.

Rendering Results May Vary

The system processor, GPU, and other factors of the host system as well as the size of the object file will determine single- and multithreaded object rendering times. Your results will vary. Normally, the host rendering phase is finished when "Renderer::&m_renderWatcher" emits a "finished" signal and "Renderer::endFrame()" is called. The resource-releasing phase might be initiated in cases such as:

The Vulkan window is resized or closed.
"Renderer::releaseSwapChainResources" and "Renderer::releaseResources" will be called.
The wireframe mode changed—"Renderer::setWirefameMode"
Objects are deleted—"Renderer::deleteObjects"
Objects are added—"Renderer::addObject"

In those situations, the first things we need to do are:

Wait until all worker threads are finished.
Explicitly finish the rendering phase, calling "Renderer::endFrame()", which also sets the flag "m_framePreparing = false" to ignore all results from worker threads that come asynchronously in the near future.
Wait until the GPU finishes all graphical queues using the "m_deviceFunctions->vkDeviceWaitIdle(m_window->device())" call.

This is implemented in "Renderer::rejectFrame":

void Renderer::rejectFrame()
{
   m_renderWatcher.waitForFinished(); // all workers must be finished
   endFrame(); // flushes current frame
   m_deviceFunctions->vkDeviceWaitIdle(m_window->device()); // all graphics queues must be finished
}

Parallel preparation of command buffers to render 3D objects iseature is implemented in the following three functions; the code for each follows after:

Renderer::startNextFrame—This is called when the draw commands for the current frame need to be added.
Renderer::drawObject—This records commands to the secondary command buffer. This is running in worker thread. When it's done, the buffer is reported to the UI thread to be recorded to the primary command buffer.
Renderer::endFrame—This finishes the render pass for current command buffer, reports to VulkanWindow that a frame is ready, and requests an immediate update to keep rendering.

Function 1: void Renderer::startNextFrame()

This section contains mainly Vulkan-specific code that is not likely to need modification. The snippet is intended to show how to load an object file using Vulkan. About a dozen lines in, the loaded file is sent to the renderer with support for a secondary command buffer to allow object-loading in parallel.

void Renderer::startNextFrame()
{
    m_framePreparing = true;

    const QSize imageSize = m_window->swapChainImageSize();

    VkClearColorValue clearColor = { 0, 0, 0, 1 };

    VkClearValue clearValues[3] = {};
    clearValues[0].color = clearValues[2].color = clearColor;
    clearValues[1].depthStencil = { 1, 0 };

    VkRenderPassBeginInfo rpBeginInfo = {};
    memset(&rpBeginInfo, 0, sizeof(rpBeginInfo));
    rpBeginInfo.sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO;
    rpBeginInfo.renderPass = m_window->defaultRenderPass();
    rpBeginInfo.framebuffer = m_window->currentFramebuffer();
    rpBeginInfo.renderArea.extent.width = imageSize.width();
    rpBeginInfo.renderArea.extent.height = imageSize.height();
    rpBeginInfo.clearValueCount = m_window->sampleCountFlagBits() > VK_SAMPLE_COUNT_1_BIT ? 3 : 2;
    rpBeginInfo.pClearValues = clearValues;

    // starting render pass with secondary command buffer support
    m_deviceFunctions->vkCmdBeginRenderPass(m_window->currentCommandBuffer(), &rpBeginInfo,  VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS);

    if (m_objects.size()) {
        // starting parallel command buffers generation in worker threads using QtConcurrent
        auto drawObjectFn = std::bind(&Renderer::drawObject, this, std::placeholders::_1);
        QFuture<VkCommandBuffer> future = QtConcurrent::mapped(m_objects, drawObjectFn);
        m_renderWatcher.setFuture(future);
} else {
// if no object exists, end immediately    
        endFrame();
    }
}

Function 2: Renderer::endFrame()

This function instructs Vulkan that all command buffers are ready for rendering with the GPU.

void Renderer::endFrame()
{
    if (m_framePreparing) {
        m_framePreparing = false;
        m_deviceFunctions->vkCmdEndRenderPass(m_window->currentCommandBuffer());
        m_window->frameReady();
        m_window->requestUpdate();
        ++m_framesCount;
    }
}

Function 3: Renderer::drawObject()

The function prepares the command buffers to be sent to the GPU. As above, the Vulkan-specific code in this snippet also runs in a worker thread and is not likely to need modification for use in other apps.

// running in a worker thread
VkCommandBuffer Renderer::drawObject(Object3D * object)
{
    if (!object->model)
        return VK_NULL_HANDLE;

    const PipelineHandlers & pipelineHandlers = object->role == Object3D::Object ? m_objectPipeline : m_boundaryBoxPipeline;
    VkDevice device = m_window->device();

    if (object->vertexBuffer == VK_NULL_HANDLE) {
        initObject(object);
    }

    VkCommandBuffer & cmdBuffer = object->cmdBuffer[m_window->currentFrame()];

    VkCommandBufferInheritanceInfo inherit_info = {};
    inherit_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_INHERITANCE_INFO;
    inherit_info.renderPass = m_window->defaultRenderPass();
    inherit_info.framebuffer = m_window->currentFramebuffer();

    VkCommandBufferBeginInfo cmdBufBeginInfo = {
        VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
        nullptr,
        VK_COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT,
        &inherit_info
    };
    VkResult res = m_deviceFunctions->vkBeginCommandBuffer(cmdBuffer, &cmdBufBeginInfo);
    if (res != VK_SUCCESS) {
        qWarning("Failed to begin frame command buffer: %d", res);
        return VK_NULL_HANDLE;
    }

    const QSize & imageSize = m_window->swapChainImageSize();

    VkViewport viewport;
    viewport.x = viewport.y = 0;
    viewport.width = imageSize.width();
    viewport.height = imageSize.height();
    viewport.minDepth = 0;
    viewport.maxDepth = 1;
    m_deviceFunctions->vkCmdSetViewport(cmdBuffer, 0, 1, &viewport);

    VkRect2D scissor;
    scissor.offset.x = scissor.offset.y = 0;
    scissor.extent.width = imageSize.width();
    scissor.extent.height = imageSize.height();
    m_deviceFunctions->vkCmdSetScissor(cmdBuffer, 0, 1, &scissor);

    QMatrix4x4 objectMatrix;
    objectMatrix.translate(object->translation.x(), object->translation.y(), object->translation.z());
    objectMatrix.rotate(object->rotation.x(), 1, 0, 0);
    objectMatrix.rotate(object->rotation.y(), 0, 1, 0);
    objectMatrix.rotate(object->rotation.z(), 0, 0, 1);
    objectMatrix *= object->model->transformation;


    m_deviceFunctions->vkCmdBindPipeline(cmdBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineHandlers.pipeline);

    // pushing view-projection matrix to constants
    m_deviceFunctions->vkCmdPushConstants(cmdBuffer, pipelineHandlers.pipelineLayout, VK_SHADER_STAGE_VERTEX_BIT, 0, 64, m_world.constData());

    const int nodesCount = object->model->nodes.size();
    for (int n = 0; n < nodesCount; ++n) {
        const Node &node = object->model->nodes.at(n);
        const uint32_t frameUniSize = nodesCount * object->uniformAllocSize;
        const uint32_t frameUniOffset = m_window->currentFrame() * frameUniSize + n * object->uniformAllocSize;
        m_deviceFunctions->vkCmdBindDescriptorSets(cmdBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineHandlers.pipelineLayout, 0, 1,
                                                   &object->descSet, 1, &frameUniOffset);

        // mapping uniform buffer to update matrix
        quint8 *p;
        res = m_deviceFunctions->vkMapMemory(device, object->bufferMemory, object->uniformBufferOffset + frameUniOffset,
                                                      MATRIX_4x4_SIZE, 0, reinterpret_cast<void **>(&p));
        if (res != VK_SUCCESS)
            qFatal("Failed to map memory: %d", res);

        QMatrix4x4 nodeMatrix = objectMatrix * node.transformation;
        memcpy(p, nodeMatrix.constData(), 16 * sizeof(float)); //updating matrix

        m_deviceFunctions->vkUnmapMemory(device, object->bufferMemory);

        // drawing meshes
        for (const int i: qAsConst(node.meshes)) {
            const Mesh &mesh = object->model->meshes.at(i);
            VkDeviceSize vbOffset = mesh.vertexOffsetBytes();
            m_deviceFunctions->vkCmdBindVertexBuffers(cmdBuffer, 0, 1, &object->vertexBuffer, &vbOffset);
            m_deviceFunctions->vkCmdBindIndexBuffer(cmdBuffer, object->vertexBuffer, object->indexBufferOffset + mesh.indexOffsetBytes(), VK_INDEX_TYPE_UINT32);

            m_deviceFunctions->vkCmdDrawIndexed(cmdBuffer, mesh.indexCount, 1, 0, 0, 0);
        }
    }

    m_deviceFunctions->vkEndCommandBuffer(cmdBuffer);

    return cmdBuffer;
}

The complete secondary buffer is reported back to a GUI thread, and commands can be executed on the primary buffer (unless frame rendering is canceled):

Renderer.cpp (line 31-38):
QObject::connect(&m_renderWatcher, &QFutureWatcher<VkCommandBuffer>::resultReadyAt, [this](int index) {
       // secondary command buffer of some object is ready
       if (m_framePreparing) {
           const VkCommandBuffer & cmdBuf = m_renderWatcher.resultAt(index);
           if (cmdBuf)
               this->m_deviceFunctions->vkCmdExecuteCommands(this->m_window->currentCommandBuffer(), 1, &cmdBuf);
       }
   });
...

Another major challenge to development of the renderer came in the handling of different types of graphical objects—those loaded from files and those dynamically generated in the form of boundary boxes that surround selected objects. This caused a problem because they use differing shaders, primitive topologies, and polygon modes. The goal was to unify code, as much as possible, for different objects to avoid replication of similar code. Both types of objects are expressed by single-class Object3D.

In the "Renderer::initPipelines()" function, differences were isolated as function parameters and called in this way:

initPipeline(m_objectPipeline, 
QStringLiteral(":/shaders/item.vert.spv"),
 QStringLiteral(":/shaders/item.frag.spv"),
                VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST,
m_wireframeMode ? VK_POLYGON_MODE_LINE : VK_POLYGON_MODE_FILL);

initPipeline(m_boundaryBoxPipeline,
QStringLiteral(":/shaders/selection.vert.spv"),
QStringLiteral(":/shaders/selection.frag.spv"),
                VK_PRIMITIVE_TOPOLOGY_LINE_LIST, VK_POLYGON_MODE_LINE);

It also proved helpful to unify initialization of particular objects according to their role. This is handled by the "Renderer::initObject()" function:

const PipelineHandlers & pipelineHandlers = object->role == Object3D::Object ? m_objectPipeline : m_boundaryBoxPipeline;

"Function: Renderer::initPipeline()" shows the full function. Note that in addition to object files, the boundary box is another type of object and requires its own shaders, settings, and pipeline. Minimizing coding differences between object types also helped to improve flexibility and simplify the code.

void Renderer::initPipeline(PipelineHandlers & pipeline, const QString & vertShaderPath, const QString & fragShaderPath,
                            VkPrimitiveTopology topology, VkPolygonMode polygonMode)
{
    VkDevice device = m_window->device();
    VkResult res;
    VkVertexInputBindingDescription vertexBindingDesc = {
        0, // binding
        6 * sizeof(float), //x,y,z,nx,ny,nz
        VK_VERTEX_INPUT_RATE_VERTEX
    };

    VkVertexInputAttributeDescription vertexAttrDesc[] = {
        { // vertex
            0,
            0,
            VK_FORMAT_R32G32B32_SFLOAT,
            0
        },
        { // normal
            1,
            0,
            VK_FORMAT_R32G32B32_SFLOAT,
            6 * sizeof(float)
        }
    };


    VkPipelineVertexInputStateCreateInfo vertexInputInfo = {};
    vertexInputInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO;
    vertexInputInfo.vertexBindingDescriptionCount = 1;
    vertexInputInfo.pVertexBindingDescriptions = &vertexBindingDesc;
    vertexInputInfo.vertexAttributeDescriptionCount = 2;
    vertexInputInfo.pVertexAttributeDescriptions = vertexAttrDesc;


    VkDescriptorSetLayoutBinding layoutBinding = {};
    layoutBinding.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC;
    layoutBinding.descriptorCount = 1;
    layoutBinding.stageFlags =  VK_SHADER_STAGE_VERTEX_BIT;

    VkDescriptorSetLayoutCreateInfo descLayoutInfo = {
        VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO,
        nullptr,
        0,
        1,
        &layoutBinding
    };

     //!  View-projection matrix going to be pushed to vertex shader constants.
    VkPushConstantRange push_constant = {
            VK_SHADER_STAGE_VERTEX_BIT,
            0,
            64
        };

    res = m_deviceFunctions->vkCreateDescriptorSetLayout(device, &descLayoutInfo, nullptr, &pipeline.descSetLayout);
    if (res != VK_SUCCESS)
        qFatal("Failed to create descriptor set layout: %d", res);


    // Pipeline layout
    VkPipelineLayoutCreateInfo pipelineLayoutInfo = {};
    pipelineLayoutInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
    pipelineLayoutInfo.setLayoutCount = 1;
    pipelineLayoutInfo.pSetLayouts = &pipeline.descSetLayout;
    pipelineLayoutInfo.pushConstantRangeCount = 1;
    pipelineLayoutInfo.pPushConstantRanges = &push_constant;

    res = m_deviceFunctions->vkCreatePipelineLayout(device, &pipelineLayoutInfo, nullptr, &pipeline.pipelineLayout);
    if (res != VK_SUCCESS)
        qFatal("Failed to create pipeline layout: %d", res);

    // Shaders
    VkShaderModule vertShaderModule = loadShader(vertShaderPath);
    VkShaderModule fragShaderModule = loadShader(fragShaderPath);

    // Graphics pipeline
    VkGraphicsPipelineCreateInfo pipelineInfo;
    memset(&pipelineInfo, 0, sizeof(pipelineInfo));
    pipelineInfo.sType = VK_STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;

    VkPipelineShaderStageCreateInfo shaderStageCreationInfo[2] = {
        {
            VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO,
            nullptr,
            0,
            VK_SHADER_STAGE_VERTEX_BIT,
            vertShaderModule,
            "main",
            nullptr
        },
        {
            VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO,
            nullptr,
            0,
            VK_SHADER_STAGE_FRAGMENT_BIT,
            fragShaderModule,
            "main",
            nullptr
        }
    };
    pipelineInfo.stageCount = 2;
    pipelineInfo.pStages = shaderStageCreationInfo;

    pipelineInfo.pVertexInputState = &vertexInputInfo;

    VkPipelineInputAssemblyStateCreateInfo inputAssemblyInfo = {};
    inputAssemblyInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_INPUT_ASSEMBLY_STATE_CREATE_INFO;
    inputAssemblyInfo.topology = topology;
    pipelineInfo.pInputAssemblyState = &inputAssemblyInfo;

    VkPipelineViewportStateCreateInfo viewportInfo = {};
    viewportInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_VIEWPORT_STATE_CREATE_INFO;
    viewportInfo.viewportCount = 1;
    viewportInfo.scissorCount = 1;
    pipelineInfo.pViewportState = &viewportInfo;

    VkPipelineRasterizationStateCreateInfo rasterizationInfo = {};
    rasterizationInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_RASTERIZATION_STATE_CREATE_INFO;
    rasterizationInfo.polygonMode = polygonMode;
    rasterizationInfo.cullMode = VK_CULL_MODE_NONE;
    rasterizationInfo.frontFace = VK_FRONT_FACE_COUNTER_CLOCKWISE;
    rasterizationInfo.lineWidth = 1.0f;
    pipelineInfo.pRasterizationState = &rasterizationInfo;

    VkPipelineMultisampleStateCreateInfo multisampleInfo = {};
    multisampleInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_MULTISAMPLE_STATE_CREATE_INFO;
    multisampleInfo.rasterizationSamples = m_window->sampleCountFlagBits();
    pipelineInfo.pMultisampleState = &multisampleInfo;

    VkPipelineDepthStencilStateCreateInfo depthStencilInfo = {};
    depthStencilInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_DEPTH_STENCIL_STATE_CREATE_INFO;
    depthStencilInfo.depthTestEnable = VK_TRUE;
    depthStencilInfo.depthWriteEnable = VK_TRUE;
    depthStencilInfo.depthCompareOp = VK_COMPARE_OP_LESS_OR_EQUAL;
    pipelineInfo.pDepthStencilState = &depthStencilInfo;

    VkPipelineColorBlendStateCreateInfo colorBlendInfo  = {};
    colorBlendInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_COLOR_BLEND_STATE_CREATE_INFO;
    VkPipelineColorBlendAttachmentState att = {};
    att.colorWriteMask = 0xF;
    colorBlendInfo.attachmentCount = 1;
    colorBlendInfo.pAttachments = &att;
    pipelineInfo.pColorBlendState = &colorBlendInfo;

    VkDynamicState dynamicEnable[] = { VK_DYNAMIC_STATE_VIEWPORT, VK_DYNAMIC_STATE_SCISSOR };
    VkPipelineDynamicStateCreateInfo dynamicInfo = {};
    dynamicInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_DYNAMIC_STATE_CREATE_INFO;
    dynamicInfo.dynamicStateCount = 2;
    dynamicInfo.pDynamicStates = dynamicEnable;
    pipelineInfo.pDynamicState = &dynamicInfo;

    pipelineInfo.layout = pipeline.pipelineLayout;
    pipelineInfo.renderPass = m_window->defaultRenderPass();

    res = m_deviceFunctions->vkCreateGraphicsPipelines(device, m_pipelineCache, 1, &pipelineInfo, nullptr, &pipeline.pipeline);
    if (res != VK_SUCCESS)
        qFatal("Failed to create graphics pipeline: %d", res);

    if (vertShaderModule)
        m_deviceFunctions->vkDestroyShaderModule(device, vertShaderModule, nullptr);
    if (fragShaderModule)
        m_deviceFunctions->vkDestroyShaderModule(device, fragShaderModule, nullptr);
}

Conclusion

Coding flexibility is a hallmark of low-level Vulkan APIs, but it's critical to remain focused on what's going on in each Vulkan step. Lower-level programming also allows for precise fine-tuning of certain aspects of hardware access not available with OpenGL. If you take it slow and build your project in small, incremental steps, the payoffs will include far greater rendering performance, much lower runtime footprint, and greater portability to a multitude of devices and platforms.

Pros and indies alike should prepare for Vulkan. This article provided a walk-through of an app that shows how to use Vulkan APIs to render multiple .fbx and .obj objects, and read and display multiple object files in a common scene. You've also seen how to integrate a file explorer window to load and render files using linear or parallel processing and compare performance of each in the UI. The code also demonstrates a simple UI to move, rotate, and zoom the objects; to enclose objects in a bounding box; render objects in wireframe mode; display object info and stats; and allow absolute coordinates and rotations to be specified.

APPENDIX: How to Build the Project

As described earlier, the minimum requirement for developing with Vulkan APIs on GPUs from Intel is a 6th Gen Intel® processor running 64-bit Windows 7, 8.1, or 10. Vulkan drivers are now included with the latest Intel HD Graphics drivers. Follow the step-by-step instructions for installing Vulkan drivers on Intel-based systems running Unity or Unreal Engine 4, and then return here.

The following steps are for building this project using Microsoft Visual Studio* 2017 from a Windows command prompt.

Preparing the build environment

1. Download the Vulkan 3D Object Viewer sample code project to a convenient folder on your hard drive.

2. Make sure your Microsoft Visual Studio 2017 setup has Visual C++. If it doesn't, download and install it from Visual Studio site.

3. The sample app relies on the Open Asset Import Library (assimp), but the pre-built version of this library doesn't work with Visual Studio 2017; it has to be re-built from scratch. Download it from SourceForge*.

4. CMake is the preferred build system for assimp. You can download the latest version from cmake*/ or use one from Visual Studio (YOUR_PATH_TO_MSVS\2017\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin). Follow these steps to build assimp:

a. Open a command prompt (cmd.exe).

b. Set "PATH=PATH_TO_CMAKE\bin;%PATH%" (skip this step If you already set this variable permanently in your system environment variables. To do that, go to: Control Panel->System->Advanced System Settings->Environment Variables and add the line above to the list).

c. Enter "cmake -f CMakeLists.txt -G "Visual Studio 15 2017 Win64".

d. Open the generated "assimp.sln" solution file in Visual Studio, go to Build->Configuration Manager and select "Release" under Configuration (unless you need to debug assimp for some reason, building the release version is recommended for the best performance).

e. Close the configuration manager and build assimp.

5. Download and install the Vulkan SDK from Vulkan.

6. Download and install Qt. The sample app uses Qt 5.10 UI libraries, which is the minimum version required for Vulkan support. Open-source and commercial versions will do the job here, but you'll need to register either way. To get Qt:

a. Go to qt.io and select a version.

b. Log in or register and follow prompts to set up the Qt Online Installer.

c. Next, you'll be prompted to select a version. Pick Qt 5.10 or higher and follow prompts to install.

7. Clone or download the sample app repository to your hard drive.

Building the app

8. The file "env_setup.bat" is provided to help you set environment variables locally for the command processor. Before executing it:

a. Open "env_setup.bat" and check whether listed variables point to the correct locations of your installed dependencies:

I. "_VC_VARS"—path to Visual Studio environment setup vcvarsall.bat

II. "_QTDIR"—path to Qt root

III. "_VULKAN_SDK"—Vulkan SDK root

IV. "_ASSIMP"—assimp root

V. "_ASSIMP_BIN"—path to Release or Debug configuration of binaries

VI. "_ASSIMP_INC"—path to assimp's header files

VII. "_ASSIMP_LIB"—points to Release or Debug configuration of assimp lib

b. Output from the batch file will report any paths you might have missed.

c. Alternatively, add the following to the system's (permanent) environment variables:

I. Create new variables:

1. "_QTDIR"—path to Qt root

2. "_VULKAN_SDK"—Vulkan SDK root

3. "_ASSIMP"—assimp root

II. Add to variable "PATH" values:

1. %_QTDIR%\bin

2. %_VULKAN_SDK%\bin

3. %_ASSIMP%\bin

III. Create the system variable "LIB" if it doesn't exist and add the value: %_ASSIMP%\lib

IV. Create the system variable "INCLUDE" if it doesn't exist and add the values:

1. %_VULKAN_SDK%\Include

2. %_ASSIMP%\Include

d. At the command prompt, set the current directory to the project root folder (which contains the downloaded project).

e. Run qmake.exe.

f. Start build:

I. For release: nmake -f Makefile.Release

II. For debug: nmake -f Makefile.Debug

9. Run app:

a. For release: WORK_DIR\release\model-viewer-using-Vulkan.exe

b. For debug: WORK_DIR\debug\model-viewer-using-Vulkan.exe

10. Execute the newly built Vulkan object viewer app.

11. Select the number of threads to use or check "single thread." By default, the app selects the optimal number of threads based on logical cores in the host system.

12. Click "Open models..." to load some models with a selected number of threads. Then change the number of threads and click "Reload" to load the same models with new thread settings for comparison.

↧

Energy Analysis with Intel(R) System Studio 2019 Beta

July 2, 2018, 9:10 pm

Latest and popular articles on Intel Technologies

≫ Next: Inline IPsec with DPDK and Intel® 82599 Network Controller

≪ Previous: Rendering Objects in Parallel Using Vulkan* APIs

Energy Analysis with Intel® System Studio 2019 Beta

Introduction to Energy Analysis

Energy Analysis : Using the energy analysis components allows you to collect sleep state, frequency and temperature data that lets you find the software that is causing unwanted power use. Intel® System Studio 2019 beta includes energy analysis solution that best fits your environment. Please refer Intel® Energy Anaylsis Help for Energy Analysis general workflow.

Elements of Energy Analysis.

Intel® SoC Watch—a low overhead command line tool for collecting power-related metrics on Windows*, Linux*, or Android* systems

Release Notes for Windows*, Linux* and Android* ( Please find the release notes as attached )
User Guide
Intel® SoCwatch Getting Started.

Eclipse* GUI : An Eclipse* providing a GUI for collection and viewing of Intel SoC Watch data. This is a preview feature currently. You can use this in the Intel® System Studio GUI to create a project and run energy analysis on a Linux or Android target system. The results are generated in a *.swjson format and can be opened and visualized with Intel System Studio. Please refer here for the details.

VTune™ Amplifier for Systems: in addition to performance analysis, it provides visualization for data imported from SoC Watch. Future plans for this tool include adding support to launch SoC Watch collections on remote systems. Please refer here for the details.

What is new with Intel® SoCWatch v2.6.1 for Windows* ( referred from NDA v2.6.1 release notes, let me know what should be deleted )

Added option to log console output to a file. Use option --log <filename> (short name -l) to log all console messages to a file.
Added explanatory note to reports containing the Unknown value. For -f hw-cpu-cstate and -f hw-cpu-pstate, CPU C-State and CPU P-State reports now include a note explaining the meaning of the Unknown value that may appear in these reports. The residency table note is: "Unknown" state occurs when the system enters a platform idle state (ACPI S3 or S4). The wakeup table note is: "Unknown" wakeups mean the CPU wakeup reason could not be attributed to a known cause.
Reduced device ID checking for PCIe metrics. The -f pcie-lpm and -f pcie-ltr features will no longer result in unknown device ID warning. It is now assumed the devices will behave in expected fashion so that users are not burdened with having to add new device IDs manually.
Added driver version to the summary report header. Summary reports now include the version of the Intel SoC Watch driver used to collect the data in addition to the application version number.
Added Alt-S as a hot key alternative to Ctrl-C for Windows 10 Desktop OS.
Removed several OS-based features from the -f sys group. The OS-based features which had a comparable hardware data (os-cpu-cstate, os-cpu-pstate, os-gfx-cstate) were removed from the group sys. Processing the OS-based ETL trace data is very time consuming for longer collections and often users do not need these OS-based metrics since the hardware metrics are more accurate. Removing these from the commonly used sys group reduces post-processing time and file size. Both OS and hardware based metrics are still included in the cpu and gfx group names, or can be explicitly added to the command line using their individual feature names.
HWP reporting is no longer included in the hw-cpu-pstate feature. The -f hw-cpu-pstate feature now collects only Core P-state frequencies to allow finer-grained selection of which data to collect. Use the new feature -f hw-cpu-hwp to collect the HWP Guaranteed, Highest, Lowest,
Most-Efficient Performance summary reports. The HWP feature is still included in the sys, cpu, and cpu-pstate groups.

What is new with Intel® SoCWatch v2.6.1 for Linux* and Android*.

New features supported : hw-cpu-hwp to collect the HWP Guaranteed, Highest, Lowest, Most-Efficient Performance summary reports.
Command line changes : New option for logging console output to a file: --log <filename> (short name -l). The -f cpu-cstate feature name has become a group name. The individual metric name is -f hw-cpu-cstate and includes C-state at the Package, Module, and Core levels as appropriate for the platform. This aligns feature names across all operating systems. The -f gfx-cstate feature name has become a group name. The individual metric name is -f hw-gfx-cstate and reports C-state residencies for the graphics subsystem. This aligns feature names across all operating systems.
The driver version number has been added to the summary report header.
HWP reporting is no longer included in the hw-cpu-pstate feature. Use the new feature -f hw-cpu-hwp to collect the HWP Guaranteed, Highest, Lowest, Most-Efficient Performance summary reports. The HWP feature is still included in the sys, cpu, and cpupstate groups.
The -f cpu-pstate feature is now collected by sampling vs. event trace. The CPU P-state data was not being collected on platforms with the P-state driver installed because the Pstate trace events are not triggered when it is used. To avoid the loss of this data, CPU Pstate residency is now based on P-state (processor frequency) status registers being sampled during the collection. Since this is now sampled data, there is some loss in precision of the CPU P-state residency and the report format changes.
The --version option output has changed.

↧

Inline IPsec with DPDK and Intel® 82599 Network Controller

June 28, 2018, 9:17 am

Latest and popular articles on Intel Technologies

≫ Next: Deep Learning: Build a Black Box Model for Medical Professionals

≪ Previous: Energy Analysis with Intel(R) System Studio 2019 Beta

Introduction

This article looks at inline IPsec acceleration support enablement in the Data Plane Development Kit (DPDK) framework with a particular focus on the Intel® 82599 10 Gigabit Ethernet Controller series features and support.

Inline IPsec can be used to implement IPsec-aware systems that have a better latency than lookaside-assisted and accelerated hardware, providing that the algorithm supported is suitable.

This article includes background information that will highlight the differences between “lookaside” hardware acceleration for IPsec and inline IPsec (i.e., the packet flow differences and the expected differences on the application side, in performance and handling).

An example of setting up a test system, installing, and running an IPsec Gateway application will be presented.

The article assumes that the DPDK release used is 17.11 or later.

Background

The DPDK Security Library provides a framework for the management and provisioning of security protocol operations offloaded to hardware-based devices. The library defines generic APIs to create and free security sessions that can support full protocol offload as well as inline crypto operations with network interface controller (NIC) or crypto devices.

The security library, which has been included in the DPDK since release 17.11, introduces APIs and features that provide a way of adding support for the inline crypto (IPsec) acceleration already available on the Intel 82599 10 Gigabit Ethernet Controller series. The DPDK IXGBE driver was also updated with support for inline IPsec.

The Intel 82599 10 Gigabit Ethernet Controller series only supports AES-GCM 128, hence the supported protocols:

ESP authentication only: AES-128-GMAC (128-bit key)
ESP encryption and authentication: AES-128-GCM (128-bit key)

The IPsec Security Gateway sample application also supports this feature, which will be explained in more detail below.

Packet Flow

First, let’s look at the packet flow and handling through the system.

packet flow diagram

Figure 1.

The packet flow depicted above shows the packet-processing stages for an incoming encrypted IPsec packet. In the first case, it is processed (decrypted and/or authenticated) using the lookaside Intel® QuickAssist Technology (Intel® QAT) hardware accelerator. In the second case, it is processed using the inline IPsec hardware processing available in the Intel 82599 10 Gigabit Ethernet Controller.

As presented above, the second stage of the processing (decryption and/or authentication) is combined with the packet receive stage by the network controller itself. The DPDK application still needs to process the encapsulation and de-capsulation of the Encapsulating Security Payload (ESP) packet.

The outgoing flow is identical except for the packet direction.

Software APIs and the Sample Application

Now, let’s look at the software APIs, requirements, and usage model. We will use code excerpts from the IPsec Security Gateway sample application, simplified for clarity, and with error handling removed.

First, we need to create a security session, beginning with the following configuration:

struct rte_security_session_conf sess_conf = {
    .action_type = RTE_SECURITY_ACTION_TYPE_INLINE_CRYPTO,
    .protocol = RTE_SECURITY_PROTOCOL_IPSEC,
    {.ipsec = {
        .spi = <SPI>,
        .salt = <Salt>,
        .options = { 0 },
        .direction = RTE_SECURITY_IPSEC_SA_DIR_INGRESS,
        .proto = RTE_SECURITY_IPSEC_SA_PROTO_ESP,
        .mode = RTE_SECURITY_IPSEC_SA_MODE_TUNNEL,
    } },
    .crypto_xform = <xforms>,
    .userdata = NULL,
};

Next, create the security session:

struct rte_security_ctx *ctx = (struct rte_security_ctx *)
    rte_eth_dev_get_sec_ctx(<portid>);
rte_security_session *sec_session = 
    rte_security_session_create(ctx, &sess_conf, <session_pool>);

Then, create an RTE_FLOW_ACTION_TYPE_SECURITY flow:

pattern[0].type = RTE_FLOW_ITEM_TYPE_ETH;
pattern[1].type = RTE_FLOW_ITEM_TYPE_IPV4;
pattern[1].mask = &rte_flow_item_ipv4_mask;
pattern[2].type = RTE_FLOW_ITEM_TYPE_ESP;
pattern[2].spec = &esp_spec; /*contains the SPI*/
pattern[2].mask = &rte_flow_item_esp_mask;

action[0].type = RTE_FLOW_ACTION_TYPE_SECURITY;
action[0].conf = sa->sec_session;
action[1].type = RTE_FLOW_ACTION_TYPE_END;
action[1].conf = NULL;

flow = rte_flow_create(portid, <attr>, <pattern>, <action>, &<err>);

With this, the configuration is complete and the Security Association (SA) is active for incoming ESP packets that match the flow created. The offload flags in the “rte_mbuf” struct will indicate if the packet was processed inline by setting the PKT_RX_SEC_OFFLOAD flag and, if any error occurred, PKT_RX_SEC_OFFLOAD_FAILED will also be set.

For outgoing packets, the programming is similar except that the offload flag PKT_TX_SEC_OFFLOAD needs to be set by the application.

Test System Setup

In addition to DPDK and the sample application, we will use Python* scapy v2.4.0 (packet generator) and pycryptodome v3.6.0 (which provides AES-GCM support) to generate clear text and encrypted packets.

Save the following as inline_ipsec.py:

from scapy.all import *

def main(argv):
    
    payload = 'test-' * 2000
    
    sa = SecurityAssociation(ESP, spi=5, crypt_algo='AES-GCM', 
            crypt_key='\x2b\x7e\x15\x16\x28\xae\xd2\xa6\xab\xf7\x15''\x88\x09\xcf\x4f\x3d\xde\xad\xbe\xef', 
            auth_algo='NULL', auth_key=None,
            tunnel_header=IP(src='172.16.1.5', dst='172.16.2.5'))
    sa.crypt_algo.icv_size = 16
    
    try:
        opts, args = getopt.getopt(argv, 'c:i:s:e:',
                                   ['count=', 'iface=', 'size=', 'encrypt='])
    except getopt.GetoptError:
        sys.exit(2)


    maxcount = 1
    intface = "enp2s0f0"
    paysize = 64
    do_encrypt = True

    for opt, arg in opts:
        if opt in ("-c", "--count"):
            maxcount = arg
        if opt in ("-i", "--iface"):
            intface = arg
        if opt in ("-s", "--size"):
            paysize = arg
        if opt in ("-e", "--encrypt"):
            if arg == '0' or arg == 'False':
                do_encrypt = False

    p = IP(src='192.168.105.10', dst='192.168.105.10')
    p /= "|->"
    p /= payload[0:(int(paysize) - 6)]
    p /= "<-|"
    p = IP(str(p))
    
    if do_encrypt:
        e = sa.encrypt(p)
    else:
        e = p
    
    eth_e = Ether()/e
    
    sendp(eth_e, iface=intface, count=int(maxcount))
        
if __name__ == "__main__":
    exit(main(sys.argv[1:]))

The test system configuration is shown in the figure below.

test system configuration diagram

Figure 2.

The configuration uses two Intel® 82599ES 10 Gigabit Ethernet Controller dual port cards connected as shown above.

In this particular configuration, card 1 ports will be assigned to the DPDK driver, port 0 BDF 06:00.0 and port 1 BDF 06:00.1. Card 0 ports will be assigned to the kernel driver, identified as enp2s0f0 and enp2s0f1. Note, in other systems the addresses and port names may be different.

Create the IPsec sample app configuration based on this data and save it as a file called inline.cfg. For example:

#SP IPv4 rules
sp ipv4 out esp protect 1005 pri 1 dst 192.168.105.0/24 sport 0:65535 dport 0:65535

#SA rules
sa out 1005 aead_algo aes-128-gcm aead_key 2b:7e:15:16:28:ae:d2:a6:ab:f7:15:88:09:cf:4f:3d:de:ad:be:ef \
mode ipv4-tunnel src 172.16.1.5 dst 172.16.2.5 \
port_id 1 \
type inline-crypto-offload \

sa in 5 aead_algo aes-128-gcm aead_key 2b:7e:15:16:28:ae:d2:a6:ab:f7:15:88:09:cf:4f:3d:de:ad:be:ef \
mode ipv4-tunnel src 172.16.1.5 dst 172.16.2.5 \
port_id 1 \
type inline-crypto-offload \

#Routing rules
rt ipv4 dst 172.16.2.5/32 port 1
rt ipv4 dst 192.168.105.10/32 port 0

Then start the application as follows:

ipsec-secgw \
	-l 6,7 \
	-w 06:00.0 -w 06:00.1 \
	--log-level 8 --socket-mem 1024,0 --vdev crypto_null \
	-- -p 0xf -P -u 0x2 \
	--config="(0,0,6),(1,0,7)" -f ./inline.cfg

Note that the NULL crypto device is only present because the application needs a “true” crypto PMD (for legacy reasons).

Testing ESP decryption and forwarding of the decrypted packets

In order to test the application, send encrypted packets on port 1:

     python inline_ipsec.py -i enp2s0f1 -s 64 -c 32 -e 1

While listening on port 0:

     tcpdump -i enp2s0f0 –vvX

Decrypted packets should be observed on port 0:

10:31:00.636564 IP (tos 0x0, ttl 63, id 1, offset 0, flags [none], 
                    proto Options (0), length 84)
    192.168.105.10 > 192.168.105.10:  ip 64
	0x0000:  4500 0054 0001 0000 3f00 0100 c0a8 690a  E..T....?.....i.
	0x0010:  c0a8 690a 7c2d 3e74 6573 742d 7465 7374  ..i.|->test-test
	0x0020:  2d74 6573 742d 7465 7374 2d74 6573 742d  -test-test-test-
	0x0030:  7465 7374 2d74 6573 742d 7465 7374 2d74  test-test-test-t
	0x0040:  6573 742d 7465 7374 2d74 6573 742d 7465  est-test-test-te
	0x0050:  733c 2d7c                                s<-|

The packet flow is as follows:

ESP packet is received on port 1.
Port 1 RX queue contains ESP packet with the payload decrypted.
ESP decapsulation.
Clear text packet containing decrypted payload is transmitted on port 0.

Testing ESP encryption

In order to test the application, send clear text packets on port 0:

     python inline_ipsec.py -i enp2s0f0 -s 64 -c 32 -e 0

While listening on port 1:

     tcpdump -i enp2s0f0 –vvX

Encrypted packets should be observed on port 1:

10:37:45.622669 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], 
                    proto ESP (50), length 140)
    172.16.1.5 > 172.16.2.5: ESP(spi=0x000003ed,seq=0x20), length 120
	0x0000:  4500 008c 0000 0000 4032 1f16 ac10 0105  E.......@2......
	0x0010:  ac10 0205 0000 03ed 0000 0020 0000 0000  ................
	0x0020:  0000 0020 fd78 c8f5 7fab 4fb3 5b98 7e79  .....x....O.[.~y
	0x0030:  81b0 b4f2 d796 ccd4 f0a7 b031 bb9b 9bde  ...........1....
	0x0040:  af18 767e 5d0f 73e3 bc82 4ea3 4afb 00eb  ..v~].s...N.J...
	0x0050:  6d02 a367 7a3a c2dd 6b64 74c1 5d41 bb45  m..gz:..kdt.]A.E
	0x0060:  7ac2 c1e0 0fb8 5f73 7fcd 4304 e396 32ea  z....._s..C...2.
	0x0070:  228e 22e5 4a3e ea72 88fb 13a7 e940 9346  ".".J>.r.....@.F
	0x0080:  4451 98cf 97fd 878c 96f0 f754            DQ.........T

The packet flow is as follows:

Clear text packet is received on port 0.
ESP encapsulation.
Packet is placed in the TX queue on port 1.
Packet is encrypted by port 1 during transmission.

Conclusion

This article presented the details of the inline IPsec support available with the Intel 82599 10 Gigabit Ethernet Controller, how it works internally and how it is supported by the DPDK framework, and how to build an application that can utilize the feature.

About the Author

Radu Nicolau is a network software engineer with Intel. His work is currently focused on IPsec-related development of data plane functions and libraries. His contributions include enablement of the AES-GCM crypto algorithm in the VPP IPsec stack, IKEv2 initiator support for VPP, and inline IPsec enablement in DPDK.

↧

Deep Learning: Build a Black Box Model for Medical Professionals

July 2, 2018, 2:37 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Parallel Computing Center at University of Florida

≪ Previous: Inline IPsec with DPDK and Intel® 82599 Network Controller

Building a Black Box Model Using Transfer Learning

Introduction

In the 21st century, the years of big data and big innovations in medicine, we frequently hear about artificial intelligence (AI) solutions based on statistical and machine learning models that could improve disease prevention, diagnosis, and treatment in solving medical problems.

In this paper we describe and present a method for creating models that predict illness occurrence from cheap and popular medical imaging methods such as X-rays, and by using a state of the art deep learning model, of which some of the trained weights we will reuse.

Black boxes

Sometimes machine learning models are used to decide whether disease has occurred or which drug will be the best for a specific situation. They are called black boxes because their implementation and principle of operation isn't released to the public or isn't well known (even by its creators); they take input data and output diagnosis without the justification of their conclusion¹. This is also often caused by competition between companies because they don't want to release their secrets to the public and, combined with private datasets that are needed for the algorithm to work properly (they have to source from real patients), can cause a slowdown of new innovations.

On the other hand, deep learning models are a set of millions of parameters whose prediction is very hard to interpret due to the high abstraction of layers and their output.

Fortunately, they are mostly based on published algorithms that are often implemented in popular open source projects (such as scikit-learn*, TensorFlow*, PyTorch*, or Keras), and we try to replicate them using public datasets that can be found on websites like Kaggle*. Most libraries have their Python* API, which allows users to write their programs in easy to understand, higher level language.

For this purpose, the Intel® Distribution for Python* is supplied with most machine learning libraries built with optimizations for Intel® Advanced Vector Extensions (Intel® AVX) instructions and more, which allows for 10–200 times speed up²

Transfer learning and its use in various applications

Transfer learning³ is a training technique often used with deep convolutional neural networks. It allows for decreasing the number of training samples needed for the neural network to converge, together with computation cost. We can assume that low-level and sometimes mid-level features are versatile, similarly for base image and purpose-specific datasets. To reuse a pretrained layer's weights, the first few layers are frozen (not adjusted) during the training, and others are fine-tuned (adjusted for a specific task, sometimes with a smaller learning rate).

This training method has been successfully applied by researchers from Yonsei University College of Engineering to resolve classification problems on histopathology images of breast cancer with an area under the receiver-operating characteristics (ROC) curve (AUC) of 0.93, by reusing Google Inception* v3 pretrained model⁴.

Models such as VGG-16*, Inception, and others are mostly trained on the ImageNet* Large Scale Visual Recognition Competition dataset, which contains multiple representations of images in 1,000 categories. Based on the VGG-16 example, we can spot the progression of features during forwarding propagation through the neural network⁵. As we can see, the first convolutional layer neural networks focus on basic shapes such as lines, arcs, and so on; then the features that filters are looking for are more abstract.

flow of features filtering

We can assume that filters for basic shape detection will be the same for each dataset. Going forward, layers correspond to higher level features; that could be different for a new dataset, but still, weights of layers trained on ImageNet datasets could be useful for easier training by fine-tuning.

Lastly, final dense layers (including prediction layers) are often initialized from scratch, especially if a number of classes in a new dataset differ from the ImageNet one. Another reason to train dense layers from scratch is that often there is nothing useful in weights to be reused because dense connections correspond only to these old output classes, as dense layers differ in this aspect from convolutional layers.

Using libraries such as Keras, we have to import ready to go pretrained models like VGG-16 with an option to reject the top dense layers and add our own. This can be done by a simple for loop iterating the layers that we choose not to train:

for layer in model_final.layers[:10]:
    layer.trainable = False

We can conduct transfer learning in various ranges depending on size, type of dataset used for training, target accuracy of model, and hardware resources. To get the best results (accuracy) we sometimes need to conduct more experiments.

This simple scheme presents few possible options for how we can apply transfer learning. Option A is less invasive and is based on training only dense layers (from scratch), leaving convolutional layers frozen during the training. The second option B (one that I have chosen for this use case) is slightly more invasive because, as in A, we train a dense layer from scratch but also train convolutional layers to adapt to new data (images).

flow of features filtering

The following table details transfer learning possibilities, both advantages and disadvantages:

Range of Trained Layers	A	B
Layer trained from scratch	Hidden and final (prediction) dense layers
Fine-tuned layers	-	Part of higher level convolutional layer
Frozen layer	All non-dense layers	Lower level convolutional layers
Amount of training data	Small	Big
The similarity of training dataset compared to base one	Big	Small
Flexibility	Small—higher level convolutional filters will stay the same	Big—there will be a field for improving convolutional filters of some layers
Utilization	Quick training for the easy use case	Length of an epoch may be similar to training from scratch, but thanks to pretrained layers neural network performance may be better

You might ask what the difference is in training only dense layers or fine-tuning convolutional layers. The answer is simple: Even if VGG-16 architecture works very well for ImageNet challenge image categories, your medical use case data might be completely different, and filters in convolutional layers might need to be adjusted. In our case, VGG-16 with just the A-option type of learning did not seem to converge, and since we have a large dataset to utilize I have chosen to train more layers. It allowed me to get a few percent more in accuracy. However, your data situation might be different, and you will need to try different numbers of trained layers.

Use case

Transfer learning applied on National Institutes of Health (NIH) Chest X-ray dataset from Kaggle.

Problem statement

To present deep learning methods for medical imaging diagnostics we use the transfer learning method to fine-tune VGG-16 pretrained on an ImageNet dataset for classification of chest X-ray images to determine whether the patient is healthy or whether he or she has pulmonary infiltration. Due to the significant difference of this data compared to the ImageNet dataset, most of the convolutional layers have been fine-tuned, and dense layers have been trained from scratch.

Software and hardware configurations

All data manipulations and deep neural network training have been conducted on the Intel® AI DevCloud using Intel Distribution for Python 2018 version 2, which allows for using multiple nodes. Each consists of Intel® Xeon® Gold 6128 processors and 192 GB of RAM. There is also 200GB of storage per user and preconfigured Intel Distribution for Python with Jupyter Notebook* enabled, and optimized distributions of the following libraries:

neon™ framework
Intel® Optimization for Theano*
Intel® Optimization for TensorFlow*
Intel® Optimization for Caffe*
Intel Distribution for Python (including NumPy*, SciPy*, and scikit-learn)
Keras (2.2.0)
Keras-Applications (1.0.2)
Keras-Preprocessing (1.0.1)
keras-vis (0.4.1)

TensorFlow was additionally updated to 1.6 to gain maximum performance⁶.

There are optimized for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions that utilize the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) library for highly optimized and fast computations.

Dataset

This dataset was recently released by NIH⁷ and consists of 112,120 X-ray images with disease labels from 30,805 unique patients. This medical imaging technique is much cheaper than other methods including computed tomography (CT) imaging, although medical diagnosis based on this data may be more difficult in some cases.

The authors state that labeling isn't 100 percent accurate; rather, estimated at 90 percent, because it comes from natural language processing (NLP) from data mining of text diagnoses, so this should be our baseline.

Code snippets with explanations

First, we need to preprocess image data.

	Image Index	Finding Labels	Follow-Up #	Patient ID	Patient Age	Patient Gender	View Position	Original Image Width	Original Image Height	Original Image Pixel Spacing x
34708	00009142_001.png	Mass	1	9142	40	M	AP	2500	2048	0.168000
82366	00020264_001.png	Atelectasis	0	20264	7	M	PA	2458	1953	0.143000
2070	00000538_001.png	No Finding	0	538	72	F	PA	2992	2991	0.143000
58496	00014465_001.png	No Finding	11	14465	64	M	AP	2500	2048	0.168000
98154	00025919_001.png	No Finding	0	25919	53	F	PA	2021	2021	0.194311

As we can see, in comma-separated values we get basic information about an image's location, patient's gender, and age, but what's most important is that the image size isn't constant, which is required for training and final evaluation.

But before that, we need to extract only images of healthy patients and patients with pulmonary infiltration.

For this use case we will train a neural network for a binary problem: healthy versus pulmonary infiltration.

In [13]:

all_image_paths = {os.path.basename(x): x for x in 
                   glob(os.path.join('data',  'images', '*.png'))}
print('Scans found:', len(all_image_paths), ', Total Headers', all_xray_df.shape[0])
all_xray_df['path'] = all_xray_df['Image Index'].map(all_image_paths.get)
all_xray_df['infiltration'] = all_xray_df['Finding Labels'].map(lambda x: 'Infiltration' in x)
all_xray_df.sample(3)

Scans found: 112120, Total Headers 112120

	Image Index	Finding Labels	Follow-up #	Patient ID	Patient Age	Patient Gender	View Position	Original Image [Width	Height]	Original Image Pixel Spacing [X	y]	Unnamed: 11	path	infiltration
12540	00003275_004.png	No Finding	4	3275	41	F	PA	2048	2500	0.168	0.168	NaN	data/images/00003275_004.png	FALSE
45791	00011723_018.png	No Finding	18	11723	66	M	AP	2500	2048	0.168	0.168	NaN	data/images/00011723_018.png	FALSE
89096	00022116_000.png	No Finding	0	22116	46	M	PA	3056	2544	0.139	0.139	NaN	data/images/00022116_000.png	FALSE

By resampling the data, we can try to balance a number of images in both categories, in random order.

Let's examine the distribution of binary labels.

In [15]:

all_xray_df['infiltration'].hist(figsize = (10, 5))

graph

Now, we balance the distribution in sets.

In [16]:

all_xray_df = all_xray_df.groupby(['infiltration']).apply(lambda x: x.sample(6000, replace = True)).reset_index(drop = True)
all_xray_df[['infiltration']].hist(figsize = (10, 5))

graph

Another important aspect of training a neural model is to split data into training and test datasets, and not allow any information to leak, for proper evaluation.

Split data into training and validation.

In [17]:

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(all_xray_df, 
                                   test_size = 2000, 
                                   random_state = 2018,
                                   stratify = all_xray_df[['infiltration', 'Patient Gender']])
train_df.to_csv('train.csv')
test_df.to_csv('test.csv')

Train samples: 10,000; test samples: 2,000.

Create Train and Test Datasets

In [2]:

import matplotlib.pyplot as plt
from skimage import transform, color
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
IMG_SIZE = (384, 384)
def load_data(in_df,IMG_SIZE=IMG_SIZE,y_col='infiltration'):
    images = []
    for file in tqdm(in_df['path'].values):
        image = plt.imread(file)
        image_resized = transform.resize(image, IMG_SIZE, mode = 'constant')
        if not len(image_resized.shape) == 2:
            image_resized = color.rgb2gray(image_resized)
        image_resized = (image_resized - image_resized.min()) / (image_resized.max() - image_resized.min())
        images.append((image_resized*255).astype(np.uint8))
    out_img = np.expand_dims(np.array(images),axis=-1)
    out_img = np.concatenate((out_img, out_img, out_img), axis = -1)
    return (out_img,in_df[y_col].values)
train_X, train_Y = load_data(train_df)
test_X, test_Y = load_data(test_df)
with h5py.File("xray_dataset4.h5", "w") as h5f:
    h5f.create_dataset('train_X', data=train_X)
    h5f.create_dataset('train_Y', data=train_Y)
    h5f.create_dataset('test_X', data=test_X)
    h5f.create_dataset('test_Y', data=test_Y)
    h5f.create_dataset('z_param', data=np.array([train_X.mean(), train_X.std()]))
print('Images have been saved')

100%|██████████| 10000/10000 [04:26<00:00, 37.50it/s]
100%|██████████| 2000/2000 [02:05<00:00, 15.93it/s]
Images have been saved

As you probably know, deep learning likes a lot of data for training. We are providing 10,000 images for training, which should be more than enough for the binary classification problem.

This simple script does its job by loading all the images, resampling them to 384 x 384 resolution, and saving them to an HDF5 file for later use. We also save the mean and standard deviation of images to standardize later.

Another often-used trick is to augment the data by small rotations, zooms, and shifts, so on each epoch the neural network doesn't get too much data. We utilize this method by using random horizontal flips, shifting both in width and height, randomly rotating by a maximum of five degrees, shearing by 1 percent maximum, and zooming in the 0–10 percent range.

At each epoch, the neural network won't be able to overfit too much to training data, because each time it will be differently distorted.

In [14]:

%%writefile generator.py
from keras.preprocessing.image import ImageDataGenerator
def get_img_gen():
    core_idg = ImageDataGenerator(samplewise_center=False, 
                                  samplewise_std_normalization=False, 
                                  horizontal_flip = True, 
                                  vertical_flip = False, 
                                  height_shift_range = 0.15, 
                                  width_shift_range = 0.15, 
                                  rotation_range = 5, 
                                  shear_range = 0.01,
                                  fill_mode = 'nearest',
                                  zoom_range=0.10)
    return core_idg

Overwriting generator.py

Here are example results of augmented training samples. As you can see they are distorted, which hopefully helps to train the neural network with more diverse data so it will generalize better on the test dataset.

multiple lungs x-rays

The next part is to create a model and freeze specific layers to preserve already trained, low-level features trained on the ImageNet dataset. On the first epoch, we will pre-train newly added layers so they can keep up with other layers that had a warm start. Then, we will train all layers but the first 10, as in option B.

In [45]:

.
.
.
vgg16 = VGG16(input_shape =  train_X.shape[1:], 
              include_top = False, 
              weights = 'imagenet')
x = vgg16.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(2, activation='softmax')(x)

model_final = Model(inputs=vgg16.input, outputs=x)
#Let's freeze vgg16 weights, so they won't be initialized from scratch
for layer in vgg16.layers:
    layer.trainable = False
    # update the weight that are added - just one epoch, 
#so dense layers; weights won't be random for fine tuning 
model_final.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model_final.fit_generator(get_img_gen().flow(train_X, train_Y, batch_size = batch_size), 
                    steps_per_epoch=len(train_X) // batch_size, 
                    epochs = 1, max_queue_size=10, workers=12, verbose = 1)
#now let's freeze only first 10 layers (2 conv blocks)
for layer in model_final.layers[:10]:
    layer.trainable = False
    for layer in model_final.layers[10:]:
    layer.trainable = True
sgd = SGD(lr=1e-3, decay=1e-6, momentum=0.9, nesterov=True)
model_final.compile(optimizer=sgd, 
                    loss='binary_crossentropy', 
                    metrics=['accuracy'])
from keras.callbacks import ModelCheckpoint, EarlyStopping
weight_path = "model_best4.h5"
checkpoint = ModelCheckpoint(weight_path, monitor='val_loss', verbose=1, 
                             save_best_only=True, mode='min', save_weights_only = False)
early = EarlyStopping(monitor="val_loss", 
                      mode="min", 
                      patience=3)
callbacks_list = [checkpoint, early]
history = model_final.fit_generator(get_img_gen().flow(train_X, train_Y, batch_size = batch_size), 
                    steps_per_epoch=len(train_X) // batch_size,
                    validation_data = (test_X, test_Y), 
                    epochs = 30, max_queue_size=10, workers=12, 
                    callbacks = callbacks_list, verbose = 1)
(pd.DataFrame(history.history)).to_csv('history.csv')

Overwriting train.py

By training one epoch on only the last dense layers and then all layers but the first 10 (they could be useful even with different data), we can give these last layers an initial starting point to keep up with the training of other layers. This has helped to get an additional 0.01–0.02 accuracy on test data (my tests). To improve generalization, I used dropout and data augmentation. This allowed for steady training.

graph

This figure presents training after an initial warm start epoch, which is why these lines aren't so steep. At the x-axis there are epochs (starting from 0 index—which is the first epoch), and at the y-axis there are measured parameters—the model's accuracy and loss for both training and validation datasets. Validation statistics are measured at the end of each epoch, and for training data they are measured after each batch.

Results

Now let's evaluate the network's training performance on the test data (separate dataset) without data augmentation.

	Precision	Recall	F1-score	Support
Healthy	0.68	0.76	0.72	1000
Infiltration	0.73	0.64	0.68	1000
Average / total	0.70	0.70	0.70	2000

This seems really nice for such data. The ROC curve seems even better (0.76 AUC). This means that our model is quite certain about its predictions.

graph

We can compare it to the accuracy measured on test data with the same augmentation as that used during the training.

	Precision	Recall	F1-score	Support
Healthy	0.49	0.60	0.54	1000
Infiltration	0.48	0.37	0.42	1000
Average / total	0.48	0.48	0.48	2000

graph

This ROC curve for test data with augmentation presents how hard it is to classify unknown data with additional distortions. It is similar to random guessing; this is why we use test data without distortions.

To visualize what regions the neural network has focused on in order to diagnose patients as those with pulmonary infiltration, we use the keras-vis* library that gives an easy to use API for gradient-weighted class activation mapping (grad-cam) extraction. These heat maps allow evaluating regions with high importance in classification to a specific class.

heat maps of multiple lungs

Wow! Our black box model seems to understand that in this task, lungs are important and seems to be able to detect infiltration and help doctors to save time in the detection of pulmonary abnormalities, and maybe other diseases.

Summary

We have successfully trained a VGG-16 neural network using transfer learning for new X-ray chest data by reusing some of the layers and fine-tuning others. This method can be further extended for new labels and data. Class activation maps have proven that a neural network uses visual data of lungs to classify them.

This table presents a comparison of training time (with data augmentation) on the standard TensorFlow wheel and the Intel optimized one.

	Warm-Start Training Epoch (Dense Layers Only)	Epoch (All But 10 First Layers Were Trained)
Intel® optimized TensorFlow* wheel 1.6	650s/epoch	1301s/epoch
Pip* TensorFlow wheel 1.6	1713s/epoch	3277s/epoch

By using the Intel® Optimization for TensorFlow* 1.6 with the Intel MKL-DNN wheel, I have managed to get about 2x the training time improvement, compared to training time using the standard wheel [6].

This can lead to time and money savings, allowing professionals to train deep learning models quicker by utilizing resources better.

GitHub* gist link for the project: Pulmonary infiltration prediction from Chest x-rays with pretrained VGG16 and fine tuning of dense layers.

References

W. Nicholson Price II, Regulating Black-Box Medicine, Michigan Law Review, vol. 116, no. 3, 2017.
Built for Speed.
M. Oquab, L. Bottou i I. Laptev, Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks, IEEE Conference on Computer Vision and Pattern Recognition, 2014.
J. Chang, J. Yu i T. Han, A method for classifying medical images using transfer learning: A pilot study on histopathology of breast cancer, IEEE 19th International Conference on e-Health Networking, Applications and Services, 2017.
K. Simonyan i A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, Computer Vision and Pattern Recognition, 2014.
Intel / Packages / Tensorflow.
NIH Chest X-rays.

↧

Intel® Parallel Computing Center at University of Florida

July 2, 2018, 11:40 am

Latest and popular articles on Intel Technologies

≫ Next: --- Article Removed ---

≪ Previous: Deep Learning: Build a Black Box Model for Medical Professionals

Principal Investigators

David Ojika is a fourth-year doctoral student of computer engineering, working with Dr. Darin Acosta. Having received his Master’s degree from California State University, David has completed several internships with Intel, working on near-data processors and heterogeneous chip architectures.

Kyuseo Park is a second-year doctoral student with the Department of Computer & Information Science Engineering (CISE). He received his Master’s degree from New York State University in 2014, his major focus being high performance databases and spatial databases.

Jingyin Tang is a doctoral candidate from the Department of Geography, with a concurrent Master’s degree in computer science from CISE. His research focus is on radar meteorology, tropical meteorology, and high performance computing in spatial modeling and mesoscale weather modeling.

Description

Machine learning (David Ojika)

Fully titled “The Potential of the Intel architecture for Machine Learning in High-energy Physics,” this project is specifically interested in the large-scale deployment of hardware accelerators, and the acceleration of machine learning for domain-specific workloads in high-energy physics and image understanding.

Climate and weather (Kyuseo Park and Jingyin Tang)

Real-time 3D multi-radar grids of weather radar data are crucial to make warning decisions for damaging wind and flood in hurricane events, and the task of creating 3D grids is extremely challenging due to huge data volumes and complexity caused by the heterogeneous scanning patterns among multiple neighboring radar stations. To fulfill the high demand of open-source radar applications and critical performance pressure in warning decision support, the intention of the project is to optimize Radx code and adding functionality to the system to render it capable of producing the necessary output for warning decision support in hurricane events.

Publications

David Ojika, 2017, Accelerating High-energy Physics Exploration with Deep Learning, University of Florida.

Related Websites

https://www.rc.ufl.edu/research/impact/ipcc/

↧

--- Article Removed ---

July 3, 2018, 11:00 am

Latest and popular articles on Intel Technologies

≫ Next: Using Natural Language Processing for Smart Question Generation

≪ Previous: Intel® Parallel Computing Center at University of Florida

***
***
*** RSSing Note: Article removed by member request. ***
***

↧

Using Natural Language Processing for Smart Question Generation

July 4, 2018, 10:43 am

Latest and popular articles on Intel Technologies

≫ Next: Entrib* and Intel Bring Manufacturers the Industrial Internet of Things Advantage

≪ Previous: --- Article Removed ---

Introduction

Automatic question generation is part of Natural Language Processing (NLP). It is an area of research where many researchers have presented their work and is still an area under research to achieve higher accuracy. Many researchers have worked in the area of automatic question generation through NLP, and numerous techniques and models have been developed to generate the different types of question automatically. Work has been done in many languages.

Nowadays, teachers/professors/tutors (academicians) spend a lot of time generating test papers and quizzes manually. Similarly, students spend a lot of time on self-analysis (self-calibration). Moreover, students are dependent on their mentors for the self-analysis. Hence, we are working on this NLP area, which has a huge scope of development at this moment. We want to build a computer application system that can help you in calibrating yourself and remove any dependencies on mentors. Here, students can give the input text of whatever material they referred to, and on this basis they get a set of questions with answers from which they can do a self-analysis (self-calibration). A similar approach is used by mentors for creating test papers and quizzes.

Moreover, online examinations have become very popular, including many major examinations, such as GATE, CAT, and NET. Multiple Choice Questions (MCQ) is very easy for evaluations, and its evaluation is implemented through computerized applications so that results can be declared within a few hours, and the evaluation process is 100% pure.

By making this computerized application, we can reduce the task of an educator. Much time can be saved if we can know what appropriate questions can be asked for the given input of text.

Hence, we want to develop a system which can generate various logical questions from the given text input. Right now, only humans are capable of accomplishing this.

Implementation

Diagram process of question

Our system works with the following strategy:

Step 1. Select the best potential set of sentences from the given text input from which we could generate the questions. (Sentence Selection)

Step 2. Find the subject and context of the sentence to find its core agenda. (Gap Selection)

Step 3. Analyze which is the best form of question that can be generated from that sentence. (Question Formation)

Step-by-step implementation

Input text

Hinton is a British cognitive psychologist and computer scientist, most noted for his work on artificial neural networks. Hinton was one of the first researchers who demonstrated the use of generalized backpropagation algorithm for training multilayer neural nets. He is a leading figure in the deep learning community. Hinton is called by some as the "Godfather of Deep Learning."

Preprocessed text

Hinton is a British cognitive psychologist and computer scientist most noted for his work on artificial neural networks.
Hinton was one of the first researchers who demonstrated the use of the generalized backpropagation algorithm for training multilayer neural nets.
He is a leading figure in the deep learning community.
Hinton is known by some as the "Godfather of Deep Learning."

Step 1 output: Potential set of sentences

Hinton is a British cognitive psychologist and computer scientist most noted for his work on artificial neural networks.
Hinton was one of the first researchers who demonstrated the use of the generalized backpropagation algorithm for training multilayer neural nets.
Hinton is known by some as the "Godfather of Deep Learning."

Step 2 output: Subject and context of each sentence

Example sentence: Hinton is a British cognitive psychologist and computer scientist most noted for his work on artificial neural networks.

Subject 1: Hinton
Subject 2: a British cognitive psychologist and computer scientist
Subject 3: work on artificial neural networks

The same is done for all other sentences that were selected in step 1.

Step 3 output: Question formation

We support two types of questions: fill-in-the-blank statements and answer in brief type of questions.

Example sentence: Hinton is a British cognitive psychologist and computer scientist most noted for his work on artificial neural networks.

Output of fill-in-the-blank statements:

______ is a British cognitive psychologist and computer scientist most noted for his work on artificial neural networks.
Hinton is a ______.
Hinton is a British cognitive psychologist and computer scientist, most noted for his work on ______.

Output of fully stated questions (generated from the fill-in-the-blank statements):

Who is a British cognitive psychologist and computer scientist most noted for his work on artificial neural networks?
Who is Hinton?
Hinton is most noted for his work on what?

Ongoing Work

Until now, we have succeeded in forming two types of questions: fill-in-the-blank statements and fully stated questions (which are generated from the fill-in-the-blank statements). The second part (question generation from blanks) is mostly hardcoded right now.

Next, we want to implement it using encoder-decoder nets, which will increase the quality drastically. Encoder-decoder nets have been used by Google for its neural machine translation (language translation) and recurrent neural networks. By keeping encoder-decoder at the core, we also take help from Stanford Parser and NLTK for grammar analysis and more basic natural language analysis.

Diagram internal view

This image shows how encoder-decoder network works internally.

Encoder

The encoder takes a preprocessed sentence from the input text and converts it according to the weights of the hidden layer. This hidden layer creates an intermediate representation of the input text and passes it to the decoder.

Decoder

The decoder converts the hidden-layer information into question form. Machine translation uses the same concept. Here, we treat questions, essentially, as another language.

The programming is done with the help of Intel® Distribution for Python*, which makes the working very fast and efficient. The speed boost comes from the Intel® Math Kernel Library (Intel® MKL), a collection of routines that use the capabilities of recent Intel processors to provide better performance for common data-science-related tasks, such as linear algebra or fast Fourier transforms. We use Intel® AI DevCloud for testing our data models. The Intel AI DevCloud is a free cloud compute available for Intel® AI Academy members powered by Intel® Xeon® Scalable processors for machine learning and deep learning training and inference compute needs.

Conclusion

Our system can be used in multiple self-analysis scenarios. For example, students can use it to make learning easier as well as more interactive and interesting. Teachers and professors can use this system to quickly create a quiz. A central examination board can use this system to generate a unique test that is not known to any professor, eliminating the possibility of cheating and thereby securing the privacy and integrity of the examination.

There are very few competitors in the field of NLP. The major competitor is IBM Watson*, which can answer any question but cannot (so far) generate questions themselves.

For more details, visit the GitHub* repository.

↧

Entrib* and Intel Bring Manufacturers the Industrial Internet of Things Advantage

July 5, 2018, 10:36 am

Latest and popular articles on Intel Technologies

≫ Next: Reduce IT Costs and Increase Security of Building Automation at Scale with IoTium* and Intel

≪ Previous: Using Natural Language Processing for Smart Question Generation

Access holistic real-time insight across shop floors and factories

"To facilitate our clients and boost efficiency and ROI by improving productivity on the shop floor."

— Entrib* mission statement

Executive Summary

Manufacturing equipment is typically designed to gather data, but this information is difficult to consolidate, access, and analyze in a timely manner, causing issues from poor efficiency to unplanned downtime. Entrib Technologies Pvt. LTD data monitoring solution ShopWorx*, running on Intel® architecture-based gateways, is delivering holistic data analysis, shop floor monitoring, and real-time reporting to manufacturers.

Challenges

In today's competitive climate, manufacturers grapple with high rejection rates, frequent machine downtime, and poor productivity. Digital and smart manufacturing blended with the Industrial Internet of Things (IIoT) provides them competitive advantages ranging from operational efficiency to reduced wastage... However, proprietary equipment and multiple incompatible systems often prevent a holistic understanding of the factory floor, and this problem is magnified for managers of multiple plants across different geos.

Solution

Entrib provides a real-time data monitoring, analytics, and reporting solution, ShopWorx*, to ensure smooth operations on the manufacturing shop floor. In conjunction with the Intel architecture-based gateway, the solution brings digital manufacturing to today’s industrial plants. Using real-time data from devices, sensors, machines, and big data analytics—combined with the flexibility of mobile apps—ShopWorx brings people, machines, and processes together to solve the unique challenges of the manufacturing industry.

Manufacturing facilities require access to accurate, real-time data. Real-time monitoring (RTM) conducted with the Entrib and Intel solution gives manufacturers the capability to monitor processes and access visual insights into data. Users get instant alerts and notifications for business-critical events. This can significantly reduce the response time for issues on the shop floor. Data is available in the form of dashboards, reports, and graphs on any device, including tablets and smartphones.

Entrib deploys specialized sensors, real-time data communication with remote systems, and intelligent software applications in manufacturing plants to act as virtual assistants. These virtual assistants help ensure lower rejection rates and improve quality, while offering greater control. The machines automatically and instantaneously send alerts on downtime for immediate corrective action and to maintain productivity targets.

Custom solutions are available to meet the specific requirements of plastics manufacturers and the automotive assembly line.

Ubiquitous, 24/7 on-the-go access to shop floor operations

Reduce downtime
Reach new productivity benchmarks
Manage operations from anywhere
Monitor shop floor performance
Increase visibility, analytics, and alarm generation
Access a reliable view of plants and the status of delivery targets
View advanced reports targeting critical variables, including production, rejection, downtime, and OEE
Access accurate Pareto charts for more effective analysis

ShopWorx and Intel deliver data on key factors
ShopWorx* combines with the Intel® architecture-based gateway to deliver data on key factors impacting manufacturing operations

Tools for Accurate, Timely Decision-Making

CxO dashboard

One central access point to production information from all manufacturing facilities, including visualization of performance KPIs and cost analysis
Monitor performance of all plants in multiple geos and locations from tablet, laptop, or smartphone
View key metrics in real time

Plant monitoring via TV dashboards

Allow control room setup for monitoring plant performance
Get real-time production status of machines, alarms, and notifications for production-critical events

Advanced analytics

Machine learning based algorithms provide recommendations for improvement in overall equipment effectiveness (OEE) and help in the production planning process

Digital production log

Real-time performance monitoring of shop floor
Multiplant production tracking
Alarm and escalation mechanism

Real-time machine and production status

Take immediate actions on real-time production feedback

Management console for reports and analytics

More than 50 reports featuring specific analytics to address problems faster

Advancing Edge and Cloud Intelligence

Intel and its ecosystem help businesses use the IoT to solve long-standing industry-specific challenges. Quickly develop IoT solutions that connect things, collect data, and derive insights with Intel’s portfolio of open and scalable solutions so you can reduce costs, improve productivity, and increase revenue.
Intel® technologies support the rigorous requirements for programmable logic controllers (PLCs), industrial PCs (IPCs), human machine interfaces (HMIs), robotics, machine vision, and many other industrial applications.

Key Benefits

Get holistic visibility into operations—ShopWorx powered by Intel architecture brings advantages across the manufacturing facility.

Business owner

Personalized dashboards for one-point access to real-time multiplant production data.

Plant head

Analytical data insight and automated digital production log to improve production efficiency.

Senior managers

Reports providing increased visibility and control over respective departments.

Supervisor

SMS alerts and notifications to address machine issues, and mobile access to crossfactory insight to eliminate operator dependency for production data.

Operations

Use the Production Log App or the Planning App Lite to stay updated on the shop floor, change plans dynamically, and maintain a digital log. View the current status of machines and plans, update plans to respond to real-time customer change requests, publish the day’s plans, and track plan deviations.

Production

ShopWorx enables monitoring of current production status for every machine and the shop floor as a whole with alerts for downtime and deviations from planned production.

Monitor planned vs. actual production, track wastage and materials consumption, and view automatic production data reports. Quality inspectors can enter data in real time based on production status using tablets, PCs, or smartphones. Mobile apps provide rejection analysis and machine performance measurement.

Maintenance

Maintenance and tooling departments receive alerts when machines go down and need repair. View analytics and downtime causes on tablets and mobile devices.

How It Works in Brief

The Entrib and Intel® solution gathers, filters, transmits, and analyzes data in real time throughout the manufacturing facility, including sensors, machines, devices, and data stores. Mobile apps parse the data for reportage and edge intelligence, and send alerts and notifications for issues from downtime to equipment malfunctions. Apps provide easy-toread visualization of key data parameters and metrics, and can be accessed from any location on any device.

ShopWorx aggregates data across the shop floor

Mobile apps bring instant access to the shop floor

Plant monitoring

Track machine production status

Planning app

Real-time graphs

Track production cycle accuracy across plant machines, and gauge the pulse of the shop floor.

Real-time graphs

Planning app

Create and modify machine production plans.

Wastage app

Manage machine downtimes and material wastages using a tablet-based app.

The Foundation for IoT

The Entrib solution is just one example of how Intel works closely with the IoT ecosystem to help enable smart Internet of Things (IoT) solutions based on standardized, scalable, reliable Intel® architecture and software. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-toend capabilities—performance, manageability, connectivity, analytics, and advanced security—to help accelerate innovation and increase revenue for enterprises, service providers, and industry.

Conclusion

Entrib's ShopWorx, combined with the Intel architecture-based gateway, brings the competitive advantages of IIoT to manufacturers. Designed to meet the specific requirements of industrial operations, ShopWorx allows real-time decision-making based on relevant data to maximise efficiency, productivity, and proactive maintenance.

About Entrib

Entrib is taking digital manufacturing to the next level, bringing critical shop floor information across a range of essential parameters to mobile devices. Providing more comprehensive insight and control is helping manufacturers to excel and to increase shop floor efficiency and productivity.

Learn More

For more information about Entrib, please visit entrib.com or contact us at info@entrib.com.

For more information about Intel® IoT Technology and the Intel IoT Solutions Alliance, please visit intel.com/iot

↧

Reduce IT Costs and Increase Security of Building Automation at Scale with IoTium* and Intel

July 5, 2018, 10:38 am

Latest and popular articles on Intel Technologies

≫ Next: ARDIC IoT-Ignite* Platform Turns Fast Food into Smart Operations with Intel® Architecture

≪ Previous: Entrib* and Intel Bring Manufacturers the Industrial Internet of Things Advantage

Building automation systems help facility managers to meet sustainability goals, lower energy use, and save costs. Traditionally, connecting building automation systems to the cloud has been expensive, time-consuming, and labor-intensive, often requiring complicated IT systems and putting buildings at risk for security breaches.

IoTium* offers a solution that is simple to deploy, manage, and maintain—significantly reducing IT maintenance costs—while exponentially enhancing security. Whether managing hundreds of buildings or thousands, facility managers can migrate building automation management to the cloud, without disrupting operations. Facility managers can also deploy edge services and analytics across their entire portfolio with a single click.

"IoTium's zero-touch provisioning resonates well with customers that need simple deployment and manageability of edge gateways within buildings and other commercial and industrial IoT applications."

— Jason Shepherd IoT Strategy and Partnerships, Dell*

Provision. Secure. Deploy.

IoTium's network infrastructure solution was built with security at its bedrock to protect buildings' operational assets, data, control systems, and IT systems. With IoTium, facility managers can securely connect any asset through any gateway to any cloud, on any infrastructure and using any operator—all without an IT technician.

The solution has three components:

IoTium iNode on remote gateway

IoTium provides each site with an Intel® IoT Gateway pre-provisioned with the IoTium iNode software, including all necessary certificates and keys for secure communication with the data center or cloud. These plug-and-play gateways are shipped directly to the buildings, where they simply need to be connected to the building automation system.

Secure connection to vendor applications

Virtual iNodes are set up wherever vendor applications reside—whether in the data center or the cloud. A secure peer-to-peer network is established between these iNodes and the iNodes on-site. Each communication channel is completely isolated through software, so a breach in one sub-system (such as lighting or CO2 monitoring) does not compromise another.

IoTium Orchestrator*

The Orchestrator* is a web-based application that lets facility managers monitor and manage the entire network of iNodes from a single interface. Orchestrator gives facility managers a 360-degree view of building automation systems, and lets them deploy new software, upgrades, and security fixes across thousands of remote sites in one click from the cloud—no truck roll required.

IoTium Orchestrator

Why IoTium?

Zero-touch provisioning

No need for onsite IT support. Simply ship the gateway to the remote site and power it on. The preinstalled, cloud-managed software automatically authenticates, provisions and configures itself, downloading all necessary policies and automatically deploying over-the-air updates—in as little as 30 minutes.

Secure by design

Protect sensitive business data from backdoor breaches. IoTium iNodes establish secure encrypted tunnels between the building automation system and any application in the cloud, preventing data re-routing and eliminating the possibility of DDOS attacks and data theft.

Instant scale

Deploy, update, and upgrade across thousands of buildings with a single click. IoTium allows facility managers to migrate management of building automation systems to the cloud, making it possible to monitor, manage, and maintain smart building systems at scale.

Edge services

Place analytics, machine learning, encryption, compression, and other services closer to the data source for faster processing. IoTium's edge platform allows multiple applications to use building data simultaneously and securely in their respective isolated compute and memory environments.

Connect everything

IoTium's solution supports any building management system, any protocol, any gateway, any operator, any infrastructure and any cloud service—without costly and complex changes to existing IT infrastructure.

Reduce costs

IoTium significantly reduces capital costs by eliminating the need for onsite fixed function devices, such as firewalls and VPN tunnels, and enabling one-click deployment of updates and upgrades across thousands of sites without costly truck rolls and field tech visits.

Modern Building Automation at Minimum Cost: IoTium's Solution at Work

IoTium's solution is enabling commercial and industrial facility managers to affordably migrate management of building automation systems to the cloud for streamlined management and increased security.

Commercial building management company saves big with IoTium

A large North American commercial building management company's building automation systems were disparate, time-consuming and costly to maintain.

The company deployed IoTium's network infrastructure as a service, enabling it to securely connect multiple locations with zero-touch and migrate management of building automation applications to the cloud, without any changes to existing IT security policies.

Building management vendor lowers costs and increases security with IoTium

A North American building management vendor's provisioning process, which involved sending a technician to each site to configure the hardware and provision the firewall and VPN tunnel, was too costly for retail customers with a large number of small locations.

IoTium's zero-touch provisioning eliminated costly and time-consuming technician site visits. In addition, IoTium's solution allowed the company to isolate its customers' building automation data from their enterprise IT data while using the same physical infrastructure—dramatically increasing security and improving customer satisfaction.

ROI

The company realized significant savings from elimination of truck rolls and field support, and consolidation of multiple devices.

The company was able to save customers approximately $5,000 and 200 miles of driving per site visit.

See for Yourself

Find out how you can accelerate your IoT deployment with IoTium. Schedule a demo at IoTium

Trusted by the World's Leading Companies

Partners logos

The Internet of Things (IoT) is advancing a new breed of smart buildings that are better aligned with the priorities of property owners and managers. Accelerating this transformation, Intel offers IoT building blocks that simplify how building systems talk to the cloud and exhaustively analyze building data to uncover new business insights to drive greater performance and real value.

↧

ARDIC IoT-Ignite* Platform Turns Fast Food into Smart Operations with Intel® Architecture

July 5, 2018, 10:45 am

Latest and popular articles on Intel Technologies

≫ Next: Making IoT Connectivity Secure and Simple for Retailers

≪ Previous: Reduce IT Costs and Increase Security of Building Automation at Scale with IoTium* and Intel

Bringing data-driven management to complex, fast-paced environments

Executive Summary

Restaurants, like many industries, are data-rich environments that often do not have the tools or technologies to gather and mine the value from this data. Fast food chain automation is changing how restaurants calculate and manage costs on a daily basis. ARDIC's IoT-Ignite* analytics solution combines with Intel® technologies to gather, filter, and analyze key variables to automate operations in fast-moving environments. For restaurants, benefits include reduced waste, insight into customer behaviour, and operational efficiency—with positive impact on the bottom line.

Challenges

Over 40 percent of food produced globally is wasted every year.¹ ² Fast food restaurants face particular challenges with costly waste, operational inefficiencies, and food safety. In order to deliver food quickly, it must be prepared in advance of customer orders, even though daily foot traffic is unpredictable. Consistency of product is essential, and this requires accurate measurement and temperature control, but managers often have little insight into the variables of equipment and staff activity throughout the day. Fast food operations need solutions that can scale across multiple locations and geographies, and holistic insight into growth challenges and opportunities than a three-fold increase in the amount of road and rail freight.

Solution

ARDIC's IoT-Ignite platform running on Intel® architecture-based gateways is helping fast food companies better monitor food production and quality to improve customer experiences, increase operational efficiency, and minimize waste.

The innovative solution adapts a range of technologies to the specific requirements of the restaurant industry. Nonintrusive sensors and RFID tags collect and generate information relevant to different services. Smart scales and level counters add to the data pool. All data is gathered, filtered, and processed on a single Intel architecture-based gateway. Data parameters are determined in conjunction with the restaurant to ensure relevance.

Management is localized to each venue. For example, the solution automatically sends alerts when waste exceeds a certain amount, occupancy and consumption trends change, kitchen conditions alter, the restaurant is out of stock, or there are items that are close to expiration in the cold room.

Smart solutions for the restaurant industry
ARDIC and Intel bring smart, near-real-time decision-making customized for the restaurant industry

ARDIC is piloting the solution at an expanding number of a leading fast food franchise's restaurants in Turkey. RFID tags and readers, sensors, and smart cameras transmit data to the gateway where it is mined for actionable edge intelligence and/or transmitted to the cloud for deeper analytics. Capabilities include counting customers, tracking the customer journey through the experience (e.g., wait times, peak cycles), tracking goods and inventory, and managing weight and environmental variables such as temperature and humidity. The solution includes smart scales for calculating food weight, a smart trash can that computes waste and unsold burgers, and a level counter that informs kitchen staff about changing occupancy levels, so they can adjust food production accordingly.

To maintain quality, the fast food restaurant's staff must dispose of burgers not sold within 15 minutes. With the smart solution, food production is based on data, decreasing waste, and potentially saving millions of dollars globally each day.

The open IoT-Ignite platform from ARDIC allows applications and features to be easily added or modified. The entire solution is highly automated, works with the existing franchise infrastructure, and does not disrupt ongoing operations.

With ARDIC, the fast food franchise is gathering more data each day and mining it to realize its full value—with resulting increases in system efficiency. As the solution is deployed in more venues, cross-store data informs refinements and the opportunity for global deployment. As operations are standardized, they can be optimized and managed more efficiently.

Key Benefits

ARDIC's IoT-Ignite and Intel architecture-based gateways support a wide range of benefits, including stock optimization, loss prevention, and lower OpEx. ARDIC works closely with its customers and their field operations to ensure analytics are pertinent, actionable, and meet business objectives.

Optimize operations with great accuracy to maximize efficiency and help increase profits
Compare metrics between restaurants to fine-tune optimisations and create best practices for use in existing and future franchises
Easily identify which restaurants are underperforming and adjust the number of employees per location
Minimise time-consuming tasks like daily counts and bookkeeping, increasing employee efficiency
Detect abnormal activities (high customer count with low sales, high waste levels, etc.) and respond immediately to address issues
Keep current with fast food automation technology and gain a competitive advantage

Ardic's solution to improve quality
With ARDIC’s IoT-Ignite* platform combined with the Intel®architecture-based gateway, restaurants get the data to improve quality, customer service, and efficiency

Sample Use Cases for the Restaurant Industry

Inventory management

Dynamically track inventory or assets on the cloud with no manual intervention using RIFD technology
Track the amount of inventory that the restaurant must maintain
Raise flags or place orders at critical levels
Eliminate daily inventory counts
Track and reduce loss
Minimize human intervention and errors
Access reportage on daily and hourly business fluctuations

Queue management

Identify customer behaviour and rush hours to take action (e.g., open a new cashier to avoid long queues, optimise back-end operations)
Control crowds with customer queuing system
Count people entering the line during a specific time period
Count approximate number of people standing in the line
Track customer's journey in the store, such as waiting in line or leaving
Manage the number of cashiers with a predefined logic
Generate data for labor/staff allocation

Smart waste

Track dumped precooked, cooked, and unsold food with near-real-time inventory tracking service
Monitor abnormal dumps
Generate average dump levels for future evaluations
Compare daily sold food with outgoing items from the stock room and the smart waste bin

More use cases

IoT-Ignite is an open IoT platform. Companies of any size can utilize IoT-Ignite to design, develop, and deploy their own IoT services and rapidly start monetization.

Today, IoT-Ignite enables services for a wide range of vertical market segments including retail, mobile services, agriculture, education, and energy. The opportunities to extend services to healthcare, manufacturing, and mobility are wide open.

Relevant data for optimization of connected industries
The flexible, easy-to-deploy smart solution provides relevant data for optimisation of connected industries

How It Works in Brief

The IoT-Ignite solution executes processing and filtering of data within the Intel architecture-based smart gateway and forwards pertinent data to an IoT-Ignite cloud. An RFID reader developed by Intel® Labs, smart scales, and smart cameras contribute meaningful data from throughout the restaurant venue.

RFID tags

are put on boxes of materials before leaving the distribution center. The boxes can be tracked in the distribution center, freezer room, and kitchen of the restaurant. Headquarters can see how many boxes are left in each branch and make sure that each type of goods is stored at its correct temperature.

Cameras

are used for queue management. Displays in the kitchen show a number from one to seven. If queues are long, this number is increased and staff in the kitchen can quickly respond.

The ARDIC cloud platform includes a service and Wi-Fi layer, and runs algorithms built by ARDIC to meet the specified goals of the deployment.

The ARDIC IoT Ignite service enables fast IoT solution deployment for IoT service providers. The Intel architecture-based gateways help enable end-to-end security, networking, and interoperability. An Android* operating system supports easy customization. A vertical application store provides customers with access to apps and data on any device from any location, enabling flexible, user-friendly, and, most importantly, rapidly deployable services for any market segment.

The Foundation for IoT

The ARDIC solution is just one example of how Intel works closely with the IoT ecosystem to help enable smart Internet of Things (IoT) solutions based on standardized, scalable, reliable Intel® architecture and software. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-to-end capabilities— performance, manageability, connectivity, analytics, and advanced security—to help accelerate innovation and increase revenue for enterprises, service providers, and the restaurant industry.

Conclusion

Fast food restaurants offer a great example of the challenges of managing a fast-paced environment with numerous dynamic variables. With ARDIC and Intel, restaurants, along with many other industries, can improve decisionmaking and operational efficiency based on accurate, near-real-time data generated by on-site, undisruptive technologies—from sensors and cameras to RFID tags and readers and smart gateways. ARDIC closely collaborates with its customers to ensure its solutions powered by Intel® technology meet evolving requirements and generate useful, actionable data.

Learn More

For more information about ARDIC solutions for the food industry, please visit iot-ignite.com/fast-food-automation or contact us at info@iot-ignite.com.

To learn more about ARDIC, visit ardictech.com.

For more information about Intel® IoT Technology and the Intel® IoT Solutions Alliance, please visit intel.com/iot.

References

1. Precision Agriculture, IBM Research

2. Connected Life: The Impact of the Connected Life Over the Next Five Years, GSMA, Feb-2013, PDF

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com/iot. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

↧

Making IoT Connectivity Secure and Simple for Retailers

July 5, 2018, 10:57 am

Latest and popular articles on Intel Technologies

≫ Next: AllGoVision* and Intel Bring Advanced Video Analytics to the IoT Edge

≪ Previous: ARDIC IoT-Ignite* Platform Turns Fast Food into Smart Operations with Intel® Architecture

Intel IoT and Asavie* enable rapid business growth with a streamlined IoT platform and multilayer security.

"By addressing the secure network connectivity and integration challenges from edge devices to the cloud, we are providing companies with a flexible and affordable means of exploiting transformational IoT opportunities across many market sectors."

—Lars Jerkland, VP OEM, Asavie

Executive Summary

The Internet of Things (IoT) unleashes valuable business insights through data that’s gathered at every level of a retail organization. With IoT and data analytics, retailers now have the capability to gather insight into customer behavior, offer more personalized experiences, achieve better inventory accuracy, create greater supply chain efficiencies, and so much more. But with data comes great risk. A recent report by security firm Thales and 451 Research found that 43 percent of retailers have experienced a data breach in the past year, with a third reporting more than one breach.¹

Intel® technology-based gateways and Asavie, a provider of next-gen enterprise mobility management and IoT connectivity solutions, offer a security connectivity solution that minimizes the effort and cost to businesses to ensure safety from cybersecurity attacks. In addition, the Intel/Asavie IoT solution provides retailers with a solid basis to build their smart, connected projects:

Analytics

Real-time diagnostics and monitoring of all connected devices so retailers can track and resolve any issues, ensure reliable service connectivity, and optimize service delivery.

Connectivity

Simplify rollout and network management and ensure high availability.

Administration

Centrally define automated rules to comply with existing policies but dynamically add new policies to control what devices connect to and which services are permitted in the network.

Scaling

Go from a single device to hundreds of thousands—and beyond. The award-winning Asavie PassBridge* platform is designed from the ground up to scale cost-effectively and rapidly.

Cost control

Automated monitoring of service usage, paired with usage and data plan control, helps businesses manage how much they spend, to budget accurately, and ensure devices are operating at the highest possible efficiency.

Challenges

Today’s retail stores are highly connected digital hubs, with large amounts of data transfer originating from and traveling back and forth among numerous locations: the store floor, stockroom, point of sale, corporate headquarters office, security systems, digital signage, refrigeration sensors, cloud-based e-commerce platforms, and more. The constant threat of a security breach and new digital data regulations make it imperative that sensitive customer and corporate information are never exposed.

In addition, connectivity is the lifeblood of modern retail success; any loss in connectivity will greatly affect operations, revenue, and costs. The challenge is to deliver a highly available and reliable connection for numerous remote site locations.

Another formidable challenge is the speed at which today’s marketplace is evolving. Retailers must respond sooner and be ready to scale with flexible "anytime, anywhere" connectivity. As more in-store procedures become digitized, the overarching network architecture must be able to scale without needing continuous rearchitecting or disruption to existing services.

Solution

Figure 1. Intel® technology-based Advantech gateways future-proof retail business connectivity

Asavie PassBridge is a cloud-hosted, software-defined IoT connectivity management platform built on x86 hardware. Unlike IoT connectivity management platforms that focus solely on the provisioning and decommissioning of SIMs, Asavie PassBridge offers more than basic internet access, delivering on-demand connectivity services designed to secure and manage connectivity across heterogeneous networks at scale. Asavie PassBridge offers configurable, secure private IP networks as a service via private APNs to the cloud. It enables users to orchestrate and configure their own customized secure network environment, off the public internet, for 1 to 100,000s of devices in seconds.

A key advantage of the Asavie PassBridge technology is its multipronged approach to security, in which it acts as both a trust authority and a traffic manager. Only authenticated Intel Advantech Gateways can attach to the private network, enabling a highly available and connected store with the right level of compute power and memory for multiple connectivity options. Asavie’s integrated network-layer-based security helps streamline hardware so that only the Intel Advantech Gateway needs to be deployed in each store.

The Intel technology-based Advantech Gateway series includes a built-in cellular modem to support cellular WAN-like connections, making it ideal for remote locations where physical wired connectivity is not possible. The cellular connection can also be used as an alternative or secondary network path to ensure continuity of key data services in the store. In addition, administrators can easily apply dynamic traffic management policies to the cellular WAN connection, ensuring key business flows are maintained while eliminating unnecessary data flows and avoiding excessive data costs in times of network failover.

In addition, with the Intel Advantech Gateway, it’s possible to gather valuable maintenance and status information at the store edge. By seamlessly connecting key systems such as refrigeration, point of sale, and inventory management, retailers can make automated and intelligent decisions in store.

The Intel/Asavie solution allows store owners to benefit from the ease of install and flexible interface configurations for a streamlined process to interconnect store appliances securely. Retailers can connect remote locations to many application services running on-premise or in the cloud using Asavie’s private connectivity to services such as AWS or Microsoft Azure*.

Key Benefits

Delivers real-time diagnostics and monitoring of all connected devices to ensure reliable services.
Supports dynamic traffic management policies.
Ability to subtenant the local area network in store, minimizing the exposure to security threats inside the store.
Small gateway footprint and single gateway per store simplifies install in a compact retail environment.
Ideal for remote store locations where physical wired connectivity may not be possible.
Cellular WAN connectivity means centralized provisioning and management are possible for hundreds of connected stores, all from an intuitive web-based GUI.
Bidirectional connectivity facilitates remote debug of store problems.

How It Works in Brief

Asavie end users are assigned an account on Asavie PassBridge, which is accessible via an intuitive web-based user interface (UI), Asavie IoT Connect. The setup requires minimal expertise to get started with a secure and private network. The UI presents a list of templated setups to the end user, from which the user can choose an appropriate network topology that suits their needs.

To get started, users simply select a network type from the available list, which includes:

Private network

All application traffic is terminated on the on-premise server.

Secure internet

Controlled access is given to the internet, with extra security around domain names in use.

Cloud service connector

Seamless edge-to-end-user cloud serverless services.

Hybrid

A mix of secure internet and private networks.

Asavie PassBridge automatically orchestrates the network of choice, which will include the private static IP address pool and unique assignments to the SIM cards in the associated account. Network initialization will also include any security policies selected as part of the configuration setup.

Within minutes, the end user’s gateways will be automatically enrolled into the private network. Further security can then be applied, which includes data controls with event-driven notifications, alerting the user to data anomalies at each connected gateway.

Furthermore, Asavie IoT Connect provides an easy-to-understand observation dashboard providing insights into the overall deployment status, connectivity health, and any anomalies.

About Asavie

Asavie makes secure connectivity simple for any size of mobility or IoT deployment in a hyperconnected world. The Asavie PassBridge* platform powers on-demand services for the secure and intelligent distribution of data to connected devices anywhere. This enables enterprises to harness the power of the Internet of Things and mobile devices to transform and scale their businesses.

Conclusion

The Intel/Asavie solution helps retailers significantly reduce security risks and costs for existing and new store configurations by reducing the requirement to just one physical access gateway. With security and connectivity features rolled into one solution, the Intel/Asavie IoT solution greatly streamlines install to a simple plug-and-play connection out of the box for better reliability and service, no matter where the store is geographically located.

Your Best IoT Retail Solutions Are Built with Intel® Technology

Accelerate the time to solution deployment and simplify the path to cost savings, new efficiencies, inventory accuracy, smarter marketing, and better customer experiences with an end-to-end solution based on Intel technology.

Learn More

For more information about Asavie IoT solutions for retail, please visit asavie.com.

For more information about Intel® IoT Technology and the Intel® IoT Solutions Alliance, please visit intel.com/iot.

References

1. 2017 Thales Data Threat Report, Global Edition

↧

AllGoVision* and Intel Bring Advanced Video Analytics to the IoT Edge

July 5, 2018, 11:04 am

Latest and popular articles on Intel Technologies

≫ Next: Xtel* and Intel Enable Manufacturers to Improve Operational Insight and Efficiency

≪ Previous: Making IoT Connectivity Secure and Simple for Retailers

Flexible, feature-rich solution turns smart video into actionable insight

Executive Summary

Video data is a key part of surveillance as well as many other types of IoT implementations. But video data can be complex and costly to gather, transmit, and analyze. AllGoVision* provides an innovative analytics solution, purpose-built for video data. Powered by Intel® architecture, the solution has been deployed in more than 100 installations in 35 countries. It rapidly analyzes video from surveillance cameras for designated parameters, including specified factors, behavior patterns, motion tracking of people and objects, and environmental monitoring.

Challenges

Streaming video offers rich data for connected cities, transportation, building management, and a wide range of industries. But mining value from this data requires an analytics solution that can quickly parse and extract relevant insight from ever-increasing data sets. Many analytics solutions offer deep analysis at the back end, but the results frequently do not support nearly instantaneous action and automated decision-making at the edge. These solutions can also be difficult to install, costly to purchase and maintain, and run only on specific cameras and infrastructure components.

Solution

AllGoVision offers an advanced deep learning–based video analytics solution powered by Intel® architecture that integrates with existing infrastructure to provide automatic near-real-time alerts and insight for actionable business intelligence.

The enterprise-grade solution seamlessly integrates with existing open platform surveillance systems. It is designed to combine situational awareness with the business intelligence for competitive decision-making. The near-real-time results allow for proactive, preventive operations management. Security is automated for enhanced safety and efficient operations. Based on artificial intelligence and machine learning and scalable Intel® processors, the solution is future-proofed to protect investments.

AllGoVision smart video solutions
Figure 1. AllGoVision* smart video solutions powered by Intel® architecture deliver advanced analytics across a broad spectrum of industries

A Comprehensive Analytics Solution That Is Easy to Deploy

Open platform

Choice of video management software (VMS) and camera integration

Plug and play

Works with existing surveillance setup and infrastructure
Compatible with existing security systems
Easy to install and use

Enterprise ready

Flexible

Run analytics at the edge, server, and/or cloud

Robust

Superior robustness for environmental conditions (e.g., wind, rain, snow)
Optimized through deep learning for field challenges

Accurate

Exceptional object occlusion handling
Object tracking, even in crowds
Accurate results amidst gradual or sudden illumination changes

Feature rich

More than 40 basic and advanced features to support a range of use cases

Customizable

Easy to enhance to meet specific requirements

Cost-effective

Reduce hardware costs with more functionality per server

Ease of use

Easy to install and configure
Intuitive graphical user interface (GUI)
Windows*, tab-based, point-and-select interface
Extensive graphical icons and options (pull-down menus, buttons, check boxes, radio buttons)

Manageability

Integrated with open platform VMS, such as Milestone*, Genetec*, IndigoVision*, etc.

Support

Expert analytics support team from AllGoVision

One solution, multiple applications

Works across industries and use cases

Sample Use Cases

Next-generation video analytics from AllGoVision and Intel support a wide range of business and industry use cases and applications, including intelligent traffic surveillance, crowd management, perimeter protection, suspicious incidences detection, retail intelligence, facial recognition, and multi-camera tracking (i.e., smart subject search).

Figure 2. Use cases for edge analytics include surveillance, operations management, and business intelligence

Figure 3. Analytics from AllGoVision incorporate extensive functionality based on artificial intelligence and machine learning

Here are just a few of the ways that smart cities, buildings, and enterprises are using AllGoVision analytics to optimize operations, increase safety, maximize efficiency, and reduce costs.

City surveillance

Unauthorized crossing of railway lines
Boundary surveillance of critical infrastructure
Suspicious object and behaviour detection
Unauthorized crowding/overcrowding
Blacklisted person/suspect identification

See more city surveillance features

Traffic and parking management

Effective traffic control at intersections
Detection of illegal traffic behaviour
Toll evasion detection at toll booths
Detection of unauthorized parking
Available parking slot detection
License plate recognition

See more traffic surveillance features

Building surveillance

Monitoring entry at restricted spaces
Access control/lobby management
Abnormal behaviour and unidentified objects
Early warning through video smoke detection
Monitoring monuments for vandalism

See more building surveillance features

Business intelligence

Footfall statistics and queue monitoring
Reduce shrinkage with pilferage detection
Counting beverages served over the counter
Viewership and engagement analysis
Profiling by demographic data, such as age and gender

See more business intelligence features

A Close Look at Facial Recognition

One of the leading areas of IoT analytics based on video data is facial recognition. AllGoVision captures faces and stores them in customer databases. People are detected through closed-circuit television (CCTV) cameras and matched against the database. Alarms and alerts are automatically generated for potential miscreants. The function is based on advanced technology using 3D structures to extract the feature points from the face, create a feature vector containing the most dominant features on the face, and store them as models.

The facial analytics feature works with generic IP cameras and PCs in both indoor and outdoor environments. It can identify individuals even when moving and despite camouflage efforts. Time to results is as little as one second, and up to tens of thousands of registered faces can be analyzed.

Face detection

Near-real-time video from surveillance cameras, CCTV, or archived video footage is analyzed in individual frames to locate and capture a human face as a sample.

Feature extraction

The unique facial characteristics of the captured sample are then assessed and a unique data set is extracted from the 3D structure of the detected face.

Facial recognition

The system alerts if a face is recognized or not. For a recognized face, the system flashes additional registered information and an image of that individual.

Face search

The extracted data set for the sample is matched with the models available in the database of registered faces.

How It Works in Brief

AllGoVision’s advanced algorithms are integrated with VMS or IP cameras. The analytics server powered by a high-performance Intel® processor receives the video stream from VMS or directly from the IP camera. Analytics operations are then performed—with alarms and alerts generated automatically as needed. The alarms and alerts can be viewed either via the integrated VMS clients or in the AllGoVision viewer client. With the AllGoVision and Intel solution, analytics can be conducted at the edge, server, or on a distributed architecture.

Server analytics

For server analytics, the solution can reside in the VMS or on a separate machine. Open network video interface forum (ONVIF) streaming is supported, along with leading VMS. Alarms and alerts are sent to the VMS viewer (smart client) or to AllGoVision’s Alarm Center. The analytics run as a Windows* service, supporting up to 100 channels per server, as well as failover.

Edge and cloud analytics

Analytics at the edge and cloud are available on IP cameras. Alarms and alerts are sent to the VMS viewer (smart client). The analytics run as a Windows service, supporting up to 200 channels per server to save on hardware costs. Features such as detection of intrusion or suspicious incidents and counting are supported simultaneously.

Stand-alone analytics

For stand-alone analytics, AllGoVision takes the video feed directly from the camera. Alarms and alerts are sent to AllGoVision’s Alarm Center. These analytics are especially useful for non-security or business intelligence applications with no VMS.

Distributed architecture

AllGoVision can be run locally (i.e., at the edge or machine) or on the cloud, with the ability to send alarms and alerts via WAN. Alarms and alerts can be hosted on the cloud.

Software interface

The easy-to-use, intuitive GUI comes with an alarm management and reporting client—the AllGoVision Alarm Center. It offers extensive options for alarm preview, search, reporting, and analysis.

Easy-to-use graphical interface provides analytics and reportage

The Foundation for IOT

The AllGoVision solution is just one example of how Intel works closely with the IoT ecosystem to help enable smart Internet of Things (IoT) solutions based on standardized, scalable, reliable Intel® architecture and software. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-toend capabilities—performance, manageability, connectivity, analytics, and advanced security—to help accelerate innovation and increase revenue for enterprises, service providers, and industry

Conclusion

AllGoVision brings IoT and analytics expertise to unlock valuable insight from video data. The solution powered by robust Intel architecture can be used by a wide spectrum of industries and vertical segments to improve decision-making, increase automation, and tap the benefits of connected IoT solutions. Video data is critical for surveillance and monitoring—with AllGoVision and Intel this vital information is available in near-real time.

About AllGoVision

AllGoVision Technologies Pvt. Ltd., is a provider of video analytics and manufactures AllGoVision, an advanced video analytics solution. AllGoVision is offers more than 50 video analytics features used in traffic, building, and city surveillance, and business intelligence. Its global technology partners include Milestone, Genetec, Honeywell EBI*, HUS*, DVM*, Wavestore*, Axis*, Tyco*, Samsung*, Siemens*, etc. AllGoVision received a 2015 Most Innovative and High Potential Product award from NASSCOM, and was recognized as one of the 20 most promising video surveillance solution providers of 2017 by CIOReview.

Learn More

For more information about AllGoVision, please visit allgovision.com or contact us at allgovision.com/contact-us.php. For more information about Intel® IoT Technology and the Intel IoT Solutions Alliance, please visit intel.com/iot.

↧

Xtel* and Intel Enable Manufacturers to Improve Operational Insight and Efficiency

July 5, 2018, 11:12 am

Latest and popular articles on Intel Technologies

≫ Next: Real-Time Systems and Intel Take Industrial Embedded Systems to the Next Level

≪ Previous: AllGoVision* and Intel Bring Advanced Video Analytics to the IoT Edge

Robust wireless sensors collect critical data for industrial environments

Executive Summary

Manufacturers often struggle to garner usable data from complex and incompatible equipment and systems. Xtel* has developed an innovative wireless sensor solution optimized and customized for industrial environments. Monitoring of temperature, pressure, humidity, vibration, and other critical variables is simplified by the specially engineered sensors. Combined with an Intel® architecture-based gateway and analytics applications from Xtel's ecosystem partners, the result is enabling industry to increase utilization and optimization, while gaining insight into operations and reducing costs.

Challenges

Industrial manufacturing equipment is often designed to collect data but tends to be proprietary—meaning each system has its own unique hardware protocols and applications. The result is that industry often cannot access or make timely use of the data inherent in existing operations. Adding to the complexity is the exponential growth in the number of machine-to-machine connections, from 4.9 billion in 2015 to a predicted 12.2 billion by 2020.¹ Challenges such as costly downtime, a lack of predictability on equipment malfunctions, and operational inefficiency can make it difficult to stay competitive and meet compliance requirements. Getting holistic insight across a single factory or those in multiple locations remains elusive.

Solution

Xtel offers a sensor-based smart manufacturing solution that combines with Intel architecture-based gateways to deliver a new level of analytics insight. The open, standardized solution enables manufacturers to achieve the benefits of Industry 4.0 without disrupting current operations or requiring investment in new equipment. With Xtel and Intel, industry can gain holistic Insight into the complex variables that impact machines, workflow, and manufacturing output.

A wide range of robust, wireless sensors have been designed and engineered by Xtel to work in rugged manufacturing environments, digitizing equipment such as turbines, engines, and pumps. Data is gathered, filtered, and analysed at the edge or transmitted to the cloud by the Intel architecture-based gateway, providing information to automate processes and improve decision-making. The sensors are easy to deploy and can be placed around and, in some cases, within equipment, for 24/7 monitoring and identification of issues and trends.

Sensor types

Xtel sensor modules can be deployed individually or combined into a single solution.

Xtel* wireless sensors combine with Intel® architecture-based gateways to collect data across the spectrum of critical industrial functions

The solution from Xtel and Intel brings both flexibility and scalability to manufacturers. It's designed for easy integration into existing facilities. Equipment can be provisioned and operational very quickly, typically in 10–30 minutes, with data analysis generated in user-friendly graphs, alerts, and notifications. Because they are wireless, it's simple to move sensors around plants to ensure pertinent information is gathered; applications can be added as requirements evolve. This feature also saves the considerable costs associated with altering wired sensor placement or functionality.

Xtel deployments are customized to meet the needs of particular industries and are powerful tools for identifying issues impacting product quality and the bottom line. For a large freight company, specialized sensors were placed inside massive cast-iron cargo ship engines, sending signals from sensors to gateway even when engines are running. For a toy manufacturer, Xtel set sensors around normal production systems to measure and identify variations in oil temperature and humidity and discover the cause—in this case oil temperature—of production issues. Data enables preventive maintenance, for instance, ensuring a system is not running too hot and sending an alert to service technicians weeks before equipment breaks down.

Collecting data on equipment failure and preventive maintenance is just one of the benefits of the Xtel and Intel solution. Insurance rates for systems are often lower and disputes easier to manage because manufacturers have predictive maintenance in place, as well as the data when equipment and systems do not meet their guarantees. Likewise, fulfilling compliance requirements is simplified, because manufacturers have the data to prove they are meeting standards.

Typically, Xtel starts customer implementations with a pilot, allowing manufacturers to use a simple wireless sensor package (WSP) to test the value of the sensors, gateway, and data analytics for their organization. This provides a chance to experiment with the sensors and see the types of data that can be acquired and analysed.

Energy-efficient IoT

Xtel wireless sensors include a highly optimized thermal energy harvesting module with the Bluetooth® low energy chipset. With temperature gradient below 5 degrees Celsius, the Xtel wireless energy harvesting sensor platform transmits temperature data.

No batteries
No maintenance
Low environmental impact
Can incorporate into areas with limited accessibility
More than 15-year lifetime

Sample Use Cases

Handling and transport of goods

By embedding Xtel's intelligent, reusable sensors, companies that manufacture thermal storage boxes can access relevant data on temperature, position, and other parameters of goods in transit. Through a built-in online temperature and tilt sensor, manufacturers can check for compliance with required conditions.

Wireless monitoring of production machinery

Wireless sensors enable manufacturers of production machinery to continually monitor wear and tear of their products. Costly production stoppages are avoided through continuous electronic surveillance via integrated temperature, vibration, humidity, and pressure sensors which ensure that machines are serviced before breakdowns occur.

How It Works in Brief

Wireless sensors are customized for industrial companies—tailored to precision requirements—and use the Intel architecture-based gateway to filter and transmit data for edge and cloud intelligence. Sensor software, hardware, and engineering are developed by Xtel. Analytics applications are provided by Xtel's ecosystem partners.

Energy consumption in wireless sensor technology is ultralow, with sensors running for 5–10 years on a single coin cell battery. In some cases the battery can be removed altogether by collecting energy from the environment—from sources such as light, vibrations, and heat—also known as energy harvesting.

Xtel's cloud server solution allows manufacturers to manage data as it is collected.

Xtel's wireless sensors and Intel® architecture-based gateways enable industrial insight from the edge to the cloud

Advancing Edge and Cloud Intelligence

Intel and its ecosystem help businesses use the IoT to solve long-standing industry-specific challenges. Quickly develop IoT solutions that connect things, collect data, and derive insights with Intel's portfolio of open and scalable solutions so you can reduce costs, improve productivity, and increase revenue.
Intel® technologies support the rigorous requirements for programmable logic controllers (PLCs), industrial PCs (IPCs), human machine interfaces (HMIs), robotics, machine vision, and many other industrial applications.

Conclusion

With Xtel and Intel, manufacturers can achieve many of the benefits of Industry 4.0 within their existing infrastructure. The wireless sensor and gateway solution gives industry the data needed to identify issues with equipment and systems, increase operational efficiency, support predictive maintenance, and more.

The Foundation for IoT

The Xtel solution is just one example of how Intel works closely with the IoT ecosystem to help enable smart Internet of Things (IoT) solutions based on standardized, scalable, reliable Intel® architecture and software. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-toend capabilities—performance, manageability, connectivity, analytics, and advanced security—to help accelerate innovation and increase revenue for enterprises, service providers, and industry.

About Xtel

Xtel is an independent wireless product development company. Its dedicated, highly specialized engineers bring deep expertise developing crucial mobile components from idea to commercial product. The company provides innovative technologies and functionality for the industrial sector and IoT.

Learn More

For more information about Xtel, please visit xTel or contact us at info@xtel.dk.

For more information about Intel® IoT Technology and the Intel® IoT Solutions Alliance, please visit intel.com/iot.

References

1. IoT will account for nearly half of connected devices by 2020, Cisco* says

↧

Real-Time Systems and Intel Take Industrial Embedded Systems to the Next Level

July 5, 2018, 11:26 am

Latest and popular articles on Intel Technologies

≫ Next: Making NoSQL Databases Persistent-Memory-Aware: The Apache Cassandra* Example

≪ Previous: Xtel* and Intel Enable Manufacturers to Improve Operational Insight and Efficiency

Innovative hypervisor and partitioning software increases flexibility and functionality for industry

Executive Summary

The Industrial IoT (IIoT) has the potential to bring increased optimization, automation, and insight to industrial facilities. But frequently, proprietary, incompatible equipment and systems—combined with the need to meet complex, time-based, deterministic requirements—make realizing this potential challenging. Real-Time Systems (RTS) software running on robust, high-performance Intel® architecture enables the creation of intelligent embedded applications for IIoT, enabling benefits from holistic visibility into operations to centralized equipment management and maintenance.

Challenges

Industrial operations are inherently time based—they require processes and procedures to occur in a predefined linear sequence with little margin for error. Furthermore, equipment and systems built for industry are often proprietary, using incompatible protocols and networks not designed to work together or to allow for centralized management or maintenance.

Typically, coordinating timed workflows and gathering data on performance and usage requires at least two computers, one servicing an application’s realtime needs, the other running a general-purpose operating system (GPOS) such as Linux* or Windows*. In such a configuration, the GPOS is responsible for data processing, visualization, and integration of applications into the facility’s networks. The result is often costly and makes it difficult to obtain a holistic, accurate view of ongoing operations or to increase automation and preventive maintenance.

Solution

Real-Time Systems (RTS) brings in-depth expertise in hypervisor and embedded virtualization technology to simplify and speed development and deployment of market-ready, standardized software products targeting advanced embedded applications.

RTS software running on robust, high-performance Intel architecture enables industrial embedded applications that support an array of critical capabilities, including deterministic, real-time performance, data processing, visualization, and seamless connectivity. The solution is helping industry to advance automation while improving data acquisition (for instance, via motion control and programmable logic controllers [PLCs]).

Deployment of multiple operating systems on multicore processor platforms is a logical step in embedded system design, reducing total hardware costs while increasing reliability and system performance. The innovative RealTime Systems Hypervisor* permits multiple real-time operating systems (RTOS) and general-purpose operating systems (GPOS), such as Windows or Linux, to run concurrently on multicore Intel® processors.

Through the RTS Real-Time Hypervisor, modern multicore processor platforms, such as the Intel Atom®, Intel® Core™, and Intel® Xeon® Scalable processors, can execute multiple operating systems independently of one another on a single platform. The RTS Real-Time Hypervisor can also assign individual processor cores, memory, and devices to each operating system. Through a configuration file, the boot sequence can be specified, and when desired, one operating system can be rebooted independently of any others. In order to facilitate communication between operating systems, the RTS embedded virtualization solution also provides a configurable user-shared memory, an event system, and a TCP/IP-based virtual network.

Use cases

Use cases of R T S solutions
RTS solutions powered by Intel® architecture are deployed in a wide range of industrial implementations and use cases.

Get the Benefits of Industry 4.0

Industrial decision-makers and operations managers

The RTS product works out of the box without customization.

Industrial application developers

RTS products give real-time system developers up-to-date solutions for substantial portions of embedded projects, accelerating development and time to market.

Industrial equipment manufacturers

The powerful, cost-effective RTS software solution powered by Intel architecture offers increased flexibility in system design and enhanced functionality and performance, while reducing overall system cost.

Advancing Edge and Cloud Intelligence

Intel and its ecosystem help businesses use the IoT to solve long-standing industry-specific challenges. Quickly develop IoT solutions that connect things, collect data, and derive insights with Intel’s portfolio of open and scalable solutions so you can reduce costs, improve productivity, and increase revenue.
Intel® technologies support the rigorous requirements for programmable logic controllers (PLCs), industrial PCs (IPCs), human-machine interfaces (HMIs), robotics, machine vision, and many other industrial applications.

Real-Time Systems Hypervisor* Key Features and Benefits

Simplified out-of-the-box deployment

Ready to use out of the box
Users can install and configure the RTS Hypervisor* independently, without detailed hardware knowledge
Considerable savings on non-recurring engineering (NRE) costs
Easy evaluation of the RTS Hypervisor on any x86 platform

Flexible

When functions are separated among virtual machines, the collaboration between these virtual machines remains highly flexible
Create new combinations on the basis of open standard interfaces at any time
No hardware-specific modifications required
Support for all CPUs from Intel
Easy communication via high-performance internal virtual network (TCP/IP)
Mix and match operating systems

Scalable

Scales from Intel Atom® processors to multiple-node non-uniform memory access (NUMA) servers

Efficient use of resources

Make better use of the overall hardware capacity than when each operating system is allocated a dedicated system
Lower costs and energy consumption
Run a real-time operating system without adding extra latencies (0.00 μs)
Retain performance and determinism of real-time applications
Designed for rugged real-time use
Time synchronization between operating systems
High-performance event system

More secure

Measured and secure boot is possible
No backdoors—operating systems securely separated

Reliable

When the number of hardware components is reduced, the probability of failure of the overall system decreases (e.g., by consolidating an industrial PC–based HMI and an ARM or microcontrollerbased real-time control into a single hardware platform, the mean time between failures [MTBF] can be significantly increased)
Lower maintenance costs
Higher productivity with less downtime
Increase user satisfaction

Software-defined

Shared memory with easy-to-use API which can be configured for direct data exchange
APIs to monitor, start, and stop guest operating systems
Rights management for all APIs and shared memories

Tested

Highly secure design with no backdoors or interfaces into the hypervisor
Tested and deployed with customers worldwide
Proven in thousands of applications globally

Simplified certification

Use the RTS Hypervisor to separate security-critical from noncritical areas to make the development process and functional validation more efficient, and to simplify quality assurance and certification
Modularize applications and make changes in one module without having to recompile and completely retest the other software components
Reduce time to market and costs

Industry 4.0 ready

Stand-alone partitions for a dedicated security gateway with network function virtualization (NFV) for firewall, virus protection, and routing help to significantly increase security and eliminate the need for additional gateways
Real-time applications can be connected quickly and efficiently, while adapting the IoT and Industry 4.0 gateway to meet specific requirements

Hypervisor consolidation targeted to industrial implementations

Figure 1. RTS and Intel enable hypervisor consolidation targeted to industrial implementations utilizing standardized building blocks

How It Works in Brief

RTS Real-Time Hypervisor

The RTS Real-Time Hypervisor enables modern, multicore

Intel processors with Intel® Virtualization Technology (Intel® VT) to simultaneously run either multiple instances of a realtime operating system or a heterogeneous mixture of 32-bit or 64-bit operating systems on a single execution platform. All systems are safely separated, run in real time, and can even reboot without disturbing the execution of other operating systems.

Out of the box, the hypervisor supports Windows® 10 and older, Windows Embedded Compact*, VxWorks*, RTOS32*, QNX*, OS-9*, Linux and real-time Linux, RedHawk*, and T-Kernel. Support for other operating systems or proprietary real-time code can be added at any time upon request.

R T S Real-Time Hypervisor powered by Intel

Figure 2. RTS Real-Time Hypervisor powered by Intel® architecture

Technical Features

Run multiple instances of an RTOS or a mix (e.g., Windows* and RTOS)
Completely independent execution of operating systems
100 percent separation of operating systems in memory
No latencies (0 μs) added for RTOS
Direct hardware access
Exclusive resource allocation (supports xHCI and AHCI sharing)
Use standard drivers
Includes virtual network driver for seamless TCP/IP communication
Definable boot sequence; reboot any system, anytime
Simple installation and configuration
Multiple OS runtimes per guest OS
Microcode updates possible by hypervisor
xHCI controller sharing (port assignment)
AHCI controller sharing (disk or partition assignment)
Cache Allocation Technology (CAT) for L3 caches (Intel® Core™ processors and Intel® Xeon® processors)
CAT support for shared L2 caches found on Intel Atom® processors
Support for measured boot and secure boot
Access rights management for shared memory
Access rights for all APIs to monitor, start, and stop guest operating systems
Virtual MMU and IOMMU configurable for secure hardware separation of operating systems

The foundation for IoT

The RTS solution is just one example of how Intel works closely with the IoT ecosystem to help enable smart Internet of Things (IoT) solutions based on standardized, scalable, reliable Intel® architecture and software. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-to-end capabilities—performance, manageability, connectivity, analytics, and advanced security—to help accelerate innovation and increase revenue for enterprises, service providers, and industry.

Conclusion

With RTS and Intel, developing and deploying intelligent applications for embedded and real-time systems is simplified, allowing industry to accelerate the benefits of IIoT while meeting the demands of time-based, deterministic compute.

About Real-Time Systems GmbH

RTS, a congatec company, is a global manufacturer of hypervisor technology specializing in real-time virtualization. RTS hypervisor solutions support all popular operating systems for x86 architecture. The company was founded in 2006 as a spin-off of KUKA, active in various industries worldwide, and headquartered in Ravensburg, Germany. real-time-systems.com

Learn More

For more information about the RTS Hypervisor, please visit real-time-systems.com or contact us at info@real-time-systems.com. For more information about Intel® IoT Technology and the Intel IoT Solutions Alliance, please visit intel.com/iot.

Estimated results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”. Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.

↧

Making NoSQL Databases Persistent-Memory-Aware: The Apache Cassandra* Example

July 3, 2018, 2:15 pm

Latest and popular articles on Intel Technologies

≫ Next: Leveraging RDMA Technologies to Accelerate Ceph* Storage Solutions

≪ Previous: Real-Time Systems and Intel Take Industrial Embedded Systems to the Next Level

Introduction

Persistent memory is byte-addressable. This property—which allows us to directly persist our data structures without serialization—is key to simplifying code complexity dealing with data persistence. For the application, this code simplification means fewer instructions to execute in both the read and write paths, improving overall performance.

In this article, we take a look at how the software architecture of the famous Apache Cassandra* distributed database was transformed in order to use persistent memory. With persistent memory, Cassandra does not need to split the data model into the performance-optimized part, which is stored in DRAM, versus the persistent part, which is stored on disk. Persistent memory allows Cassandra to have a single unified data model.

Transforming the Apache Cassandra* Architecture

First, let’s take a high-level look at the main components in the original (unmodified) version of the Cassandra architecture. For the sake of simplicity, only one instance of Cassandra—as well as a single client—is considered. Nevertheless, the case for multiple instances or clients does not change the fundamentals presented here since the internal logic of each instance remains the same.

Original Apache Cassandra architecture.
Figure 1. Main components of the Original Apache Cassandra* architecture.

Due to performance considerations in write-heavy workloads, writes in Cassandra are stored only on the Memtable data structure in memory, doing eventual flushing operations in order to synchronize with the sorted strings tables (SSTables) stored on disk. In order to avoid possible data losses due to node crashes, Cassandra also writes all operations to a log on disk (Commit Log), which can be used for recovery if needed. As it is possible to see, even if writing to an unstructured log is somewhat faster than directly updating the SStables on disk, writes to Cassandra still involve a write operation to disk.

In the case of reads, paths in the original Cassandra are even more complex. First, Cassandra uses the Bloom Filter to determine if the key we want to read is likely to be in this instance (false positives are possible, but not false negatives). Next, it looks in the key cache. If we are lucky and the key is there, Cassandra then accesses the Compression offsets in order to know where it should go and look for the requested data on the SSTable files. If the key is not in the key cache, however, Cassandra needs to perform an extra read to disk in order to find the key’s related information in the Partition Index.

Once the data is read from the corresponding SSTables, the Memtable is explored in case there are recent writes not yet flushed to disk. If that is the case, the data read from the Memtable is merged with the data read from the SSTable before returning it to the client. Both read and write paths are presented in Figure 2. For more information regarding the Cassandra architecture, as well as other details (like how replication works in Cassandra), you can read this introductory article.

Figure 2. Read and write paths in the Apache Cassandra* architecture.

Figure 3 presents the persistent-memory version of Cassandra. In this case, only one data structure is used.

persistent memory version of the Apache Cassandra
Figure 3. Main components of the persistent-memory version of the Apache Cassandra* architecture.

In the case of a write operation to the new Cassandra, only a single write to the PMTable data structure is needed. Since the structure resides on persistent memory already, no logging or any other write to disk is necessary. Likewise, a read operation involves only a single read to PMTable. If the requested keys are found in the table, the data is returned to the client.

This significant reduction in the number of components is what gives us shorter read and write code paths. In addition, there are other advantages gained by using persistent memory, especially regarding data serialization (as we will see next).

The Devil is in the Details

No need for serialization

One of the main challenges in designing applications, including Apache Cassandra, to handle data persistence is designing the data model. On the one hand, we want to have rich and powerful data structures adapted to the needs of the application by taking full advantage of what DRAM has to offer (byte addressability and speed). On the other hand, we want to provide persistence (so no data is lost). Due to the limitations of block storage (presented in the figures as “disk”) regarding data access granularity (block access) and speed (orders of magnitude slower than DRAM), serialization of data objects is required. This introduces a big design constraint since now we need to carefully design our data structures to avoid high overheads during serialization and deserialization phases.

The revolutionary aspect of persistent memory is that it is byte-addressable and fast (close to DRAM) without sacrificing persistence, all possible thanks to the Intel® 3D XPoint™ Memory Technology. By putting our data structures on persistent memory, we can achieve persistence without serialization. This, however, does not come at no cost. Due to the nature of persistent memory (more on that below), traditional volatile code does not work out of the box. Some coding effort is still necessary.

Need for code transformation

At the core of the NVM Programming Model (NPM) standard, developed by key players in the industry through the Storage and Networking Industry Association (SNIA), we have memory-mapped files. Using a special file system, processes running in user space can (after opening and mapping a file) access this mapped memory directly (through loads and stores) without involving the operating system. Programming directly against memory-mapped files, however, is not trivial. Data corruption can happen if CPU caches are not flushed before a sudden loss of power. To avoid that, programmers need to design their data structures in such a way that temporary torn-writes are allowed, and they need to make sure that the proper flushing instructions are issued at exactly the right time. Too much flushing is not good either because it impacts performance.

Fortunately, Intel has developed the Persistent Memory Developer Kit (PMDK), an open-source collection of libraries—implemented in C/C++—and tools that provide low-level primitives as well as useful high-level abstractions to help persistent memory programmers overcome these obstacles. Intel has also implemented Persistent Collections for Java* (PCJ), an API provided for persistent-memory programming in Java emphasizing persistent collections. For more information, you can read the following introductory article to PCJ.

The transformation of Cassandra, which is written in Java, is done using PCJ. Next, an example is presented showcasing the cell class, which is the basic object storing data for a particular row. To learn more, please refer to the source code.

A transformed cell

First, let’s look at the class BufferCell, used to buffer cell objects in volatile memory:

...
public class BufferCell extends AbstractCell
{
    private static final long EMPTY_SIZE = ObjectSizes.measure(new BufferCell(ColumnMetadata.regularColumn("", "", "", ByteType.instance), 0L, 0, 0, ByteBufferUtil.EMPTY_BYTE_BUFFER, null));

    private final long timestamp;
    private final int ttl;
    private final int localDeletionTime;

    private final ByteBuffer value;
    private final CellPath path;
...

    public ByteBuffer value()
    {
        return value;
    }
...

As you can see, this is just a regular Java class. The class BufferCell extends (inherits) from another class, called AbstractCell, and then defines its fields named Tiemstamp, ttl, localDeletionTime, value, and path. The data proper is stored in the field value, which is a ByteBuffer. To return the value, the method value() simply returns this field.

Let’s compare the above class to the class PersistentCell, which, in the new Cassandra, is used to store cell objects in persistent memory:

public final class PersistentCell extends PersistentCellBase implements MSimpleCell
{
    private static final FinalObjectField<PersistentCellType> CELL_TYPE = new FinalObjectField<>();
    private static final ByteField FLAGS = new ByteField();
    private static final LongField TIMESTAMP = new LongField();
    private static final IntField LOCAL_DELETION_TIME = new IntField();
    private static final IntField TTL = new IntField();
    private static final ObjectField<PersistentImmutableByteArray> VALUE = new ObjectField<>();
    private static final ObjectField<PersistentImmutableByteArray> CELL_PATH = new ObjectField<>();
    private static final ObjectField<PMDataTypes> DATA_TYPE = new ObjectField<>();

    public static final ObjectType<PersistentCell> TYPE = 
                 ObjectType.fromFields(PersistentCell.class,
                                       CELL_TYPE,
                                       FLAGS,
                                       TIMESTAMP,
                                       LOCAL_DELETION_TIME,
                                       TTL,
                                       VALUE,
                                       CELL_PATH,
                                       DATA_TYPE);
...
    public byte[] getValue()
    {
        PersistentImmutableByteArray bArray = getObjectField(VALUE);
        return (bArray != null) ? bArray.toArray() : null;
    }
...

The first noticeable difference is in the declaration of fields. We go, for example, from int ttl in the volatile class to IntField TTL in the persistent one. The reason for this change is that, in this case, TTL is not a field but a meta field. Meta fields in PCJ serve only as a guidance to PersistentObject (all custom persistent classes need to have this class as an ancestor in their inheritance path), which is going to access the real fields as offsets in persistent memory. PersistentCell passes this information up to PersistentObject by constructing the special meta field TYPE, which is passed up the constructor chain by calling super(TYPE,…). This need for meta fields is just an artifact of persistent objects not being supported natively in Java.

You can see how meta fields work by looking at the getValue() method. Here, we do not have direct access to the field. Instead, we call the method getObjectField(VALUE), which will return a reference to a location in persistent memory where the field is stored. VALUE is used by PersistentObject to know where, in the layout of the object in persistent memory, the desired field is located.

How to Run

The patch to enable Cassandra for persistent memory is open source and available. To build Cassandra using this patch, you need PCJ, PMDK, and Java 8 or above. You will also need the ant compiler.

Currently there are two versions of the patch, and both require the use of the following version of Cassandra from GitHub*: 106691b2ff479582fa5b44a01f077d04ff39bf50 (June 5, 2017). You can get this version by doing this (inside your local cloned Cassandra repository):

$ git checkout 106691b2ff479582fa5b44a01f077d04ff39bf50

To apply the patch, move the file to the root directory of the repository and run this:

$ git apply –index in-mem-cassandra-1.0.patch

In order to build Cassandra, you need to include the PCJ libraries in the project (JAR and so files). First, create a JAR file for the PCJ classes:

$ cd <PCJ_HOME>/target/classes
$ jar cvf persistent.jar lib/

Now copy persistent.jar and libPersistent.so to Cassandra’s library path:

$ cp <PCJ_HOME>/target/classes/persistent.jar <CASSANDRA_HOME>/lib/
$ mkdir <CASSANDRA_HOME>/lib/persistent-bin
$ cp <PCJ_HOME>/target/cppbuild/libPersistent.so <CASSANDRA_HOME>/lib/persistent-bin/

Finally, add the following line to the configuration file <CASSANDRA_HOME>/conf/cassandra-env.sh so Java knows where to find the native library file libPersistent.so (which is used as a bridge between PCJ and PMDK):

JVM_OPTS="$JVM_OPTS -Djava.library.path=$CASSANDRA_HOME/lib/persistent-bin"

After that, you can build Cassandra by simply running ant:

$ ant
. . .
BUILD SUCCESSFUL
Total time: ...

For the last step, we create a configuration file called config.properties for PCJ. This file needs to reside on the current working directory when launching Cassandra. The following example sets the heap’s path to /mnt/mem/persistent_heap and its size to 100 GB (assuming that a persistent memory device—real or emulated using RAM—is mounted at /mnt/mem):

path=/mnt/mem/persistent_heap
size=107374182400

Be aware that if the file config.properties does not exist, Cassandra will use as defaults /mnt/mem/persistent_heap and 2 GB. You can now start Cassandra normally. For more information, please refer to the readmes provided with the patches.

Summary

In this article, I described how the software architecture of the famous Apache Cassandra distributed database was transformed to use persistent memory. First, the high-level architectural changes were shown. After that, some details of this transformation were discussed—specifically, those related to serialization and code transformation. After that, I showed an example of code transformation by comparing the classes BufferCell (from the original Cassandra) and PersistentCell (from the new Cassandra). I finished the article showing how you can download and run this new persistent-memory-aware Cassandra.

About the Author

Eduardo Berrocal joined Intel as a Cloud Software Engineer in July 2017 after receiving his PhD in Computer Science from the Illinois Institute of Technology (IIT) in Chicago, Illinois. His doctoral research interests were focused on (but not limited to) data analytics and fault tolerance for high-performance computing. In the past he worked as a summer intern at Bell Labs (Nokia), as a research aide at Argonne National Laboratory, as a scientific programmer and web developer at the University of Chicago, and as an intern in the CESVIMA laboratory in Spain.

Resources

↧

Leveraging RDMA Technologies to Accelerate Ceph* Storage Solutions

July 3, 2018, 11:02 am

Latest and popular articles on Intel Technologies

≫ Next: Configure Yardstick Network Services Benchmarking to Measure NFVI and VNF Performance

≪ Previous: Making NoSQL Databases Persistent-Memory-Aware: The Apache Cassandra* Example

In this article, we first review the performance challenges encountered in Ceph* 4K I/O workloads and give a brief analysis of the CPU distribution for a single Ceph OSD object storage daemon (OSD) process. We then discuss inefficiencies in the existing TCP/IP stack and introduce the iWARP RDMA protocol supported by Intel® Ethernet Connection X722, followed by a description of the design and implementation for iWARP RDMA integration into Ceph. Finally, we provide a performance evaluation of Ceph with iWARP RDMA, which demonstrates up to 17 percent performance improvement compared with the TCP/IP stack⁶.

Background

Red Hat Ceph*, one of today’s most popular distributed storage systems, provides scalable and reliable object, block, and file storage services in a single platform ¹. It is widely adopted in both cloud and big data environments, and over the last several years, Ceph RADOS block device (RDB) has become the dominant OpenStack* Cinder driver. Meanwhile, with the emergence of new hardware technologies like Intel^® 3D XPoint^™ memory² and remote direct memory access (RDMA) network interface cards (NICs), enterprise application developers have developed new expectations for high-performance and ultra-low latency storage solutions for online transaction processing (OLTP) workloads on the cloud.

Ceph has made steady progress on improving networking messenger since the Ceph Jewel* release. The default simple messenger has been changed to async messenger to be more CPU efficient and compatible with different network protocols, such as TCP/IP and RDMA. The Ceph community designed and implemented a new solid-state drive (SSD)-friendly object storage backend called BlueStore*, and leveraged additional state-of-the-art software technology such as Data Plane Development Kit (DPDK) and Storage Performance Development Kit (SPDK). These software stack changes made it possible to further improve the Ceph-based storage solution performance.

Software evolution in the Ceph system
Figure 1. Software evololution in the Ceph* system

Intel^® Optane^™ SSD, used for Red Hat Ceph BlueStore metadata and WAL drives, fill the gap between DRAM and NAND-based SSD, providing unrivaled performance even at low queue depth workloads. Intel^® Xeon^® Scalable processors, offering a range of performance, scalability and feature options to meet a wide variety of workloads, are ideal for Red Hat Ceph data-intensive solutions. RDMA enables direct, zero-copy data transfer between RDMA-capable server adapters and application memory, removing the need in Ethernet networks for data to be copied multiple times to operating system data buffers. This is highly efficient and eliminates the associated processor-intensive context switching between kernel space and user space. The Intel Xeon Scalable processor platform includes integrated Intel^® Ethernet Connection X722 with Internet Wide-area RDMA Protocol (iWARP), and provides up to four 10-gigabit Ethernet (GbE) ports for high data throughput and low-latency workloads, which makes the platform an ideal choice for scale-out storage solutions.

Motivation

The performance challenges

At the 2017 Boston OpenStack Summit³, Intel presented an Intel Optane SSD and Intel^® SSD Data Center P4500 Series based Ceph all-flash array cluster that delivered multimillion input/output operations per second (IOPS) with extremely low latency and competitive dollar per gigabyte costs. We also showed the significant networking overhead imposed by network messenger. As shown in Figure 2, the CPU tended to be the bottleneck in a 4K random read workload. Analysis showed that 22–24 percent of the CPU was used to handle network traffic, highlighting the need to optimize the Ceph networking component for ultra-low latency and low CPU overhead. Traditional TCP/IP cannot satisfy this requirement, but RDMA can ⁴.

Figure 2. Networking component bottleneck in the Ceph* system.

RDMA versus traditional TCP/IP protocol

Today there are three RDMA options: InfiniBand* requires deploying a separate infrastructure in addition to the requisite Ethernet network, iWARP ⁵, and RoCE (RDMA over Converged Ethernet) are computer networking protocols that implement RDMA for efficient data transfer over Internet protocol networks.

Previous studies have shown that traditional TCP/IP has two outstanding issues: high CPU overhead for handling packets in the operating system kernel and high message transfer round-trip latency, even when the average traffic load is moderate. RDMA performs direct memory access from one computer to another without involving the destination computer’s operating system, and has the following advantages over TCP/IP:

Avoids memory copies on both sender and receiver, providing the application with the smallest round trip latency and lowest CPU overhead.
The data moves from the network into an area of application memory in the destination computer without involving its operating system and the network input/output (I/O) stack.
RDMA protocol transfers data as messages, while TCP sockets transfer data as a stream of bytes. RDMA avoids the header used in the TCP stream that consumes additional network bandwidth and processing.
RDMA protocol is naturally asynchronous; no blocking is required during a message transfer.

Therefore, we expect lower CPU overhead and lower network message latency when integrating RDMA into the Ceph network component.

Integrating iWARP into the Ceph* System

This section describes the evolution of RDMA design and implementation in Ceph. We will discuss the general architecture of Ceph RDMA messenger, and then share how we enabled iWARP in the current Ceph async messenger.

Ceph RDMA messenger

The Ceph system relies on messenger for communications. Currently, the Ceph system supports simple, async, and XIO messengers. From the view of the messenger, all of the Ceph services such as OSD, monitor, and metadata server (MDS), can be treated as a message dispatcher or consumer. Messenger layers play the role of bridge between Ceph services and bottom-layer network hardware.

There are several other projects that focus on integrating RDMA into the Ceph system—XIO* messenger is one of them. XIO messenger is built on top of an Accelio* project that is a high-performance asynchronous reliable messaging and remote procedure call (RPC) library optimized for hardware acceleration. It was merged into the Ceph master in 2015, and supports different network protocols, such as RDMA and TCP/IP. XIO messenger seamlessly supports RDMA*, including InfiniBand*, RoCE* and iWARP*. In this implementation, RMDA is treated as a network component, like simple messenger or async messenger in the Ceph system. According to feedback from the Ceph community ⁷, there are some scalability issues and stability issues; currently this project is not actively maintained.

Another project is aimed at integrating InfiniBand RDMA into async messenger. Async messenger is the default networking component starting with the Ceph Jewel release. Compared to simple messenger, the default networking component before Ceph Jewel, async messenger is more CPU-efficient and spares more CPU resources. It is an asynchronous networking library designed for the Ceph system and is compatible with different network protocols such as Posix, InfiniBand RDMA, and DPDK. Figure 3 shows the architecture of Ceph async messenger with InfiniBand protocol; RoCE support is similar.

Figure 3. InfiniBand integration with Ceph* async messenger

iWARP integration with async messenger

With the rapid growth of message transfer between Internet applications, high-performance (high-speed, high-throughput, and low-latency) networking is needed to connect servers in data centers, where Intel Ethernet is still the dominant network physical layer, and the TCP/IP stack is widely used for network services. Previously, we came to the conclusion that the TCP/IP stack cannot meet the demands of the new generation of data center workloads. Ceph with iWARP RDMA is a practical way for data centers running Ceph over TCP/IP to move to RDMA, leveraging Intel^® Ethernet with iWARP RDMA to accelerate the Ceph system.

Figure 4. iWARP integrated in async messenger

Thanks to the extensible framework of async messenger, we can modify RDMA connection management to use the RDMA connection management (RDMA-CM) library to support iWARP, instead of the current InfiniBand RDMA implementation,which uses self-implemented TCP/IP-based RDMA connection management. We implement the RDMA connection interface with the librdmacm library so it is compatible with other implementations including InfiniBand and RoCE. Choosing iWARP or InfiniBand as the RDMA protocol is configurable. In addition, we support creating queue pairs that are not associated with a shared receive queue. The memory requested by the receive queue is allocated from a centralized memory pool. The memory pool is reserved and released when starting and ending async messenger service.

Performance Tests

In this section, we present the performance evaluation of Ceph with iWARP RDMA.

Testing methodology

The performance evaluation was conducted on a cluster with two OSD nodes and two client nodes. The detailed configurations were as follows:

Hardware configuration: Each of the four nodes was configured with an Intel Xeon Platinum 8180 processor and 128 GB memory, with integrated 10-Gigabit Intel Ethernet Connection X722 with iWARP RDMA. Each of the OSD nodes had 4x Intel SSD Data Center P3520 Series 2TB SSDs as storage devices.
Ceph system and FIO* configuration: The OSD servers ran Ubuntu* 17.10 with the Ceph Luminous* release. Each OSD drive on each server node hosted one OSD process as BlueStore* data and DB drive, totaling 8x OSD processes running in the test. The RBD pool used for testing was configured with two replications. The FIO version was 2.12.
Network configuration: The network module between OSD nodes and client nodes was user defined. In this test, we changed the network module from TCP/IP to RDMA. The networking topology is described in Figure 5. For Ceph with RDMA testing, the public network and cluster network shared one NIC.

Ceph benchmarking topology
Figure 5. Ceph* benchmarking topology—two nodes

We simulated typical workloads on an all-flash Ceph cluster in the cloud with FIO 4K random write running on Ceph RBD volumes. For each test case, IOPS was measured at different levels of queue depth scaling (1 to 32). Each volume was configured to be 30 GB. The volumes were pre-allocated to eliminate the Ceph thin-provision mechanism’s impact on stable and reproducible results. The OSD page cache was dropped before each run to eliminate page cache impact. For each test case, FIO was configured with a 300-second warm up and 300-second data collection.

Ceph system performance comparison with TCP and with iWARP RDMA

(a) FIO performance comparison

(b) CPU comparison on OSD node

Figure 6. Ceph* system performance comparison with RDMA or TCP/IP

Figure 6 (a) illustrates aggregated FIO IOPS on client nodes using different network protocols. It shows that Ceph with RDMA delivered higher performance in a 4K random write workload than TCP/IP—up to a 17 percent performance improvement with queue depth = 2. Increasing FIO queue depth also impacted RDMA results. The RDMA benefit was more obvious in a low queue depth workload, depending on Ceph tunings, such as completed queue depth in Ceph RDMA messenger.

Figure 6 (b) shows CPU utilization on an OSD node when running an FIO process on an RBD volume. The CPU utilization of Ceph with RDMA was higher than with TCP/IP, which was not our expectation (detailed root cause will be illustrated later). Theoretically, RDMA should reduce CPU utilization since RMDA bypasses the kernel and limits context switching.

C P U profiling on O S D node
Figure 7. CPU profiling on OSD node

As shown in Figure 7, Ceph with TCP/IP consumed more system-level CPU, and Ceph with iWARP RDMA consumed more user-level CPU. That makes sense at first glance, because RDMA achieves kernel bypass, so RDMA consumes less system-level CPU. However, it does not make sense that RDMA consumed more user-level CPU. The root cause for this will be explained later. Even Ceph with iWARP consumed more CPU, and the FIO IOPS per CPU cycles on the OSD node was higher compared to TCP/IP. Overall, Ceph with iWARP provided higher 4K random-write performance and was more CPU efficient than Ceph with TCP/IP.

Scalability tests

To verify the scalability of Ceph with iWARP RDMA, we scaled up the number of OSD nodes and client nodes to three, keeping the other Ceph configuration and benchmarking methodologies the same as previous tests.

Figure 8. Ceph* benchmarking topology—scale up to three nodes

With one more OSD node, the performance of Ceph with iWARP increased by 48.7 percent and the performance of Ceph with TCP/IP increased by 50 percent. Both of them showed greater node scalability. Not surprisingly, Ceph with iWARP RDMA showed higher 4K random write on the three OSD nodes cluster.

Figure 9. Ceph* benchmarking topology—scale up to three nodes

Performance Analysis

To better understand the overhead inside Ceph async messenger with iWARP RDMA, we looked at the message receiving flow.

Figure 10. Data receiving flow in async messenger with RDMA

To give a clearer description of the flow, we simplify the message transfer process, which follows this precondition: It is a single server and client architecture; the client has already established an RDMA connection with the server, and the server sends a 4K message to the client.

Once the network driver on the client side gets the remote send request it triggers the CQ polling event. The event takes over the back-end worker thread and handles the CQ polling event. The CQ polling event fetches the 4K remote DMA message and puts it into the async messenger recv buffer, followed by another request to trigger an async messenger read event. After that, the polling event releases the back-end threads.

The read event reads the 4K message from the specified recv buffer and then transfers it to the responding dispatcher to handle. Finally, the read event releases the work thread and completes the read process.

The RDMA message transfer process is based on Ceph async messenger. For one message receiving flow, two events are triggered and one message copied. We go deeper and use a perf flame graph to get the details of CPU usage for one async messenger worker.

C P U usage of async messenger worker
Figure 11. CPU usage of Ceph* async messenger worker

Figure 11 shows that most of the CPU used by a worker is consumed by the RDMA polling thread and async messenger polling process. As described in the message transfer flow, adding RDMA polling over async messenger increases the CPU overhead and context switch because it doubles the polling process, and two events are triggered for one message transfer. Meanwhile, an additional message copy from the RDMA received buffer to the async messenger unread buffer adds message transfer roundtrip latency. The two polling threads and additional memory copy issue lead to higher user-level CPU consumption for Ceph with iWARP RDMA.

Next Steps

Performance optimization

Adapting RDMA polling to an I/O multiplexing framework such as async messenger is not an optimal solution. RDMA concentrates on avoiding CPU overhead in the kernel level. Signaling a descriptor in async messenger introduces an extra context switch, which increases CPU overhead. Meanwhile, we have proposed an RDMA messenger library, and integrated it with a distributed cache storage project, Hyper-converged Distributed Cache Storage (HDCS) ⁸. The initial benchmark shows great performance benefit (I/O and CPU consumption) with RDMA networking, compared to TCP/IP.

Based on past experience, we will continue to make performance optimizations to the Ceph RDMA code, including separating RDMA polling from the async messenger event driver, and avoiding memory copy to the async messenger recv buffer. Because RDMA protocol provides message-based rather than stream-based transactions, we do not need to separate the stream into different message/transactions on the message sender side and piece them together on the receiver side. Messenger-based transactions make it possible to avoid extra memory copy operations for buffering the data.

Disaggregate Ceph storage node and OSD node with NVMe-oF

For two issues, we consider leveraging non-volatile memory express over Fabrics (NVMe-oF) to disaggregate the Ceph storage node and the OSD node. First, current Ceph system configuration cannot fully benefit from NVMe drive performance; the journal drive tends to be the bottleneck. Second, for the one OSD process per one NVMe drive configuration, 40 percent of the Intel Xeon processor is not utilized on the OSD node. By disaggregating the Ceph storage and OSD nodes, we can use all NVMe devices on the target node as an NVMe pool and dynamically allocate an appropriate NVMe drive for the specified OSD node.

We have initial performance data⁶to show that NVMe-oF does not degrade Ceph 4K random-write performance. With NVMe-oF CPU offload on the target node, the CPU overhead on the target node is less than 1 percent. We did not find evidence of CPU overhead on the OSD node. However, we found that with NVMe-oF, the FIO tail latency is much higher than with the local NVMe drive in a high FIO queue depth workload. We still need to identify the root cause and leverage the high-density storage node as a pool for lower TCO.

Summary

We find that the CPU still tends to be the bottleneck in 4K random read/write workloads, which severely limits the peak performance and OSD scale-up ability, even with the latest network layer and object storage backend optimization in the Ceph community. RDMA provides remote memory access, bypassing the kernel to unleash CPU overhead there, and reducing round-trip message transfer time.

We find that iWARP RDMA accelerates the Ceph network layer (async messenger) and improves 4K random-write performance by up to 17 percent. In addition, Ceph with iWARP RDMA shows great scalability. When scaling the Ceph OSD nodes from two to three, the 4K random-write performance increased by 48.7 percent.

According to system metrics on the OSD node, Ceph with iWARP RDMA consumes more CPU. But, through deeper analysis, we see the CPU cycle distribution and find the two polling threads issue in the current RDMA implementation.

Next steps: we will focus on async messenger RDMA performance optimization, including the two polling threads issue. Furthermore, we will explore the opportunity to leverage an NVMe-oF and use the high-density storage node as a storage pool to reduce the TCO of Ceph all-flash array cluster.

References

1. OpenStack User Survey (PDF)

2. Intel Optane technology and Intel^® 3D NAND SSDs

3. Ceph system optimization with Intel Optane and Intel Xeon platform

4. RDMA over Commodity Ethernet at Scale (PDF)

5. iWARP RDMA technology brief

6. Accelerating Ceph with RDMA and NVMe-oF

7. Ceph with XIO Messenger performance

8. Hyper-converged Distributed Cache Storage (HDCS)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

§ Configurations: [describe config + what test used + who did testing].

§ For more information go to http://www.intel.com/performance.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

↧

Configure Yardstick Network Services Benchmarking to Measure NFVI and VNF Performance

July 5, 2018, 11:59 am

Latest and popular articles on Intel Technologies

≫ Next: Using TensorFlow* for Deep Learning Training and Testing

≪ Previous: Leveraging RDMA Technologies to Accelerate Ceph* Storage Solutions

Introduction

Yardstick Network Services Benchmarking (NSB) is a test framework for measuring the performance of a virtual network function (VNF) in a network function virtualization (NFV) environment. Yardstick NSB may also be used to characterize the performance of an NFV infrastructure (NFVI).

Yardstick NSB may be run in three environments: native Linux* (bare metal), standalone virtualized (VNF running in a virtual machine), and managed virtualization. This tutorial is the first of a two-part series. It details how to install Yardstick NSB on bare metal. Part two shows how to run Yardstick NSB for characterizing an NFV/NFVI.

Installing Yardstick NSB on Bare Metal

In this section, we show how to install Yardstick NSB on bare metal for creating a minimal-scale test system capable of running the NFVI test cases and visualizing the output.

The hardware and software requirements used in these tests include:

A jump server to run Yardstick NSB software.
Two other servers connected to each other, back-to-back, with two network interfaces on each server. One server is used as a system under test (SuT); the other is used as a traffic generator (TG).
A managed network that connects the three servers.
A minimum of 20 gigabytes (GB) of available hard disk space on each server.
A minimum 8 GB of memory on each server.
A minimum of eight cores needed on the SuT and TG.
On the SuT and TG, one 1-Gigabit Ethernet (GbE) network interface is necessary for the managed network and two 10-GbE interfaces for the data plane. The jump server only requires one 1-GbE network interface for the managed network. The network interfaces for the data plane must support the Data Plane Development Kit (DPDK).
All three servers have Ubuntu* 16.04 installed.

The following figure illustrates the configuration of the servers:

Figure 1. Network configuration of the servers. All three servers connect using a managed network. On the SuT and the TG, two network interfaces are connected back-to-back to serve as a data plane network. The network interfaces in the data plane network support DPDK.

Prerequisites on the jump server, the SuT, and the TG

The following requirements are applicable for the jump server, the SuT, and the TG:

Network connectivity

All servers should have an Internet connection so they can download and install NSB software.
Each server should be able to log in to the others using a Secure Shell (SSH) with root privilege via the management network.
For convenience, it is recommended that you configure a static IP address on each server so that in case you have to reboot the machines, their IP addresses in the configuration files remain unchanged.
On each server, one should be able to employ HTTP and HTTPS to transfer data between a browser and a website. If you are behind a firewall, you may need to set the HTTP/HTTPS proxy:

$ export http_proxy=http://<myproxy.com>:<myproxyport>
$ export https_proxy=https://<mysecureproxy.com>:<myproxyport>

Check package resources (optional)

Add repositories at the end of the /etc/apt/sources.list file:

$ sudo vi /etc/apt/sources.list
deb http://archive.ubuntu.com/ubuntu xenial main restricted universe multiverse

Depending on your configuration, you may have to add the following two lines to the file:

/etc/apt/apt.conf.d/95proxies:

$ cat /etc/apt/apt.conf.d/95proxies
Acquire::http::proxy "http://<myproxy.com>:<myproxyport>/";
Acquire::https::proxy "https://<mysecureproxy.com>:<myproxyport>/";

Then, reboot the system. After logging on, update the package lists for packages that need upgrading:

$ sudo apt-get update
Ign:1 http://ubuntu-cloud.archive.canonical.com/ubuntu xenial-updates/ocata InRelease
Hit:2 http://ppa.launchpad.net/juju/stable/ubuntu xenial InRelease
Ign:3 http://ubuntu-cloud.archive.canonical.com/ubuntu xenial-updates/pike InRelease
Hit:4 http://ubuntu-cloud.archive.canonical.com/ubuntu xenial-updates/ocata Release
Hit:6 http://ubuntu-cloud.archive.canonical.com/ubuntu xenial-updates/pike Release
Get:8 http://us.archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Hit:9 http://us.archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:10 http://us.archive.ubuntu.com/ubuntu xenial-backports InRelease
Get:11 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Hit:12 http://ppa.launchpad.net/maas/stable/ubuntu xenial InRelease
Hit:13 http://security.ubuntu.com/ubuntu xenial-security InRelease
Fetched 494 kB in 41s (11.9 kB/s)
Reading package lists... Done

Add additional packages (optional)

If needed, additional packages (openssh-server, xfce4, xfce4-goodies, tightvncserver, tig, apt-transport-https, ca-certificates) may be installed:

$ sudo apt-get update && sudo apt-get install openssh-server \
xfce4 xfce4-goodies tightvncserver tig apt-transport-https ca-certificates

SSH root login

To allow root login in using ssh, add "PermitRootLogin yes" to the /etc/ssh/sshd_config file, and then restart the ssh service:

$ sudo vi /etc/ssh/sshd_config
PermitRootLogin yes
$ sudo service ssh restart

Add user with a password using sudo

To use sudo to give the user permission with a password, edit /etc/sudoers and add myusername ALL=(ALL) NOPASSWD:ALL to the end, where myusername is the user name of the account. It is better to use visudo to edit the /etc/sudoers file instead of vi, since visudo can validate the syntax of the file upon saving.

$ sudo visudo

Then, add this line at the end of the /etc/sudoers file:

ausername ALL=(ALL) NOPASSWD:ALL

If you are behind a firewall, you have to add this line after the "Defaults env_reset" line:

Defaults env_keep = "http_proxy https_proxy"

Install yardstick NSB on the jump server, the SuT, and the TG

Install yardstick NSB software

To install Yardstick NSB software, the following instructions are executed on the jump server, the SuT, and the traffic generator. In this tutorial, to install the stable Euphrates* version of Yardstick, run the following commands:

$ git clone https://gerrit.opnfv.org/gerrit/yardstick
$ cd yardstick
$ git checkout stable/euphrates
$ sudo ./nsb_setup.sh

It can take up to 15 minutes to complete the installation. The script not only installs Yardstick NSB, but it also installs the DPDK, the realistic traffic generator (TRex*), and downloads packages required for building the Docker* image of Yardstick. Although only the jump sever needs to have Yardstick NSB installed, for simplicity just run the script on the three servers (the jump server, the SuT, and the TG). At the end, you just delete the Docker image of Yardstick on the SuT and the TG servers.

Verify yardstick NSB installation on the jump server

After successfully installing on the jump server, you may want to verify that the Yardstick image is created. In the jump server, list all of the Docker image(s):

$ sudo docker images
REPOSITORY          TAG             IMAGE ID            CREATED            SIZE
opnfv/yardstick     stable          4a035a71c93f        6 days ago         2.08 GB

You can check the state of the running container(s):

$ sudo docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                NAMES
5c9627810060        opnfv/yardstick:stable   "/usr/bin/supervisord"   25 minutes ago      Up 25 minutes       5000/tcp, 5672/tcp   yardstick

This shows that a Docker container named "yardstick" is currently running on the jump server. To connect to the Yardstick NSB running container from the jump server:

$ sudo docker exec -it yardstick /bin/bash
root@5c9627810060:/home/opnfv/repos#
root@5c9627810060:/home/opnfv/repos# pwd
/home/opnfv/repos
root@5c9627810060:/home/opnfv/repos# ls
storperf  yardstick

To check the IP address of the Yardstick NSB container from its container:

root@5c9627810060:/home/opnfv/repos# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
21: eth0@if22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:08 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.8/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:8/64 scope link
       valid_lft forever preferred_lft forever

Note that the IP address of the Yardstick NSB container is 172.17.0.8.

By default, the results when running Yardstick NSB are displayed on screen in text. To improve the visualization of the results, you need to install InfluxDB* and the Grafana* dashboard.

Installing InfluxDB* on the Jump Server

From the Yardstick NSB container, run the following command to start the Influx container:

root@5c9627810060# yardstick env influxdb
No handlers could be found for logger "yardstick.common.utils"
/usr/local/lib/python2.7/dist-packages/flask/exthook.py:71: ExtDeprecationWarning: Importing flask.ext.restful is deprecated, use flask_restful instead.
  .format(x=modname), ExtDeprecationWarning
* creating influxDB

After it starts, you need to find the name of the Influx container. List all running containers from another jump server terminal:

$ sudo docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                                            NAMES
dad03fb24deb        tutum/influxdb:0.13      "/run.sh"                15 seconds ago      Up 14 seconds       0.0.0.0:8083->8083/tcp, 0.0.0.0:8086->8086/tcp   modest_bassi
5c9627810060        opnfv/yardstick:stable   "/usr/bin/supervisord"   31 minutes ago      Up 31 minutes       5000/tcp, 5672/tcp

Note: A name was generated for the Influx container: "modest_bassi". From a terminal on the jump server, run the following Docker command to connect to the Influx container by passing its name as a parameter:

$ sudo docker exec -it modest_bassi /bin/bash
root@dad03fb24deb:/#
root@ dad03fb24deb:/# pwd
/
root@ dad03fb24deb:/# ls
bin   config  dev  home  lib64  mnt  proc  run     sbin  sys  usr
boot  data    etc  lib   media  opt  root  run.sh  srv   tmp  var

After connecting to the Influx container, find its IP address by running the following command from the Influx container:

root@dad03fb24deb:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
44: eth0@if45: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:2/64 scope link
       valid_lft forever preferred_lft forever

It shows that 172.17.0.2 is the IP address of the Influx container. This IP address is used in the yardstick config file. From the Yardstick NSB container, edit the yardstick config file at /etc/yardstick/yardstick.conf, and make sure the Influx IP address is set to 172.17.0.2.

root@5c9627810060:/home/opnfv/repos# cat /etc/yardstick/yardstick.conf
[DEFAULT]
debug = False
dispatcher = influxdb

[dispatcher_http]
timeout = 5
target = http://127.0.0.1:8000/results

[dispatcher_file]
file_path = /tmp/yardstick.out
max_bytes = 0
backup_count = 0

[dispatcher_influxdb]
timeout = 5
target = http://172.17.0.2:8086
db_name = yardstick
username = root
password = root

[nsb]
trex_path = /opt/nsb_bin/trex/scripts
bin_path = /opt/nsb_bin
trex_client_lib = /opt/nsb_bin/trex_client/stl

Note: There is a dispatch file located at /etc/yardstick.out in the Yardstick container.

To visualize the data, the graphical user interface (GUI) package should be installed on the jump server. Start a browser on the jump server and browse to http://<jumphost_IP>:8083. In this exercise, since the jump server’s IP address is 10.23.3.73, browse to http://10.23.3.73:8083 to access the InfluxDB web UI:

influxDB admin interface
Figure 2. The InfluxDB* web UI.

In the Query box, type SHOW DATABASES. You should see "yardstick".

influxDB show databases interface
Figure 3. Show databases in InfluxDB*.

On the other hand, you can verify that the Influx database includes Yardstick NSB. From the Influx container, type influx and verify that the Yardstick database is available:

root@8aa048e69c34:/# influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 0.13.0
InfluxDB shell version: 0.13.0
> SHOW DATABASES
name: databases
---------------
name
yardstick
_internal

> quit
root@8aa048e69c34:/#

Installing Grafana* on the Jump Server

Grafana is used to monitor and visualize data. From the Yardstick NSB container, run the following command to start the Grafana container:

root@5c9627810060:/home/opnfv/repos# yardstick env grafana
No handlers could be found for logger "yardstick.common.utils"
/usr/local/lib/python2.7/dist-packages/flask/exthook.py:71: ExtDeprecationWarning: Importing flask.ext.restful is deprecated, use flask_restful instead.
  .format(x=modname), ExtDeprecationWarning
* creating grafana                                                    [Finished]

Find the name of the Grafana running container, and list all running containers from another jump server terminal:

$ sudo docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
opnfv/yardstick     stable              4a035a71c93f        4 weeks ago         2.08 GB
grafana/grafana     4.4.3               49e2eb4da222        9 months ago        287 MB
tutum/influxdb      0.13                39fa42a093e0        22 months ago       290 MB

$ sudo docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                                            NAMES
077052dc73a9        grafana/grafana:4.4.3    "/run.sh"                38 minutes ago      Up 38 minutes       0.0.0.0:1948->3000/tcp                           confident_jones
dad03fb24deb        tutum/influxdb:0.13      "/run.sh"                57 minutes ago      Up 57 minutes       0.0.0.0:8083->8083/tcp, 0.0.0.0:8086->8086/tcp   modest_bassi
5c9627810060        opnfv/yardstick:stable   "/usr/bin/supervisord"   2 weeks ago         Up 17 hours         5000/tcp, 5672/tcp                               yardstick

Note: The name generated for the Grafana container is "confident_jones". From a terminal in the jump server, run the following Docker command to connect to the Influx container by passing its name as a parameter. Similar to the Influx container, you can query the IP address of the Grafana container:

$ sudo docker exec -it confident_jones /bin/bash
root@077052dc73a9:/# ls
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  run.sh  sbin  srv  sys  tmp  usr  var
root@077052dc73a9:/# pwd
/

root@077052dc73a9:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
48: eth0@if49: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.3/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:3/64 scope link
       valid_lft forever preferred_lft forever

It shows that 172.17.0.3 is the IP address of the Grafana container.

Configuration of the container is done by logging into the Grafana web UI. Start a new browser from the jump server and browse to http://<jumphost_IP>:1948. Since the jump host server IP address is 10.23.3.74, browse to http://10.23.3.74:1948. To log in, enter admin as user and admin as password. It brings you to the Grafana home dashboard.

grafana login screen
Figure 4. Enter admin as the user and admin as the password in the Grafana* web UI.

grafana dashboard UI
Figure 5. The Grafana* Home Dashboard.

Click Create your first data source and fill in the Add data source page with the following under the Config tag:

Name: yardstick
Type: scroll to InfluxDB

In Http settings:
URL: 172.17.0.3:8086 (refer to the IP address of Influx)

In Http Auth:
Click Basic Auth, then fill in (refer to user and password to login to Grafana)

In Basic Auth Details:
User: admin
Password: admin

In InfluxDB Details: (refer to the Yardstick database; see /etc/yardstick/yardstick.conf)

Database: yardstick
User: root
Password: root

Then click the Add button.

grafana add data source screen
Figure 6. Fill-in on the Add data source page.

Note: After creating Influx and Grafana containers, Docker creates a bridge network automatically. This bridge connects the containers.

~$ ifconfig
docker0   Link encap:Ethernet  HWaddr 02:42:2a:85:81:8c
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:2aff:fe85:818c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:179308 errors:0 dropped:0 overruns:0 frame:0
          TX packets:179321 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:13156392 (13.1 MB)  TX bytes:48160608 (48.1 MB)

enp3s0f0  Link encap:Ethernet  HWaddr a4:bf:01:00:92:73
          inet addr:10.23.3.155  Bcast:10.23.3.255  Mask:255.255.255.0
          inet6 addr: fe80::a6bf:1ff:fe00:9273/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:246959 errors:0 dropped:0 overruns:0 frame:0
          TX packets:206311 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:53823526 (53.8 MB)  TX bytes:32738472 (32.7 MB)

enp3s0f1  Link encap:Ethernet  HWaddr a4:bf:01:00:92:74
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8776 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2920 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1246733 (1.2 MB)  TX bytes:548262 (548.2 KB)

ens802    Link encap:Ethernet  HWaddr 68:05:ca:2e:76:e0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:510506 errors:0 dropped:0 overruns:0 frame:0
          TX packets:510506 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:210085327 (210.0 MB)  TX bytes:210085327 (210.0 MB)

veth2764a3d Link encap:Ethernet  HWaddr 7e:a4:8f:77:8c:c3
          inet6 addr: fe80::7ca4:8fff:fe77:8cc3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2076 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3745 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:193053 (193.0 KB)  TX bytes:8041958 (8.0 MB)

veth2fbd63e Link encap:Ethernet  HWaddr 7e:a6:d0:a0:66:d2
          inet6 addr: fe80::7ca6:d0ff:fea0:66d2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:179 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1115 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:11972 (11.9 KB)  TX bytes:262036 (262.0 KB)

veth92be3da Link encap:Ethernet  HWaddr 72:05:75:a2:00:ca
          inet6 addr: fe80::7005:75ff:fea2:ca/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:181816 errors:0 dropped:0 overruns:0 frame:0
          TX packets:181375 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:23437842 (23.4 MB)  TX bytes:48355807 (48.3 MB)

virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000

root@csp2s22c04:~# brctl show
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02422a85818c       no              veth2764a3d
                                                        veth2fbd63e
                                                        veth92be3da

Stop Running Yardstick Container on the SuT and TG

Finally, Yardstick container is not necessary on the SuT and the TG; therefore, after installing Yardstick NSB on SuT and TG, you may stop Yardstick container on the SuT and TG by issuing the command sudo docker container stop yardstick:

~$ sudo docker container stop yardstick
~$ sudo docker ps
CONTAINER   ID    IMAGE       COMMAND       CREATED     STATUS    PORTS   NAMES

Summary

The first part of this tutorial details how to install Yardstick NSB software on the servers. The Yardstick NSB software also contains the scripts to start NFVs, NFVI, and the TG, which are needed for the SuT and the server that generates packets. Since only the jump server needs to run the NSB tests later, you can disable the Yardstick containers in the TG and the SuT. The second part of this tutorial will show how to run NSB tests.

References

Install Yardstick on Ubuntu* 14.04: OPNFV Yardstick Project
Yardstick User Guide (Euphrates): Yardstick User Guide
ETSI GS NFV-TST 001 (v1.1.1 (2016-04): http://www.etsi.org/deliver/etsi_gs/NFV-TST/001_099/001/01.01.01_60/gs_NFV-TST001v010101p.pdf
InfluxDB* 1.5 Documentation
Grafana* website
Data Plane Development Kit website

↧

Using TensorFlow* for Deep Learning Training and Testing

July 5, 2018, 2:37 pm

Latest and popular articles on Intel Technologies

≫ Next: Game Dev with Unity* ML-Agents and Intel® Optimized Python* (Part Two)

≪ Previous: Configure Yardstick Network Services Benchmarking to Measure NFVI and VNF Performance

Introduction

In this tutorial, you learn to train and test a single-node Intel® Xeon® Scalable processor platform system using TensorFlow* framework with CIFAR-10 image recognition datasets. Use these step-by-step instructions as-is, or as the foundation for enhancements and/or modifications.

Prerequisites

Hardware	Steps have been verified on Intel® Xeon® Scalable processors but should work on any latest Intel® Xeon® processor-based system. None of the software pieces used in this document were performance optimized.
Software	Basic Linux*, familiar with the concepts of deep learning training

Install TensorFlow using binary packages or from GitHub* sources. This document describes one way to successfully deploy and test on a single Intel Xeon Scalable processor system running CentOS* 7.3. Other installation methods can be found in ^2,18. This document is not meant to give an elaborate description of how to reach state-of-the-art performance; rather, it’s to introduce TensorFlow and run a simple train and test using the CIFAR-10 dataset on a single-node Intel Xeon Scalable processor system.

Hardware and Software Bill of Materials

The hardware and software bill of materials used for the verified implementation recommended here is detailed in Section II. Intel® Parallel Studio XE Cluster Edition is an optional installation for single-node implementation providing most of the basic tools and libraries in one package. Starting with Intel Parallel Studio XE Cluster Edition accelerates the learning curve needed for multi-node implementation of the same training and testing, as this software is significantly instrumental on a multi-node deep learning implementation.

Item	Manufacturer	Model/Version
Hardware
Intel® Server Chassis	Intel	R1208WT
Intel® Server Board	Intel	S2600WT
2 - Intel® Xeon® Scalable processor	Intel	Intel Xeon® Gold 6148 processor
6 - 32 GB LRDIMM DDR4	Crucial*	CT32G4LFD4266
1 - Intel® SSD 1.2 TB	Intel	S3520
Software
CentOS* Linux* Installation DVD	CentOS	7.3.1611
Intel® Parallel Studio XE Cluster Edition	Intel	2017.4
TensorFlow*		setuptools-36.7.2-py2.py3-none-any.whl

Install the Linux* Operating System

This section requires CentOS-7-x86_64-*1611.iso. This software component can be downloaded from the CentOS website.

DVD ISO was used to implement and verify the steps in this document; you can also use Everything ISO and Minimal ISO.

Step 1. Install Linux

1. Insert the CentOS 7.3 1611 install disc/USB. Boot from the drive and select Install CentOS 7.

2. Select Date and Time.

3. If necessary, select Installation Destination.

a. Select the automatic partitioning option.

b. Click Done to return home. Accept all defaults for the partitioning wizard, if prompted.

4. Select Network and host name.

a. Enter "<hostname>" as the hostname.

i. Click Apply for the hostname to take effect.

b. Select Ethernet enp3s0f3 and click Configure to set up the external interface.

i. From the General section, check Automatically connect to this network when it’s available.

ii. Configure the external interface as necessary. Save and Exit.

c. Select the toggle to ON for the interface.

d. Select the toggle to ON for the interface.

5. Select Software Selection. In the box labeled Base Environment on the left side, select Infrastructure server.

a. Click Done to return home.

b. Wait until the Begin Installation button is available, which may take several minutes. Then click it to continue

6. While waiting for the installation to finish, set the root password.

7. Click Reboot when the installation is complete.

8. Boot from the primary device and log in as root.

Step 2. Configure YUM*

If the public network implements a proxy server for Internet access, Yellowdog Updater Modified* (YUM*) must be configured in order to use it.

Open the /etc/yum.conf file for editing.
Under the main section, append the following line:
Proxy=http://<address>:<port>;
where <address> is the address of the proxy server and <port> is the HTTP port.
Save the file and Exit.

Disable updates and extras. Certain procedures in this document require packages to be built against the kernel. A future kernel update may break the compatibility of these built packages with the new kernel, so we recommend disabling repository updates and extras to provide further longevity to this document.

This document may not be used as is when CentOS updates to the next version. To use this document after such an update, it is necessary to redefine repository paths to point to CentOS 7.3 in the CentOS vault. To disable repository updates and extras: Yum-config-manager --disable updates --disable extras.

Step 3. Install EPEL

Extra Packages for Enterprise Linux (EPEL) provides 100 percent, high-quality add-on software packages for Linux distribution. To install EPEL (latest version for all packages required):

Yum –y install (download from https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm)

Step 4. Install GNU* C Compiler

Check whether the GNU Compiler Collection* is installed. It should be part of the development tools install. Verify the installation by typing:

gcc --version or whereis gcc

Step 5. Install TensorFlow*

Using virtualenv¹⁸, follow these steps to install TensorFlow:

1. Update to the latest distribution of EPEL:

yum –y install epel-release

2. To install TensorFlow, the following dependencies must be installed¹⁰:

NumPy*: a numerical processing package that TensorFlow requires
Devel*: this enables adding extensions to Python*
Pip*: this enables installing and managing certain Python packages
Wheel*: enables managing Python compressed packages in wheel formal (.whl)
Atlas*: Automatically Tuned Linear Algebra Software
Libffi*: Library provides Foreign Function Interface (FFI) that allows code written in one language to call code written in another language. It provides a portable, high-level programming interface to various calling conventions¹¹

3. Install dependencies:

sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel gcc-gfortran openssl-devel libffi-devel python-numpy

4. Install virtualenv
There are various ways to install TensorFlow¹⁸. This document uses virtualenv, a tool to create isolated Python environments¹⁶.

pip install --upgrade virtualenv

5. Create a virtualenv in your target directory:

virtualenv --system-site-packages <targetDirectory>

Example: virtualenv --system-site-packages tensorflow

6. Activate your virtualenv¹⁸:

source <targetDirectory>/bin/activate

Example: source ~/tensorflow/bin/activate

7. Upgrade your packages, if needed:

pip install --upgrade numpy scipy wheel cryptography

8. Install the latest version of Python compressed TensorFlow packages:

pip install --upgrade

https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl . OR:

pip install --upgrade tensorflow

Step 6. Train a Convolutional Neural Network (CNN)

1. Download the CIFAR10³ training dataset into /tmp/ directory:
Download the cifar-10 python version from ^4,8: https://www.cs.toronto.edu/~kriz/cifar.html

2. Unzip the tar file in the /tmp/ area as the python script (cifar10_train.py) looks for data in this directory:

tar –zxf <dir>/cifar-10-python.tar.gz

3. Change directory to TensorFlow:

cd tensorflow

4. Make a new directory:

mkdir git_tensorflow

5. Change directory to the one created in last step:

cd git_tensorflow

6. Download a clone of the TensorFlow repository from GitHub⁹:
Git clone https://github.com/tensorflow/tensorflow.git

7. If the Models folder is missing from the tensorflow/tensorflow directory, access a Git of models from:⁹
https://github.com/tensorflow/models.git:

cd tensorflow/tensorflow

git clone https://github.com/tensorflow/models.git

8. Upgrade TensorFlow to the latest version or errors could occur when training the model:

pip install --upgrade tensorflow

9. Change directory to CIFAR-10 dir to get the training and evaluation Python scripts¹⁴:

cd models/tutorials/image/cifar10

10. Before running the training code, check the cifar10_train.py code and change steps from 100K to 60K if needed, as well as logging frequency from 10 to whatever you prefer.

For this document, tests were done for both 100K steps and 60K steps, for a batch size of 128, and logging frequency of 10.

code line

11. Run the training Python script to train your network:

python cifar10_train.py

This will take few minutes and you will see an image similar to below:

Python code sample

Testing script and dataset terminology

In the neural network terminology:

One epoch = one forward pass and one backward pass of all the training examples.
Batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space required. TensorFlow pushes it all through one forward pass (in parallel) and follows with a back-propagation on the same set. This is one iteration, or step.
Number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass equals one forward pass plus one backward pass (do not count the forward pass and backward pass as two different passes).
Steps parameter tells TensorFlow to run X of these iterations to train the model.

Example: given 1,000 training examples, and a batch size of 500, then it will take two iterations to complete one epoch.

To learn more about the difference between epoch versus batch size versus iterations, read the article¹⁵.

In the cifar10_train.py script:

Batch size is set to 128. It represents the number of images to process in a batch.
Max step is set to 100,000. It is the number of iterations for all epochs.
NOTE: The GitHub code has a typo; instead of 100K, the number shows 1000K. Please update before running.
The CIFAR-10 binary dataset in⁴ has 60,000 images: 50,000 images to train and 10,000 images to test. Each batch size is 128, so the number of batches needed to train is 50,000/128 ~ 391 batches for one epoch.
The cifar10_train.py used 256 epochs, so the number of iterations for all the epochs is ~391 x 256 ~ 100K iterations or steps.

Step 7. Evaluate the model

Use the cifar10_eval.py script⁸ to evaluate how well the trained model performs on a hold-out data set.:

python cifar10_eval.py

Once you reach expected accuracy, you should see a precision @ 1 = 0.862 on your screen when running the above command, it can be run while the training script is still running towards the end of the number of steps, or it can be run after the training script has finished.

Code line

Sample results

The cifar10_train.py script shows the following results:

Results of the test

A similar-looking result below was achieved with the system described in the Hardware and Software Bill or Materials Section of this document. Note that these numbers are only for educational purposes and no specific CPU optimizations were performed.

System	Step Time (sec/batch)	Accuracy
2 - Intel® Xeon® Gold processors	~ 0.105	85.8% at 60K steps (~2 hours)
2 - Intel Xeon Gold processors	~0.109	86.2% at 100K steps (~3 hours)

When you finish training and testing your CIFAR-10 dataset, the same Models directory has images for MNIST* and AlexNet* benchmarks. For additional learning, go into MNIST and AlexNet directories and try running the Python scripts to see the results.

References

1. Thoolihan, n.d. "Install TensorFlow on CentOS7," Accessed 6/25/18.

2. n.d., "Installing TensorFlow on Ubuntu*", Accessed 6/25/18.

3. n.d., "Install TensorFlow on CentOS7", Accessed 6/25/18.

4. The CIFAR-10 dataset

5. TensorFlow, MNIST and your own handwritten digits

6. TensorFlow Tutorial

7. Tutorial on CNN on TensorFlow

8. CIFAR-10 Details

9. TensorFlow Models

10. Installing TensorFlow from Sources

11. Libffi

12. Performance Guide for TensorFlow

13. What is batch size in neural network?

14. Learning Multiple Layers of Features from Tiny Images (PDF), Alex Krizhevsky, 2009

15. Epoch vs Batch Size vs Iterations

16. Virtualenv

17. CPU Optimizations

18. Download and Setup

↧

Game Dev with Unity* ML-Agents and Intel® Optimized Python* (Part Two)

July 6, 2018, 9:48 am

Latest and popular articles on Intel Technologies

≫ Next: Privacy-Preserving Face Features Detection

≪ Previous: Using TensorFlow* for Deep Learning Training and Testing

Abstract render ball

Abstract

In the final part of this two-part series on machine learning with Unity* ML-Agents, we will dig deeper into the architecture and create an ML-Agent from scratch. Before training, we will inspect the files that require parameters for machine learning to proceed. Finally, we will train the agent using Intel® optimized Python* and show how the completed system works.

Architecture of Unity* ML-Agents

Figure 1 shows the architecture of Unity ML-Agents:

Diagram Unity M L Agents
Figure 1. Unity* ML-Agents architecture.

At first glance, it might seem that the external communicator and Intel-optimized Python can only be used by the external brain, but this is not the case. The external brain can be accessed by other training modes, too.

Every scene will have two entities:

An “Academy,” using an “Academy Script” that will be added later.
“Brains,” which are the logic inside Unity ML-Agents where the main connection lies. Agents share the same brain; each agent has an agent script on it which links back to the brain. The brain itself has a brain script on it. It may or may not have a decision script.

Changes in V3 with Respect to V2

Unity ML-Agents have seen several changes, many based on community feedback. Some of the changes are described below:

The ML-Agents reward system changed to “AddReward()” or “SetReward().”
When we are working with an Agent and it has worked in its entirety or performed its function, we now use the “Done()” method.
The concept of state has been changed to observations, so “CollectStates()” have been replaced by “CollectObservations().”
When we collect Observations, we have to call “AddVectorObs()” with floats, integers, lists, and an array of floats, vectors, and quaternions. (Quaternions represent the orientation of every object in Unity.) The names of the inputs in the Internal Brain have been changed accordingly.
We must replace State with “Vector_Observation” and observation with “Visual_Observation.”

The table below summarizes the key changes in V3:

Old (V2)	New (V3)
State	Vector Observation
Observation	Visual Observation (New) Text Observation
Action	Vector Action (New) Text Action

Table 1. Changes in Unity* ML-Agents from v2 to v3.

Let’s Start with an Example

Use the following steps to start creating your own example of machine learning using Unity ML-Agents and Intel-optimized Python:

Open up the Unity ML cloned project. Everything we do will be kept inside the Examples folder.
The cloned project is opened in Unity.
Create a new subfolder named “MyBall” within the Examples folder. We will keep all of our resources within this folder.
The Examples folder is where we are keeping all the content and the resources.
Create a new scene using the suggested name “MyBall(scene).”
Next, we will create a new scene.

To start setting up machine learning inside the scene, we will have to create 3D objects, using the following steps:

Create a 3D object cube.
Add “rigid body” and make it “kinematic.”
Change the color of the cube. For adding colors to our object, we need to create a new material and name it “Blue.” We will change the color content to blue. (We can also change the color of the background.)
Create a 3D object sphere and add a rigid body to it.
We will now organize the scene and add an event system from the UI.
Right-click on “Hierarchy” then select “Event System.”

To follow the procedure for Unity ML-Agents, we need to separately create an Academy object and a brain object, and then associate the scripts properly. We will create an Academy object, then have a child object created from Academy named “Brain.” Within the brain, we will add the brain script; but when we do, we will notice an error in the inspector window, which we can quickly resolve.

Adding Functionality to the Academy and the Brain Object

When we add functionality to the Academy and Brain object by adding a C# script in it, we remove the error condition. The script follows a basic flow with some override methods. As we have created the ball Academy object, we can now create a C# script named “MyBallAcademy” and attach the script to the Academy in the hierarchy.

Before editing, the script looks like this:

using System. Collections;
using System.Collections.Generic;
using UnityEngine;

public class MyBallAcademy : MonoBehaviour {

	// Use this for initialization
	void Start () {
		
	}
	
	// Update is called once per frame
	void Update () {
		
	}
}

We will not inherit from monobehaviour, as we are not deriving any characteristics from it. After we change the script, everything will be derived from Academy and we don’t need “void Start()” and “void Update().”

using System. Collections;
using System.Collections.Generic;
using UnityEngine;

public class MyBallAcademy : Academy {

	// Use this for initialization
	public override void AcademyReset()
	{

	}

	public override void AcademyStep()
	{
		
	}
}

We have inherited from Academy and have declared two empty override methods as “AcademyReset()” and “AcademyStep().” We cannot change these methods, as this is the structure for any Academy script that you want to derive from. With both of these methods we have made the generalized script that can be used within the scene.

With the changes made to the script, we have a basic, bare-bones structure for linking Academy and the brain.

Basic Setup for the Scene

In this scene we will be creating a cube, which we will refer to as the “platform.” Within that platform, we will place a sphere, which will act like a ball. With movements, we can adjust the ball in order to prevent it from falling off the platform. If the ball falls off, the scene will reset, and we will restart the balancing act.

We now have our platform and the ball, but to demonstrate machine learning, we need to configure a brain to control the action. Once the system is under the control of the brain, it will drive the platform and then fire off an agent script. Our next job is to write the agent script.

Programming and Scene Setup Logic

We will now create an agent script and name it as MyBallAgent. We will inherit from the Agent. Once we add the MyBallAgent script to the system, we will immediately see what inherited values we need to put in. We will drag and drop Brain to the required inherited values.

First, we will drag and drop the MyBallAgent script created to the cube as shown below.

MyBallAgent script

Then we drag and drop the child we created for Academy as brain to the Brain option, which showed none (shown below).

Brain option

In the Agent code itself, we will write all the controlling parameters we intend to use. We will declare a GameObject “ball,” which we will include from the inspector that is ball.

public GameObject ball;

Now the flow of the agent is controlled by the Unity ML-Agents plugin. (We will not need Unity’s default update method.)

Override

Overriding common methods.

We need to override common methods because the type of environment we created might require changes and more training. For that we need to change the values of the parameters and override the common values present.

First, we have to find out where we are going to have the transformations and other declarations for the game object. In version 0.3, game object changes have been shifted to “AddVectorObs,” which are now known as ”vector observations.”

For object transformation, positions, and rigid body, we are declaring eight AddVectorObs (also known as “vector objects”).

The method is called CollectObservations.

AddVectorObs(gameObject.transform.rotation.z);
  AddVectorObs(gameObject.transform.rotation.x);
  AddVectorObs((ball.transform.position.x - gameObject.transform.position.x));
  AddVectorObs((ball.transform.position.y - gameObject.transform.position.y));
  AddVectorObs((ball.transform.position.z - gameObject.transform.position.z));
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.x);
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.y);
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.z);

The complete method is shown below.

public override void CollectObservations()
{
  AddVectorObs(gameObject.transform.rotation.z);
  AddVectorObs(gameObject.transform.rotation.x);
  AddVectorObs((ball.transform.position.x - gameObject.transform.position.x));
  AddVectorObs((ball.transform.position.y - gameObject.transform.position.y));
  AddVectorObs((ball.transform.position.z - gameObject.transform.position.z));
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.x);
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.y);
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.z);
  SetTextObs("Testing " + gameObject.GetInstanceID());

 }

Here is what the above code does:

We get the x and z rotation; the game object will rotate in two directions.

AddVectorObs(gameObject.transform.rotation.x);
AddVectorObs(gameObject.transform.rotation.z);

We get the difference between the ball’s x position and the game object’s x position.
We get where the ball is respective to the platform.

We get the ball’s velocity in x,y and z directions.

AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.x);
AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.y);
AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.z);

When the Game Resets, What Method Will We Override?

The override method that we will be using for when the game resets is AgentReset(), which initiates when the ball is dropped onto the platform. Here are some of the key instructions:

Reset everything back to zero:
gameObject.transform.rotation = new Quaternion(0f, 0f, 0f, 0f);
Change the velocity of the ball back to 0:
ball.GetComponent().velocity = new Vector3(0f, 0f, 0f);
Set the position of the ball back to StartPos:
ball.transform.position = ballStartPos;
Create “Vector3” to store the vector’s start position:
Vector3 ballStartPos;
Configure the starting position by working inside “Void Start()” and declaring the following:
ballStartPos = ball.transform.position;

We have now defined the starting environment when we hold the ball for the very first time, and when the system resets.

Controlling the Platform

Once we shift to the “Player” option, we must enable certain keys on the keyboard to control movement. We accomplish this by creating a way to physically control the platform. This is where all the actions get converted, and for any desired change for the scene that we have created the response that we do by giving the keyboard movements should produce the results in the scene for the movement of the ball. We need to check as we map the keyboard keys to ensure that it is reflecting the same way that it is supposed to be. The entire updated code for MyBallAgent is shown below:

using System. Collections;
using System.Collections.Generic;
using UnityEngine; 

public class MyBallAgent : Agent {

public GameObject ball;
Vector3 ballStartPos;


void Start()
{
    ballStartPos = ball.transform.position;

}

public override void AgentAction(float[] vectorAction, string textAction)
    {

    
        if (brain.brainParameters.vectorActionSpaceType == SpaceType.continuous)
        {
            float action_z = 2f * Mathf.Clamp(vectorAction[0], -1f, 1f);
            if ((gameObject.transform.rotation.z < 0.25f && action_z > 0f) ||
                (gameObject.transform.rotation.z > -0.25f && action_z < 0f))
            {
                gameObject.transform.Rotate(new Vector3(0, 0, 1), action_z);
            }
            float action_x = 2f * Mathf.Clamp(vectorAction[1], -1f, 1f);
            if ((gameObject.transform.rotation.x < 0.25f && action_x > 0f) ||
                (gameObject.transform.rotation.x > -0.25f && action_x < 0f))
            {
                gameObject.transform.Rotate(new Vector3(1, 0, 0), action_x);
            }

            SetReward(0.1f);

        }
        if ((ball.transform.position.y - gameObject.transform.position.y) < -2f ||
            Mathf.Abs(ball.transform.position.x - gameObject.transform.position.x) > 3f ||
            Mathf.Abs(ball.transform.position.z - gameObject.transform.position.z) > 3f)
        {
            Done();
            SetReward(-1f);
        }
     


   

    }

public override void CollectObservations()
{
  AddVectorObs(gameObject.transform.rotation.z);
  AddVectorObs(gameObject.transform.rotation.x);
  AddVectorObs((ball.transform.position.x - gameObject.transform.position.x));
  AddVectorObs((ball.transform.position.y - gameObject.transform.position.y));
  AddVectorObs((ball.transform.position.z - gameObject.transform.position.z));
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.x);
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.y);
  AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.z);
  SetTextObs("Testing" + gameObject.GetInstanceID());

 }


 public override void AgentReset()
 {
 gameObject.transform.rotation = new Quaternion(0f, 0f, 0f, 0f);
 ball.GetComponent<Rigidbody>().velocity = new Vector3(0f, 0f, 0f);
 ball.transform.position = ballStartPos;
 }



}

Simulation Using Keyboard Inputs

For a simulation using keyboard inputs with the brain type set as “Player,” we will need to configure the brain script. Because there are eight AddVectorObs, the parameter for Vector Observation space size would be eight, and space type is “continuous.” Make the changes in the Inspector window, shown below:

Figure 2. Configuring the brain script in the Inspector window.

Now we can add continuous player actions to control keyboard inputs. There are four keys to map, so there are four continuous player elements: up-arrow, down-arrow, right-arrow, and left-arrow. The parameter values are the following:

Element 0
Key -> Up Arrow
Index->1
Value->1

Element 1
Key->Down Arrow
Index->1
Value->-1

Element 2
Key->Right Arrow
Index->0
Value->-1

Element 3
Key->Left Arrow
Index->0
Value->1

The keyboard mapping is shown in the figure below:

Figure 3. Keyboard mapping for elements 0-3.

Now we will click on “Play” to test the scene under player settings and try to keep the ball in the platform using the keyboard arrows up, down, left, and right.

For training the model using Intel-optimized TensorFlow*, we need to keep the brain type set to “external” for the build.

Figure 4. Play starts with the ball at the center of the platform.

As we have done before, we need to create the build for the project.

Figure 5. Selecting the scenes and creating the project.

We have added the scene; now we will create the build and name it.

Figure 6. Naming and saving the scene.

Now that the executable has been created, we must train it using our Intel-optimized Python module. However, before training can start, there are some things to know about the “learn.py” file and the “trainer_config.yaml” file. The “learn.py” file contains certain details for running the training. The key parameters are declared in the config file. The main work of the “learn.py” file is to initialize general parameters such as run_id, fast_simulation, etc. and trigger the “trainer_config.yaml” file. We don’t have to make changes to the “learn.py” file; it has the format as shown below:

# # Unity ML Agents
# ## ML-Agent Learning

import logging

import os
from docopt import docopt

from unitytrainers.trainer_controller import TrainerController

if __name__ == '__main__':
    logger = logging.getLogger("unityagents")
    _USAGE = '''
    Usage:
      learn (<env>) [options]
      learn --help

    Options:
      --curriculum=<file>        Curriculum json file for environment [default: None].
      --keep-checkpoints=<n>     How many model checkpoints to keep [default: 5].
      --lesson=<n>               Start learning from this lesson [default: 0].
      --load                     Whether to load the model or randomly initialize [default: False].
      --run-id=<path>            The sub-directory name for model and summary statistics [default: ppo]. 
      --save-freq=<n>            Frequency at which to save model [default: 50000].
      --seed=<n>                 Random seed used for training [default: -1].
      --slow                     Whether to run the game at training speed [default: False].
      --train                    Whether to train model, or only run inference [default: False].
      --worker-id=<n>            Number to add to communication port (5005). Used for multi-environment [default: 0].
      --docker-target-name=<dt>       Docker Volume to store curriculum, executable and model files [default: Empty].
    '''

    options = docopt(_USAGE)
    logger.info(options)
    # Docker Parameters
    if options['--docker-target-name'] == 'Empty':
        docker_target_name = ''
    else:
        docker_target_name = options['--docker-target-name']

    # General parameters
    run_id = options['--run-id']
    seed = int(options['--seed'])
    load_model = options['--load']
    train_model = options['--train']
    save_freq = int(options['--save-freq'])
    env_path = options['<env>']
    keep_checkpoints = int(options['--keep-checkpoints'])
    worker_id = int(options['--worker-id'])
    curriculum_file = str(options['--curriculum'])
    if curriculum_file == "None":
        curriculum_file = None
    lesson = int(options['--lesson'])
    fast_simulation = not bool(options['--slow'])

    # Constants
    # Assumption that this yaml is present in same dir as this file
    base_path = os.path.dirname(__file__)
    TRAINER_CONFIG_PATH = os.path.abspath(os.path.join(base_path, "trainer_config.yaml"))

    tc = TrainerController(env_path, run_id, save_freq, curriculum_file, fast_simulation, load_model, train_model,
                           worker_id, keep_checkpoints, lesson, seed, docker_target_name, TRAINER_CONFIG_PATH)
    tc.start_learning()

The “trainer_config.yaml” file contains more important information. Some default parameters are already declared. The important ones are max_steps: 5.0e4. (The max steps are how many times we loop around and train the entire thing. For this scene it is 50,000 and is written as 5.0e4, which is 5 * 10⁴. The value is default.) We can alter the value so that we can train the model more. The number of times the model is trained is known as “epochs.” Generally, one epoch cycle is known as one full training cycle on the set or, in this case, is the scene.

α- value or learning rate 3.0e-4

We can also override some values. We can override the value if we need to change the training times such that we can increase the number of max steps, so that the scene is trained more. This helps us for better machine-learning results. Within the file there are examples where the default brain script values have been overridden.

A small snippet of the “config.yaml” file is shown below:

default:
    trainer: ppo
    batch_size: 1024
    beta: 5.0e-3
    buffer_size: 10240
    epsilon: 0.2
    gamma: 0.99
    hidden_units: 128
    lambd: 0.95
    learning_rate: 3.0e-4
    max_steps: 5.0e4
    memory_size: 256
    normalize: false
    num_epoch: 3
    num_layers: 2
    time_horizon: 64
    sequence_length: 64
    summary_freq: 1000
    use_recurrent: false

BananaBrain:
    normalize: false
    batch_size: 1024
    beta: 5.0e-3
    buffer_size: 10240

PushBlockBrain:
    max_steps: 5.0e4
    batch_size: 128
    buffer_size: 2048
    beta: 1.0e-2
    hidden_units: 256
    summary_freq: 2000
    time_horizon: 64
    num_layers: 2

Now we can start the training process. The following is the command we will use:

python learn.py mball2.exe --run-id=mball2 –train

As the process runs, the following details are populated:

(idp) C:\Users\abhic\Desktop\ml-agents\python>python learn.py mball2.exe --run-id=mball2 --train
INFO:unityagents:{'--curriculum': 'None',
 '--docker-target-name': 'Empty',
 '--help': False,
 '--keep-checkpoints': '5',
 '--lesson': '0',
 '--load': False,
 '--run-id': 'mball2',
 '--save-freq': '50000',
 '--seed': '-1',
 '--slow': False,
 '--train': True,
 '--worker-id': '0',
 '<env>': 'mball2.exe'}
INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :

Unity brain name: Brain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: ,
2018-06-04 05:28:49.992671: I k:\tf_jenkins_freddy\ cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
C:\<path>\conda\envs\idp\lib\site-packages\tensorflow\python\ops\gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:unityagents:Hyperparameters for the PPO Trainer of brain Brain:
        batch_size:     1024
        beta:   0.005
        buffer_size:    10240
        epsilon:        0.2
        gamma:  0.99
        hidden_units:   128
        lambd:  0.95
        learning_rate:  0.0003
        max_steps:      5.0e4
        normalize:      False
        num_epoch:      3
        num_layers:     2
        time_horizon:   64
        sequence_length:        64
        summary_freq:   1000
        use_recurrent:  False
        graph_scope:
        summary_path:   ./summaries/mball2
        memory_size:    256
INFO:unityagents: Brain: Step: 1000. Mean Reward: 6.975. Std of Reward: 1.993.
INFO:unityagents: Brain: Step: 2000. Mean Reward: 9.367. Std of Reward: 3.598.
INFO:unityagents: Brain: Step: 3000. Mean Reward: 7.258. Std of Reward: 2.252.
INFO:unityagents: Brain: Step: 4000. Mean Reward: 7.333. Std of Reward: 3.324.
INFO:unityagents: Brain: Step: 5000. Mean Reward: 10.700. Std of Reward: 4.618.
INFO:unityagents: Brain: Step: 6000. Mean Reward: 7.183. Std of Reward: 1.750.
INFO:unityagents: Brain: Step: 7000. Mean Reward: 7.038. Std of Reward: 2.464.
INFO:unityagents: Brain: Step: 8000. Mean Reward: 6.400. Std of Reward: 1.561.
INFO:unityagents: Brain: Step: 9000. Mean Reward: 7.664. Std of Reward: 3.189.
INFO:unityagents: Brain: Step: 10000. Mean Reward: 7.333. Std of Reward: 2.236.
INFO:unityagents: Brain: Step: 11000. Mean Reward: 9.622. Std of Reward: 4.135.
INFO:unityagents: Brain: Step: 12000. Mean Reward: 10.938. Std of Reward: 1.323.
INFO:unityagents: Brain: Step: 13000. Mean Reward: 10.578. Std of Reward: 2.623.
INFO:unityagents: Brain: Step: 14000. Mean Reward: 11.986. Std of Reward: 2.559.
INFO:unityagents: Brain: Step: 15000. Mean Reward: 10.411. Std of Reward: 2.383.
INFO:unityagents: Brain: Step: 16000. Mean Reward: 10.925. Std of Reward: 2.178.
INFO:unityagents: Brain: Step: 17000. Mean Reward: 10.633. Std of Reward: 1.173.
INFO:unityagents: Brain: Step: 18000. Mean Reward: 11.957. Std of Reward: 3.645.
INFO:unityagents: Brain: Step: 19000. Mean Reward: 10.511. Std of Reward: 2.343.
INFO:unityagents: Brain: Step: 20000. Mean Reward: 10.975. Std of Reward: 2.469.
INFO:unityagents: Brain: Step: 21000. Mean Reward: 12.025. Std of Reward: 6.786.
INFO:unityagents: Brain: Step: 22000. Mean Reward: 10.538. Std of Reward: 1.935.
INFO:unityagents: Brain: Step: 23000. Mean Reward: 10.311. Std of Reward: 1.044.
INFO:unityagents: Brain: Step: 24000. Mean Reward: 9.844. Std of Reward: 1.023.
INFO:unityagents: Brain: Step: 25000. Mean Reward: 10.167. Std of Reward: 0.886.
INFO:unityagents: Brain: Step: 26000. Mean Reward: 10.388. Std of Reward: 1.628.
INFO:unityagents: Brain: Step: 27000. Mean Reward: 10.000. Std of Reward: 1.332.
INFO:unityagents: Brain: Step: 28000. Mean Reward: 10.322. Std of Reward: 1.240.
INFO:unityagents: Brain: Step: 29000. Mean Reward: 9.644. Std of Reward: 0.837.
INFO:unityagents: Brain: Step: 30000. Mean Reward: 10.244. Std of Reward: 1.606.
INFO:unityagents: Brain: Step: 31000. Mean Reward: 9.922. Std of Reward: 1.576.
INFO:unityagents: Brain: Step: 32000. Mean Reward: 10.200. Std of Reward: 1.060.
INFO:unityagents: Brain: Step: 33000. Mean Reward: 10.413. Std of Reward: 0.877.
INFO:unityagents: Brain: Step: 34000. Mean Reward: 10.233. Std of Reward: 1.104.
INFO:unityagents: Brain: Step: 35000. Mean Reward: 10.411. Std of Reward: 0.825.
INFO:unityagents: Brain: Step: 36000. Mean Reward: 9.875. Std of Reward: 1.221.
INFO:unityagents: Brain: Step: 37000. Mean Reward: 10.067. Std of Reward: 0.550.
INFO:unityagents: Brain: Step: 38000. Mean Reward: 9.660. Std of Reward: 0.759.
INFO:unityagents: Brain: Step: 39000. Mean Reward: 11.063. Std of Reward: 1.467.
INFO:unityagents: Brain: Step: 40000. Mean Reward: 9.722. Std of Reward: 0.989.
INFO:unityagents: Brain: Step: 41000. Mean Reward: 9.656. Std of Reward: 0.732.
INFO:unityagents: Brain: Step: 42000. Mean Reward: 9.689. Std of Reward: 0.839.
INFO:unityagents: Brain: Step: 43000. Mean Reward: 9.689. Std of Reward: 1.152.
INFO:unityagents: Brain: Step: 44000. Mean Reward: 9.570. Std of Reward: 0.593.
INFO:unityagents: Brain: Step: 45000. Mean Reward: 9.856. Std of Reward: 0.510.
INFO:unityagents: Brain: Step: 46000. Mean Reward: 10.278. Std of Reward: 1.219.
INFO:unityagents: Brain: Step: 47000. Mean Reward: 9.988. Std of Reward: 0.924.
INFO:unityagents: Brain: Step: 48000. Mean Reward: 10.311. Std of Reward: 0.788.
INFO:unityagents: Brain: Step: 49000. Mean Reward: 10.044. Std of Reward: 1.192.
INFO:unityagents:Saved Model
INFO:unityagents: Brain: Step: 50000. Mean Reward: 9.210. Std of Reward: 0.730.
INFO:unityagents:Saved Model
INFO:unityagents:Saved Model
INFO:unityagents:List of nodes to export :
INFO:unityagents:       action
INFO:unityagents:       value_estimate
INFO:unityagents:       action_probs
INFO:tensorflow:Restoring parameters from ./models/mball2\model-50000.cptk
INFO:tensorflow:Restoring parameters from ./models/mball2\model-50000.cptk
INFO:tensorflow:Froze 12 variables.
INFO:tensorflow:Froze 12 variables.
Converted 12 variables to const ops.

The bytes file is now generated in the /mball directory.

Directory contents the bytes file
Figure 7. Directory contents after generating the bytes file.

In our project inside the folder, there is no TFModels directory, so we will have to create one and keep the bytes file there.

Figure 8. Create the TFModels directory to store the bytes file properly.

After creating the bytes file, copy it to the \TFModels folder. Once that step is complete, go back to the Unity project and move to the Inspector window. Change the brain type to “internal.” It will show an error.

Brain to internal
Figure 9. After the bytes file is created, set the brain to “internal.”

We can now drag and drop the bytes file (inside the TFModels folder) corresponding to the Graph Model and resolve the error. The system is now ready to test to see how well the model has been trained.

Summary

Intelligent agents, each acting with dynamic and engaging behavior, offer promise for more realism and better user experiences. After completing the tasks described in part one and part two of this series, you can now create a Unity ML-Agent from scratch, configure the key learning and training files, and understand the key parameters to set up in order to get started with machine learning. Based on what you learned in these articles, you should now be able to incorporate more compelling AI behavior in your own games to boost immersion and attract players.

Resources

↧