Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Qlik Increases Big Data Analysis Speed on Intel® Xeon® Platinum 8168 processor

$
0
0

Qlik enables organizations to analyze disparate data sources using a visual interface.

Performance is essential to enable users to explore their data intuitively. Data is cached in memory. As users make new selections, everything based on the selection is recalculated and the visualization is updated. If a page takes longer than a second or two to update, users will lose their train of thought, and their patience.

Qlik worked with Intel to benchmark the performance of the new Intel® Xeon® Platinum 8168 processor, and compared its performance to the previous generation Intel® Xeon® processor E5-2699 v4, and the previous generation v3. The test used an internal Qlik benchmark that performs over 80 selections at the same time, simulating user interactions. The data set comprised 1 billion records of sales data, representing different customers in different countries. The calculations involved processing the data but excluding a single week from the data presented. Reports included the sales by year, the top ten customers with their sales total, and gross margins by product category shown using a treemap (a grid of proportionally sized boxes). The scenario stresses the processor’s CPU and memory.

View complete Solution Brief (PDF)


API Without Secrets: Introduction to Vulkan* Part 7

$
0
0

Tutorial 7: Uniform Buffers — Using Buffers in Shaders

Go back to previous tutorial: Introduction to Vulkan Part 6 – Descriptor Sets 

This is the time to summarize knowledge presented so far and create a more typical rendering scenario. Here we will see an example that is still very simple, yet it reflects the most common way to display 3D geometry on screen. We will extend code from the previous tutorial by adding a transformation matrix to the shader uniform data. This way we can see how to use multiple different descriptors in a single descriptor set.

Of course, knowledge presented here applies to many other use cases, as descriptor sets may contain multiple types of resources, both various or identical. Nothing stops us from creating a descriptor set with many storage buffers or sampled images. We can also mix them as shown in this tutorial — here we use both a texture (combined image sampler) and a uniform buffer. We will see how to create a layout for such a descriptor, how to create a descriptor set, and how to populate it with appropriate resources.

In a previous part of the tutorial we learned how to create images and use them as textures inside shaders. This knowledge is also used in this tutorial, but here we focus only on buffers, and learn how to use them as a source of uniform data. We also see how to prepare a projection matrix, how to copy it to a buffer, and how to access it inside shaders.

Creating a Uniform Buffer

In this example we want to use two types of uniform variables inside shaders: combined image sampler (sampler2D inside shader) and a uniform projection matrix (mat4). In Vulkan*, uniform variables other than opaque types like samplers cannot be declared in a global scope (as in OpenGL*); they must be accessed from within uniform buffers. We start by creating a buffer.

Buffers can be used for many different purposes — they can be a source of vertex data (vertex attributes); we can keep vertex indices in them so they are used as index buffers; they can contain shader uniform data, or we can store data inside buffers from within shaders and use them as storage buffers. We can even keep formatted data inside buffers, access it through buffer views, and treat them as texel buffers (similar to OpenGL's buffer textures). For all the above purposes we use the usual buffers, created always in the same way. But it is the usage provided during buffer creation that defines how we can use a given buffer during its lifetime.

We saw how to create a buffer in Introduction to Vulkan Part 4 – Vertex Attributes, so only source code is presented here without diving into specifics:

VkBufferCreateInfo buffer_create_info = {
    VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO, // VkStructureType      sType
    nullptr,                              // const void          *pNext
    0,                                    // VkBufferCreateFlags  flags
    buffer.Size,                          // VkDeviceSize         size
    usage,                                // VkBufferUsageFlags   usage
    VK_SHARING_MODE_EXCLUSIVE,            // VkSharingMode        sharingMode
    0,                                    // uint32_t             queueFamilyIndexCount
    nullptr                               // const uint32_t      *pQueueFamilyIndices
  };

  if( vkCreateBuffer( GetDevice(), &buffer_create_info, nullptr, &buffer.Handle ) != VK_SUCCESS ) {
    std::cout << "Could not create buffer!"<< std::endl;
    return false;
  }

  if( !AllocateBufferMemory( buffer.Handle, memoryProperty, &buffer.Memory ) ) {
    std::cout << "Could not allocate memory for a buffer!"<< std::endl;
    return false;
  }

  if( vkBindBufferMemory( GetDevice(), buffer.Handle, buffer.Memory, 0 ) != VK_SUCCESS ) {
    std::cout << "Could not bind memory to a buffer!"<< std::endl;
    return false;
  }

return true;

1. Tutorial07.cpp, function CreateBuffer()

We first create a buffer by defining its parameters in a variable of type VkBufferCreateInfo. Here we define the buffer's most important parameters — its size and usage. Next, we create a buffer by calling the vkCreateBuffer() function. After that, we need to allocate a memory object (or use a part of another, existing memory object) to bind it to the buffer through the vkBindBufferMemory() function call. Only after that can we use the buffer the way we want to in our application. Allocating a dedicated memory object is performed as follows:

VkMemoryRequirements buffer_memory_requirements;
vkGetBufferMemoryRequirements( GetDevice(), buffer, &buffer_memory_requirements );

VkPhysicalDeviceMemoryProperties memory_properties;
vkGetPhysicalDeviceMemoryProperties( GetPhysicalDevice(), &memory_properties );

for( uint32_t i = 0; i < memory_properties.memoryTypeCount; ++i ) {
  if( (buffer_memory_requirements.memoryTypeBits & (1 << i)) &&
    (memory_properties.memoryTypes[i].propertyFlags & property) ) {

    VkMemoryAllocateInfo memory_allocate_info = {
      VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO, // VkStructureType   sType
      nullptr,                                // const void       *pNext
      buffer_memory_requirements.size,        // VkDeviceSize      allocationSize
      i                                       // uint32_t          memoryTypeIndex
    };

    if( vkAllocateMemory( GetDevice(), &memory_allocate_info, nullptr, memory ) == VK_SUCCESS ) {
      return true;
    }
  }
}
return false;

2. Tutorial07.cpp, function AllocateBufferMemory()

To create a buffer that can be used as a source of shader uniform data, we need to create a buffer with the VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT usage. But, depending on how we want to transfer data to it, we may also need other usages as well. Here we want to use a buffer with a device-local memory bound to it because such memory may have better performance. But, depending on the hardware's architecture, it may not be possible to map such memory and copy data to it directly from the CPU. That's why we want to use a staging buffer through which data will be copied from the CPU to our uniform buffer. And in order to do that, our uniform buffer must also be created with the VK_BUFFER_USAGE_TRANSFER_DST_BIT usage, as it will be a target of data copy operation. Below, we can see how our buffer is finally created:

Vulkan.UniformBuffer.Size = 16 * sizeof(float);
if( !CreateBuffer( VK_BUFFER_USAGE_TRANSFER_DST_BIT | VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, Vulkan.UniformBuffer ) ) {
  std::cout << "Could not create uniform buffer!"<< std::endl;
  return false;
}

if( !CopyUniformBufferData() ) {
  return false;
}

return true;

3. Tutorial07.cpp, function CreateUniformBuffer()

Copying Data to Buffers

The next thing is to upload appropriate data to our uniform buffer. In it we will store 16 elements of a 4 x 4 matrix. We are using an orthographic projection matrix but we can store any other type of data; we just need to remember that each uniform variable must be placed at an appropriate offset, counting from the beginning of a buffer's memory. Such an offset must be a multiple of a specific value. In other words, it must be aligned to a specific value or it must have a specific alignment. The alignment of each uniform variable depends on the variable's type, and the specification defines it as follows:

  • A scalar variable whose type has N bytes must be aligned to an address that is a multiple of N.
  • A vector with two elements of size N (whose type has N bytes) must be aligned to 2 N.
  • A vector with three or four elements, each of size N, has an alignment of 4 N.
  • An array's alignment is calculated as an alignment of its elements, rounded up to a multiple of 16.
  • A structure's alignment is calculated as the largest alignment of any of its members, rounded up to a multiple of 16.
  • A row-major matrix with C columns has an alignment equal to the alignment of a vector with C elements of the same type as the elements of the matrix.
  • A column-major matrix has an alignment equal to the alignment of the matrix column type.

The above rules are similar to the rules defined for the standard GLSL 140 layout and we can apply it for Vulkan's uniform buffers as well. But we need to remember that if we place data at inappropriate offsets it will lead to incorrect values being fetched in shaders.

For the sake of simplicity, our example has only one uniform variable, so it can be placed at the very beginning of a buffer. To transfer data to it, we will use a staging buffer—it is created with VK_BUFFER_USAGE_TRANSFER_SRC_BIT usage and is backed by a memory supporting VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT property, so we can map it. Below, we can see how data is copied to the staging buffer:

const std::array<float, 16> uniform_data = GetUniformBufferData();

void *staging_buffer_memory_pointer;
if( vkMapMemory( GetDevice(), Vulkan.StagingBuffer.Memory, 0, Vulkan.UniformBuffer.Size, 0, &staging_buffer_memory_pointer) != VK_SUCCESS ) {
    std::cout << “Could not map memory and upload data to a staging buffer!” << std::endl;
    return false;
}

memcpy( staging_buffer_memory_pointer, uniform_data.data(), Vulkan.UniformBuffer.Size );

VkMappedMemoryRange flush_range = {
    VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE,  // VkStructureType  sType
    nullptr,                                // const void      *pNext
    Vulkan.StagingBuffer.Memory,            // VkDeviceMemory   memory
    0,                                      // VkDeviceSize     offset
    Vulkan.UniformBuffer.Size               // VkDeviceSize     size
};
vkFlushMappedMemoryRanges( GetDevice(), 1, &flush_range );

vkUnmapMemory( GetDevice(), Vulkan.StagingBuffer.Memory );

4. Tutorial07.cpp, function CopyUniformBufferData()

First, we prepare the projection matrix data. It is stored in a std::array, but we can keep it in any other type of variable. Next, we map memory bound to the staging buffer. We need to have access to memory that is at least as big as the size of data we want to copy, so we also need to remember to create a staging buffer that is big enough to hold it. Next, we copy data to the staging buffer using an ordinary memcpy() function. Now, we must tell the driver which parts of the buffer's memory were changed; this operation is called flushing. After that, we unmap the staging buffer's memory. Keep in mind that frequent mapping and unmapping may impact performance of our application. In Vulkan, resources can be mapped all the time and it doesn't impact our application in any way. So, if we want to frequently transfer data using staging resources, we should map them only once and keep the acquired pointer for future use. Here we are unmapping it to just show you how to do it.

Now we need to transfer data from the staging buffer to our target, the uniform buffer. In order to do that, we need to prepare a command buffer in which we will record appropriate operations, and which we will submit for these operations to occur.

We start by taking any unused command buffer. It must be allocated from a pool created for a queue that supports transfer operations. Vulkan specification requires that at least one general purpose queue must be available — a queue that supports graphics (rendering), compute, and transfer operations. In the case of Intel® hardware, there is only one queue family with one general-purpose queue, so we don't have this problem. Other hardware vendors may support other types of queue families, maybe even a queue family that is dedicated for data transfer. In that case, we should choose a queue from such a family.

We start recording a command buffer by calling the vkBeginCommandBuffer() function. Next, we record the vkCmdCopyBuffer() command that performs the data transfer, where we tell it that we want to copy data from the very beginning of the staging buffer (0th offset) to the very beginning of our uniform buffer (also 0th offset). We also provide the size of data to be copied.

Next, we need to tell the driver that after the data transfer is performed, our whole uniform buffer will be used as, well, a uniform buffer. This is performed by placing a buffer memory barrier in which we tell it that, until now, we were transferring data to the buffer (VK_ACCESS_TRANSFER_WRITE_BIT), but from now on we will use (VK_ACCESS_UNIFORM_READ_BIT) as a source of data for uniform variables. The buffer memory barrier is placed using the vkCmdPipelineBarrier() function call. It occurs after the data transfer operation (VK_PIPELINE_STAGE_TRANSFER_BIT) but before the vertex shader execution, as we access our uniform variable inside the vertex shader (VK_PIPELINE_STAGE_VERTEX_SHADER_BIT).

Finally, we can end the command buffer and submit it to the queue. The whole process is presented in the code below:

// Prepare command buffer to copy data from staging buffer to a uniform buffer
VkCommandBuffer command_buffer = Vulkan.RenderingResources[0].CommandBuffer;

VkCommandBufferBeginInfo command_buffer_begin_info = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO, // VkStructureType              sType
  nullptr,                                     // const void                  *pNext
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT, // VkCommandBufferUsageFlags    flags
  nullptr                                      // const VkCommandBufferInheritanceInfo  *pInheritanceInfo
};

vkBeginCommandBuffer( command_buffer, &command_buffer_begin_info);

VkBufferCopy buffer_copy_info = {
  0,                                // VkDeviceSize       srcOffset
  0,                                // VkDeviceSize       dstOffset
  Vulkan.UniformBuffer.Size         // VkDeviceSize       size
};
vkCmdCopyBuffer( command_buffer, Vulkan.StagingBuffer.Handle, Vulkan.UniformBuffer.Handle, 1, &buffer_copy_info );

VkBufferMemoryBarrier buffer_memory_barrier = {
  VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER, // VkStructureType    sType;
  nullptr,                          // const void        *pNext
  VK_ACCESS_TRANSFER_WRITE_BIT,     // VkAccessFlags      srcAccessMask
  VK_ACCESS_UNIFORM_READ_BIT,       // VkAccessFlags      dstAccessMask
  VK_QUEUE_FAMILY_IGNORED,          // uint32_t           srcQueueFamilyIndex
  VK_QUEUE_FAMILY_IGNORED,          // uint32_t           dstQueueFamilyIndex
  Vulkan.UniformBuffer.Handle,      // VkBuffer           buffer
  0,                                // VkDeviceSize       offset
  VK_WHOLE_SIZE                     // VkDeviceSize       size
};
vkCmdPipelineBarrier( command_buffer, VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_VERTEX_SHADER_BIT, 0, 0, nullptr, 1, &buffer_memory_barrier, 0, nullptr );

vkEndCommandBuffer( command_buffer );

// Submit command buffer and copy data from staging buffer to a vertex buffer
VkSubmitInfo submit_info = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,    // VkStructureType    sType
  nullptr,                          // const void        *pNext
  0,                                // uint32_t           waitSemaphoreCount
  nullptr,                          // const VkSemaphore *pWaitSemaphores
  nullptr,                          // const VkPipelineStageFlags *pWaitDstStageMask;
  1,                                // uint32_t           commandBufferCount
  &command_buffer,                  // const VkCommandBuffer *pCommandBuffers
  0,                                // uint32_t           signalSemaphoreCount
  nullptr                           // const VkSemaphore *pSignalSemaphores
};

if( vkQueueSubmit( GetGraphicsQueue().Handle, 1, &submit_info, VK_NULL_HANDLE ) != VK_SUCCESS ) {
  return false;
}

vkDeviceWaitIdle( GetDevice() );
return true;

5. Tutorial07.cpp, function CopyUniformBufferData()

In the code above we call the vkDeviceWaitIdle() function to make sure the data transfer operation is finished before we proceed. But in real-life situations, we should perform more appropriate synchronizations by using semaphores and/or fences. Waiting for all the GPU operations to finish may (and probably will) kill performance of our application.

Preparing Descriptor Sets

Now we can prepare descriptor sets — the interface between our application and a pipeline through which we can provide resources used by shaders. We start by creating a descriptor set layout.

Creating Descriptor Set Layout

The most typical way that 3D geometry is rendered is by multiplying vertices by model, view, and projection matrices inside a vertex shader. These matrices may be accumulated in a model-view-projection matrix. We need to provide such a matrix to the vertex shader in a uniform variable. Usually we want our geometry to be textured; the fragment shader needs access to a texture — a combined image sampler. We can also use a separate sampled image and a sampler; using combined image samplers may have better performance on some platforms.

When we issue drawing commands, we want the vertex shader to have access to a uniform variable, and the fragment shader to a combined image sampler. These resources must be provided in a descriptor set. In order to allocate such a set we need to create an appropriate layout, which defines what types of resources are stored inside the descriptor sets.

  std::vector<VkDescriptorSetLayoutBinding> layout_bindings = {
  {
    0,                                         // uint32_t           binding
    VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, // VkDescriptorType   descriptorType
    1,                                         // uint32_t           descriptorCount
    VK_SHADER_STAGE_FRAGMENT_BIT,              // VkShaderStageFlags stageFlags
    nullptr                                    // const VkSampler *pImmutableSamplers
  },                                                                  
  {                                                                   
    1,                                         // uint32_t           binding
    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,         // VkDescriptorType   descriptorType
    1,                                         // uint32_t           descriptorCount
    VK_SHADER_STAGE_VERTEX_BIT,                // VkShaderStageFlags stageFlags
    nullptr                                    // const VkSampler *pImmutableSamplers
  }
};

VkDescriptorSetLayoutCreateInfo descriptor_set_layout_create_info = {
  VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO, // VkStructureType  sType
  nullptr,                                             // const void      *pNext
  0,                                                   // VkDescriptorSetLayoutCreateFlags flags
  static_cast<uint32_t>(layout_bindings.size()),       // uint32_t         bindingCount
  layout_bindings.data()                               // const VkDescriptorSetLayoutBinding *pBindings
};

if( vkCreateDescriptorSetLayout( GetDevice(), &descriptor_set_layout_create_info, nullptr, &Vulkan.DescriptorSet.Layout ) != VK_SUCCESS ) {
  std::cout << "Could not create descriptor set layout!"<< std::endl;
  return false;
}

return true;

6. Tutorial07.cpp, function CreateDescriptorSetLayout()

Descriptor set layouts are created by specifying bindings. Each binding defines a separate entry in a descriptor set and has its own, unique index within a descriptor set. In the above code we define that descriptor set (and its layout); it contains two bindings. The first binding, with index 0, is for one combined image sampler accessed by a fragment shader. The second binding, with index 1, is for a uniform buffer accessed by a vertex shader. Both are single resources; they are not arrays. But, we can also specify that each binding represents an array of resources by providing a value greater than 1 in the descriptorCount member of the VkDescriptorSetLayoutBinding structure.

Bindings are also used inside shaders. When we define uniform variables, we need to specify the same binding value as the one provided during layout creation:

layout( set=S, binding=B ) uniform <variable type> <variable name>;

Two things are worth mentioning. Bindings do not need to be consecutive. We can create a layout with three bindings occupying, for example, indices 2, 5, and 9. But unused slots may still use some memory, so we should keep bindings as close to 0 as possible.

We also specify which shader stages need access to which types of descriptors (which bindings). If we are not sure, we can provide more stages. For example, let's say we want to create several pipelines, all using descriptor sets with the same layout. In some of these pipelines, a uniform buffer will be accessed in a vertex shader, in others in a geometry shader, and in still others in both vertex and fragment shaders. For such a purpose we can create one layout in which we can specify that the uniform buffer will be accessed by vertex, geometry, and fragment shaders. But we should not provide unnecessary shader stages because, as usual, it may impact the performance (though this does not mean it will).

After specifying an array of bindings, we provide a pointer to it in a variable of type VkDescriptorSetLayoutCreateInfo. The pointer to this variable is provided in the vkCreateDescriptorSetLayout() function, which creates the actual layout. When we have it we can allocate a descriptor set. But first, we need a pool of memory from which the set can be allocated.

Creating a Descriptor Pool

When we want to create a descriptor pool we need to know what types of resources will be defined in descriptor sets allocated from the pool. We also need to specify not only the maximum number of resources of each type that can be stored in descriptor sets allocated from the pool, but also the maximum number of descriptor sets allocated from the pool. For example, we can prepare a storage for one combined image sampler and for one uniform buffer, but for two sets in total. This means that we can have two sets, one with a texture and one with a uniform buffer, or only one set, with both a uniform buffer and a texture (in such a situation the second pool is empty, as it cannot contain either of these two resources).

In our example we need only one descriptor set, and we can see below how to create a descriptor pool for it:

std::vector<VkDescriptorPoolSize> pool_sizes = {
  {
    VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER,   // VkDescriptorType  type
    1                                            // uint32_t          descriptorCount
  },
  {
    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,           // VkDescriptorType  type
    1                                            // uint32_t          descriptorCount
  }
};

VkDescriptorPoolCreateInfo descriptor_pool_create_info = {
  VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO, // VkStructureType     sType
  nullptr,                                       // const void         *pNext
  0,                                             // VkDescriptorPoolCreateFlags flags
  1,                                             // uint32_t            maxSets
  static_cast<uint32_t>(pool_sizes.size()),      // uint32_t            poolSizeCount
  pool_sizes.data()                              // const VkDescriptorPoolSize *pPoolSizes
};

if( vkCreateDescriptorPool( GetDevice(), &descriptor_pool_create_info, nullptr, &Vulkan.DescriptorSet.Pool ) != VK_SUCCESS ) {
  std::cout << "Could not create descriptor pool!"<< std::endl;
  return false;
}

return true;

7. Tutorial07.cpp, function CreateDescriptorPool()

Now, we are ready to allocate descriptor sets from the pool using the previously created layout.

Allocating Descriptor Sets

Descriptor set allocation is pretty straightforward. We just need a descriptor pool and a layout. We specify the number of descriptor sets to allocate and call the vkAllocateDescriptorSets() function like this:

  VkDescriptorSetAllocateInfo descriptor_set_allocate_info = {
    VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO, // VkStructureType   sType
    nullptr,                                        // const void       *pNext
    Vulkan.DescriptorSet.Pool,                      // VkDescriptorPool  descriptorPool
    1,                                              // uint32_t      descriptorSetCount
    &Vulkan.DescriptorSet.Layout                // const VkDescriptorSetLayout *pSetLayouts
};

if( vkAllocateDescriptorSets( GetDevice(), &descriptor_set_allocate_info, &Vulkan.DescriptorSet.Handle ) != VK_SUCCESS ) {
    std::cout << "Could not allocate descriptor set!"<< std::endl;
    return false;
}

return true;

8. Tutorial07.cpp, function AllocateDescriptorSet()

Updating Descriptor Sets

We have allocated a descriptor set. It is used to provide a texture and a uniform buffer to the pipeline so they can be used inside shaders. Now we must provide specific resources that will be used as descriptors. For the combined image sampler, we need two resources — an image, which can be sampled inside shaders (it must be created with the VK_IMAGE_USAGE_SAMPLED_BIT usage), and a sampler. These are two separate resources, but they are provided together to form a single, combined image sampler descriptor. For details about how to create these two resources, please refer to the Introduction to Vulkan Part 6 – Descriptor Sets. For the uniform buffer we will provide a buffer created earlier. To provide specific resources to a descriptor, we need to update a descriptor set. During updates we specify descriptor types, binding numbers, and counts in exactly the same way as we did during layout creation. These values must match. Apart from that, depending on the descriptor type, we also need to create variables of type:

  • VkDescriptorImageInfo, for samplers, sampled images, combined image samplers, and input attachments
  • VkDescriptorBufferInfo, for uniform and storage buffers and their dynamic variations
  • VkBufferView, for uniform and storage texel buffers

Through them, we provide handles of specific Vulkan resources that should be used for corresponding descriptors. All this is provided to the vkUpdateDescriptorSets() function, as we can see below:

VkDescriptorImageInfo image_info = {
  Vulkan.Image.Sampler,                    // VkSampler        sampler
  Vulkan.Image.View,                       // VkImageView      imageView
  VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL // VkImageLayout    imageLayout
};

VkDescriptorBufferInfo buffer_info = {
  Vulkan.UniformBuffer.Handle,             // VkBuffer         buffer
  0,                                       // VkDeviceSize     offset
  Vulkan.UniformBuffer.Size                // VkDeviceSize     range
};

std::vector<VkWriteDescriptorSet> descriptor_writes = {
  {
    VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET,    // VkStructureType     sType
    nullptr,                                   // const void         *pNext
    Vulkan.DescriptorSet.Handle,               // VkDescriptorSet     dstSet
    0,                                         // uint32_t            dstBinding
    0,                                         // uint32_t            dstArrayElement
    1,                                         // uint32_t            descriptorCount
    VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, // VkDescriptorType    descriptorType
    &image_info,                               // const VkDescriptorImageInfo  *pImageInfo
    nullptr,                                   // const VkDescriptorBufferInfo *pBufferInfo
    nullptr                                    // const VkBufferView *pTexelBufferView
  },
  {
    VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET,    // VkStructureType     sType
    nullptr,                                   // const void         *pNext
    Vulkan.DescriptorSet.Handle,               // VkDescriptorSet     dstSet
    1,                                         // uint32_t            dstBinding
    0,                                         // uint32_t            dstArrayElement
    1,                                         // uint32_t            descriptorCount
    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,         // VkDescriptorType    descriptorType
    nullptr,                                   // const VkDescriptorImageInfo  *pImageInfo
    &buffer_info,                              // const VkDescriptorBufferInfo *pBufferInfo
    nullptr                                    // const VkBufferView *pTexelBufferView
  }
};

vkUpdateDescriptorSets( GetDevice(), static_cast<uint32_t>(descriptor_writes.size()), &descriptor_writes[0], 0, nullptr );
return true;

9. Tutorial07.cpp, function UpdateDescriptorSet()

Now we have a valid descriptor set. We can bind it during command buffer recording. But, for that we need a pipeline object, which is created with an appropriate pipeline layout.

Preparing Drawing State

Created descriptor set layouts are required for two purposes:

  • Allocating descriptor sets from pools
  • Creating pipeline layout

The descriptor set layout specifies what types of resources the descriptor set contains. The pipeline layout specifies what types of resources can be accessed by a pipeline and its shaders. That's why before we can use a descriptor set during command buffer recording, we need to create a pipeline layout.

Creating a Pipeline Layout

The pipeline layout defines the resources that a given pipeline can access. These are divided into descriptors and push constants. To create a pipeline layout, we need to provide a list of descriptor set layouts, and a list of ranges of push constants.

Push constants provide a way to pass data to shaders easily and very, very quickly. Unfortunately, the amount of data is also very limited — the specification allows only 128 bytes to be available for push constants data provided to a pipeline at a given time. Hardware vendors may allow us to provide more data, but it is still a very small amount compared to usual descriptors, like uniform buffers.

In this example we don't use push constants ranges, so we only need to provide our descriptor set layout and call the vkCreatePipelineLayout() function. The code below does exactly that:

VkPipelineLayoutCreateInfo layout_create_info = {
  VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO, // VkStructureType              sType
  nullptr,                                       // const void                  *pNext
  0,                                             // VkPipelineLayoutCreateFlags  flags
  1,                                             // uint32_t                     setLayoutCount
  &Vulkan.DescriptorSet.Layout,                  // const VkDescriptorSetLayout *pSetLayouts
  0,                                             // uint32_t                     pushConstantRangeCount
  nullptr                                        // const VkPushConstantRange   *pPushConstantRanges
};

if( vkCreatePipelineLayout( GetDevice(), &layout_create_info, nullptr, &Vulkan.PipelineLayout ) != VK_SUCCESS ) {
  std::cout << "Could not create pipeline layout!"<< std::endl;
  return false;
}
return true;

10. Tutorial07.cpp, function CreatePipelineLayout()

Creating Shader Programs

Now we need a graphics pipeline. Pipeline creation is a very time-consuming process, from both the performance and code development perspective. I will skip the code and present only the GLSL source code of shaders.

The vertex shader used during drawing takes a vertex position and multiplies it by a projection matrix read from a uniform variable. This variable is stored inside a uniform buffer. The descriptor set, through which we provide our uniform buffer, is the first (and the only one in this case) in the list of descriptor sets specified during pipeline layout creation. So, when we record a command buffer, we can bind it to the 0th index. This is because indices to which we bind descriptor sets must match indices corresponding to descriptor set layouts that are provided during pipeline layout creation. The same set index must be specified inside shaders. The uniform buffer is represented by the second binding within that set (it has an index equal to 1), and the same binding number must also be specified. This is the whole vertex shader source code:

#version 450

layout(set=0, binding=1) uniform u_UniformBuffer {
    mat4 u_ProjectionMatrix;
};

layout(location = 0) in vec4 i_Position;
layout(location = 1) in vec2 i_Texcoord;

out gl_PerVertex
{
    vec4 gl_Position;
};

layout(location = 0) out vec2 v_Texcoord;

void main() {
    gl_Position = u_ProjectionMatrix * i_Position;
    v_Texcoord = i_Texcoord;
}

11. shader.vert, -

Inside the shader we also pass texture coordinates to a fragment shader. The fragment shader takes them and samples the combined image sampler. It is provided through the same descriptor set bound to index 0, but it is the first descriptor inside it, so in this case we specify 0 (zero) as the binding's value. Have a look at the full GLSL source code of the fragment shader:

#version 450

layout(set=0, binding=0) uniform sampler2D u_Texture;

layout(location = 0) in vec2 v_Texcoord;

layout(location = 0) out vec4 o_Color;

void main() {
  o_Color = texture( u_Texture, v_Texcoord );
}

12. shader.frag, -

The above two shaders need to be converted to a SPIR-V* assembly before we can use them in our application. The core Vulkan specification allows only for binary SPIR-V data to be used as a source of shader instructions. From them we can create two shader modules, one for each shader stage, and use them to create a graphics pipeline. The rest of the pipeline state remains unchanged.

Binding Descriptor Sets

Let's assume we have all the other resources created and ready to be used to draw our geometry. We start recording a command buffer. Drawing commands can only be called inside render passes. Before we can draw any geometry, we need to set all the required states — first and foremost, we need to bind a graphics pipeline. Apart from that, if we are using a vertex buffer, we need to bind the appropriate buffer for this purpose. If we want to issue indexed drawing commands, we need to bind a buffer with vertex indices too. And when we are using shader resources like uniform buffers or textures, we need to bind descriptor sets. We do this by calling the vkCmdBindDescriptorSets() function, in which we need to provide not only the handle of our descriptor set, but also the handle of the pipeline layout (so we need to keep it). Only after that can we record drawing commands. These operations are presented in the code below:

vkCmdBeginRenderPass( command_buffer, &render_pass_begin_info, VK_SUBPASS_CONTENTS_INLINE );

vkCmdBindPipeline( command_buffer, VK_PIPELINE_BIND_POINT_GRAPHICS, Vulkan.GraphicsPipeline );

// ...

VkDeviceSize offset = 0;
vkCmdBindVertexBuffers( command_buffer, 0, 1, &Vulkan.VertexBuffer.Handle, &offset );

vkCmdBindDescriptorSets( command_buffer, VK_PIPELINE_BIND_POINT_GRAPHICS, Vulkan.PipelineLayout, 0, 1, &Vulkan.DescriptorSet.Handle, 0, nullptr );

vkCmdDraw( command_buffer, 4, 1, 0, 0 );

vkCmdEndRenderPass( command_buffer );

13. Tutorial07.cpp, function PrepareFrame()

Don't forget that a typical frame of animation requires us to acquire an image from a swapchain, record the command buffer (or more) as presented above, submit it to a queue, and present a previously acquired swapchain image so it gets displayed according to the present mode requested during swapchain creation.

Tutorial 7 Execution

Have a look at how the final image generated by the sample program should appear:

track with Intel logo

We still render a quad that has a texture applied to its surface. But the size of the quad should remain unchanged when we change the size of our application's window.

Cleaning Up

As usual, at the end of our application, we should perform a cleanup.

// ...

if( Vulkan.GraphicsPipeline != VK_NULL_HANDLE ) {
  vkDestroyPipeline( GetDevice(), Vulkan.GraphicsPipeline, nullptr );
  Vulkan.GraphicsPipeline = VK_NULL_HANDLE;
}

if( Vulkan.PipelineLayout != VK_NULL_HANDLE ) {
  vkDestroyPipelineLayout( GetDevice(), Vulkan.PipelineLayout, nullptr );
  Vulkan.PipelineLayout = VK_NULL_HANDLE;
}

// ...

if( Vulkan.DescriptorSet.Pool != VK_NULL_HANDLE ) {
  vkDestroyDescriptorPool( GetDevice(), Vulkan.DescriptorSet.Pool, nullptr );
  Vulkan.DescriptorSet.Pool = VK_NULL_HANDLE;
}

if( Vulkan.DescriptorSet.Layout != VK_NULL_HANDLE ) {
  vkDestroyDescriptorSetLayout( GetDevice(), Vulkan.DescriptorSet.Layout, nullptr );
  Vulkan.DescriptorSet.Layout = VK_NULL_HANDLE;
}

DestroyBuffer( Vulkan.UniformBuffer );

14. Tutorial07.cpp, function destructor

Most of the resources are destroyed, as usual. We do this in the order opposite to the order of their creation. Here, only the part of code relevant to our example is presented. The graphics pipeline is destroyed by calling the vkDestroyPipeline() function. To destroy its layout we need to call the vkDestroyPipelineLayout() function. We don't need to destroy all descriptor sets separately, because when we destroy a descriptor pool, all sets allocated from it also get destroyed. To destroy a descriptor pool we need to call the vkDestroyDescriptorPool() function. The descriptor set layout needs to be destroyed separately with the vkDestroyDescriptorSetLayout() function. After that, we can destroy the uniform buffer; it is done with the vkDestroyBuffer() function. But we can't forget to destroy the memory bound to it via the vkFreeMemory() function, as seen below:

if( buffer.Handle != VK_NULL_HANDLE ) {
  vkDestroyBuffer( GetDevice(), buffer.Handle, nullptr );
  buffer.Handle = VK_NULL_HANDLE;
}

if( buffer.Memory != VK_NULL_HANDLE ) {
  vkFreeMemory( GetDevice(), buffer.Memory, nullptr );
  buffer.Memory = VK_NULL_HANDLE;
}

15. Tutorial07.cpp, function DestroyBuffe()

Conclusion

In this part of the tutorial we extended the example from the previous part by adding the uniform buffer to a descriptor set. Uniform buffers are created as are all other buffers, but with the VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT usage specified during their creation. We also allocated dedicated memory and bound it to the buffer, and we uploaded projection matrix data to the buffer using a staging buffer.

Next, we prepared the descriptor set, starting with creating a descriptor set layout with one combined image sampler and one uniform buffer. Next, we created a descriptor pool big enough to contain these two types of descriptor resources, and we allocated a single descriptor set from it. After that, we updated the descriptor set with handles of a sampler, an image view of a sampled image, and the buffer created in this part of the tutorial.

The rest of the operations were similar to the ones we already know. The descriptor set layout was used during pipeline layout creation, which was then used when we bound the descriptor sets to a command buffer.

We have seen once again how to prepare shader code for both vertex and fragment shaders, and we learned how to access different types of descriptors provided through different bindings of the same descriptor set.

The next parts of the tutorial will be a bit different, as we will see and compare different approaches to managing multiple resources and handling various, more complicated, tasks.

Intel® Computer Vision SDK: Getting Started with the Intel® Computer Vision SDK (Intel® CV SDK)

$
0
0

 

Using Caffe with the Intel *

Deep Learning Model Optimizer for Caffe* requires the Caffe framework to be installed on the client machine with all relevant dependencies. Caffe should be dynamically compiled and linked. A shared library named libcaffe.so should be available in the CAFFE_HOME/build/lib directory.

For ease of reference, the Caffe* installation folder is referred to as <CAFFE_HOME> and the Model Optimizer installation folder is referred to as <MO_DIR>.

The installation path to the Model Optimizer depends on whether you use the Intel® CV SDK or Deep Learning Deployment Toolkit. For example, if you are installing with sudo, the default <MO_DIR> directory is:

  • /opt/intel/deeplearning_deploymenttoolkit_<version>/deployment_tools/model_optimizer - In case of using Deep Learning Deployment Toolkit
  • /opt/intel/computer_vision_sdk_<version>/mo - In case of Intel CV SDK installation.

Installing Caffe

To install Caffe, complete the following steps:

  1. For convenience, set the following environment variables:
      export MO_DIR=<PATH_TO_MO_INSTALL_DIR>
      export CAFFE_HOME=<PATH_TO_YOUR_CAFFE_DIR>
  2. Go to the Model Optimizer folder:
    cd $MO_DIR/model_optimizer_caffe/
  3. For easiness of the installation procedure, you can find two additional scripts in the $MO_DIR/model_optimizer_caffe/install_prerequisites folder:
    • install_Caffe_dependencies.sh - Installs the required dependencies like Git*, CMake*, GCC*, etc.
    • clone_patch_build_Caffe.sh - Installs the Caffe* distribution on your machine and patches it with the required adapters from the Model Optimizer.
  4. Go to the helper scripts folder and install all the required dependencies:
        cd install_prerequisites/
        ./install_Caffe_dependencies.sh 
  5. Install Caffe* distribution. By default it installs the BVLC Caffe* from the master branch of the official repository. If you want to install other version of Caffe*, you can slightly edit the content of the clone_patch_build_Caffe.sh script. In particular, the following lines:
        CAFFE_REPO=https://github.com/BVLC/caffe.git # link to the repository with Caffe* distribution
        CAFFE_BRANCH=master # branch to be taken from the repository
        CAFFE_FOLDER=`pwd`/caffe # where to clone the repository on your local machine
        CAFFE_BUILD_SUBFOLDER=build # name of the folder required for building Caffe* 
    To launch installation, just run the following command:
        ./clone_patch_build_Caffe.sh 

NOTE: In case of problem with the hdf5 library while building Caffe on Ubuntu* 16.04, see the following fix.

Once you have configured Caffe* framework on your machine, you need to configure Model Optimizer for Caffe* to properly work with it. For that, please refer to the Configure Model Optimizer for Caffe* page.

See Also

Modular Concepts for Game and Virtual Reality Assets

$
0
0

In-game environment

A hot topic and trend of the games industry is modularity, or the process of organizing groups of assets into reusable, inter-linkable modules to form larger structures and environments. The core idea is reusing as much work as possible in order to save memory, improve load times, and streamline production. There are however drawbacks to overcome with these methods. Creating variation in surfaces is important so that repetition of modules is not noticeable, and a viewer feels immersed in a realistic environment.

One of the biggest issues with real-time environments is that we cannot do all the creation in-engine. As artists, we rely on a plethora of programs that all get consolidated into a workflow, with the final product reaching the engine. This is a challenge because it is important that the whole scene shares consistent and equal detail, rather than inconsistent pockets of micro detail where one spent more time. This requires a lot of rapid iteration and early testing. With the current next gen tools and engines, it is possible to add detail within the final scene by leveraging the use of advanced materials/shaders to increase the visual quality across the scene and break up repetition.

Pros

  • Build large environments quickly
  • Memory efficient

Cons

  • Extra planning time required at the start
  • Can look repetitive, boxy or mechanically aligned; boring

In this article I aim to share my experiences learned in creating environment art for virtual reality (VR) games that can be applied to any 3D application or experience.

How To Think as a Designer: Basic Fundamentals

Understanding how elements of architecture come together to form details and interesting spaces is just as important as knowing level design strategies of how to convey importance to a user when it comes to creating modular assets. Understanding how to simplify visuals into believable and reusable prefabs while working within design and hardware constraints is all a balancing act that gets easier with practice and study. In addition, this plays into how we as artists look at reference and decide which assets to make first. I highly recommend aiming for the minimum amount of work, or the maximum amount of reusability when beginning a scene. Iterative strategies such as this improve the overall quality without stretching budget, as well as inform next steps.

Games Versus Commercial

The biggest difference between games and commercial applications should be that the art created accommodates a player whereas, in commercial, our art accommodates a consumer. While a player is still a consumer, a game has rules and mechanics that must be emphasized in the layout and design of our spaces to accent what makes the game fun. Poor level design begets a poor user experience. Whereas with just a consumer, we must focus on an invitation to someplace comfortable and easy on the eyes. With VR, these lines are merging together regardless of the application you are making, whether it’s a game or an experience. Games typically require more lateral thinking and communication between artists, designers, and programmers; it is important to get feedback early so as not to cause problems down the line.

As we build our interlocking assets, we must always check out the pieces in the engine, seeing how simple untextured or textured models line up or don’t line up when trying to put pieces together to form simple structures. During these tests of assets the key things to check are:

Ease of use: Does the asset line up well with others? Is the pivot point of the mesh in the correct spot and snapped to the grid? Overall, is it not a hassle for you to use?

Repetition: Do we need new pieces to break up a kit that maybe is too small? If the viewer can easily see each piece and notice the prefab architecture we will have a hard time immersing people in a space.

Forms and shapes: Does everything have a unique silhouette and play with light well in the engine? Or does a flat surface break up well?

Composition: Is everything working together? Can an interesting image be composed?

Readability, color/detail, scale and the first two seconds: This is a later check after getting in some texture and base lighting for a scene. Make sure that pathing is visible or importance is conveyed right away. To get this right we may need more assets, or sometimes, less. This is all about visual balance and iterating quickly from rough geo and textures to final assets.

Learning how to think as a designer and understand how your work affects the whole process of development as well as the viewer experience is what ultimately allows you to make subjective decisions that can improve the design or limit cluttering of conveyance. Understanding why real-world places look and feel a certain way or how a shape suggests use of an object is the core of being an artist and a designer. I see a lot of technical breakdowns but understanding and studying these fundamentals through observing references will make you better, faster. Technology is just a tool. Modularity is a set of wrenches in the tool box. Building an airplane is easy if you know what you are doing.

Importance of Planning

Understanding the elements of design helps to make decisions when planning out our modular set and assessing its needs. As stated before, modularity requires a great deal of planning ahead of production. Reference should be the first place you start. Begin to break down images and pull out assets and ideas from real-world spaces. Good reference shows variety in a natural sense. Immersion is in the details.

The tools I use at this stage are Pinterest*, my ultimate tool when it comes to finding reference. Use it in tandem with a free tool called PureRef to view the references and you have a very powerful combo. Trello* can help manage tasks based on images and show progress. These tools can help limit creative friction or indecisiveness, especially if screenshots are posted into Trello to keep track of what you did today so that tomorrow you know where to pick up. In time, this is a real lifesaver for personal projects, to keep them going, as you see how far you've come.

When working with a client, get as much reference as possible; pictures, 360-degree videos, similar spaces, and any important details. In some cases, it may be good to do storyboards beforehand so as we work, we can think laterally as to the target goal, and the client is also on the same page. This could be as simple as photos of the place with a green screened character wearing a headset to imply what each scene is. Then move onto blockout or rough 3D layout, and then the final look pass. It is also important to consider the usage of sites like TurboSquid*, Cubebrush*, or the Unity* Asset Store, that can help cut costs of production time on asset creation. Investing in Quixel Megascans* and Substance Source can really help get in quality materials early. This also helps greatly since purchased assets oftentimes have lower quality textures.

General Process Overview

It is important to try different workflows and adopt other’s workflows to see if one process speeds you up or slows you down. One thing I learned as an artist is that there is no one sacred way or secret sauce to being good. Everyone has a different process or tricks depending on how they work, what they are working on, and where they work. Here is a basic process you can use to see the general overview of what goes in to a complete personal environment. You have to be flexible and iterative. Know when something is good enough and minimize creative friction by keeping tasks and files organized. Figure out a good naming convention for your assets early.

  • Gather reference: Learn your space. Map it out in your mind. What does it feel like?
  • Sort reference into tasks: Assets, materials, modular components.
  • Check 1: Attempt a basic 3D layout or blockout of primitive geometry. Sense of scale, space, and interconnectivity of assets is important.
  • Test early lighting: Adjust elements to suit ideal composition.
  • Once your meshes work well together and the scene is composed, improve the detail of the meshes to a final in game and begin unwrapping.
  • Check 2: Apply textures early, focusing on basic albedo, basic normal maps, and the overall readability with temp lighting. Note if your albedo is too dark.
  • Begin to follow your production schedule and continue to work up the detail. High poly creation, final texture passes; focus on larger or frequently used assets first as a baseline to always check if everything is cohesive.
  • Create supporting props to fill the scene with desired detail giving a sense of character, story, and scale.
  • Final lighting pass: Take screenshots, compare changes in external photo viewer, tweak until desired look is achieved.

    The greatest advantage in my opinion of working modularly is that you are ever aware of how many assets you have to be consistently updating. In a professional setting this helps a lot if changes need to be made, either in layout or optimization.

Texture Types

As we begin to look at the reference we want to capture the atmosphere, space, and harmony. However, it’s also crucial to be able to analyze each bit of reference and see where every material is used, so as to notice where the same materials are applied. The goal should be as few materials as possible to limit the work into a manageable scope and reduce the overall loading of textures, which take up a great amount of memory at runtime. To grasp modular concepts, it's necessary to understand the three types of textures we will be working with.

Tileable: Textures that tile infinitely in either all four directions or only tiles horizontally or vertically are also known as trim sheets. Tileable textures are your frontline infantry and should make up most of your scene as they are efficient and reusable. Oftentimes, we can start here and model directly from these textures to get our first building blocks. We will cover that further along.

tilable texture for 3d object
Tileable texture

Tile notes and tips:

  • Substance Designer is incredible for making just about any tile texture or material imaginable. It excels at achieving hyperrealism without using scan data, and is completely nondestructive. Its node-based system allows for the ability to open parameters for use in engine, leverage random seeds for quick iterations, limitless customization to tweak on the fly, and happy accidents.
  • Priority number one for physically based rendering materials is the height and normal, then roughness, then albedo. This way, the initial read feels correct and the details are lining up, but we can also use/create Active Objects(AO), curvature, and normal map color channels to create masks from the height data in order to create the roughness and albedo.
  • Decide how the texture should tile and be aware of repetition! My rock texture is noticeably tiling, but this study example was to create a more unique piece that would blend between a less noticeably repetitive version of the same rock material. This is called having low noise; supplying areas of rest, but details that invite the viewer in.
  • Use the histogram! To make an albedo work in any lighting condition and look natural, your albedo should have a nice cure in the histogram with the middle landing around a mean luminance of 128. This number can be lower, as albedo information for materials changes for naturally darker surfaces, and vice versa. Furthermore, getting a wide range of values for the color or a softer curve is more natural. Check the value curves on textures at Textures.com to help give yourself proper value and color. It is worth mentioning that mid gray is a value of 186 RGB. I would further recommend using this value on all assets in a scene when tweaking early lighting tests to see how geometry/height maps add in light versus shadow details.
  • Albedo and roughness are two sides of the same coin. Both interpret light to create value. Albedo is just the base coat, but if you look at a wall that has good specular breakup, you notice that areas with more reflectivity are brighter and inherit light color, while duller areas are darker and maintain more of the raw albedo color.
  • Learn how to create generators in Substance Designer to speed up your workflow, and reuse graphs to give yourself a good base. These can be shapes and patterns, edge damage, cracks, moisture damage, and so on. Please, accredit generators and graphs supplied by other artists that you may use in a Frankenstein graph.

Trim sheets:

trim sheet texture

asset with texture variants
Using Substance Smart materials to create to interchangeable variants.

high poly trim sheet
Our trim sheet high polygon mesh:

Trim notes and tips:

  • Create a 1m by 1m plane as your template outline. Keep this to export as your low poly mesh as well as export with high poly to ensure no gaps between elements.
  • Snap elements to grid! Makes for conversion to scene scale relative and accurate for texel density!
  • In order for trims to tile in the bake, the geometry has to go past the 1 x 1 plane.
  •  Don’t unwrap! This gets baked to a flat plane so we do not have to worry about geometry or unwrap these objects. The base plane is already unwrapped to 0-1, so we are good to go.
  • Floaters. We can make 3D objects to use as details that sit or float on top of other elements such as the small bolts in the bottom corner. Because the texture is baked flat, the bake won’t recognize the change in depth, as the normals should appear to blend seamlessly. This saves a lot of time as we don't need to model details into complicated meshes, but rather reuse these elements like screws, bolts, or any concave/convex shape to place where we want. This also makes it nondestructive if we are not happy with the bake result.
  • Save space for small, reusable details. Sometimes floaters don’t turn out too well if they are small and there are not enough pixels to support the normal. By putting details at the bottom, we can reuse these on the game mesh by putting small planes as floaters that have the details mapped to them.
  • Forty-five-degree bevels in the normal map help to smooth edges, and at this angle they are very reusable. Sunset Overdrive has a great breakdown of this technique. This isn't to say that each edge needs this, however; it is mostly for lower poly objects with hard edges.
  • Simplicity is often better. Have some unique elements, but low detailed pieces are far more reusable and less noticeable than, say, a trim with lots of floaters, so find a good balance.
  • Your trim can be composed of different materials such as metal trims, wood mouldings, or rubber strips all on one texture sheet. Baking out a material ID mask helps to make a sheet more versatile for saving memory. This is where really good planning helps.
  • It is also possible to create trims without geo in Substance Designer and Quixel* NDO Painter. Details can also be added to trim geo using alpha masks in Substance Painter.

Further creative examples:

example of a texture
Image of texture
This was for a mobile AR project. This method kept the texture size low and the detail high.

 

If trim sheets are made to the grid, they can be easily used interchangeably by adding edge loops to a plane and breaking off the components. This method is quick for prototyping since the UVs will be one to one to ensure tiling. If planned accordingly, one can also interchange textures if the textures share the same grid space. Check out this breakdown by Jacob Norris for details.

Here, the elements come together to form a basic corridor using one trim and one tileable.

Unique: A texture that focuses on one asset like a prop. It utilizes baked maps from a high polygon model to derive unique, non-tiling detail. These should be the finishing touches on a scene, and ideally, similar props always featured together are on a single texture atlas.

An exception to doing a unique asset first would be with a hero prop, which is a category for a unique asset of great significance to the overall scene. In this case getting a basic or final unique asset can come first.


Chess set uses one texture for all pieces. Each piece is baked from a high poly to a low poly and each surface face has unique detail.

Hybrid: Features the use of both tiling elements and unique elements on the same 0-1 texture. This could also be material blending on larger assets that use unique normals and masks to pump in tiling materials such as ornate stone walls or large rocks.

This hybrid example uses a unique normal map to drive the wear generators in Substance Painter to get masks that blend the rust and metals together. The red lines also indicate where I made cuts to form new geometry from the existing mesh. Yay, recycling!

The elements coming together to make an old sci fi warehouse kit. One trim (two color variants), one hybrid, and two tile textures.

Texel Density and Variation

One last thing to consider at an early stage is texel density or how many pixels per unit our assets will use. This can be pretty lengthy for first timers, and I highly recommend this article by Leonardo Lezzi. For first-person experiences we would want 1k or a 1024 x 1024 texture per one meter, or a texel density of 10.24 pixels per cm. We primarily want to use tileable textures in order to maximize visual detail on larger surfaces. Exception to these rules would be anything interactable that will be close to the player, such as a gun. I like to use Nightshade UV for Maya* when unwrapping. The tool has a built in texel density feature to set or track the pixels per unit. Below is an example of a 3m x 3m wall with a texel density of 1k per meter.

VR needs at least 1K per meter, but as the headsets evolve, that value will likely increase to 2K per meter. This poses a lot of challenges for VR artists as the hardware of computers will still limit texture sizes for many users. VR headsets already render two images at a wider field of view, making higher frame rates and nice details tricky. In order to combat this, it is possible to create shaders in Unreal Engine* 4 (UE4) and Unity* that use lower resolution detail maps to create the illusion of nearly 8K. A short demonstration of this technique can be seen on Youtube: Detail Textures: Quick Tutorial for UE4 & Megascans.

These shader setups are also critical to adding variation and breakup to surface tiling, either by using mask-based material blending or using vertex colors to blend materials or colors. This topic is bit of a rabbit hole as there are so many different ways to achieve variation through shader techniques. Amplify Shader Editor for Unity is a great tool to allow for this style of AAA art development. Additionally, this great breakdown by senior environment artist Yannick Gombart demystifies some of these techniques.

The Grid

I remember that, at first, modularity was a difficult concept to grasp. Take a look at Tetris*. All Tetris is is a stacking game of pieces or modules, which are paired together to create new, interlocking shapes that all fall onto a grid.

The grid is the guide to our modular blueprints according to a set footnote or scale. The footnote informs the basis of our modular construction and usually depends on the type of game we are making. If we were in Dungeons & Dragons*, is the base tile size for our modular corridors 5 feet or 10 feet? How does that impact player movement and sense of space? In Tetris, it would be the base cube size.

Since we are discussing VR, we are talking about first-person games and applications; a good footnote would be three meters by three meters or four by four. Always work in centimeters when modeling, so that would be 300 by 300. The metric system is easily divisible, making modules easy to break up into round units, and it is what game engines use by default. When deciding a footnote, keep in mind that first-person VR tends to make objects appear smaller than they actually are, so exaggeration in shapes makes things feel more realistic or clear. To begin, we need to change our modeling application’s grid to mimic our engine’s grid so that they will integrate seamlessly.

How To Set Up our Scale in Maya*

First, let’s ensure we are using centimeters in Maya.
Go to Window > Setting/Preferences > Preferences

Click on Settings and under Working Units check that Linear is set to Centimeter.

Now, let’s set up our grid.

Go to Display > Grid Options

Under Size set the following values.

  • Length/Width: 1,000 units. This is the overall grid size of our perspective view and not the size of each grid unit.
  • Grid lines every: 10 units (this controls grid unit lines as it does in Unity or UE4; so, if you can set this value to 5, 10, 50, 100 it will match the UE4 grid snaps. For Unity, I use a ProGrids plugin that mirrors a more robust grid, like that of Unreal). Changing this value is what will mirror how assets snap and line up with each other in the engine.
  • Subdivisions: 1. This changes how many grid lines we have per unit. At 1 it will be every 10 units, but at 2, we change this to grid lines every 5 units. It’s a quicker way to divide the grid into different snapping values, by sliding the input, rather than inputting a unique number each time for the Grid lines every field.
    Finally, create a cube 100 x 100 x 100 and export it as an .FBX file, in order to see if your asset matches the default cube size in the engine (usually 1m by 1m).
  • Useful hotkeys:
    • By holding X, you can snap an object to the grid as you translate.
    • Holding V snaps the selection to the vertices.
    • Pressing the Ins or the D key allows for moving an object’s pivot point. In tandem with holding X or V, you can get the pivot on the grid.
    • Holding J snaps the rotation.

Again, with whatever modeling software you use, the goal is for the grid to match that of the engine’s grid.

Here is a great walkthrough with images for grid setup, as well as a good site in general for good level design and environment art practices.

Bringing It All Together—The Blockout

Now that we are aware of our metrics and know what textures we need to look out for, we can go one of two ways. We can either make a quick texture set, usually a trim sheet and a tiling texture to create modules from, or we can go straight into a modeling program. Either way, breaking down the reference into working units and materials reduces creative friction early by having a good understanding of the space and sense of production scope.

As an example, here are some ways to look at a reference, in order to plan:

Some modeling programs have a distance tool to measure your units. I have measured a base man here to use as a scale reference for when I work in my scenes.

I can now overlay the humanoid reference and scale him accordingly to get the units for the reference. I have also color coated some of the texture callouts, highlighting essentials that I can use to plan a production schedule.

Here is another example of working from a photo reference. I used the park bench to estimate the building scale, sliced it into modules, and highlighted the trims.

Back to the room image. With our units in mind, I can now do a quick blockout in Maya using simple planes to get my scale. This establishes my footnote and serves as the base or root guideline for all of my high poly meshes and game assets.

From here I can create my materials and set of assets such as this:

As mentioned earlier, we need to check this in-engine constantly to see if our modules line up as intended. Work with units of 5, 10, 50, 100 cm. Keep in mind that working with meters ensures perfect tiling of textures from asset to asset (if our texel density is 1k per meter). It is important to export modules with good pivot points snapped on the grid with the front face facing the Z-direction. It is also good to establish naming conventions so it will be easy to look up each module in the engine.

In Unreal, for this project I assembled my level using the base kit. From here it is about polishing upwards by creating materials, high poly meshes, and unwrapping. This scene was made from various references and is a custom space. For this I drew a basic top-down map as a floor plan, then started with my wooden beams and railings, then filled in the other details to support them.

Keeping track of progress looks something like this in Trello:

From the left we have my reference images, which help inform space, props, and materials. Next I have my progress, showing each stage. Lastly, creating different columns for an asset catalog, to keep track of each asset. Each card has a checklist and any images or notes I need to keep in mind. The color coding informs the material type (tile, trim, hybrid, unique) for each card. From here it's just about polish.

Process Continued—Look Dev

It is important to note that when working within games it’s ideal to be as fast as possible. Just because you know how to save time by using modular concepts does not mean your final will look the way you want it to. There is a great deal of back and forth to do, and in order to mitigate risk, it is key to iterate. Get to a final early and do a look dev test such as this one:

This image is by no means a final environment. There are many changes I want to make and materials to change out. To get to this stage, I used basic primitives, some premade assets, and a mix between my own materials made in Substance Designer, and materials downloaded from Substance Source. The goal is selling on a sense of atmosphere to inform an entire level. This scene was assembled in little over a week’s time, which is very fast for one artist!

Now, something I realized early when using modularity and with level design in general is that if you always stay on the grid, a scene can become very boring. You want to strike a balance between a nice, organic composition, and the ease of use the grid offers.

In order to attain this balance, I use Oculus Medium* to start my blockout in. Working in VR with a voxel-based sculpting program is incredibly freeing, and simple for an artist to previsualize entire scenes. Using PureRef and Pinterest, it is possible to make large images that can be loaded into Medium to be used as reference, which makes the early creation and inspiration seamless. Additionally, because you are working in VR, I find it easy to judge scale and can get a sense of space far faster and easier than working back and forth between the engine and Maya. Furthermore, it is easy to perceive a scene in Medium from any angle, including basic lighting and ambient occlusion. This makes for a powerful iteration tool that gets you to the essence of your scene faster.

Here is my decimated mesh in Maya. It doesn't look like much, but it gives me enough ideas for how I can move on a space, early and more organically.

From here, I add the Oculus mesh to a layer to set as a wireframe, in order to reverse engineer it with modular pieces.

Next is to go wild with primitives, working around my footnote and getting things to snap together.

Now, I set up a camera in Maya and reorient assets to compose something more interesting. From here I save the file as an .MA, which Unity can read, in order to work quickly between Maya and Unity. The far wall with the weird growth was made using Oculus Medium and remeshed in ZBrush*. I wanted to capture a cool set piece that had an organic cyber punk fusion. The stairs are also made in Medium, and reflect my initial sculpt.

Here I have the Maya file imported in order to test lighting early.

Getting the scene filled out a bit more.

It's a little hard to tell, but I overlaid my original Oculus sculpt into this scene. I absolutely love this. This is where the initial payoff comes full circle. Now I can see with fresh eyes and notice a sense of rigidity that I must break up.

Now I have offset the wall to an angle and added some elements to create some more organic motion as eyes look through the scene. The key addition is the stairs, which have a unique workflow that I like using.

To easily get everything tiling perfectly I use two planes, shown on the left. I weld them all together and we have it tiling easily. Next, I add some bevels to remove hard edges. Lastly, I use Maya’s sculpt tool to add in noise to the geo, to break up the rigidity. This is the final stage of getting usable modules unwrapped. I prefer to use Nightshade UV, which makes things much easier than using the default tools.

I then export my scene in chunks as FBX files and begin assigning materials in the engine, finding values that work well with my lighting while also adjusting the lighting. My current step is the hardest part, polish. From here on it's on to the grind: high poly meshes, trims and tiles, and shaders to add variation control. Then, as a final stage we go on to optimization, to get this scene running in VR. Much of my look dev post effects wouldn't be viable. Also, I have yet to have the right shaders come online for the final look and style but that's okay, because I have the mood I’m looking for. When optimizing environments the three major drawbacks are:

  • Geo: Simplifying the geometry and edge flow.
  • Textures: Lowering the resolution; finding balance between lowering different maps such as roughness, masks, and ambient occlusion but keeping normal maps at set size.
  • Shaders: Keeping the nodes short and sweet. Blend only up to three full materials within one shader at a time and utilize masks and constants to do the rest.

Note: sometimes it can be outside art causing performance issues (such as characters, scripts, or post effects), so be sure to check everything else first. Optimization is another puzzle and is tricky to get right, but modular workflows help make it a bit easier with fewer assets to optimize, and the possibility of reworking areas if need be.

Last Words and Advice

Projects meet a serious halt for not being well organized and/or too much polish on one asset too early, and not enough overall progress early on. It is a hard lesson learned. Do not feel the need to be good enough too soon; it’s an iterative process. The hardest part about being an artist is knowing how to take feedback positively and build up self-worth when starting out. Sometimes it’s grind, grind, grind, and it goes by all too quickly. Other times, problems abound and you are in a lull. In the end, have a goal. Know where you want to be. It’s easy to get lost in your work and feel aimless or that you need to do more to get better.

The best advice I ever received was the only thing that truly matters is trying to enjoy the ride. Once you stop having fun a project can die quickly. Or worse, you could wake up when you are 30 and wonder where years of your life vanished. Sometimes you just need a little more time to get better at the individual steps. Learning how to use software is the same as with any tool. The more you use it, the faster you get. You pick up tricks along the way and ideally never stop learning. Create an environment or project that focuses on a goal or something you want to learn or improve on. Don't aspire to just make something cool. That is not a goal and you will likely never be finished.

Modularity is one of the core pillars of game development for environment artists. Once you understand the basic fundamentals, you can begin to break scenes down into more efficient and approachable methods for environment creation. The most valuable asset an artist has is the sense of community within art production. It really is a glass door community made open and accessible, thanks to ArtStation*, Polycount, YouTube*, Twitch*, and more publishers of articles that focus on art production. Artists are finally in the limelight and you know who the rock stars are. Learn from these people. Follow them on ArtStation and aspire to meet your goals as you watch your peers meet theirs. Keep your head above water, stay inspired at every step of the way, and strive to see things from different angles. The rest will come in time.

Useful Links

AI-Driven Test System Detects Bacteria In Water

$
0
0

Hands in water

“Clean water and health care and school and food and tin roofs and cement floors, all of these things should constitute a set of basics that people must have as birthrights.”1

– Paul Farmer, American Doctor, Anthropologist, Co-Founder,
Partners In Health

Challenge

Obtaining clean water is a critical problem for much of the world’s population. Testing and confirming a clean water source typically requires expensive test equipment and manual analysis of the results. For regions in the world in which access to clean water is a continuing problem, simpler test methods could dramatically help prevent disease and save lives.

Solution

To apply artificial intelligence (AI) techniques to evaluating the purity of water sources, Peter Ma, an Intel® Software Innovator, developed an effective system for identifying bacteria using pattern recognition and machine learning. This offline analysis is accomplished with a digital microscope connected to a laptop computer running the Ubuntu* operating system and the Intel® Movidius™ Neural Compute Stick. After analysis, contamination sites are marked on a map in real time.

Background and History

Peter Ma, a prolific contributor in the Intel® AI Academy program, regularly participates in hackathons and has won awards in a number of them. “I think everything started as a kid; I've always been intrigued by new technologies,” Peter said.

Winning the Move Your App! Developer Challenge in 2010, a contest hosted by TEDprize, led to a speaking appearance at TEDGlobal and reinforced Peter’s desire to use technology to improve human lives. The contest was based on a challenge by a celebrity chef and restaurateur, to tackle child obesity.

Over several years, Peter has established an active consulting business around his design and development skills. “I build prototypes for different clients,” Peter said, “ranging from Fortune 500 to small startups. Outside of my consulting gigs, I attend a lot of hackathons and build out my own ideas. I built the Clean Water AI specifically for World Virtual GovHack, targeting the Water Safety and Food Security challenge.”

Based in Dubai, United Arab Emirates, the GovTechPrize offers awards annually in several different categories to acknowledge technology solutions that target pressing global challenges. The World Virtual GovHack was added to the awards roster, framed as a global virtual hackathon, to encourage students and startups to tackle difficult challenges through experimentation with advanced technologies.

Develop the future of ∀I for All

Peter Ma

Figure 1. Peter Ma demonstrates the clean water test system.

“We originally started to work on this December of 2017,” Peter said, “specifically for World Virtual GovHack. I won first place and was presented USD 200,000 by His Highness Mansoor of Dubai at the awards ceremony in February 2018. This makes it possible to take the project much further. We are currently in the prototyping stage and working on the next iteration of the prototype so it can be in one single IoT device. I think in the world of innovation, there is never completion—only improvements from your last iteration.”

Peter’s success rate at hackathons is impressive, inspiring other projects, including Doctor Hazel, Vehicle Rear Vision, and Anti-Snoozer. “I think I do well in most hackathons,” he said, “because I focus mostly on how technologies can better people's lives—rather than just what technologies can do.”

Notable Project Milestones

  • Started development work in December 2017.
  • Submitted the project in February 2018 and won first prize in the World Virtual GovHack, receiving USD 200,000 that will help fund the next phase of the Clean Water AI project.
  • Began work on a prototype for a new version of the test system that can be embedded in a self-contained Internet of Things device.
  • Garnered a first place finish in the SAP Spatial Hackathon by mapping out a water system in San Diego to demonstrate how water contamination can be stopped once it has started.
  • Slated to present a demo at the O’Reilly AI Conference in New York in April 2018.

Peter Ma receives the top GovTechPrize

Figure 2. Peter Ma receives the top GovTechPrize in the Water Safety and Food Security category.

Every minute a newborn dies from infection caused by lack of safe water and an unclean environment.2

– World Health Organization, 2017

Enabling Technologies

The Clean Water AI project benefited from access to the Intel® AI DevCloud, a free hosting platform made available to Intel AI Academy members. Powered by Intel® Xeon® Scalable processors, the platform is optimized for deep learning training and inference compute needs. Peter took advantage of Intel AI DevCloud to train the AI model and Intel Movidius Neural Compute Stick to perform water testing in real time. The Neural Compute Stick supports both the Caffe* and TensorFlow* frameworks, popular with deep learning developers.

The Intel® Movidius™ Software Development Kit also figured heavily in the development, providing a streamlined mechanism to profile, tune, and deploy the convolutional neural network capabilities on the Neural Compute Stick. Because the Clean Water AI test system must be able to perform real- time analysis and identify contaminants without access to the cloud, the self-contained, AI-optimized features of the Neural Compute Stick are essential to the operation of the test system. The Neural Compute Stick is a compact, fanless device, the size of a typical thumb drive, with fast USB 3.0 throughput, making it an effective way to deploy efficient deep learning capabilities at the edge of an Internet of Things network.

“Intel provides both hardware and software needs in artificial intelligence—from training through deployment. For startups, it is relatively inexpensive to build the prototype. The AI is being trained through Intel AI DevCloud for free; anyone can sign up. The Intel Movidius Neural Compute Stick costs about USD 79, and it allows the AI to run in real time.”

– Peter Ma, Software Innovator, Intel AI Academy

The neural compute stick acts as an inference accelerator with the added advantage that it does not require an Internet link to operate. All of the data needed by the neural network is stored locally, which makes the rapid, real-time operation possible. Any test system dependent on accessing data from a remote server is going to be burdened by availability of connections (particularly in rural areas where the testing is very important), as well as potential service disruption and lag time in performing analyses. For developers that need more inference performance for an intensive application, up to four compute sticks can be combined at once for a given solution.

Clean Water AI Test System

The Clean Water AI test system is composed of simple, inexpensive, off-the-shelf components:

  • A digital microscope, available at many sources for USD 100 or less
  • A modestly equipped computer running the Ubuntu operating system
  • An Intel Movidius Neural Compute Stick running machine learning and AI in real time

The entire test system can be constructed for well under USD 500, making it within reach of organizations that cannot usually afford expensive traditional test systems.

Figure 3 shows the basic test setup.

Microscope, laptop, and compute stick

Figure 3. The basic test system—microscope, laptop, and compute stick—can be assembled for less than USD 500.

The convolutional neural network at the heart of the test system determines the shape, color, density, and edges of the bacteria. Identification at this point is limited to Escherichia coli (E. coli) and the bacterium that causes cholera, but because different types of bacteria have distinctive shapes and physical characteristics, the range of identification can be extended to many different types. Project goals on the near horizon include distinguishing between good microbes and harmful bacteria, detection of substances such as minerals, and satisfying the certification requirements necessary in different geographies.

To refine the approach and sharpen the precision of identification, Peter has continuing training. Currently, the confidence level for testing is above 95 percent, as high as 99 percent, assessing clean water compared with contaminated water, but this is likely to improve further as additional images are added to the system and more training is performed.

In a video demonstration of the Clean Water AI test system, Peter uses the microscope to first capture an image of clean water and then compares that with a sample showing contaminated water, as shown in Figure 4. The AI immediately detects the harmful bacteria and can flag the contamination on a map. All of these activities are carried out in real time.

Contaminated water

Figure 4. Screenshot of a sampling of contaminated water and the map indicating the location.

E. coli bacteria, shown in the rendering in Figure 5, is typically present in contaminated water and can be accurately identified by the AI according to shape and size.

For more information about Peter Ma's Clean Water AI project, visit https://devmesh.intel.com/projects/ai-bacteria-classification.

E. coli bacteria

Figure 5. A rendering of E. coli bacteria, one of the most common and dangerous water contaminants.

AI is Changing the Landscape of Business and Science

Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is firmly committed to advancing the state of artificial intelligence (AI) to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with government organizations and corporations to uncover and advance solutions that solve major challenges.

For example, an engagement with NASA focused on sifting through many images of the moon and identifying different features, such as craters. By using AI and compute techniques, NASA was able to achieve its results from this project in two weeks rather than a couple of years.

The Intel AI portfolio includes:

Xeon inside

Intel Xeon Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.

Logos

Framework Optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.

Movidius chip

Intel® Movidius™ Myriad™ Vision Processing Unit (VPU): Create and deploy on-device neural networks and computer vision applications.

For more information, visit this portfolio page: https://ai.intel.com/technology

Rome fountain

“At Intel we have a pretty pure motivation: we want to change the face of computing and increase the capabilities of humanity and change every industry out there. AI today is really a set of tools. It allows us to sift through data in much more scalable ways, scaling our intelligence up. We want our machines to personalize and change and adapt to the way we shop and the way we interact with others.

There are already vast changes taking place, which are happening under the hood. Intel has a broad portfolio of products for AI. We start with the Intel Xeon Scalable processor, which is a general-purpose computing platform that also has very efficient inference for deep learning.”

– Naveen Rao, Intel VP and GM, Artificial Intelligence Products Group

 

The Promise of AI

The possibilities of AI are only beginning to be recognized and exploited. Intel AI Academy works collaboratively with leaders in this field and talented software developers and system architects exploring new solutions that promise to reshape life in today’s world. We invite interested, passionate innovators to join us in this effort and become part of an exciting community to make contributions and take advanced technology in new directions for the benefit of the global community.

Join today: https://software.intel.com/ai/sign-up

“I see AI playing a major part in helping governments and non government organizations in the future,” Peter said, “especially in terms of monitoring resources, such as ensuring water safety. AI can reduce costs and provide more accurate continuous monitoring than current systems. An AI device for water safety typically requires very little maintenance, because it will be based on optical readings, rather than chemical based.”

“We have the ability to provide clean water for every man, woman and child on the Earth. What has been lacking is the collective will to accomplish this. What are we waiting for? This is the commitment we need to make to the world, now.”3

– Jean-Michel Cousteau

Resources

Intel AI Academy
Clean Water Project in Intel DevMesh
Build an Image Classifier on the Intel Movidius Neural Compute Stick
Getting the Most Out of IA with Caffe* Deep Learning Framework
Rapid Waterborne Pathogen Detection with Mobile Electronics
Intel Movidius Neural Compute Stick
GovTech Prize
Artificial Intelligence: Teaching Machines to Learn Like Humans
Intel Processors for Deep Learning Training

 

1 http://richiespicks.pbworks.com/w/page/65740422/MOUNTAINS%20BEYOND%20MOUNTAINS

2 http://www.who.int/mediacentre/factsheets/fs391/en/

3 http://www.architectsofpeace.org/architects-of-peace/jean-michel-cousteau?page=2

Troubleshooting Visual Studio Command Prompt Integration Issue

$
0
0

Issue Description

Nmake and ifort are not recognized from a command window, however using Intel Fortran under Visual Studio works perfectly.

Troubleshooting

Follow below checklist to troubleshooting Visual Studio command environmental issues:

1. Verify whether ifort and nmake are installed correctly:

    For Visual Studio 2017, nmake is installed at:

    C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.10.25017\bin\HostX64\x64\nmake.exe

    Find this by running below commands from a system ifort or nmake setup correctly:

> where nmake
  > where ifort

    Also check whether the location is included from PATH environment:

> echo %PATH%

2. If nmake can be found, verify if VS setup script runs properly.
    Start a cmd window, and run Visual Studio setup script manually:

> "C:\Program Files (x86)\Microsoft Visual Studio\2017 \Professional\VC\Auxiliary\Build\vcvars64.bat"

    An expected output is as below

    vscmd_setup.png

3.If nmake cannot be found. It’s your Visual Studio installation is incomplete. Please try re-install Visual Studio and find instructions from below articles:
 
4.Got an error in step 2?
> "C:\Program Files (x86)\Microsoft Visual   Studio\2017\Professional\VC\Auxiliary\Build\vcvars64.bat"
  \Common was unexpected at this time.
If yes, try debug the setup script set VSCMD_DEBUG environment variable:
> set VSCMD_DEBUG=3
Run the setup script again and redirect the output to log file:
> "C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Auxiliary\Build\vcvars64.bat"> setup.log 2>&1

5.If you got the same error as above, there are some references from Visual Studio community:

    The solution is to remove all quotes from PATH environment variable value.
 
6.If you got a different error, again get an expected output from any system that runs the script correctly and compare with your current one. This will help you locate which command in the setup script that triggers the error. 
    You may also consider to report such issue to Visual Studio community directly at

 

Merging MPI Intercommunicators

$
0
0

Sometimes you may have completely separate instances for an MPI job.  You can connect these separate instances using techniques such as MPI_Comm_accept/MPI_Comm_connect, which creates an intercommunicator.  But in more complex situations, you may have a significant number of separate intercommunicators, and want to send data between arbitrary ranks, or perform collective operations across all ranks, or other functions which are most efficiently handled over a single intracommunicator.

In order to handle this situation, the MPI standard includes a function call MPI_Intercomm_merge.  This function takes three arguments.  The first argument is the intercommunicator to be merged.  The second argument is a boolean argument indicating whether the ranks will be numbered in the high range (true) or the low range (false) in the resulting intracommunicator.  The third argument is a pointer to the new intracommunicator.  When you call MPI_Intercomm_merge, you must call it from every rank in both sides of the intercommunicator, and all ranks on a particular side must have the same high/low argument.  The two sides can have the same or different values.  If the same, the resulting rank order will be arbitrary.  If the two are different, you will end up with the ranks with the low (false) argument having lower rank numbers, and the ranks with the high (true) argument having higher rank numbers.  For example, if you have an intercommunicator with 2 ranks on side A and 3 ranks on side B, and you call MPI_Intercomm_merge with false on side A and true on side B, the side A ranks will have new ranks 0 and 1, and the side B ranks will have rank numbers 2, 3, and 4.

In a more complex situation, you may need to merge multiple intercommunicators.  This can be done in one of several ways, depending on how your ranks join the intercommunicator.  If you have separate ranks joining independently, you can merge them as each joins, and use the resulting intracommunicator as the base intracommunicator for the newly joining ranks.

MPI_Comm_accept(port, MPI_INFO_NULL, 0, localcomm, &intercomm[0]);
MPI_Intercomm_merge(intercomm[0], false, &localcomm);

This will update localcomm to include all ranks as they join.  You can also merge them after all have joined.  This will require multiple steps of creating new intercommunicators to merge, but can also lead to the same end result.

Once this is done, you can now use collectives across the new intracommunicator as if you had started all ranks under the same intracommunicator originally.

Getting Started with Parallel STL

$
0
0

Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, as specified in the working draft N4659 for the next version of the C++ standard, commonly called C++17. The implementation also supports the unsequenced execution policy specified in the ISO* C++ working group paper P0076R3.

Parallel STL offers efficient support for both parallel and vectorized execution of algorithms for Intel® processors. For sequential execution, it relies on an available implementation of the C++ standard library.

Parallel STL is available as a part of Intel® Parallel Studio XE and Intel® System Studio.

 

Prerequisites

To use Parallel STL, you must have the following software installed:

  • C++ compiler with:
    • Support for C++11
    • Support for OpenMP* 4.0 SIMD constructs
  • Intel® Threading Building Blocks (Intel® TBB) 2018

The latest version of the Intel® C++ Compiler is recommended for better performance of Parallel STL algorithms, comparing to previous compiler versions.

To build an application that uses Parallel STL on the command line, you need to set the environment variables for compilation and linkage. You can do this by calling suite-level environment scripts such as compilervars.{sh|csh|bat}, or you can set just the Parallel STL environment variables by running pstlvars.{sh|csh|bat} in <install_dir>/{linux|mac|windows}/pstl/bin.

<install_dir> is the installation directory, by default, it is:

For Linux* and macOS*:

  • For super-users:      /opt/intel/compilers_and_libraries_<version>
  • For ordinary users:  $HOME/intel/compilers_and_libraries_<version>

For Windows*:

  • <Program Files>\IntelSWTools\compilers_and_libraries_<version>

 

Using Parallel STL

Follow these steps to add Parallel STL to your application:

  1. Add the <install_dir>/pstl/include folder to the compiler include paths. You can do this by calling the pstlvars script.

  2. Add #include "pstl/execution" to your code. Then add a subset of the following set of lines, depending on the algorithms you intend to use:

    • #include "pstl/algorithm"
    • #include "pstl/numeric"
    • #include "pstl/memory"
  3. When using algorithms and execution policies, specify the namespaces std and std::execution, respectively. See the 'Examples' section below.
  4. For any of the implemented algorithms, pass one of the values seq, unseq, par or par_unseq as the first parameter in a call to the algorithm to specify the desired execution policy. The policies have the following meaning:

     

    Execution policy

    Meaning

    seq

    Sequential execution.

    unseq

    Try to use SIMD. This policy requires that all functions provided are SIMD-safe.

    par

    Use multithreading.

    par_unseq

    Combined effect of unseq and par.

     

  5. Compile the code as C++11 (or later) and using compiler options for vectorization:

    • For the Intel® C++ Compiler:
      • For Linux* and macOS*: -qopenmp-simd or -qopenmp
      • For Windows*: /Qopenmp-simd or /Qopenmp
    • For other compilers, find a switch that enables OpenMP* 4.0 SIMD constructs.

    To get good performance, specify the target platform. For the Intel C++ Compiler, some of the relevant options are:

    • For Linux* and macOS*: -xHOST, -xSSE4.1, -xCORE-AVX2, -xMIC-AVX512.
    • For Windows*: /QxHOST, /QxSSE4.1, /QxCORE-AVX2, /QxMIC-AVX512.
    If using a different compiler, see its documentation.

     

  6. Link with the Intel TBB dynamic library for parallelism. For the Intel C++ Compiler, use the options:

    • For Linux* and macOS*: -tbb
    • For Windows*: /Qtbb (optional, this should be handled by #pragma comment(lib, <libname>))

Version Macros

Macros related to versioning, as described below. You should not redefine these macros.

PSTL_VERSION

Current Parallel STL version. The value is a decimal numeral of the form xyy where x is the major version number and yy is the minor version number.

PSTL_VERSION_MAJOR

PSTL_VERSION/100; that is, the major version number.

PSTL_VERSION_MINOR

PSTL_VERSION - PSTL_VERSION_MAJOR * 100; that is, the minor version number.

Macros

PSTL_USE_PARALLEL_POLICIES

This macro controls the use of parallel policies.

When set to 0, it disables the par and par_unseq policies, making their use a compilation error. It's recommended for code that only uses vectorization with unseq policy, to avoid dependency on Intel® TBB runtime library.

When the macro is not defined (default) or evaluates to a non-zero value all execution policies are enabled.

PSTL_USE_NONTEMPORAL_STORES

This macro enables the use of #pragma vector nontemporal in the algorithms std::copy, std::copy_n, std::fill, std::fill_n, std::generate, std::generate_n with the unseq policy. For further details about the pragma, see the User and Reference Guide for the Intel® C++ Compiler at https://software.intel.com/en-us/node/524559.

If the macro evaluates to a non-zero value, the use of #pragma vector nontemporal is enabled.

When the macro is not defined (default) or set to 0, the macro does nothing.

 

Examples

Example 1

The following code calls vectorized copy:

#include "pstl/execution"
#include "pstl/algorithm"
void foo(float* a, float* b, int n) {
    std::copy(std::execution::unseq, a, a+n, b);
}

Example 2

This example calls the parallelized version of fill_n:

#include <vector>
#include "pstl/execution"
#include "pstl/algorithm"

int main()
{
    std::vector<int> data(10000000);
    std::fill_n(std::execution::par_unseq, data.begin(), data.size(), -1);  // Fill the vector with -1

    return 0;
}

Implemented Algorithms

Parallel STL supports all of the aforementioned execution policies only for the algorithms listed in the following table. Adding a policy argument to any of the rest of the C++ standard library algorithms will result in sequential execution.

 

Algorithm

Algorithm page at cppreference.com

adjacent_find

http://en.cppreference.com/w/cpp/algorithm/adjacent_find

all_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

any_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

copy

http://en.cppreference.com/w/cpp/algorithm/copy

copy_if

http://en.cppreference.com/w/cpp/algorithm/copy

copy_n

http://en.cppreference.com/w/cpp/algorithm/copy_n

count

http://en.cppreference.com/w/cpp/algorithm/count

count_if

http://en.cppreference.com/w/cpp/algorithm/count

destroy

http://en.cppreference.com/w/cpp/memory/destroy

destroy_n

http://en.cppreference.com/w/cpp/memory/destroy_n

equal

http://en.cppreference.com/w/cpp/algorithm/equal

exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/exclusive_scan

fill

http://en.cppreference.com/w/cpp/algorithm/fill

fill_n

http://en.cppreference.com/w/cpp/algorithm/fill_n

find

http://en.cppreference.com/w/cpp/algorithm/find

find_end

http://en.cppreference.com/w/cpp/algorithm/find_end

find_if

http://en.cppreference.com/w/cpp/algorithm/find

find_if_not

http://en.cppreference.com/w/cpp/algorithm/find

for_each

http://en.cppreference.com/w/cpp/algorithm/for_each

for_each_n

http://en.cppreference.com/w/cpp/algorithm/for_each_n

generate

http://en.cppreference.com/w/cpp/algorithm/generate

generate_n

http://en.cppreference.com/w/cpp/algorithm/generate_n

inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/inclusive_scan

is_sorted

http://en.cppreference.com/w/cpp/algorithm/is_sorted

is_sorted_until

http://en.cppreference.com/w/cpp/algorithm/is_sorted_until

max_element

http://en.cppreference.com/w/cpp/algorithm/max_element

merge

http://en.cppreference.com/w/cpp/algorithm/merge

min_element

http://en.cppreference.com/w/cpp/algorithm/min_element

minmax_element

http://en.cppreference.com/w/cpp/algorithm/minmax_element

mismatch

http://en.cppreference.com/w/cpp/algorithm/mismatch

move

http://en.cppreference.com/w/cpp/algorithm/move

none_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

partition_copy

http://en.cppreference.com/w/cpp/algorithm/partition_copy

reduce

http://en.cppreference.com/w/cpp/algorithm/reduce

remove_copy

http://en.cppreference.com/w/cpp/algorithm/remove_copy

remove_copy_if

http://en.cppreference.com/w/cpp/algorithm/remove_copy

replace_copy

http://en.cppreference.com/w/cpp/algorithm/replace_copy

replace_copy_if

http://en.cppreference.com/w/cpp/algorithm/replace_copy

search

http://en.cppreference.com/w/cpp/algorithm/search

search_n

http://en.cppreference.com/w/cpp/algorithm/search_n

sort

http://en.cppreference.com/w/cpp/algorithm/sort

stable_sort

http://en.cppreference.com/w/cpp/algorithm/stable_sort

transform

http://en.cppreference.com/w/cpp/algorithm/transform

transform_exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_exclusive_scan

transform_inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_inclusive_scan

transform_reduce

http://en.cppreference.com/w/cpp/algorithm/transform_reduce

uninitialized_copy

http://en.cppreference.com/w/cpp/memory/uninitialized_copy

uninitialized_copy_n

http://en.cppreference.com/w/cpp/memory/uninitialized_copy_n

uninitialized_default_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct

uninitialized_default_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct_n

uninitialized_fill

http://en.cppreference.com/w/cpp/memory/uninitialized_fill

uninitialized_fill_n

http://en.cppreference.com/w/cpp/memory/uninitialized_fill_n

uninitialized_move

http://en.cppreference.com/w/cpp/memory/uninitialized_move

uninitialized_move_n

http://en.cppreference.com/w/cpp/memory/uninitialized_move_n

uninitialized_value_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct

uninitialized_value_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct_n

unique_copy

http://en.cppreference.com/w/cpp/algorithm/unique_copy

Known limitations

Parallel and vector execution is only supported for a subset of aforementioned algorithms if random access iterators are provided, while for the rest execution will remain serial.

Legal Information

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© Intel Corporation


Using Intel® Compilers to Mitigate Speculative Execution Side-Channel Issues

$
0
0

Table of Content

  1. Disclaimers
  2. Introduction
  3. Mitigating Bounds Check Bypass (Spectre Variant 1)
  4. Mitigating Branch Target Injection (Spectre Variant 2)
  5. How to Obtain the Latest Intel® C++ Compiler and Intel® Fortran Compiler
  6. Conclusion and Further Reading

Disclaimers

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at www.intel.com.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Intel provides these materials as-is, with no express or implied warranties.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright © [2018], Intel Corporation. All rights reserved.

Introduction

Side channel methods are techniques that may allow an attacker to obtain secret or privileged information through observing the system that they would not normally be able to access, such as measuring microarchitectural properties about the system. For background information relevant to this article, refer to the overview in Intel Analysis of Speculative Execution Side Channels. This article describes Intel® C++ Compiler support and Intel® Fortran Compiler support for speculative execution side channel mitigations.

Mitigating Bounds Check Bypass (Spectre Variant 1)

Please read Intel's Analysis of Speculative Execution Side Channels for details, exploit conditions, and mitigations for the exploit known as bounds check bypass (Spectre variant 1).

One mitigation for Spectre variant 1 is through use of the LFENCE instruction. The LFENCE instruction does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. _mm_lfence() is a compiler intrinsic or assembler inline that issues an LFENCE instruction and also ensures that compiler optimizations do not move memory references across that boundary. Inserting an LFENCE between a bounds check condition and memory loads helps ensure that the loads do not occur until the bounds check condition has actually been completed.

The Intel C++ Compiler and Intel Fortran Compiler both allow programmers to insert LFENCE instructions, which can be used to help mitigate bounds check bypass (Spectre variant 1).

LFENCE in C/C++

You can insert LFENCE instructions in a C/C++ program as shown in the example below:

    #include <intrin.h>
    #pragma intrinsic(_mm_lfence)

    if (user_value >= LIMIT)
    {
        return STATUS_INSUFFICIENT_RESOURCES;
    }
    else
    {    
        _mm_lfence();	/* manually inserted by developer */
        x = table[user_value];
        node = entry[x];
    }

LFENCE in Fortran

You can insert an LFENCE instruction in Fortran applications as shown in the example below.
Implement the following subroutine, which calls _mm_lfence() intrinsics:

    interface 
        subroutine for_lfence() bind (C, name = "_mm_lfence") 
            !DIR$ attributes known_intrinsic, default :: for_lfence
        end subroutine for_lfence
    end interface
 
    if (untrusted_index_from_user .le. iarr1%length) then
        call for_lfence()
        ival = iarr1%data(untrusted_index_from_user)
        index2 = (IAND(ival,1)*z'100') + z'200'    
        if(index2 .le. iarr2%length) 
            ival2 = iarr2%data(index2)
    endif

The LFENCE intrinsic is supported in the following Intel compilers:

  • Intel C++ Compiler 8.0 and later for Windows*, Linux*, and macOS*
  • Intel Fortran Compiler 8.0 and later for Windows, Linux and macOS

 

Mitigating Branch Target Injection (Spectre Variant 2)

Intel's whitepaper on Retpoline: A Branch Target Injection Mitigation discusses the details, exploit conditions, and mitigations for the exploit known as branch target injection (Spectre variant 2). While there are a number of possible mitigation techniques for this side channel method, the mitigation technique described in that document is known as retpoline, which is a technique employed by the Intel® C++ and Fortran compilers.

The Intel C++ and Fortran compilers have command line options that can be used to help mitigate branch target injection (Spectre variant 2). These options replace all indirect branches (calls/jumps) with the retpoline code sequence. The thunk-inline option inserts a full retpoline sequence at each indirect branch that needs mitigation. The thunk-extern option reduces code size by sharing the retpoline sequence.

The compiler options implemented are:

  • -mindirect-branch=thunk-inline for Linux or macOS
  • -mindirect-branch=thunk-extern for Linux or macOS
  • /Qindirect-branch:thunk-inline for Windows
  • /Qindirect-branch:thunk-extern for Windows

The command line option is included in the following Intel compilers:

  • Intel® C++ Compiler 18.0 update 2 and later for Windows, Linux, and macOS
  • Intel® Fortran Compiler 18.0 update 2 and later for Windows, Linux, and macOS

Refer to the Intel Compilers - Supported compiler versions article for updates on the availability of mitigation options in supported Intel Compilers.

How to Obtain the Latest Intel® C++ Compiler and Intel® Fortran Compiler

The Intel® C++ Compiler is distributed as part of the Intel® Parallel Studio XE and Intel® System Studio tool suites. The Intel® Fortran Compiler is distributed as part of Intel® Parallel Studio XE 2018. You can be downloaded these from the Intel Registration Center. Intel® Parallel Studio XE 2018 update 2 or later and Intel® System Studio 2018 update 1 or later contain support for retpoline. Refer to the Intel Compilers - Supported compiler versions article for updates on the availability of mitigation options in supported Intel Compilers.

Conclusion and Further Reading

Visit the Intel Press Room for the latest updates regarding the Spectre and Meltdown issues, and Intel’s Side Channel Security Support website for additional software-specific, up-to-date information. You can find more detailed explanations of Speculative Execution Side-Channel Mitigations and Intel’s Mitigation Overview for Potential Side-Channel Cache Exploits in Linux* on our Side-Channel Security Support website.

Refer to our support site for support options if you experience any issues.

Intel continues to work on improving Intel software development products for the identified security issues. We will continue to revise this article with Intel® C++ Compiler and Intel Fortran Compiler product updates as they become available.

Intel® Computer Vision SDK Model Optimizer Guide

$
0
0

Prerequisites

The Model Optimizer requires:

  • Python* 3 or newer
  • In some cases, you must install a framework, such as Caffe*, or TensorFlow*, or MXNet*.  

How to Configure the Model Optimizer for a Framework

If you are not using a layered model, you do not need to use or configure a framework. In that case, you can disregard these steps.

These instructions assume you have installed the Caffe, TensorFlow, or MXNet framework, and that you have a basic knowledge of how the Model Optimizer works.

Before you can use the Model Optimizer to convert your trained network model to the Intermediate Representation file format required by the Inference Engine, you must configure the Model Optimizer for the framework that was used to train the model. This section tells you how to configure the Model Optimizer either through scripts or using a manual process.

Using Configuration Scripts

You can either configure all three frameworks at the same time, or install an individual framework. The scripts install all required dependencies. 

To configure all three frameworks: Go to the <INSTALL_DIR>/model_optimizer/install_prerequisites folder and run:

  • For Linux*:
    install_prerequisites.sh
  • For Windows*:
    install_prerequisites.bat

To configure a specific framework: Go to the <INSTALL_DIR>/model_optimizer/install_prerequisites folder and run:

  • For Caffe on Linux:
    install_prerequisites_caffe.sh
  • For Caffe on Windows:
    install_prerequisites_caffe.bat
  • For TensorFlow on Linux:
    install_prerequisites_tf.sh
  • For TensorFlow on Windows:
    install_prerequisites_tf.bat
  • For MXNet on Linux:
    install_prerequisites_mxnet.sh
  • For MXNet on Windows:
    install_prerequisites_mxnet.bat

Configuring Manually

If you prefer, you can manually configure the Model Optimizer for your selected framework. This option does not install all of the required dependencies.

  1. Go to the Model Optimizer folder:
    • For Caffe:
      cd INSTALL_DIR>/model_optimizer/model_optimizer_caffe
    • For TensorFlow: 
      cd INSTALL_DIR>/model_optimizer/model_optimizer_tensorflow
    • For MXNet:
      cd INSTALL_DIR>/model_optimizer/model_optimizer_mxnet
  2. Recommended for all global Model Optimizer dependency installations: Create and activate a virtual environment. While not required, this option is strongly recommended since the virtual environment creates a Python* sandbox, and dependencies for the Model Optimizer do not influence the global Python configuration, installed libraries, or other components. In addition, a flag ensures that system-wide Python libraries are available in this sandbox:
    • Create a virtual environment: 
      virtualenv -p /usr/bin/python3.5 .env3 --system-site-packages
    • Activate the virtual environment:
      virtualenv -p /usr/bin/python3.5 .env3/bin/activate
  3. Install all dependencies or only the dependencies for a specific framework:
    • To install dependencies for all frameworks:
      pip3 install -r requirements.txt 
    • To install dependencies only for Caffe:
      pip3 install -r requirements_caffe.txt
    • To install dependencies only for TensorFlow:
      pip3 install -r requirements_tensorflow.txt
    • To install dependencies only for MXNet:
      pip3 install -r requirements_mxnet.txt

Using an Incompatible Caffe Distribution

These steps apply to situations in which your model has custom layers, but you do not have a compatible Caffe distribution installed. In addition to this section, you should also read the information about model layers in the Intel® CV SDK Overview.

This section includes terms that might be unfamiliar:

TermDescription
Proto fileA proto file contains data and services and is compiled with protoc. The proto file is created in the format defined by the associated protocol buffer.
ProtobufA protobuf is a library for protocol buffers.
ProtocA compiler that is used to generate code from proto files.
Protocal bufferData structures are saved and in and communicated from protocol buffers. The primary purpose of protocol buffers is in network communication. Protocol buffers are used because they are simple and fast.

Many distributions of Caffe are available other than the recommended Intel distribution. Each of these distributions may use a different proto version. If you installed one of these non-Intel distributions, the Model Optimizer uses the Berkeley Vision and Learning Center* (BVLC) Caffe proto, which is distributed with the Model Optimizer. 


Intermission: About the BVLC Caffe

The Model Optimizer contains a proto parser file called caffe_pb2.py that is generated by a protoc with a protobuf, using the Berkeley Vision and Learning Center* (BVLC) Caffe proto. The proto parser loads Caffe models into memory and parses the models according to rules described in the proto file.


If your model is trained in a distribution of Caffe version other than the Intel version, your proto is probably different from the default BVLC proto, preventing the Model Optimizer from loading your model. As a possible solution, you can generate a parser specifically for your distribution of caffe.proto, assuming your distribution of Caffe includes this file. This is not a guarantee that the Model Optimizer will be successful with your Caffe distribution, since it is not possible to account for every possible distribution.

Note: The script that follows replaces a file named caffe_pb2.py in the folder MODEL_OPTIMIZER_ROOT/mo/front/caffe/proto/. You might want to back up the existing file before running the script.

  1. Use this script to generate a parser specifically for your caffe.proto file:
    cd MODEL_OPTIMIZER_ROOT
    cd mo/front/caffe/proto/
    python3 generate_caffe_pb2.py --input_proto ${PATH_TO_CUSTOM_CAFFE}/src/caffe/proto/caffe.proto
  2. Check the date and time stamp on the caffe_pb2.py file in the MODEL_OPTIMIZER_ROOT/mo/front/caffe/proto/ folder. If the new file is not there, you might have run the script from a different location and you need to copy the new file to this folder.

When you run the Model Optimizer the new parser loads the model.

 

How to Register a Custom Layer in the Model Optimizer

If you do not know what a customer layer is, or why you might need to register a customer layer, read about model layers in the Intel® CV SDK Overview.

This example uses the .prototext file because it is a well known topology. The .prototext file looks like this:

name: "my_net"
input: "data"
input_shape {
  dim: 1
  dim: 3
  dim: 227
  dim: 227
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 64
    kernel_size: 3
    stride: 2
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "reshape_conv1"
  type: "Reshape"
  bottom: "conv1"
  top: "reshape_conv1"
  reshape_param {
    shape {
      dim: 3
      dim: 2
      dim: 0
      dim: 1
    }
    # omitting other params
  }
}
  1. To customize your layer, edit the prototext file to replace some of the information with your layer information:
    layer {
      name: "reshape_conv1"
      type: "CustomReshape"
      bottom: "conv1"
      top: "reshape_conv1"
      custom_reshape_param {
        shape {
          dim: 3
          dim: 2
          dim: 0
          dim: 1
        }
      }
  2. Run the Model Optimizer against your trained model and watch for error messages that indicate the Model Optimizer was not able to load the model. You might see one of these messages:
    [ ERROR ]: Current caffe.proto does not contain field "custom_reshape_param".
    [ ERROR ]: Unable to create ports for node with id

What Went Wrong

[ ERROR ]: Current caffe.proto does not contain field "custom_reshape_param"

This error message means CustomReshape is not registered as a layer for Caffe. This example error messages uses custom_reshape_pattern. Your error might refer to a different field.

This message means the Model Optimizer uses a protobuf library to parse and load Caffe models. This library needs a file grammar to parse, and a generated parser for the library. As a Caffe fallback, the Model Optimizer uses the Caffe-generated parser for a Caffe-specific .proto file. If you have Caffe installed with the Python interface available, make sure your version of the .proto file in the src/caffe/proto folder is the same version as the version of Caffe that created the model.

For the full correction, build the version of Caffe that was used to create the model. As a temporary measure, you can use a Python extension to work with your custom layers without building Caffe. To use the temporary correction, add the layer description to the caffe.proto file and generate the parser for it.

For example, to add the description of the CustomReshape layer, which is an artificial layer, and therefore not in any cafe.proto file:

  1. Add these lines to the end of the caffe.proto file:
    // this line to the end of LayerParameter layer = 546; - ID is any number not present in caffe.proto
      optional CustomReshapeParameter custom_reshape_param = 546;
    }
    // these lines to end of the file - describing contents of this parameter
    message CustomReshapeParameter {
      optional BlobShape shape = 1; // Use the same parameter type as some other Caffe layers
    }
  2. Generate a new parser:
    cd ROOT_MO_DIR/mo/front/caffe/proto
    python3 generate_caffe_pb2.py --input_proto PATH_TO_PROTO/caffe.proto

The Model Optimizer can now load the model into memory and work with any extensions you have.


[ ERROR ]: Unable to create ports for node with id

This message means the Optimizer does now know how to infer the shape of the layer.

To localize the scope, compile the list of custom layers that are:

  • In the topology
  • Not in the list of supported layers for the target framework

 

How to Register a Custom Layer as a Model Optimizer Extension

 

 

How to Convert a Model to Intermediate Representation

The Inference Engine requires the network Intermediate Representation (IR) that is produced by Model Optimizer. The network can be trained with Caffe*, TensorFlow*, or MXNet*.

To run the Model Optimizer and convert the model, the mo.py script from the <INSTALL_DIR>/model_optimizer directory is used. To convert a model, call mo.py, specifying a path to the input model file: python3 mo.py --input_model INPUT_MODEL

The mo.py script provides the universal entry point that can deduce the framework that has produced the input model by a standard extension of the model file:

  • .pb — TensorFlow models
  • .params — MXNet models
  • .caffemodel — Caffe models

If the model files do not have standard extensions, you can use the –framework{tf,caffe,mxnet} option to specify the framework type explicitly.

For example, the following commands are equivalent:

python3 mo.py --input_model /user/models/model.pb
python3 mo.py --framework tf --input_model /user/models/model.pb
python3 mo.py --input_model INPUT_MODEL

To adjust the conversion process, the Model Optimizer additionally provides a wide list of conversion parameters: general (such as path to the model file, model name, output directory, etc.) and framework-specific parameters.

For the full list of additional options, run

python3 mo.py -h

Converting a Model Using General Conversion Parameters

Converting a Caffe* Model

Converting a TensorFlow* Model

Converting an MXNet* Model

 

Frequently Asked Questions

 

Helpful Links

Note: Links open in a new window.

Intel® CV SDK Home Page: https://software.intel.com/en-us/computer-vision-sdk

Intel® CV SDK Documentation: https://software.intel.com/en-us/computer-vision-sdk/documentation/view-all

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidia, Pentium, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used with permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

Using the Caffe* Framework with the Intel® Computer Vision SDK

$
0
0

Using Caffe* with the Intel® Computer Vision SDK (Intel® CV SDK)

Caffe* is one of the framework options you have when using a layered model with the Model Optimizer. You will need the steps in this document if:

  • Your model topology contains layers that are not implemented in Model Optimizer AND
  • You decide not to register these unknown layers as custom operations.

To use this option, you must install the Caffe framework with all relevant dependencies, and then compile and link it. You must also make a shared library named libcaffe.so available in the CAFFE_HOME/build/lib directory.

Topologies Supported by the Model Optimizer with the Caffe Framework

  • Classification models:
    • AlexNet
    • VGG-16, VGG-19
    • SqueezeNet v1.0, SqueezeNet v1.1
    • ResNet-50, ResNet-101, ResNet-152
    • Inception v1, Inception v2, Inception v3, Inception v4
    • CaffeNet
    • MobileNet
  • Object detection models:
    • SSD300-VGG16, SSD500-VGG16
    • Faster-RCNN
    • Yolo v2, Yolo Tiny
  • Face detection models:
    • VGG Face
  • Semantic segmentation models:
    • FCN8

Install the Caffe Framework

In this guide, the Caffe* installation directory is referred to as CAFFE_HOME, and the Model Optimizer installation directory is referred to as MO_DIR.

The installation path to the Model Optimizer depends on whether you use the Intel® CV SDK or the Deep Learning Deployment Toolkit. For example, if you use the installation command with sudo, the default directory is MO_DIR

To install Caffe:

  1. Set these environment variables:
    export MO_DIR=PATH_TO_MO_INSTALL_DIR
    export CAFFE_HOME=<PATH_TO_YOUR_CAFFE_DIR
  2. Go to the Model Optimizer directory:
    cd $MO_DIR/model_optimizer_caffe/
  3. Install the Caffe dependencies, such as Git*, CMake*, and GCC*:
    cd install_prerequisites/
    ./install_Caffe_dependencies.sh
  4. Apply the Model Optimizer adapters:
    clone_patch_build_Caffe.sh 
  5. Optional: By default the Caffe installation installs BVLC Caffe from the master branch of the official repository. If you want to install a different version of Caffe, edit the the clone_patch_build_Caffe.sh script by changing these lines:
    CAFFE_REPO="https://github.com/BVLC/caffe.git" # link to the repository with Caffe* distribution
    CAFFE_BRANCH=master # branch to be taken from the repository
    CAFFE_FOLDER=`pwd`/caffe # where to clone the repository on your local machine
    CAFFE_BUILD_SUBFOLDER=build # name of the directory required for building Caffe* 
  6. Install Caffe*.To launch installation, just run the following command:
    ./clone_patch_build_Caffe.sh 

NOTE: If you experience problems with the hdf5 library while building Caffe on Ubuntu* 16.04, TBD.

The Caffe framework is installed. Continue with the next section to build the Caffe framework.

Build the Caffe Framework

  1. Build Caffe with Python 3.5:
    export CAFFE_HOME=PATH_TO_CAFFE
    cd $CAFFE_HOME
    rm -rf  ./build
    mkdir ./build
    cd ./build
    cmake -DCPU_ONLY=ON -DOpenCV_DIR=<your opencv install dir> -DPYTHON_EXECUTABLE=/usr/bin/python3.5 ..
    make all # also builds pycaffe
    make install
    make runtest # optional
  2. Add the Caffe Python directory to PYTHONPATH to let it be imported from the Python program:
    export PYTHONPATH=$CAFFE_HOME/python;$PYTHONPATH
  3. Confirm the installation and build worked correctly:
    python3
    import caffe

The Caffe framework is installed and built. Continue to the next section to build a protobuf library.

Implement the protobuf Library

The Model Optimizer uses the protobuf library to load the trained Caffe* model. By default, the library executes pure Python* implementation, which is slow. These steps implement the faster, C implementation, of the protobuf library on Windows or Linux.

Implementing the protobuf Library on Linux*

On Linux, the implementation is completed by setting an environment variable:

export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp

Implementing the protobuf Library on Windows*

Steps TBD - original content is probably incorrect. The commands are for Linux. See the file ConfigMOForCaffe.html

  1. Clone protobuf:
    sh git clone https://github.com/google/protobuf.git
    cd protobuf
  2. Create a Visual Studio solution file:
    sh C:\Path\to\protobuf\cmake\build>mkdir solution
    cd solution C:\Path\to\protobuf\cmake\build\solution>cmake -G "Visual Studio 12 2013 Win64" ../.. ```
  3. Change the runtime library option for libprotobuf and libprotobuf-lite:
    • Open the project's Property Pages dialog box.
    • Expand the C/C++ tab.
    • Select the Code Generation property page.
    • Modify the Runtime Library property to Multy-thread DLL (/MD).
    • Build libprotoc, protoc, libprotobuf and libprotobuf-lite projects for the Release configuration. 
  4. Add a path to the build directory to include environment variable PATH:
    ```sh set PATH=PATH%;C:\Path\to\protobuf\cmake\build\solution\Release ``` 
  5. Go to the python directory:
     ```sh cd C:\Path\to\protobuf\python ```
  6. Use a text editor to open and change these setup.py options:
    • Change from ​libraries = ['protobuf']
      to libraries = ['libprotobuf', 'libprotobuf-lite']
    • Change from extra_objects = ['../src/.libs/libprotobuf.a', '../src/.libs/libprotobuf-lite.a']
    • to extra_objects = ['../cmake/build/solution/Release/libprotobuf.lib', '../cmake/build/solution/Release/libprotobuf-lite.lib']
  7. Build the Python package with the CPP implementation:
    ```sh python setup.py build –cpp_implementation ```
  8. Install the Python package with the CPP implementation:
    ```sh python -m easy_install dist/protobuf-3.5.1-py3.5-win-amd64.egg ```
  9. Set an environment variable to boost the protobuf performance:
    ```sh set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp ```

You are ready to use Caffe with your trained models. Your next step is to use the Intel® Computer Vision SDK Model Optimizer Guide.

 

Helpful Links

Note: Links open in a new window.

Intel® Computer Vision SDK Model Optimizer Guide: https://software.intel.com/en-us/articles/CVSDK-ModelOptimizer

Intel® Computer Vision SDK Inference Engine Guide: https://software.intel.com/en-us/articles/CVSDK-InferEngine

Intel® Computer Vision SDK Overview: https://software.intel.com/en-us/articles/CVSDK-Overview

Intel® CV SDK Home Page: https://software.intel.com/en-us/computer-vision-sdk

Intel® CV SDK Documentation: https://software.intel.com/en-us/computer-vision-sdk/documentation/view-all

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidia, Pentium, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used with permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

Why and How to Replace Perl Compatible Regular Expressions (PCRE) with Hyperscan

$
0
0

Introduction to PCRE and Hyperscan

Perl Compatible Regular Expressions (PCRE), is a widely used regular expression matching library written in the C language, inspired by the regular expression capabilities of the Perl programming language. Its syntax is much more powerful and flexible than many other regular expression libraries, such as the Portable Operating System Interface for UNIX* (POSIX).

Hyperscan is a high performance, multi-pattern regular expression matching library developed by Intel, which supports the same syntax as PCRE. In this article, we will describe the API differences and provide a performance contrast between PCRE and Hyperscan, then show how to replace PCRE with Hyperscan in a typical scenario.

Functionality Comparison

PCRE supports only block mode compilation and matching, while Hyperscan supports both block and streaming mode. Streaming mode is more practical and flexible in real network scenarios.

PCRE supports only single pattern compilation and matching, while Hyperscan can support multiple patterns. In real scenarios, it is common that multiple patterns are applied, and Hyperscan can efficiently complete all work in one compilation and scan.

API Comparison

Both PCRE and Hyperscan interfaces have compile time and runtime phases. When replacing PCRE with Hyperscan, the basic idea is to replace the PCRE API with the Hyperscan API at both compile time and runtime.

Compile time API changes

With PCRE, we often use the following API to compile each pattern:

#include <pcre.h>
pcre *pcre_compile2(const char *pattern, int options, int *errorcodeptr, const char **errptr, int *erroffset, const unsigned char *tableptr); 

It is very easy to replace this API with the Hyperscan compile time API:

#include <hs_compile.h>
hs_error_t hs_compile(const char *expression, unsigned int flags,
               unsigned int mode, const hs_platform_info_t *platform,
               hs_database_t **db, hs_compile_error_t **error);

Parameters:

expression – single pattern. Corresponding to “pattern” of pcre.
flags – flag of single pattern. Corresponding to “options” of pcre.
mode – mode selection. What corresponds to pcre is HS_MODE_BLOCK.
platform – platform information.
db – Generated Hyperscan database. Corresponding to the return value of pcre_compile2.
error – return the compile error.

Return value is HS_SUCCESS(0) or error code.

Hyperscan also provides an API for multi-pattern compilation:

hs_error_t hs_compile_multi(const char *const *expressions,
               const unsigned int *flags, const unsigned int *ids,
               unsigned int elements, unsigned int mode,
               const hs_platform_info_t *platform,
               hs_database_t **db, hs_compile_error_t **error);

This API supports the compiling of several patterns together to generate one Hyperscan database. There are some differences in the parameters:

expressions – the array of patterns.
flags – the flag array of patterns.
ids – the id array of patterns.
elements – the number of patterns.

The others remain the same.

Run time API replacement

To build the compiled PCRE database, we often use the following API to scan the input data:

#include <pcre.h>
int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize); 

It is easy to replace it with the Hyperscan runtime API:

#include <hs_runtime.h>
hs_error_t hs_scan(const hs_database_t *db, const char *data,
               unsigned int length, unsigned int flags,
               hs_scratch_t *scratch, match_event_handler onEvent,
               void *context);

Parameters:

db – Hypersan data base. Corresponding to “code” of pcre.
data – input data. Corresponding to “subject” of pcre.
length – length of the input data. Corresponding to “length” of pcre.
flags – reserved options.
scratch – the space storing temporary state during runtime, allocated by hs_alloc_scratch().
onEvent – callback function at match, user-defined.
context – callback function parameter, user-defined.

Return value is HS_SUCCESS(0) or error code.

Using prefilter mode

Hyperscan does not completely match PCRE syntax; for example, it doesn’t support Back Reference and Zero-Width Assertion. However, Hyperscan’s performance advantage over PCRE makes it worthwhile to convert the unsupported pattern to its superset, which can be supported by Hyperscan’s prefilter mode. For example:

Convert /foo(\d)+bar\1+baz/ to /foo(\d)+bar(\d)+baz/

At compile time, each pattern goes through the classifier first, which decides whether to use Hyperscan, prefilter mode, or PCRE. The patterns supported by Hyperscan and prefilter mode are multi-compiled together to generate one Hyperscan database; in addition, every pattern that uses prefilter mode or PCRE is compiled separately to generate the PCRE database:


Figure 1. Compile time.

At runtime, the input data is scanned against the Hyperscan database and each non-prefilter mode PCRE database. If Hyperscan finds a match, it should be reconfirmed with PCRE:


Figure 2. Runtime.

Please refer to API Reference: Constants for more API information.

Performance Comparison

To contrast Hyperscan and PCRE performance, we used the Hyperscan performance testing tool hsbench. We selected the following 15 regular expressions, which include both plain text strings and regular expression rules:

IDSignature
1Twain
2(?i)Twain
3[a-z]shing
4Huck[a-zA-Z]+|Saw[a-zA-Z]+
5\b\w+nn\b
6[a-q][^u-z]{13}x
7Tom|Sawyer|Huckleberry|Finn
8(?i)Tom|Sawyer|Huckleberry|Finn
9.{0,2}(Tom|Sawyer|Huckleberry|Finn)
10.{2,4}(Tom|Sawyer|Huckleberry|Finn)
11Tom.{10,25}river|river.{10,25}Tom
12[a-zA-Z]+ing
13\s[a-zA-Z]{0,12}ing\s
14([A-Za-z]awyer|[A-Za-z]inn)\s
15["'][^"']{0,30}[?!\.]["']

We ran the test on a single core of an Intel® Core™ i7-8700K processor at 3.70 GHz. We chose an e-book from The Entire Project Gutenberg Works of Mark Twain by Mark Twain, containing about 20M words (18,905,427 bytes) as input, and then looped for 200 times using hsbench. Time spent and throughput of PCRE (v8.41, just-in-time mode) and Hyperscan (v4.7.0) are as follows:

 Corpus: mtent.txtTotal Data: 18905427 Bytes x 200
 Time(s)Throughput (Mbit/s)
IDpcre_jithspcre_jiths
13.5181.4528,598.2620,832.43
23.6311.4988,330.6820,192.71
33.3272.3559,091.8812,844.45
41.5822.24619,120.5313,467.80
512.9012.0672,344.6814,634.10
61.4621.58520,689.9319,084.34
74.7553.0376,361.459,960.05
811.0753.082,731.269,821.00
931.6843.037954.709,960.05
1031.0143.042975.329,943.68
113.1432.0979,624.1414,424.74
127.423.2794,076.649,224.97
138.4634.453,574.236,797.46
146.4232.894,709.4310,466.67
152.3954.26712,629.937,088.98
Total132.79340.382227.79749.0635234
Multi132.79313.4227.792,257.36

The results show that Hyperscan has a performance advantage over PCRE for most of the rules tested. The highest throughput (see test 9) is 10.4 times greater using Hyperscan.

Multiple pattern matching test results also show the advantage of using Hyperscan. Multi-pattern matching is very common in practical use. Hyperscan can compile all the patterns simultaneously into one database which is scanned against the input corpus only once. The results above (see the rows labeled Total and Multi) show that it takes Hyperscan only 13.4 seconds to perform multiple pattern matching, and a total of 40.382 seconds when the 15 rules are scanned separately in single pattern scans. Because PCRE supports only single pattern compilation and scanning, each of the 15 rules must be compiled separately and scanned against the input corpus. Altogether, it takes PCRE 132.793 seconds to complete all scans. The throughput histogram is as follows:

Replacement Pseudo-code Samples

Assume that we have a pattern set and input corpus:

// patterns
const char *pats[];
unsigned flags[];

// input data
const char *buf = "blah...................blah";
size_t buf_len = strlen(buf);

When using PCRE, we may have this kind of implementation:

// pcre compile
for each pattern i
    pcres[i] = pcre_compile2(pats[i], flags[i], …);

// pcre runtime
for each pattern i
    ret = pcre_exec(pcres[i], …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern i match at ovector[1]

Now we’ll describe the details of replacing PCRE with Hyperscan.

In addition to a possible requirement for prefiltering mode, we also have to be careful about a pattern having variable width, which means that a pattern may consume a different amount of data to get matches. This is because Hyperscan reports all matches but PCRE only reports one match under greedy or ungreedy mode. We also need to reconfirm the match from a variable width pattern with PCRE. We may use the following function to check whether a pattern has variable width or not:

bool is_variable_width(re, flags) {
    hs_expr_info_t *info = NULL;
    ret = hs_expression_info(re, flags, &info, ...);
    if (ret == HS_SUCCESS) and info and (info->min_width == info->max_width)
        return false
    else
        return true
}

Here we show two different scenarios - single compile or multi-compile.

Single pattern compile

// try compile hs, prefilter mode and pcre compile
for each pattern i
    dbs[i].pcre = NULL;
    ret = hs_compile(pats[i], flags[i], HS_MODE_BLOCK, …, &dbs[i].hs, …);
    if ret == HS_SUCCESS
        hs_alloc_scratch(dbs[i].hs, dbs[i].scratch);
        if pats[i] has variable width
            dbs[i].pcre = pcre_compile2(pats[i], flags[i], …);
else
        dbs[i].pcre = pcre_compile2(pats[i], flags[i], …);
        ret = hs_compile(pats[i], flags[i] | HS_FLAG_PREFILTER, HS_MODE_BLOCK, …, &dbs[i].hs, …);
        if ret == HS_SUCCESS
            hs_alloc_scratch(dbs[i].hs, dbs[i].scratch);
        else
            dbs[i].hs = NULL;

// runtime
on_match(…, to, …, ctx) { // hs callback
    if !ctx->pcre // not prefilter mode pattern
        report pattern ctx->i match at to
        return
    
    // got a match from a prefilter mode or variable width pattern, need pcre confirm
    ret = pcre_exec(ctx->pcre, …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern ctx->i match at ovector[1]
    return
}

for each pattern i
    if dbs[i].hs
        ctx = pack dbs[i].pcre and i
        hs_scan(dbs[i].hs, buf, buf_len, 0, dbs[i].scratch, on_match, &ctx);
    else
        ret = pcre_exec(dbs[i].pcre, …, buf, buf_len, …, &ovector[0], …);
        if ret >= 0
            report pattern i match at ovector[1]

// house clean
for each pattern i
    if dbs[i].hs
        hs_free_scratch(dbs[i].scratch);
        hs_free_database(dbs[i].hs);

Multi-pattern compile

// try hs, use prefilter mode and pcre compile if failed
for each pattern i
ret = hs_compile(pats[i], flags[i], HS_MODE_BLOCK, …, &hs, …);
if ret == HS_SUCCESS
        store pats[i] to hs_pats[]
        store flags[i] | HS_FLAG_PREFILTER to hs_flags[]
        store ids[i] to hs_ids[]
        if pats[i] has variable width
            id2pcre[ids[i]] = pcre_compile2(pats[i], flags[i], …);
    else
        ret = hs_compile(pats[i], flags[i] | HS_FLAG_PREFILTER, HS_MODE_BLOCK, …, &hs, …);
        if ret == HS_SUCCESS
            store pats[i] to hs_pats[]
            store flags[i] | HS_FLAG_PREFILTER to hs_flags[]
            store ids[i] to hs_ids[]
            id2pcre[ids[i]] = pcre_compile2(pats[i], flags[i], …);
        else
            pcres[n_pcre++] = pcre_compile2(pats[i], flags[i], …);

// hs multi compile
hs_compile_multi(hs_pats, hs_flags, hs_ids, n_hs, HS_MODE_BLOCK, …, &hs, …);
hs_alloc_scratch(hs, &scratch);

// hs runtime for multi compiled part
on_match(id, …, to, …, ctx) {
    if ctx[id] // got match from a prefilter mode or variable width pattern, need pcre confirm
        ret = pcre_exec(ctx[id], …, buf, buf_len, …, &ovector[0], …);
        if ret >= 0
            report pattern id match at ovector[1]
    else
        report pattern id match at to
    return
}

ctx = id2pcre; // user defined on_match context
hs_scan(hs, buf, buf_len, 0, scratch, on_match, ctx);

// pcre runtime for the rest
for each db i in pcres[]
    ret = pcre_exec(pcres[i], …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern ids[i] match at ovector[1]

// house clean
hs_free_scratch(scratch);
hs_free_database(hs_db);

Summary

Hyperscan is a high performance regular expression matching library that is very suitable for multi-pattern matching and is faster than PCRE. In this article, we showed how to replace PCRE with Hyperscan in a typical scenario. In addition to its performance advantage, Hyperscan has another superior feature, the streaming mode. It enables us to deal with the case that input data is separated into pieces of blocks. For these reasons, Hyperscan is expected to replace many of the existing regular expression matching engines in a greater number of scenarios.

Intel® Graphics Performance Analyzers (Intel® GPA) 2018 R1 Release Notes

$
0
0

Thank you for choosing the Intel® Graphics Performance Analyzers (Intel® GPA), available as a standalone product and as part of Intel® System Studio.

Contents

Introduction
What's New
System Requirements and Supported Platforms
Installation Notes
Technical Support and Troubleshooting
Known Issues and Limitations
Legal Information

Introduction

Intel® GPA provides tools for graphics analysis and optimizations for making games and other graphics-intensive applications run even faster. The tools support the platforms based on the latest generations of Intel® Core™ and Intel Atom™ processor families, for applications developed for  Windows*, Android*, Ubuntu*, or macOS*.

Intel® GPA provides a common and integrated user interface for collecting performance data. Using it, you can quickly see performance opportunities in your application, saving time and getting products to market faster.

For detailed information and assistance in using the product, refer to the following online resources:

  • Home Page - view detailed information about the tool, including links to training and support resources, as well as videos on the product to help you get started quickly.
  • Getting Started - get the main features overview and learn how to start using the tools on different host systems.
  • Training and Documentation - learn at your level with Getting Started guides, videos and tutorials.
  • Online Help for Windows* Host - get details on how to analyze Windows* and Android* applications from a Windows* system.
  • Online Help for macOS* Host - get details on how to analyze Android* or macOS* applications from a macOS* system.
  • Online Help for Ubuntu* Host - get details on how to analyze Android* or Ubuntu* applications from an Ubuntu* system.
  • Support Forum - report issues and get help with using Intel® GPA.

What's New

Intel® GPA 2018 R1 offers the following new features:

New Features for Analyzing All Graphics APIs

Graphics Frame Analyzer

  • API Log pane now contains a new Frame Statistic tab, and separate tabs for Resource History and Pixel History. The Resource History tab enables you to select a target resource, and in the Pixel History tab you can select pixel coordinates. 
  • API Log and Metrics can be exported now.
  • Input/Output Geometry viewer now provides additional information about the topology, primitive count, and bounding box.
  • Frame Overview pane shows full-frame FPS along with a GPU duration time.
  • Information about systems where a frame is captured and replayed is shown.

New Features for Analyzing Microsoft DirectX* Applications

Graphics Monitor

  • New User Interface is now available on Windows*
  • Remote profiling of DirectX* 9 or DirectX*10 frames is discontinued.

Graphics Frame Analyzer

  • New User Interface for DirectX* 11 frames. The following Legacy User Interface features are transferred to the new interface:
    • Render Target overdraw view
    • Shader replacement experiment allowing the user to import the HLSL shader code and view performance impacts on the entire frame
  • Default layout of D3D Buffers is now based on a specific buffer usage in a frame.
  • Samples count is shown as a parameter for 2D Multisample Textures or 2D Multisample Texture Arrays.
  • API Call arguments including structures, arrays and enums are correctly shown for DirectX11 frames.
  • API Log contains calls from the D3D11DeviceContext interface only.
  • List of bound shader resources (input elements, SRVs, UAVs, CBVs, Sampler, RTVs, DSV) is shown along with a shader code.
  • Target GPU adapter can be selected on multi-GPU machines for DirectX* 11 and DirectX* 12 frames.
  • Intel Gen Graphics Intermediate Shader Assembly (ISA) code is added for DirectX* 11 frames.
  • Input-Assembly layout is shown for DirectX*11 and DirectX*12 frames in the Geometry viewer.

New Features for Analyzing macOS Metal* Applications

Multi-Frame Analyzer

  • Ability to export the Metal source or LLVM disassembly codes for a selected shader.
  • Shader replacement experiment allowing the user to import a modified shader and view the performance impacts on the entire frame.

Many defect fixes and stability improvements

Known Issues

  • Full Intel GPA metrics are not supported on macOS* 10.13.4 for Skylake-based and Kaby Lake-based Mac Pro systems.  For full metric support, please do not upgrade to macOS* 10.13.4.
  • Metrics in the System Analyzer's system view are inaccurate for Intel® Graphics Driver for Windows* Version 15.65.4.4944. You can use Intel® Graphics Driver for Windows* Version 15.60.2.4901 instead.

System Requirements and Supported Platforms

The minimum system requirements are: 

  • Host Processor: Intel® Core™ Processor
  • Target Processor: See the list of supported Windows* and Android* devices below
  • System Memory: 8GB RAM
  • Video Memory: 512MB RAM
  • Minimum display resolution for client system: 1280x1024
  • Disk Space: 300MB for minimal product installation

Direct installation of Intel® GPA on 32-bit Windows* systems is not supported. However, if you need to analyze an application on a 32-bit Windows* target system, you can use the following workaround:

  1. Copy the 32-bit *.msi installer distributed with the 64-bit installation from your analysis system to the target system.
  2. Run the installer on the target system to install System Analyzer and Graphics Monitor.
  3. Start the Graphics Monitor and the target application on the 32-bit system and connect to it from the 64-bit host system.

For details, see the Running System Analyzer on a Windows* 32-bit System article.

The table below shows platforms and applications supported by Intel® GPA 2018 R1

Target System
(the system where your game runs)
Host System
(your development system where you run the analysis)
Target Application
(types of supported applications running on the target system)

Windows* 7 SP1/8/8.1/10

Windows* 7 SP1/8/8.1/10

Microsoft* DirectX* 9/9Ex, 10.0/10.1, 11.0/11.1/11.2/11.3

Windows* 10

Windows* 10

Microsoft* DirectX* 12, 12.1

Google* Android* 4.1, 4.2, 4.3, 4.4, 5.x, 6.0

The specific version depends on the officially-released OS for commercial version of Android* phones and tablets.
See the list of supported devices below.

NOTE: Graphics Frame Analyzer does not currently support GPU metrics for the Intel® processor code-named Clover Trail+.

Windows* 7 SP1/8/8.1/10
or
macOS* 10.11, 10.12
or
Ubuntu* 16.04

OpenGL* ES 1.0, 1.1, 2.0, 3.0, 3.1, 3.2

Ubuntu* 16.04

Ubuntu* 16.04

OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile

macOS* 10.12 and 10.13macOS* 10.12 and 10.13

OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile

and

Metal* 1 and 2

Intel® GPA does not support the following Windows* configurations: All server editions, Windows* 8 RT, or Windows* 7 starter kit.

Supported Windows* Graphics Devices

Intel® GPA supports the following graphics devices as targets for analyzing Windows* workloads. All these targets have enhanced metric support:

TargetProcessor
Intel® UHD Graphics 6308th generation Intel® Core™ processor
Intel® UHD Graphics 6307th generation Intel® Core™ processor
Intel® UHD Graphics 6207th generation Intel® Core™ processor
Intel® HD Graphics 6207th generation Intel® Core™ processor
Intel® HD Graphics 6157th generation Intel® Core™ m processor
Intel® HD Graphics 5306th generation Intel® Core™ processor
Intel® HD Graphics 5156th generation Intel® Core™ m processor
Iris® graphics 61005th generation Intel® Core™ processor
Intel® HD Graphics 5500 and 60005th generation Intel® Core™ processor
Intel® HD Graphics 53005th generation Intel® Core™ m processor family
Iris® Pro graphics 52004th generation Intel® Core™ processor
Iris® graphics 51004th generation Intel® Core™ processor
Intel® HD Graphics 4200, 4400, 4600, and 50004th generation Intel® Core™ processor
Intel® HD Graphics 2500 and 40003rd generation Intel® Core™ processor
Intel® HD Graphics
Intel® Celeron® processor N3000, N3050, and N3150
Intel® Pentium® processor N3700

Although the tools may appear to work with other graphics devices, these devices are unsupported. Some features and metrics may not be available on unsupported platforms. If you run into in an issue when using the tools with any supported configuration, please report this issue through the Support Forum.

Driver Requirements for Intel® HD Graphics

When running Intel® GPA on platforms with supported Intel® HD Graphics, the tools require the latest graphics drivers for proper operation. You may download and install the latest graphics drivers from http://downloadcenter.intel.com/.

Intel® GPA inspects your current driver version and notifies you if your driver is out-of-date.

Supported Devices Based on Intel® Atom™ Processor

Intel® GPA supports the following devices based on Intel® Atom™ processor:

Processor ModelGPUAndroid* VersionSupported Tools

Intel® Atom™ Z35XX 

Imagination Technologies* PowerVR G6430

Android* 4.4 (KitKat), Android* 5.x (Lollipop)

System Analyzer
Graphics Frame Analyzer
Trace Analyzer [Beta]

Intel® Atom™ Z36XXX/Z37XXX 

Intel® HD Graphics

Android* 4.2.2 (Jelly Bean MR1)
Android* 4.4 (KitKat)
Android* 5.x (Lollipop)

 

System Analyzer
Graphics Frame Analyzer
Trace Analyzer [Beta]

Intel® Atom™ Z25XX 

Imagination Technologies* PowerVR SGX544MP2

Android* 4.2.2 (Jelly Bean MR1)
Android* 4.4 (KitKat)

 

System Analyzer
Graphics Frame Analyzer
Trace Analyzer [Beta]

Intel® Atom™ x7-Z8700, x5-Z8500, and x5-Z8300 

Intel® HD Graphics

Android* 5.x (Lollipop), Android* 6.0 (Marshmallow)

System Analyzer
Graphics Frame Analyzer
Trace Analyzer [Beta]

Supported ARM*-Based Devices

The following devices are supported with Intel® GPA:

ModelGPUAndroid* Version

Samsung* Galaxy S5

Qualcomm* Adreno 330

Android* 5.0

Samsung* Galaxy Nexus (GT-i9500)

Imagination Technologies* PowerVR SGX544

Android* 4.4

Samsung* Galaxy S4 Mini (GT-I9190)

Qualcomm* Adreno 305

Android* 4.4

Samsung* Galaxy S III (GT-i9300)

ARM* Mali 400MP

Android* 4.3

Google* Nexus 5

Qualcomm* Adreno 330

Android* 5.1

Nvidia* Shield tablet

NVIDIA* Tegra* K1 processor

Android* 5.1

Your system configuration should satisfy the following requirements:

  • Your ARM*-based device is running Android* 4.1, 4.2, 4.3, 4.4, 5.0, 5.1, or 6.0
  • Your Android* application uses OpenGL* ES 1.0, 1.1, 2.0, 3.0, 3.1, or 3.2
  • Regardless of your ARM* system type, your application must be 32-bit

For support level details for ARM*-based devices, see this article.

Installation Notes

Installing Intel® GPA 

Download the Intel® GPA installer from the Intel® GPA Home Page.

Installing Intel® GPA on Windows* Target and Host Systems

To install the tools on Windows*, download the *.msi package from the Intel® GPA Home Page and run the installer file.

The following prerequisites should be installed before you run the installer:

  • Microsoft DirectX* Runtime June 2010
  • Microsoft .NET 4.0 (via redirection to an external web site for download and installation)

If you use the product in a host/target configuration, install Intel® GPA on both systems. For more information on the host/target configuration, refer to Best Practices.

For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Installing Intel® GPA on Ubuntu* Host System

To install Intel® GPA on Ubuntu*, download the .tar package, extract the files, and run the .deb installer.

It is not necessary to explicitly install Intel® GPA on the Android* target device since the tools automatically install the necessary files on the target device when you run System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Installing Intel® GPA on macOS* Host System

To install the tools on macOS*, download the .zip package, unzip the files, and run the .pkg installer.

It is not necessary to explicitly install Intel® GPA on the Android* target device because the tools automatically install the necessary files on the target device when you run the System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Technical Support and Troubleshooting

For technical support, including answers to questions not addressed in the installed product, visit the Support Forum.

Troubleshooting Android* Connection Problems

If the target device does not appear when the adb devices command is executed on the client system, do the following:

  1. Disconnect the device
  2. Execute $ adb kill-server
  3. Reconnect the device
  4. Run $ adb devices

If these steps do not work, try restarting the system and running $adb devices again. Consult product documentation for your device to see if a custom USB driver needs to be installed. 

Known Issues and Limitations

General

  • Your system must be connected to the internet while you are installing Intel® GPA.
  • Selecting all ergs might cause a significant memory usage in Graphics Frame Analyzer.
  • Intel® GPA uses sophisticated techniques for analyzing graphics performance which may conflict with third-party performance analyzers. Therefore, ensure that other performance analyzers are disabled prior to running any of these tools. For third-party graphics, consult the vendor's website.
  • Intel® GPA does not support use of Remote Desktop Connection.
  • Graphics Frame Analyzer (DirectX* 9,10,11) runs best on systems with a minimum of 4GB of physical memory. Additionally, consider running the Graphics Frame Analyzer (DirectX* 9,10,11) in a networked configuration (the server is your target graphics device, and the client running the Graphics Frame Analyzer is a 64-bit OS with at least 8GB of memory).
  • On 64-bit operating systems with less than 8GB of memory, warning messages, parse errors, very long load times, or other issues may occur when loading a large or complex frame capture file.

Analyzing Android* Workloads

  • Graphics Frame Analyzer does not currently support viewing every available OpenGL/OpenGL ES* texture format.
  • Intel® GPA provides limited support for analyzing browser workloads on Android*. You can view metrics in the System Analyzer, but the tools do not support creating or viewing frame capture files or trace capture files for browser workloads. Attempting to create or view these files may result in incorrect results or program crashes.
  • Intel® GPA may fail to analyze OpenGL* multi-context games.

Analyzing Windows* Workloads

  • The Texture 2x2 experiment might work incorrectly for some DirectX* 12 workloads.
  • Intel® GPA may show offsets used in DirectX* 12 API call parameters in scientific format.
  • Render Target visualization experiments “Highlight” and “Hide” are applied to all Draw calls in a frame. As a result, some objects may disappear and/or be highlighted incorrectly.
  • Frame Analyzer may crash if the ScissorRect experiment is deselected. The application will go back to Frame File open view.
  • Downgrade from 17.2 to 17.1 might not be successful.
  • The Overdraw experiment for Render Targets with 16-bit and 32-bit Alpha channel is not supported now.
  • To view Render Targets with 16-bit and 32-bit Alpha channel, you should disable Alpha channel in the Render Targets viewer.
  • To ensure accurate measurements on platforms based on Intel® HD Graphics, profile your application in the full-screen mode. If windowed mode is required, make sure only your application is running. Intel® GPA does not support profiling multiple applications simultaneously.
  • For best results when analyzing frame or trace capture files on the same system where you run your game, follow these steps:
    • Run your game and capture a frame or trace file.
    • Shut down your game and other non-essential applications.
    • Launch the Intel® GPA.
  • To run Intel® GPA on hybrid graphics solutions (a combination of Intel® Processor Graphics and third-party discrete graphics), you must first disable one of the graphics solutions.
  • Secure Boot, also known as Trusted Boot, is a security feature in Windows* 8 enabled in BIOS settings which can cause unpredictable behavior when the "Auto-detect launched applications" option is enabled in Graphics Monitor Preferences. Disable Secure Boot in the BIOS to use the auto-detection feature for analyzing application performance with Intel® GPA. The current version of the tools can now detect Secure Boot, and warns you of this situation.
  • To view the full metric set with the tools for Intel® Processor Graphics on systems with one or more third-party graphics device(s) and platforms based on Intel® HD Graphics, ensure that Intel is the preferred graphics processor. You can set this in the Control Panel application for the third-party hardware. Applications running under Graphics Monitor and a third-party device show GPU metrics on DirectX* 9 as initialized to 0 and on DirectX* 10/11 as unavailable.
  • When using the Intel® GPA, disable the screen saver and power management features on the target system running the Graphics Monitor — the Screen Saver interferes with the quality of the metrics data being collected. In addition, if the target system is locked (which may happen when a Screen Saver starts), the connection from the host system to the target system will be terminated.
  • Intel® GPA does not support frame capture or analysis for:
    • applications that execute on the Debug D3D runtime system
    • applications that use the Reference D3D Device
  • System Analyzer HUD may not operate properly when applications use copy-protection, anti-debugging mechanisms, or non-standard encrypted launching schemes.
  • Intel® GPA provides analysis functionality by inserting itself between your application and Microsoft DirectX*. Therefore, the tools may not work correctly with certain applications which themselves hook or intercept DirectX* APIs or interfaces.
  • Intel® GPA does not support Universal Windows Platform applications where the graphics API uses compositing techniques such as HTML5 or XAML interop.  Only traditional DirectX* rendering is supported. To workaround this limitation, port your application as a Desktop application, and then use the full Intel® GPA suite of tools.
  • In some cases, the Overview tab in Graphics Frame Analyzer (DirectX* 9,10,11) can present GPU Duration values higher than Frame Duration values measured during game run time. This could be a result of Graphics Frame Analyzer (DirectX* 9,10,11) playing the captured frame back in off-screen mode which can be slower than on-screen rendering done in the game.

    To make playback run on-screen use this registry setting on the target system: HKEY_CURRENT_USER\Software\Intel\GPA\16.4\ForceOnScreenPlaybackForRemoteFA = 1 and connect to the target with Graphics Frame Analyzer (DirectX* 9,10,11) running on a separate host. If these requirements are met, the playback runs in off-screen mode on the target. If the frame was captured from the full-screen game, but playback renders it in a windowed mode, then try pressing Alt+Enter on the target to switch playback to full-screen mode.

  • Frame capture using Graphics Monitor runs best on 64-bit operating systems with a minimum of 4GB of physical memory.
    On 32-bit operating systems (or 64-bit operating systems with <4GB of memory), out of memory or capture failed messages can occur.
  • Scenes that re-create resource views during multi-threaded rendering have limited support in the current Intel® GPA version, and might have issues with frame replays in Graphics Frame Analyzer.

*Other names and brands may be claimed as the property of others.

** Disclaimer: Intel disclaims all liability regarding rooting of devices. Users should consult the applicable laws and regulations and proceed with caution. Rooting may or may not void any warranty applicable to your devices.

Unreal Engine 4 Parallel Processing School of Fish

$
0
0

Nikolay Lazarev

Integrated Computer Solutions, Inc.

General Description of the Flocking Algorithm

The implemented flocking algorithm simulates the behavior of a school, or flock, of fish. The algorithm contains four basic behaviors:

  • Cohesion: Fish search for their neighbors in a radius defined as the Radius of Cohesion. The current positions of all neighbors are summed. The result is divided by the number of neighbors. Thus, the center of mass of the neighbors is obtained. This is the point to which the fish strive for cohesion. To determine the direction of movement of the fish, the current position of the fish is subtracted from the result obtained earlier, and then the resulting vector is normalized.
  • Separation: Fish search for their neighbors in a radius defined as the Separation Radius. To calculate the motion vector of an individual fish in a specific separation direction from a school, the difference in the positions of the neighbors and its own position is summed. The result is divided by the number of neighbors and then normalized and multiplied by -1 to change the initial direction of the fish to swim in the opposite direction of the neighbors.
  • Alignment: Fish search for their neighbors in a radius defined as the Radius of Alignment. The current speeds of all neighbors are summed, then divided by the number of neighbors. The resulting vector is normalized.
  • Reversal: All of the fish can only swim in a given space, the boundaries of which can be specified. The moment a fish crosses a boundary must be identified. If a fish hits a boundary, then the direction of the fish is changed to the opposite vector (thereby keeping the fish within the defined space).

These four basic principles of behavior for each fish in a school are combined to calculate the total position values, speed, and acceleration of each fish. In the proposed algorithm, the concept of weight coefficients was introduced to increase or decrease the influence of each of these three modes of behavior (cohesion, separation, and alignment). The weight coefficient was not applied to the behavior of reversal, because fish were not permitted to swim outside of the defined boundaries. For this reason, reversal had the highest priority. Also, the algorithm provided for maximum speed and acceleration.

According to the algorithm described above, the parameters of each fish were calculated (position, velocity, and acceleration). These parameters were calculated for each frame.

Source Code of the Flocking Algorithm with Comments

To calculate the state of fish in a school, double buffering is used. Fish states are stored in an array of size N x 2, where N is the number of fish, and 2 is the number of copies of states.

The algorithm is implemented using two nested loops. In the internal nested loop, the direction vectors are calculated for the three types of behavior (cohesion, separation, and alignment). In the external nested loop, the final calculation of the new state of the fish is made based on calculations in the internal nested loop. These calculations are also based on the values ​​of the weight coefficients of each type of behavior and the maximum values ​​of speed and acceleration.

External loop: At each iteration of a cycle, a new value for the position of each fish is calculated. As arguments to the lambda function, references are passed to:

agentsArray of fish states
currentStatesIndexIndex of array where the current states of each fish are stored
previousStatesIndexIndex of array where the previous states of each fish are stored
kCohWeighting factor for cohesion behavior
kSepWeighting factor for separation behavior
kAlignWeighting factor for alignment behavior
rCohesionRadius in which neighbors are sought for cohesion
rSeparationRadius in which neighbors are sought for separation
rAlignmentRadius in which the neighbors are sought for alignment
maxAccelMaximum acceleration of fish
maxVelMaximum speed of fish
mapSzBoundaries of the area in which fish are allowed to move
DeltaTimeElapsed time since the last calculation
isSingleThreadParameter that indicates in which mode the loop will run

ParllelFor can be used in either of two modes, depending on the state of the isSingleThread Boolean variable:

     ParallelFor(cnt, [&agents, currentStatesIndex, previousStatesIndex, kCoh, kSep, kAlign, rCohesion, rSeparation, 
            rAlignment, maxAccel, maxVel, mapSz, DeltaTime, isSingleThread](int32 fishNum) {

Initializing directions with a zero vector to calculate each of the three behaviors:

     FVector cohesion(FVector::ZeroVector), separation(FVector::ZeroVector), alignment(FVector::ZeroVector);

Initializing neighbor counters for each type of behavior:

     int32 cohesionCnt = 0, separationCnt = 0, alignmentCnt = 0;

Internal nested loop. Calculates the direction vectors for the three types of behavior:

     for (int i = 0; i < cnt; i++) {

Each fish should ignore (not calculate) itself:

     if (i != fishNum) {

Calculate the distance between the position of a current fish and the position of each other fish in the array:

     float distance = FVector::Distance(agents[i][previousStatesIndex].position, agents[fishNum][previousStatesIndex].position);

If the distance is less than the cohesion radius:

     if (distance < rCohesion) {

Then the neighbor position is added to the cohesion vector:

     cohesion += agents[i][previousStatesIndex].position;

The value of the neighbor counter is increased:

     cohesionCnt++;
     }

If the distance is less than the separation radius:

     if (distance < rSeparation) {

The difference between the position of the neighbor and the position of the current fish is added to the separation vector:

     separation += agents[i][previousStatesIndex].position - agents[fishNum][previousStatesIndex].position;

The value of the neighbor counter is increased:

     separationCnt++;
     }

If the distance is less than the radius of alignment:

     if (distance < rAlignment) {

Then the velocity of the neighbor is added to the alignment vector:

     alignment += agents[i][previousStatesIndex].velocity;

The value of the neighbor counter is increased:

     alignmentCnt++;
                      }
             }

If neighbors were found for cohesion:

     if (cohesionCnt != 0) {

Then the cohesion vector is divided by the number of neighbors and its own position is subtracted:

     cohesion /= cohesionCnt;
     cohesion -= agents[fishNum][previousStatesIndex].position;

The cohesion vector is normalized:

     cohesion.Normalize();
     }

If neighbors were found for separation:

     if (separationCnt != 0) {

The separation vector is divided by the number of neighbors and multiplied by -1 to change the direction:

            separation /= separationCnt;
            separation *= -1.f;

The separation vector is normalized:

              separation.Normalize();
     }

If neighbors were found for alignment:

     if (alignmentCnt != 0) {

The alignment vector is divided by the number of neighbors:

            alignment /= alignmentCnt;

The alignment vector is normalized:

            alignment.Normalize();
     }

Based on the weight coefficients of each of the possible types of behavior, a new acceleration vector is determined, limited by the value of the maximum acceleration:

agents[fishNum][currentStatesIndex].acceleration = (cohesion * kCoh + separation * kSep + alignment * kAlign).GetClampedToMaxSize(maxAccel);

To limit the acceleration vector along the Z-axis:

   agents[fishNum][currentStatesIndex].acceleration.Z = 0;

Add to the previous position of the fish the result of the multiplication of the new velocity vector and the time elapsed since the last calculation:

     agents[fishNum][currentStatesIndex].velocity += agents[fishNum][currentStatesIndex].acceleration * DeltaTime;

The velocity vector is limited to the maximum value:

     agents[fishNum][currentStatesIndex].velocity =
                 agents[fishNum][currentStatesIndex].velocity.GetClampedToMaxSize(maxVel);

To the previous position of a fish, the multiplication of the new velocity vector and the time elapsed since the last calculation is added:

     agents[fishNum][currentStatesIndex].position += agents[fishNum][currentStatesIndex].velocity * DeltaTime;

The current fish is checked to be within the specified boundaries. If yes, the calculated speed and position values are saved. If the fish has moved beyond the boundaries of the region along one of the axes, then the value of the velocity vector along this axis is multiplied by -1 to change the direction of motion:

agents[fishNum][currentStatesIndex].velocity = checkMapRange(mapSz,
               agents[fishNum][currentStatesIndex].position, agents[fishNum][currentStatesIndex].velocity);
               }, isSingleThread);

For each fish, collisions with world-static objects, like underwater rocks, should be detected, before new states are applied:

     for (int i = 0; i < cnt; i++) {

То detect collisions between fish and world-statiс objects:

            FHitResult hit(ForceInit);
            if (collisionDetected(agents[i][previousStatesIndex].position, agents[i][currentStatesIndex].position, hit)) {

If a collision is detected, then the previously calculated position should be undone. The velocity vector should be changed to the opposite direction and the position recalculated:

                   agents[i][currentStatesIndex].position -= agents[i]  [currentStatesIndex].velocity * DeltaTime;
                   agents[i][currentStatesIndex].velocity *= -1.0; 
                   agents[i][currentStatesIndex].position += agents[i][currentStatesIndex].velocity * DeltaTime;  
            }
     }

Having calculated the new states of all fish, these updated states will be applied, and all fish will be moved to a new position:

for (int i = 0; i < cnt; i++) {  
           FTransform transform; 
            m_instancedStaticMeshComponent->GetInstanceTransform(agents[i][0]->instanceId, transform);

Set up a new position of the fish instance:

     transform.SetLocation(agents[i][0]->position);

Turn the fish head forward in the direction of movement:

     FVector direction = agents[i][0].velocity; 
     direction.Normalize();
     transform.SetRotation(FRotationMatrix::MakeFromX(direction).Rotator().Add(0.f, -90.f, 0.f).Quaternion());

Update instance transform:

            m_instancedStaticMeshComponent->UpdateInstanceTransform(agents[i][0].instanceId, transform, false, false);
     }

Redraw all the fish:

     m_instancedStaticMeshComponent->ReleasePerInstanceRenderData();

     m_instancedStaticMeshComponent->MarkRenderStateDirty();

Swap indexed fish states:

      swapFishStatesIndexes();

Complexity of the Algorithm: How Increasing the Number of Fish Affects Productivity

Suppose that the number of fish participating in the algorithm is N. To determine the new state of each fish, the distance to all the other fish must be calculated (not counting additional operations for determining the direction vectors for the three types of behavior). The initial complexity of the algorithm will be O(N2). For example, 1,000 fish will require 1,000,000 operations.

Figure 1

Figure 1: Computational operations for calculating the positions of all fish in a scene.

Compute Shader with Comments

Structure describing the state of each fish:

     struct TInfo{
              int instanceId;
              float3 position;
              float3 velocity;
              float3 acceleration;
     };

Function for calculating the distance between two vectors:

     float getDistance(float3 v1, float3 v2) {
              return sqrt((v2[0]-v1[0])*(v2[0]-v1[0]) + (v2[1]-v1[1])*(v2[1]-v1[1]) + (v2[2]-v1[2])*(v2[2]-v1[2]));
     }

     RWStructuredBuffer<TInfo> data;

     [numthreads(1, 128, 1)]
     void VS_test(uint3 ThreadId : SV_DispatchThreadID)
     {

Total number of fish:

     int fishCount = constants.fishCount;

This variable, created and initialized in C++, determines the number of fish calculated in each graphics processing unit (GPU) thread (by default:1):

     int calculationsPerThread = constants.calculationsPerThread;

Loop for calculating fish states that must be computed in this thread:

     for (int iteration = 0; iteration < calculationsPerThread; iteration++) {

Thread identifier. Corresponds to the fish index in the state array:

     int currentThreadId = calculationsPerThread * ThreadId.y + iteration;

The current index is checked to ensure it does not exceed the total number of fish (this is possible, since more threads can be started than there are fish):

     if (currentThreadId >= fishCount)
            return;

To calculate the state of fish, a single double-length array is used. The first N elements of this array are the new states of fish to be calculated; the second N elements are the older states of fish that were previously calculated.

Current fish index:

    int currentId = fishCount + currentThreadId;

Copy of the structure of the current state of fish:

     TInfo currentState = data[currentThreadId + fishCount];

Copy of the structure of the new state of fish:

     TInfo newState = data[currentThreadId];

Initialize direction vectors for the three types of behavior:

     float3 steerCohesion = {0.0f, 0.0f, 0.0f};
     float3 steerSeparation = {0.0f, 0.0f, 0.0f};
     float3 steerAlignment = {0.0f, 0.0f, 0.0f};

Initialize neighbors counters for each type of behavior:

     float steerCohesionCnt = 0.0f;
     float steerSeparationCnt = 0.0f;
     float steerAlignmentCnt = 0.0f;

Based on the current state of each fish, direction vectors are calculated for each of the three types of behaviors. The cycle begins with the middle of the input array, which is where the older states are stored:

     for (int i = fishCount; i < 2 * fishCount; i++) {

Each fish should ignore (not calculate) itself:

     if (i != currentId) {

Calculate the distance between the position of current fish and the position of each other fish in the array:

     float d = getDistance(data[i].position, currentState.position);

If the distance is less than the cohesion radius:

     if (d < constants.radiusCohesion) {

Then the neighbor’s position is added to the cohesion vector:

     steerCohesion += data[i].position;

And the counter of neighbors for cohesion is increased:

            steerCohesionCnt++;
     }

If the distance is less than the separation radius:

     if (d < constants.radiusSeparation) {

Then the separation vector is added to the difference between the position of the neighbor and the position of the current fish:

     steerSeparation += data[i].position - currentState.position;

The counter of the number of neighbors for separation increases:

            steerSeparationCnt++;
     }

If the distance is less than the alignment radius:

     if (d < constants.radiusAlignment) {

Then the velocity of the neighbor is added to the alignment vector:

     steerAlignment += data[i].velocity;

The counter of the number of neighbors for alignment increases:

                          steerAlignmentCnt++;
                   }
            }
     }

If neighbors were found for cohesion:

   if (steerCohesionCnt != 0) {

The cohesion vector is divided by the number of neighbors and its own position is subtracted:

     steerCohesion = (steerCohesion / steerCohesionCnt - currentState.position);

The cohesion vector is normalized:

            steerCohesion = normalize(steerCohesion);
     }

If neighbors were found for separation:

     if (steerSeparationCnt != 0) {

Then the separation vector is divided by the number of neighbors and multiplied by -1 to change the direction:

     steerSeparation = -1.f * (steerSeparation / steerSeparationCnt);

The separation vector is normalized:

            steerSeparation = normalize(steerSeparation);
     }

If neighbors were found for alignment:

     if (steerAlignmentCnt != 0) {

Then the alignment vector is divided by the number of neighbors:

     steerAlignment /= steerAlignmentCnt;

The alignment vector is normalized:

           steerAlignment = normalize(steerAlignment);
     }

Based on the weight coefficients of each of the three possible types of behaviors, a new acceleration vector is determined, limited by the value of the maximum acceleration:

     newState.acceleration = (steerCohesion * constants.kCohesion + steerSeparation * constants.kSeparation
            + steerAlignment * constants.kAlignment);
     newState.acceleration = clamp(newState.acceleration, -1.0f * constants.maxAcceleration,
            constants.maxAcceleration);

To limit the acceleration vector along the Z-axis:

     newState.acceleration[2] = 0.0f;

To the previous velocity vector, the product of the new acceleration vector and the time elapsed since the last calculation is added. The velocity vector is limited to the maximum value:

     newState.velocity += newState.acceleration * variables.DeltaTime;
     newState.velocity = clamp(newState.velocity, -1.0f * constants.maxVelocity, constants.maxVelocity);

Add to the previous position of the fish the result of the multiplication of the new velocity vector and the time elapsed since the last calculation:

     newState.position += newState.velocity * variables.DeltaTime;

The current fish is checked to be within the specified boundaries. If yes, the calculated speed and position values are saved. If the fish has moved beyond the boundaries of the region along one of the axes, then the value of the velocity vector along this axis is multiplied by -1 to change the direction of motion:

                   float3 newVelocity = newState.velocity;
                   if (newState.position[0] > constants.mapRangeX || newState.position[0] < -constants.mapRangeX) {
                          newVelocity[0] *= -1.f;
                   }    

                   if (newState.position[1] > constants.mapRangeY || newState.position[1] < -constants.mapRangeY) {
                          newVelocity[1] *= -1.f;
                   }
                   if (newState.position[2] > constants.mapRangeZ || newState.position[2] < -3000.f) {
                          newVelocity[2] *= -1.f;
                   }
                   newState.velocity = newVelocity;

                   data[currentThreadId] = newState;
            }
     }         

Table 1: Comparison of algorithms.

Fish

Algorithm (FPS)

Computing Operations

CPU SINGLE

CPU MULTI

GPU MULTI

100

62

62

62

10000

500

62

62

62

250000

1000

62

62

62

1000000

1500

49

61

62

2250000

2000

28

55

62

4000000

2500

18

42

62

6250000

3000

14

30

62

9000000

3500

10

23

56

12250000

4000

8

20

53

16000000

4500

6

17

50

20250000

5000

5

14

47

25000000

5500

4

12

35

30250000

6000

3

10

31

36000000

6500

2

8

30

42250000

7000

2

7

29

49000000

7500

1

7

27

56250000

8000

1

6

24

64000000

8500

0

5

21

72250000

9000

0

5

20

81000000

9500

0

4

19

90250000

10000

0

3

18

100000000

10500

0

3

17

110250000

11000

0

2

15

121000000

11500

0

2

15

132250000

12000

0

1

14

144000000

13000

0

0

12

169000000

14000

0

0

11

196000000

15000

0

0

10

225000000

16000

0

0

9

256000000

17000

0

0

8

289000000

18000

0

0

3

324000000

19000

0

0

2

361000000

20000

0

0

1

400000000

Figure 2

Figure 2: Comparison of algorithms.

Laptop Hardware:
CPU – Intel® Core i7-3632QM processor 2.2 GHz with turbo boost up to 3.2 GHz
GPU - NVIDIA GeForce* GT 730M
RAM - 8 GB DDR3*

Start Amazon Web Services Greengrass* Core on the UP Squared* Development Board

$
0
0

Introduction

This guide shows the steps to start Amazon Web Services (AWS) Greengrass* core on Ubuntu* using the UP Squared* development board.

About the UP Squared* Board

Characterized by low power consumption and high performance - which is ideal for the Internet of Things (IoT) - the UP Squared platform is the fastest x86 maker board based on the Apollo Lake platform from Intel. It contains both the Intel Celeron® processor Dual Core N3350 and Intel® Pentium® brand Quad Core N4200 processor.

AWS Greengrass*

AWS Greengrass is software that extends AWS cloud capabilities to local devices, allowing them to collect and analyze data on the local devices. This reduces latency between the devices and data processing layer, and reduces storage and bandwidth costs involved with sending data to the cloud. The user can create AWS Lambda functions to enable Greengrass to keep data in sync, filter data for further analysis, and communicate with other devices securely.

Operating System Compatibility

The UP Squared board can run Ubilinux*, Ubuntu*, Windows® 10 IoT Core, Windows® 10, Yocto Project*, and Android* Marshmallow operating systems. For more information on UP Squared, visit this website.

Hardware Components

The hardware components used in this project are listed below:

Create AWS Greengrass* Group

An AWS Greengrass group is a collection of settings for AWS Greengrass core devices, and the devices that communicate with them. Let’s start by logging into the Amazon Web Services (AWS)* Management Console, opening AWS IoT console, choosing a region from the top right corner of the navigation bar, then selecting Greengrass.

On the Welcome to AWS Greengrass screen, choose Get Started.

Figure 1: AWS IoT Console

On the Set up your Greengrass group page, select Use easy creation to create an AWS Greengrass group.

Figure 2: Setting up AWS Greengrass Group

Choose a name for your Greengrass Group, then click Next.

Figure 3: Setting up AWS Greengrass Group: Name the Group

Use the default name for the AWS Greengrass core, then select Next.

Figure 4: Setting up AWS Greengrass Group: Name the Greengrass Core

Select Create group and Core on the Run a scripted easy Group creation page. 

Figure 5: Setting up AWS Greengrass Group: Create Group and Core

You should see following page while the AWS Greengrass Group is being created.

Figure 6: Setting up AWS Greengrass Group: Creating Group and Core

When you see a certificate and public and private keys, you have successfully created the new Greengrass group. Click on Download these resources as a tar.gz to download the certificate and private key for later use. Select x86_64 for CPU architecture and then click on the Download Greengrass to start the Greengrass download.

Figure 7: Setting up AWS Greengrass Group: Certificate and Private Key

Select Finish.

Figure 8: Setting up AWS Greengrass Group: Successfully

Development Boards

Before you begin, make sure that the Ubuntu* operating is installed on The UP Squared board. To ensure that the Ubuntu operating system is up to date and dependent Ubuntu packages are installed, open a command prompt (terminal) and type the following:

sudo apt-get update

Install sqlite3 package by entering the following command in the terminal:

sudo apt-get install sqlite3

Create the Greengrass user and group account:

sudo adduser --system gcc_user
sudo addgroup --system gcc_group

Untar the Greengrass Core software that was downloaded in the “Figure 7: Setting up AWS Greengrass Group: Certificate and Private Key” step earlier.

Download the cmake package by entering the following command in the terminal:

wget https://cmake.org/files/v3.8/cmake-3.8.0.tar.gz

Execute the following commands:

tar -xzvf cmake-3.8.0.tar.gz
cd cmake-3.8.0
./configure
make
sudo make install

Use the following commands to install OpenSSL:

wget https://www.openssl.org/source/openssl-1.0.2k.tar.gz
tar -xzvf openssl-1.0.2k.tar.gz
cd openssl-1.0.2k
./config --prefix=/usr
make
sudo make install
sudo ln –sf /usr/local/ssl/bin/openssl ‘which openssl’
openssl version -v

Enable Hardlinks and Softlinks Protection

Activate the hardlinks and softlinks protection to improve security on the device. Add the following two lines to /etc/sysctl.d/10-link-restrictions.conf.

fs.protected hardlinks = 1
fs.protected symlinks = 1

Reboot the UP Squared board and validate the system variables by running:

sudo sysctl -a | grep fs.protected

Install Greengrass Certificate and Key

Copy the certificate and private keys files created in the “Figure 7: Setting up AWS Greengrass Group: Certificate and Private Key” above to the UP Squared board as follow:

  • cloud.pem.crt: 4f7a73faa9-cert.pem.crt created above
  • cloud.pem.key: 4f7a73faa9-private.pem.key created above
  • root-ca-cert.pem: wget https://www.symantec.com/content/en/us/enterprise/verisign/roots/VeriSign-Class%203-Public-Primary-Certification-Authority-G5.pem -O root-ca-cert.pem

The ~/greengrass/certs should look like this:

Edit config.json

Open a command prompt (terminal) and navigate to ~/greengrass/config folder. Edit config.json as follows to configure the Greengrass Core:

{
    "coreThing": {
        "caPath": "root-ca-cert.pem",
        "certPath": "cloud.pem.crt",
        "keyPath": "cloud.pem.key",
        "thingArn": "arn:aws:iot:us-east-1:xxxxxxxxxxxx:thing/MyGreengrass1stGroup_Core",
        "iotHost": "yyyyyyyyyyyy.iot.us-east-1.amazonaws.com",
        "ggHost": "greengrass.iot.us-east-1.amazonaws.com",
	"keepAlive": 600
    },
    "runtime": {
        "cgroup": {
            "useSystemd": "yes"
        }
    },
    "system": {
        "shadowSyncTimeout": 120
    }
}
Note: The default value of shadowSyncTimeout is 1.
  • thingArn: Navigate to AWS IoT console, choose Manage on the left, and then select MyGreengrass1stGroup under Thing.

ThingARN should look like this:

  • iotHost: Navigate to AWS IoT console, the Endpoint is located under Settings on the bottom left corner of the AWS IoT console. 

Start AWS Greengrass* Core

Open a command prompt (terminal) and navigate to ~/greengrass/config folder

cd ~/greengrass/ggc/core
sudo ./greengrassd start

When you see the message “Greengrass successfully started”, you will know the Greengrass core has been created successfully.

To confirm that the Greengrass core process is running, run the following command:

ps aux | grep greengrass

Summary

We have described how to start the Greengrass core on the UP Squared board. From here, there are several projects you can try to explore the potential of the UP Squared board. For example, you can create a Greengrass deployment, add a group of devices that can communicate with the local IoT endpoint, enable Lambda functions to filter data for further analysis, and more.

References

Amazon Kinesis* Service API Reference: 
http://docs.aws.amazon.com/greengrass/latest/developerguide/what-is-gg.html
http://docs.aws.amazon.com/greengrass/latest/developerguide/gg-config.html

Up Squared:
http://www.up-board.org/upsquared

Amazon:
https://aws.amazon.com/kinesis/streams/getting-started

IoT References:
https://software.intel.com/en-us/iot/hardware/devkit

About the Author

Nancy Le is a software engineer at Intel Corporation in the Software and Services Group, working on the Intel Atom® processor and IoT scale enabling projects.

*Other names and brands may be claimed as the property of others.

More on UP Squared


Better Generative Modelling through Wasserstein GANs

$
0
0

The following research uses Intel® AI DevCloud, a cloud-hosted hardware and software platform available for developers, researchers and startups to learn, sandbox and get started on their Artificial Intelligence projects. This free cloud compute is available for Intel® AI Academy members.

Overview

The year 2017 was a period of scientific breakthroughs in deep learning, with the publication of numerous research papers. Every year seems like a big leap toward artificial general intelligence, or AGI.

One exciting development involves generative modelling and the use of Wasserstein GANs (Generative Adversarial Networks). An influential paper on the topic has completely changed the approach to generative modelling, moving beyond the time when Ian Goodfellow published the original GAN paper.

Why Wasserstein GANs are such a big deal:

  • With Wasserstein GAN, you can train the discriminator to convergence. If true, it would totally remove the need to balance generator updates with discriminator updates, as earlier the updates of generator and discriminator were happening with no correlation to each other.
  • The initial paper (Soumith et al.) proposed a new GAN training algorithm that works well on the commonly used GAN datasets.
  • Usually theory justified papers don't provide good empirical results, but the training algorithm mentioned in the paper is backed up by theory and it explains why WGANs work so much better.

Introduction

This paper differs from earlier work: the training algorithm is backed up by theory, and few examples exist where theory-justified papers gave good empirical results. The big thing about WGANs is that developers can train their discriminator to convergence, which was not possible earlier. Doing this eliminates the need to balance generator updates with discriminator updates.

What is Earth Mover's Distance?

When dealing with discrete probability distributions, the Wasserstein Distance is also known as Earth mover's distance (EMD). Imagining different heaps of earth in varying quantities, EMD would be the minimal total amount of work it takes to transform one heap into another. Here, work is defined as the product of the amount of earth being moved and the distance it covers. Two discrete probability distributions are usually defined as Pr and P(theta).

Pr comes from unknown distribution, and the goal is to learn P(theta) that approximates Pr.

Calculation of EMD is an optimization process with infinite solution approaches; the challenge is to find the optimal one.

Calculation of EMD

One approach would be to directly learn probability density function P(theta). This would mean that P(theta) is some differentiable function that can be optimized by maximum likelihood estimation. To do that, minimize the KL (Kullback–Leibler) divergence KL(Pr||(P(theta)) and add a random noise to P(theta) when training the model for maximum likelihood estimation. This ensures that distribution is defined elsewhere; otherwise, if a single point lies outside P(theta), the KL divergence can explode.

Adversarial training makes it hard to see whether models are training. It has been shown that GANs are related to actor-critic methods in reinforcement learning. Learn More.

Kullback–Leibler and Jensen–Shannon Divergence

  1. KL (Kullback–Leibler) divergence measures how one probability distribution P diverges from a second expected probability distribution Q.

    Ecuations

    We drop −H(p) going from (18) − (19) because it is a constant. We can see if we minimize the LHS (Left-hand side), we are maximizing the expectation of log q(x) over the distribution p. Therefore, minimizing the LHS is maximizing the RHS, which is maximizing the log-likelihood of the data.

    DKL achieves the minimum zero when p(x) == q(x) everywhere.

    It is noticeable from the formula that KL divergence is asymmetric. In cases where P(x) is close to zero, but Q(x) is significantly non-zero, the q’s effect is disregarded. It could cause buggy results when the intention was just to measure the similarity between two equally important distributions.

  2. Jensen–Shannon Divergence is another measure of similarity between two probability distributions. JS (Jensen–Shannon) divergence is symmetric and relatively smoother and is bounded by [0,1].

    Given two Gaussian distributions, P with mean=0 and std=1 and Q with mean=1 and std=1. The average of two distributions is labelled as m=(p+q)/2. KL divergence DKL is asymmetric but JS divergence DJS is symmetric.

    Gaussian distributions

Generative Adversarial Network (GAN)

GAN consists of two models:

  • A discriminator D estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the real ones.
  • A generator G outputs synthetic samples given a noise variable input z (z brings in potential output diversity). It is trained to capture the real data distribution so that its generative samples can be as real as possible, or in other words, it can trick the discriminator to offer a high probability.

GAN model

Use Wasserstein Distance as GAN Loss Function

It is almost impossible to exhaust all the joint distributions in Π(pr,pg) to compute infγ∼Π(pr,pg). Instead, the authors proposed a smart transformation of the formula based on the Kantorovich-Rubinstein duality:

Kantorovich-Rubinstein duality

One big problem involves maintaining the K-Lipschitz continuity of fw during the training to make everything work out. The paper presented a simple but very practical noteworthy trick: after the gradient gets updated, clamping the weights w to a small window is required, such as [−0.01,0.01], resulting in a compact parameter space W; and thus, fw obtains it's lower and upper bounds in order to preserve the Lipschitz continuity.

K-Lipschitz

Compared to the original GAN algorithm, the WGAN undertakes the following changes:

  • After every gradient update on the critic function, we are required to clamp the weights to a small fixed range is required, usually [−c,c].
  • Use a new loss function derived from the Wasserstein distance.  The discriminator model does not play as a direct critic but rather a helper for estimating the Wasserstein metric between real and generated data distributions.

Empirically the authors recommended usage of RMSProp optimizer on the critic, rather than a momentum-based optimizer such as Adam which could cause instability in the model training.

Improved GAN Training

The following suggestions are proposed to help stabilize and improve the training of GANs.

  • Adding noises - Based on the discussion in the previous section, it is now known that Pr and Pg are disjointed in a high dimensional space and they may become the reason for the problem of vanishing gradient.To synthetically “spread out” the distribution and to create higher chances for two probability distributions to have overlaps, one solution is to add continuous noises onto the inputs of the discriminator D.
  •  One-sided label smoothing - When we are feeding the discriminator, instead of providing the labels as 1 and 0, this paper proposed using values such as 0.9 and 0.1. This will help in reduce the vulnerabilities in Network.

Wasserstein metric is proposed to replace JS divergence because it has a much smoother value space.

Overview of DCGAN

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. As compared to supervised learning, ConvNets have received little attention. Deep convolutional generative adversarial networks (DCGANs) have certain architectural constraints and demonstrate a strong potential for unsupervised learning. Training on various image datasets show convincing evidence that a deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, the learned features were used for novel tasks - demonstrating their applicability as general image representations.

DCGAN

Problem with GANs

  1. It’s harder to achieve Nash Equilibrium - Since there are two neural networks (generator and discriminator), they are being trained simultaneously to find a Nash Equilibrium. In the whole process each player updates the cost function independently without considering the updates of cost function by another network. This method cannot assure a convergence, which is the stated objective.
  2. Vanishing gradient - When the discriminator works as required, the distribution D(x) equals 1 when x belongs to Pr and vice versa. In this process, loss function L fails to zero and results in no gradients to update the loss during the training process. This figure shows that as the discriminator gets increasingly better, the gradient vanishes fast, tending to 0.
  3. Use better metric of distribution similarity - The loss function as proposed in the vanilla GAN (by Goodfellow et al.) measures the JS divergence between the distributions of Pr and P(theta). This metric fails to provide a meaningful value when two distributions are disjointed.

Replacing JS divergence with the Wasserstein metric gives a much smoother value space.

Training a Generative Adversarial Network faces a major problem:

  • If the discriminator works as required, the gradient of the loss function starts tending to zero. As a process loss cannot be updated, training becomes very slow or the model gets stuck.
  • If the discriminator behaves badly, the generator does not have accurate feedback and the loss function cannot represent the reality.

Evaluation Metric

GANs faced the problem of good objective function that can give better insight of the whole training process. A good evaluation metric was needed. Wasserstein Distance sought to address this problem.

Technologies Involved and Methodologies

GANs are difficult to train since convergence is an issue. Using Intel® AI DevCloud and implementing with TensorFlow* served to hasten the process. The first step was to determine the evaluation metric, followed by getting the generator and discriminator to work as required. Other steps included defining the Wasserstein Distance and making use of Residual Blocks in the generator and discriminator.

Steps and Development Process

Initially the project dealt with GANs, which are powerful models but suffer from training stability. Switching to DCGANs and making this project with WGANs sought to make progress towards stable training of GANs. Images generated with DCGANs were not good quality and failed to converge during training. In the initial WGANs paper instead of optimizing Jenson Shannon Divergence, they proposed using Wasserstein metric (measure of the distance between two probability distributions).

The reason why Wasserstein distance is better than JS or KL divergence is that when two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful and smooth representation of the distance in-between.

Additionally, there is almost no hyperparameter tuning.

Intel Development Tools Used

The project made use of Jupyter notebook on the Intel® AI DevCloud (using Intel® Xeon® Scalable processors) to write the code and for visualization purposes. Also made used was information from the Intel® AI Academy forum.

Few GANs Applications

These are some very few applications of GANs (just to provide some ideas) but they can be extended to do so much than what we can possibly think of. There are many papers which have made use of different architectures of GANs, some are listed below:

  • Font generation with conditional GANs
  • Interactive image generation
  • Image editing
  • Human pose estimation
  • Synthetic data generation
  • Visual saliency prediction
  • Adversarial examples (defense vs attack)
  • Image blending
  • Super resolution
  • Image inpainting
  • Face aging

Code

The code can be found in this Github* repository.

Empirical Results

Initially the paper (Soumith at al.) demonstrated the real difference between GAN and WGAN. A GAN Discriminator and Wasserstein GAN critic are trained optimality. In the following graph blue depicts real Gaussian distribution and green depicts fake ones then the values are plotted. The red curve depicts the GAN discriminator output.

GAN discriminator output

Both GAN and WGAN will identify which distribution is fake and which ones are real, but GAN Discriminator does this in such a way that gradients vanish over this high dimensional space. WGANs make use of weight clamping which gives them an edge and it which is able to give gradients in almost every point in space. Wasserstein loss seems to correlate well with image quality also.

Join the Intel® AI Academy

Signup for the Intel® AI Academy and access essential learning materials, community, tools and technology to boost your AI development.

Apply to become an AI Student Ambassador and share your expertise with other student data scientists and developer.

Contact the author on Twitter* or Github*.

Artificial Intelligence (AI) Helps with Skin Cancer Screening

$
0
0

ai-helps-with-skin-cancer-screening

"The long-term goal and true potential of AI is to replicate the complexity of human thinking at the macro level, and then surpass it to solve complex problems—problems both well-documented and currently unimaginable in nature."1

Challenge

Skin cancer has reached epidemic proportions in much of the world. A simple test is needed to perform initial screening on a wide scale to encourage individuals to seek treatment when necessary.

Solution

Doctor Hazel, a skin cancer screening service powered by artificial intelligence (AI) that operates in real time, relies on an extensive library of images to distinguish between skin cancer and benign lesions, making it easier for people to seek professional medical advice.

Background and History

Hackathons have proven to be a successful way to channel energy and technical expertise into solving very specific problems and generating bright, new ideas for applied technology. Such is the case for the genesis of Doctor Hazel, a noteworthy project at the TechCrunch Disrupt’s San Francisco 2017 hackathon, co-developed by Intel® Software Innovator, Peter Ma, and Mike Borozdin, VP of Engineering at Ethos Lending and cofounder of Doctor Hazel. (see Figure 1).

Peter noted, "My cofounder and I had a very close mutual friend who died of cancer in his early 30s. That event triggered our desire to do something about curing cancer. After researching AI and cancer, we think we can actually do something— using AI effectively—to screen for skin cancer."

Peter Ma (left) and Mike Borozdin show screening techniques
Figure 1. Peter Ma (left) and Mike Borozdin show screening techniques.

With the purchase and aid of an inexpensive, high-powered endoscope camera to capture images, Peter and Mike launched into the creation of the Doctor Hazel website and presented the project at the TechCrunch hackathon to widespread acclaim. "Since we built the first prototype in September 2017," Peter said, "we've been covered on TechCrunch, in The Wall Street Journal, IQ by Intel, and many other outlets and publications. Given our experience, we are confident that we can handle the technical requirements; our biggest challenges are US Food and Drug Administration (FDA) approval and gathering additional classified images."

"For all startups," Peter said, "the ideas are the easiest and execution is the hard work. Most of the projects fail because they can't find the product market fit. I've built out hundreds of prototypes, but very few of them gained interest from anyone. When you show people the demo of Doctor Hazel, everyone wants to join the beta and help out. We are getting hundreds of inquiries every single week from people who want to donate data and try the service."

Notable project milestones

  • First introduction of the Doctor Hazel concept and prototype at the TechCrunch hackathon, September 2017.
  • Launch of the Doctor Hazel website to explain the project and solicit images and information from parties that want to help build the database.
  • Media coverage in a number of different outlets and publications, including The Wall Street Journal, TechCrunch, IT by Intel.
  • Demonstrations of the project capabilities at multiple venues, including the Global IoT DevFest II, November 7 and 8, 2017.

Peter Ma demonstrates the technology at Strata Data NY in 2017
Figure 2. Peter Ma demonstrates the technology at Strata Data NY in 2017.

Enabling Technologies

The hardware portion of the project came together easily. Using a high-power endoscope camera acquired from Amazon* for about USD 30, the team captured high resolution images of moles and skin lesions to compare with the images in the growing database. Peter and Mike took advantage of Intel® AI DevCloud to train the AI model. This Intel® Xeon® Scalable processor-powered platform is available to Intel® AI Academy members for free and supports several of the major AI frameworks, including TensorFlow* and Caffe*. To broaden the utility of this diagnostic tool, Doctor Hazel employs the Intel® Movidius™ Neural Compute Stick, which makes it possible to conduct screening in situations where no Internet access is immediately available.

"Intel provides both hardware and software needs in artificial intelligence," Peter said, "from training to deployment. As a startup, it's relatively inexpensive to build up the prototype. The Intel® Movidius™ Neural Compute Stick costs about USD 79 and it allows AI to run in real time. We used the Intel® Movidius™ Software Development Kit (SDK), which proved extremely useful for this project."

Contained in a USB form factor and powered by a low- power Intel® Movidius™ Vision Processing Unit (VPU), the Intel Movidius Neural Compute Stick excels at accelerating deep neural networks processing using the self-contained inference engine. Developers have the option of initiating projects with a Convolution Neural Network model, based Caffe or TensorFlow frameworks, using one of the multiple examples networks. A toolkit then makes it possible to profile and tune the neural network, then compile a version for embedding with the Neural Compute Platform API. Visit this site for tips to start developing with the Intel Movidius Neural Compute Stick.

An extensive image database of suspected and validated skin cancer lesions is a primary requisite for improving machine learning and boosting recognitions accuracy.

Thousands of images were downloaded from the International Skin Imaging Collaboration, the Skin Cancer Foundation, and the University of Iowa to seed the learning process initially. In assessing a sample, Doctor Hazel gauges 8,000 variables to detect whether an image sample is likely to be skin cancer, a mole, or a benign lesion.

The driving goal of the project is to provide a means for anyone to get skin cancer screening for free. To build the image database and collect a broader sampling of confirmed skin cancer images, the beta version of the Doctor Hazel site is soliciting input and data. In an interview with TechCrunch, Mike commented, "There's a huge problem in getting AI data for medicine, but amazing results are possible. The more people share, the more accurate the system becomes." The team is working to advance recognition rates past the 90 percent level, a goal that gets closer as the image database expands.

Eventually, the team is planning an app to accompany the platform, and plans are also being considered for a compact, inexpensive image-capturing device to use in screening. An underlying goal of the project is to permit individuals to have themselves tested easily, perhaps at a clinic or through a free center using the real-time test system, and then seek a dermatologist or medical professional if the results indicate a high probability of skin cancer. Doctors will no longer need to perform the initial screening, allowing them to focus on patients that show a greater need for treatment based on a positive indication of cancer (see Figure 3).

Doctor reaching for a dermascope to examine a patient’s skin lesion
Figure 3. Doctor reaching for a dermascope to examine a patient's skin lesion.

AI is opening innovative paths to medical advances

The use of AI in diagnostic medicine and treatment methods is creating new opportunities to enhance healthcare globally. Through the design and development of specialized chips, optimized software and frameworks, sponsored research, educational outreach, and industry partnerships, Intel is firmly committed to advancing the state of AI to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with government organizations, non-government organizations, and corporations to uncover and advance solutions that solve major challenges, while complying with governmental policies and mandates in force.

The Intel® AI portfolio includes:

Intel Xeon logo

Intel® Xeon® Scalable processor: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.

Framework Optimization

Framework Optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.

Intel Movidius Myriad

Intel® Movidius™ Myriad™ Vision Processing Unit (VPU): Create and deploy on-device neural networks and computer vision applications.

For more information, visit this portfolio page: https://ai.intel.com/technology

For Intel® AI Academy members, the Intel AI DevCloud provides a cloud platform and framework for machine learning and deep learning training. Powered by Intel Xeon Scalable processors, the Intel AI DevCloud is available for up to 30 days of free remote access to support projects by academy members.

Join today: https://software.intel.com/ai/sign-up

"AI fundamentally will enable us to advance scientific method, which itself is a tool, a process that allows us to have repeatable, reproducible results. Now we need to incorporate more data into those inferences in order to drive the field forward. Gone are the days that a single person goes and looks at some data on their own and comes up with a breakthrough, sitting in a corner. Now it is all about bringing together multiple data sources, collaborating, and the tools are what makes that happen."2

– Naveen Rao, Intel VP and GM, Artificial Intelligence Products Group

Resources

Intel® AI Academy

Skin Cancer Project in Intel Developer Mesh

IQ by Intel article - Skin Cancer Detection Using Artificial Intelligence

Deep-learning Algorithm for Skin Cancer Research

Doctor Hazel Website

Doctor Hazel uses AI for Skin Cancer Research

Getting the Most out of AI Using the Caffe Deep Learning Framework

Intel® Distribution for Caffe*

Intel® Movidius™ Neural Compute Stick

Dermatologist-level classification of skin cancer

References

1. Carty, J., C. Rodarte, and N. Rao. "Artificial Intelligence in Pharma and Care Delivery", HealthXL. 2017

2. https://newsroom.intel.com/news/intel-accelerates-accessibility-ai-developer-cloud-computing-resources/

Intel® System Studio 2018 for FreeBSD* Release Notes

$
0
0

This page provides the Release Notes for Intel® VTune™ Amplifier 2018 component of Intel® System Studio 2018 for FreeBSD*. 

To get product updates, log in to the Intel® Software Development Products Registration Center.

For questions or technical support, visit Intel® Software Products Support.

You can register and download the Intel® System Studio 2018 package  here.

Intel® VTune™ Amplifier 2018 for FreeBSD* Release Notes

Intel® VTune™ Amplifier 2018 provides an integrated performance analysis and tuning environment with graphical user interface that helps you analyze code performance on systems with IA-32 or Intel® 64 architectures.  It provides a target package for collecting data on the FreeBSD* system that is then displayed on a host system supporting the graphical interface, either via the remote capability or by manually copying the results to the host

This document provides system requirements, issues and limitations, and legal information for both the host and target systems.

System requirements

For an explanation of architecture names, see https://software.intel.com/en-us/articles/intel-architecture-platform-terminology/

Host Processor requirements 

  • For general operations with user interface and all data collection except Hardware event-based sampling analysis:
    • A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium® 4 processor or later, or compatible non-Intel processor).
    • For the best experience, a multi-core or multi-processor system is recommended.
    • Because the VTune Amplifier requires specific knowledge of assembly-level instructions, its analysis may not operate correctly if a program contains non-Intel instructions. In this case, run the analysis with a target executable that contains only Intel® instructions. After you finish using the VTune Amplifier, you can use the assembler code or optimizing compiler options that provide the non-Intel instructions.
  • For Hardware event-based sampling analysis (EBS):
    • EBS analysis makes use of the on-chip Performance Monitoring Unit (PMU) and requires a genuine Intel® processor for collection. EBS analysis is supported on Intel® Pentium® M, Intel® Core™ microarchitecture and newer processors (for more precise details, see the list below).
    • EBS analysis is not supported on the Intel® Pentium® 4 processor family (Intel® NetBurst® MicroArchitecture) and non-Intel processors. However, the results collected with EBS can be analyzed using any system meeting the less restrictive general operation requirements.
  • The list of supported processors is constantly being extended. In general VTune Amplifier supports publicly launched Desktop, Mobile, Server and Embedded Processors listed at https://ark.intel.com/. For pre-release processor support please file a support request at Online Service Center (https://www.intel.com/supporttickets).

System memory requirements

At least 2GB of RAM

Disk Space Requirements

900MB free disk space required for all product features and all architectures

Software Requirements

For software requirements, please refer here

Target FreeBSD* collection

For information on configuring the FreeBSD* collection and target setup please refer here.

What's new

Support for Latest Processors:

  • New Intel® processors including Intel® Xeon® Scalable Processor (code named Skylake-SP)

Issues and limitations

For information on issues and limitations please refer here.

Attributions

Attributions can be found here

Disclaimer and Legal Information

Disclaimer and Legal information can be found here

 

Developer Success Stories Library

$
0
0

Intel® Parallel Studio XE | Intel® System Studio  Intel® Media Server Studio

Intel® Advisor | Intel® Computer Vision SDK | Intel® Data Analytics Acceleration Library 

Intel® Distribution for Python* | Intel® Inspector XEIntel® Integrated Performance Primitives

Intel® Math Kernel Library | Intel® Media SDK  | Intel® MPI Library | Intel® Threading Building Blocks

Intel® VTune™ Amplifer 

 


Intel® Parallel Studio XE


Altair Creates a New Standard in Virtual Crash Testing

Altair advances frontal crash simulation with help from Intel® Software Development products.


CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


Envivio Helps Ensure the Best Video Quality and Performance

Intel® Parallel Studio XE helps Envivio create safe and secured code.


ESI Group Designs Quiet Products Faster

ESI Group achieves up to 450 percent faster performance on quad-core processors with help from Intel® Parallel Studio.


F5 Networks Profiles for Success

F5 Networks amps up its BIG-IP DNS* solution for developers with help from
Intel® Parallel Studio and Intel® VTune™ Amplifer.


Fixstars Uses Intel® Parallel Studio XE for High-speed Renderer

As a developer of services that use multi-core processors, Fixstars has selected Intel® Parallel Studio XE as the development platform for its lucille* high-speed renderer.


Golaem Drives Virtual Population Growth

Crowd simulation is one of the most challenging tasks in computer animation―made easier with Intel® Parallel Studio XE.


Lab7 Systems Helps Manage an Ocean of Information

Lab7 Systems optimizes BioBuilds™ tools for superior performance using Intel® Parallel Studio XE and Intel® C++ Compiler.


Mentor Graphics Speeds Design Cycles

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.


Massachusetts General Hospital Achieves 20X Faster Colonoscopy Screening

Intel® Parallel Studio helps optimize key image processing libraries, reducing compute-intensive colon screening processing time from 60 minutes to 3 minutes.


Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.


NERSC Optimizes Application Performance with Roofline Analysis

NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Advisor.


Nik Software Increases Rendering Speed of HDR by 1.3x

By optimizing its software for Advanced Vector Extensions (AVX), Nik Software used Intel® Parallel Studio XE to identify hotspots 10x faster and enabled end users to render high dynamic range (HDR) imagery 1.3x faster.


Novosibirsk State University Gets More Efficient Numerical Simulation

Novosibirsk State University boosts a simulation tool’s performance by 3X with Intel® Parallel Studio, Intel® Advisor, and Intel® Trace Analyzer and Collector.


Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


Ural Federal University Boosts High-Performance Computing Education and Research

Intel® Developer Tools and online courseware enrich the high-performance computing curriculum at Ural Federal University.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


Intel® System Studio


CID Wireless Shanghai Boosts Long-Term Evolution (LTE) Application Performance

CID Wireless boosts performance for its LTE reference design code by 6x compared to the plain C code implementation.


GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and Intel® Computer Vision SDK.


NERSC Optimizes Application Performance with Roofline Analysis

NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Advisor.


Daresbury Laboratory Speeds Computational Chemistry Software 

Scientists get a speedup to their computational chemistry algorithm from Intel® Advisor’s vectorization advisor.


Novosibirsk State University Gets More Efficient Numerical Simulation

Novosibirsk State University boosts a simulation tool’s performance by 3X with Intel® Parallel Studio, Intel® Advisor, and Intel® Trace Analyzer and Collector.


Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


Intel® Computer Vision SDK


GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and Intel® Computer Vision SDK.


Intel® Data Analytics Acceleration Library


MeritData Speeds Up a Big Data Platform

MeritData Inc. improves performance—and the potential for big data algorithms and visualization.


Intel® Distribution for Python*


DATADVANCE Gets Optimal Design with 5x Performance Boost

DATADVANCE discovers that Intel® Distribution for Python* outpaces standard Python.
 


Intel® Inspector XE


CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


Envivio Helps Ensure the Best Video Quality and Performance

Intel® Parallel Studio XE helps Envivio create safe and secured code.


ESI Group Designs Quiet Products Faster

ESI Group achieves up to 450 percent faster performance on quad-core processors with help from Intel® Parallel Studio.


Fixstars Uses Intel® Parallel Studio XE for High-speed Renderer

As a developer of services that use multi-core processors, Fixstars has selected Intel® Parallel Studio XE as the development platform for its lucille* high-speed renderer.


Golaem Drives Virtual Population Growth

Crowd simulation is one of the most challenging tasks in computer animation―made easier with Intel® Parallel Studio XE.


Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


Intel® Integrated Performance Primitives


JD.com Optimizes Image Processing

JD.com Speeds Image Processing 17x, handling 300,000 images in 162 seconds instead of 2,800 seconds, with Intel® C++ Compiler and Intel® Integrated Performance Primitives.


Tencent Optimizes an Illegal Image Filtering System

Tencent doubles the speed of its illegal image filtering system using SIMD Instruction Set and Intel® Integrated Performance Primitives.


Tencent Speeds MD5 Image Identification by 2x

Intel worked with Tencent engineers to optimize the way the company processes millions of images each day, using Intel® Integrated Performance Primitives to achieve a 2x performance improvement.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


Intel® Math Kernel Library


DreamWorks Puts the Special in Special Effects

DreamWorks Animation’s Puss in Boots uses Intel® Math Kernel Library to help create dazzling special effects.


GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and Intel® Computer Vision SDK.

 


MeritData Speeds Up a Big Data Platform

MeritData Inc. improves performance―and the potential for big data algorithms and visualization.


Qihoo360 Technology Co. Ltd. Optimizes Speech Recognition

Qihoo360 optimizes the speech recognition module of the Euler platform using Intel® Math Kernel Library (Intel® MKL), speeding up performance by 5x.


Intel® Media SDK


NetUP Gets Blazing Fast Media Transcoding

NetUP uses Intel® Media SDK to help bring the Rio Olympic Games to a worldwide audience of millions.


Intel® Media Server Studio


ActiveVideo Enhances Efficiency

ActiveVideo boosts the scalability and efficiency of its cloud-based virtual set-top box solutions for TV guides, online video, and interactive TV advertising using Intel® Media Server Studio.


Kraftway: Video Analytics at the Edge of the Network

Today’s sensing, processing, storage, and connectivity technologies enable the next step in distributed video analytics, where each camera itself is a server. With Kraftway* video software platforms can encode up to three 1080p60 streams at different bit rates with close to zero CPU load.


Slomo.tv Delivers Game-Changing Video

Slomo.tv's new video replay solutions, built with the latest Intel® technologies, can help resolve challenging game calls.


SoftLab-NSK Builds a Universal, Ultra HD Broadcast Solution

SoftLab-NSK combines the functionality of a 4K HEVC video encoder and a playout server in one box using technologies from Intel.


Vantrix Delivers on Media Transcoding Performance

HP Moonshot* with HP ProLiant* m710p server cartridges and Vantrix Media Platform software, with help from Intel® Media Server Studio, deliver a cost-effective solution that delivers more streams per rack unit while consuming less power and space.


Intel® MPI Library


Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


Intel® Threading Building Blocks


CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


Johns Hopkins University Prepares for a Many-Core Future

Johns Hopkins University increases the performance of its open-source Bowtie 2* application by adding multi-core parallelism.


Mentor Graphics Speeds Design Cycles

 

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.


Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


Quasardb Streamlines Development for a Real-Time Analytics Database

To deliver first-class performance for its distributed, transactional database, Quasardb uses Intel® Threading Building Blocks (Intel® TBB), Intel’s C++ threading library for creating high-performance, scalable parallel applications.


University of Bristol Accelerates Rational Drug Design

Using Intel® Threading Building Blocks, the University of Bristol helps slash calculation time for drug development—enabling a calculation that once took 25 days to complete to run in just one day.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


Intel® VTune™ Amplifer


ADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


 


F5 Networks Profiles for Success

F5 Networks amps up its BIG-IP DNS* solution for developers with help from
Intel® Parallel Studio and Intel® VTune™ Amplifer.


GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and Intel® Computer Vision SDK.


Mentor Graphics Speeds Design Cycles

 

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.


Nik Software Increases Rendering Speed of HDR by 1.3x

By optimizing its software for Advanced Vector Extensions (AVX), Nik Software used Intel® Parallel Studio XE to identify hotspots 10x faster and enabled end users to render high dynamic range (HDR) imagery 1.3x faster.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


 

Build a Fast Network Stack with Vector Packet Processing (VPP) on an Intel® Architecture Server

$
0
0

Introduction

This tutorial shows how to install the FD.io Vector Packet Processing (VPP) package and build a packet forwarding engine on a bare metal Intel® Xeon® processor server. Two additional Intel Xeon processor platform systems are used to connect to the VPP host to pass traffic using iperf3* and Cisco’s TRex* Realistic Traffic Generator (TRex*). Intel 40 Gigabit Ethernet (GbE) network interface cards (NICs) are used to connect the hosts.

Vector Packet Processing (VPP) Overview

VPP is open source high-performance packet processing software. It leverages the Data Plane Development Kit (DPDK) to take advantage of fast I/O. DPDK provides fast packet processing libraries and user space drivers. It receives and send packets with a minimum number of CPU cycles by bypassing the kernel and using a user poll mode driver. Details on how to configure DPDK can be found in the DPDK Documentation.

VPP can be used as a standalone product or as an extended data plane product. It is highly efficient because it scales well on modern Intel® processors and handles packet processing in batches, called vectors, up to 256 packets at a time. This approach ensures that cache hits will be maximized.

The VPP platform consists of a set of nodes in a directed graph called a packet processing graph. Each node provides a specific network function to packets, and each directed edge indicates the next network function that will handle packets. Instead of processing one packet at a time as the kernel does, the first node in the packet processing graph polls for a burst of incoming packets from a network interface; it collects similar packets into a frame (or vector), and passes the frame to the next node indicated by the directed edge. The next node takes the frame of packets, processes them based on the functionality it provides, passes the frame to the next node, and so on. This process repeats until the last node gets the frame, processes all the packets in the frame based on the functionality it provides, and outputs them on a network interface. When a frame of packets is handled by a node only the first packet in the frame needs to load the CPU’s instructions to the cache; the rest of the packets benefit from the instruction already in the cache. VPP architecture is flexible to allow users to create new nodes, enter them into the packet processing graph, and rearrange the graph.

Like DPDK, VPP operates in user space. VPP can be used on bare metal, virtual machines (VMs), or containers.

Build and Install VPP

In this tutorial, three systems named csp2s22c03, csp2s22c04, and net2s22c05 are used. The system csp2s22c03, with VPP installed, is used to forward packets, and the systems csp2s22c04 and net2s22c05 are used to pass traffic. All three systems are equipped with Intel® Xeon® processor E5-2699 v4 @ 2.20 GHz, two sockets with 22 cores per socket, and are running 64-bit Ubuntu* 16.04 LTS. The Intel® Ethernet Converged Network Adapter XL710 10/40 GbE is used to connect these systems. Refer to Figure 1 and Figure 2 for configuration diagrams.

Build the FD.io VPP Binary

The instructions in this section describe how to build the VPP package from FD.io. Skip to the next section if you’d like to use the Debian* VPP packages instead.

With an admin privileges account in csp2s22c03, we download a stable version of VPP (version 17.04 is used in this tutorial), and navigate to the build-root directory to build the image:

csp2s22c03$ git clone -b stable/1704 https://gerrit.fd.io/r/vpp fdio.1704
csp2s22c03$ cd fdio.1704/
csp2s22c03$ make install-dep
csp2s22c03$ make bootstrap
csp2s22c03$ cd build-root
csp2s22c03$ source ./path_setup
csp2s22c03$ make PLATFORM=vpp TAG=vpp vpp-install

To build the image with debug symbols:

csp2s22c03$ make PLATFORM=vpp TAG=vpp_debug vpp-install

After you've configured VPP, you can run the VPP binary from the fdio.1704 directory using the src/vpp/conf/startup.conf configuration file:

csp2s22c03$ cd ..
csp2s22c03$ sudo build-root/build-vpp-native/vpp/bin/vpp -c src/vpp/conf/startup.conf

Build the Debian* VPP Packages

If you prefer to use the Debian VPP packages, follow these instructions to build them:

csp2s22c03$ make PLATFORM=vpp TAG=vpp install-deb
csp2s22c03:~/download/fdio.1704/build-root$ ls -l *.deb
-rw-r--r-- 1 plse plse 1667422 Feb 12 16:41 vpp_17.04.2-2~ga8f93f8_amd64.deb
-rw-r--r-- 1 plse plse 2329572 Feb 12 16:41 vpp-api-java_17.04.2-2~ga8f93f8_amd64.deb
-rw-r--r-- 1 plse plse 23374 Feb 12 16:41 vpp-api-lua_17.04.2-2~ga8f93f8_amd64.deb
-rw-r--r-- 1 plse plse 8262 Feb 12 16:41 vpp-api-python_17.04.2-2~ga8f93f8_amd64.deb
-rw-r--r-- 1 plse plse 44175468 Feb 12 16:41 vpp-dbg_17.04.2-2~ga8f93f8_amd64.deb
-rw-r--r-- 1 plse plse 433788 Feb 12 16:41 vpp-dev_17.04.2-2~ga8f93f8_amd64.deb
-rw-r--r-- 1 plse plse 1573956 Feb 12 16:41 vpp-lib_17.04.2-2~ga8f93f8_amd64.deb
-rw-r--r-- 1 plse plse 1359024 Feb 12 16:41 vpp-plugins_17.04.2-2~ga8f93f8_amd64.deb

In this output:

  • vpp is the packet engine
  • vpp-api-java is the Java* binding module
  • vpp-api-lua is the Lua* binding module
  • vpp-api-python is the Python* binding module
  • vpp-dbg is the debug symbol version of VPP
  • vpp-dev is the development support (headers and libraries)
  • vpp-lib is the VPP runtime library
  • vpp-plugins is the plugin module

Next, install the Debian VPP packages. At a minimum, you should install the VPP, vpp-lib, and vpp-plugins packages). We install them on the machine csp2s22c03:

csp2s22c03$ apt list --installed | grep vpp
csp2s22c03$ sudo dpkg -i vpp_17.04.2-2~ga8f93f8_amd64.deb vpp-lib_17.04.2-2~ga8f93f8_amd64.deb vpp-plugins_17.04.2-2~ga8f93f8_amd64.deb

Verify that the VPP packages are installed successfully:

csp2s22c03$ apt list --installed | grep vpp
vpp/now 17.04.2-2~ga8f93f8 amd64 [installed,upgradable to: 18.01.1-release]
vpp-lib/now 17.04.2-2~ga8f93f8 amd64 [installed,upgradable to: 18.01.1-release]
vpp-plugins/now 17.04.2-2~ga8f93f8 amd64 [installed,upgradable to: 18.01.1-release]

Configure VPP

During installation, two configuration files are created: /etc/sysctl.d/80-vpp.conf and /etc/vpp/startup.conf/startup.conf. The /etc/sysctl.d/80-vpp.conf configuration file is used to set up huge pages. The /etc/vpp/startup.conf/startup.conf configuration file is used to start VPP.

Configure huge pages

In the /etc/sysctl.d/80-vpp.conf configuration file, set parameters as follows: the number of 2 MB huge pages vm.nr_hugepages is chosen to be 4096, and vm.max_map_count is 9216 (2.5 * 4096), shared memory max kernel.shmmax is 8,589,934,592 (4096 * 2 * 1024 * 1024).

csp2s22c03$ cat /etc/sysctl.d/80-vpp.conf
# Number of 2MB hugepages desired
vm.nr_hugepages=4096

# Must be greater than or equal to (2 * vm.nr_hugepages).
vm.max_map_count=9216

# All groups allowed to access hugepages
vm.hugetlb_shm_group=0

# Shared Memory Max must be greator or equal to the total size of hugepages.
# For 2MB pages, TotalHugepageSize = vm.nr_hugepages * 2 * 1024 * 1024
# If the existing kernel.shmmax setting  (cat /sys/proc/kernel/shmmax)
# is greater than the calculated TotalHugepageSize then set this parameter
# to current shmmax value.
kernel.shmmax=8589934592

Apply these memory settings to the system and verify the huge pages:

csp2s22c03$ sudo sysctl -p /etc/sysctl.d/80-vpp.conf
vm.nr_hugepages = 4096
vm.max_map_count = 9216
vm.hugetlb_shm_group = 0
kernel.shmmax = 8589934592

csp2s22c03$ cat /proc/meminfo
MemTotal:       131912940 kB
MemFree:        116871136 kB
MemAvailable:   121101956 kB
...............................
HugePages_Total:    4096
HugePages_Free:     3840
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Configure startup.conf

In the /etc/vpp/startup.conf/startup.conf configuration file, the keyword interactive is added to enable the VPP Command-Line Interface (CLI). Also, four worker threads are selected and run on cores 2, 3, 22, and 23. Note that you can choose the NIC cards to use in this configuration or you can specify them later, as this exercise shows. The modified /etc/vpp/startup.conf/startup.conf configuration file is shown below.

csp2s22c03$ cat /etc/vpp/startup.conf

unix {
  nodaemon
  log /tmp/vpp.log
  full-coredump
  interactive
}

api-trace {
  on
}

api-segment {
  gid vpp
}

cpu {
     ## In the VPP there is one main thread and optionally the user can create worker(s)
        ## The main thread and worker thread(s) can be pinned to CPU core(s) manually or automatically

        ## Manual pinning of thread(s) to CPU core(s)

        ## Set logical CPU core where main thread runs
        main-core 1

        ## Set logical CPU core(s) where worker threads are running
        corelist-workers 2-3,22-23
}

dpdk {
        ## Change default settings for all intefaces
        # dev default {
                ## Number of receive queues, enables RSS
                ## Default is 1
                # num-rx-queues 3

                ## Number of transmit queues, Default is equal
                ## to number of worker threads or 1 if no workers treads
                # num-tx-queues 3

                ## Number of descriptors in transmit and receive rings
                ## increasing or reducing number can impact performance
                ## Default is 1024 for both rx and tx
                # num-rx-desc 512
                # num-tx-desc 512

                ## VLAN strip offload mode for interface
                ## Default is off
                # vlan-strip-offload on
        # }

        ## Whitelist specific interface by specifying PCI address
        # dev 0000:02:00.0

        ## Whitelist specific interface by specifying PCI address and in
        ## addition specify custom parameters for this interface
        # dev 0000:02:00.1 {
        #       num-rx-queues 2
        # }

        ## Change UIO driver used by VPP, Options are: igb_uio, vfio-pci
        ## and uio_pci_generic (default)
        # uio-driver vfio-pci
}

# Adjusting the plugin path depending on where the VPP plugins are:
plugins
{
        path /usr/lib/vpp_plugins
}

Run VPP as a Packet Processing Engine

In this section, four examples of running VPP are shown. In the first two examples, the iperf3 tool is used to generate traffic, and in the last two examples the TRex Realistic Traffic Generator is used. For comparison purposes, the first example shows packet forwarding using ordinary kernel IP forwarding, and the second example shows packet forwarding using VPP.

Example 1: Using Kernel Packet Forwarding with iperf3*

In this test, 40 GbE Intel Ethernet Network Adapters are used to connect the three systems. Figure 1 illustrates this configuration.

three system configuration diagram

Figure 1 – VPP runs on a host that connects to two other systems via 40 GbE NICs.

For comparison purposes, in the first test, we configure kernel forwarding in csp2s22c03 and use the iperf3 tool to measure network bandwidth between csp2s22c03 and net2s22c05. In the second test, we start the VPP engine in csp2s22c03 instead of using kernel forwarding.

On csp2s22c03, we configure the system to have the addresses 10.10.1.1/24 and 10.10.2.1/24 on the two 40-GbE NICs. To find all network interfaces available on the system, use the lshw Linux* command to list all network interfaces and the corresponding slots [0000:xx:yy.z]. For example, the 40-GbE interfaces are ens802f0 and ens802f1.

csp2s22c03$ sudo lshw -class network -businfo
Bus info          Device      Class          Description
========================================================
pci@0000:03:00.0  enp3s0f0    network        Ethernet Controller 10-Gigabit X540
pci@0000:03:00.1  enp3s0f1    network        Ethernet Controller 10-Gigabit X540
pci@0000:82:00.0  ens802f0    network        Ethernet Controller XL710 for 40GbE
pci@0000:82:00.1  ens802f1    network        Ethernet Controller XL710 for 40GbE
pci@0000:82:00.0  ens802f0d1  network        Ethernet interface
pci@0000:82:00.1  ens802f1d1  network        Ethernet interface

Configure the system to have 10.10.1.1 and 10.10.2.1 on the two 40-GbE NICs ens802f0 and ens802f1, respectively.

csp2s22c03$ sudo ip addr add 10.10.1.1/24 dev ens802f0
csp2s22c03$ sudo ip link set dev ens802f0 up
csp2s22c03$ sudo ip addr add 10.10.2.1/24 dev ens802f1
csp2s22c03$ sudo ip link set dev ens802f1 up

List the route table:

csp2s22c03$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         jf111-ldr1a-530 0.0.0.0         UG    0      0        0 enp3s0f1
default         192.168.0.50    0.0.0.0         UG    100    0        0 enp3s0f0
10.10.1.0       *               255.255.255.0   U     0      0        0 ens802f0
10.10.2.0       *               255.255.255.0   U     0      0        0 ens802f1
10.23.3.0       *               255.255.255.0   U     0      0        0 enp3s0f1
link-local      *               255.255.0.0     U     1000   0        0 enp3s0f1
192.168.0.0     *               255.255.255.0   U     100    0        0 enp3s0f0
csp2s22c03$ ip route
default via 10.23.3.1 dev enp3s0f1
default via 192.168.0.50 dev enp3s0f0  proto static  metric 100
10.10.1.0/24 dev ens802f0  proto kernel  scope link  src 10.10.1.1
10.10.2.0/24 dev ens802f1  proto kernel  scope link  src 10.10.2.1
10.23.3.0/24 dev enp3s0f1  proto kernel  scope link  src 10.23.3.67
169.254.0.0/16 dev enp3s0f1  scope link  metric 1000
192.168.0.0/24 dev enp3s0f0  proto kernel scope link src 192.168.0.142 metric 100

On csp2s22c04, we configure the system to have the address 10.10.1.2 and use the interface ens802 to route IP packets 10.10.2.0/24. Use the lshw Linux command to list all network interfaces and the corresponding slots [0000:xx:yy.z]. For example, the interface ens802d1 (ens802) is connected to slot [82:00.0]:

csp2s22c04$ sudo lshw -class network -businfo
Bus info          Device      Class       Description
=====================================================
pci@0000:03:00.0  enp3s0f0    network     Ethernet Controller 10-Gigabit X540-AT2
pci@0000:03:00.1  enp3s0f1    network     Ethernet Controller 10-Gigabit X540-AT2
pci@0000:82:00.0  ens802d1    network     Ethernet Controller XL710 for 40GbE QSFP+
pci@0000:82:00.0  ens802      network     Ethernet interface

For kernel forwarding, set 10.10.1.2 to the interface ens802, and add a static route for IP packet 10.10.2.0/24:

csp2s22c04$ sudo ip addr add 10.10.1.2/24 dev ens802
csp2s22c04$ sudo ip link set dev ens802 up
csp2s22c04$ sudo ip route add 10.10.2.0/24 via 10.10.1.1
csp2s22c04$ ifconfig
enp3s0f0  Link encap:Ethernet  HWaddr a4:bf:01:00:92:73
          inet addr:10.23.3.62  Bcast:10.23.3.255  Mask:255.255.255.0
          inet6 addr: fe80::a6bf:1ff:fe00:9273/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3411 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1179 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:262230 (262.2 KB)  TX bytes:139975 (139.9 KB)

ens802    Link encap:Ethernet  HWaddr 68:05:ca:2e:76:e0
          inet addr:10.10.1.2  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::6a05:caff:fe2e:76e0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:40 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:5480 (5.4 KB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:31320 errors:0 dropped:0 overruns:0 frame:0
          TX packets:31320 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:40301788 (40.3 MB)  TX bytes:40301788 (40.3 MB)

After setting the route, we can ping from csp2s22c03 to csp2s22c04, and vice versa:

csp2s22c03$ ping 10.10.1.2 -c 3
PING 10.10.1.2 (10.10.1.2) 56(84) bytes of data.
64 bytes from 10.10.1.2: icmp_seq=1 ttl=64 time=0.122 ms
64 bytes from 10.10.1.2: icmp_seq=2 ttl=64 time=0.109 ms
64 bytes from 10.10.1.2: icmp_seq=3 ttl=64 time=0.120 ms
csp2s22c04$ ping 10.10.1.1 -c 3
PING 10.10.1.1 (10.10.1.1) 56(84) bytes of data.
64 bytes from 10.10.1.1: icmp_seq=1 ttl=64 time=0.158 ms
64 bytes from 10.10.1.1: icmp_seq=2 ttl=64 time=0.096 ms
64 bytes from 10.10.1.1: icmp_seq=3 ttl=64 time=0.102 ms

Similarly, on net2s22c05, we configure the system to have the address 10.10.2.2 and use the interface ens803f0 to route IP packets 10.10.1.0/24. Use the lshw Linux command to list all network interfaces and the corresponding slots [0000:xx:yy.z]. For example, the interface ens803f0 is connected to slot [87:00.0]:

NET2S22C05$ sudo lshw -class network -businfo
Bus info          Device      Class          Description
========================================================
pci@0000:03:00.0  enp3s0f0    network    Ethernet Controller 10-Gigabit X540-AT2
pci@0000:03:00.1  enp3s0f1    network    Ethernet Controller 10-Gigabit X540-AT2
pci@0000:81:00.0  ens787f0    network    82599 10 Gigabit TN Network Connection
pci@0000:81:00.1  ens787f1    network    82599 10 Gigabit TN Network Connection
pci@0000:87:00.0  ens803f0    network    Ethernet Controller XL710 for 40GbE QSFP+
pci@0000:87:00.1  ens803f1    network    Ethernet Controller XL710 for 40GbE QSFP+

For kernel forwarding, set 10.10.2.2 to the interface ens803f0, and add a static route for IP packet 10.10.1.0/24:

NET2S22C05$ sudo ip addr add 10.10.2.2/24 dev ens803f0
NET2S22C05$ sudo ip link set dev ens803f0 up
NET2S22C05$ sudo ip r add 10.10.1.0/24 via 10.10.2.1

After setting the route, you can ping from csp2s22c03 to net2s22c05, and vice versa. However, in order to ping between net2s22c05 and csp2s22c04, kernel IP forwarding in csp2s22c03 has to be enabled:

csp2s22c03$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0
csp2s22c03$ echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward
csp2s22c03$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1

If successful, verify that now you can ping between net2s22c05 and csp2s22c04:

NET2S22C05$ ping 10.10.1.2 -c 3
PING 10.10.1.2 (10.10.1.2) 56(84) bytes of data.
64 bytes from 10.10.1.2: icmp_seq=1 ttl=63 time=0.239 ms
64 bytes from 10.10.1.2: icmp_seq=2 ttl=63 time=0.224 ms
64 bytes from 10.10.1.2: icmp_seq=3 ttl=63 time=0.230 ms\

We use the iperf3 utility to measure network bandwidth between hosts. In this test, we download the iperf3 utility tool on both net2s22c05 and csp2s22c04. On csp2s22c04, we start the iperf3 server with “iperf3 –s”, and then on net2s22c05, we start the iperf3 client to connect to the server:

NET2S22C05$ iperf3 -c 10.10.1.2
Connecting to host 10.10.1.2, port 5201
[  4] local 10.10.2.2 port 54074 connected to 10.10.1.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   936 MBytes  7.85 Gbits/sec  2120    447 KBytes
[  4]   1.00-2.00   sec   952 MBytes  7.99 Gbits/sec  1491    611 KBytes
[  4]   2.00-3.00   sec   949 MBytes  7.96 Gbits/sec  2309    604 KBytes
[  4]   3.00-4.00   sec   965 MBytes  8.10 Gbits/sec  1786    571 KBytes
[  4]   4.00-5.00   sec   945 MBytes  7.93 Gbits/sec  1984    424 KBytes
[  4]   5.00-6.00   sec   946 MBytes  7.94 Gbits/sec  1764    611 KBytes
[  4]   6.00-7.00   sec   979 MBytes  8.21 Gbits/sec  1499    655 KBytes
[  4]   7.00-8.00   sec   980 MBytes  8.22 Gbits/sec  1182    867 KBytes
[  4]   8.00-9.00   sec  1008 MBytes  8.45 Gbits/sec  945    625 KBytes
[  4]   9.00-10.00  sec  1015 MBytes  8.51 Gbits/sec  1394    611 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  9.45 GBytes  8.12 Gbits/sec  16474             sender
[  4]   0.00-10.00  sec  9.44 GBytes  8.11 Gbits/sec                  receiver

iperf Done.

Using kernel IP forwarding, iperf3 shows the network bandwidth is about 8.12 Gbits per second.

Example 2: Using VPP with iperf3

First, disable kernel IP forward in csp2s22c03 to ensure the host cannot use kernel forwarding (all the settings in net2s22c05 and csp2s22c04 remain unchanged):

csp2s22c03$ echo 0 | sudo tee /proc/sys/net/ipv4/ip_forward
0
csp2s22c03$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0

You can use DPDK’s device binding utility (./install-vpp-native/dpdk/sbin/dpdk-devbind) to list network devices and bind/unbind them from specific drivers. The flag “-s/--status” shows the status of devices; the flag “-b/--bind” selects the driver to bind. The status of devices in our system indicates that the two 40-GbE XL710 devices are located at 82:00.0 and 82:00.1. Use the device’s slots to bind them to the driver uio_pci_generic:

csp2s22c03$ ./install-vpp-native/dpdk/sbin/dpdk-devbind -s

Network devices using DPDK-compatible driver
============================================
<none>

Network devices using kernel driver
===================================
0000:03:00.0 'Ethernet Controller 10-Gigabit X540-AT2' if=enp3s0f0 drv=ixgbe unused=vfio-pci,uio_pci_generic *Active*
0000:03:00.1 'Ethernet Controller 10-Gigabit X540-AT2' if=enp3s0f1 drv=ixgbe unused=vfio-pci,uio_pci_generic *Active*
0000:82:00.0 'Ethernet Controller XL710 for 40GbE QSFP+' if=ens802f0d1,ens802f0 drv=i40e unused=uio_pci_generic                       
0000:82:00.1 'Ethernet Controller XL710 for 40GbE QSFP+' if=ens802f1d1,ens802f1 drv=i40e unused=uio_pci_generic                        

Other network devices
=====================
<none>


csp2s22c03$ sudo modprobe uio_pci_generic
csp2s22c03$ sudo ./install-vpp-native/dpdk/sbin/dpdk-devbind --bind uio_pci_generic 82:00.0
csp2s22c03$ sudo ./install-vpp-native/dpdk/sbin/dpdk-devbind --bind uio_pci_generic 82:00.1

csp2s22c03$ sudo ./install-vpp-native/dpdk/sbin/dpdk-devbind -s

Network devices using DPDK-compatible driver
============================================
0000:82:00.0 'Ethernet Controller XL710 for 40GbE QSFP+' drv=uio_pci_generic unused=i40e,vfio-pci
0000:82:00.1 'Ethernet Controller XL710 for 40GbE QSFP+' drv=uio_pci_generic unused=i40e,vfio-pci

Network devices using kernel driver
===================================
0000:03:00.0 'Ethernet Controller 10-Gigabit X540-AT2' if=enp3s0f0 drv=ixgbe unused=vfio-pci,uio_pci_generic *Active*
0000:03:00.1 'Ethernet Controller 10-Gigabit X540-AT2' if=enp3s0f1 drv=ixgbe unused=vfio-pci,uio_pci_generic *Active*

Start the VPP service, and verify that VPP is running:

csp2s22c03$ sudo service vpp start
csp2s22c03$ ps -ef | grep vpp
root     105655      1 98 17:34 ?        00:00:02 /usr/bin/vpp -c /etc/vpp/startup.conf
:w
         105675 105512  0 17:34 pts/4    00:00:00 grep --color=auto vpp

To access the VPP CLI, issue the command sudo vppctl . From the VPP interface, list all interfaces that are bound to DPDK using the command show interface:

VPP shows that the two 40-Gbps ports located at 82:0:0 and 82:0:1 are bound. Next, you need to assign IP addresses to those interfaces, bring them up, and verify:

vpp# set interface ip address FortyGigabitEthernet82/0/0 10.10.1.1/24
vpp# set interface ip address FortyGigabitEthernet82/0/1 10.10.2.1/24
vpp# set interface state FortyGigabitEthernet82/0/0 up
vpp# set interface state FortyGigabitEthernet82/0/1 up
vpp# show interface address
FortyGigabitEthernet82/0/0 (up):
  10.10.1.1/24
FortyGigabitEthernet82/0/1 (up):
  10.10.2.1/24
local0 (dn):

At this point VPP is operational. You can ping these interfaces either from net2s22c05 or csp2s22c04. Moreover, VPP can forward packets whose IP address are 10.10.1.0/24 and 10.10.2.0/24, so you can ping between net2s22c05 and csp2s22c04. Also, you can run iperf3 as illustrated in the previous example, and the result from running iperf3 between net2s22c05 and csp2s22c04 increases to 20.3 Gbits per second.

ET2S22C05$ iperf3 -c 10.10.1.2
Connecting to host 10.10.1.2, port 5201
[  4] local 10.10.2.2 port 54078 connected to 10.10.1.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  2.02 GBytes  17.4 Gbits/sec  460   1.01 MBytes
[  4]   1.00-2.00   sec  3.28 GBytes  28.2 Gbits/sec    0   1.53 MBytes
[  4]   2.00-3.00   sec  2.38 GBytes  20.4 Gbits/sec  486    693 KBytes
[  4]   3.00-4.00   sec  2.06 GBytes  17.7 Gbits/sec  1099   816 KBytes
[  4]   4.00-5.00   sec  2.07 GBytes  17.8 Gbits/sec  614   1.04 MBytes
[  4]   5.00-6.00   sec  2.25 GBytes  19.3 Gbits/sec  2869   716 KBytes
[  4]   6.00-7.00   sec  2.26 GBytes  19.4 Gbits/sec  3321   683 KBytes
[  4]   7.00-8.00   sec  2.33 GBytes  20.0 Gbits/sec  2322   594 KBytes
[  4]   8.00-9.00   sec  2.28 GBytes  19.6 Gbits/sec  1690  1.23 MBytes
[  4]   9.00-10.00  sec  2.73 GBytes  23.5 Gbits/sec  573    680 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  23.7 GBytes  20.3 Gbits/sec  13434             sender
[  4]   0.00-10.00  sec  23.7 GBytes  20.3 Gbits/sec                  receiver

iperf Done.

The VCC CLI command show run displays the graph runtime statistics. Observe that the average vector per node is 6.76, which means on average, a vector of 6.76 packets is handled in a graph node.

code block

Example 3. Using VPP with the TREX* Realistic Traffic Generator

In this example we use only two systems, csp2s22c03 and net2s22c05, to run the TRex Realistic Traffic Generator. VPP is installed in csp2s22c03 and run as a packet forwarding engine. On net2s22c05, TRex is used to generate both client and server-side traffic. TRex is a high-performance traffic generator. It leverages DPDK to run in user space. Figure 2 illustrates this configuration.

VPP is set up on csp2s22c03 exactly as it was in Example 2. Only the setup on net2s22c05 is modified slightly to run TRex preconfigured traffic files.

two system configuration diagram

Figure 2 – The TRex traffic generator sends packages to the host that has VPP running.

To install TRex, in net2s22c05, download and extract TRex package:

NET2S22C05$ wget --no-cache http://trex-tgn.cisco.com/trex/release/latest
NET2S22C05$ tar -xzvf latest
NET2S22C05$ cd v2.37
NET2S22C05$ sudo ./dpdk_nic_bind.py -s

Network devices using DPDK-compatible driver
============================================
0000:87:00.0 'Ethernet Controller XL710 for 40GbE QSFP+' drv=vfio-pci unused=i40e
0000:87:00.1 'Ethernet Controller XL710 for 40GbE QSFP+' drv=vfio-pci unused=i40e

Network devices using kernel driver
===================================
0000:03:00.0 'Ethernet Controller 10-Gigabit X540-AT2' if=enp3s0f0 drv=ixgbe unused=vfio-pci *Active*
0000:03:00.1 'Ethernet Controller 10-Gigabit X540-AT2' if=enp3s0f1 drv=ixgbe unused=vfio-pci
0000:81:00.0 '82599 10 Gigabit TN Network Connection' if=ens787f0 drv=ixgbe unused=vfio-pci
0000:81:00.1 '82599 10 Gigabit TN Network Connection' if=ens787f1 drv=ixgbe unused=vfio-pci

Other network devices
=====================
<none>

Create the /etc/trex_cfg.yaml configuration file. In this configuration file, the port should match the interfaces available in the target system, which is net2s22c05 in our example. The IP addresses correspond to Figure 2. For more information on the configuration file, please refer to the TRex Manual.

NET2S22C05$ cat /etc/trex_cfg.yaml
### Config file generated by dpdk_setup_ports.py ###
- port_limit: 2
  version: 2
  interfaces: ['87:00.0', '87:00.1']
  port_bandwidth_gb: 40
  port_info:
      - ip: 10.10.2.2
        default_gw: 10.10.2.1
      - ip: 10.10.1.2
        default_gw: 10.10.1.1

  platform:
      master_thread_id: 0
      latency_thread_id: 1
      dual_if:
        - socket: 1
          threads: [22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43]

Stop the previous VPP session and start it again in order to add a route for new IP addresses 16.0.0.0/8 and 48.0.0.0/8, according to Figure 2. Those IP addresses are needed because TRex generates packets that use these addresses. Refer to the TRex Manual for details on these traffic templates.

csp2s22c03$ sudo service vpp stop
csp2s22c03$ sudo service vpp start
csp2s22c03$ sudo vppctl
    _______    _        _   _____  ___
 __/ __/ _ \  (_)__    | | / / _ \/ _ \
 _/ _// // / / / _ \   | |/ / ___/ ___/
 /_/ /____(_)_/\___/   |___/_/  /_/

vpp# sho int
              Name               Idx       State          Counter          Count
FortyGigabitEthernet82/0/0        1        down
FortyGigabitEthernet82/0/1        2        down
local0                            0        down

vpp#
vpp# set interface ip address FortyGigabitEthernet82/0/0 10.10.1.1/24
vpp# set interface ip address FortyGigabitEthernet82/0/1 10.10.2.1/24
vpp# set interface state FortyGigabitEthernet82/0/0 up
vpp# set interface state FortyGigabitEthernet82/0/1 up
vpp# ip route add 16.0.0.0/8 via 10.10.1.2
vpp# ip route add 48.0.0.0/8 via 10.10.2.2
vpp# clear run

Now, you can generate a simple traffic flow from net2s22c05 using the traffic configuration file cap2/dns.yaml:

NET2S22C05$ sudo ./t-rex-64 -f cap2/dns.yaml -d 1 -l 1000
 summary stats
 --------------
 Total-pkt-drop       : 0 pkts
 Total-tx-bytes       : 166886 bytes
 Total-tx-sw-bytes    : 166716 bytes
 Total-rx-bytes       : 166886 byte

 Total-tx-pkt         : 2528 pkts
 Total-rx-pkt         : 2528 pkts
 Total-sw-tx-pkt      : 2526 pkts
 Total-sw-err         : 0 pkts
 Total ARP sent       : 4 pkts
 Total ARP received   : 2 pkts
 maximum-latency   : 35 usec
 average-latency   : 8 usec
 latency-any-error : OK

On csp2s22c03, the VCC CLI command show run displays the graph runtime statistics:

Terminal output

Example 4: Using VPP with TRex Mixed Traffic Templates

In this example, a more complicated traffic with delay profile on net2s22c05 is generated using the traffic configuration file avl/sfr_delay_10_1g.yaml:

NET2S22C05$ sudo ./t-rex-64 -f avl/sfr_delay_10_1g.yaml -c 2 -m 20 -d 100 -l 1000
summary stats
 --------------
 Total-pkt-drop       : 43309 pkts
 Total-tx-bytes       : 251062132504 bytes
 Total-tx-sw-bytes    : 21426636 bytes
 Total-rx-bytes       : 251040139922 byte

 Total-tx-pkt         : 430598064 pkts
 Total-rx-pkt         : 430554755 pkts
 Total-sw-tx-pkt      : 324646 pkts
 Total-sw-err         : 0 pkts
 Total ARP sent       : 5 pkts
 Total ARP received   : 4 pkts
 maximum-latency   : 1278 usec
 average-latency   : 9 usec
 latency-any-error : ERROR

On csp2s22c03, use the VCC CLI command show run to display the graph runtime statistics. Observe that the average vector per node is 10.69 and 14.47:

Terminal output

Summary

This tutorial showed how to download, compile, and install the VPP binary on an Intel® Architecture platform. Examples of /etc/sysctl.d/80-vpp.conf and /etc/vpp/startup.conf/startup.conf configuration files were provided to get the user up and running with VPP. The tutorial also illustrated how to detect and bind the network interfaces to a DPDK-compatible driver. You can use the VPP CLI to assign IP addresses to these interfaces and bring them up. Finally, four examples using iperf3 and TRex were included, to show how VPP processes packets in batches.

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

References

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>