Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Lower Numerical Precision Deep Learning Inference and Training

$
0
0

View PDF

Introduction

Most commercial deep learning applications today use 32-bits of floating point precision (ƒp32)for training and inference workloads. Various researchers have demonstrated that both deep learning training and inference can be performed with lower numerical precision, using 16-bit multipliers for training and 8-bit multipliers or fewer for inference with minimal to no loss in accuracy (higher precision – 16-bits vs. 8-bits – is usually needed during training to accurately represent the gradients during the backpropagation phase). Using these lower numerical precisions (training with 16-bit multipliers accumulated to 32-bits or more and inference with 8-bit multipliers accumulated to 32-bits) will likely become the standard over the next year, in particular for convolutional neural networks (CNNs).

There are two main benefits of lower precision. First, many operations are memory bandwidth bound, and reducing precision would allow for better usage of cache and reduction of bandwidth bottlenecks. Thus, data can be moved faster through the memory hierarchy to maximize compute resources. Second, the hardware may enable higher operations per second (OPS) at lower precision as these multipliers require less silicon area and power.

In this article, we review the history of low-bit precision training and inference, describe how Intel is enabling lower precision for inference on the current Intel® Xeon® Scalable processors, and explore lower precision training and inference enabled by hardware and software on future generation Intel Xeon Scalable platforms. Specifically, we describe new instructions available in the current generation and instructions that will be available in future generations of Intel Xeon Scalable processors. We describe how to quantize the model weights and activations and the lower numerical functions available in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). Finally, we describe how deep learning frameworks take advantage of these lower precision functions to reduce the conversion overhead between different numerical precisions. Each section can be read independently of other sections and the reader may skip to their section of interest.

Brief History of Lower Precision in Deep Learning

Researchers have demonstrated deep learning training with 16-bit multipliers and inference with 8-bit multipliers or less of numerical precision accumulated to 32-bits with minimal to no loss in accuracy across various models. Vanhoucke, et al. (2011) quantized activations and weights to 8-bits and kept the biases and first layer input at fp32 for the task of speech recognition on a CPU. Hwang, et al. (2014) trained a simple network with quantized weights of -1, 0 and 1 in the feed forward propagation and updated the high precision weights in the back propagation using the MNIST* and TIMIT* datasets with negligible performance loss. Courbariaux, et al. (2015) used the MNIST, CIFAR-10*, and SVHN* datasets to train with lower precision multipliers and high precision accumulators, and updated the high precision weights. They proposed combining dynamic fixed point (having one shared exponent for a tensor) with Gupta, et al.'s (2015) stochastic rounding as future work. This became the core piece of Koster, et al.'s (2017) FlexPoint use in Intel® Nervana™ Neural Network Processors (NNP). Kim and Smaragdis (2016) trained with binary weights and updated on full precision, i.e., ƒp32, weights with competitive performance on the MNIST dataset. Miyashita, et al. (2016) encoded the weights and activations in a base-2 logarithmic representation (since weights/activations have a non-uniform distribution). They trained CIFAR-10 with 5-bits resulting in only 0.6% worse performance than full precision. Rastegari, et al. (2016) trained AlexNet with binary weights (except for the first and last layers) and updated on full precision weights with a top-1 2.9% accuracy loss. Based on their experiments, they recommend avoiding binarization in fully connected layers and convolutional layers with small channels or filter sizes (e.g., 1x1 kernels). Mellempudi, et al. (2017) from Intel Labs trained ResNet-101 with 4-bit weights and 8-bit activations in convolutional layers while doing updates in full precision with a top-1 2% accuracy loss. Micikevicius, et al. (2017) trained with 16-bit floating-point (ƒp16) multipliers and ƒp32 accumulators, and updated the high precision weights with negligible to no loss in accuracy for AlexNe*, VGG-D, GoogLeNet, ResNet-50, Faster R-CNN, Multibox SSD, DeepSpeech2, Sequence-to-Sequence, bigLSTM, and DCGAN (some models required gradient scaling to match ƒp32 results). Baidu researchers (2017) used 8-bits of fixed precision with 1 sign bit, 4-bits for the integer part and 3-bits for the fractional part. Sze, et al. (2017) various quantization techniques (see Table 3 in their paper) showing minimal to no loss at reduced precision (except for the first and last layers which were at ƒp32). An anonymous submission to ICLR 2018 details how to generate state-of-the-art on ResNet-50, GoogLeNet, VGG-16, and AlexNet using 16-bits integer multipliers and 32-bit accumulators.

Lower Numerical Precision With Intel® Xeon® Scalable Processors

The Intel Xeon Scalable processor (formerly codename Skylake-SP) cores now includes Intel® Advance Vector Extension 512 (Intel® AVX-512) units which have the 512-bit wide Fused Multiply Add (FMA) core instructions. These instructions enable lower precision multiplies with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32-bits requires 3 instructions and requires one of the 8-bit vectors to be in unsigned int8 (υ8) format, the other in signed int8 (s8) format with the accumulation in signed int32 (s32) format. This allows for 4x more input at the cost of 3x more instructions or 33.33% more compute with ¼ the memory requirement. The reduced memory and higher frequency for lower precision operations makes it even faster. See Figure 1 for details1.

map type image
Figure 1. The Intel Xeon Scalable processor core enables 8-bit multiplies with 32-bit accumulates with 3 instructions: VPMADDUBSWυ8×s8→s16 multiples, VPMADDWD broadcast1 s16→s32, and VPADDD s32→s32 adds the result to accumulator. This allows for 4x more input over fp32 at the cost of 3x more instructions or 33.33% more compute and ¼ the memory requirement. The reduced memory and higher frequency available with lower precision makes it even faster. Image credit to Israel Hirsh.

The Intel AVX-512 instructions also enable 16-bit multiplies. Multiplying two 16-bit values and accumulating the result to 32-bits requires 2 instructions (2 cycles) with both 16-bit vectors to be in signed int16 (s16) format and the accumulation in signed int32 (s32) format. This allows for 2x more input at the cost of 2x more instructions, resulting in no additional compute. It does, however, reduce the memory requirement and bandwidth bottlenecks, both of which may improve the overall performance. See Figure 2 for details.


Figure 2. The Intel Xeon Scalable processor core is capable of 16-bit multiplies with 32-bit accumulates with 2 instructions: VPMADDWD s16×s16→s32 multiples, and VPADDD s32→s32 adds the result to accumulator. This allows for 2x more input over fp32 at the cost of 2x more instructions or no more compute and ½ the memory requirement. Image credit to Israel Hirsh.

Intel developed the AVX512_VNNI (Vector Neural Network Instruction), a new set of Intel AVX-512 instructions to boost DL performance. Ice Lake and other future microarchitectures (see Table 1-1) will have the AVX512_VNNI instructions. AVX512_VNNI includes 1) an FMA instruction for 8-bit multiplies with 32-bits accumulates υ8×s8→s32 as shown in Figure 3, and 2) an FMA instruction for 16-bit multiplies with 32-bit accumulates s16×s16→s32 as shown in Figure 4. The theoretical peak compute gains are 4x int8 OPS and 2x int16 OPS over fp32 OPS, respectively. Practically, the gains may be lower due to memory bandwidth bottlenecks.

An image
Figure 3. AVX512_VNNI enables 8-bit multiplies with 32-bit accumulates with 1 instruction. The VPMADDUBSW, VPMADDWD, VPADDD instructions in Figure 1 are fused into the VPDPBUSD instruction υ8×s8→s32. This allows for 4x more inputs over ƒp32 and (theoretical peak) 4x more compute with ¼ the memory requirements. Image credit to Israel Hirsh.

Image
Figure 4. AVX512_VNNI enables 16-bit multiplies with 32-bit accumulates with 1 instruction. The VPMADDWD, VPADDD instructions in Figure 2 are fused into the VPDPWSSD instruction s16×s16→s32. This allows for 2x more inputs over ƒp32 and (theoretical peak) 2x more compute with ½ the memory requirements. Image credit to Israel Hirsh.

A potential issue is the undefined behavior on overflows that may occur when using the VPMADDUBSW instruction υ8×s8→s16 (see Figure 1). This is a problem when both υ8 and s8 values are near their maximum values2. This can be mitigated by reducing the precision of the inputs by 1-bit. This is not an issue when using the AVX512_VNNI VPDPBUSD FMA instruction υ8×s8→s32.

An overflow is more likely to occur with the AVX512_VNNI VPDPWSSD FMA instruction s16×s16→s32. This can be similarly mitigated by reducing the precision of the activations and the weights by 1 or 2 bits. Another technique to prevent overflow is to use a second accumulator at ƒp32, and convert to ƒp32 and use that accumulator after a set number of s32 accumulates. Preliminary results show that statistical performance does not suffer using these techniques.

Compiler support for these AVX512_VNNI instructions is underway. GCC 8 development code and LLVM/Clang 6.0 compiler already support AVX512_VNNI instructions. The X86 Encoder Decoder (XED) and the Intel software developer emulator (SDE) October 2017 update adds support for AVX512_VNNI instructions.

Intel® MKL-DNN Library Lower Precision Primitives

The Intel MKL-DNN library contains popular deep learning functions or primitives used across various models such as inner products, convolutions, rectified linear units (ReLU), and batch normalization (BN), along with functions necessary to manipulate the layout of tensors or high dimensional arrays. Intel MKL-DNN is optimized for Intel processors with Intel AVX-512, Intel® AVX-2, and Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2) instructions. These functions use ƒp32 for training and inference workloads. Recently, new functions were introduced to support inference workloads with 8-bits of precision in convolutional, ReLU, fused convolutional plus ReLU and pooling layers. Functions for recurrent neural networks (RNNs), other fused operations, and Winograd convolutions with 8-bits are designated as future work. Intel MKL-DNN will add support for 16-bits functions in the future when the AVX512_VNNI instructions are available.

Currently, Intel MKL-DNN does not have a local response normalization (LRN), fully connected (FC), softmax, or batch normalization (BN) layers implemented with 8-bits of precision (only with ƒp32) for the following reasons. Modern models do not use LRN and older models can be modified to use batch normalization, instead. Modern CNN models do not typically have many FC layers, although adding support for FC layers is designated as future work. The softmax function currently requires higher precision as it does not maintain accuracy with 8-bits of precision. A BN inference layer is not needed as it can be absorbed by its preceding layer by scaling the weight values and modifying the bias as discussed in the Enabling Lower Precision in the Frameworks section.

Intel MKL-DNN implements the 8-bit convolution operations with the activation (or input) values in υ8 format, weights in s8 format and biases in s32 format (biases can be kept in ƒp32 as well as they take a very small percentage of the overall compute). Figure 5 shows the process of inference operations with 8-bit multipliers accumulated to s32.

Flowchart image
Figure 5. The data layer or the first convolution layer activations are quantized to υ8 as inputs to the next convolutional layer. The weights are quantized to s8 and the bias is formatted to s32 and added to the s32 convolution accumulate. The framework chooses the format of the convolution output as s8, υ8, or s32 depending on the parameters of the following layer. Image credit to Jiong Gong.

8-bit Quantization of Activations With Non-negative Values and Weights

Intel MKL-DNN currently assumes that the activations are non-negative which is the case after the ReLU activation function. Later in this article we discuss how to quantize activations with negative values. Intel MKL-DNN quantizes the values for a given tensor or for each channel in a tensor (the choice is up to the framework developers) as follows.

R{a,w}=max⁡(abs(Τ{a,w} )), where Τ{a,w} is a tensor corresponding to either the weights w or the activations or model inputs a.

Qa=255/Ra is the quantization factor for activations with non-negative values, and Qw=127/Rw is the quantization factor for the weights. The quantized activation, weights, and bias are:
aυ8 = ||Qaaƒ32|| ∈[0,255]
Ws8 = ||QwWf32|| ∈[-127,127]
bs32 = ||QaQwbf32|| ∈[−231,231−1]
where the function ||⋅|| rounds to the nearest integer. Note that while the s8 format supports −128, the smallest quantized s8 weight value use is −127.

The affine transformation using 8-bit multipliers and 32-bit accumulates results in
xs32 = Ws8aυ8+bs32≈ Qa Qw (Wƒ32aƒ32+bƒ32) = QaQwxƒ32
where the approximation is because the equation ignores the rounding operation, and
xƒ32=Wƒ32aƒ32+b_f32 ≈ 1/Qa Qwxs32=Dxs32
is the affine transformation with f32 format, and D = 1/QaQw is the dequantization factor.

In quantizing to υ8 and s8 formats, a zero value maps to a specific value without any rounding. Given that zero is one of the most common values, it is advantageous to have exact mappings to reduce quantization errors and improve statistical accuracy.

The quantization factors above can be in fp32 format format in the Intel Xeon Scalable processors. However, some architectures do not support divides (e.g., FPGAs) and use shifts. For those architectures, the scalar is rounded to the nearest power-of-two and the scaling is done with bit-shifts. The reduction in statistical accuracy is minimal (usually <1%).

Efficient 8-bit Multiplies

In Figure 6, we demonstrate how to efficiently perform the 8-bit multiplies for A x W. Intel MKL-DNN uses an NHWC data layout for the activation tensors where N is the batch size, H is the height, W is the width, and C is the number of channels, and an (O/16)Κ(C/4)Τ16o4c data layout for the weight tensors where O is the number kernels or output channels, C is the number of input channels, Κ is the height, and Τ is the width. The first 32-bits (4 int8 values) of tensor A shown in gray are broadcasted 16 times to fill a 512-bit register. Intel MKL-DNN modifies the data layout of tensor W after quantizing the weights. Tensor W data layout is rearranged as W′ by groups of 16 columns, with each column having 32-bits (4 int8 values) to be read continuous in memory starting with the first 4 values in column 1 occupying the first 32-bits of the register (red), the next 4x1 occupying the next 32-bits of the register (orange), and so forth (green). The second, third, and fourth block (yellow) below the first block are rearranged in the same pattern. The next set of blocks (blue) follows. In practice, tensor W is usually transposed before re-arranging the memory layout in order to access 1x4 continuous memory values rather than 4x1 scatter values when rearranging the data layout. Modifying this data layout is usually done once and stored for reuse for all inference iterations.

An image
Figure 6. Efficient use of int8 multiplies to compute the product A×W requires a data layout transformation of tensor W in order to read continuous bits. Groups of 32-bits of A are broadcasted 16 times to fill a 512-bit register which are multiplied by groups of 512-bits from tensor W′.

The register with the first 4 int8 values (copied 16 times) of A is multiplied by the 64 int8 values (512-bits) of W′ and accumulated. The next 4 values in A are broadcasted 16 times to another register which is multiplied by the next 64 int8 values of W′. This continues until the first row of A is read and the results are accumulated. The outputs (after all 3 instructions of the 8-bit FMA) are the first 16 output values (requiring 512-bits at s32). The first row of A is then multiplied by the next values on W′ resulting in the next 16 values of the output.

The Intel Xeon Scalable processors have up to 32 registers. When executing in 512-bit register port scheme on processors with two FMA units3 , Port 0 FMA has a latency of 4 cycles and Port 5 FMA has a latency of 6 cycles. The instructions used for deep learning workloads at int8 support bypass and have a latency of 5 cycles for both ports 0 and 5 (see Section 15.17). In practice, multiple rows of W′ are loaded to multiple registers to hide these latencies.

16-bit Functions for Training

Intel MKL-DNN support of 16-bits functions is designated as future work. Nevertheless, researchers have already shown training of various CNNs models using 16-bit multiplies with 32-bit accumulates by taking advantage of the AVX512_4VNNI instruction (also known as QVNNI, available on the Intel® Xeon® Phi™ processors codename Knights Mill) and the VP4DPWSSD instruction (similar to the AVX512_VNNI VPDPWSSD instruction discussed earlier, and which will be available in some future Intel Xeon Scalable processors).

These researchers matched the fp32 statistical performance of ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet with the same number of iterations as ƒp32 models without changing the hyper-parameters. They use s16 to store the activations, weights, and gradients, and also keep a master-copy of the ƒp32 weights for the weights updates that gets quantized back to s16 after each iteration. They use quantization factor that are powers-of-two which facilitates managing the quantization / dequantization factors through tensor multiplies.

Enabling Lower Precision in the Frameworks

The popular frameworks enable users to define their model without writing all the function definitions themselves. The details on the implementations of the various functions can be hidden from the framework users. These implementations are done by framework developers. This section explains the modifications required at the framework level to enable lower precision.

Quantizing the weights is done before inference starts. Quantizing the activations efficiently requires precomputing the quantization factors. The activation quantization factor are precomputed usually sampling the validation dataset to find the range as described above. Values in the test dataset outside this range are saturated to the range. For negative activation values, the range before saturation could be relaxed to −128Ra′/127 in order to use the s8=−128 value, where Ra′ is maximum absolute value of these activations. These scalars are then written to a file.

8-bit Quantization of Activations or Inputs With Negative Values

Quantizing activations or input values with negative values can be implemented at the framework level as follows. Qa′=127/Ra′ is the quantization factor for activations with negative values. The s8 quantized format is as8 = ||Qa′aƒ32 ||∈[-128,127], where the function ||⋅|| rounds to the nearest integer. However, the activation must be in υ8 format to take advantage of the VPMADDUBSW AVX512 instruction or the VPDPBUSD AVX512_VNNI instruction (described in Section "Lower numerical precision with Intel Xeon Scalable processors"). Therefore, all values in as8 are shifted by Κ=128 to be non-negative:
aυ8=as8 + Κ1∈ [0,255] where 1 is a vector of all 1s, and the bias bƒ32 is modify as
b′ƒ32 = bƒ32Κ/Q′aWƒ321

The methodology to quantize the weights and modified bias is the same as before:
Ws8 = ||QwWƒ32 ||∈ [−128,127]
b′s32 =||Qa′ Qwb′ƒ32 ||∈ [−231,231−1]

The affine transformation using 8-bit multipliers and 32-bit accumulates results in
xs32 = Ws8aυ8 + b′s32≈ QwWƒ32 (Q)a′aƒ32 + Κ1) + QwQa′(bƒ32Κ/Qa′) Wƒ321) = Qa′Qw(Wƒ32aƒ32+b32) = Qa′Qwxƒ32
where

xƒ32 = Wƒ32aƒ32+b321/Qa′Qwxs32 = Dxs32

where D = 1/Qa′Qw is the dequantization factor.

When the input signal is already in υ8 format (e.g., RGB images) but a preprocessing step is required to subtract the mean signal, the above equations can be used where K is the mean, aυ8 is the input signal (not pre-processed), and Qa′ = 1.

Researchers often keep the first convolution layer in ƒp32 format and do the other convolutional layers in int8 (see Brief History of Lower Precision in Deep Learning section for examples). We observe that using these quantization techniques enables the use of all convolution layers in int8 with no significant decrease in statistical accuracy.

To recap, to use activations with negative values, the activations are quantized to s8 format and then shifted by K=128 to υ8 format. The only additional change is to modify the bias:b′ƒ32′=bƒ32K/Q′a Wƒ321. For a convolution layer the product Wƒ321 is generalized to equal the sum over all the values of Wƒ32 along all dimensions except the dimension shared with bƒ32. See Appendix A for details.

Fused Quantization

Fused quantization improves performance by combining dequantization and quantization as follows so there is no need to convert to fp32. The activation at layer l+1 is:

a(l+1)ƒ32 = g(x((l)ƒ32) ) = g(D(l)x(l)s32 )

where g(⋅) is a non-linear activation function. Assuming the ReLU activation function, the activation can be expressed in υ8 format as
a(l+1)υ8 = ||Qa(l+1) aƒ32(l+1)|| = ||Qa(l+1) D(l) max ⁡(0,xs32(l) ||
where the product Qa(l+1)D(l) enables computing the next layer's quantized activation in υ8 format without computing the ƒp32 representation.

When g(⋅) is the ReLU function (as in the equations below) and Q ≥ 0, the following property holds:
Qg(D(l)x(l)s32 + D(h)x(h)s32) = g(QD(l)x(l)s32+QD(h)x(h)s32)
This property is useful for models with skip connections such as ResNet where a skip connection branch may have dependencies on various activations. As an example, and using the nomenclature by the ResNet-50 author in Caffe's deploy.prototxt (see Figure 7), the quantized input activation in layer res2b_branch2a (abbreviated as 2b2a in the equations below) is

a(2b2a)u8 = Q(2b2a)ag(D(2a1)s(2a1)32+D(2a2c)s(2a2c)32)υ8(2b2a) = Q(2b2a)a g(D(2a1)s(2a1)32+D(2a2c)s(2a2c)32 )

where a(2b2a)υ8∈ [0,127] (instead of [0,255]) because Q(2b2a)a D(2a1)s(2a1)32∈[ −128,127] is in s8 format because the product comes before the ReLU function and Q(2b2a)a = 127/Ra(2b2a) is the quantization factor. Following this procedure, it is shown in Appendix B that the activation a(2c2a)υ8 depends on s(2a1)32, s(2a2c)32 and s(2b2c)32. Similarly, the activation a(3ca)υ8 depends on s(2a1)32, s(2a2c)32, s(2b2c)32 and s(2c2c)32.

An image
Figure 7. Diagram of the second group of residual blocks in ResNet-50 (and the first branch in the third group) using the nomenclature by the ResNet-50 author in Caffe's deploy.prototxt. The layers marked with a blue arrow have dependencies on 2 or more activations. Image credit to Barukh Ziv, Etay Meiri, Eden Segal.

Batch Normalization

A batch normalization (BN) inference layer is not needed as it can be absorbed by its preceding layer by scaling the weight values and modifying the bias. This technique only works for inference and is not unique to lower precision. It can be implemented at the framework level instead of Intel MKL-DNN. BN is usually applied after the affine transformation x = Wa+b and before the activation function (details in the original BN paper). BN normalizes x to be zero mean and unit norm, and then scales and shifts the normalized vector by γ and β, respectively, which are parameters also learned during training. During a training iteration, x is normalized using the mini-batch statistics. For inference, the mean E and variance V of x are precomputed using the statistics of the entire training dataset or a variant such as a running average of these statistics computed during training. During inference, the BN output y is:
y = BN(x) = γ x−E1/V+β1 = γWa+b−E1/V1 = γ/VWa+γ/Vb+β−γE/V1 = W′a+b′
where W′=γ/VW and b′ = γ/Vb+β−γE/V1. That is, during inference the BN layer can be replaced by adjusting weights and bias in the preceding convolutional or fully connected layer.

Frameworks

Intel enabled 8-bit inference in Intel Optimized Caffe* (also known as IntelCaffe). Intel's DL Inference Engine, Apache* MXNet*, and TensorFlow* optimizations are expected to be available in Q2 2018. All these 8-bit optimizations are currently limited to CNN models. RNN models, 16-bit training enabling, and other frameworks will follow later in 2018.

In Intel Optimized Caffe, the model.prototxt file is modified to include the precomputed scalars as shown in Figure 8. Currently, Intel Optimized Caffe can provide the quantization factor as either a power-of-two or as regular fp32 value, and can use either 1 quantization factor per tensor or 1 per channel. Those quantization factors are computed using a sampling tool built into Intel Optimized Caffe.

An image
Figure 8. Quantization factors are added to the model.prototxt file. Image credit to Haihao Shen.

Intel's DL Inference Engine is part of Intel's Deep Learning Deployment Toolkit and Intel® Computer Vision SDK. It's available on Linux* and Windows* OS and supports models trained from Caffe, MXNet, and TensorFlow with others coming in the future. The Inference Engine facilitates deployment of DL solutions by delivering a unified API for various hardware backends: Intel Xeon processors with AVX-2 and AVX-512, Intel Atom processors, Intel® HD Graphics, and Intel® Arria® 10 (Intel® A10) discrete cards at various numerical precisions depending on the hardware. Support for 8-bit inference on Intel Xeon Scalable processors will be available in Q2 2018.

TensorFlow already supports 8-bit inference and various quantization methods. It can dynamically compute the scale or collect statistics during training or calibration phase to then assign a quantization factor. TensorFloW's graph, which includes these scalars, is written to a file. The graph with the respective scalars is quantized and ran during inference. TensorFlow supports two methods for quantization. One method is similar to Intel MKL-DNN by setting the min and max as additive inverses. The other uses arbitrary values for min and max that need an offset plus scale (not supported in Intel MKL-DNN). See Pete Warden's blog for more details but note that the blog is outdated as it does not contain all the ways to quantize in TensorFlow.

Another tool of TensorFlow is retraining or fine-tuning at lower precision. Fine-tuning can improve the statistical performance. Given a model that is trained at ƒp32, after its weights are quantized, the model is then fine-tuned with the quantized weights and the weights are re-quantized after each training iteration.

GemmLowP is a Google library adopted in TensorFlow Lite*. It uses υ8 multiplies, where ƒ32 = D×(υ8 − Κ), Κ is an υ8 value that maps to ƒp32 = 0, and D>0 is the dequantization factor.

The Apache MXNet branch currently does not support 8-bit. However, a branch by one of the main MXNet contributors supports 8-bit inference. In that branch, there are two methods to quantize the values: one where the min value is mapped to 0 and the max value to 255 (note that zero does not map to an exact value); and, another one where the max of the absolute value is mapped to either −127 or 127 (note that zero maps to zero —similar to Intel MKL-DNN). The main difference with the presented approached is that the scalars in this MXNet branch are not precomputed. Rather, they are computed during the actual inference steps which reduces the benefits of lower precision. In that branch, the scalars for the activations are computed by multiplying the scalars from the inputs with the scalars from the weights: activation-scalar = input-scalar * weight-scalar, where input = input-scalar * quantized-input; weight = weight-scalar * quantized-weight; and activation = activation-scalar * quantized-activation; input, weights, activations, and scalars are in ƒp32 format, quantized-input and quantized-weights are in int8 format, and quantized-activations are in int32 format (see details). While min and max of the activations are tracked, the values are only dequantized when encountering an fp32 layers (e.g., softmax).

TensorRT quantizes to s8 format similar to Intel MKL-DNN with the addition of finding a tighter range by minimizing the KL divergence between the quantized and reference distributions.

The TPU team claims that TPUs which uses int8 multiplies are being used across a variety of models including LSTM models. The software stack translates API calls from TensorFlow graphs into TPU instructions.

Caffe2's docs state that there is "flexibility for future directions such as quantized computation," but currently no plans for quantization have been disclosed.

PyTorch has a branch that offers various options to quantize but there is no discussion on which is better.

Microsoft introduced Project Brainwave* using a custom 8-bit floating point format (ms-fp8) that runs on Intel® Stratix® 10 FPGA. The details of this format, quantization techniques, or framework implementation has not been disclosed. Project Brainwave supports CNTK* and TensorFlow and plans to support many others by converting models trained in popular frameworks to an internal graph-based intermediate representation.

Model and Graph Optimizations

Model optimizations can further improve inference performance. For example in ResNet, the stride operation can be moved to an earlier layer without modifying the end result and reducing the number of operations as shown in Figure 9. This modification applies to both 8-bit or 32-bits.

An image
Figure 9. The stride 2 shown on the layers on the left blocks can be moved to an earlier layer during inference which reduces the number of operations and does not modify the result. Courtesy of Eden Segal and Etay Meiri.

Conclusion

Lower precision inference and training can improve the computational performance with minimal or no reduction in statistical accuracy. Intel is enabling 8-bit precision for inference on the current generation of Intel Xeon Scalable processors. Intel is also enabling 8-bit precision for inference and 16-bit precision for training on future generations of Intel Xeon Scalable processors with both hardware and software enabling compilers, the Intel MKL-DNN library and popular deep learning frameworks.

Acknowledgements

A special thanks to the framework optimization team leads and Intel Xeon processor architects for the useful discussions including Israel Hirsh, Alex Heinecke, Vadim Pirogov, Frank Zhang, Rinat Rappoport, Barak Hurwitz, Dipankar Das, Dheevatsa Mudigere, Naveen Mellempudi, Dhiraj Kalamkar, Bob Valentine, AG Ramesh, Nagib Hakim as well as the wonderful reviewers R. Chase Adams, Nikhil Murthy, Banu Nagasundaram, Todd Wilson, Alexis Crowell, and Emily Hudson.

About the Authors

Andres Rodriguez, PhD, is a Sr. Principal Engineer working with the Data Center Group (DCG) and Artificial Intelligence Products Group (AIPG) where he designs AI solutions for Intel’s customers and provides technical leadership across Intel for AI products. He has 13 years of experience working in AI. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer reviewed publications in journals and conferences, and a book chapter on machine learning.

Eden Segal, is a software developer at the Pre-Enabling team where he optimizes Deep Learning algorithms to find the peak algorithm performance on Intel processors. This knowledge is used to improve Intel’s performance across the entire deep learning stack from the hardware, through the libraries and up to the deep learning framework.

Etay Meiri, is a software developer at the Pre-Enabling team where he optimizes Deep Learning algorithms to find the peak algorithm performance on Intel processors. This knowledge is used to improve Intel’s performance across the entire deep learning stack from the hardware, through the libraries and up to the deep learning framework.

Evarist Fomenko, MD in Applied Mathematics, is a software development engineer in Intel MKL and Intel MKL-DNN where he designs and optimizes library functions, and interacts with internal and external teams to assist with integration. He has 5 years of experience working on hardware optimizations at Intel.

Young Jin Kim, PhD, is a Sr. Machine Learning Engineer with Intel’s AI Products Group (AIPG) where he develops and optimizes deep learning software frameworks for Intel’s hardware architecture by adopting the state-of-the-art techniques. He has over 10 years of experience working in artificial intelligence. Young received his PhD from Georgia Institute of Technology for his research in deep learning and high-performance computing. He holds over 10 peer reviewed publications in journals and conferences.

Haihao Shen, MD in Computer Science, is a deep learning engineer in machine learning and translation team (MLT) with Intel Software and Services Group (SSG). He leads the development of Intel Distribution of Caffe, including low precision inference and model optimizations. He has 6 years of experience working on software optimization and verification at Intel. Prior to joining Intel, he graduated from Shanghai Jiao Tong University.

Barukh Ziv, PhD, is a Senior Software Engineer, working with pre-Enabling group in SSGi, where he designs efficient implementations of DL applications for future generations of Xeon processors. He has 2 years of experience working on DL optimizations. Barukh received his Ph. D. in Technical Sciences from Kaunas University of Technology. He holds over 5 peer reviewed publications in journals and conferences.

Appendix A: Details on Quantization of Activations or Inputs With Negative Values

To convince the reader that these same formulas (see Section 8-bit quantization of activations or inputs with negative values) generalize to convolutional layers, we use the indices of each tensor entry and work through the steps to show the convolutional output. Let Wƒ32∈ ℜO×C×K×T be the weight tensor with O kernels or output channels, C input channels, Κ height, and Τ width. The modified bias can be represented as:

Mathematical formula

where
Mathematical formula
and oi, ci, κi, and τi are the indices for the kernels or output channels, input channels, kernel height, and kernel width, respectively. The convolution output can be represented as follows. Note that we assume batch size one (to omit the batch index for simplicity), the activations have been already zero padded in fp32 format (or equivalently padded with Κ=128 in υ8 format), and the convolution stride is one.

Mathematical formula

Appendix B – Details on Fused Quantization With Skip Connections

The activation inputs to the layers marked by the blue arrow in Figure 7 are as follows where layer res2b_branch2a is abbreviated as 2b2a in the equations below with similar abbreviations for the other layers.

Mathematical formula

  1. The raw compute can be calculated as AVX-512-frequency * number-of-cores * number-of-FMAs-per-core * 2-operations-per-FMA * SIMD-vector-length / number-of-bits-in-numerical-format / number-of-instructions. Two 512-bit FMA units computing in parallel per core are available in the Intel Xeon Platinum processors, Intel Xeon Gold processors 6000 series and 5122. Other Intel Xeon Scalable processor stock keeping units (SKUs) have one FMA unit per core. ƒp32, int16, and int8 FMAs require 1, 2, and 3 instructions, respectively, with the Intel AVX-512 instructions. The Intel Xeon Platinum 8180 has 28 cores per socket and 2 FMAs per core. The ƒp32 OPS per socket are approximately 1.99-GHz-AVX-512-frequency * 28-cores * 2-FMA-units-per-core * 2-OPS-per-FMA * 512-bits / 32-bits / 1-instruction = 3.570 ƒp32 TOPS. The int8 OPS per socket are approximately 2.17-GHz-AVX-512-frequency * 28-cores * 2-FMA-units-per-core * 2-OPS-per-FMA * 512-bits / 8-bits / 3-instruction = 5.185 int8 TOPS. The AVX-512 frequencies for multiple SKUs can be found here (these correspond to ƒp64 operations—the frequencies for lower precision are higher). The AVX-512 max turbo-frequency may not be fully sustained when running high OPS workloads.
  2. in practice these υ8 values are usually closer to their minimum than their maximum if they are activations preceded by the ReLU activation function
  3. Two 512-bit FMA units computing in parallel per core are available in Intel Xeon Platinum processors, Intel Xeon Gold processors 6000 series and 5122. Other Intel Xeon Scalable processor SKUs have one FMA unit per core.

Notices and Disclaimers:

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance ;tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

​Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.


Get Ready, Get Noticed, Get Big: A Practical Guide to Marketing Your Indie Game

$
0
0

Intel has supported the PC gaming community since the late 1970s, when the Intel 8088 processor ran at 4.77 MHz inside the IBM PC. While hardware advances received the early headlines and large studios dominated the trade press the role of independent game developers has always been of interest. The freshest ideas, the most interesting stories, and the most ground-breaking advances still come from the indies who bravely bring their visions to market. Their struggle to balance the mastery of new technology and to conquer competitive marketing is growing in complexity.

Intel’s new Get Ready, Get Noticed, Get Big initiative is designed to help indie game developers with vital tools, information, and guidance during each stage of the marketing process. This marketing guide is a go-to resource packed with current content for vital individuals and small teams trying to get their titles noticed in the dynamic gaming market.

The mention of any particular game, product, or tool is not an endorsement.

The State of the Industry

According to Newzoo—the leading provider of market intelligence covering global games, eSports, and mobile markets—more than 2.2 billion gamers worldwide are generating an estimated United States dollar (USD) 108.9 billion in game revenue for 2017. That global market for games offers many enticing targets for indie developers.

LAI Global Game Service reports that Western Europe is now the market leader with 31 percent of the total sales, and it boasts the top spending per mobile title at USD 4.40 each. North America is in second place with slower growth prospects, but MENA (Middle East, North Africa, and Turkey) is projected to grow by 21 percent, year-to-year. Asia is growing at an annual 13 percent rate, Latin America increased by 14 percent in 2016, and Eastern Europe (and especially Russia) is another key emerging market. Even long-overlooked regions such as Southeast Asia remain largely untapped.

In early 2017, Apple reported that their App Store brought in USD 20 billion in revenue for the previous year. On January 1, 2017, they set a new record of USD 240 million in revenue in a single day. The top-grossing apps were games, and Super Mario Run* from Nintendo was the number one app overall.


Figure 1. Still popular after all these years Super Mario Run* was the #1 downloaded app in 2016.

Statistics from the Entertainment* Software Association’s February 2017 report tell an encouraging story. “Video game industry growth is likely due to1 the rise of independent video game developers, who in 2016 made up 98.1 percent of all company additions; and2 the increasing amount of video game studies, courses, and programs offered across 940 American educational institutions of higher learning.”

According to PCGamesN.com, Steam* will hit a record of 5,000 new releases in 2017, representing an enormous opportunity for developers. Statista breaks down the game industry numbers for 2016; and with some careful study, a clever independent could spot several profitable, growing niches. For example, should you develop for the growing elderly population in the US? According to the US Census Bureau, the US population aged 65 and over is projected to be 83.7 million people by 2050. Targeting a brain-boosting puzzle game or nutrition diary might make sense. At the other end of the scale, a game for preteens in the Middle East might offer lucrative potential.

Multiple opportunities exist for hungry independents in the games market. In a 2017 blog post, Kenneth Tran at Gamasutra.com offered this insight, “The independent games industry is currently in a state of near perfection.” Tran says the market has been “disrupted by digital distribution and self-publishing. Everyone knows this story: the rise of Google Play*, the (Apple) App Store, Unity* Personal Edition, and Free2Play*.”

The key takeaway from Tran’s blog is the concept of perfect competition. An indie game developer can create incredible beauty and compete on equal terms like never before. The barriers to entry are crumbling, with free dev kits, copious training and documentation, and multiple vendors offering assistance. When you combine that newfound muscle with the availability of extensive market research and data, the ability to Get Noticed and grow into something big is enticing. Remember, Minecraft* was a wildly successful indie game before Microsoft Windows* put it on every platform on earth, making Markus Notch Persson a billionaire. In 2016, Forbes magazine listed  Minecraft at second place in the top-selling games of all time, at 107 million copies, though still far behind Tetris*, with an estimated 495 million copies sold.


Figure 2. Minecraft* began as a popular indie game before it was purchased by Microsoft and became a household name.

Why Marketing Matters

Marketing is defined as “an aggregate of functions involved in moving goods from producer to consumer, the process or technique of promoting, selling, and distributing a product or service.”

Effective marketing can mean the difference between a little-known cult classic and a blockbuster. Marketing includes your ability to spread the word, build excitement, satisfy customers, and create enthusiasm; and when done right, it is measurable, plannable, and repeatable. If you can master shaders, physics engines, and compilers, you can certainly compete in the marketing arena. Be prepared to put in the time and effort to learn.

Game designer Sarah Woodrow estimates that developers spend only 30 percent of their working time actually coding. “The rest of the time will be everything else you need to do, especially if you’re a one-man band.” The only way to succeed if you remain small and nimble is to constantly learn and adapt. Try new angles, but fail quickly and move on. Learn how to run a business and how to do marketing and networking as you go, but be prepared to spend some time and money.

Often juggling family, work, and other commitments, the time that indies devote to their projects is already squeezed. Finding time for marketing activities is difficult. Experts sympathize, but they stress that time for marketing must be found. Intel’s Patrick DeFreitas is a partner marketing manager who outreaches to the independent gaming community. For him, the answer to the question, “When should I start marketing?” is obvious: IMMEDIATELY!

“I’ve seen reports that over 4,000 new games are launching every year,” he recently explained. “That’s 11 games a day!” He suggests you start marketing your title early. “Every day you let pass allows another stack of competitors to fall into your same genre, attracting the same customers, pulling for the same share of wallet,” Patrick explained. “You must decide how much time to devote to marketing. If you spend ‘x’ hours developing your title, you must put in an equal amount of time toward its promotion.”

Think of marketing as any activity that starts a conversation and builds a relationship with the gaming community. You’re probably marketing without realizing it. Developer blogs, websites, social media, gamer- and game-developer forums, video trailers, and a host of other tasks let you lift the curtain on your game’s development to promote its progress and features. Every chat with a potential customer is a marketing activity. Along the way, you’ll be able to gather feedback to help iterate designs and even extend the shelf life of your title.

The How to Start Getting Noticed section of this guide delves deeply into these marketing concepts, presents strategies for using them effectively, and discusses how to avoid common pitfalls.

The Right Time is Now

While it’s never too early to think about how to differentiate your game in the marketplace, waiting until release to trumpet its arrival is too late. Most game sales happen shortly after release, so it’s essential that potential customers know well in advance that your game is coming.

Marketing approaches vary depending on where you are in the development process. Use Table 1 to begin crafting your marketing strategy and creating a timeline of activities.

Table 1. Crafting the marketing strategy.

Dev StageActivityGoal
Initial planningWhile crafting story and gameplay—and choosing a programming language, game engine, graphics, and audio tools—look for anything unique in how you’re approaching a problem. Use it later when creating your value proposition.If possible, identify elements of your game development process that set it apart from other games. Maybe you’re Agile with a Twist or strictly use student volunteers for QA. Work that into your title’s value proposition. Next, create a timeline that sets deadlines for each milestone in your marketing strategy, from defining your value proposition and writing it down to creating assets, promotional materials, and all of the subsequent steps outlined for each development stage.
Asset creation, prototypingShare game graphics and audio samples on:
  • Forums
  • Your website and blog
  • Social media sites such as Facebook*, Twitter*, YouTube*
  • Meetups*
  • Game events

When your prototype is ready to share, give keys to beta users so they can test drive the gameplay.

Establish a communication channel for beta users.

Be a speaker or panelist at game events.

Create and share at least one trailer (or several trailers) to spur excitement about your game.

Start communicating your value proposition through visual and audio samples, as well as early gameplay experiences. Collect user feedback and use it to refine gameplay.

When community feedback is positive enough, encourage people to start spreading the word about what you’re doing.

Encourage people to share gameplay videos to help generate interest.

Establish relationships with key influencers—including the press—whose interests align with your game. Get them talking about what you’re up to, sharing your graphics or sound clips. Post gameplay videos and interviews you’ve done about the game’s development.

Use this momentum to promote interest among retail and online distribution channels.

Finished development, ready to release

Hold contests and value-add promotions (tell-a-friend and get a code to unlock features or Easter eggs).

Exhibit at game events, either by purchasing a low-cost booth through a booth aggregator, or by developing a relationship with a tool vendor and landing a spot in their booth.

If you can’t exhibit, attend the show and bring promotional handouts, copies of your game, game keys, business cards, t-shirts, and so forth.

Continue to blog about your progress and link to your website/blog wherever you can.

Increase interest and drive sales.

Encourage influencers and fans to talk about your game and share their experiences to get other potential customers interested in playing and buying your game.

Game released and generating interest without any marketing effort

Exhibit and speak at events, meetups, and trade shows.

Hold in-store promotions or online promotions such as contests, podcasts, and value-add promotions to boost momentum as the new factor fades.

Use social media and every other available channel to make people aware of where they can play and buy your game.

If appropriate, release a new level periodically or leak clues to unlocking Easter eggs and hidden features.

Seed stories in magazines and online via podcasts and influencer websites to build excitement over playing your game.

Keep your audience engaged and interested in playing your game.

Help people find and buy your game.

Sustain momentum and rekindle interest in your game after the thrill of its newness has faded.

Boost sales around holidays with special add-ons.

How to Start Getting Noticed

During the initial planning stage (see Table 1), think about your target audience and how to reach them. At the 2013 Konsoll conference, dedicated to the advancement of the Norwegian independent game community, famed IndieGameGirl* Emmy Jonassen spoke about How to Successfully Market Your Indie Game on a $0 Budget.

She told the story of Monkey Labour* from Dawn of Play, which didn’t sell many copies until they reached out for a game review at Touch Arcade. After a positive review, sales spiked 600 percent. She also mentioned Hitbox Team, which spent USD 100,000 building DustForce* and very little on marketing. A friend volunteered to write press releases, create trailer videos, manage media communications, and begin marketing efforts before launch. DustForce generated tremendous interest and awareness well ahead of release, landing over 100 articles, including a positive piece on GameSpot. Their return on investment was made in seven days, and the game quickly became profitable.

Reaching out is the first step. Rev up your social media presence, start a blog, begin networking with friends in the industry, and make new contacts—all at little cost. Read  How NOT to Market Your Indie Game at Gamasutra, by Dushan Chaciej, CEO and lead designer at Frozen District, creator of Warlocks 2: God Slayers*. He’s made every possible mistake.

Where to Start: Irresistible Promotional Materials

Create irresistible promotional materials that compel sharing and discussion. Trailer videos, screenshots, press releases, social media presence, a landing page, and a development blog are the best places to invest your time, Jonassen advises, with the trailer being particularly valuable.

Trailer Video

The game development tutorials at Envato include an Indie Game Dev’s Marketing Checklist, written by Robert DellaFave, a self-described logic nerd who founded Fourth Dimensional Gaming. It suggests that the trailer doesn’t have to be “overly flashy or dramatic, but it should leave viewers with a lasting impression of your game,” as he put it. Create a trailer using video-capture software and editing tools. Customers use trailer videos to see if they like a title’s appearance, music, art, concept, and playability. Game journalists rely on trailers to sort out the clutter of a jammed inbox.

As a good example, Jonassen pointed to the trailer video for the gold edition of the PC game Guacamelee!*, a side-scrolling shooter from Drink Box Studios. Created by Kert Gartner—a noted leader in trailer production—the first 3–5 seconds are entrancing, with artwork, action, and music all combining into a happy frenzy. Lasting less than 60 seconds, the trailer contains testimonial pull-quotes showing that it was updated after positive reviews came in, and ends with a call to action and logo. It’s fun, frantic, and leaves you wanting more.


Figure 3. The trailer for the gold edition of the PC version of Guacamelee!* is instantly funny, engaging, and lively, and lasts just 59 seconds.

Virtual reality (VR) games are a new challenge for creating trailers, as you’ll need to convey the excitement of a 3D world in a 2D video. Gartner is experimenting with mixed-reality environments to film players against a green screen, and also using avatar-based trailers to enhance the visual appeal. Such efforts can be time-consuming and require enormous processing power to play, record, mix, and stream, but the new Intel® Core™ i9-7980XE Extreme Edition processor can do it all from a single system.  For more information about simultaneously handling VR game-trailer production tasks that previously required multiple computers, check out this article at the Intel® Developer Zone (Intel® DZ).

Screenshots

Screenshots are another important weapon in your arsenal. They should be high resolution, with excellent composition that’s well-lit. Avoid dark images, skip the menus and interface unless they are part of your genius, and concentrate on the beauty of the creation. When choosing a screenshot, pull an engaging scene that captures what DellaFave describes as “your game’s most magnificent moments.” Get your viewers to sink into the art and want to see more, like in this scene from Secrets of Raetikon*, created by the Viennese team Broken Rules.


Figure 4. This screenshot from Secrets of Raetikon* shows the dreamy, atmospheric, 2D adventure game.

Press Releases

Press releases have to engage immediately, with passion and sizzle. Keep in mind that stressed-out deadline writers will like pieces from which they can cut and paste, so make sure your writing is good—don’t assume it will get cleaned up before publishing. Jonassen advises that the first paragraph is crucial, and it must summarize all the key points you want to convey. Place the reader into the game, give them a point of view as the player, and sell the sizzle.

Fact Sheets

DellaFave also suggests creating one-page fact sheets with links to your website, landing page, and developer’s blog, plus contact information, website addresses, team history and pedigree, and other titles you have worked on. Readers like quotes because they bring more life to a written page and make the writing more genuine and relatable. Include quotes in your fact sheets when possible.

The screenshots, trailer, press release, and fact sheet are crucial pieces of your promotional materials. Check out presskit()* for guidance. It’s a free resource for beginners who need help getting traction or need templates to speed up their work. See also this infographic from Entrepreneur* Magazine for some timely writing tips.

Landing Page

Once you have someone’s attention, you should drive them somewhere, and that’s where a good landing page comes in. Create a unique web address to convert visitors into customers, and include an instantly-recognizable button for game purchases. The only navigation off that page should be to the purchasing process or to provide more information about the company. The page should be stocked with screenshots, testimonials, and other art. Your landing page should also be easy to share via Facebook, Twitter, LinkedIn*, and Google+*.

Developer’s Blog

Another key to initiating and maintaining customer contact is the developer’s blog, which is one of the best ways to reach out to your fans. Jonassen says that sites with a dev blog bring in 55 percent more traffic than those without blogs. She advises that you post at least once a week.

At becomingminimalist.com, writer Joshua Becker gives some important reasons why he believes more people should blog. Number 4 on the list is, “You’ll develop an eye for meaningful things.” Once you start thinking of your blog as a weekly task, you’ll find yourself making notes about things that readers might like to read about. You will soon realize that you have plenty to say. And because this dynamic industry is constantly evolving and you’re surrounded by interesting, creative personalities, you can cultivate those relationships and build new ones by staying in contact through your blog.

Good dev blogs include eye-catching artwork and give subscribers an easy option for an RSS feed, email subscription, or social subscription. With each post, you’ll ring the bell and get fans (and the press) to return. Include a download button or link to make it easy to pull a demo, or the actual game if you’re into the revision phase and you want to keep interest up.


Figure 5. Dev blogs such as this one from Drink Box Studios are a key to driving traffic and continually engaging your customers. Note the easy subscription buttons on the right.

Reaching Out to the Press

Be systematic about contacting the trade press and requesting they officially review your game. There is a herd mentality to getting reviews; once you get that first one, you’re in the herd, and more will follow if you keep reaching out. Some of the many places to find reviewers to choose from include GameSpot, Gamasutra, VentureBeat, IndieGames, GamesIndustry.biz, and Polygon.

Be as methodical about generating press coverage as you are about tracking down a compiler bug. First create a list of press contacts and then grow that into a spreadsheet where you can track what you sent, when you sent it, the response you got, and so forth. To build the list, collect business cards at conferences and conventions, get contact information from bylines on websites, and spend time churning the web for names. Networking can help—share with other indies and always be on the lookout for new tips. Turnover in the industry is non-stop, so maintaining your press contacts list will require time, but it’s solid gold.

Once your promotional materials are ready to go and have been scrubbed for errors, contact game reviewers via social media and email. Create a template for introducing yourself, your team, and the game, but be sure to personalize the message for each reviewer. Answer the question, “Why should this reviewer care?” Make your pitch compelling, clear, and concise. This is your elevator pitch, your reason for being; take time to polish it into a shiny object.

Following Up

Always follow up with your press contacts and reviewers, especially when you break through and get some much-coveted publicity. Jonassen told of an example where she followed up after publication of a review and thanked them for taking the interest. She got a reply, made a friend, and expanded her network. That contact now republishes her press releases or passes her info along without fail. And she keeps in touch while faithfully commenting on new articles that her contacts publish. When she sees her contacts at game shows or conferences, Jonassen has a more genuine relationship.

Your marketing campaign goal is simple—to build and maintain an adoring fan base. Make it easy to discover your game, using social media to post updates. Post on social media daily, if possible, even if you’re just passing something along. Think of it as the I’m alive circuit for your marketing efforts. Be active in forums and blogs, and participate in game jams and other events. Show up at your local game development community events. They’re desperate for the help and welcome new faces as well as old veterans. Start a crowdfunder not just to pull in money but to help with your online presence.

Convert Visitors into Active Fans

When you create and update promotional materials and regularly post a blog, you systematically convert casual visitors into paying customers. Even better is when you upgrade your customers into active fans. Regular news about your game and your company is vital to your growth. Nurture your fan base with personal touches, and respond quickly to questions. But don’t just post and respond—drive the conversation with dialogue starters. Make fans feel like part of the process by seeking input on certain decisions. And remember to remain professional when responding to trolls and critics. Simply explain the reason for your decision, but don’t throw gasoline on the fire and engage in a two-week flame war.

According to Jonassen, the Canadian team at Sauropod Studio developed Castle Story* with some polite interest but not many sales. They hadn’t put much effort into sales materials, and their first demo was too short. But once they had their 11-minute story told and recorded, it got posted to Reddit. Within an hour they started getting an avalanche of serious interest from the gaming community, which they were eventually able to convert into sales.

Examples of starter promotional materials such as those for Castle Story are available across the Internet. If you know of a game that does a good job, start at their website and download their materials to get an idea of what you need.

Maintain Your Marketing Momentum

Once you have a playable demo, you’re ready to step into the limelight either as an exhibitor or a vendor. Or create your own event. For example, Inc. Magazine has an eight-point primer on staging your own event. They describe how to employ GPS tracking, add augmented reality, manage social media, and more. Some of their advice may seem costly, but many examples don’t require a big budget. The key is to plug away on social media with continual build-up and updates, via your Twitter handle, Facebook page, and blog posts.

Contests are another inexpensive way to create value-add promotions. Insert Easter eggs into your game around major holidays or in conjunction with an upcoming event. Enlist the aid of your customers by suggesting that if they contact a friend, they’ll get a secret code. This was hugely successful with games like Candy Crush*, created by United Kingdom-based King.com, and was responsible for 500 million downloads, an average of 6.7 million daily users by 2013, with average daily income of USD 633,000 at the iOS App Store alone. In 2015, King was purchased by Activision for USD 5.9 billion.

Podcasts and interviews are another way to spread your message. With the right interviewer, you can talk about your philosophy, passions, and motivations. Interviews are easy to share and blog about when they’re recorded as videos. Remember that if you’re not in the right frame of mind, or your interviewer isn’t well prepared, the results live forever. So come prepared, be alert, and preferably know your questions in advance.

Common Mistakes and Pitfalls to Avoid

Keep these points in mind while working on your marketing activities:

  • Establish a schedule of marketing activities and commit to it. Don’t stop.
  • Plan to spend at least as much time and energy on marketing activities as you do on development. Some experts advise that you may drop to a ratio of one-third development, two-thirds marketing.
  • Be careful about what you share—don’t give away your secret sauce. Also, if you provide a peek at what you’re doing too early, you can build up a buzz that you can’t satisfy quickly, and attention will drift away.
  • Use a public voice in your blogging and social media that addresses your audience as your peers, not your minions. Keep a sense of humor, humility, and wonder about the road you’re on.
  • Know the roles of your influencers and reviewers. For example, if you’re building a PC game, don’t contact people or publications solely focused on mobile games.
  • Tell your game’s story with an eye toward its value proposition. Whatever it is that makes your game special, be it art, design, story, music, or cleverness, sell that continually. Entrepreneur.com has a good tutorial on developing a value prop, and it’s crucial. Of course, be careful not to oversell it in the process. Let others draw their own conclusions about the game’s quality.

What Makes You Unique?

Your first marketing strategy step is to develop your value proposition. If you know what makes your title unique, and what the target market should see in your product, you’re way ahead. Recall the first brainstorming sessions you conducted, and remember what motivated you to design the game. If you remember thinking, “There’s never been a game like this,” or “Nobody has ever done this before,” you have the beginnings of a unique story. Use the concepts that help you stand out and make you different to develop your company slogan and focus your marketing activities.

To flesh out your ideas, follow these steps from sitepoint.com, written by Alyssa Gregory, founder of Small Business Bonfire, a social, educational, and collaborative community for entrepreneurs.

  1. Describe your target audience. Are they using PCs, smartphones, or tablets? What is their age group? Do they like sports, technology, or just a quick puzzle or game?
  2. Explain the problem you’re solving. Why does the world need another shooter? What’s unique about your puzzle game?
  3. List the big benefits. Will they be entertained, mystified, challenged, or otherwise satisfied?
  4. Define your promise. Do you vow to be the most intriguing, have the most engaging theme or the most beautiful art, or stay true to your mission? A big benefit to any company is to share in a common promise that people can rally around.
  5. With the thoughts from the first four steps, write a full paragraph with three or four full sentences—aim for at least 60 words.
  6. Take out the chainsaw, cut that word salad down, and saw off the sharp edges. Smooth the slogan until it’s memorable, repeatable, and your whole team agrees with every word.

Here are just a few examples of popular value propositions from the business world:

  • Fast, incredibly fast relief. – Anacin*
  • Melts in your mouth, not in your hand. – M&M* Candies
  • Clinically proven to reduce dandruff. – Head & Shoulders*
  • You get fresh, hot pizza delivered to your door in 30 minutes or less—or it’s free. – Domino*’s Pizza
  • When it absolutely, positively has to be there overnight. – FedEx*
  • Get Met. It Pays. – Metropolitan Life*
  • We are THE low-fare airline. – Southwest Airline*

At Convince & Convert, they list things to consider when developing a value proposition:

  1. Unabashedly appeal to your ideal customer. For example, Abercrombie & Fitch* says its ideal customers are cool, good-looking people. They’re focused on a segment, not mass appeal.
  2. Use unique personalities. If you have a personable, identifiable leader, use him/her!
  3. Avoid the superstar rat race. Don’t strive to be the best—stand out with a unique approach.

Demographics

To create a value proposition, you must know your target audience. How old are they, are they male or female, do they come from one region or are they global, and what are they interested in? What are their buying habits? What makes them tick?

At GameRefinery, Joel Julkunen wrote an article about target audiences and competitors. As the leader of GameRefinery’s analytics department, he creates algorithms and statistical models that pull the data apart and make it understandable. He understands the marketing challenge that game developers face. “The natural strategy is to make a game that appeals to your target audience AND at the same time stands out from the crowd of similar games.” If you’re writing a role-playing game (RPG), you need to appeal to traditional RPG enthusiasts. But at the same time, you must differentiate yourself, or you’re just another title.

Julkunen suggests plotting your game in a matrix to determine where it sits in the cognitive and dimensional spectrums. On the horizontal axis, pinpoint your game’s placement between acting quickly versus acting correctly, and simple versus complex thinking. Think about the skills your game aims to teach. Strategy games focus on a player’s cognitive skills by emphasizing tactical thinking, puzzle solving, and pattern detection. Speed is not important but being logical and methodical are. Shooting games challenge players to develop sensory and motor skills such as speed, aim, and reaction. Plot where your game falls on this axis, between physical (sensomotoric) and mental (cognitive) skills.


Figure 6. Simple 2x2 matrix used by Julkunnen to show where games fall based on mental versus physical, and complex versus simple axes. The real-time strategy (RTS) game Dune* would be just right of center on the X axis, but toward the very top of the Y axis (source: GameRefinery.com).

The vertical axis differentiates between core layers to model complexity. As Julkunen describes it, single-dimensional games are simple, because they focus around one core layer, such as repeatedly solving scrambled word puzzles. At the opposite end of the spectrum are games like Clash of Clans*, a mobile strategy game that requires multidimensional thinking about planning, asset optimization, and resource allocation. An example of an exceedingly intricate challenge would be a real-time strategy game like Dune*, created by Paris-based Cryo in 1992. To win, players must balance offense and defense, create buildings or weapons, plan assaults, conserve resources, and watch for sporadic sandworms, all in real time. There is a lot of clicking, but not a lot of aiming. You now have a two-by-two matrix to plot where your game sits: physical versus mental, and simple versus complex.


Figure 7. In Angry Birds*, players master the physical task of pulling the rubber band on the catapult while calculating how explosions will destroy structures and remove obnoxious pigs. This game includes physical AND mental components to a multilayered challenge.

With your matrix done, build on your assumptions about games you’re already familiar with, and determine what player types to attract. Also consider what you know about top-selling games and decide if you have an easy story to tell that appeals to buyers of equivalent best-selling titles. Study successful franchises for insights into their appeal, approach, marketing, outreach, and other tasks—good examples are everywhere.

Personas: Mythical Prototypes

While in development, create a persona—a mythical prototypical consumer—to guide you along. A persona represents an important cluster of behavior patterns that can be grouped by purchasing decisions, adoption of technology, lifestyle choices, service preferences, and other behaviors, attitudes, and motivations. As you identify these patterns, you can create a generalized character to represent the entire segment.


Figure 8. A detailed persona capturing the motivations and tendencies of a certain gamer type.

At the Gurusability blog, Papa_Lamp discusses some overall persona themes in the gaming world. He cited an article in Gamasutra by Flavio Camasco debating the difference between hardcore and casual gamers, and he argues that a serious Journey* addict is just as hardcore as someone who plays a lot of Shovel Knight* or puts in long hours managing the farm in Stardew Valley*. He argues you might as well differentiate solely on the amount of time a player commits to a title.

Papa_Lamp then describes the gathering of metrics to determine player behavior. He discusses work by Lennart Nacke, from the University of Ontario Institute of Technology, who presented a talk at the 2009 Canada Game Developers Conference (GDC-C). Nacke advocated the use of gameplay metrics to help identify and build personas, and while data can be hard to track down, the premise remains true. Nacke suggested mixing qualitative and quantitative metrics, and described how those feed into the bigger picture to inform game design.

Kevin O’Connor, president of user insight at UXMag calls personas, “The foundation for a great user experience,” and says personas should hold true regardless of age, gender, or education.

O’Connor recommends conducting one-on-one interviews with at least 30 people before studying the results to watch what patterns evolve. He also suggests the interviews be conducted in context, such as where the gamer plays, to ensure there are no missed environmental clues. Such formal studies can cost around USD 35,000 and take three to six months—several lifetimes for an indie developer. You’ll have to use your own insights and anecdotes to replace a formal report, but the underlying science is important to know.

A few online tools are available to help you create a persona, including UpCloseAndPersona.com and ImFORZA. The tools are only as good as the assumptions you use to create them, but they’re a start. You can create simple personas based on a fictional name, an assumed set of details about their background, and a simple statement about their goals. For example, Andre is a French hipster with too much spare time who wants to be entertained by a challenging racing game. Or Tomoko is a middle-aged Tokyo woman with a demanding schedule who needs an easy game or puzzle while riding a commuter train.

If you’ve interacted with the gaming community at large, you probably already know a lot about your ideal customer segment. You may already have an idea of who your target audience is, based on feedback you’ve received. Obviously, the more time and money you spend to identify the persona’s traits, the better.

Once you’ve started selling your title, you can easily answer questions about your customers with quick polls and surveys. SurveyMonkey*, PollEverywhere, Typeform, SoGoSurvey, and many others exist solely to help you ask questions and get answers as you strive to Get Big.

Competitive Analysis

Another key tool in your marketing arsenal is a competitive analysis, which is a broad statement about your business strategy and how you relate to the competition. The more you know about the companies battling in your space, the better. According to Entrepreneur.com, if you can build a clear picture of your competition, you’ll understand their strengths and weaknesses. “With this evaluation, you can establish what makes your product or service unique—and therefore what attributes you should play up to attract your target market,” they write.

Inc.com suggests asking these questions about your competitors, adapted for the gaming world:

  • What are their strengths? Artwork, theme music, playability, extensibility, community following, and established presence are all areas where you may be vulnerable.
  • What weaknesses can you take advantage of? Maybe they are under-staffed, over-worked, under-funded, or missed their target date. Maybe their music is basic, their art is bland, but they have a killer artificial intelligence.
  • What are their basic objectives? Do they seek to gain market share? Do they attempt to capture premium clients? See your industry through their eyes. What are they trying to achieve?
  • What marketing strategies do they use? Look at their advertising, public relations, and so forth.
  • How can you take market share away from them?
  • How will they respond when you enter the market?

This is where you set aside your developer’s hat and put on the business person’s uniform. Think of this process as its own puzzle game that requires quick thinking, long-term strategy, and fast reflexes. Make it fun!

Gather competitors’ information from their websites, such as the size of their team and their expertise. If you don’t yet know who your competitors are, talk to trade show attendees, read community boards, explore gaming events, and talk to sales people. Build a spreadsheet or a grid where you can collect the information. You may not know a competitor’s annual sales, but you can use high, medium, and low for starters.

Try tracking down these key differentiators while building your grid:

  • List similar titles to your game.
  • Estimate their pricing model.
  • Figure out where they distribute.
  • Determine the team size.
  • Analyze their strengths and weaknesses.
  • Locate them on a map.
  • Guess the strength of their reputation.
  • Weight their commitment to your genre.
  • Rank their threat as strong, medium, or low.

Once the grid is built, fine-tune your assumptions, keep gathering information, and keep asking questions. Your future depends on your insights into your competitors and how fast you can get into the market. For example, understand the concept of MVP—Minimum Viable Product. At Agile Alliance, the term refers to how many bells and whistles should exist in a demo to draw instructive feedback. If you extrapolate the concept to competitive analysis, you can determine if you can safely cut levels, complexity, characters, or structures. If your competitors don’t have 50+ levels, 25 weapons, or 12 character options, you probably don’t need those either to release your title.

Talk with anyone you can reach. If you find a competitor at an event, ask them a few questions in person, if possible. Who knows—you may make a friend who could become a partner some day! After all, Digi-Capital* reports that 2016 was a record-breaker for mergers and acquisitions in the games industry.

Strategy and Goals

Your ability to set realistic, achievable marketing goals and strategies is crucial to Get Big. That starts with writing them down. Tadhg Kelly, creative director at Jawfish Games, said it best in a GamesBrief article about the biggest marketing mistakes that indies make. His succinct answer: “Making a game that has no marketing story.” In the same article, Oscar Clark, evangelist for Applifier, says teams should always ask “So what?” In other words, so you’re making a great (insert genre) game…so what? Is that enough of a marketing story?

At GameSparks.com, the team created a blog devoted exclusively to game marketing. They suggest two artifacts to guide your marketing efforts: a marketing strategy and a marketing plan. The strategy guides the overall objective, while the tactics get you there.

Table 2 includes some keys to what GamesBrief considers a good marketing strategy.

Table 2. Keys to a good marketing strategy.

ElementDescription
GaaP versus GaaSAre you a product or a service? Will you have frequent updates with new add-ons that a subscriber would get excited about? Or will you launch a product and then move onto the next project?
Business ModelWill you charge a one-time fee, or give the title away and collect cash via in-app items?
Target AudienceDefine your distribution and marketing choices by who you are targeting, and not the other way around.
Platforms and App StoresUnless you plan a multiplatform launch, your decisions about console versus PC, smartphone versus tablet, and Xbox* versus Nintendo are vital to know up front.
GeographyAre you going global or staying regional? Do you have translation services, or are you restricted to a single geography? The fewer words on your screens, the less you have to translate, so your early design decisions could be guided by these answers.
BudgetEven if your budget is small, you have time. Always think about where to spend money and time. And consider a kill criteria, where you stop spending any effort on a project.
Marketing ChannelsEvents, reviews, ads, launch parties, blog posts, social media, and other channels can help get your word out. Which ones seem right to you?
MeasurementHow can you gather statistics to better allocate your limited resources?

For more information, read  How to Publish a Game, by Nicholas Lovell. It’s packed with tips, tools, and strategies. Black Shell Media provides equivalent tips while offering a full range of marketing solutions. An academic paper by Peter Zackariasson and Timothy L. Wilson at academia.edu goes into more depth.

Once the overall strategy is in place, or at least forming, develop specific tactics for each sector. See these posts and articles for valuable information: 

  • Anecdotes about successful video game marketing from Creative Guerilla Marketing.
  • At entrepreneur.com, Mike Templeman discusses specific tactics to capitalize on Pokémon Go*.
  • Read about the battle between Xbox One* and Sony PS4* at the Strength in Business website.
  • An article from David Murdio gives digital marketing tips for video games, which follows up on his article about video and social media marketing tips

Your marketing plan is a compilation of the strategies and tactics you intend to use. Keep in mind what Mike Tyson said, a US boxer who often knocked out opponents very early in his matches. “Everyone has a plan until they get punched in the face,” he replied when asked how well he thought he carried out his plan after entering the ring. Be prepared to adapt to changing conditions in the marketplace—flexibility is vital.

Marketing Goals

Management guru Peter Drucker has a famous saying, “You can’t manage what you can’t measure.” His point is simple—if you don’t know a statistic before you make a change, you won’t know how much impact your change had on the result. Software engineers are familiar with the principle of changing only one thing at a time to see if they moved the needle on a measurement. If you use a shotgun approach and try several tactics simultaneously, you might not learn which tactic had a big impact.

For indie developers, that might be a hard discipline to follow because you usually don’t have time to try one tactic and measure the results. What you can do is try to devise statistics that capture the effects of a single change. For example, gather Google stats on website traffic, then start blogging more often and measure the changes. Track the number of your Twitter and Facebook followers, and determine the rate of change when you post a video versus a simple comment.

Sometimes your data gathering may consist of simply, “Hey, traffic is up.” You can get a bit more scientific and measure the before-and-after results as you go. Then you can revisit your tactics and goals, and give yourself a specific task, such as, “I want to increase traffic by 10 percent in the next two months.” This gives you something to execute against, focusing your efforts.

Lead Generation

A sales team lives and dies with their lead generation tactics. Because one of your hats is to lead your game’s sales efforts, you must understand the term. Lead generation is the concept of developing a list of names that you can hopefully turn into sales.

On this topic, a web search will pull up lengthy advice from Hubspot, InfusionSoft*, ThriveHive, Lynda.com*, and Salesforce. Most of them want to sell a service, and some, like Unbounce, MarketJS, and DuctTapeMarketing, will help you turn the task into a game of sorts, or offer tips.

Here are some ways to start generating leads:

  • Create a new demo video and circulate it far and wide.
  • Check your website to make sure the call to action still stands out after you squint at it like a cowboy in a dust storm. If it becomes invisible, fix it!
  • Obey Hick’s Law of web design—give your website visitors fewer choices, not more. Focus is good.
  • Capture email addresses in exchange for content.
  • Use services such as FollowerWonk to identify leads on Twitter.
  • Try tools like Quora that use yes/no questions to track links. Here’s a link to a case study about building connections that turn into conversions.
  • Post presentation slide decks to sites such as SlideShare. According to one case study, SlideShare has 70 million visitors, and the site is addictive. Be sure to include a link back to your landing page where your readers can get more information.
  • Speak at events. Look for opportunities to talk about your journey, and get used to making humble brags, spreading around the credit, and thanking your long-suffering significant other in public.
  • Update your email signature. Make sure it includes your contact information and logo. If you recently won a contest or landed some good praise, update your signature block.
  • Try renting an email list. LaunchBit is a good place to start, among many others. 

All of these tips have one goal: build a better list of leads. The end goal is converting those leads into sales. Think of your efforts as beginning a conversation. Your job is to continually create new content to share, new conversations, and new ways to engage your growing fan base. Building a buzz takes a brick-by-brick mentality, and it’s much like tending a garden.

In one of the great post-mortems at Gamasutra, Rob de Lara described one of his problems in getting NyxQuest: Kindred Spirits*, an award-winning action-based platformer for Wii World*, completed on time. “I know the many hats issue is a common wrong for indie developers, but it took me totally by surprise,” he said. “I didn't expect the management, paperwork, and PR requirements of a video game to take so much time. We had (and still have) to devote a lot of time to write emails, request reviews, prepare trailers and screenshots, and answer interviews. After a few months, we feel that there are still a lot of people who haven't heard of NyxQuest. Some magazines have nominated our game for Best Sleeper Hit, and there's a reason for that. Hopefully, we will be able to address this issue and create more buzz for our next game. We wanted to create a nice blog, dev diary, and additional media content, but because of the enormous amount of work, we had to leave it for the future. Lesson learned: PR is a huge area that requires full-time dedication. The more time you spend here, the better the awareness of your title will be.”

If you are going global, tailor your messaging to individual regions. This obviously takes more time and effort, but trying to use a one message fits all approach could shortchange your relationship-building efforts. Similarly, if you’re collecting demographics data that seem to be pointing you in a certain direction, go with that flow. Building a true buzz in one demographic can help your game catch fire in other sectors. But it takes a spark to get it going.

Creating a Brand

Your brand sums up your appeal, positioning, persona, attitude, and design changes, all in one subtle statement. According to Inc. Magazine, killer brands all do the same things well:

  1. Focus on a single brand.
  2. Snag a good domain name.
  3. Keep it simple.
  4. Choose one: descriptive, evocative, or whimsical.
  5. Avoid branding by committee.
  6. Apply your brand consistently.
  7. Protect your brand

Forbes has a great checklist for creating a great brand, and so does Branding Strategy Insider. See Strategic Thunder’s list of questions to answer; Brand Butterfly also has some good bullets to consider. Whatever you come up with, maintain the identity religiously to establish your corporate identity. The brand should be splashed all over your website, business cards, landing page, contact page, download page, dev blog, and other marketing materials. Each facet of your company—from audio to video—needs to be consistently and appropriately branded.

Multiple books and articles have been written about the top branding mistakes companies make; Entrepreneur.com, Precision Intermedia, All Business, and Inc. are just a few. Read them, and extrapolate to the indie game dev world. For example, Xerox is a global term for photocopying—but the company once tried to kill the use of Xerox as a replacement. Esurance gave in to a few critics and killed their beloved babe mascot Erin Esurance just as she was gaining serious traction. Colgate thought it could pivot from toothpaste to packaged food, even though the two are hardly related. Burger King* creepy King mascot was thought to be a sure winner in the board room, but it wasn’t. A Chevy executive once demanded that employees drop the beloved short-hand term and use the whole word—Chevrolet.

Working Without Deep Pockets

Indie budgets are notoriously thin. Nevertheless, start budgeting and tracking costs and expenses because when you start that second title, you can refer back to these costs.

Budgeting is more than just wishful thinking. As you become more of a business leader, you’ll have to become familiar with terms such as return on investment (ROI), risk versus reward, and cost versus benefit. If you don’t know the costs, you can’t calculate the numbers.

ROI is an attempt to use data to steer decisions. If you expect to invest USD 100 on productivity tools, you better see at least USD 101 in return on that investment. You could get better returns on upgrading to Unity* Pro or buying more RAM for your main system. The art in calculating ROI isn’t in the numbers; it’s in how you put a number on things that are difficult to quantify.

For example, what is the expected benefit of using an agency to create your branding? Let’s say the cost would be USD 8,000. What’s the benefit of offloading that task to a vendor rather than your overworked teammate? How do you measure the expected (or at least, hoped-for) outcome? Trying to put a dollar figure on the value of what it’s worth to not have to complete your project on your own is not easy, but it’s certainly worth doing.

Figure 9. To calculate an ROI requires a
firm grasp on inputs and outcomes.

You can find ROI calculators at many places on the web; Financial Calculators, Easy Calculation, and Money-Zine*, are just a few examples.

Cost-Benefit Analysis

Cost-benefit analysis starts with being systematic and data-driven about where you spend your time and efforts. Multiple online tools and cost-benefit analysis explanations are available at The Balance, Mind Tools*, Chron, and Investopedia*. Allocating time and creating an efficient daily routine may seem difficult when you’re juggling school, work, relationships, and physical well-being, but it helps to have a plan. And in that plan, get granular in your indie project’s budget. Don’t simply allocate 10 hours a week to game-making; break that down further so that your marketing efforts don’t get overlooked as your journey continues.

Elon Musk, the famed South African entrepreneur who changed the world through Tesla*, SpaceX*, and other endeavors, famously declared that he hasn’t read any books on time management. But he manages time in his own way. “It’s very important to have a feedback loop, where you’re constantly thinking about what you’ve done and how you could be doing it better. That’s the single best piece of advice: constantly think about how you could be doing things better and questioning yourself.”

If you have enough data to calculate the ROI on a decision such as branching out to multiple geographies, you’re far ahead of most indie developers. More commonly, game devs make their ROI decisions based on intuition, which is hit-or-miss at best. The good news is that if you at least attempt to calculate ROIs and cost-benefit analyses, you are more memorable to investors. And that alone would make exploring data-driven marketing a little more important.

Metrics: In Data We Trust

Consider this graph of game sales for Beat Hazard*, an indie title from Cold Beam Games. This galactic arcade shooter set to the beat of a player’s chosen music hit USD 2 million in total sales.


Figure 10. Sales graph for Beat Hazard*, showing sales on the Y-axis and time on the X-axis. Marketing around holidays defied the typical initial burst/long tail pattern for most games (source: ColdBeamGames.com).

Most games start with an initial peak and taper off over time, and Beat Hazard’s launch was typical. This particular title benefited from incorporating new content in the game around holiday themes, generating new spikes from refreshed gameplay.

The best indie developers are keen on gathering metrics for everything they can find. Think of all the different aspects of your business that you can track: 

  • Buzz on social media and number of reviews, downloads, and visitors
  • Number of likes on YouTube and Facebook
  • Number of followers on Twitter, Facebook, and Google+
  • Number of market influencers you’ve impacted

Under Drucker’s maxim, that you can’t manage something if you can’t measure it, you need to gather metrics on all your key marketing tasks.

In addition, you must recognize when an idea didn’t work. If you had a marketing goal to increase traffic by 10 percent with a new video, and there’s barely a bump, then something wasn’t right with the video, the distribution, the timing, or more. Perhaps the answer lies in the visitor comments. Try again with a new video that is different and completely fresh.

At Developer.com, the staff wrote an intriguing article entitled I’d Rather Be Coding: Gathering Metrics. It explains why gathering metrics is as important to beginning marketers as it is for project managers.

Analytics

Google Analytics* service is a big advantage for today’s indie developer. Check out the success stories Google compiled at their site, or read some of their tutorials.

Gamasutra has an article by Nemanja Bondzulic about using “Google Analytics service in Games” where they tracked how users interact with SUPERVERSE*, an online arcade space shooter. They needed to know the most popular hardware configurations that users played the game with. They gathered information by tracking usage, which proved helpful for their future planning.


Figure 11. Google Analytics service can reveal important data about user events (source: GamaSutra.com).

Some pitfalls to avoid when gathering analytics include:

  • Information becoming dated quickly due to external events, changes in your game, or other variables.
  • A single source for statistics, and always needing confirmation about the data you gather.
  • Drawing conclusions based on statistics for one region that may not apply to other geographies, so as with all measurements, use some judgement.

Forbes published an interesting article that describes the 12 most effective SEO strategies for 2017. In it, John Rampton talks about content length as a key motivator for ranking position results. When writing your blog posts, for example, avoid the tendency to stop too soon. According to Rampton, “virtually every study done to date shows a correlation between longer content and higher rankings. Some suggest 1,200–1,300 words, while others say 1,500 words should be the minimum. If you want your content to rank, aim for a minimum length of 1,200 words for standard blog posts, and 2,000 words+ for [timeless] content.”

As you gain marketing expertise, become familiar with search-engine optimization (SEO). Make sure your game shows up in a results list when a consumer searches for titles similar to yours. The web contains plenty of information on this topic, and an article from Moz is especially helpful with an eight-step process to get you to that promised land.

Marketing Channels

Marketing channels are how goods and services flow to consumers. Game titles can move directly from the creator to the customer via their own website, for example. Or a retailer can be involved in selling boxed games of top titles.

Your choice of distribution channel(s) depends on how hard you want to work. If you collect the cash directly via your website, you’ll become a PayPal expert, you’ll be generating unique product keys, and you’ll be chasing accounts. This method can become quite time consuming.

Channel marketing includes digital marketing, direct marketing, email marketing, and more. Trends come and go quickly, so you’ll want to collect metrics to determine if email campaigns are working better than banner ads for you, for example. According to a 2017 article by Andrew Medal at Entrepreneur.com, as many as 60 percent of all banner ads are clicked on by mistake. About 91 percent are viewed for less than one second. Clearly, the metrics just aren’t there any more for banner ads.

By far the most common marketing channel for indies now are download sites. Consumers can download indie games from multiple providers that act as wholesalers, such as the Xbox Games Store, Microsoft, Steam, GameJolt, IndieDB, and EpicBundle. While you may lose a percentage of sale revenues via the bigger sites, you benefit from increased exposure and traffic.

Shows and Events

Game jams, trade shows, and gaming events are great places to concentrate your marketing efforts. Your budget may not allow for extensive travel, but crashing on a friend’s sofa and carpooling are still common for many indies who are starting up. And although you might not be able to offer T-shirts, key rings, or USB drives, walk around the shows and events to see what others are doing.

Some of the most well-known indie gatherings include Electronic Entertainment Expo (E3), Game Developers Conference (GDC), Independent Games Festival (IGF), and PAX. See gamesindustry.biz for a continually updated list of industry events that range from eSports to casual connects.

Participating in a panel discussion or presenting a slide deck about your story is a great way to get recognition. You’ll find that public speaking isn’t so difficult while talking about your favorite topic. Be prepared to spread advice and encouragement in your talk, and promote your presentations through all your social media channels.

Jams and Meet-ups

Game jams are formal or informal gatherings for the purpose of planning, designing, and creating one or more games within a short span of time, usually ranging between 24 and 72 hours. Participants include programmers, designers, artists, writers, and fans. Game jams can be intoxicating, exhausting, and exhilarating. They’re a great way to meet other indies, but participating in such gatherings can leave you drained.

PixelProspector maintains a complete list of game jams, as does Wikipedia. Here’s some advice from BáiYù at itch.io:

  1. Avoid crunch and deadline pressure—pace yourself and know your limits.
  2. Know the scope of the project; don’t bite off more than you can chew.
  3. Plan for the worst. If someone drops out, reduce your scope immediately.
  4. Communicate with the team. State your assumptions about who is doing what.
  5. Leave time for testing and bug fixes.
  6. Protect your health. Don’t get caught up in the frenzy. Stop for breaks, fresh air, and stretching

Meet-ups are another great way to network. Meet-ups range from informal, local hang-outs to formal meetings with speakers and schedules. Some are dominated by developers, and others by players. Search at Meetup.com or elsewhere to see what’s happening around you. At meet-ups, you could bring your game demo for some feedback, show off your game trailer, give a talk, or otherwise network with like-minded indie fans and developers. You may find these are also a good place to look for help with design, coding, graphics, or music.

Closed Alpha Exposure

Project managers use the alpha stage of a software project as the first stage of rigorous testing. While the code may be unstable, it is now at a state where you can gather feedback at meet-ups, jams, and other gatherings. Players can tell you what they like and don’t like, and help you make decisions about features to add or drop. Few indie devs are brave enough to host an open alpha testing phase, where all comers can drop in. That’s why most teams who endure alpha events usually close them to a carefully selected audience. The advantages of gaining player feedback, gauging playability and enthusiasm, and generating buzz may or may not offset any nagging issues or crashes, so use your judgement about how early you want to show the world what you’ve got.

Contests

Entering your game in a contest is a time-tested way to get feedback from accomplished judges and maybe a pat on the back when you most need it. Winning a contest can boost your momentum, give you an instant marketing point, and provide info for your dev blog and social media storm. Pushing yourself through a final scrub to hit an entry deadline can also provide motivation.

Feedback and technical assistance is a key part of the annual Intel® Level Up Game Developer Contest (Intel® Level Up Game Dev Contest). Intel gathers a well-rounded field of judges from the indie world and top development studios, and their insights and observations are a special part of the allure. The winnings are more than just cash prizes; all contest winners in 2017 received a Razer Blade Stealth Ultrabook*, and the Game of the Year winner received USD 5,000, an agency-driven digital marketing campaign tailored to their needs (valued at USD 12,000), and a distribution contract offer with Green Man Gaming.

Don’t Tweet That

Social media is a sharp, two-edged sword which, when put into the wrong hands, can prove deadly. Pick your battles wisely, stay true to your game’s identity and voice, and learn to shrug off criticism, no matter how loud it seems and no matter how well-intentioned it may have been.

Some of the most common social media sites are well-known—Facebook, Twitter, Snapchat*, and Instagram*—and a new one may spring up at any moment. Joel Lee, writing at MakeUseOf.com, listed three awesome social networks just for gamers in 2013: Raptr, Playfire, and Duxter. Of those, Duxter has closed its doors, Playfire moved to Green Man Gaming, and Raptr said good-bye in September 2017. Use caution when investing considerable time in a new site.

Pricing and Monetization Strategies

One of the biggest challenges in game distribution is how to price it. Yu Zhan at the University of North Carolina has a simple guide to pricing strategies, divided into three sectors: Pay to Win, Pay to Play, and Play for Free.

Players can pay for better heroes, better weapons, or more levels after starting out for free, for example, such as in All-Star Heroes*. In this game, the players can’t really win the game until they pay.

Dark Souls* is an example of Pay to Play. They sell sequels and downloadable content, plus online versions. Minecraft Realms* is another example of this strategy, and so is World of Warcraft*.

Free-to-play games, sometimes called freemium, use a strategy where the game is free, but it’s full of ads and inducements. Most recent games use this strategy, sometimes offering players the ability to pay up to avoid ads.

Setting the right price involves competitive analysis, described earlier, and knowing how similar titles handle pricing. If you know your target audience and their expectations, you should be able to set a price and stick to it.

Six to 12 months after your game’s launch, determine whether you’re leaving money on the table by not offering discounts, sales, and other promotions. You can push the boundaries of industry-wide pricing trends if you are tracking sales and gathering stats to make informed decisions. But keep in mind that it’s almost impossible to raise your prices once you’re in the marketplace.

Retailers want you to be successful so they can grow, too. Brick and mortar locales are growing less important, especially for indies, and most of your sales will probably come either by selling online via Steam, Green Man Gaming, Humble Bundle, G2G, or others. At the GameJolt marketplace, look beyond your own borders and list retailers by region. At Statista, find information about game market revenues for a particular region.

Revenue sharing with Steam or other providers is necessary in today’s indie landscape. Gamasutra tackles the question and asks if it’s worth it. The answer is maybe. They conclude that, “If you partner with the wrong folks (or even with the right folks but under the wrong conditions) no contract is going to help you. But going through this process is vital. Most importantly, a contract may help you avoid getting into the wrong partnership. Additionally, a contract will give you a reliable framework if disagreements arise between you and your partners.”

PR and Self-Promotion

According to the Public Relations Society of America, PR is a “strategic communication process that builds mutually beneficial relationships between organizations and their public.” When done properly, PR includes complex planning, metrics-gathering, and development stages. Unfortunately, hiring a public relations company to help promote your game is a luxury most independents can’t afford. However, inexpensive activities you can use to enhance your reputation with the public do exist.

Rich Kahn, founder and CEO of eZanga.com, told smallbusinesspr.com there are five easy things you can do:

  1. Become an authority in your industry. Accept any speaking engagement you can find, volunteer often, and answer questions on a blog for starters. Publish interesting tidbits on Twitter, engage in polite debate, and make a name for yourself. Be sure that your thoughts are consistent with your brand—don’t hype FPS titles if you’re an RPG studio, for example.
  2. Connect with schools. Students are the employees of tomorrow, and getting in front of them as a guest lecturer is an easy way to expand your public presence. You can build on your campus connections to hire interns from engineering, business, graphic design, and writing programs, depending on your needs.
  3. Befriend the media. Reach out to the reviewers and editors whose bylines you respect and chat them up. They may need a quote someday to perfectly capture a key insight, and there’s nothing like seeing your wisdom in a pull-quote, highlighted for everyone to see. The more you learn about the people who cover your industry, or especially your particular niche, the more fun you’ll have at conferences and gatherings, too.
  4. Consider co-branding. If your game was a hit at a local jam or meet-up, the more you share that success, the more you help that entity in their marketing efforts, too. If you spread the word about a positive review at a growing website, you help their efforts as much as your own. Good co-branding is like having good manners at a party—of course you would thank the host and compliment their efforts.
  5. Take the industry pulse. Build your own online surveys about interesting challenges, and publish the results on your blog. Trumpet the news on social media, and share the wealth. You might be picked up by your media friends, which gives you more credibility the next time you survey.

Your game development journey is full of peril and excitement, sometimes in the same day. To Get Ready, Get Noticed, and Get Big, you face multiple challenges along the way. Although this guide covers some of the biggest hurdles you will face, new ones arise every day.

Because most indies’ marketing budgets are tight, you will have to improvise, adapt, and overcome throughout your journey. While this guide offers a few ideas, some elements are mandatory, such as a social network presence, a website landing page, a solid video trailer, and a playable demo. Start early in getting the word out, establish your voice and use it, and avoid the tendency to go quiet. Find panel discussions to join, and tell your war stories. You may not have deep pockets, but you have a unique story and the passion that goes with it. Jaded veterans working on yet another sequel admire your enthusiasm, so ride it as far as you can.

Back at Gamasutra.com, game designer Sarah Woodrow offers this encouragement: “Indie game development will drive the future of games. Indie game developers will be the ones to take games beyond what we know, to create truly innovative and interesting experiences. There are indies who are starting out now who will be the business leaders of the game industry in 10–20 years. We are already seeing a rise of indies; we will see more.”

About the author

Garret Romaine has been covering the game industry since 1992, reviewing games, writing features, and authoring white papers, case studies, and analysis. He writes for RH+M3 and holds an MBA from Portland State University.

Hey Game Developers: Enter to Win an iBUYPOWER Revolt 2 and More!

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Hey game developers: Enter to win an iBUYPOWER Revolt 2 and more!. Get more game dev news and related topics from Intel on VentureBeat.

iBUYPOWER Revolt 2

Games like Grand Theft Auto and Resident Evil set the bar for commercial success, and indie gamers are increasingly rising to the occasion. But it requires sophisticated tech to develop, cash in hand to cover production costs, a solid marketing plan to launch, and confidence that your game will run beautifully across the whole spectrum of consumer hardware out there.

And Intel is on the case. Right now, they’re offering developers an opportunity to get free testing and promotion by Intel when you submit your PC game for free testing through the Intel Test Suite by December 14.

The first 50 PC games accepted to be tested will receive free testing — a $750 value. To make things even sweeter, the top 50 titles to pass testing will also get $5,000 worth of social promotion across Intel channels.

Better yet? Submission also means you’ll be entered to win an Intel​ CORE i9 PROCESSOR worth $1000, an iBUYPOWER​ Revolt 2 Pro Z370 worth $1750, or an @ASRock​ X299 MOTHERBOARD worth $390.

Intel Core Series

The Intel Test Suite consists of preferred test methodologies and tools to determine your game’s performance and playability on Intel® Core™ processors and Iris® Graphics. And a Runs Great on Intel® technology certification means that your game delivers the best experience on the most platforms, plus offers you a chance to be considered for software bundles, partner stores, and other opportunities to get big.

The company has long been a source of support for game developers through its Intel Game Developer program, which offers libraries, performance analyzers, and other tools to optimize PC games, a resource library and developer forum, plus contests, networking opportunities, trade show collaborations, and go-to-market programs to expand a game’s reach.

Check out Intel’s suite of free optimization tutorials, testing tools, and other resources before you submit your game for certification.

Hurry – the contest ends December 14. Get your game tested now, and get big!

Deep Learning for Cancer Diagnosis: A Bright Future

$
0
0

Cancer is a leading cause of death and affects millions of lives every year. Its early detection could help to increase the survival of many lives1 in addition to saving billions of dollars.2 Most of the healthcare data are obtained from ‘omics’ (such as genomics, transcriptomics, proteomics, or metabolomics), clinical trials, research and pharmacological studies. Such data are highly complex, variable, and multidimensional. Sometimes such data are available from incompatible data sources. Unfortunately, the bulk of this data remains underutilized and could be used for biomarker identification and drug discovery.

Deep learning (DL) is a member of the larger machine learning (ML) and artificial intelligence (AI) family. It has been applied in many fields like computer vision, speech recognition, natural language processing, object detection, and audio recognition.3 Deep learning architectures, including deep neural networks (DNNs) and recurrent neural networks (RNNs), have been persistently improving the state of the art in drug discovery and disease diagnosis.4 Deep learning has the potential to achieve good accuracy for the diagnosis of various types of cancers, such as breast, colon, cervical, and lung cancer. It builds an efficient algorithm based on multiple processing layers of neurons5 (see Figure 1). However, the output (i.e. accuracy) of any deep learning model depends on multiple factors including, but not limited to, data type (numeric, text, image, sound, video), data size, architecture, and data ETL (extract, transform, load) and so on.

Figure 1

In this article we explore how deep learning has been successfully applied to potential areas of oncology (the study of cancer diagnosis and treatment). It provides an insight into deep learning for medical and paramedical professionals, educators, and students. It also highlights the various potential area of healthcare where data science professionals, such as scientists, data engineers, and developers, can take the lead in building products and services that use Intel® technologies.

Identification of biomarker useful for cancer diagnosis using deep learning

Figure 2The human genome is a complex sequence of nucleic acids. It encodes as DNA within 23 chromosomes.6 It is well known that the expression of genes changes according to the situation and consequently such changes regulate many biological functions. Interestingly, certain genes change only as a result of specific pathological conditions (like cancer) or with treatment. These genes are called biomarker(s) for a specific tumor. Recently, a group of scientists from Oregon State University used a deep learning approach to identify certain genes critical for the diagnosis of breast cancer. As shown in Figure 2, they used a stacked denoising autoencoder (SDAE) for features extraction and then implied supervised classification models to verify new features in cancer detection.7 Another group of scientist from China applied a deep learning model for high-level features extraction between combinatorial SMP (somatic point mutations) and cancer types.8

Discovering drug molecules and biomarkers using deep learning

Several endogenous molecules (chemical compound or protein) circulate in body fluid (blood, urine, cerebrospinal fluid). Some of these molecules are considered to be a tumor-specific biomarkers. The discovery of such a molecule or its synthetic analog gives new hope for understanding the mechanisms of disease and for creating therapeutic benefits.

The design of a new molecule is based on the historical dataset of old molecules and targets. In quantitative structure-activity relationship (QSAR) analysis scientists try to find out a known and novel patterns between structures and activity. At the Merck Research Laboratory, Ma et al. used a dataset of thousands of compounds (~5000) and built a model based on the architecture of DDNs.9 In another QSAR study, Dahl et al. built neural network models on 19 datasets of 2000‒14000 compounds to predict the activity of new compound.10 Aliper and colleagues built a deep neural network–support vector machine (DNN-SVM) model that was trained on a large transcriptional response dataset and classified various drugs into therapeutic categories.11

AtomNet is the first structure-based deep convolutional neural network. It incorporates structural target information and consequently predicts the bioactivity of small molecules. This application worked successfully to predict new active molecules for targets with no previously known modulators.12 Furthermore, Altae-Tran et al. introduced a new deep learning architecture called iterative refinement long short-term memory (LSTM), which significantly increases predictive power for specific drug discovery problems even with limited data.13

Feature detection in histopathological images using deep learning

Figure 3 Histopathological images are primarily obtained from the thin section of a tumor. These sections are generally stained with specific colorful chemicals or antibodies to distinguish cancerous cells. Recently Kaggle* organized the Intel and MobileODT Cervical Cancer Screening competition to improve the precision and accuracy of cervical cancer screening using deep learning.14 The participants used different deep learning models such as the faster R-CNN detection framework with VGG16,15 supervised semantic-preserving deep hashing (SSDH), and U-Net for convolutional networks.16 As shown in Figure 3, Dr. Silva achieved 81 percent accuracy using the Intel® Deep Learning SDK and GoogLeNet* using Caffe* on the validation test.

Turkki et al. also applied convolutional neural networks (CNNs) and a support vector analysis approach to quantify immune cells in breast cancer slides. They achieved up to 90 percent of the agreement, similar to what pathologists achieve.17 Another application of deep learning is predicting the prognosis of cancer (that is estimating the stage of cancer). Hyung et al. found 83.5 percent average accuracy in predicting a patient’s survival with gastric cancer.18

Feature detection in MRI and ultrasound images using deep learning

Figure 4 Medical technologies such as computed tomography, magnetic resonance imaging (MRI), and ultrasound are a rich source to capture tumor images without invasion. Deep learning models can be used to measure the tumor growth over time in cancer patients on medication. As shown in Figure 4, Jaeger et al. applied CNN architecture on diffusion-weighted MRI. Based on an estimation of the properties of the tumor tissue, this architecture reduced false-positive findings and thereby decreased the number of unnecessary invasive biopsies. The researchers noticed that deep learning reduced the motion and vision error and thus provided more stable results in comparison to manual segmentation.19 A study conducted in China showed that deep learning helped to achieve 93 percent accuracy in distinguishing malignant and benign cancer on the elastogram of ultrasound shear-wave elastography of 200 patients.2,20

Identification of cancer cell type based on morphological features of cells using deep learning

Figure 5Several participants in the Kaggle competition successfully applied DNN to the breast cancer dataset obtained from the University of Wisconsin. Based on the features of each cell nucleus (radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension), a DNN classifier was built to predict breast cancer type (malignant or benign) (Kaggle: Breast Cancer Diagnosis Wisconsin). Similarly, Xu et al. investigated datasets from over 7,000 images of single red blood cells (RBCs) from eight patients with sickle cell disease and applied the DNN classifier to classify the different RBC types (sickle, elongated, granular, oval, discocytes, and so on) (see Figure 5). The trained deep CNN distinguished the subtle differences in texture alteration inside the oxygenated and deoxygenated RBCs.22

Challenges of implementing deep learning in healthcare

Data scarcity

Healthcare generates a huge amount of data, which is consistently growing. However, this data is not widely available to scientists and developers in startups and other areas of industry and in academic areas. While the potential for lucrative profit that can come from joining AI with the healthcare domain is high, unfortunately only a small fraction of startups and institutes are interested in exploring AI tools in healthcare. The amount of investment of time, money, and human resources in this domain is restricted probably due to ethical concerns and several rules and regulations. However the data scarcity trend is changing rapidly, and there is a lot of room in which to grow.

Data sharing

In order to build a deep learning model in the medical field, we need a significant amount of high-quality data. In some cases, the specific clinical or research data are not available (or are very limited) from a particular institute. It is essential to collaborate with other institutes to get sufficient data. But setting up such collaborations with a mutual agreement is sometimes hard to accomplish and very time-consuming.

Computational skills

People working in the medical, paramedical, and research fields are well educated for performing medical and research jobs. However, they may not have enough training in computer science and various computation languages, such as C++, Python*, and Java*, hardware and software knowledge. Having a team of AI experts, such as deep learning data scientists, developers, and solution architects, may be beneficial in order to fill the gaps in computation skill and run the healthcare projects.

Deep learning skills

It is essential that a data scientist, developer, or data engineer have knowledge of the healthcare domain and also a good understanding of DNNs as well as experience in advanced statistical modeling. They should be aware of the latest deep learning frameworks, libraries, APIs, UIs, web interactive notebooks, labs, devCloud, multinode systems, big data, and so on. Implementations using older libraries or frameworks in the project may delay the process or produce poor results. A team of deep learning experts may help you to design optimal sampling procedures and get effective results. They may also help you to not only identify external and internal factors causing variation in your data analysis but also build strategies to reduce optimization difficulties due to poor-conditioning or local minima.23

Data Science and Medical Science: A Combined Approach

Deep learning has a great potential to help medical and paramedical practitioners by

  • Reducing the human error rate24 and the workload,
  • Helping in diagnosis and the prognosis of disease, and
  • Analyzing complex data and building a report.

The histopathological examination of thousands of images is complex, time-consuming, and labor intensive. How can AI help?

A team from Harvard Medical School’s Beth Israel Deaconess Medical Center noticed a 2.9 percent error rate with the AI model and a 3.5 percent error rate with pathologists for breast cancer diagnosis. Interestingly, the pairing of “deep learning with pathologist” showed a 0.5 percent error rate, which is an 85 percent drop.24 Litjens et al. suggest that deep learning holds great promise to improve the efficacy of prostate cancer diagnosis and breast cancer staging. 25,26

Learn More about Initiatives in AI from Intel

Intel commends the AI developers who contribute their time and talent to help improve diagnosis and treatment for this life-threatening disease. Committed to helping scale AI solutions through the developer community, Intel makes available free AI training and tools through the Intel® AI Academy.

Intel recently published the series of AI hands-on tutorials. Here you will learn:

  • How to start your project by defining goals, data sources, and the strategy of building your AI team (ideation and planning)
  • How to select a deep learning framework optimized by Intel, an AI computing infrastructure from Intel, data resources, and so on (technology and infrastructure)
  • How to build an AI model (data and modeling)
  • How to build and deploy an app (app development and deployment)

The same concepts could also be useful in healthcare to solve a similar set of problems. In the next series of articles, we will explore some examples of healthcare datasets where you will learn how to apply deep learning. If you want to test the deep learning in your own dataset, please contact the support community at Intel® AI Academy. Our Intel team will help you achieve your project goals.

Take part as AI drives the next big wave of computing, delivering solutions that create, use and analyze the massive amounts of data that are generated every minute.

Sign up to get the latest tools, optimized frameworks, and training for AI, machine learning, and deep learning.

References

  1. Howard, J. Ted Talk: The wonderful and terrifying implication of computers that can learn. (2014).
  2. Ali, A.-R. Deep Learning in Oncology – Applications in Fighting Cancer. September 14 (2017). 
  3. Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
  4. Mamoshina, P., Vieira, A., Putin, E. & Zhavoronkov, A. Applications of Deep Learning in Biomedicine. Molecular Pharmaceutics 13, 1445–1454 (2016).
  5. INSILICO MEDICINE, I. Deep learning applied to drug discovery and repurposing. (2016). 
  6. Wikipedia. Human genome.
  7. Danaee, P., Ghaeini, R. & Hendrix, D. A. A deep learning approach for cancer detection and relevant gene indentification. Pac. Symp. Biocomput. 22, 219–229 (2017).
  8. Yuan, Y. et al. Deepgene: An advanced cancer type classifier based on deep learning and somatic point mutations. BMC Bioinformatics 17, (2016).
  9. Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structure-activity relationships. J. Chem. Inf. Model. 55, 263–274 (2015).
  10. Dahl, G. E., Jaitly, N. & Salakhutdinov, R. Multi-task Neural Networks for QSAR Predictions. (University of Toronto, Canada. 2014).
  11. Aliper, A. et al. Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Mol. Pharm. 13, 2524–2530 (2016).
  12. Wallach, I., Dzamba, M. & Heifets, A. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery. 1–11 (2015). doi:10.1007/s10618-010-0175-9
  13. Altae-Tran, H., Ramsundar, B., Pappu, A. S. & Pande, V. Low Data Drug Discovery with One-Shot Learning. ACS Cent. Sci. 3, 283–293 (2017).
  14. Kaggle competition-Intel & MobileODT Cervical Cancer Screening. Intel & MobileODT Cervical Cancer Screening. Which cancer treatment will be most effective? (2017).
  15. Intel and MobileODT* Competition on Kaggle*. Faster Convolutional Neural Network Models Improve the Screening of Cervical Cancer. December 22 (2017).
  16. Intel and MobileODT* Competition on Kaggle*. Deep Learning Improves Cervical Cancer Accuracy by 81%, using Intel Technology. December 22 (2017). 
  17. Turkki, R., Linder, N., Kovanen, P., Pellinen, T. & Lundin, J. Antibody-supervised deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin stained breast cancer samples. J. Pathol. Inform. 7, 38 (2016).
  18. Hyung, W. J. et al. Superior prognosis prediction performance of deep learning for gastric cancer compared to Yonsei prognosis prediction model using Cox regression. J Clin Oncol 35, abstract 164 (2017).
  19. Jäger, P. F. et al. Revealing hidden potentials of the q-space signal in breast cancer. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 10433 LNCS, 664–671 (2017).
  20. Zhang, Q. et al. Sonoelastomics for Breast Tumor Classification: A Radiomics Approach with Clustering-Based Feature Selection on Sonoelastography. Ultrasound Med. Biol. 43, 1058–1069 (2017).
  21. Kaggle: Breast Cancer Diagnosis Wisconsin. Breast Cancer Wisconsin (Diagnostic) Data Set: Predict whether the cancer is benign or malignant.
  22. Xu, M. et al. A deep convolutional neural network for classification of red blood cells in sickle cell anemia. PLoS Comput. Biol. 13, 1–27 (2017).
  23. Bengio, Y. Deep learning of representations: Looking forward. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 7978 LNAI, 1–37 (2013).
  24. Kontzer, T. Deep Learning Drops Error Rate for Breast Cancer Diagnoses by 85%. September 19 (2016). 
  25. Litjens, G. et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep. 6, (2016).
  26. Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017).

Vectorization: A Key Tool To Improve Performance On Modern CPUs

$
0
0

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

The Rise of Parallelism

For the past decade, Moore’s law has continued to prevail, but while chip makers have continued to pack more transistors into every square inch of silicon, the focus of innovation has moved away from greater clock speeds and towards multicore and manycore architectures.

A great deal of focus has been given to engineering applications that are capable of exploiting the growing number of CPU cores by running multi-threaded or grid-distributed calculations. This type of parallelism has become a routine part of designing performance critical software.

At the same time, as the multicore chip design has given rise to task parallelism in software design, chipmakers have also been increasing the power of a second type of parallelism: instruction level parallelism. Alongside the trend to increase core count, the width of SIMD (single instruction, multiple data) registers has been steadily increasing. The software changes required to exploit instruction level parallelism are known as ‘vectorisation’.

The most recent processors have many cores/threads and the ability to implement single instructions on an increasingly large data set (SIMD width).

A key driver of these architectural changes was the power/ performance dynamic of the alternative architectures.

• Wider SIMD – Linear increase in transistors & power

• Multi core – Quadratic increase in transistors & power

• Higher clock frequency – Cubic increase power

SIMD provides a way to increase performance using less power.

Software design must adapt to take advantage of these new processor technologies. Multi-threading and vectorisation are each powerful tools on their own, but only by combining them can performance be maximised. Modern software must leverage both Threading and Vectorisation to get the highest performance possible from the latest generation of processors.

Why Vectorise?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD). For example, a CPU with a 512 bit register could hold 16 32- bit single precision doubles and do a single calculation.

16 times faster than executing a single instruction at a time. Combine this with threading and multi-core CPUs leads to orders of magnitude performance gains.

Implementing Vectorization

There are a range of alternatives and tools for implementing vectorisation. They vary in terms of complexity, flexibility and future compatibility. The simplest way to implement vectorisation is to start with Intel’s 6-step process. This process leverages Intel tools to provide a clear path to transforming existing code into modern, high-performance software leveraging multicore and manycore processors.

Applying Vectorization to CVA Aggregation

The Finance domain provides many good candidates for vectorization. A particularly good example is the aggregation of Credit Value Adjustment (CVA) and other measures of counterparty risk. The most common general purpose approach to calculation of CVA is based on a Monte-Carlo simulation of the distribution of forward values for all derivative trades with a counterparty. The evolution of market prices over a series of forward dates is simulated, then the value of each derivative trade is calculated at that forward date using the simulated market prices. This gives us a ‘path’ of projected values over the lifetime of each trade. By running a large number of these randomized simulated ‘paths’, we can estimate the distribution of forward values, giving both the expected and extreme ‘exposures.’ The simulation step results in a 3-dimensional array of exposures. The task of calculating CVA from these exposures occurs in several steps: netting, collateralisation, integration over paths, and integration over dates.

More Details

Check out this whitepaper (PDF).

Also a complete webinar (on quantifi's site) and associated slide-deck (PDF)

 

Vitruvian Game Wingsuit—VR Automation

$
0
0

Abstract

Through virtual reality and an electronically controlled mechanism, the wingsuit flight simulator calls to mind the Vitruvian Man of Leonardo da Vinci. The user controls the flight with a joystick and wears a backpack that wirelessly (thanks to the battery) creates virtual reality. You can fly anywhere for a unique experience. The engines, in full security, allow the person to move like a gyroscope.

Introduction

The project puts together multiple technologies and allows the player to control the simulation. It is the first Virtual Reality (VR) game prototype in which the person is not only a spectator but also can control the joystick with automation, thus simulating the movement that turns the display into increased reality.

The layout reminds us of the great drawing by Leonardo Da Vinci, who for us is one of the first makers of the modern age.

Areas of Expertise

Virtual Reality

We used the backpack with HTC Vive*. We created a VR application in Unity* that simulates flying over a city. The 3D mapping is connected on the Internet and you can choose the Global Positioning System (GPS) starting point coordinates.

Iron Construction—Made in Italy

We designed a tailor-made iron structure that is representative of the circle surrounding the Vitruvian Man, a ring inside an outer ring. A person can board the three-meter tall construction to experience the application.

3D Simulation

Before construction, we used computer-aided design to simulate and calculate fabrication tasks, and draw components such as the pin joints and motor couplings with precision.

 

 

Programs and Technologies Used

Virtual Reality

HardwareSoftwareNote
MSI* Backpack VR ONE
HTC Vive* full pack
Windows® 10 operating system
Steam* VR
Unity* software
Microsoft Visual Studio*
WRLD 3D* App
Tried to use Google HEART* but no VR API is available.

Automation

HardwareSoftwareNote
2 Brushless motors
2 Gear reducers
1 Programmable logic controller (PLC)
1 Double driver
Automation Studio*
C language
For safety, we set limits on speed, acceleration, deceleration, and angular position.

Internet of Things

HardwareSoftwareNote
2 Normal joysticks
2 TinyTILE*
Arduino*One board controls the PLC automation; one board controls the LED light with accelerometer and gyroscope.

3D Simulation and Precision

HardwareSoftwareNote
Mac*
Computer numerical control  (CNC) Machine
Rhinoceros* 

Application in Unity*

We made a prototype of a wingsuit simulator using HTC Vive to simulate flight in a city-like environment. We used Unity with the Steam* VR software development kit to control the Vive controller and camera head.

To simulate the movement of a character in space, we created a script that uses Euler angles and the vector of the controllers’ direction to go up, down, left, and right.

The large terrain size posed environment generation problems. We used a map generator called WRLD 3D*. It needs an Internet connection and creates a cartoon-style map of the world in run time, so you can fly above New York or London, or wherever you want, without any limits.

The other difficult thing was attaching the controllers to the Vitruvian structure so that they would move correctly. By the way, after many attempts we found a solution for attaching the inner ring near the border, in the middle. It’s like having controllers in your hands.

IoT—Gamer Control with TinyTILE*

With TinyTILE* we connected the joysticks by converting the directions into input signals on the PLC engine and programmed it with Arduino*. For safety issues, the motors are controlled only through the joystick drive.

Through a second card, we used the accelerometer and the gyro, and we connected the LEDs to visually show the inclination of the person.

It was not very complicated to create the code with Arduino. The complexity was in the conversion logic between the joystick and PLC input. The player can set four positions (up, down, right, left) for each joystick for 16 combinations, while the engines have three maneuver options (forward, back, and hold).

We then mapped the joystick positions by converting engine combinations.

There are eight inputs in the TinyTILE (four pins for each joystick) and six outputs that connect to the PLC.

Automation (Software and Motors)

We’ve automated the two rings independently with two engines. This was the most expensive investment. We chose a brushless technology with high performance and manageability through a PLC.

When it comes to people, security is the first step, and we have therefore limited the range of maneuverability to avoid creating discomfort for people who lose control.

  • The software stops the person at most at 90 degrees. You can not go upside down Soft acceleration and deceleration
  • STOP: We created the simulator lock condition
  • EMERGENCY: We have built an emergency button that allows the person to always return to the initial position.

Motor Technical Information

  • Synchronous motor
  • Number of pole pairs: four pairs
  • Nominal speed: 3000 rpm
  • Stall torque: 2.300 Nm
  • No brake
  • Smooth shaft
  • Angled hybrid connector (swivel)
  • Center diameter 80 mm
  • j6 fit (self-cooling, construction type A)
  • IP64 protection
  • Without oil seal
  • 560 VDC
  • Brake holding torque: 0.00 Nm
  • DA EnDat single-turn (inductive) 32-line

We choose B&R Perfection in Automation because their engine responsivness is the best!

Photo Album and Videos

Complete VR project photos and videos: goo.gl/1nTtpR

Thanks to:

FabLab Network—Coordinator and event manager, www.fablabnetwork.it.

Busnet it—Developed the control board software and the VR software application.

Mingardo Designer Faber—Daniele is the blacksmith who made the custom rings and iron parts.

Smart Meter (applicative design)—Developed the executive prototype on CAD.

ZB Solution—Supplied our mechanical components.

Daniele Squizzato– automation project 

Manage Deep Learning Networks with IntelChainer for Intel® Architecture

$
0
0

Summary

Chainer is a Python*-based deep learning framework aiming at flexibility and intuition. It provides automatic differentiation APIs based on the define-by-run approach (a.k.a. dynamic computational graphs) as well as object-oriented high-level APIs to build and train neural networks. It supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures. Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug. IntelChainer, which is optimized for Intel® architecture is currently integrated with the latest release of Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) 2017 optimized for Intel® Advanced Vector Extensions 2 (Intel® AVX) and Intel® Advanced Vector Extensions 512 (Intel®AVX-512) instructions which are supported in Intel® Xeon® and Intel® Xeon Phi™ processors.

Recommended Environments

We recommend these Linux* distributions.

  • Ubuntu* 14.04/16.04 LTS 64bit
  • CentOS* 7 64bit

The following versions of Python can be used:

  • 2.7.5+, 3.5.2+, and 3.6.0+

Above recommended environments are tested. We cannot guarantee that IntelChainer works on other environments including Windows* and macOS*, even if IntelChainer looks running correctly.

Dependencies

Before installing IntelChainer, we recommend to upgrade setuptools if you are using an old one:

$ pip install -U setuptools

The following packages are required to install Chainer.

  • NumPy 1.9, 1.10, 1.11, 1.12, 1.13
  • Six 1.9+
  • Swig 3.0.9+

Image dataset support

  • pillow 2.3+

HDF5 serialization support

  • h5py 2.5+

Testing utilities

  • pytest 3.2.5+

Intel® MKL-DNN

  • You don’t need to manually install Intel MKL-DNN, when build IntelChainer, Intel MKL-DNN will be downloaded and built automatically, thus, boost, glog and gflags are also required.

Installation

Currently, master_v3 is the most stable version. You can use setup.py to install IntelChainer from the tarball:

$ git clone -b master_v3 https://github.com/intel/chainer
$ cd chainer
$ python setup.py install

Use pip to uninstall Chainer:

$ pip uninstall chainer

Run with Docker*

We provide the Dockerfile for Ubuntu and Centos in chainer/docker directory based on python2 and python3, respectively. Links see:

https://github.com/intel/chainer/blob/master_v3/docker/python2/Dockerfile_ubuntu

https://github.com/intel/chainer/blob/master_v3/docker/python3/Dockerfile_ubuntu

https://github.com/intel/chainer/blob/master_v3/docker/python2/Dockerfile_centos

https://github.com/intel/chainer/blob/master_v3/docker/python3/Dockerfile_centos

You can refer to following wiki to check how to build/run with docker. Even if you don’t want to play with docker, you can also refer to those docker files to learn how to build IntelChainer environment for Ubuntu and Centos based on python2 and python3.

https://github.com/intel/chainer/wiki/How-to-build-and-run-Intel-Chainer-Docker-image

 

Training Examples

Training test with mnist dataset:

$ cd examples/mnist
$ python train_mnist.py -g -1

Training test with cifar datasets:

  • run the CIFAR-100 dataset:
$ cd examples/cifar
$ python train_cifar.py –g -1 --dataset='cifar100'
  • run the CIFAR-10 dataset:
$ cd examples/cifar
$ python train_cifar.py –g -1 --dataset='cifar10'

Single Node Performance Test Configurations

For Single Node Performance Test Configurations, please refer to following wiki:

https://github.com/intel/chainer/wiki/Intel-Chainer-Single-Node-Performance-Test-Configurations

License

MIT License (see LICENSE file).

Reference

Tokui, S., Oono, K., Hido, S. and Clayton, J., Chainer: a Next-Generation Open Source Framework for Deep Learning,Proceedings of Workshop on Machine Learning Systems(LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), (2015) URLBibTex

More Information

 

 

Android Things* on Intel® Architecture

$
0
0

Android Things* is an operating system from Google used to build connected devices for the Internet of Things.

Intel® Edison and Intel® Joule™ compute modules

Update

In June 2017, Intel announced it discontinued Intel® Edison compute modules, Intel® Joule™ compute modules, and their associated developer kits.

Resources

Keep in Touch

We appreciate your interest and the time you have given to Android Things* on Intel® architecture. We’d love for you to connect with us if you have specific questions or want stay in touch.


Traffic Light Detection Using the TensorFlow* Object Detection API

$
0
0

Abstract

This case study evaluates the ability of the TensorFlow* Object Detection API to solve a real-time problem such as traffic light detection. The experiment uses the Microsoft Common Objects in Context (COCO) pre-trained model called Single Shot Multibox Detector MobileNet from the TensorFlow Zoo for transfer learning. Intel® Xeon® and Intel® Xeon PhiTM processor-based machines were used for the study. At the end of this experiment, we obtained an accurate model that was able to identify the traffic signals at more than 90 percent accuracy.

Introduction

With the advancements in technology, there has been a rapid increase in the development of autonomous cars or smart cars. Accurate detection and recognition of traffic lights is a crucial part in the development of such cars. The concept involves enabling autonomous cars to automatically detect traffic lights using the least amount of human interaction. Automating the process of traffic light detection in cars would also help to reduce accidents.

Traditional approaches in machine learning for traffic light detection and classification are being replaced by deep learning methods to provide state-of-the-art results. However, these methods create various challenges. For example, the distortion or variation in images due to orientation, illumination, and speed fluctuation of vehicles could result in false recognition.

The experiment was implemented using transfer learning of the Microsoft Common Objects in Context (COCO) pre-trained model called Single Shot Multibox Detector (SSD) with MobileNet. A subset of the ImageNet* dataset, which contains traffic lights, was used for further training to improve the performance. For this particular experiment, the entire training was done on an Intel® Xeon PhiTM processor and the inferencing was done on an Intel® Xeon® processor. However, the Intel Xeon processor-based machine can be used for both training and inferencing.

Hardware Details

Tables 1 and 2 list the configuration used for the Intel Xeon Phi and Intel Xeon processors:


Table 1. Intel® Xeon Phi™ processor configuration.

Table 2. Intel® Xeon® processor configuration.

Software Configuration

The development of this use case had the following dependencies as shown in Table 3.

LibraryVersion
TensorFlow*1.4.0 (built from source)
Python*3 or later
Operating systemCentOS* 7.3.1
Protobuf2.6
Pillow1.0
Lxml4.1.1
Matplotlib2.1.0
MoviePy0.2
GCC* (GNU Compiler Collection*)6+

Table 3. Software configuration

Installation

Building and Installing TensorFlow Optimized for Intel® Architecture

TensorFlow can be installed and used with several combinations of development tools and libraries on a variety of platforms. The following are the steps to build and install TensorFlow optimized for Intel® architecture1 with the Intel® Math Kernel Library 2017 on Ubuntu*-based systems.

git clone https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout r1.4
wget https://github.com/bazelbuild/bazel/releases/download/0.7.0/bazel-0.7.0-installer-linux-x86_64.sh
wget --no-check-certificate -c --header "Cookie: oraclelicense=accept-securebackupcookie" http://download.oracle.com/otn-pub/java/jdk/8u151-b12/e758a0de34e24606bca991d704f6dcbf/jdk-8u151-linux-x64.tar.gz
export PATH=/opt/intel/intelpython3.5/bin/:${PATH}
conda create -n tensorflow python=3.5
source activate tensorflow
tar -zxvf jdk-8u151-linux-x64.tar.gz
export JAVA_HOME=$WRKDIR/jdk1.8.0_151
export PATH=$JAVA_HOME/bin:$PATH
export PATH=$PATH:$JAVA_HOME/bin:/home/intel-user3/bazel/output
chmod 755 bazel-0.7.0-installer-linux-x86_64
./bazel-0.7.0-installer-linux-x86_64 --user --prefix=~/bazel
./configure
bazel build --config=mkl -c opt //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
sudo pip install /tmp/tensorflow_pkg/tensorflow-1.4.0

Installing LabelImg

Download the latest version of LabelImg, an annotation tool for Microsoft Windows*2. Extract the zip file, and then rename the folder as LabelImg.

Solution Design

The solution was implemented with the TensorFlow Object Detection API using Intel architecture. The detection pipeline is given below.

Traffic detection pipeline

Algorithm 1: Detection Pipeline
boxAssigned← false
while true do
		f ←nextFrame
		while boxAssigned == false do
           		 InvokeDetection(f)
		if Bounding Box is detected then
				boxAssigned ← true
				class ← identfiedClass
				if class is Trafficlight then
					drawBoundingBox
				end if
        end if
end while

Why choose TensorFlow Object Detection API?

TensorFlow’s Object Detection API is a powerful tool that makes it easy to construct, train, and deploy object detection models3. In most of the cases, training an entire convolutional network from scratch is time consuming and requires large datasets. This problem can be solved by using the advantage of transfer learning with a pre-trained model using the TensorFlow API. Before getting into the technical details of implementing the API, let’s discuss the concept of transfer learning.

Transfer learning is a research problem in machine learning that focuses on storing the knowledge gained from solving one problem and applying it to a different but related problem. Transfer learning can be applied three major ways4:

Convolutional neural network (ConvNet) as a fixed feature extractor: In this method the last fully connected layer of a ConvNet is removed, and the rest of the ConvNet is treated as a fixed feature extractor for the new dataset.

Fine-tuning the ConvNet: This method is similar to the previous method, but the  difference is that the weights of the pre-trained network are fine-tuned by continuing backpropagation.

Pre-trained models: Since modern ConvNets takes weeks to train from scratch, it is common to see people release their final ConvNet checkpoints for the benefit of others who can use the networks for fine-tuning. For example, TensorFlow Zoo5 is one such place where people share their trained models/checkpoints.

In this experiment, we used a pre-trained model for the transfer learning. The advantage of using a pre-trained model is that instead of building the model from scratch, a model trained for a similar problem can be used as a starting point for training the network. Many pre-trained models are available. This experiment used the COCO pre-trained model/checkpoints SSD MobileNet from the TensorFlow Zoo. This model was used as an initialization checkpoint for training. The model was further trained with images of traffic lights from ImageNet. This fine-tuned model was used for inference.

Now let’s look at how to implement the solution. The TensorFlow Object Detection API has a series of steps to follow, as shown in Figure 1.

Solution design

Figure 1. Solution design

1. Dataset download

The dataset for fine-tuning the pre-trained model was prepared using over 600 traffic light images from ImageNet6. The dataset contains over ten million URLS of images from various classes. The traffic light images were downloaded from the URLs and saved for annotation.

2. Image Annotation

  1. Configuring the LabelImg tool. Before starting with the annotation of images, the classes for labelling needs to be defined in the LabelImg/data/predefined_classes.txt file. In this case, there’s only one class which is trafficlight.
  2. Launch labelimg.exe and then select the dataset folder by clicking the OpenDir icon on the left pane.
  3. For each image that appears, draw a rectangular box across each traffic light by clicking the Create RectBox icon. These rectangular boxes are known as bounding boxes. Select the category trafficlight from the drop-down list that appears.
  4. Repeat this process for every traffic light present in the image. Figure 2 shows an example of a completely annotated image.

Annotated image

Figure 2. Annotated image

Once the annotations for an image are completed, save the image to any folder.

The corresponding eXtensible Markup Language (XML) files will be generated for each image in the specified folder. XML files contain the coordinates of the bounding boxes, filename, category, and so on for each object within the image. These annotations are the ground truth boxes for comparison. Figure 3 represents the XML file of the corresponding image in Figure 2.

XML file structure

Figure 3. XML file structure

3. Label map preparation

Each dataset requires a label map associated with it, which defines a mapping from string class names to integer class IDs. Label maps should always start from ID 1.

As there is only one class, the label map for this experiment file has the following structure:

item {
	id: 1
	name: 'trafficlight'
}

4. TensorFlow records (TFRecords) generation

TensorFlow accepts inputs in a standard format called a TFRecord file, which is a simple record-oriented binary format. Eighty percent of the input data is used for training and 20 percent is used for testing. The split dataset of images and ground truth boxes are converted to train and test TFRecords. Here, the XML files are converted to csv, and then the TFRecords are created. Sample scripts for generation are available here.

5. Pipeline configuration

This section discusses the configuration of the hyperparameters, and the path to the model checkpoints, ft. records, and label map. The protosun files are used to configure the training process that has a few major configurations to be modified. A detailed explanation is given in Configuring the Object Detection Training Pipeline. The following are the major settings to be changed for the experiment.

  • In the model config, the major setting to be changed is the num_classes that specifies the number of classes in the dataset.
  • The train config is used to provide model parameters such as batch_size, learning_rate and fine_tune_checkpoint. fine_tune_checkpoint field is used to provide path to the pre-existing checkpoint.
  • The train_input_config and eval_input_config fields are used to provide paths to the TFRecords and the label map for both train as well as test data.

Table 4 depicts the observations of hyperparameter tuning for various trials of batch_size and learning_rate.

Hyperparameter Tuning
LEARNING RATEBATCH SIZELOSS
0.00516 ~7.2 to 3.4
0.00116~3.5 to 1.4
0.00018~1.8 to 0.5

Table 4. Hyperparameter tuning

Note: The numbers in Table 4 are indicative. Results may vary depending on hyperparameter tuning.

6. OpenMP* (OMP) parameters configuration

There are various optimization parameters that can be configured to improve the system performance. The experiment was attempted with OMP_NUM_THREADS equal to 8. However the experiment could be tried with OMP_NUM_THREADS up to four less than the number of cores.

7. Training

The final task is to assemble all that has been configured so far and run the training job (see Figure 4). Once the optimization parameters like OMP_NUM_THREADS, KMP_AFFININTY, and the rest are set, the training file is executed. By default, the training job will continue to run until the user terminates it explicitly. The models will be saved at various checkpoints.

Training pipeline

Figure 4. Training pipeline

8. Inference

The inferencing video was first converted into frames using MoviePy, a Python* module for video editing. These sets of frames are given to our model trained using transfer learning. After the frames pass through the Object Detection pipeline, the bounding boxes will be drawn on the detected frames. These frames are finally merged to form the inferred video (see Figure 5).

Inference pipeline

Figure 5. Inference pipeline

Experimental Results

The following detection (see Figures 6 and 7) was obtained when the inference use case was run on a sample YouTube* video available at: https://www.youtube.com/watch?v=BMYsRd7Qq0I

Raw frame

Figure 6. Raw frame

Inferenced frame

Figure 7. Inferenced frame

Conclusion and Future Work

From the results, we observed that the traffic lights were detected with a high level of accuracy. Future work involves parallel inferencing across multiple cores.

About the Authors

Nikhila Haridas and Sandhiya S. are part of an Intel team, working on AI evangelization.

References

1. Build and install TensorFlow on Intel architecture:

https://software.intel.com/en-us/articles/build-and-install-tensorflow-on-intel-architecture

2. LabelImg

https://github.com/tzutalin/labelImg

3. TensorFlow Object Detection API

https://github.com/tensorflow/models/tree/master/research/object_detection

4.Transfer learning

http://cs231n.github.io/transfer-learning

TensorFlow detection model zoo https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

5.ImageNet

http://imagenet.stanford.edu/synset?wnid=n06874185

6. ImageNet

http://imagenet.stanford.edu/synset?wnid=n06874185

Temporally Stable Conservative Morphological Anti-Aliasing (TSCMAA)

$
0
0
File(s):TSCMAA-CodeSample (76.6 MB)
License:BSD 3-clause
Optimized with.. 
OS:Windows® 10
Hardware:Intel® HD Graphics 630
Software:Microsoft Visual Studio* 2017,
Universal Windows* Platform development,

Windows 10 SDK 10.0.15063
Prerequisites:An understanding of DirectX* 11 is helpful but not essential.

Seshupriya Alluru, Rebecca David, Matthew Goyder, Anupreet Karla, Yaz Khabiri, Sungye Kim, Pavan Lanka, Filip Strugar, and Kai Xiao contributed to TSCMAA and samples provided in this document.

This article describes Temporally Stable Conservative Morphological Anti-Aliasing (TSCMAA), which was designed and developed as a multisample anti-aliasing (MSAA) alternate technique to provide better temporal stability and performance workaround for Virtual Reality (VR) applications.

Introduction

Anti-aliasing (AA) in computer graphics refers to a set of techniques used to overcome aliasing artifacts in the rendered images, which are a byproduct when representing a high-resolution image at a lower resolution (for example, rasterization or sampling). MSAA is the most widely used spatial AA algorithm in various applications but at the cost of performance. Post-processing AA algorithms like Conservative Morphological Anti-Aliasing (CMAA), Fast Approximate Anti-Aliasing (FXAA), and Subpixel Morphological Anti-Aliasing (SMAA) can provide constant performance but suffer from temporal instability. For VR, AA is more critical since the display is closer to the eyes, thus artifacts are more noticeable, and temporal stability becomes a key aspect in providing a good user experience.

This article discusses TSCMAA, which employs an optimized CMAA and integrates temporal accumulation in order to create a temporally stable post-processing AA solution as a MSAA alternate technique without compromising on image quality. TSCMAA is also designed to run efficiently on low- and medium-end graphics processing units (GPUs), such as integrated GPUs, and to be minimally invasive. This makes it acceptable as an MSAA alternate with better temporal stability in a wide range of applications, which include aliasing-prune geometry such as text, patterns, lines, and foliage.

Figure 1 shows a TSCMAA flow that combines a CMAA pass (top) and a TAA pass (bottom). In TSCMAA, CMAA and TAA passes run only for edge candidates identified by edge detection, resulting in fast indirect dispatch of shaders.


Figure 1. TSCMAA flow.

Conservative Morphological Anti-Aliasing

CMAA takes a rendered color frame as an input and updates the frame with spatially anti-aliased edge pixels by processing edge candidates (any shape and Z-shape edges). To identify such edge candidates, edge detection uses the luminance difference in color with a given edge threshold. In TSCMAA, a default edge threshold value is 0.045f (1.f/22.f) based on experiments with our sample scene, but it can be adjusted for other scenes as a quality versus performance knob. The details of CMAA are found in the original article.

Temporal Anti-Aliasing

To reduce temporal aliasing such as shimmering or crawling caused by the motion between frames, TAA blends a current pixel with a history pixel. Then the output of TAA becomes a history frame for the next TAA frame in a feedback loop. In VR, TAA is more crucial due to quite a bit of jitter from head pose changes over time, resulting in rendered frames always containing motion in a head-mounted display (HMD). TSCMAA takes care of both spatial and temporal aliasing by combining CMAA and TAA, and major benefits arise because TSCMAA runs only for edge pixels, resulting in faster pixels that are less blurred overall, compared to the results of other TAA algorithms.

TAA Edge Candidates

Only a portion of edge candidates from edge detection are considered as TAA edge candidates. In TSCMAA, we use 50 percent of CMAA edge candidates for TAA. This default value is based on experiments with our sample scene but can be adjusted for other scenes as a quality versus performance knob. Figure 2 shows edge detection results with the default edge threshold value for CMAA edge candidates on the left and TAA edge candidates on the right. The TAA pass is applied only to the pixels in TAA edge candidates, resulting in fast TAA and less blurring on non-edge pixels.


Figure 2. Edge candidates for CMAA (left) and TAA (right). TAA edge candidates are 50 percent of the sCMAA edge candidates.

History Pixel Sampling

The history frame maintains the previous TSCMAA frame so that a CMAA pixel is blended with a history pixel to generate the current TSCMAA frame (see Figure 1). To find a correct history pixel, we reproject a texture coordinate with view and projection of the current rendered frame based on depth. Then a history pixel is sampled with the reprojected texture coordinate using bicubic sampling. TSCMAA employs a five-tap approximation for a Hermite/Catmull-Rom bicubic filter for sharpness-preserving sampling in a history frame but with faster approximation.

Variance Clipping

When there are moving objects in a scene, reusing stale pixels in the history frame creates a ghosting artifact. A neighborhood color clipping using an axis-aligned bounding box (AABB) has been used as an alternative for expensive color clipping using a convex hull. However color clipping using AABB results in poor quality when a new color clipped by AABB is too far from the original pixel. Many articles discuss the benefits of variance clipping to provide less ghosting artifacts. TSCMAA uses variance clipping in YCoCg space to minimize such a ghosting artifact on moving objects.

Blending with the CMAA Pixel and TSCMAA History Pixel for TAA Edge Candidates

Since the TAA pass is processed only for TAA edge candidates, blending between CMAA pixels and history pixels is also done only for TAA edge candidates with a blend weight of 0.8f and other non TAA edge pixels are directly taken from CMAA pixels which is same to using a blend weight of 0.f. Then the final blended output generates a TSCMAA frame and becomes a history frame for the next TSCMAA frame as shown in Figure 1.

Sample Applications

In the TSCMAA sample package attachment, we provide two VR sample applications using the same test scene as shown in Figure 3. One (AASample_WMR) is for Microsoft’s Windows Mixed Reality (WMR) HMD and the other (AASample) uses OpenVR* so that it will work on all OpenVR-compatible HMDs. AASample can also run as a desktop application without the “_OPENVR_” preprocessor in project configuration properties.


Figure 3. TSCMAA sample scene with lines, foliage, and a transparent object.

Quality and Performance

Figure 4 shows a quality comparison among MSAA2x, MSAA4x, TSCMAA, and No AA. TSCMAA quality is moderately equivalent to MSAA4x when viewed with an HMD and overall better than MSAA2x with default edge threshold values. Since the TAA pass in TSCMAA is based on temporal accumulation, somehow blurred pixels are inevitable in the final TSCMAA frame. However, we minimize the blurring by accumulating only TAA edge pixels in a history frame and provide better temporal stability. Since we cannot show TAA benefits with static screenshots, this article also provides an attached sample code package with TSCMAA shaders .


Figure 4. Quality comparison among MSAA2x (top left), MSAA4x (top right), TSCMAA (bottom left), and No AA (bottom right).

TSCMAA performance depends on multiple aspects, such as edge count in the scene and render target resolution. For our test scene, TSCMAA shows MSAA4x quality with around a 40 percent cost reduction and is equivalent to MSAA2x performance with default edge threshold values. Here we also show how TSCMAA performance scales as the edge count in the scene and render target resolution change. These experiments were performed with 1280x1280/eye on Intel® HD Graphics 630 system at 1150 MHz.

Since TSCMAA runs only on edge candidates after edge detection, and performance depends on the edge pixel count in the current view. Table 1 shows TSCMAA performance scaling for different edge counts from 53K to 100K pixels. The baseline contains 73K edge pixels, which is from a view shown in Figure 3. Table 1 shows TSCMAA performance changes of 10‒15 percent for a 30 percent difference in edge count.

Table 1. TSCMAA performance scaling ratio for different edge count. Baseline (1.0) has 73K edges and we compare with different scene (or views) where edge count is 30 percent less or greater than baseline.

Scale Factor over Base53K Edges73K Edges (Base = 1.0)100K Edges
Edge count0.721.01.36
TSCMAA time0.861.01.08

Table 2 shows TSCMAA performance scaling for 2Kx2K render target resolution while maintaining the same edge count. Compared to a baseline with 1280x1280 resolution, a 2Kx2K render target contains a 2.56x larger number of pixels, contributing to a TSCMAA performance increase of up to 1.6x. The increase is mostly from an edge detection step which is resolution dependent.

Table 2. TSCMAA performance-scaling ratio for different render target resolution. Baseline (1.0) is 1280x1280 per eye, and we compare with 2Kx2K per eye while preserving edge count.

Scale Factor over Base1280x1280/Eye (Base = 1.0)2Kx2K/Eye
Pixel count1.02.56
Edge count1.01.0
TSCMAA time1.01.59

Table 3 shows the combined contribution from larger render target resolution and thereby the increased number of edge pixels.

Table 3. TSCMAA performance scaling ratio for different render target resolution.

Scale Factor over Base1280x1280/Eye (Base = 1.0)2Kx2K/Eye
Pixel count1.02.56
Edge count1.01.91
TSCMAA time1.02.12

Supported Resource Formats

To apply TSCMAA, applications are required to use one of TSCMAA supported resource formats for render target and depth resources. For the render target texture, TSCMAA supports 32- bit RGBA and BGRA typeless formats, such as

  • DXGI_FORMAT_R8G8B8A8_TYPELESS
  • DXGI_FORMAT_B8G8R8A8_TYPELESS and creates internal view resources with corresponding UNORM and UNORM_SRGB formats.
  • DXGI_FORMAT_R8G8B8A8_UNORM
  • DXGI_FORMAT_R8G8B8A8_UNORM_SRGB
  • DXGI_FORMAT_B8G8R8A8_UNORM
  • DXGI_FORMAT_B8G8R8A8_UNORM_SRGB

The 32-bit BGRA format is the only resource format that Microsoft’s WMR application uses at the time of this writing.

For depth, TSCMAA supports 24- and 32-bit resource formats with or without a stencil, such as:

  • DXGI_FORMAT_D32_FLOAT
  • DXGI_FORMAT_D24_UNORM_S8_UINT
  • DXGI_FORMAT_D32_FLOAT_S8X24_UINT
  • DXGI_FORMAT_R32_TYPELESS
  • DXGI_FORMAT_R24G8_TYPELESS
  • DXGI_FORMAT_R32G8X24_TYPELESS

TSCMAA Interface

TSCMAA provides a simple interface to support both standard desktop and VR applications.

TSCMAA::Create(…) initializes TSCMAA.

HRESULT TSCMAA::Create(ID3D11Device * pDevice,
                       DXGI_FORMAT format,
                       int width,
                       int height,
                       unsigned int numEyes = 1);
  • pDevice: D3D11 device pointer from application
  • format: Render target resource format
  • width: Render target width
  • height: Render target height
  • numEyes: The number of eyes. Default is 1 for standard desktop application. For VR applications, numEyes should be 2.

TSCMAA::Resize(…) is optionally called to resize resources in TSCMAA when the application render target is resized.

HRESULT TSCMAA::Resize(ID3D11Device * pDevice,
                       DXGI_FOARMAT format,
                       int width,
                       int height);
  • pDevice: D3D11 device pointer from application
  • format: Render target resource format
  • width,/code>: Render target width
  • height: Render target height

TSCMAA::Draw(…)applies TSCMAA-given color and depth resources and returns the final TSCMAA resource to ppOutTex. To provide input color and depth resources from the application to TSCMAA, the application should prepare ColorDepthIn by calling ColorDepthIn::Create(…).

HRESULT TSCMAA::Draw(ID3D11DeviceContext * pContext,
	                ColorDepthIn * pColorDepthIn,
	                DirectX::XMFLOAT4x4 &projection,
                     DirectX::XMFLOAT4x4 &view,
	                ID3D11Texture2D ** ppOutTex,
	                unsigned int eye = 0);
  • pContext: D3D11 device context pointer from the application
  • pColorDepthIn: A pointer of ColorDepthIn for TSCMAA that is created in the application side by calling ColorDepthIn::Create(…)
  • projection: Projection matrix
  • view: View matrix
  • ppOutTex: A pointer of TSCMAA output texture resource, which resides in the TSCMAA side
  • eye: eye indexm where the left eye is 0 and the right eye is 1

To destroy TSCMAA, call TSCMAA::Destroy() which will also release all resources in TSCMAA. void TSCMAA::Destroy();

TSCMAA has a few control knobs to adjust the number of edges. Since CMAA and TAA passes run only on edge candidates, the number of edges decides performance and quality. To adjust the number of edge pixels detected, TSCMAA provides SetEdgeThresholds(…) which will set an edge threshold and a non-dominant edge removal amount for edge detection.

void TSCMAABase::SetEdgeThresholds(float edgeDetectionThreshold,
                                   float nonDominantEdgeRemovalAmount,
                                   float bluriness);
  • edgeDetectionThreshold: The recommended value ranges [(1.f/32.f), (1.f/1.f)] and default value is (1.f/22.f).
  • nonDominantEdgeRemovalAmount: The recommended value ranges [0.f, 3.f] and default value is 0.5f.
  • bluriness: The recommended value ranges [0.5f, 2.f] and default value is 0.7f.

Note that bluriness does not affect edge detection but does affect processing edge candidates. Hence edgeDetectionThreshold and nonDominantEdgeRemovalAmount are knobs to control the number of edges in edge candidates.

To get the current edge thresholds, use TSCMAA::GetEdgeThresholds(…).

void TSCMAABase::GetEdgeThresholds(float &edgeDetectionThreshold,
                                  float &nonDominantEdgeRemovalAmount,
                                  float &bluriness);

How to Integrate TSCMAA into Other DirectX 11* Applications

To integrate the TSCMAA library into the application:

  1. Build the TSCMAA library with “Intel/TSCMAA.sln” if you do not have TSCMAA.lib.
  2. Link “Intel/lib/TSCMAA.lib” to your application project.
  3. Include “Intel/TSCMAA/TSCMAA.h” in your application.
  4. Create a TSCMAA instance and ColorDepthIn instance.
  5. Create resource views from application’s color and depth textures for TSCMAA by calling ColorDepthIn::Create(…). To create color and depth textures, use one of the supported formats in TSCMAA.
  6. Initialize TSCMAA by calling TSCMAA::Create().
  7. Render color and depth textures in an application render loop.
  8. Apply TSCMAA by calling TSCMAA::Draw() after application render.
  9. Submit TSCMAA output to HMD (or backbuffer).
  10. Repeat steps 7‒9 for every frame.
  11. Destroy TSCMAA resources by calling TSCMAA::Destroy() and ColorDepthIn::Destroy().

The sample code is shown below.

#include “Intel/TSCMAA/TSCMAA.h”
TSCMAA::TSCMAA _tscmaa;
TSCMAA::ColorDepthIn _tscmaaColorDepthIn;

// In application init
_tscmaaColorDepthIn.Create(pDevice, pEyeTex, pEyeDepthTex);
_tscmaa.Create(pDevice, DXGI_FORMAT_B8G8R8A8_TYPELESS, 1080, 1200, 2);

// In app render loop
for (each frame){
    ID3D11Texture2D * pEyeTSCMAAOutTex[2] = { nullptr, };
    for (each eye) {
        // App renders into pEyeTex and pEyeDepthTex
        // …
        _tscmaa.Draw(pContext,
                projectionMat,
                viewMat,
                _tscmaaColorDepthIn,
                &pEyeTSCMAAOutTex[eye],
                eye);
    }
    // Submit pEyeTSCMAAOutTex[2] to your HMD back buffer
    // …
}

// In application destroy
_tscmaaColorDepthIn.Destroy();
_tscmaa.Destroy();

Summary

This article described TSCMAA and how easily VR application developers can integrate TSCMAA into their applications as a MSAA alternate with competitive spatial quality and better temporal stability. TSCMAA quality and performance is also adjustable with the number of edges since CMAA and TAA passes are processed only on the edge candidates, resulting in faster, less blurred pixels. The attached TSCMAA sample package has been optimized on Intel HD Graphics 630, but it runs on any platforms.

Face Detection with Intel® Distribution for Python*

$
0
0

Abstract

Artificial Intelligence (AI) can be used to solve a wide range of problems, including those related to computer vision, such as image recognition, object detection, and medical imaging. In the present paper we show how to integrate OpenCV* (Open Source Computer Vision Library) with a neural network backend. In order to achieve this aim, we first explain how the video stream is manipulated using a Python* programming interface and we also provide guidelines on how to use it. Finally, we discuss a working example of an OpenCV application. OpenCV is one of the packages that ship with Intel® Distribution for Python* 2018.

Introduction

Today, the possibilities of artificial intelligence (AI) are accessible to almost everyone. There are a number of artificial intelligence applications and many of them require the use of computer vision techniques. One of the most currently used libraries to help detection and matching, motion estimation, and tracking is OpenCV1. OpenCV is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage and is now maintained by Itseez. The library is cross-platform and free for use under the open-source BSD license.

Usually, the OpenCV library is used to detect something on a video or image that is used as input for some AI application or deep learning framework like MXNet*2, Caffe*3, Caffe2*4, Torch*5, Theano*6, TensorFlow*7, and others. There are several AI applications that use computer vision, for instance:

  • Advanced driver assistance systems (ADAS) and autonomous cars
  • Image recognition, object detection and tracking, and automatic document analysis
    • Real-time detection of unattended baggage in airport, train, and bus stations
  • Face detection and recognition, normally used for security issues
  • Medical image processing
  • IoT (Internet of Things) applications

In this work, we will make an overview of video preprocessing techniques that are used to detect some feature that may be present in a video stream, and generate images with the detected features. Our example will focus on face detection, which is used as a preprocessing phase to a face recognition system. The face recognition systems can be an AI application, a deep learning framework, or some cloud service such as Amazon Rekognition*8, Microsoft Azure* Cognitive Services9, Google Cloud Vision10, and others.


Figure 1. Video preprocessing.

Figure 1 shows the flow diagram of the face detection process from a video stream:

  • Input video: Can be from a surveillance camera, webcam, notebook camera, and so on.
  • Backend stream video: Sometimes the OpenCV cannot directly open the video from the camera. In this case, we can use a tool to record, convert, and stream video to one format/encode that the OpenCV knows. The main tool for that is FFmpeg* lib11. FFmpeg is a free software project that produces libraries and programs for handling multimedia data. FFmpeg is a leading multimedia framework, capable of decoding, encoding, transcoding, muxing, demuxing, streaming, filtering, and playing nearly any signal format available. It supports from the most obscure ancient formats up to the cutting edge ones. FFmpeg also supports any signal source, such as screen, or camera, as well as a file input.
  • OpenCV video reader: Open software used to read the video stream and process it to make face or object detection.
  • Frames (fps): Frames per second processed by OpenCV.
  • Image file (.jpg): Output file; this is the OpenCV image recognition result.

Environment

The environment used for this work is composed of one surveillance camera and one computer running CentOS* 7 Linux* with the Intel® Distribution for Python 2018. Intel® Distribution for Python complies with the SciPy* Stack specification, and includes the package OpenCV and a deep learning framework such as Caffe, Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)12, Theano, or TensorFlow. Python packages have been accelerated with Intel® Performance Libraries, including Intel® Math Kernel Library (Intel® MKL), Intel® Threading Building Blocks, and Intel® Data Analytics Acceleration Library. All the aforementioned software packages have been optimized to take advantage of parallelism through the use of threading, multiple nodes, and vectorization.

Intel® Distribution for Python 2018 has improved OpenCV performance compared to OpenCV available on CentOS Linux distribution. The performance was measured by comparing the time, in seconds, it takes to compute the full-HD frames captured and processed on OpenCV. The percent gain was around 92 percent. The machine used in our test has two Intel® Xeon® E5-2630 CPU with 8 GBytes of RAM.

To capture a video stream it is necessary to create a VideoCapture* object. The input argument to create such an object can be either the device index or the name of a video file. The device index is simply the number that identifies which camera will provide images for the system. When the VideoCapture object is created, the image provided by the specified camera or video file is captured frame by frame. When the built-in laptop webcam or some external camera is used, it is possible to open the video directly in OpenCV, using the sequence of commands shown in figure 2.

import cv2

cap = cv2.VideoCapture(0)

Figure 2. Capture video from camera.

However, OpenCV cannot handle our surveillance camera directly. For this reason, it is necessary to use a backend video stream to convert the input from the camera to a format that OpenCV can understand. We use the FFmpeg multimedia framework to convert the video stream to MPEG format.

Figure 3 shows the flow diagram for the backend video stream. The components used for this solution are described below.

  • Camera: Device camera
  • ffmpeg: A tool used to copy the video  stream to the file cam.ffm.
  • ffserver: A tool that converts the video stream from camera, saved to file cam.ffm, to an MPEG video stream (cam.mpeg) that will be used by OpenCV.
  • OpenCV: Reads the video stream from ffserver (cam.mpeg) and treats it frame by frame. It is possible to use filters to help the face detection process.
  • File cam.ffm: Used as buffer from ffmpeg tool to ffserver.
  • File cam.mpeg: Used as buffer from ffserver to OpenCV.


Figure 3. Backend video stream.

The ffmpeg software gets input from a video camera and writes to the file named “cam.ffm”. An IP address is assigned to the video camera and an authentication system is used (“user:password”) to grant the user access to the video stream. ffmpeg uses the Real-Time Streaming Protocol (RTSP) over TCP; see Figure 4. In the present case, the Uniform Resource Identifier (URI) uses channel 1, which corresponds to the original video camera.

ffmpeg -rtsp_transport tcp -i
rtsp://user:password@192.168.1.100:554/Streaming/Channels/1
http://localhost:8090/cam.ffm

Figure 4. ffmpeg tool.

Provided that the “cam.ffm” file is created, it is read by the ffserver, which decodes the “cam.ffm” file and encodes it to MPEG format, saving it to the file named “cam.mpeg”. The ffserver needs a configuration file, named “ffserver.config”. Figure 5 shows a basic configuration file. The ffserver has a number of options that can be set up, but in this application, we need only the following basic configuration options:

  • Enable access to http port 8090.
  • Up to 10 clients are allowed.
  • Read the “cam.ffm” and allow access only localhost.
  • Generate a video stream with MPEG format (cam.mjpeg) with the settings below:
    • 20 frames per second
    • Full HD resolution
  • Access from localhost and the 192.168.1./24 network is allowed.

The “ffserver.config” file can be stored in the default directory (/etc); alternatively, a custom location for the configuration file can be provided. If this file is stored in the user’s local directory, ffserver can be called by means of the command line shown in Figure 6. On the other hand, if the “ffserver.config” file is stored in the default directory, use the command shown in Figure 7 to call ffserver.

Finally, the OpenCV reads the “cam.mpeg” file frame by frame using the cv2.VideoCapture function, and processes the video frame by frame.

HTTPPort 8090
HTTPBindAddress 0.0.0.0
MaxClients 10
MaxBandWidth 50000
CustomLog -
#NoDaemon<Feed cam.ffm>
   File /tmp/cam.ffm
   FileMaxSize 1G
   ACL allow 127.0.0.1
   ACL allow localhost</Feed><Stream cam.mjpeg>
   Feed cam.ffm
   Format mpjpeg
   VideoFrameRate 20
   VideoBitRate 10240
   VideoBufferSize 20480
   VideoSize 1920x1080
   VideoQMin 3
   VideoQMax 31
   NoAudio
   Strict -1</Stream><Stream stat.html>
   Format status
   # Only allow local people to get the status
   ACL allow localhost
   ACL allow 192.168.1.0 192.168.1.255</Stream><Redirect index.html>
   URL http://www.ffmpeg.org/</Redirect>

Figure 5. ffserver.config file.

ffserver -d -f ./ffserver.config

Figure 6. ffserver tool—local directory.

ffserver -d -f /etc/ffserver.config

Figure 7. ffserver tool—default.

Face Detection

When OpenCV is correctly configured by means of the procedure described above, it reads and processes all frames from the video stream. OpenCV has several built-in pretrained classifiers for face, eyes, and smile detection, among others. We use the frontal face Haar-Cascade classifier for the detection process. The details of this classifier are given in the file named haarcascade_frontalface_default.xml.

Figure 8 shows the Python script to detect faces. Below, we describe how the Python script works.

  • The script has a function called “detect”, which is used for face detection.
  • The script opens the video stream and runs in an infinite loop, identifying each beginning and end of frame.
  • Then, the frame is converted to gray to serve as input to the detect function.
  • If any face is identified within the frame, the script saves a JPEG file, with the naming convention YYYYMMDD_HH_MM_SS_Frame_cam.jpg, where:
    • YYYYMMDD: stands for year, month and day
    • HH_MM_SS: stands for hour, minutes and seconds
import cv2, platform
import numpy as np
import urllib
import os
from time import strftime

def detect(img, cascade, scale, neigh, size):
    rects = cascade.detectMultiScale(img, scaleFactor=scale,
            minNeighbors=neigh, minSize=(size, size))
    if len(rects) == 0:
        return []
    rects[:,2:] += rects[:,:2]
    return rects

face_cascade = cv2.CascadeClassifier('/opt/intel/intelpython2/pkgs/opencv-3.1.0-np113py27_intel_6/share/OpenCV/haarcascades/haarcascade_frontalface_default.xml')

cam = "http://localhost:8090/cam.mjpeg"

stream = urllib.urlopen(cam)
bytes = ''
nframe = 0
nfaces = 0

scale = 1.3
neigh = 3
size = 50
margin = 40
while True:
    # to read mjpeg frame -
    bytes += stream.read(8192)
    a = bytes.find('\xff\xd8')
    b = bytes.find('\xff\xd9')
    if a!=-1 and b!=-1:
        nframe = nframe+1
        jpg = bytes[a:b+2]
        bytes= bytes[b+2:]

        if (nframe % 20)!=0:
          continue
        frame = cv2.imdecode(np.fromstring(jpg,
                dtype=np.uint8),cv2.CV_LOAD_IMAGE_COLOR)
        # we now have frame stored in frame.

        # Our operations on the frame come here
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Detect faces in the image
        rects = detect(gray, face_cascade, scale, neigh, size)

        # Draw a rectangle around the faces
        if len(rects):
            filename = strftime("%Y%m%d_%H_%M_%S")+"_frame_cam.jpg"
            cv2.imwrite(filename, frame)

    # Press 'q' to quit
    #if cv2.waitKey(1) & 0xFF == ord('q'):
    #    break

cv2.destroyAllWindows()

Figure 8. Face detection script—save the frame.

Figure 10 shows another Python script to detect faces. Basically the idea is the same, but this script does not save the whole frame, it saves only the detected faces. The script identifies the faces and adds some margin to get more information to help the recognition software. After adding the margin, the script crops the frame and saves into a small image. Figure 9 shows the face detection internal rectangle (green) and face detection with margin in the external rectangle (blue).


Figure 9. Face detection and margin.

import cv2, platform
import numpy as np
import urllib
from time import strftime

def detect(img, cascade, scale, neigh, size):
    rects = cascade.detectMultiScale(img, scaleFactor=scale,
            minNeighbors=neigh, minSize=(size, size))
    if len(rects) == 0:
        return []
    rects[:,2:] += rects[:,:2]
    return rects

face_cascade = cv2.CascadeClassifier(
 '/usr/share/OpenCV/haarcascades/haarcascade_frontalface_default.xml')

cam = "http://localhost:8090/cam.mjpeg"

stream = urllib.urlopen(cam)
bytes = ''
nframe = 0
nfaces = 0

scale = 1.3
neigh = 3
size = 50
margin = 40
xfhd = 1920
yfhd = 1080

while True:
    # to read mjpeg frame -
    bytes += stream.read(8192)
    a = bytes.find('\xff\xd8')
    b = bytes.find('\xff\xd9')
    if a!=-1 and b!=-1:
        nframe = nframe+1
        jpg = bytes[a:b+2]
        bytes= bytes[b+2:]

        if (nframe % 20) != 0:
          continue
        frame = cv2.imdecode(np.fromstring(jpg,
                dtype=np.uint8),cv2.CV_LOAD_IMAGE_COLOR)
        # we now have frame stored in frame.

        # Our operations on the frame come here
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Detect faces in the image
        rects = detect(gray, face_cascade, scale, neigh, size)

        # Draw a rectangle around the faces
        if len(rects):
            iface = 0
            for x1, y1, x2, y2 in rects:
                iface = iface+1
                nfaces = nfaces+1
                sface = "_%02d" % iface
                filename = strftime("%Y%m%d_%H_%M_%S")+ sface +".jpg"
                yf1 = y1 – margin
                if yf1 < 0:
                   yf1 = 0
                yf2 = y2 + margin
                if yf2 >= yfhd:
                   yf1 = yfhd - 1
                xf1 = x1 – margin
                if xf1 < 0:
                   xf1 = 0
                xf2 = x2 + margin
                if xf2 >= xfhd:
                   xf2 = xfhd - 1
                crop_img = frame[yf1:yf2, xf1:xf2]
                cv2.imwrite(filename, crop_img)

    # Press 'q' to quit
    #if cv2.waitKey(1) & 0xFF == ord('q'):
    #    break

cv2.destroyAllWindows()

Figure 10. Face detection script—save only the faces.

Conclusions

This work shows how the OpenCV library can be used to provide adequate input to some face recognition software. Intel Distribution for Python 2018 greatly improves OpenCV performance. Any package included in Intel Distribution for Python as the deep learning framework can be used to make recognition software. Depending on the image, some filters of OpenCV can be used to improve image sharpness; for example, the histogram equalization.

If the face recognition software is used from a cloud, the second script shown in Figure 10 is more appropriate because it saves only the face, not the entire frame, saving storage space, since the file is smaller and its upload is faster. In the examples discussed above, the frame size was around 400 KB and the face size was around 35 KB. Figure 11 shows a fragment of the code of the script for uploading a file to S3* on Amazon Web Services* (AWS). Once the upload is completed, the file is removed. This code fragment can be included in any of the previously shown scripts.

import boto3
import os

# ...

# Upload frame to AWS and remove the file
s3.upload_file(filename, 'YOUR_BUCKET', 'YOUR_FOLDER/'+filename)
# remove file after upload
os.remove(filename)

Figure 11. Stretch for upload to AWS* Cloud.

References

  1. OpenCV: http://opencv.org/
  2. MXNet: https://mxnet.incubator.apache.org/
  3. Caffe: http://caffe.berkeleyvision.org/
  4. Caffe2: https://caffe2.ai/
  5. Torch: http://torch.ch/
  6. Theano: http://deeplearning.net/software/theano/
  7. TensorFlow: https://www.tensorflow.org/
  8. Amazon Web Services—Rekognition: https://aws.amazon.com/rekognition/
  9. Microsoft Azure—Cognitive Services:  https://azure.microsoft.com/en-us/services/cognitive-services/
  10. Google Cloud Platform*—Cloud Vision: https://cloud.google.com/vision/
  11. FFmpeg multimedia framework: https://www.ffmpeg.org/
  12. Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN): https://01.org/mkl-dnn

Pruebá_de_Artículo_Localiñado

Temporally Stable Conservative Morphological Anti-Aliasing (TSCMAA)

$
0
0
File(s):TSCMAA-CodeSample (76.6 MB)
License:BSD 3-clause
Optimized with.. 
OS:Windows® 10
Hardware:Intel® HD Graphics 630
Software:Microsoft Visual Studio* 2017,
Universal Windows* Platform development,

Windows 10 SDK 10.0.15063
Prerequisites:An understanding of DirectX* 11 is helpful but not essential.

Seshupriya Alluru, Rebecca David, Matthew Goyder, Anupreet Karla, Yaz Khabiri, Sungye Kim, Pavan Lanka, Filip Strugar, and Kai Xiao contributed to TSCMAA and samples provided in this document.

This article describes Temporally Stable Conservative Morphological Anti-Aliasing (TSCMAA), which was designed and developed as a multisample anti-aliasing (MSAA) alternate technique to provide better temporal stability and performance workaround for Virtual Reality (VR) applications.

Introduction

Anti-aliasing (AA) in computer graphics refers to a set of techniques used to overcome aliasing artifacts in the rendered images, which are a byproduct when representing a high-resolution image at a lower resolution (for example, rasterization or sampling). MSAA is the most widely used spatial AA algorithm in various applications but at the cost of performance. Post-processing AA algorithms like Conservative Morphological Anti-Aliasing (CMAA), Fast Approximate Anti-Aliasing (FXAA), and Subpixel Morphological Anti-Aliasing (SMAA) can provide constant performance but suffer from temporal instability. For VR, AA is more critical since the display is closer to the eyes, thus artifacts are more noticeable, and temporal stability becomes a key aspect in providing a good user experience.

This article discusses TSCMAA, which employs an optimized CMAA and integrates temporal accumulation in order to create a temporally stable post-processing AA solution as a MSAA alternate technique without compromising on image quality. TSCMAA is also designed to run efficiently on low- and medium-end graphics processing units (GPUs), such as integrated GPUs, and to be minimally invasive. This makes it acceptable as an MSAA alternate with better temporal stability in a wide range of applications, which include aliasing-prune geometry such as text, patterns, lines, and foliage.

Figure 1 shows a TSCMAA flow that combines a CMAA pass (top) and a TAA pass (bottom). In TSCMAA, CMAA and TAA passes run only for edge candidates identified by edge detection, resulting in fast indirect dispatch of shaders.


Figure 1. TSCMAA flow.

Conservative Morphological Anti-Aliasing

CMAA takes a rendered color frame as an input and updates the frame with spatially anti-aliased edge pixels by processing edge candidates (any shape and Z-shape edges). To identify such edge candidates, edge detection uses the luminance difference in color with a given edge threshold. In TSCMAA, a default edge threshold value is 0.045f (1.f/22.f) based on experiments with our sample scene, but it can be adjusted for other scenes as a quality versus performance knob. The details of CMAA are found in the original article.

Temporal Anti-Aliasing

To reduce temporal aliasing such as shimmering or crawling caused by the motion between frames, TAA blends a current pixel with a history pixel. Then the output of TAA becomes a history frame for the next TAA frame in a feedback loop. In VR, TAA is more crucial due to quite a bit of jitter from head pose changes over time, resulting in rendered frames always containing motion in a head-mounted display (HMD). TSCMAA takes care of both spatial and temporal aliasing by combining CMAA and TAA, and major benefits arise because TSCMAA runs only for edge pixels, resulting in faster pixels that are less blurred overall, compared to the results of other TAA algorithms.

TAA Edge Candidates

Only a portion of edge candidates from edge detection are considered as TAA edge candidates. In TSCMAA, we use 50 percent of CMAA edge candidates for TAA. This default value is based on experiments with our sample scene but can be adjusted for other scenes as a quality versus performance knob. Figure 2 shows edge detection results with the default edge threshold value for CMAA edge candidates on the left and TAA edge candidates on the right. The TAA pass is applied only to the pixels in TAA edge candidates, resulting in fast TAA and less blurring on non-edge pixels.


Figure 2. Edge candidates for CMAA (left) and TAA (right). TAA edge candidates are 50 percent of the sCMAA edge candidates.

History Pixel Sampling

The history frame maintains the previous TSCMAA frame so that a CMAA pixel is blended with a history pixel to generate the current TSCMAA frame (see Figure 1). To find a correct history pixel, we reproject a texture coordinate with view and projection of the current rendered frame based on depth. Then a history pixel is sampled with the reprojected texture coordinate using bicubic sampling. TSCMAA employs a five-tap approximation for a Hermite/Catmull-Rom bicubic filter for sharpness-preserving sampling in a history frame but with faster approximation.

Variance Clipping

When there are moving objects in a scene, reusing stale pixels in the history frame creates a ghosting artifact. A neighborhood color clipping using an axis-aligned bounding box (AABB) has been used as an alternative for expensive color clipping using a convex hull. However color clipping using AABB results in poor quality when a new color clipped by AABB is too far from the original pixel. Many articles discuss the benefits of variance clipping to provide less ghosting artifacts. TSCMAA uses variance clipping in YCoCg space to minimize such a ghosting artifact on moving objects.

Blending with the CMAA Pixel and TSCMAA History Pixel for TAA Edge Candidates

Since the TAA pass is processed only for TAA edge candidates, blending between CMAA pixels and history pixels is also done only for TAA edge candidates with a blend weight of 0.8f and other non TAA edge pixels are directly taken from CMAA pixels which is same to using a blend weight of 0.f. Then the final blended output generates a TSCMAA frame and becomes a history frame for the next TSCMAA frame as shown in Figure 1.

Sample Applications

In the TSCMAA sample package attachment, we provide two VR sample applications using the same test scene as shown in Figure 3. One (AASample_WMR) is for Microsoft’s Windows Mixed Reality (WMR) HMD and the other (AASample) uses OpenVR* so that it will work on all OpenVR-compatible HMDs. AASample can also run as a desktop application without the “_OPENVR_” preprocessor in project configuration properties.


Figure 3. TSCMAA sample scene with lines, foliage, and a transparent object.

Quality and Performance

Figure 4 shows a quality comparison among MSAA2x, MSAA4x, TSCMAA, and No AA. TSCMAA quality is moderately equivalent to MSAA4x when viewed with an HMD and overall better than MSAA2x with default edge threshold values. Since the TAA pass in TSCMAA is based on temporal accumulation, somehow blurred pixels are inevitable in the final TSCMAA frame. However, we minimize the blurring by accumulating only TAA edge pixels in a history frame and provide better temporal stability. Since we cannot show TAA benefits with static screenshots, this article also provides an attached sample code package with TSCMAA shaders .


Figure 4. Quality comparison among MSAA2x (top left), MSAA4x (top right), TSCMAA (bottom left), and No AA (bottom right).

TSCMAA performance depends on multiple aspects, such as edge count in the scene and render target resolution. For our test scene, TSCMAA shows MSAA4x quality with around a 40 percent cost reduction and is equivalent to MSAA2x performance with default edge threshold values. Here we also show how TSCMAA performance scales as the edge count in the scene and render target resolution change. These experiments were performed with 1280x1280/eye on Intel® HD Graphics 630 system at 1150 MHz.

Since TSCMAA runs only on edge candidates after edge detection, and performance depends on the edge pixel count in the current view. Table 1 shows TSCMAA performance scaling for different edge counts from 53K to 100K pixels. The baseline contains 73K edge pixels, which is from a view shown in Figure 3. Table 1 shows TSCMAA performance changes of 10‒15 percent for a 30 percent difference in edge count.

Table 1. TSCMAA performance scaling ratio for different edge count. Baseline (1.0) has 73K edges and we compare with different scene (or views) where edge count is 30 percent less or greater than baseline.

Scale Factor over Base53K Edges73K Edges (Base = 1.0)100K Edges
Edge count0.721.01.36
TSCMAA time0.861.01.08

Table 2 shows TSCMAA performance scaling for 2Kx2K render target resolution while maintaining the same edge count. Compared to a baseline with 1280x1280 resolution, a 2Kx2K render target contains a 2.56x larger number of pixels, contributing to a TSCMAA performance increase of up to 1.6x. The increase is mostly from an edge detection step which is resolution dependent.

Table 2. TSCMAA performance-scaling ratio for different render target resolution. Baseline (1.0) is 1280x1280 per eye, and we compare with 2Kx2K per eye while preserving edge count.

Scale Factor over Base1280x1280/Eye (Base = 1.0)2Kx2K/Eye
Pixel count1.02.56
Edge count1.01.0
TSCMAA time1.01.59

Table 3 shows the combined contribution from larger render target resolution and thereby the increased number of edge pixels.

Table 3. TSCMAA performance scaling ratio for different render target resolution.

Scale Factor over Base1280x1280/Eye (Base = 1.0)2Kx2K/Eye
Pixel count1.02.56
Edge count1.01.91
TSCMAA time1.02.12

Supported Resource Formats

To apply TSCMAA, applications are required to use one of TSCMAA supported resource formats for render target and depth resources. For the render target texture, TSCMAA supports 32- bit RGBA and BGRA typeless formats, such as

  • DXGI_FORMAT_R8G8B8A8_TYPELESS
  • DXGI_FORMAT_B8G8R8A8_TYPELESS and creates internal view resources with corresponding UNORM and UNORM_SRGB formats.
  • DXGI_FORMAT_R8G8B8A8_UNORM
  • DXGI_FORMAT_R8G8B8A8_UNORM_SRGB
  • DXGI_FORMAT_B8G8R8A8_UNORM
  • DXGI_FORMAT_B8G8R8A8_UNORM_SRGB

The 32-bit BGRA format is the only resource format that Microsoft’s WMR application uses at the time of this writing.

For depth, TSCMAA supports 24- and 32-bit resource formats with or without a stencil, such as:

  • DXGI_FORMAT_D32_FLOAT
  • DXGI_FORMAT_D24_UNORM_S8_UINT
  • DXGI_FORMAT_D32_FLOAT_S8X24_UINT
  • DXGI_FORMAT_R32_TYPELESS
  • DXGI_FORMAT_R24G8_TYPELESS
  • DXGI_FORMAT_R32G8X24_TYPELESS

TSCMAA Interface

TSCMAA provides a simple interface to support both standard desktop and VR applications.

TSCMAA::Create(…) initializes TSCMAA.

HRESULT TSCMAA::Create(ID3D11Device * pDevice,
                       DXGI_FORMAT format,
                       int width,
                       int height,
                       unsigned int numEyes = 1);
  • pDevice: D3D11 device pointer from application
  • format: Render target resource format
  • width: Render target width
  • height: Render target height
  • numEyes: The number of eyes. Default is 1 for standard desktop application. For VR applications, numEyes should be 2.

TSCMAA::Resize(…) is optionally called to resize resources in TSCMAA when the application render target is resized.

HRESULT TSCMAA::Resize(ID3D11Device * pDevice,
                       DXGI_FOARMAT format,
                       int width,
                       int height);
  • pDevice: D3D11 device pointer from application
  • format: Render target resource format
  • width,/code>: Render target width
  • height: Render target height

TSCMAA::Draw(…)applies TSCMAA-given color and depth resources and returns the final TSCMAA resource to ppOutTex. To provide input color and depth resources from the application to TSCMAA, the application should prepare ColorDepthIn by calling ColorDepthIn::Create(…).

HRESULT TSCMAA::Draw(ID3D11DeviceContext * pContext,
	                ColorDepthIn * pColorDepthIn,
	                DirectX::XMFLOAT4x4 &projection,
                     DirectX::XMFLOAT4x4 &view,
	                ID3D11Texture2D ** ppOutTex,
	                unsigned int eye = 0);
  • pContext: D3D11 device context pointer from the application
  • pColorDepthIn: A pointer of ColorDepthIn for TSCMAA that is created in the application side by calling ColorDepthIn::Create(…)
  • projection: Projection matrix
  • view: View matrix
  • ppOutTex: A pointer of TSCMAA output texture resource, which resides in the TSCMAA side
  • eye: eye indexm where the left eye is 0 and the right eye is 1

To destroy TSCMAA, call TSCMAA::Destroy() which will also release all resources in TSCMAA. void TSCMAA::Destroy();

TSCMAA has a few control knobs to adjust the number of edges. Since CMAA and TAA passes run only on edge candidates, the number of edges decides performance and quality. To adjust the number of edge pixels detected, TSCMAA provides SetEdgeThresholds(…) which will set an edge threshold and a non-dominant edge removal amount for edge detection.

void TSCMAABase::SetEdgeThresholds(float edgeDetectionThreshold,
                                   float nonDominantEdgeRemovalAmount,
                                   float bluriness);
  • edgeDetectionThreshold: The recommended value ranges [(1.f/32.f), (1.f/1.f)] and default value is (1.f/22.f).
  • nonDominantEdgeRemovalAmount: The recommended value ranges [0.f, 3.f] and default value is 0.5f.
  • bluriness: The recommended value ranges [0.5f, 2.f] and default value is 0.7f.

Note that bluriness does not affect edge detection but does affect processing edge candidates. Hence edgeDetectionThreshold and nonDominantEdgeRemovalAmount are knobs to control the number of edges in edge candidates.

To get the current edge thresholds, use TSCMAA::GetEdgeThresholds(…).

void TSCMAABase::GetEdgeThresholds(float &edgeDetectionThreshold,
                                  float &nonDominantEdgeRemovalAmount,
                                  float &bluriness);

How to Integrate TSCMAA into Other DirectX 11* Applications

To integrate the TSCMAA library into the application:

  1. Build the TSCMAA library with “Intel/TSCMAA.sln” if you do not have TSCMAA.lib.
  2. Link “Intel/lib/TSCMAA.lib” to your application project.
  3. Include “Intel/TSCMAA/TSCMAA.h” in your application.
  4. Create a TSCMAA instance and ColorDepthIn instance.
  5. Create resource views from application’s color and depth textures for TSCMAA by calling ColorDepthIn::Create(…). To create color and depth textures, use one of the supported formats in TSCMAA.
  6. Initialize TSCMAA by calling TSCMAA::Create().
  7. Render color and depth textures in an application render loop.
  8. Apply TSCMAA by calling TSCMAA::Draw() after application render.
  9. Submit TSCMAA output to HMD (or backbuffer).
  10. Repeat steps 7‒9 for every frame.
  11. Destroy TSCMAA resources by calling TSCMAA::Destroy() and ColorDepthIn::Destroy().

The sample code is shown below.

#include “Intel/TSCMAA/TSCMAA.h”
TSCMAA::TSCMAA _tscmaa;
TSCMAA::ColorDepthIn _tscmaaColorDepthIn;

// In application init
_tscmaaColorDepthIn.Create(pDevice, pEyeTex, pEyeDepthTex);
_tscmaa.Create(pDevice, DXGI_FORMAT_B8G8R8A8_TYPELESS, 1080, 1200, 2);

// In app render loop
for (each frame){
    ID3D11Texture2D * pEyeTSCMAAOutTex[2] = { nullptr, };
    for (each eye) {
        // App renders into pEyeTex and pEyeDepthTex
        // …
        _tscmaa.Draw(pContext,
                projectionMat,
                viewMat,
                _tscmaaColorDepthIn,
                &pEyeTSCMAAOutTex[eye],
                eye);
    }
    // Submit pEyeTSCMAAOutTex[2] to your HMD back buffer
    // …
}

// In application destroy
_tscmaaColorDepthIn.Destroy();
_tscmaa.Destroy();

Summary

This article described TSCMAA and how easily VR application developers can integrate TSCMAA into their applications as a MSAA alternate with competitive spatial quality and better temporal stability. TSCMAA quality and performance is also adjustable with the number of edges since CMAA and TAA passes are processed only on the edge candidates, resulting in faster, less blurred pixels. The attached TSCMAA sample package has been optimized on Intel HD Graphics 630, but it runs on any platforms.

Optimizing OpenStack* Swift* Performance with PyPy*

$
0
0

Isolate and address challenges, understand different solutions, and learn best-known methods associated with adopting a PyPy* just-in-time interpreter for a leading cloud computing solution.

Introduction

Python* is an open source, general purpose programming language. Applications based on Python are used in data centers for cloud computing and similar applications. By optimizing the Python core language—the interpreter itself—we can improve the performance of almost any application implemented in Python. For example, OpenStack* Swift*, a leading open source and object storage solution, is mostly written in Python. We demonstrated in this paper that, by simply switching the interpreter, OpenStack Swift performance could be improved. We achieved up to 2.2x increase in throughput, with up to 78 percent reduction in latency response, measured with benchmarks.

In this paper, we share technical insights for achieving optimal OpenStack Swift performance using a just-in-time (JIT) Python interpreter, the PyPy* JIT. We will present our best-known methods (BKM) for optimizing application performance using the JIT solution.

Terminology

CPython*: Python is commonly known as a programming language. However, strictly speaking, Python is actually the language specification. This is because application source codes written in Python (or which comply with the Python specification) can be interpreted with different runtime implementations. The reference or standard Python interpreter named CPython is implemented in the C programming language. CPython is an open source implementation and has a wide developer support community. This is used as the baseline interpreter for the current experiment.

Python version: Within CPython, there are two main branches, 2.7 and 3, often referred to as Python 2 and Python 3. This paper focuses on CPython 2.7 and the PyPy JIT that is compatible with CPython 2.7. In this context, CPython refers to CPython 2.7, and PyPy JIT refers to PyPy2 JIT, unless otherwise noted. Python 3 or PyPy 3 is not discussed here.

Python interpreter: We use the term interpreter to refer to these typical terms: runtime, core language, and compiler. The interpreter is the piece of code (binary) that understands the Python application source codes and executes them on the user’s behalf.

Modules, library and extension: A Python interpreter may include additions providing commonly used functions, known as standard libraries, which are packaged and shipped with the interpreter itself. These functions may be implemented within the interpreter binary (built-in), in libraries written with pure Python scripts (*.py files), or in libraries written in the C programming language (*.so files on Linux*-based operating systems, and *.dll files on Windows* operating systems). A Python developer may create additional C modules, also known as customized C extensions. From a general Python application developer’s point of view, module, library, and extension are interchangeable in many cases.

Python application: An application may be written in pure Python scripts (ASCII text files), with or without C modules. Pure Python codes are hardware-agnostic and can be distributed as a software product. C modules, however, are typically compiled into binaries targeting a specific underlying hardware before being shipped out as a product.

Python execution: To simply illustrate how Python works, Figure 1 shows the relationship among the application, the C extensions, and the Python runtime. The application must be interpreted, or run under a runtime. The interpreter (or runtime) could also call system libraries in the underlying operating systems to get system services such as input/output (I/O) or networking.

optimizing-openstack-swift-fig1
Figure 1. Relationship among different software pieces within a Python* execution environment. The Python application has a dependency on the Python interpreter, which in turn has a dependency on the underlying operating system and services.

Dynamic translation: Python is a dynamic scripting language. A simple way to understand the word dynamic is to imagine how a variable can hold a string in one code line, and then hold an integer in another line at run time. This is in contrast with static compiled languages, where data types must be declared at compile time and cannot be changed at run time. However, dynamic translation is also understood as bytecodes being compiled into machine codes at run time (dynamically) for better performance.

Just-in-time compilation: A form of dynamic translation, combining speed of compiled code with the flexibility of interpretation. JIT technique is common among dynamic scripting languages, HHVM*, Node.js*, Lua*, to give a few examples, all targeting performance as a design goal. A JITed code section is a memory location where the dynamically generated and optimized instructions for the running machine are being stored at run time. JITed codes may be generated and destroyed constantly at run time.

PyPy: PyPy is an alternative implementation of Python, with the aim of faster execution speed. Unlike CPython, which does not have a JIT capability, PyPy provides the JIT compilation feature. Also unlike CPython, which has a very large development community and has the largest user base among all Python implementations, PyPy is maintained by a much smaller developer community. In this paper, PyPy and PyPy JIT are interchangeable.

OpenStack and Swift: OpenStack is an open source software platform for cloud computing, mostly deployed as an infrastructure-as-a-service (IaaS). The software platform consists of interrelated components (or services) that control diverse, multivendor hardware pools of processing, storage, and networking resources throughout a data center. Swift, a scalable, redundant storage system, is a part of the OpenStack services. Objects and files are written to multiple disk drives spread throughout servers in one or more data centers, with the OpenStack software responsible for ensuring data replication and integrity across the clusters. Storage clusters scale horizontally simply by adding new nodes. Swift release 2.11 was used for this experiment. More details about Swift will be provided in later sections.

A node is referred to here as a single physical machine or bare metal. A proxy server is typically referred to as a daemon process running on a proxy node, and listening to incoming client requests. A proxy node will typically have multiple proxy server processes running, while the storage node will have multiple storage server processes running. For the sake of simplicity, the server and node are interchangeable in this paper.

Python* performance

Python is a popular scripting language for two main reasons. Python makes it easy to quickly create code, and the language has an increasing rich set of open source libraries.

So yes, Python offers a lot of benefits, but it is also inherently slow as a dynamic scripting language. One issue is that the hardware-agnostic CPython—the standard Python interpreter—runs in pure interpreted mode. When running in interpreted mode, the application source codes must be analyzed at run time (dynamically) before being executed on the target’s underlying hardware. Applications written in statically compiled languages, on the other hand, are already converted into machine codes for the target hardware (in this case CPUs) at compile time before being distributed as a product. The latter runs much faster. Most developers are aware of its slow performance and can accept that. However, in some enterprise applications (such as OpenStack), users demand better performance.

Common approaches to improve Python performance

One of the solutions adopted by the Python community has been to write C modules, and thus offload CPU-intensive operations to performant C codes. A similar approach, called Cythonizing*, converts existing Python codes to C codes. This type of approach adds complexity, while increasing development, deployment, and maintenance costs.

In the scientific world, writing code blocks with decorators within the applications is also common. The decorators can instruct the underlying interpreter to execute the Python application code blocks in optimized or JITed mode.

Yet another alternative is to rewrite the application in a different programming language, such as Golang* (also known as Go*). Again, the cost of development, implementation, and maintenance remains a big factor to consider.

Using PyPy* JIT as a CPython* alternative for OpenStack* Swift*

We chose a different approach. We applied a Python JIT engine (PyPy JIT) as the interpreter. This approach does not require application source code change and/or additional hardware. Considering that the OpenStack code base is in the millions of lines, we think this is one of the best ways to lower the total cost of ownership (TCO), while maintaining good performance and scalability that rivals other solutions.

Enriching open source communities

One of our goals is enriching the open source community. We want to drive awareness of PyPy’s capabilities and potential by using and improving OpenStack Swift. To that end, this paper presents information that may be useful to both software architects and developers. The information here will help the audience understand the performance benefits of PyPy. This information will also help to better understand the challenges, solutions, and BKMs associated with adopting PyPy.

OpenStack Swift

Overview

OpenStack Swift is designed to durably store unstructured data at a massive scale, and keep that data highly available for fast access. To do this, Swift provides an HTTP application programming interface (API) to clients. API requests use standard HTTP verbs, and API responses use standard HTTP response codes.

optimizing-openstack-swift-fig2
Figure 2. Schematic view of a generic OpenStack* Swift* architecture.

At a high level, a customer requesting any object (a photo, for instance) reaches a Swift cluster after being routed by a load balancer. The Swift cluster may contain multiple proxy server nodes, while the storage nodes may be divided into multiple regions (such as a data center in the United States and another one in Europe). Each region may contain a number of zones (such as a full rack within a data center) and each zone typically contains multiple storage nodes. A user request will eventually be fulfilled by fetching (or writing) the object file from (or to) one of the disks somewhere in the service provider’s facility.

Proxy server and storage servers

Internally, Swift has two major logical parts.

Proxy server: responsible for implementing most of the API. The proxy server coordinates and brokers all communication between clients and storage nodes. The proxy server is responsible for determining where data lives in the cluster. The proxy server is also responsible for choosing the right response to send to clients, depending on how the storage servers handled the initial request. Basically, the proxy server is responsible for accepting network connections from clients, creating network connections to storage servers, and moving data between the two.

Storage servers: responsible for persisting data on durable media, serving data when requested, replicating, and auditing data for correctness.

Focusing on the proxy server only

To limit our scope, in this paper we are only focusing on improving the performance of the proxy server where significant CPU-bound performance bottlenecks were observed in the past. No change was made on any of the storage servers. This does not imply that no CPU performance issues were observed on the storage server, which deserves a separate study.

Reading and writing objects

To write a new object into Swift:

  • A client sends a PUT request to a Uniform Resource Identifier (URI) that represents the full name of the new object. The body of the request contains the data the client wants to store in the system.
  • The proxy server (proxy) accepts the request.
  • The proxy then deterministically chooses the correct storage locations for the data, and opens connections to the backend storage servers.
  • The proxy then reads in bytes from the client request body, and sends that data to the storage servers.
  • For redundancy, a new object is typically written into multiple storage servers.

To read an object:

  • A client sends a GET request to the URI that represents the full name of the object.
  • The proxy server accepts the request.
  • The proxy deterministically chooses where the data should be in the cluster, and opens the necessary connections to backend storage servers.

Benchmarking, System Configuration, and Performance Metrics

Selecting microbenchmark for system health check and tuning

Before evaluating OpenStack Swift performance, we need to conduct a system-level health check and configure the systems to ensure performance consistency. At the time this work was started, the Grand Unified Python Benchmark* (GUPB) suite was the only Python benchmark suite recommended by the Python developer’s community. This was used during initial system health check and performance comparison. The suite included greater than 50 microbenchmarks (with each a single-threaded Python application running a specific task such as regular expression, ray trace algorithm, JSON parser, and so on). The suite was hosted at one of the Python Mercurial* repositories, and is now obsolete as of this writing. A similar yet different benchmark suite is now recommended and hosted on GitHub*.

System configuration BKM to reduce run-to-run performance variation

After a number of experiments running the GUPB, we established the following best known method (BKM) to reduce run-to-run performance variations to a minimum.

  1. Set all CPU cores to run at the exact same and fixed frequency. We did this by disabling the P-state in the system BIOS during system boot. On the Ubuntu* operating system, the CPU frequency could be set by setting the parameter value at “/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq” and “/sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq” with sudo user. An application performance behavior could be quite different when a CPU runs at 1GHz versus 2GHz at run time. The dynamic frequency adjustment could be done by the hardware or the operating system, and may be due to multiple factors, including thermal and cooling.
  2. Disable ASLR, or address space layout randomization, a Linux security feature and is on by default, using this command:
    echo 0/proc/sys/kernel/randomize_va_space

After setting the CPU frequency at a fixed value as a baseline, the next two graphs show the difference before and after disabling ASLR, running one of the microbenchmarks within the GUPB suite, CALL_METHOD. This method evaluates the overhead of calling Python functions. In this example, the performance metric is execution time (seconds), with lower values indicating better performance. To demonstrate the data scattering, a median value is calculated first from all the data points, then the delta relative to the median from each data point is plotted in the graph.

Before disabling ASLR:

optimizing-openstack-swift-fig3
Figure 3. Significant run-to-run performance variation with microbenchmark CALL_METHOD before turning off ASLR.

Benchmark results were obtained prior to implementation of recent software patches and firmware updates that are intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

After disabling ASLR:

optimizing-openstack-swift-fig4
Figure 4. Much reduced run-to-run performance variation with same microbenchmark after turning off ASLR.

Benchmark results were obtained prior to implementation of recent software patches and firmware updates that are intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

When ASLR is enabled (Figure 3), the run-to-run delta was as high as 11 percent. One can clearly see the data scattering during the 30 repeated runs. When ASLR is disabled (Figure 4), the run-to-run variation is down to nearly zero, with just a single outlier, running the very same micro on the very same hardware.

Choosing a benchmark to measure OpenStack Swift performance

We need a benchmark tool to issue HTTP load, stress the OpenStack Swift cluster, and scale well. In addition, we need to find a tool that can track the performance of each request and create a histogram of the entire benchmark run. Only one tool matched our requirement, an open source tool called ssbench*, written in Python, and originally developed by SwiftStack*.

Ssbench generates payloads to the proxy server to simulate clients sending requests to the server. It prints running status to the console during the run, which is very useful. It also provides a summary at the end, reporting requests completed per second (RPS), as the performance metric. This metric is reported as throughput. Another performance metric is latency, measured in seconds, reflecting the round trip time required to complete each request.

Running ssbench* with input parameters

The following code is an example CRUD (Create, Read, Update and Delete) input file for ssbench, set at [0, 100, 0, 0], or 100 percent READ scenario. We created this file using an existing template from the ssbench package under the scenario folder.

In this experiment, we used 4k (or 4,096 bytes to be exact) as the object size to get the maximum I/O rate out of the storage nodes. 4k is the block size of our file systems on the disk. Unless explicitly specified, 4k was the default object size for the current experiment. Reading or writing 4k chunks aligns with the block boundary, and is the most efficient and fastest way to get bytes out of (or in to) the disk. One way to confirm the device block size is this command:

sudo blockdev --getbsz /dev/sda1 4096

In the following code experiment, we used the 4k size for both tiny and small objects while doing READ only:

{"name": "Small test scenario","sizes": [{"name": "tiny","size_min": 4096,"size_max": 4096
 }, {"name": "small","size_min": 4096,"size_max": 4096
 }],"initial_files": {"tiny": 50,"small": 50
 },"operation_count": 500,"crud_profile": [0, 100, 0, 0],"user_count": 4,"container_base": "ssbench","container_count": 100,"container_concurrency": 10
}

You can overwrite some of the parameters from the command-line input.

Here is a sample command line:

ssbench-master run-scenario -f ./very_small.scenario –A http://controller:8080/auth/v1.0 -U system:root -K testpass --pctile 90 --workers 4 -r 600 -u 256 -s ./ssbench-results/very_small.scenario.out

In this test case, we used a v1.0 authentication scheme. We did not use v2.0 or Keystone*, since we noticed that the Keystone service was another performance bottleneck, and should be studied separately.

The above command line dictates 256 users (or 256 concurrency), which overwrites the user_count value set in the input file. This is the default value for our experiment (unless specified explicitly). The command line dumps the results to the “very_small.scenario.out” file in ASCII text format. We use the number of workers at a constant of 4 (--workers 4) to match the four CPU cores available on the client machine, while varying the number of users (user concurrency) to adjust the load or stress level to the server. The “-r 600” parameter specifies keeping each ssbench run for 600 seconds continuously.

Tuning Memcached* and resolving its scaling issue

Memcached* is an important component in the OpenStack Swift software stack. It runs as a Linux service and provides key/value pairs as a cache. The application is written in C, and is highly efficient. However, if not used correctly, Memcached can lead to performance issues—we show these issues in the next throughput chart.

First, we looked at Memcached performance issues and symptoms. We then adjusted the way we ran Memcached, in order to optimize performance.

Figure 5 provides a histogram of throughput when running ssbench. Each dot represents a single throughput number for that second during the run.

Before resolving the Memcached issue, the histogram shows a wider spread, with dots in red color.

optimizing-openstack-swift-fig5
Figure 5. Throughput data histogram comparison before and after tuning Memcached*. The red colored dots represent the result from the “before”, and green colored dots represent the “after”.

Benchmark results were obtained prior to implementation of recent software patches and firmware updates that are intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

From the ssbench client’s terminal/console, we noticed a large number of errors coming back from the Swift proxy server. Digging into the proxy server log files, we found a significant number of timeouts. There could be multiple reasons for any given timeout, such as a storage node with an I/O bottleneck, intermittent loss of network packets, proxy or storage CPUs being saturated, and so on. In real-world environments, the root cause of a timeout can be complex.

However, the errors we saw were not caused by any of the usual suspects. It was only after we did application code instrumentation and network traffic sniffing that we narrowed the root cause to Memcached. Specifically, Memcached was not responding as quickly as expected.

That was surprising, and the first thing we did was to run a Memcached benchmark, and made sure the response was actually within the expected time range. The second thing we did was to spread the workload in a distributed fashion. In other words, we installed and ran Memcached in two to five separate nodes. This Memcached load spreading is configured in the /etc/swift/proxy-server.conf file, under [filter:cache]; for example:

memcache_servers = 192.168.0.101:11211,192.168.0.102:11211,192.168.0.103:11211 (this sets memcache server to run from 3 separate machines with 3 different IP addresses)

Distributing the workload helped, but the issue still existed, even after adding five machines (also known as nodes in this context) to host the Memcached services. Also, the number of response errors from the proxy server increased as we increased the stress level by raising the number of concurrent users (-u option) from ssbench. After significant searching, we finally saw that one parameter was commented out in the proxy configuration file (proxy-server.conf):

#memcache_max_connections = 2

This meant that the Memcached server was configured to run at a default value of maximum two connections at any point. For our tests, we increased that value to 256. We also used only one Memcached server running on the same proxy server. With those changes, all timeout errors disappeared.

After our changes, not only is the data scattering significantly reduced, but also the maximum throughput increased (green dots in Figure 5). On average, we achieved 1.75x speedup, or a whopping 75 percent improvement in throughput, just by tuning Memcached alone.

With system tuning and configuration completed to generate consistent throughput from ssbench runs, or a good baseline created, we were ready to collect performance data and compare results before and after using PyPy.

OpenStack Swift Performance Improvement with PyPy

We ran ssbench for 600 seconds with two types of experiments, 100 percent READ and 100 percent WRITE. Ssbench can track the throughput during the entire run, allowing us to check the throughput delta from CPython to PyPy from beginning to the end of the benchmark run. In Figure 6, the green dots represent the PyPy impact to throughput during 100 percent READ, and blue dots represent the impact from 100 percent WRITE. Any data value above 0.0 percent is good, indicating performance improvement from PyPy.

optimizing-openstack-swift-fig6
Figure 6. OpenStack* Swift* ssbench* runs histograms at 100 percent READ and 100 percent WRITE, comparing PyPy* with CPython*.

Benchmark results were obtained prior to implementation of recent software patches and firmware updates that are intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

As shown in Figure 6, the overall throughput from PyPy is higher than that from CPython, regardless of READ (green color) or WRITE (blue color), demonstrated by the positive percentage value. However, in the first 10 seconds, PyPy is actually lower than CPython. In the next 20 seconds or so, PyPy continues to rise and gradually exceeds CPython until it reaches a plateau. The chart also illustrates a greater performance improvement from PyPy during READ than during WRITE (with majority of the green dots, READ, above the blue dots, WRITE).

PyPy warmup

The PyPy behavior during the first 30 seconds is known as warm-up. During that time, the PyPy interpreter runs in interpreted mode (same as CPython), while at same time collecting instruction traces, generating JITed instructions, and optimizing those just-generated instructions. This is the initial overhead from PyPy, which cannot be avoided.

Up to 2.2x speedup PyPy over CPython during 100 percent READ

After the warmup period, the PyPy over CPython delta remains flat (near-constant throughput rate). As discussed earlier, server processes are long running, and so comparing the average values or trend over a long running time such as greater than 30 seconds is reasonable. With that in mind, PyPy over CPython throughput is roughly 120 percent or a speedup of 2.2x at 100 percent READ.

Up to 1.6x speedup PyPy over CPython during 100 percent WRITE

Figure 6 shows a comparison at 100 percent WRITE, and PyPy over CPython throughput improvement is roughly 60 percent, or 1.6x speedup in average after the warmup. This value is much smaller than the 2.2x. This is because WRITE is more I/O constrained than READ. With READ, various data caching mechanisms play roles in reducing the actual I/O activities on the disk itself in the OpenStack Swift storage nodes, and is more CPU-bound. At 100 percent WRITE, disk utilization reaches 60 percent in our experiment (based on data collected from iostat*, an open source tool), a sign of increasing I/O-bound. We can also see some blue dots dropping below 0.0 percent, indicating a performance regression from PyPy, again due to the same reason. In the latter case, a faster proxy server sending more requests than the backend storage servers could handle pushes more pressure to the disks, thus slowing everything down (doing more harm than good).

Performance impact by object size and user concurrency

In the previous experiment, we’ve been using 4KB object size with a fixed user concurrency at 256. In the next experiment, we studied the performance impact from PyPy with varying object size, which impacts I/O, and user concurrency, which impacts load. The result is shown in Figure 7. The chart reproduced roughly 2x (100 percent) performance gain using 1KB object size during 100 percent READ. However, with 10MB object size for READ, the performance gain is much smaller in the 10 to 20 percent range. For WRITE, the performance gain is in the 8 to 20 percent with 1KB object size, and 40+ percent for the 10MB object size. In this current experiment, increasing user concurrency from 200 to 300 and 400 improves performance gain slightly.

optimizing-openstack-swift-fig7
Figure 7. Ssbench* throughput comparison between CPython* to PyPy* at 100 percent READ and 100 percent WRITE with varying user concurrency and object size.

Benchmark results were obtained prior to implementation of recent software patches and firmware updates that are intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

Up to 78 percent reduction in response latency during READ

Throughput, which measures the RPS, is one vector of the performance metrics. Another way to track performance is by measuring the time it takes to complete a request, in the unit of seconds. Figure 8 shows response latency reduction by 57 percent at 256 user concurrency and 4k object size, and 78 percent at 2048 user concurrency and 4k object size during 100 percent READ. The throughput is what the service providers can see from the back-end server side while the latency impact is what the customer could feel directly. In that sense, the latter metrics could be valuable for simulating or predicting customer impact during peak traffic hours. This result shows greater PyPy performance benefit when system is under heavier load at 2048 user concurrency (versus 256 user concurrency as the baseline).

optimizing-openstack-swift-fig8
Figure 8. Latency reduction.

Benchmark results were obtained prior to implementation of recent software patches and firmware updates that are intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

Additional Discussion

How and why PyPy helps

We demonstrated that PyPy could improve performance. This is because PyPy can optimize the machine codes, or JITed instructions dynamically, reducing or eliminating some redundant calls such as removing repeated data type checking in a loop (while the data type remains constant for some variables). In short, PyPy reduced the number of CPU cycles required to complete the same amount of work. This is called CPU pathlength reduction. A further, deeper analysis would require a full and separate white paper.

PyPy only helps if mostly CPU-bound

To showcase the PyPy value, the application must be CPU-bound, with relatively small impact from other long latency hardware constraints, such as memory, networking, or I/O. We did this by using a fast network switch and memory DIMMs, and by adjusting the user concurrency parameter for ssbench in order to create the appropriate payloads. During our initial experiment, we notice our Intel® Xeon® processor based proxy node has a very low CPU utilization rate of 10 percent, no matter how much user concurrency we used running ssbench. After some further digging, we realized we were bottlenecked on the storage. In this kind of experiment setup or usage scenario, we know for sure we would not get much performance benefit by switching the Python interpreter on the proxy node. We were simply not CPU bound on the proxy node! However, as stated earlier that we would like to focus on the proxy node only in this experiment. To create a setup where proxy node is made the performance bottleneck, we reduced the number of available CPU cores down to 10 by running command. After that, we were able to push the CPU utilization to above 80 percent on all CPU cores on the proxy node. However, our performance results were still limited by I/O partially, especially during 100% WRITE, as demonstrated earlier.

On Linux-based operating systems, the CPU utilization can be dynamically monitored by a tool called Top. If the application is not CPU-bound, the performance of the programming language itself is no longer the main factor, and even a hand-tweaked assembly code may not help. While collecting performance data from ssbench on the client machine, we also collected system performance data including CPU, memory, networking, and I/O on the proxy server and storage servers. To monitor CPU behaviors under load, we collected CPU profiles using a Linux operating system-based tool called Perf. The Perf data revealed that the Python interpreter itself took greater than 80 percent of the CPU cycles, with the CPython’s main loop function, PyEval_EvalFrameEx, taking 24 percent of the CPU cycles. The Top results indicated that we were mostly CPU-bound, with the Perf data showing CPython as the performance bottleneck, before we started playing with PyPy.

The experiment result discussed in this paper demonstrate the potential performance benefit from PyPy. In a real world production setup where multiple factors are often presenting and influencing each other, a comprehensive system performance analysis is recommended before considering optimizing the Python runtime.

PyPy and module compatibility

One important aspect of OpenStack Swift’s design is that it supports third-party extensions. These middleware modules are typically loaded into the proxy, and can act on either a request or a response. Many features in Swift are implemented as middleware. In fact, there is a wide range of functionality in the ecosystem that is implemented in this way. Examples include S3 API compatibility. When considering ways to improve the performance of Swift’s proxy server, one of the requirements is to ensure a seamless integration with these third-party modules (including C extensions). Fortunately, we have resolved all known issues for integrating PyPy into OpenStack Swift as of this writing, such as a memory leaking issue from the Eventlet* module. A patch was created and merged since Eventlet 0.19.

OpenStack Swift application optimization

While analyzing the OpenStack Swift source codes and runtime characteristics, we noticed some additional issues. First, the hottest Python application codes on the proxy server are spending the majority of the CPU cycles repeatedly creating and destroying network sockets, while communicating with the storage servers. This is an architecture issue, and is not related to the implementation language itself.

We think there is an alternative way to communicate with the storage servers. We suggest creating a connection pool with each storage server from the proxy server, as soon as the processes are up and running. We believe there is huge potential for performance improvement with that kind of change. Additionally, a better data caching implementation at the software level in the proxy server could also help minimize or eliminate the trip to the storage servers. Thus, it would help make use case scenarios such as READ much more efficient.

Summary

With no application (Swift) source code changes or hardware upgrade, we have demonstrated that PyPy can potentially improve OpenStack Swift throughput by up to 2.2x speedup and reduce response latency by up to 78 percent. We believe this is a good way to enhance existing system performance without increasing the total cost of ownership. We recommend this approach to OpenStack Swift architects and developers. We shared tuning BKM to turn the systems into optimal condition by configuring ASLR and Memcached appropriately. We believe that these technical insights can be applied to other large Python applications where performance is one of the design criteria.

Configurations

 Client NodesProxy NodesStorage Nodes
Number of Nodes1115
ProcessorIntel® Core™ i7-4770Intel® Xeon® E5-2699 v4Intel® Atom™ C2750
CPU Cores/per node4108
CPU Frequency3.4GHz1.8GHz2.4GHz
CPU Hyper-ThreadingOffOffOff
Memory8GB32GB8GB
Operating SystemUbuntu* 14.04 LTSUbuntu* 14.04 LTSUbuntu* 14.04 LTS
OpenStack* Swift VersionN/A2.112.11
Ssbench* Version0.2.23N/AN/A
Python* Version2.7.102.7.102.7.10

Intel® VTune™ Amplifier - Impact of recent OS security updates

$
0
0

The Intel® VTune™ Amplifier product team is investigating the impact of recent OS security updates. We are providing initial guidance which will allow you to continue using VTune Amplifier 2018 Update 1 and earlier.

Analyses using precise events or hardware event-based sampling with stacks may cause an application or system crash. We recommend that you do not use such analyses. These include:

  • Advanced Hotspots
  • All analyses listed under microarchitecture analysis
  • HPC Performance Characterization
  • Custom Analysis that uses the collect stacks option or precise events

The following analyses do not use stacks or precise events and work correctly:

  • Basic Hotspots
  • Concurrency
  • Locks and Waits
  • Platform analysis

On Linux, Perf-based (driverless) collection is not affected by the OS security updates and will work correctly for all analyses. Uninstall the VTune Amplifier sampling driver to enable perf-based collection. See instructions for uninstalling the VTune Amplifier sampling driver.

In addition, building the sampling driver may fail on kernels > 4.13

If you continue to experience issues, please submit a ticket at https://supporttickets.intel.com. For information on security updates, see Intel.com Security Center site.

We are working on Intel® VTune™ Amplifier product updates which will restore proper operation.  We will keep you posted on product updates as they become available. Stay tuned.


Pedestrian Detection Using Deep Neural Networks on Intel® Architecture

$
0
0

Abstract

This paper explains the process to train and infer the pedestrian detection problem using the TensorFlow* deep learning framework on Intel® architecture. A transfer learning approach was used by taking the frozen weights from a Single Shot MultiBox Detector model with Inception* v2 topology trained on the Microsoft Common Objects in Context* (COCO) dataset, and then using those weights on a Caltech pedestrian dataset to train and validate. The trained model was used for inference on traffic videos to detect pedestrians. The experiments were run on Intel® Xeon Phi™ and Intel® Xeon® Scalable Gold processor-powered systems. Improved model detection performance was observed by creating a new dataset from the Caltech images, and then selectively filtering based on the ratio of image size to object size and training the model on this new dataset.

Introduction

With the world becoming more vulnerable to pronounced security threats, intelligent video surveillance systems are becoming increasingly significant. Video monitoring in public areas is now common; prominent examples of its use include the provision of security in urban centers and the monitoring of transportation systems. These systems can monitor and detect many elements, such as pedestrians, in a given interval of time. Detecting a pedestrian is an essential and significant task in any intelligent video surveillance system, as it provides fundamental information for semantic understanding of the video footages. This information has an obvious extension to automotive applications due to its potential for improving safety systems.

Continued research in the deep learning space has resulted in the evolution of many frameworks to solve the complex problem of image classification, detection, and segmentation. These frameworks have been optimized specific to the hardware on which they are run in order to achieve better accuracy, reduced loss, and increased speed. Intel has optimized the TensorFlow* library for better performance on its Intel® Xeon® Scalable and Intel® Xeon Phi™ processors.

This paper discusses the training and inferencing pedestrian detection problem that was built using the Inception* v2 topology with the TensorFlow framework on an Intel® processor-powered cluster. A transfer learning approach was used by taking the weights for the Inception v2 topology on the Microsoft Common Objects in Context* (COCO) dataset and using those weights on a Caltech dataset to train and validate. Inference was done using traffic videos to detect the pedestrians.

Train and Infer Procedures

This section describes in detail the steps we used to train and infer the pedestrian detection problem.

Choosing the Environment

Hardware Configuration

We began the experiments on an Intel Xeon Phi processor-powered system. Since Intel has launched its new and faster Intel® Xeon® Scalable Gold processor family at that time, with expectations of faster training, we decided to perform the experiments on the new Intel Xeon Scalable Gold processor as well.

The hardware details for the Intel Xeon Scalable Gold and Intel Xeon Phi processors used for the experiments are listed in Table 1 and Table 2.

Architecturex86_64
CPU op-mode(s)32 bit, 64 bit
Byte orderLittle endian
CPU(s)24
Core(s) per socket6
Socket(s)2
CPU family6
Model85
Model nameIntel® Scalable Xeon® Gold 6128 processor @ 3.40 GHz
RAM92 GB

Table 1. Intel® Xeon® Scalable Gold processor configuration.

Architecturex86_64
CPU op-mode(s)32 bit, 64 bit
Byte orderLittle endian
CPU(s)256
Core(s) per socket64
Socket(s)1
CPU family6
Model87
Model nameIntel® Xeon Phi™ processor 7210 @ 1.30 GHz
RAM110 GB

Table 2. Intel® Xeon Phi™ processor configuration.

Software Configuration

The TensorFlow framework optimized for Intel® architecture and the Intel® Distribution for Python* were used as the software configuration, as shown in Table 3 and Table 4.

TensorFlow*1.3.0 (Intel® optimized)
Python*3.5.3 (Intel distributed)

Table 3. Software configuration for the Intel® Xeon® Gold processor.

TensorFlow*1.4.0-rc1 (Intel® optimized)
Python*2.7.5

Table 4. Software configuration for the Intel® Xeon Phi™ processor.

The software configurations are available on the chosen hardware environments, and no source build for TensorFlow was necessary.

TensorFlow Object Detection API

The TensorFlow Object Detection API was used, which an open source framework is built on top of TensorFlow that makes it easy to construct, train, and deploy object detection models. This API was used for the experiments on the pedestrian detection problem.

Dataset

We chose the Caltech Pedestrian Dataset1 for training and validation. This dataset consisted of approximately 10 hours of 640x480 30-Hz video that was taken from a vehicle driving through regular traffic in an urban environment. To accommodate multiple scenarios, about 250,000 frames (in approximately 137 one-minute-long segments) with a total of 350,000 bounding boxes and 2,300 unique pedestrians were annotated. The dataset consisted of the following elements:

  • A list of bounding boxes for the image. Each bounding box contained:
    • Bounding box coordinates (with the origin in the upper-left corner) defined by four floating point numbers such as, [ymin, xmin, ymax, xmax]. We stored the normalized coordinates (x / width, y / height) in the TFRecord dataset.
    • The class of the object in the bounding box.
  • The dataset was organized into six training sets and five test sets.
  • Each set consisted of 6‒13 one-minute-long .seq files with annotations in .vbb file format.
  • An RGB image was encoded for the dataset as jpeg.

Topology

The Inception architecture was built with the intent of improving the use of computing resources inside a deep neural network. The main idea behind Inception is the ability to approximate a sparse structure with spatially repeated dense components and use dimension reduction like those used in a “network in network” architecture to keep the computational complexity in bounds, but only when required. The computational cost of Inception is also much lower than that of other topologies. More information on Inception is given in this paper2. Figure 1 shows the Inception architecture.

pedestrian-detection-with-neural-networks-fig1
Figure 1. GoogLeNet* Inception* model.3

Inception v2 has a slight structural change in the Inception module. Figure 2 shows the Inception v2 module structure.

pedestrian-detection-with-neural-networks-fig2
Figure 2. Inception* v2 module.3

To accelerate the training process, we applied a transfer learning technique by using the pretrained Inception v2 model from GoogLeNet* on the COCO dataset. The pretrained model had already learned the knowledge on the data and stored that in the form of weights. These weights were directly used as initial weights and readjusted when the model was retrained on the Caltech dataset.

The pretrained model (265MB) was downloaded from the following link : http://download.tensorflow.org/models/object_detection/ssd_inception_v2_coco_2017_11_17.tar.gz

Methodology

This section covers the steps we followed to train and infer pedestrian detection on Intel architecture. These steps included:

  • Preparing the input
  • Training the model
  • Experimental runs and inference

Preparing the Input

TFRecord Format

To use the pedestrian dataset in TensorFlow Object Detection API, it must be converted into the TFRecord file format. Reading data from the TFRecord file is much faster in TensorFlow than reading from other image formats.

The Caltech dataset consisted of images in the jpg format and their corresponding annotations in XML format.

To convert the dataset into TFRecord format, we did the following:

  1. Images from the .seq files were extracted into an Images folder.
  2. The annotations from the corresponding .vbb files were extracted into an annotations folder.

The following code was used to convert the Caltech dataset into TFRecord format:

DATASET_DIR=./CALTECH/train/
OUTPUT_DIR=./tfrecords
python tf_convert_data.py \
    --dataset_name=caltech \
    --dataset_dir=${DATASET_DIR} \
    --output_name=caltech_tfrecord \
    --output_dir=${OUTPUT_DIR}

Label Map

Each dataset is required to have an associated label map. This label map defines a mapping from string class names to integer class Ids. The label created for pedestrian was as follows.

item {
  id: 1
  name: 'person'
}

Configuring Training Pipeline

The TensorFlow Object Detection API uses protobuf files to configure the training and evaluation process. The configuration file is structured into five sections. The required sections were used as appropriate. The changes to be done in each section are as below.

  1. model section, set the num_classes to one (num_classes: 1). For the pedestrian detection only one class has to be detected.
  2. train_config section, set the checkpoint file with the path. (fine_tune_checkpoint: " ~/research/object_detection/models/model/ssd_inception_v2_coco_2017_11_08/model.ckpt")
  3. train_input_reader section, set the input_path (input_path: "~/caltech/cal_tfrecord/caltech_train_000.tfrecord") and label_map_path (label_map_path: "~/research/object_detection/data/ped_label_map.pbtxt"). The paths given are examples. Paths can be set as per the location of the files on individual systems.
  4. eval_config section, number of samples to be evaluated.
  5. eval_input_reader section and also the label_map_path is set the same as train_input_reader. The input_path is set to point to the evaluation dataset (input_path: "~/caltech/cal_tfrecord/caltech_train_001.tfrecord").

Training the Model

After making the necessary changes as listed in the previous section, experimental runs were done to retrain the model on the Intel Xeon Scalable Gold and Intel Xeon Phi processors. We tried different parameter values for environment options and found that the following combinations worked better.

"OMP_NUM_THREADS = "8" or "6""KMP_BLOCKTIME" = "0""KMP_SETTINGS" = "1""KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0"'inter_op' = 1'intra_op' = 8 or 6

Values of both 6 and 8 gave a per-step execution time that varied between 2 and 4 seconds.

Experimental Runs and Inference

On the Intel Xeon Scalable Gold processor (DevCloud Cluster)

To run the training on the DevCloud cluster, we used the following command to submit the training job:

qsub ped_train.sh -l walltime=24:00:00

On this cluster, there is a restriction on walltime for six hours to execute a job. There is a maximum value that can be set to 24 hours. As shown in the qsub command, the walltime is set to 24 hours.

The job script ped_train.sh has the following code:

#PBS -l nodes=1:skl
cd $PBS_O_WORKDIR
protoc object_detection/protos/*.proto --python_out=.
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
numactl --interleave=all python ~/research/object_detection/train.py --logtostderr \
--pipeline_config_path=~/research/object_detection/models/model/ssd_inception_v2_caltech.config --train_dir=~/research/object_detection/models/model/ckpt_train

Table 5 lists the details of the run iterations.

Run #Iteration CountBatch SizeLoss
13600244.3007
28509642.5600
310441242.7100
415501243.6926
521304241.9600
628406242.4090
735940242.2240

Table 5. Run iteration details.

On an Intel Xeon Scalable Gold processor dedicated cluster

To run on the dedicated cluster, the walltime setting is not required. The other part of the code as listed under the DevCloud cluster section above remains the same.

The training was done for 190K+ iterations and the variation of loss is shown in Figure 3.

pedestrian-detection-with-neural-networks-fig3
Figure 3: variation of loss on an Intel® Xeon® Scalable Gold processor.

After the model was trained, we exported it to a TensorFlow graph proto. The checkpoint will typically consist of three files:

  • model.ckpt-${CHECKPOINT_NUMBER}.data-00000-of-00001
  • model.ckpt-${CHECKPOINT_NUMBER}.index
  • model.ckpt-${CHECKPOINT_NUMBER}.meta

After identifying a candidate checkpoint, we used the following script to export the trained model file for inference:

#PBS -l nodes=1:skl
cd $PBS_O_WORKDIR
protoc object_detection/protos/*.proto --python_out=.
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
python object_detection/export_inference_graph.py  --input_type=image_tensor  --pipeline_config_path=~/research/object_detection/models/model/ssd_inception_v2_caltech.config  --trained_checkpoint_prefix=~/research/object_detection/models/model/ckpt_train/model.ckpt-34150  --output_directory=~/research/object_detection/models/model/output_inference_graph

Figure 4 shows the inference output for the model.

pedestrian-detection-with-neural-networks-fig4
Figure 4. Raw6 and inferenced frames on the Intel® Xeon® Gold processor.

On an Intel Xeon Phi processor

To execute on the Intel Xeon Phi processor-based server, we used the following command to submit the training job:

nohup numactl --interleave=all python ~/research/object_detection/train.py  --pipeline_config_path=~/research/object_detection/models/model/ ssd_inception_v2_caltech.config  --train_dir=~/research/object_detection/models/model/ckpt_train > pedestrian_log.txt 2>&1 &

The training was done for 100K+ iterations. The variation of loss is shown in Figure 5.

pedestrian-detection-with-neural-networks-fig5
Figure 5. Variation of loss on the Intel® Xeon Phi™ processor.

nohup  numactl --interleave=all python ~/research/object_detection/export_inference_graph.py --input_type image_tensor --pipeline_config_path=~/research/object_detection/models/model/ ssd_inception_v2_caltech.config  --trained_checkpoint_prefix=~/research/object_detection/models/model/ckpt_train /model.ckpt-64343 --output_directory=~/research/object_detection/models/model/output_pedestrian_inference_graph

Figure 6 shows the inference output for the model.

pedestrian-detection-with-neural-networks-fig6
Figure 6. Raw6 and inferenced frames on the Intel® Xeon Phi™ processor.

Results and Improvement

The inference runs on the Intel Xeon Scalable Gold and Intel Xeon Phi processor resulted in a Mean Average Precision (mAP) close to ~30 percent. To boost the accuracy we looked at other options to treat the training data.

To achieve better detection performance, the size of the image and objects within the image need to be tracked and adjusted.

The Caltech dataset consists of a dominant set of images where the pedestrian objects are ~50 to ~70 in pixel size, which is less than 15 percent of the image height. The presence of too many small-scale objects in the images could potentially result in underperformance on pedestrian detection by the model when trained on this dataset. Treating the dataset could help improve the detection performance of the model.

Data Treatment, Training, and Inference

The following steps were performed on the Caltech data to selectively choose the right data:

  1. Filter those images where the size of the objects in any image is less than 5 percent of the size of the image. This forms a new dataset.
  2. From the newly created dataset in step 1, filter those images where the size of the objects in any image is less than 10 percent of the image size.
  3. From the set created in step 2, filter those images where the size of the objects in any image is less than 15 percent of the image size.
  4. Remove the dataset created in step 2 from the one created in step 1.

All the datasets were converted into TFRecord format for training and inference.

The dataset created in step 1 was used for training, while the ones in steps 2 and 3 were used for testing.

Table 6 summarizes the counts of the datasets created.

Caltech database5% Object Size Filtering (A)10% Object Size Filtering (B)15% Object Size Filtering (C)Training (A-B)Inference1 (B)Inference2 (C)
60,0006,2791,2702705,0001,270270

Table 6. New treated dataset details.

The model was run for 33K iterations using the new training dataset of 5,000 images. Table 7 details the training performed.

Iteration countBatch SizeLoss
33,103241.3527

Table 7. New run iteration details.

Inference was run on 1,270 and 270 count datasets. Table 8 shows the results of the inference.

Inference #Image CountmAP
11,27046%
227073%

Table 8. Inference results.

Figure 7 shows the inference output for the model.

pedestrian-detection-with-neural-networks-fig7
Figure 7. Raw6 and inferenced frames trained on a treated dataset on the Intel® Xeon® Gold processor.

Comparing the results, the model detection was better on the treated dataset.

Summary

In this paper, we discussed training and inferencing a pedestrian detection problem built using the Inception v2 topology with the TensorFlow framework on Intel architecture applying the transfer learning technique. The weights from the model trained on the COCO dataset were used as initial weights on the Inception v2 topology. These weights were readjusted when the model was retrained using the Caltech dataset on the Intel Xeon Scalable Gold and Intel Xeon Phi processor-powered environment. The model was better trained as the iterations increased on both systems. The mAP was observed to be low. From the Caltech dataset, by selectively filtering the images, where the pedestrian object sizes were less than 5 percent of the image size and training the model on this new dataset, improved the mAP. As a next step, more generalization of the model can be achieved by creating custom pedestrian datasets with varied object sizes and training on those datasets to improve the model detection performance.

About the Author(s)

Ajit Kumar Pookalangara, Rajeswari Ponnuru, and Ravi Keron Nidamarty are part of the Intel and Tata Consultancy Services relationship team, working to evangelize artificial intelligence.

References

1. Caltech dataset for training:

http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/

2. Going deeper with convolutions:

https://arxiv.org/pdf/1409.4842v1.pdf

3. Rethinking the Inception Architecture for Computer Vision:

https://arxiv.org/pdf/1512.00567v3.pdf

4. TensorFlow Object Detection API:

https://github.com/tensorflow/models/tree/master/research/object_detection

5. Single Shot MultiBox Detector in Tensorflow:

https://github.com/balancap/SSD-Tensorflow

6. Traffic Video:

https://www.videvo.net/video/traffic-in-downtown-chicago/5069/

Related Resources

SSD: Single Shot MultiBox Detector: https://arxiv.org/abs/1512.02325

TensorFlow* Optimizations on Modern Intel® Architecture: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

Build and Install TensorFlow* on Intel® Architecture: https://software.intel.com/en-us/articles/build-and-install-tensorflow-on-intel-architecture

TensorFlow Issue #1907: https://github.com/tensorflow/models/issues/1907

Clapfoot: The Seven-Year Overnight Success Story

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Clapfoot: The seven-year overnight success story. Get more game dev news and related topics from Intel on VentureBeat.

Clapfoot

Indie studios are lucky to survive a few years, let alone seven like Clapfoot has. And 2017 is by far the best year the Toronto developer has ever had.

In August, Clapfoot released a work-in-progress version of Foxhole, an open-ended war game, onto Steam’s Early Access program. Foxhole plunges 140 players into a pseudo-WWII setting, and it’s up to them to figure out how they want to fight their war. The community, which began forming over a year earlier during a free pre-alpha phase, grew exponentially. The Foxhole server on Discord, a popular chat application, blew up to 50,000 people.

“Not only has Foxhole been the most successful game for the studio, it’s also the game we’ve always wanted to work on. … It’s been very stressful, but it’s also just — it’s kind of surreal. We never thought we’d be able to work on a dream project and actually get a business out of it. It’s been pretty fantastic,” said cofounder and programmer Mark Ng.

That success helped the team grow from seven employees to 12, and they’re eager to hire more people. They’re finally making the type of game they always wanted to make, but until recently, didn’t have the resources or technical know-how to do it.

Originally, Clapfoot started out creating mobile games. But as the market changed from favoring premium priced games to free-to-play titles, the team knew it was time to leave.

“At the beginning, when we first started working on mobile games, it was more about gameplay. The focus was on making a high quality game, where the gameplay was the dominant feature that would sell your game,” said Ng.

“Then it shifted over to the free-to-play model, where the important things are the monetization and analytics. It just wasn’t something we were very good at, and we also weren’t very interested in it.”

The developers turned to console and PC games instead. The idea for Foxhole — a massive multiplayer game where battles can last for days or even weeks — was still there, but they didn’t have any experience making multiplayer games. So to hone their skills, Clapfoot made Fortified, a four-player cooperative shooter with a 1950s alien invasion theme, for Xbox One and PC.

After Fortified’s release in 2016, the team felt like they were ready to tackle Foxhole’s more ambitious ideas.

Clapfoot
Above: Nothing about Foxhole’s wars is predetermined.

Fostering teamwork

Foxhole is less about shooting and more about meticulous planning. When players enter the battlefield (fighting either for the Colonial or the Warden faction), they have to figure out how they want to contribute to the war effort. That requires coordination and strategizing with their fellow teammates.

“Supply lines, logistics, even mission objectives, they’re all set by the player. As an example, if there are weapons in the game, it’s not because the game spawned them. Somebody made those weapons. Somebody put them on a truck and drove them to the front line and now you have a weapon. The entire supply chain that drives the war effort is also in the hands of players,” said Ng.

Foxhole encourages teamwork through a number of different ways. First, and perhaps most importantly, Clapfoot wanted to make sure that the actions players could take would benefit the group as a whole. When the game was still in pre-alpha, players used to be able to make weapons for themselves by walking up to a building and spending the team’s resources to produce a gun. But it became apparent that some people were abusing the system by deliberately wasting resources.

In response, Clapfoot changed it so that players could no longer make individual weapons for themselves. They can only produce guns by the crate.

“Then, if you make a crate of weapons, suddenly you’re sharing with a bunch of people. You’re working with the group. You can’t just help yourself on your own…We try to avoid those types of mechanics. But it isn’t perfect yet,” Ng said.

This philosophy also ties into Foxhole’s progression system. Unlike other military-themed games, Foxhole doesn’t base your rank on the number of kills you get in a given battle. Instead, players can commend each other for being helpful, and the more commendations you have, the higher your rank will be. Clapfoot still wants to improve how commendations work, but it hopes that the system encourages people to play fairly.

“The game is still in early access, so we’re still trying to figure out ways for players to help each other,” said Ng.

Clapfoot
Above: Clapfoot recently added amphibious vehicles to the game.

Making each war meaningful

Foxhole experienced a major side effect due to its growing popularity: individual wars weren’t as meaningful as they used to be. Before launching on Steam, and back when Foxhole had a much smaller community, Clapfoot would run one giant battle per week where everyone participated.

This made each war feel like a big deal, creating lasting memories for the players. But as word-of-mouth spread and more players joined in, Foxhole naturally became more segmented. It became easy to lose track of all the different player-driven narratives with multiple wars going on at the same time.

“We always wanted the wars to have weight. We didn’t want them to feel like, ‘Oh, I’m just fighting 100 wars over and over again.’ We have that problem somewhat right now, and we’re trying to fix it. But our goal was always to make it so each war feels like a unique event,” said Ng.

To address this, developers are exploring a new mode that will combine all the other wars into one match, increasing the max number of players in the process. Clapfoot is also trying to find the best way to award the victors. In the early days, the studio used to put together a war report after every match, which listed the names of all the players on the winning side, as well as a detailed breakdown of various stats (like the number of weapons made, buildings they destroyed, etc.). But that, too, had to stop as the game became bigger.

With the new mode and reward system, the goal is to bring back what helped make Foxhole so special in the first place. And it’s important for Clapfoot to get this right because Foxhole represents the type of emergent multiplayer experience they want to keep making in the years to come.

“We have huge plans for the future. I think that a game like Foxhole is just the first of many in this genre. We want this to be our thing,” said Ng.

 

Kert Gartner Explains the Genius Behind Mixed-Reality VR Game Trailers

$
0
0

 

The worlds of bicycle motocross (BMX) and video-game marketing collide in the person of Kert Gartner, one of the leading indie game trailer producers. Gartner's first film-making experiences consisted of filming his buddies while they careened around empty lots in Winnipeg, Manitoba, Canada, performing crazy stunts. He quickly learned the craft of staging action, film editing, and adding special effects, and he even sold a few BMX videos around town. This was followed by stints at a local television station, and working on more than 28 movies as a visual effects artist (including Superman Returns and Silent Hill). Eventually, Gartner struck out on his own, and now specializes in producing award winning video-game trailers for indie game developers.

Kert Gartner, portrait
Figure 1. Kert Gartner is a magic man when it comes to producing trailers for VR games

Lately, he's been involved with creating virtual reality (VR) trailers using a mixed-reality technique. Showing the excitement of a 3D world in a 2D video puts a heavy load on the computing systems he uses. Playing the VR game, capturing the incoming stream, mixing them, and outputting the content aren't simple tasks, or even megatasks-they're extreme megatasks. His current computer is a Hackintosh—a system based on the Intel® Core™ i7-6700k processor, with 64 GB RAM, fast solid-state drives (SSDs), and an Nvidia GeForce* GTX 980 Ti video card. Most days, that's enough power to get his work done smoothly.

"I remember using Adobe After Effects* on a 100-megahertz Apple Mac* way back in the day," Gartner said. "But I could always use ‘faster.' I'm still waiting for beach balls every now and then, but I think that's got more to do with the software at this point." After all, he's pushing around highly compressed and optimized 4K files in H264 format. "The computing power that's necessary to decompress all that stuff, and actually work in real time, is pretty intense. It seems like, as the computers are getting faster, what we've been asked to do with them is getting pushed up as well," he said.

Gamer at Heart

Gartner has always enjoyed gaming, dating back to his younger days playing on Atari and Nintendo* NES systems. His workspace is surrounded by posters of PacMan and StarFox*, and there is "old NES junk all over the place," he admits. He just bought a new Nintendo Switch*, and he is currently enjoying Super Mario Odyssey*. Still, Gartner has always appreciated the work of indie game devs.

His very first trailer was for Canabalt 2p*, a custom two-player version of the indie game released in 2009 by Adam Saltsman that sparked the left-to-right, single-button endless runner genre. He shot the trailer in two days, compositing tiny live-action pixel art characters dashing across the skyline of Winnipeg. From a side hustle, his trailer work grew into a full-time job, and he's busier than ever.

Canabalt 2
Figure 2. Gartner's first video-game trailer was for Canabalt 2p, an "infinite runner," scrolling from left to right across the Winnipeg skyline.

Great Game Trailers Enhance Marketing

Compelling trailers are crucial to get noticed as you market your game. In her presentation"Marketing Indie Games On a $0 Budget" (given at the 2013 Konsoll conference), Emmy Jonassen listed her five essential pieces of marketing content that developers should have as soon as they can produce them—a compelling game trailer was #1. (FYI, the other four were screenshots, a press release, a landing page, and a development blog.)

Gallery of past work
Figure 3. Check out kertgartner.com for some of his past work.

Gartner has produced so many trailers that he's lost count—he guesses in the hundreds—and he showcases 30 of them at a time on his website. He sees trailers as one of the most important marketing tools for indies, as he explained in this blog post, "Trailers are usually the first impression that a player's going to get of the game." While many indies may believe that positive press coverage and glowing reviews are the key to getting noticed, Gartner points out that trailers are usually the first thing people see. "If someone is browsing games on Steam*, the trailer is the first thing that auto-plays," he notes. "They're not going read an article, they're going watch a video about it."

The Five-Second Rule for Attracting Viewers

Gartner stresses hooking the viewer, and keeping them engaged. He uses catchy music, extreme close-ups, amazing stunts, and other emotional cues, with the first few seconds of a video being "make or break," as he puts it. "I remember seeing some viewer stats on Steam about how fast people bail on a video. It's literally under five seconds. Now you're starting to see a five-second trailer before a movie trailer. I think Blade Runner 2049 did this—five seconds of super, super-fast cuts, just like boom, boom, boom. Cool shot, cool shot, Blade Runner [2049], boom, boom. Then it fades out, and the actual trailer starts."

What drives Gartner crazy are lazy trailers that start slow and go nowhere fast. "I see too many game trailers that spend the first ten seconds on title cards and logos of companies we've never heard of before, and I just ask, ‘Who cares?' Just put that at the end of the trailer!"

Another rule that Gartner preaches to indies who need to get noticed is to keep the total length at around 80–90 seconds. "If I can get it to a minute and 20 seconds, that's ideal. Unless there's a very good reason to go over a minute and 20 seconds, don't do it. People's attention spans are short as it is, so trying to keep their attention for more than 90 seconds is usually a tough sell."

Producing a tight, crisp trailer that informs the viewer is not an easy skill, so many beginners face a key decision early on: budget for a professional, or make it on your own. "A lot of people think, oh I've got Adobe Premiere*. I'll just capture some game playing, slap some music under it, and be done with it. But really, there's a lot more to it."

Sometimes Gartner will script out a plan with a storyboard, but it's often immediately obvious to him what the game trailer must capture. For example, his trailer for Rick and Morty: Virtual Rick-ality starts with a bang, and quickly develops the story, the characters, and the game play. "It has to have a beginning, a middle, and an end," he said. "Then you figure out what pieces you should slot into each of those sections. Pretty soon, you have a plan."

Screenshot of Rick and Morty game
Figure 4. The trailer for Rick and Morty: Virtual Rick-ality packs the whole story into 90 seconds.

It's a Relationship, Not a Project

Trying to distill what a game feels like, and how to play it, can be easier for an outsider to grasp than for someone who has labored on the project for a long time. "It's tougher for the game developer, because they're closer to the project. I can come at it with a different perspective," Gartner said, noting that he has to be sure to break that news gently.

Gartner is basically a would-be partner looking for a story to tell, and once he figures out the story to be told, he can swat away suggestions that don't lend themselves to the finished product. "The thing is, with a good game, it will usually market itself to a certain extent. A good game will have a very good hook to it, and there will be something really easy to latch on to. If a game doesn't have that, sometimes that's a warning sign that there might be other problems with the game itself," he said.

For example, Gartner was inspired by Dropsy, a non-traditional take on the classic point-and-click adventure game. It was unique—the art and style were unlike anything Gartner had ever seen. "It's a game about this clown whose parents died in a circus fire. It's got this crazy, amazing pixel art, really trippy art style to it, and just looking at that it was really obvious that we needed to focus on the crazy story that's happened to this clown, with this interesting art style." Once he established the key points, he was ready to tell the story.

Screenshot of Dropsy trailer
Figure 5. ;Inspired by the pixelated design of Dropsy, Gartner created a trailer that celebrated its trippy art style.

What happens next is an intense working session that can last for days. Gartner often needs custom programming, such as invisibility cloaks to keep the game play going, or drops to put him at certain points in the game. He might need access to every weapon, and that requires more code. Or, there might be bugs that prevent a certain shot. Having the developer nearby is a huge help.

New Challenge—VR Trailers

Lately, Gartner has been shooting trailers for VR titles, and the challenges have increased exponentially for his system, the software stack, and the post-production work. "The problem with most VR trailers is that they're shot from a first-person perspective, because that's typically the raw output from the head-mounted display. That is literally the worst possible thing you can use, because your head is not a camera," he said. When players are in a VR world, their eyes are darting around, and their head is making constant micro adjustments, but their brain cancels all of that out. The footage of that view is almost unwatchable, it's so shaky and unfocused. "It's a horrible way to convey what it actually feels like to be in that environment," he said. As a result, many VR trailers fall flat.

The answer is to film from the third-person perspective, as though the viewer is looking over the shoulder of the player. Typically, that would require two video feeds—one of the player against a green screen, then composite that over a feed of the actual game play. "When you have the ability to film that person inside of VR from a different perspective, like you're filming them on a virtual sound stage, or a virtual set, all of a sudden, your options just explode. You have new decisions about where you can put the camera, how you can frame this person within that environment, and [you can] now structure a shot that actually explains to the viewer in a concise way what they're actually doing in that world," he said.

Screenshot of Job Simulator trailer
Figure 6. The Job Simulator* trailer used mixed reality in a fast-paced, 87-second story.

The Job Simulator VR game from Owlchemy Labs simulates our inevitable, fully automated, cubicle-dwelling robot future. For Gartner's game trailer, developers had to code a custom, smoothed, head-mounted display camera that would be tied to the actual head-mount display and smooth the output.

The trailer for Rick and Morty: Virtual Rick-ality presented a number of unique challenges. The game is a collaboration between Owlchemy Labs and Adult Swim Games, and pushes the player to explore Morty's world while solving puzzles. The linear nature of the story, and the vast number of environments and locations involved, were challenging—Gartner needed a way to access the different points in the story quickly so he could get as many unique shots as possible. Owlchemy Labs created a customized build of the game, with its own user interface (UI), that allowed him to jump to all the major points in the game quickly. If, for example, ten or more takes of a single place in the game were required, it would have been maddening to have to play through from the beginning for each shot.

Custom tools were used to turn off the in-game voice, in order to capture the raw gameplay sounds. Owlchemy Labs created a toggle to remove the large whiteboard in the garage for certain shots, added player invincibility, and enabled different facial expressions that matched the action in the scene. "Having the same static mouth throughout the video would have felt emotionally flat and weird," Gartner said.

Gartner used a third-person view almost entirely, and he's probably not going back. "Every single VR trailer that I do going forward, I basically talk to the developer first. If they don't want to shoot in third-person format, I'm not really interested in doing it, because I don't think that it's going to be that good of a result."

Avatars Take Mixed Reality One Step Further

One of his favorite recent projects was the trailer for Space Pirate Trainer*. Made by Dirk Van Welden of Brussels-based I-Illusions, it's a classic arcade-style shooter, putting the player inside the arcade machine, blasting away. Van Welden built a full-body, inverse-kinematic body rig that moved the entire body, based on the position of the head and hands. The result was like having a motion-captured person, or avatar. "It looked so good," Gartner said. "It was like we were filming a virtual movie. It was super, super fun to do."

Screenshot of Space Pirates Trainer trailer
Figure 7.The trailer for Space Pirate Trainer required custom programming for an in-game avatar.

Had Space Pirate Trainer been shot in first person, it would have been dry and boring, with the viewer seeing the same thing over and over. "We knew we needed to shoot this in third person, so we could show the Space Pirate on the landing platform, and shoot it from a wide variety of angles. We would put the camera way out into the distance, then pull it in close. My friend Vince came up with the idea of putting the camera on the drones so we could capture the drones flying at the player. I talked to Dirk, and he coded it in an evening, and had it in the next build."

Using in-game avatars instead of mixed reality cuts down on expenses, too. "We can experiment for a while trying out weird or interesting camera angles and if we don't get anything out of it, it's no big deal. Then we can come back tomorrow if we just get too tired after the day. It's not costing us any more money to experiment during the shoot, because we don't have a bunch of extra people that we have to bring in for another day on a green screen set."

Gartner is a seasoned professional, and the industry's go-to guy for trailers; his work shows how an experienced eye can make all the difference. However, using the latest Intel® processors and the best software tools can help anyone taking their first steps toward creating mixed-reality trailers. The extreme workloads involved put tremendous pressure on a system, and may require multiple PCs to play, capture, and mix at high quality.

Some of Gartner's work in mixed reality inspired an Intel team to try to replicate the entire process of playing, capturing, compositing, and streaming a session on a single, 18-core Intel® Core™ i9-7980XE Extreme Edition Processor, shown live at Computex Taipei in 2017. The system handled the chores easily, and the audience could see the utilization graphs for each core as the system worked. Gartner is seriously thinking about making the new processor his next upgrade. "I would kill for that," he joked.

Resources

Kert's Blog: http://www.kertgartner.com

Adobe Premiere Pro video editing software

Intel Game Developer program: https://software.intel.com/en-us/gamedev

Intel Corei9-7980XE Extreme Edition Processor Landing Page

Watch VR reunite a group of former N64 developers

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Watch VR reunite a group of former N64 developers. Get more game dev news and related topics from Intel on VentureBeat.

Phaser Lock Team

You might not recognize the name, but chances are good that a lot of the folks at Phaser Lock Interactive worked on a game you love — especially if you were a diehard Nintendo 64 fan.

The small group of coders and animators that founded the Austin-based company first met 20 years ago when they were colleagues at Iguana Entertainment, which was best known for games like Turok: Dinosaur Hunter, Turok 2, and South Park. That experience led to lifelong friendships and a Facebook page that helps them keep in touch.

But they never thought they’d be back more than a decade later to work on cutting edge virtual reality games.

Cofounder and chief creative officer Michael Daubert was one of those people. Daubert recently gave us a tour of Phaser Lock HQ, where he talked about the studio’s past, present, and future. Watch the video below to find out how VR brought the band back together and what it’s like to be one of the first indie devs to champion the new technology.

Get Ready: Pricing Your Indie Game

$
0
0

As if making a game wasn't stressful enough, once you've made one, you then have to stick a price on it, and ask people to buy it. How the heck do you know what to charge? And, in a more existential sense, how can you put a price on your own creative outpourings? Well, think of it this way: if you don't get paid, this could be the last game you ever make—and none of us want that. So it's time to bite the bullet, put cap in hand, and finally get what you deserve for all those months of late nights and creative anxiety.

Setting the right price for your indie personal computer (PC) game is a key piece of the marketing puzzle, and, in a constantly shape-shifting market, it's a tricky one to nail. The price of your game is so much more than just the money someone hands over; it guides player expectations of quality and content, and—crucially for you, the developer—it's what stands between making or losing money.

There are commentators outside the industry who say games are too expensive compared to their movie cousins. However, some within the business point to rising development costs, increasing technical demands of hardware, mushrooming player expectations, and the many hours of fun that one game can deliver as evidence that games aren't expensive enough. Recently, PC Gamer* explored the question of underpricing in the indie market, leading to some weighty debate on the issue.

All debate aside, when it comes to pricing your game, stay focused on two things: making sure the player feels like they're getting a fair deal and ensuring you, the developer, get paid. The goal of this marketing guide is to help you tick both boxes by throwing some light on different pricing strategies for indie PC games.

Dodging Anomalies

The indie-game renaissance is showing no signs of abating, although with such buoyant success comes stiff competition. Steam* is at the forefront of bringing many of these games to its PC-playing public, and, according to one analyst, the platform was deluged with more than 6,000 new titles in 2017, almost as many as the total number released between 2005 and 2015. It's usually the indie success-stories that make headlines, the leading contender for 2017 being Cuphead. The self-published game was made by a small, independent team utilizing the free-to-use Unity* game engine, and sold for USD 20. Two weeks after release, Cuphead had sold over a million copies on PC and Xbox* One.

Massively successful indie games such as Cuphead, however, are anomalies, and not representative of most of what happens in the market. When it comes to the price, would Cuphead have sold a million more if it had cost less? Maybe it sold as many as it did because that was the natural limit of the market for it, regardless of the price. We can only guess.

Cuphead developer Chad Moldenhauer pretty much nailed it in StudioMDHR's MIGS 2017 keynote: "The commercial success really just means we get to keep making games for hopefully as long as we live." This is almost certainly what most indie developers want their games to deliver, and so the price needs to be set accordingly.

Pricing Checklist

There's a lot to consider when making pricing decisions. The following checklist covers most of the key areas that you should check off and revisit as you approach the release of your game, with each one explored in more detail below.

Break-even pricing: How much money do I need to earn to be able to keep making games? If you're considering Kickstarter*, there are other factors to take into account.

Market pricing: How are my past, present, and future competitors priced, and what are the pricing limits for comparable games? Also consider downloadable content (DLC) pricing, episodic pricing (if it's relevant to your game), and geographical pricing.

Perceived value pricing: What else do I want to communicate with the price?

Promotional pricing: What kind of discounts and offers should I be thinking about and at what point in the product life cycle?

DLC pricing: How should I price post-launch content?

Break-Even Pricing

It may seem obvious, but the first thing you need to do is work out how much money you need in order to pay yourself and your team a living wage, both for the work you've done up to the launch of your game and to keep going afterward. Even if you have additional funding from an investor or a Kickstarter campaign, you still need to do the math.

The break-even point is the point at which your revenue covers your costs. Costs need to include not only your team's burn rate, but also the percentage of revenue that goes to the retailer (usually 30 percent in the case of Steam and GOG*, for example) and to your publisher, if you have one.

Once you've got that break-even figure, you can start working out how you're going to reach it by multiplying the price by the estimated number of copies you're going to sell. Since you'll never know exactly how many games you're going to sell, try different scenarios and make good use of all the historical data for comparable titles on Steam Spy. UK-based independent developer Ninja Theory* gave us a glimpse of how this process works in the post-mortem video it released for its 2017 self-published title Hellblade*: Senua's Sacrifice.

Screnshot of a sales figure graph
Figure 1. Ninja Theory's expected-versus-actual sales for Hellblade: Senua's Sacrifice*.

The image shown in Figure 1 was taken from the video and shows the projected sales (in blue) of the game compared to the actual sales (in green). Ninja Theory anticipated it would need to sell 500,000 copies in six months to break even, which means it was able to plan its finances in advance, with the aim of being able to stay solvent for six months following launch. In the end, Ninja Theory exceeded its own sales expectations, but, by creating this projection, the company knew exactly what it had to do and could price its game accordingly—in this case at USD 30.

Kickstarter* Pricing

It may be that you follow in the footsteps of many successful indie teams and launch a Kickstarter to help fund the development of your game. Crowdfunding a game on Kickstarter is something of an art in itself. The pricing process is complicated by the number of different reward tiers, often involving the production and distribution of unique physical and digital items, as well as the game itself.

If you're heading down the Kickstarter path, you need to calculate the true cost of the rewards—not just the goods themselves, but the resources required to fulfill them. Items such as figurines, art books, and vinyl records are expensive to produce and ship. Add everything to your costs—including labor costs—and use that to help set the prices of each reward tier, and work out your overall break-even point.

Market Pricing

The key to getting the price right is understanding the market you're entering. You need to ask some important questions about how competing games are priced, be they direct competitors in your genre, indirect competitors aimed at a similar market, or share-of-wallet competitors releasing at the same time as you.

Steam Spy and SteamDB give access to historical sales and pricing data for every game on Steam so you can see how they performed. If a certain price worked for a similar game, there's a good chance it will work for you—although it's always a roll of the dice, and, ultimately, nothing replaces having an excellent game to sell in the first place.

This is not to say that you can't do things differently if you want to—just make sure you've done the research so you know exactly why you want to do it another way and what the risks are. Relying on hunches alone is crazy when there's so much data available.

Competitor Pricing

The clearest indicator of how you should price your game should come from your competitors, which you can group as follows:

  • Direct competitors: Games competing in the same or similar game genre.
  • Indirect competitors: Games that share a broad audience target, and have a similar amount of content and level of production values without being in the same genre.
  • Share-of-wallet competitors: Games released in the same period of time as yours, competing for finite financial resources.

Build a representative list of a dozen or so recent historical competitors (from the last year or two) that fit into each of the first two competitor types, and then list the following:

  • Competitor type (direct or indirect)
  • Release date
  • Release price
  • Sales data (for example, from Steam Spy, or their own announcements)
  • Revenue estimate (minus retailer percentage, which is approximately 30 percent)
  • Time period between launch and first discount
  • The sales the game has been in, with prices and percentage discounts

With that data, you will be able to paint a detailed picture of the effects of different pricing strategies on competitors' sales, glean information on best practices and pitfalls, and start to define what your own pricing strategy should look like.

Screenshot of a price history graph
Figure 2. Price history for Reikon Games'Ruiner*, released in September 2017 (SteamDB*).

For games coming out around the same time as yours—share-of-wallet competitors—look at the audience they're targeting and estimate the degree of crossover with yours, and then gather as much relevant information as you can about them. Check their pricing, pre-order strategy, media presence (including social media, streams, and YouTube*), and try to gauge the overall level of anticipation compared to your game. This will give you an idea of how you'll fare in the battle for market bandwidth at launch.

What's key is doing your research and not losing your nerve. Price your game at what you believe it's worth based on the data available and on the revenue you need in order to break even and keep going.

Downloadable Content Pricing

DLC is all about extending the life cycle and revenue potential of a game. It has become a regular fixture in the world of PC gaming, not least because it's much easier to do on a PC than a console (first-party certification demands are notoriously slow and complex to navigate). Publishers tend to favor a mix of paid and free DLC over a game's life cycle, so those who don't want to pay more than the shelf price still get a piece of the action.

The Season Pass signs up players for a whole set of forthcoming DLC for a game. The player gets a discount, and the publisher gets guaranteed income, and an indication of interest, which helps predict sales of the DLC, especially with the help of historical data. A win-win again.

Of course, it's vital to make sure the perceived value of your DLC matches the price you're asking. A couple of skins and a new map might be fodder for free DLC, whereas major new game modes and a new game world sound more like the stuff of paid DLC. Look at the DLC and relative pricing for other games, and if it's your first time avoid big risks—stick with what it looks like the market will bear easily.

Episodic Pricing

Releasing a game in episodes has become an accepted way of delivering game content, and it can be a great way of generating regular revenue over a sustained period of time.

A recent success story is Life is Strange, which was sold in five episodes, at USD 5 per episode. The Season Pass pricing mechanic often used for DLC is also employed for episodic games, and with Life is Strange players could buy all five episodes for USD 20—a discount of 20 percent.

Selling a game experience in episodic form can put constraints on pricing—asking for USD 40 each for three-or-more episodes is likely to be somewhat contentious—but it's well-suited to narrative-driven experiences, for example, where relatively lower development budgets allow for accessible pricing per episode.

Screenshot from game, Life is Strange
Figure 3. All five episodes of Life is Strange* were offered for USD/EUR 19.99, a 20 percent discount on the price when bought separately (since then, episode one has been made free).

In addition to generating revenue over a longer period, releasing a game in episodes can be extremely useful for small, resource-strapped development teams. Creating episodes allows them to release a part of the game, generate revenue, and reinvest in ongoing development. Ebb Software is intending to release the first part of Scorn in 2018; it will hopefully be successful enough to fund the development of part two.

Screenshot from game, Scorn
Figure 4. Part one of Ebb Software's first-person horror game Scorn* is entitled DASEIN, and planned for 2018.

Geographical Pricing

Another factor to bear in mind is the (often surprising) price differences in various geographical regions. At the time of this writing, on Steam DB, the USD-equivalent local prices for PlayerUnknown's Battlegrounds range from USD 14.94 in Indonesia to USD 37.41 in the UK, which is a phenomenal range.

Image of list of prices per country
Figure 5. Some of the geographical price variations for PlayerUnknown's Battlegrounds* (SteamDB*).

The availability of lower prices in certain countries encourages some buyers to purchase from another territory (using a VPN for example) or buy keys via resellers such as G2A*. This phenomenon is proving challenging to parts of the games market where margins are narrow; squeezing those margins too hard can mean the difference between success and failure. It is, however, possible to region-lock Steam codes, although the practice is not that widespread.

It's worth building best- and worst-case financial scenarios based on the lowest and highest prices you estimate your game will be sold at. While it will never happen that all your players, worldwide, buy at the cheapest price available, you'll be prepared if some of them do.

Perceived Value Pricing

When pricing your creation, you need to get inside the head of your customer, and try to figure out what price they think is fair. It's all about perception and context.

The price communicates information about where your game sits in the market and its intrinsic value—the perception of which depends on the context. For example, a USD 60 price immediately suggests a AAA-quality game or blockbuster, for which gamers would expect to receive a significant amount of content (hours-of-play being the simplest yardstick) and high production values. At the other end of the spectrum, a USD 5 or USD 10 indie game may be equally or more creative in its ideas but will have less overall content and lower production values, whether in terms of the graphics or in other areas.

Where it gets really interesting, and more complicated to judge, is in the mid-market, especially in the digital space. Indie games with higher production values may be priced anywhere from USD 20 to USD 40. Shadow Warrior 2 from independent team Flying Wild Hog was released in October 2016 and has sold around 355,000 copies with a standard price of USD 40 (not including discounted sales). In terms of price expectations, Shadow Warrior 2 was consistent with its predecessor and with what players expect to pay for a complex first-person action game with relatively high production values—certainly for an indie game.

Screenshot from game, Shadow Warriors
Figure 6. Flying Wild Hog's Shadow Warrior 2*, an indie game with high production values.

There's no doubt that 2017 release Cuphead also has high production values, but its genre is a simpler one—a 2D side-scrolling platformer. Hence its USD 20 price—half the price of Shadow Warrior 2—feels about right, perhaps even generous to the consumer, given the game's impressive artistic realization and commitment to its creative vision. While Cuphead has sold more copies (1.3 million on Steam as of January 2018), in revenue terms, both titles have performed well. The real test is whether they've covered costs and entered profit, while being priced at a level that is acceptable to buyers, according to their perception of the game's value. That's the sweet spot you need to find.

Promotional Pricing

The world of digital game distribution is littered with sales and promotions, from pre-order incentives, to seasonal sales and bundling. You need to watch the markets, study the competition, and work out the right timing and type of promotions to support the success of your game.

Pre-Order Pricing

Many games offer pre-order discounts (typically 10 percent on Steam) or added value, such as exclusive in-game content. The former option can eat into revenue, while the latter risks alienating players who don't want to pre-order. A well-judged pre-order incentive is worth it, however, as it can provide a guide to future sales—especially when combined with historical data from previous titles.

Discounting and Sales

Steam, GOG, Green Man Gaming, and other digital marketplaces have had a profound effect on consumer behavior when it comes to discounts and sales. As with physical retail, day one of release is still the moment when the sales are likely to be highest, but the endless shelf-life of digital products means that the life cycle can last years, and many buyers are happy to wait to bag a bargain.

Games compete with each other not only within their genre, but also for a share of the finite amount of money consumers have to spend at any one time (share-of-wallet). With so many games hitting the market each month, gamers have to make tough choices on how to spend their money. Looking for quiet gaps in which to release a new intellectual property has become something of a sport for many publishers. Crowded release periods can reduce day-one sales and encourage potential buyers to wait for a discount.

In that context, it's important to look beyond day-one sales figures and consider the number of people who have put your game on their wish list for an indication of the game's potential over time.

When it comes to discounting, don't do it too quickly after launch—you risk devaluing your product ("Why is it on sale already? Is it no good?") and annoying your early buyers who picked it up at full price. A couple of months post-release is probably a good guide, but watch the market, competitors, your sales, and make a smart call.

You may find that deep discounts of 75 to 90 percent don't deliver a proportionate increase in sales for the massive slice they take out of your revenue. Generally, reductions on digital stores are more modest, with the majority of games in the Steam 2017 Autumn Sale on offer with discounts of between 20 and 60 percent.

Again, the price of your game communicates more than just the hard cash spent, so always discount thoughtfully, and take your lead from the market and other comparable games.

Bundle Pricing

Humble Bundle has put PC game-bundling on the map, offering limited-time bundles and donating a portion of the profits to charity. It's a fascinating model, and one that many publishers use to monetize digital content that would otherwise be gathering virtual dust.

Bundles are also a permanent fixture on Steam and other digital stores, and they're a win-win for both the consumers and the game publishers. The buyer gets a bunch of games for a great price, and the publisher generates revenue from games that may otherwise have stalled by bundling them with more desirable ones. It's a useful strategy for further into the product life cycle once sales have reached a plateau and can breathe new life into a game's sales.

Sales Metrics

The final word: measure everything for the entire duration of your marketing campaign, pre- and post-launch. Log every article, stream, YouTube video, social-media post, and user review, and after-launch look for correlations with peaks and troughs in the sales data. Compare your sales with your initial predictions, build up a picture of what works, what doesn't, and how you can do it better. Then do it better.

This guide gives you a good start to find a pricing strategy that works for you. As you go through the process of releasing games, gathering data, and gaining experience, you will become your own expert on using price to increase your chances of success in the fascinating, frustrating, challenging, and rewarding indie game market. 

Resources

Intel® Developer Zone: https://software.intel.com/en-us

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>