Introduction

The Intel® Deep Learning SDK is a free set of tools for data scientists and software developers to develop, train, and deploy deep learning solutions. The SDK encompasses a training tool and a deployment tool that can be used separately or together in a complete deep learning workflow. In this case study, we explore LeNet*,one of the prominent image recognition topologies for handwritten digit recognition, and show how the training tool can be used to visually set up, tune, and train the Mixed National Institute of Standards and Technology (MNIST) dataset on Caffe* optimized for Intel® architecture. Data scientists are the intended audience.

Human Visual System and Convolutional Neural Networks

Before we dive into the use of the Intel Deep Learning SDK, it helps to basically understand how the human visual system works and how it relates to the design of computer neural networks. The neuron is the basic computational unit in the brain. It receives input from dendrites, and when the combination of all input exceeds a certain threshold, it fires an output that triggers the connected neurons. Mathematically, the biological system can be represented as shown below [1]:

This is an overly simplistic model. In reality, the human brain processes the input signal through multiple layers within the visual cortex, which handle feature extraction, feature detection, and classification. Feature extractions are handled by cells in the visual cortex that subsample overlapping areas in the visual field (receptive fields) thus acting as filters over the input image. Feature detection is handled in cortical areas 17, 18, and 19 and classification in areas 20 and 21. Take a look at the picture below [2]. The processed information is also fed forward and back-propagated to the previous layers before an image is correctly recognized.

In convolutional neural networks (CNN), the convolutional layers act as feature detectors and the fully connected layers as classifiers. The feature extraction is handled at runtime by passing multiple input sets simultaneously through the CNN (called mini-batches) and adjusting the weights in each iteration, thus aggregating features in each forward pass and fine-tuning features during backward propagation. Now let’s take a look at the LeNet topology, which is a prominent CNN for handwritten-digit recognition.

The LeNet topology

The LeNet-5 architecture as published in [3] is shown below:

The topology has seven layers excluding the input: two sets of convolutional and subsampling layers followed by two fully connected layers and an output classifying layer.

The first convolution layer C1 has six feature maps of 28´28 dimension. The kernel size of 5´5 with random weights and a constant bias is chosen. Between the six feature maps, this constitutes a total of 156 trainable parameters (6fm * (5 * 5 + 1[bias term])). The feature maps are now scanned in overlapping areas by moving one pixel each time (stride equals one) thus forming a total of 122,304 connections in the very first layer. Depending on the complexity of the problem we are trying to solve, you can see how quickly the number of neurons in each layer can increase manifold. In order to reduce the number of computational units, we use subsampling. In layer S2, there are six feature maps which are 14´14. You obtain this reduction in dimension by sampling 2´2 pixels in the corresponding feature map C1, adding the four inputs, and then multiplying them by a trainable coefficient and a bias. Note that the 2´2 region in the subsampling (pooling) layer is non-overlapping. So in S2, we end up with 12 trainable parameters ((one coefficient + one bias)* 6 feature maps) and 5,880 connections.

Now look at layer C3. There are 16 feature maps that are 10´10 each. The table below explains how we achieved the reduction in number of pixels between S2 and C3 [3]:

The asymmetric choice of pixels from S2 into C3 ensures that different feature maps extract different features as they each get different pixels while keeping the number of connections reasonable. I will keep the explanation of S4 brief. The concept is exactly the same as S2.

Finally, let’s look at the fully connected layers. Since these are classifiers, all of the features extracted in previous layers are used to match the input to extract the correct output. Remember that in C1, S1, C2, and S2 we chose random weights and bias in the first pass. When the input is evaluated against the actual output from the CNN, the accuracy, as expected would be very low. Our goal is to increase the accuracy of the model in such a way that for most (if not all) inputs, the expected output will match the provided label. The problem-solving method we use to accomplish this is the Gradient Descent for a cost function. In the most simplistic terms, the formula can be shown as below [4]:

C is the cost function, which is a function of the weights and bias chosen. Our goal is to minimize the possibility of errors between the actual output “y” for any input “x” and the expected output “a” over all inputs “n” using the features extracted with weights “w” and bias “b”.

Later, we’ll elaborate on how to create each convolution, pooling and fully connected layers in Caffe and show you how the choice of gradients and other hyper-parameters can be achieved using the Intel Deep Learning SDK.

The MNIST dataset

The MNIST dataset is a repository of 70k grayscale images of handwritten digits [5]. Each image is 28´28 in dimension. The collection was created from two NIST datasets, one of which was collected by Census Bureau employees and another by high-school students. To increase the variation in data, the final MNIST collection uses 30k images from each dataset for training and 5k images from each for testing.

You can obtain the dataset from here. We use the LeNet topology explained above with the MNIST dataset to demonstrate the training tool in the Intel Deep Learning SDK. Before we begin, make sure to process the data so that the category labels (0,1…9) are at the highest level of the directory structure with the corresponding images within each folder:

Now let’s dive into training the model using the Intel Deep Learning SDK. If you have not already installed the Intel Deep Learning SDK, you can do so now from here. There are installers for both Windows* and MacOS*. Behind the scenes, the installer installs the SDK on an Ubuntu* 14.04 or higher machine using Docker. The training process runs on the Caffe framework that is optimized for Intel architecture. Read [7] for assistance. The Intel Deep Learning SDK Training Tool provides a graphical user interface to manipulate parameters and visualize the training process.

Using the Intel® Deep Learning SDK to train the model

One of the main advantages of using the Intel Deep Learning SDK to train a model is its ease of use. As a data scientist, your focus would be more on easily preparing training data, using existing topologies where possible, designing new models if required, and train models with automated experiments and advanced visualizations. The training tool provides all of these benefits while also simplifying the installation of popular deep learning frameworks. Below is a snapshot of the training tool:

Now let’s look at the steps required to generate a trained model from an existing topology. To begin, launch the training UI using the IP address and the port of the device on which you have installed the training tool. If a port number is not specified, it defaults to 8080. Enter your username and the password you created during installation and log in to the interface.

Step 1: Upload the dataset

From the Uploads tab on the left, select the dataset zip file containing the labeled folders and associated images as explained above. Choose a folder path, and then click Upload. Once complete, the uploaded dataset path can be obtained from the Completed section.

Step 2: Create a new dataset

Select training/validation percentages and data augmentation

In the Datasets tab, click New Dataset. Name the dataset and choose an already uploaded dataset from the drop-down list. You can choose the percentage of data you want to use for training, validation, and testing.

The effectiveness of the training process lies in the variability of the data in the dataset, so without altering our base dataset, we can augment the inputs to the model by applying some transformations. Some augmentation techniques available are rotate, bidirectional shift, zoom, and mirror. You can learn more about each of these options in [8].

Process the data

The MNIST dataset we have used has grayscale images that are 28´28 pixels, so we adjust the settings accordingly. If you need to resize the data you have, use one of the available options. The user guide has more details on each option.

Select the database backend and image format

Create the dataset

Click Create Dataset. Once the process is complete, you can visualize the number of data in each label for both the training and testing datasets.

Step 3: Create a model

Select the dataset

In the Models tab on the left, click New Model, and then name the model (for simplicity, I choose the same name as the dataset, but you can choose a different name).

Select the topology

The training tool currently supports three image recognition topologies: LeNet, AlexNet*, and GoogLeNet*. Since we are training a model for handwritten digit recognition, I will choose LeNet, which is well suited for the job. The base data (number of channels in convolution layers, pooling layers, fully connected layers, and so on) is obtained from the Intel® Distribution for Caffe* that runs underneath. If you are training a different model, say using color images from Cifar 100* or ImageNet, you could choose AlexNet or GoogLeNet to train the model.

You also have the provision to create a new custom topology based on LeNet, AlexNet, or GoogLeNet, which I discuss in a later section.

Perform the data transformation

If you need to introduce some randomness during the training process, you could perform some data transformation operation without affecting the raw data in your dataset. These steps are covered in detail in the user guide. For now, I will use the default settings.

Select hyper-parameters

Here you choose settings to fine-tune your model. Some of the important parameters include the following:

Epoch: One complete pass through all the layers of a CNN, with all data files in the dataset having gone through the training.

Batch size: The CNN datasets are usually large. To allow for more effective training, the dataset is split into batches of “x” images. A complete pass for one batch through the CNN is called an iteration. While a large batch size may reduce the variance in the parameter update process, it also consumes a lot of memory. So it is important to balance the two.

Base learning rate: In the gradient descent algorithm we discussed earlier, we said that in the first pass, since the weights and bias is chosen at random, the accuracy is low. In order to find a convergence to the global minimum, we change the base learning rate of the algorithm. While a larger learning rate could mean faster convergence, if the learning rate is too large, it will drive our model away from convergence. On the other hand, a slower learning rate takes tiny steps toward convergence with the global minimum.

Solver type: The default option is the stochastic gradient descent, which takes random samples in the input data set in successive batches to get the solution to converge at the global minimum more quickly. You can choose between other options that are described in the user guide.

Now you are ready to run the model. Once complete, you will see the accuracy and loss at each epoch of execution and final training, validation, and testing accuracies.

At this point, the training is complete. We will need the list of Caffe files generated by the training tool. Click Download, and then save the files. In the next section, we’ll walk through each of these files and connect the concepts we have learned so far.

Understanding the Caffe* model files

The downloaded model.zip has all the necessary files to understand the CNN. This is important in case you want to create a custom topology or debug the model. These files are also used on the deployment platform to validate the model against real-time data. Let’s take a closer look at some of the files in this archive.

The most important file is the train_val.prototxt that contains the architecture of the model [9].

The data layer

Let’s take a look at the description of the data layer.

name: "MNISTLeNet"
layer {
  name: "MNISTFinal"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: false
    scale: 0.00390625
  }
  data_param {
    source: "/workspace/dl-genie/jobs/datasets/ca04ddcd-66b2-422d-8de8-dcc1ad38cfd4/train.txt_LMDB"
    batch_size: 64
    backend: LMDB
  }
}

The code above generates two blobs as indicated by “top”: one for the data and another for the labels. The “include” section indicates that the data layer is being created for the TRAIN phase. Since we need the same structure for validation as well, you will notice that there is another data layer within the train_val.prototxt file that indicates TEST in the “include” section. Next, we scale the pixels so that they are all in the range 0 to 1 (we do this by 1/256 = 0.00390625). Following this, we set the data source to point to the docker container where the dataset is created. Note that the back-end is set to LMDB and the batch size is set to 64, which is what we indicated in the training tool.

The convolutional layer

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 20
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}

From the LeNet topology we discussed above, we know that the first convolution layer takes as input the 28´28 binary image. This is indicated by the “bottom:” parameter, which indicates that the data blob is the input. The “param” section indicates the learning rate adjustments for both the weights and the bias, which are the learnable parameters in this layer. The learning rate for the weights (lr_mult: 1) is set to the value that we indicated in the hyper-parameters. The bias learning rate is set to twice the weight learning rate. Experientially, this is known to provide better convergence rates.

Next, let’s define the number of channels in the convolution layer. This is defined by the “num_output”. Note that the number of channels here is different from the LeNet topology we described above. The LeNet topology in the Caffe framework uses a variant of the LeNet algorithm described above; however, conceptually, they are similar. The kernel_size is set to a 5´5 pixels (same as LeNet-5) which is used to scan through the 28´28 input image by moving one pixel at a time (as indicated by “stride”). Also note that we mentioned that the weights and bias are chosen at random on the first iteration, and these parameters learn how to fine-tune features as the training progresses. The weight_filler in this case is set to “Xavier” which samples weights from a uniform distribution [-scale, scale] where scale = sqrt (3/ n); n= number of inputs. If you are curious to learn more about the other types of fillers available in Caffe, read this file. We then set the bias_filler to a constant value = 0.

The sub-sampling or pooling layer

layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}

Defining a sub-sampling layer is easier. It takes as input the previous convolution layer and performs max pooling by taking a 2´2 pixel area and moving two pixels as indicated by “stride”. This indicates that there is no overlapping. By subsampling, we get a feature map size that is half the size of the channel size in the first convolutional layer.

The train-val.prototxt file has two more sections listing the second convolutional layer with 50 channels and the second pooling layer. We will not explain those here as they are similar to the above.

The fully connected layers

layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool2"
  top: "ip1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 500
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
}
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}

In Caffe, “InnerProduct” refers to the fully connected layers. The only change we note here is the number of channels, which are set to 500. The Rectified Linear Unit (ReLu) performs element-wise function f(x) = max(0,x). The last fully connected layer outputs 10 signals. This is a binary signal. An activation on output n (say n= 3) indicates that the trained model predicts the input to be 3.

The loss layer

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
  top: "loss"
}

As mentioned previously, in CNNs, we have information regarding the accuracy of the model being back-propagated so that the weights and bias can be adjusted for better feature extraction and prediction. The loss layer is responsible for taking as input the label blob and the current prediction, computing the loss, and sending the data back during back-propagation.

You can also include certain rules within layer definitions. For example, to define an accuracy layer only during training, you could include the rule as shown below:

layer {
name: "accuracy_top5"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy_top5"
include {
phase: TEST
}
accuracy_param {
top_k: 5
}
}

Let’s look at the solver.prototxt file. This file includes all of the hyper-parameters that we defined using the training tool interface. You can experiment by changing one or more of these parameters to see how the training changes.

# The train/test net protocol buffer definition
net: "/workspace/dl-genie/jobs/models/ce7c60e9-be99-47bf-b166-7f9036cde0c8/train_val.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 5.0E-4
# The learning rate policy
lr_policy: "inv"
gamma: 1.0E-4
power: 0.75
# Display every 100 iterations
display: 31
# The maximum number of iterations
max_iter: 12645
# snapshot intermediate results
snapshot: 843
snapshot_prefix: "/workspace/dl-genie/jobs/models/ce7c60e9-be99-47bf-b166-7f9036cde0c8/snapshot"
# solver mode: CPU or GPU
solver_mode: CPU
type: "SGD"

Note that the Intel Deep Learning SDK by default runs the training on the CPU without having to configure Caffe manually. If you want to run the training on Caffe optimized for Intel architecture outside of the SDK, please follow the instructions in this article.

Now that you understand how to interpret the Caffe model files, let’s look at how you can customize the topology using the training tool interface.

Customizing topologies using the Intel® Deep Learning SDK

Let’s revisit the process of creating a new model. After creating a new model name and selecting the dataset, go to the Topology tab and select one of the existing topologies. In this case, I will select the LeNet topology. Click edit as shown below:

A text box prepopulated with the train_val.prototxt file for the chosen model displays:

You can now change the parameters for the convolutional, pooling, fully-connected layers. Type a name for this new model, and then save the new topology. Once saved, a list of the custom topologies you have created becomes available for you to choose from.

You can now proceed with the rest of the training process as explained above. Once complete, the model files downloaded from the training tool can be used on your deployment platform to make predictions in real time.

References and Resources

[1] Using Convolutional Neural Networks for Image Recognition

[2] A Neural Network Architecture for General Image Recognition

[3] Gradient-Based Learning Applied to Document Recognition

[4] Using neural nets to recognize handwritten digits

[5] The MNIST Database

[6] Intel® Deep Learning SDK

[7] Intel® Deep Learning SDK – Training tool Installation Guide

[8] Intel® Deep Learning SDK – Training tool user guide

[9] Training LeNet on MNIST with Caffe

[10] Training and deploying deep learning networks with Caffe* optimized for Intel® Architecture

Notice

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

This sample source code is released under the Intel Sample Source Code License Agreement.