Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

CIFAR-10 Classification using Intel® Optimization for TensorFlow*

$
0
0

Abstract

This work demonstrates the experiments to train and test the deep learning AlexNet* topology with the Intel® Optimization for TensorFlow* library using CIFAR-10 classification data on Intel® Xeon Phi™ processor powered machines. These experiments were conducted with options set at compile time and run time. From these runs the training accuracy, validation accuracy, and testing accuracy numbers were captured for different compiler switches and environment configurations to identify the optimal configuration. For the optimal combination identified, the top-1 and top-5 accuracies were plotted.

Introduction

Many deep learning frameworks running on different processors have evolved in recent years to solve various complex problems in image classification, detection, and segmentation. Continued research in this space helped to optimize these frameworks and hardware to improve the training, inference accuracy, and speed of performance. Intel has optimized the TensorFlow* library for Intel® Xeon Phi™ processors. The Intel Xeon Phi processor is designed to scale out in a near-linear fashion across cores and nodes to reduce the time and to train machine deep learning models. During the experiment various optimization options were tried to train and test AlexNet* topology with CIFAR-10 images using Intel® optimized TensorFlow on Intel Xeon Phi processors. The optimal combination has been identified based on the results.

Document Content

Environment Setup

The following hardware and software environments were used to perform the experiments.

Hardware

Architecturex86_64
CPU op-mode(s)32 bit, 64 bit
Byte orderLittle endian
CPU(s)256
Online CPU(s) list0-255
Thread(s) per core4
Core(s) per socket64
Socket(s)1
Non-uniform memory access (NUMA) node(s)2
Vendor IDGenuine Intel
CPU family6
Model87
Model nameIntel® Xeon Phi™ processor 7210 @ 1.30 GHz
Stepping1
CPU MHz1153.648
Bogus Million Instructions Per Second (BogoMIPS)2593.69
L1d cache32K
L1i cache32K
L2 cache1024K
NUMA node0 CPU(s)0-255

Software

TensorFlow*1.3.0 (Intel® optimized)
Python*3.5.3 (Intel distributed)
GNU Compiler Collection* (GCC)6.2.1
Virtual environmentConda*

Choosing the Optimal Software Configuration

Trial runs were performed to choose the optimal software configuration. For these runs, an Intel optimized TensorFlow library was built from the sources using Bazel* 0.4.5.

TensorFlow versions 1.2 and 1.3 with Python* version 2.7.5 were tried on an AlexNet benchmark script. It was found that TensorFlow 1.3 showed 16 times faster performance (refer configurations>) compared with TensorFlow 1.2. Further, Python 2.7.5 and 3.5.1 versions were tried with TensorFlow 1.3 on AlexNet topology, with 10,000 images.

The results of the evaluation are as follows.

TensorFlow*
Version
Python*
Version
No. of
Epochs
Compiler SwitchesAccuracy
1.32.7.520DEIGEN_USE_VML, config=mkl16.10%
1.32.7.520mfma, Intel® AVX2, DEIGEN_USE_VML, config=mkl16.40%
1.33.5.3 (Intel)20mfma, Intel AVX2, DEIGEN_USE_VML, config=mkl51.9%

From the above results, the software configuration listed in in the software table was finalized with the compiler switches; namely, mfma, Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Math Kernel Library (Intel® MKL) config.

Network Topology and Model Training

This section details the dataset adopted, AlexNet architecture, and training the model in the current work.

Dataset

The CIFAR-10 dataset chosen for these experiments consists of 60,000 32 x 32 color images in 10 classes. Each class has 6,000 images. The 10 classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

The dataset was taken from Kaggle*3. The following figure shows a sample set of images for each classification.

CIFAR-10 sample images

Figure 1: CIFAR-10 sample images.

For the experiments, out of the 60,000 images, 50,000 images were chosen for training and 10,000 images for testing.

AlexNet* Architecture

The AlexNet network is made of five convolution layers, max-pooling layers, dropout layers, and three fully connected layers. The network was designed to be used for classification with 1,000 possible categories.

 MIT2).

Figure 2: AlexNet* architecture (credit: MIT2).

Model Training

In these experiments, it was decided to train the model from the beginning using the CIFAR-10 dataset. The dataset is split as 50,000 images for training and validation and 10,000 images for testing.

Experimental Runs

The experiment was conducted in two steps.

Step 1, multiple compiler switches were used and runs were performed for different batch sizes. The epoch counts considered for these runs are 25 and 100. The aim of this step was to observe the accuracy and throughput for each batch.

Step 2, Intel suggested environment configuration was used on top of the complier switches set in Step 1. Benchmark scripts were run to observe the throughput and, based on that, AlexNet runs using CIFAR-10 were executed to get the top-1 and top-5 accuracies.

Step 1: With Compiler Switches

The following are the compiler switches that were set during the Bazel build:

mfma, Intel AVX2, DEIGEN_USE_VML, config=mkl

The runs were performed for different batch sizes and the following results were obtained. For 25 epochs:

Batch SizeEpochsTraining AccuracyValidation AccuracyTesting Accuracy
642571.87%69.22%67.47%
962568.50%66.63%67.16%
1282565.80%64.55%64.82%
2562559.30%58.98%59.16%

 

Training with 25 epochs.

Figure 3: Training with 25 epochs.

It was observed that while using a larger batch, there is a degradation in the quality of the model as there is a reduction in the stochasticity of the gradient descent. The accuracy fall is steeper when there is an increase in the batch size from 128 to 256. In general, the performance of processors is better if the batch size is a power of 2. Considering this, it was decided to perform runs with a higher epoch count on batch sizes of 64 and 128.

For 100 epochs:

Batch SizeEpochsTraining AccuracyValidation AccuracyTesting Accuracy
6410094.98%72.62%72.29%
12810089.19%72.23%70.94%

 

Training with 100 epochs.

Figure 4: Training with 100 epochs.

As the epoch count increased, the network showed improvement in accuracy, but significant overfitting of the model was observed. At this stage, it warranted to consider additional options beyond compiler flags to best utilize the Intel Xeon Phi processor capability to improve the performance of the model and reduce the training time.

Step 2: With Environment Configurations

Retaining the compiler options as-is, in this step different environmental parameters as suggested by Intel1 were set.These parameters are as follows:

"OMP_NUM_THREADS = "136"

"KMP_BLOCKTIME" = "30"

"KMP_SETTINGS" = "1"

"KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0"

'inter_op' = 1

'intra_op' = 136

Using the same TensorFlow setup built in Step 1 which is built using the compiler switches as mentioned in Step 1 and setting the above environment parameters, the AlexNet topology was run using the CIFAR-10 dataset for 1,000 epochs to capture the top-5 and top-1 accuracies. The results are as follows:

Sr. NoTop-n AccuracyTraining AccuracyTesting Accuracy
1Top-599.74%96.98%
2Top-193.26%70.94%

The following graph represents the top-1 and top-5 accuracies for training and testing for every 100 epochs:

Training accuracy comparison.

Figure 5: Training accuracy comparison.

Comparing the top-1 training and testing accuracy, it can be inferred that the network tends to overfit after 500 epochs. The reason could be that the model is training on the same data again.

Conclusion

The experiments on training the AlexNet topology on Intel Xeon Phi processor powered machines, with Intel optimized TensorFlow using the CIFAR-10 classification data set illustrates that the performance gains on Intel Xeon Phi processors can be achieved by setting the compiler switches (mfma, Intel AVX2), the configuration option (Intel® Math Kernel Library), and the environment options (as suggested1).

Further, making the Intel Xeon Phi processor numactl-aware helps to optimize the performance by 1.2x times (refer configurations). Similar runs can be performed on newly released Intel® Xeon® Gold processor powered machines to experience enhanced performance.

About the Authors

Rajeswari Ponnuru, Ajit Kumar Pookalangara, and Ravi Keron Nidamarty are part of the Intel-Tata Consultancy Services relationship, working on the AI academia evangelization.

Acronyms and Abbreviations

Term/AcronymDefinition
CIFARCanadian Institute for Advanced Research
CIFAR-10Established computer-vision dataset used for object recognition
GCCGNU Compiler Collection*

Configurations

For performance reference under Choosing Optimal software Configuration section:

    Hardware: refer Hardware under Environment Setup

    Software:

        Virtual environment 1: Intel Optimized Tensorflow 1.2, Python 2.7.5

        Virtual environment 2: Intel Optimized Tensorflow 1.3, Python* version 2.7.5

    Test performed: executed the script benchmark_alexnet.py from convent-benchmarks

For performance reference under Conclusion section:

        Hardware: refer Hardware under Environment Setup

        Software: Intel Optimized Tensorflow 1.3, Python* version 3.5.3

        Test performed: executed the script benchmark_alexnet.py from convent-benchmarks

For more information go to http://www.intel.com/performance.

References

  1. TensorFlow* Optimizations on Modern Intel® Architecture: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
  2. Alexnet topology diagram: http://vision03.csail.mit.edu/cnn_art/
  3. CIFAR-10 dataset taken from: https://www.kaggle.com/c/cifar-10/data

Related Resources

Alexnet details: http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf

About CIFAR-10 data: https://www.cs.toronto.edu/~kriz/cifar.html


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>