CIFAR-10 Classification using Intel® Optimization for TensorFlow*

Abstract

This work demonstrates the experiments to train and test the deep learning AlexNet* topology with the Intel® Optimization for TensorFlow* library using CIFAR-10 classification data on Intel® Xeon Phi™ processor powered machines. These experiments were conducted with options set at compile time and run time. From these runs the training accuracy, validation accuracy, and testing accuracy numbers were captured for different compiler switches and environment configurations to identify the optimal configuration. For the optimal combination identified, the top-1 and top-5 accuracies were plotted.

Introduction

Many deep learning frameworks running on different processors have evolved in recent years to solve various complex problems in image classification, detection, and segmentation. Continued research in this space helped to optimize these frameworks and hardware to improve the training, inference accuracy, and speed of performance. Intel has optimized the TensorFlow* library for Intel® Xeon Phi™ processors. The Intel Xeon Phi processor is designed to scale out in a near-linear fashion across cores and nodes to reduce the time and to train machine deep learning models. During the experiment various optimization options were tried to train and test AlexNet* topology with CIFAR-10 images using Intel® optimized TensorFlow on Intel Xeon Phi processors. The optimal combination has been identified based on the results.

Document Content

Environment Setup

The following hardware and software environments were used to perform the experiments.

Hardware

Architecture	x86_64
CPU op-mode(s)	32 bit, 64 bit
Byte order	Little endian
CPU(s)	256
Online CPU(s) list	0-255
Thread(s) per core	4
Core(s) per socket	64
Socket(s)	1
Non-uniform memory access (NUMA) node(s)	2
Vendor ID	Genuine Intel
CPU family	6
Model	87
Model name	Intel® Xeon Phi™ processor 7210 @ 1.30 GHz
Stepping	1
CPU MHz	1153.648
Bogus Million Instructions Per Second (BogoMIPS)	2593.69
L1d cache	32K
L1i cache	32K
L2 cache	1024K
NUMA node0 CPU(s)	0-255

Software

TensorFlow*	1.3.0 (Intel® optimized)
Python*	3.5.3 (Intel distributed)
GNU Compiler Collection* (GCC)	6.2.1
Virtual environment	Conda*

Choosing the Optimal Software Configuration

Trial runs were performed to choose the optimal software configuration. For these runs, an Intel optimized TensorFlow library was built from the sources using Bazel* 0.4.5.

TensorFlow versions 1.2 and 1.3 with Python* version 2.7.5 were tried on an AlexNet benchmark script. It was found that TensorFlow 1.3 showed 16 times faster performance (refer configurations>) compared with TensorFlow 1.2. Further, Python 2.7.5 and 3.5.1 versions were tried with TensorFlow 1.3 on AlexNet topology, with 10,000 images.

The results of the evaluation are as follows.

TensorFlow* Version	Python* Version	No. of Epochs	Compiler Switches	Accuracy
1.3	2.7.5	20	DEIGEN_USE_VML, config=mkl	16.10%
1.3	2.7.5	20	mfma, Intel® AVX2, DEIGEN_USE_VML, config=mkl	16.40%
1.3	3.5.3 (Intel)	20	mfma, Intel AVX2, DEIGEN_USE_VML, config=mkl	51.9%

From the above results, the software configuration listed in in the software table was finalized with the compiler switches; namely, mfma, Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Math Kernel Library (Intel® MKL) config.

Network Topology and Model Training

This section details the dataset adopted, AlexNet architecture, and training the model in the current work.

Dataset

The CIFAR-10 dataset chosen for these experiments consists of 60,000 32 x 32 color images in 10 classes. Each class has 6,000 images. The 10 classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

The dataset was taken from Kaggle*³. The following figure shows a sample set of images for each classification.

CIFAR-10 sample images

Figure 1: CIFAR-10 sample images.

For the experiments, out of the 60,000 images, 50,000 images were chosen for training and 10,000 images for testing.

AlexNet* Architecture

The AlexNet network is made of five convolution layers, max-pooling layers, dropout layers, and three fully connected layers. The network was designed to be used for classification with 1,000 possible categories.

MIT2).

Figure 2: AlexNet* architecture (credit: MIT²).

Model Training

In these experiments, it was decided to train the model from the beginning using the CIFAR-10 dataset. The dataset is split as 50,000 images for training and validation and 10,000 images for testing.

Experimental Runs

The experiment was conducted in two steps.

Step 1, multiple compiler switches were used and runs were performed for different batch sizes. The epoch counts considered for these runs are 25 and 100. The aim of this step was to observe the accuracy and throughput for each batch.

Step 2, Intel suggested environment configuration was used on top of the complier switches set in Step 1. Benchmark scripts were run to observe the throughput and, based on that, AlexNet runs using CIFAR-10 were executed to get the top-1 and top-5 accuracies.

Step 1: With Compiler Switches

The following are the compiler switches that were set during the Bazel build:

mfma, Intel AVX2, DEIGEN_USE_VML, config=mkl

The runs were performed for different batch sizes and the following results were obtained. For 25 epochs:

Batch Size	Epochs	Training Accuracy	Validation Accuracy	Testing Accuracy
64	25	71.87%	69.22%	67.47%
96	25	68.50%	66.63%	67.16%
128	25	65.80%	64.55%	64.82%
256	25	59.30%	58.98%	59.16%

Training with 25 epochs.

Figure 3: Training with 25 epochs.

It was observed that while using a larger batch, there is a degradation in the quality of the model as there is a reduction in the stochasticity of the gradient descent. The accuracy fall is steeper when there is an increase in the batch size from 128 to 256. In general, the performance of processors is better if the batch size is a power of 2. Considering this, it was decided to perform runs with a higher epoch count on batch sizes of 64 and 128.

For 100 epochs:

Batch Size	Epochs	Training Accuracy	Validation Accuracy	Testing Accuracy
64	100	94.98%	72.62%	72.29%
128	100	89.19%	72.23%	70.94%

Training with 100 epochs.

Figure 4: Training with 100 epochs.

As the epoch count increased, the network showed improvement in accuracy, but significant overfitting of the model was observed. At this stage, it warranted to consider additional options beyond compiler flags to best utilize the Intel Xeon Phi processor capability to improve the performance of the model and reduce the training time.

Step 2: With Environment Configurations

Retaining the compiler options as-is, in this step different environmental parameters as suggested by Intel¹ were set.These parameters are as follows:

"OMP_NUM_THREADS = "136"

"KMP_BLOCKTIME" = "30"

"KMP_SETTINGS" = "1"

"KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0"

'inter_op' = 1

'intra_op' = 136

Using the same TensorFlow setup built in Step 1 which is built using the compiler switches as mentioned in Step 1 and setting the above environment parameters, the AlexNet topology was run using the CIFAR-10 dataset for 1,000 epochs to capture the top-5 and top-1 accuracies. The results are as follows:

Sr. No	Top-n Accuracy	Training Accuracy	Testing Accuracy
1	Top-5	99.74%	96.98%
2	Top-1	93.26%	70.94%

The following graph represents the top-1 and top-5 accuracies for training and testing for every 100 epochs:

Training accuracy comparison.

Figure 5: Training accuracy comparison.

Comparing the top-1 training and testing accuracy, it can be inferred that the network tends to overfit after 500 epochs. The reason could be that the model is training on the same data again.

Conclusion

The experiments on training the AlexNet topology on Intel Xeon Phi processor powered machines, with Intel optimized TensorFlow using the CIFAR-10 classification data set illustrates that the performance gains on Intel Xeon Phi processors can be achieved by setting the compiler switches (mfma, Intel AVX2), the configuration option (Intel® Math Kernel Library), and the environment options (as suggested¹).

Further, making the Intel Xeon Phi processor numactl-aware helps to optimize the performance by 1.2x times (refer configurations). Similar runs can be performed on newly released Intel® Xeon® Gold processor powered machines to experience enhanced performance.

About the Authors

Rajeswari Ponnuru, Ajit Kumar Pookalangara, and Ravi Keron Nidamarty are part of the Intel-Tata Consultancy Services relationship, working on the AI academia evangelization.

Acronyms and Abbreviations

Term/Acronym	Definition
CIFAR	Canadian Institute for Advanced Research
CIFAR-10	Established computer-vision dataset used for object recognition
GCC	GNU Compiler Collection*

Configurations

For performance reference under Choosing Optimal software Configuration section:

Hardware: refer Hardware under Environment Setup

Software:

Virtual environment 1: Intel Optimized Tensorflow 1.2, Python 2.7.5

Virtual environment 2: Intel Optimized Tensorflow 1.3, Python* version 2.7.5

Test performed: executed the script benchmark_alexnet.py from convent-benchmarks

For performance reference under Conclusion section:

Hardware: refer Hardware under Environment Setup

Software: Intel Optimized Tensorflow 1.3, Python* version 3.5.3

Test performed: executed the script benchmark_alexnet.py from convent-benchmarks

For more information go to http://www.intel.com/performance.

References

TensorFlow* Optimizations on Modern Intel® Architecture: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
Alexnet topology diagram: http://vision03.csail.mit.edu/cnn_art/
CIFAR-10 dataset taken from: https://www.kaggle.com/c/cifar-10/data

Related Resources

Alexnet details: http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf

About CIFAR-10 data: https://www.cs.toronto.edu/~kriz/cifar.html

CIFAR-10 Classification using Intel® Optimization for TensorFlow*

Abstract

Introduction

Document Content

Environment Setup

Choosing the Optimal Software Configuration

Network Topology and Model Training

Experimental Runs

Conclusion

About the Authors

Acronyms and Abbreviations

Configurations

References

Related Resources

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112