Abstract
This work demonstrates the experiments to train and test the deep learning AlexNet* topology with the Intel® Optimization for TensorFlow* library using CIFAR-10 classification data on Intel® Xeon Phi™ processor powered machines. These experiments were conducted with options set at compile time and run time. From these runs the training accuracy, validation accuracy, and testing accuracy numbers were captured for different compiler switches and environment configurations to identify the optimal configuration. For the optimal combination identified, the top-1 and top-5 accuracies were plotted.
Introduction
Many deep learning frameworks running on different processors have evolved in recent years to solve various complex problems in image classification, detection, and segmentation. Continued research in this space helped to optimize these frameworks and hardware to improve the training, inference accuracy, and speed of performance. Intel has optimized the TensorFlow* library for Intel® Xeon Phi™ processors. The Intel Xeon Phi processor is designed to scale out in a near-linear fashion across cores and nodes to reduce the time and to train machine deep learning models. During the experiment various optimization options were tried to train and test AlexNet* topology with CIFAR-10 images using Intel® optimized TensorFlow on Intel Xeon Phi processors. The optimal combination has been identified based on the results.
Document Content
Environment Setup
The following hardware and software environments were used to perform the experiments.
Hardware
Architecture | x86_64 |
CPU op-mode(s) | 32 bit, 64 bit |
Byte order | Little endian |
CPU(s) | 256 |
Online CPU(s) list | 0-255 |
Thread(s) per core | 4 |
Core(s) per socket | 64 |
Socket(s) | 1 |
Non-uniform memory access (NUMA) node(s) | 2 |
Vendor ID | Genuine Intel |
CPU family | 6 |
Model | 87 |
Model name | Intel® Xeon Phi™ processor 7210 @ 1.30 GHz |
Stepping | 1 |
CPU MHz | 1153.648 |
Bogus Million Instructions Per Second (BogoMIPS) | 2593.69 |
L1d cache | 32K |
L1i cache | 32K |
L2 cache | 1024K |
NUMA node0 CPU(s) | 0-255 |
Software
TensorFlow* | 1.3.0 (Intel® optimized) |
Python* | 3.5.3 (Intel distributed) |
GNU Compiler Collection* (GCC) | 6.2.1 |
Virtual environment | Conda* |
Choosing the Optimal Software Configuration
Trial runs were performed to choose the optimal software configuration. For these runs, an Intel optimized TensorFlow library was built from the sources using Bazel* 0.4.5.
TensorFlow versions 1.2 and 1.3 with Python* version 2.7.5 were tried on an AlexNet benchmark script. It was found that TensorFlow 1.3 showed 16 times faster performance (refer configurations>) compared with TensorFlow 1.2. Further, Python 2.7.5 and 3.5.1 versions were tried with TensorFlow 1.3 on AlexNet topology, with 10,000 images.
The results of the evaluation are as follows.
TensorFlow* Version | Python* Version | No. of Epochs | Compiler Switches | Accuracy |
1.3 | 2.7.5 | 20 | DEIGEN_USE_VML, config=mkl | 16.10% |
1.3 | 2.7.5 | 20 | mfma, Intel® AVX2, DEIGEN_USE_VML, config=mkl | 16.40% |
1.3 | 3.5.3 (Intel) | 20 | mfma, Intel AVX2, DEIGEN_USE_VML, config=mkl | 51.9% |
From the above results, the software configuration listed in in the software table was finalized with the compiler switches; namely, mfma, Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Math Kernel Library (Intel® MKL) config.
Network Topology and Model Training
This section details the dataset adopted, AlexNet architecture, and training the model in the current work.
Dataset
The CIFAR-10 dataset chosen for these experiments consists of 60,000 32 x 32 color images in 10 classes. Each class has 6,000 images. The 10 classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
The dataset was taken from Kaggle*3. The following figure shows a sample set of images for each classification.
Figure 1: CIFAR-10 sample images.
For the experiments, out of the 60,000 images, 50,000 images were chosen for training and 10,000 images for testing.
AlexNet* Architecture
The AlexNet network is made of five convolution layers, max-pooling layers, dropout layers, and three fully connected layers. The network was designed to be used for classification with 1,000 possible categories.
Figure 2: AlexNet* architecture (credit: MIT2).
Model Training
In these experiments, it was decided to train the model from the beginning using the CIFAR-10 dataset. The dataset is split as 50,000 images for training and validation and 10,000 images for testing.
Experimental Runs
The experiment was conducted in two steps.
Step 1, multiple compiler switches were used and runs were performed for different batch sizes. The epoch counts considered for these runs are 25 and 100. The aim of this step was to observe the accuracy and throughput for each batch.
Step 2, Intel suggested environment configuration was used on top of the complier switches set in Step 1. Benchmark scripts were run to observe the throughput and, based on that, AlexNet runs using CIFAR-10 were executed to get the top-1 and top-5 accuracies.
Step 1: With Compiler Switches
The following are the compiler switches that were set during the Bazel build:
mfma, Intel AVX2, DEIGEN_USE_VML, config=mkl
The runs were performed for different batch sizes and the following results were obtained. For 25 epochs:
Batch Size | Epochs | Training Accuracy | Validation Accuracy | Testing Accuracy |
64 | 25 | 71.87% | 69.22% | 67.47% |
96 | 25 | 68.50% | 66.63% | 67.16% |
128 | 25 | 65.80% | 64.55% | 64.82% |
256 | 25 | 59.30% | 58.98% | 59.16% |
Figure 3: Training with 25 epochs.
It was observed that while using a larger batch, there is a degradation in the quality of the model as there is a reduction in the stochasticity of the gradient descent. The accuracy fall is steeper when there is an increase in the batch size from 128 to 256. In general, the performance of processors is better if the batch size is a power of 2. Considering this, it was decided to perform runs with a higher epoch count on batch sizes of 64 and 128.
For 100 epochs:
Batch Size | Epochs | Training Accuracy | Validation Accuracy | Testing Accuracy |
64 | 100 | 94.98% | 72.62% | 72.29% |
128 | 100 | 89.19% | 72.23% | 70.94% |
Figure 4: Training with 100 epochs.
As the epoch count increased, the network showed improvement in accuracy, but significant overfitting of the model was observed. At this stage, it warranted to consider additional options beyond compiler flags to best utilize the Intel Xeon Phi processor capability to improve the performance of the model and reduce the training time.
Step 2: With Environment Configurations
Retaining the compiler options as-is, in this step different environmental parameters as suggested by Intel1 were set.These parameters are as follows:
"OMP_NUM_THREADS = "136"
"KMP_BLOCKTIME" = "30"
"KMP_SETTINGS" = "1"
"KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0"
'inter_op' = 1
'intra_op' = 136
Using the same TensorFlow setup built in Step 1 which is built using the compiler switches as mentioned in Step 1 and setting the above environment parameters, the AlexNet topology was run using the CIFAR-10 dataset for 1,000 epochs to capture the top-5 and top-1 accuracies. The results are as follows:
Sr. No | Top-n Accuracy | Training Accuracy | Testing Accuracy |
1 | Top-5 | 99.74% | 96.98% |
2 | Top-1 | 93.26% | 70.94% |
The following graph represents the top-1 and top-5 accuracies for training and testing for every 100 epochs:
Figure 5: Training accuracy comparison.
Comparing the top-1 training and testing accuracy, it can be inferred that the network tends to overfit after 500 epochs. The reason could be that the model is training on the same data again.
Conclusion
The experiments on training the AlexNet topology on Intel Xeon Phi processor powered machines, with Intel optimized TensorFlow using the CIFAR-10 classification data set illustrates that the performance gains on Intel Xeon Phi processors can be achieved by setting the compiler switches (mfma, Intel AVX2), the configuration option (Intel® Math Kernel Library), and the environment options (as suggested1).
Further, making the Intel Xeon Phi processor numactl-aware helps to optimize the performance by 1.2x times (refer configurations). Similar runs can be performed on newly released Intel® Xeon® Gold processor powered machines to experience enhanced performance.
About the Authors
Rajeswari Ponnuru, Ajit Kumar Pookalangara, and Ravi Keron Nidamarty are part of the Intel-Tata Consultancy Services relationship, working on the AI academia evangelization.
Acronyms and Abbreviations
Term/Acronym | Definition |
CIFAR | Canadian Institute for Advanced Research |
CIFAR-10 | Established computer-vision dataset used for object recognition |
GCC | GNU Compiler Collection* |
Configurations
For performance reference under Choosing Optimal software Configuration section:
Hardware: refer Hardware under Environment Setup
Software:
Virtual environment 1: Intel Optimized Tensorflow 1.2, Python 2.7.5
Virtual environment 2: Intel Optimized Tensorflow 1.3, Python* version 2.7.5
Test performed: executed the script benchmark_alexnet.py from convent-benchmarks
For performance reference under Conclusion section:
Hardware: refer Hardware under Environment Setup
Software: Intel Optimized Tensorflow 1.3, Python* version 3.5.3
Test performed: executed the script benchmark_alexnet.py from convent-benchmarks
For more information go to http://www.intel.com/performance.
References
- TensorFlow* Optimizations on Modern Intel® Architecture: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
- Alexnet topology diagram: http://vision03.csail.mit.edu/cnn_art/
- CIFAR-10 dataset taken from: https://www.kaggle.com/c/cifar-10/data
Related Resources
Alexnet details: http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf
About CIFAR-10 data: https://www.cs.toronto.edu/~kriz/cifar.html