Introduction
This document provides step-by-step instructions on how to train and test a single-node Intel® Xeon® Scalable processor platform system, using TensorFlow* framework with CIFAR-10 image recognition datasets. This document provides beginner-level instructions, and both training and inference is happening on the same system. The steps have been verified on Intel Xeon Scalable processors, but should work on any latest Intel Xeon processor-based system. None of the software pieces used in this document were performance optimized. This document is follow-on documentation to the article Deep Learning Training and Testing on a Single-Node Intel® Xeon® Scalable Processor System Using Intel® Optimized Caffe* .
This document is targeted toward a beginner-level audience who want to learn how to proceed with training and testing a deep learning dataset using TensorFlow framework once they have Intel Xeon CPU-based hardware. The document assumes that the reader has basic Linux* knowledge and is familiar with the concepts of deep learning training. The instructions can be confidently used as they are, or can be the foundation for enhancements and/or modifications.
There are various ways to install TensorFlow. You can install it using binary packages or from GitHub* sources. This document describes one of the ways that was successfully deployed and tested on a single Intel Xeon Scalable processor system, running CentOS* 7.3. Some other installation methods can be found in2,18. The goal of this document is not to give an elaborate description of how to reach state of the art performance; rather, it’s to dip a toe into TensorFlow and run a simple train and test using the CIFAR-10 dataset on a single-node Intel Xeon Scalable processor system.
This document is divided into six major sections including the introduction. Section II details hardware and software bill of materials used to implement and verify the training. Section III covers installing CentOS Linux as the base operating system. Section IV covers the details on installing and deploying TensorFlow using one of the many ways to install it. Sections V and VI enlist the steps needed to train and test the model with the CIFAR-10 dataset.
The hardware and software bill of materials used for verified implementation is mentioned in Section II. Users can try a different configuration, but the configuration in Section II is recommended. Intel® Parallel Studio XE Cluster Edition is an optional installation for single-node implementation. It provides you with most of the basic tools and libraries in one package. Starting with Intel Parallel Studio XE Cluster Edition from the beginning accelerates the learning curve needed for multi-node implementation of the same training and testing, as this software is significantly instrumental on a multi-node deep learning implementation.
Hardware and Software Bill of Materials
Item | Manufacturer | Model/Version |
---|---|---|
Hardware | ||
Intel® Server Chassis | Intel | R1208WT |
Intel® Server Board | Intel | S2600WT |
(2x) Intel® Xeon® Scalable processor | Intel | Intel Xeon® Gold 6148 processor |
(6x) 32 GB LRDIMM DDR4 | Crucial* | CT32G4LFD4266 |
(1x) Intel® SSD 1.2 TB | Intel | S3520 |
Software | ||
CentOS Linux* Installation DVD | 7.3.1611 | |
Intel® Parallel Studio XE Cluster Edition | 2017.4 | |
TensorFlow* | setuptools-36.7.2-py2.py3-none-any.whl |
Installing the Linux* Operating System
This section requires the following software component: CentOS-7-x86_64-*1611.iso. The software can be downloaded from the CentOS website.
DVD ISO was used for implementing and verifying the steps in this document, but the reader can use Everything ISO and Minimal ISO, if preferred.
- Insert the CentOS 7.3.1611 install disc/USB. Boot from the drive and select Install CentOS 7.
- Select Date and Time.
- If necessary, select Installation Destination.
- Select the automatic partitioning option.
- Click Done to return home. Accept all defaults for the partitioning wizard if prompted.
- Select Network and host name.
- Enter “<hostname>” as the hostname.
- Click Apply for the hostname to take effect.
- Select Ethernet enp3s0f3 and click Configure to set up the external interface.
- From the General section, check Automatically connect to this network when it’s available.
- Configure the external interface as necessary. Save and Exit.
- Select the toggle to ON for the interface.
- Click Done to return home.
- Enter “<hostname>” as the hostname.
- Select Software Selection.
- In the box labeled Base Environment on the left side, select Infrastructure server.
- Click Done to return home.
- Wait until the Begin Installation button is available, which may take several minutes. Then click it to continue.
- While waiting for the installation to finish, set the root password.
- Click Reboot when the installation is complete.
- Boot from the primary device.
- Log in as root.
Configure YUM*
If the public network implements a proxy server for Internet access, Yellowdog Updater Modified* (YUM*) must be configured in order to use it.
- Open the /etc/yum.conf file for editing.
- Under the main section, append the following line:
Proxy=http://<address>:<port>
Where <address> is the address of the proxy server and <port> is the HTTP port. - Save the file and Exit.
Disable updates and extras. Certain procedures in this document require packages to be built against the kernel. A future kernel update may break the compatibility of these built packages with the new kernel, so we disable repository updates and extras to provide further longevity to this document.
This document may not be used as is when CentOS updates to the next version. To use this document after such an update, it is necessary to redefine repository paths to point to CentOS 7.3 in the CentOS vault. To disable repository updates and extras:
Yum-config-manager --disable updates --disable extras
Install EPEL
Extra Packages for Enterprise Linux (EPEL) provides 100 percent, high-quality add-on software packages for Linux distribution. To install EPEL (must have the latest version for all packages):
Yum –y install (download from https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm)
Install GNU* C Compiler
Check whether the GNU Compiler Collection* is installed. It should be part of the development tools install. You can verify the installation by typing:
gcc --version or whereis gcc
Install TensorFlow* Using virtualenv18
- Update to the latest distribution of EPEL:
yum –y install epel-release - To install TensorFlow, you must have the following dependencies installed10:
- NumPy*: a numerical processing package that TensorFlow requires.
- Devel*: this enables adding extensions to Python*.
- Pip*: this enables installing and managing certain Python packages.
- Wheel*: this enables managing Python compressed packages in wheel formal (.whl).
- Atlas*: Automatically Tuned Linear Algebra Software.
- Libffi*: Library provides Foreign Function Interface (FFI) that allows code written in one language to call code written in another language. It provides a portable, high-level programming interface to various calling conventions11.
- Install dependencies:
sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel gcc-gfortran openssl-devel libffi-devel python-numpy
- Install virtualenv.
There are various ways to install TensorFlow18. In this document we will use virtualenv. Virtualenv is a tool to create isolated Python environments16:pip install --upgrade virtualenv
- Create a virtualenv in your target directory:
virtualenv --system-site-packages <targetDirectory>
Example:virtualenv --system-site-packages tensorflow
- Activate your virtualenv18:
source ~/<targetDirectory>/bin/activate
Example:source ~/tensorflow/bin/activate
- Upgrade your packages, if needed:
pip install --upgrade numpy scipy wheel cryptography
- Install the latest version of Python compressed TensorFlow packages:
pip install --upgrade
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl (none of the other versions worked for me so I tried this). OR just do this:pip install --upgrade tensorflow
Train a Convolutional Neural Network (CNN) using a CIFAR-10 dataset3
- Download the CIFAR10 training data in /tmp/ directory:
Download the cifar-10 python version from4,8: https://www.cs.toronto.edu/~kriz/cifar.html - Unzip the tar file in the /tmp/ area as the python script (cifar10_train.py) looks for data in this directory:
tar –zxf <dir>/cifar-10-python.tar.gz
- Change directory to tensorflow:
cd tensorflow
- Make a new directory:
mkdir git_tensorflow
- Change directory to the one created in last step:
cd git_tensorflow
- Get a clone of the tensorflow repository from GitHub9:
Git clone https://github.com/tensorflow/tensorflow.git - If you notice the Models folder is missing from the tensorflow/tensorflow directory, you can get a Git of models from9
https://github.com/tensorflow/models.git:cd tensorflow/tensorflow
git clone https://github.com/tensorflow/models.git - Upgrade TensorFlow to the latest version or you might see errors when training your model:
pip install --upgrade tensorflow
- Change directory to CIFAR-10 dir to get the training and evaluation Python scripts14:
cd models/tutorials/image/cifar10
- Before running the training code, you are advised to check the cifar10_train.py code and change steps from 100K to 60K if needed, as well as logging frequency from 10 to whatever you prefer.
For this document, tests were done for both 100K steps and 60K steps, for a batch size of 128, and logging frequency of 10. - Now, run the training Python script to train your network:
python cifar10_train.py
This will take few minutes and you will see something like the image below:
Testing Script and Dataset Terminology:
In the neural network terminology:
- One epoch = one forward pass and one backward pass ofallthe training examples.
- Batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need. TensorFlow (TF) pushes all of those through one forward pass (in parallel) and follows with a back-propagation on the same set. This is one iteration, or step.
- Number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass equals one forward pass plus one backward pass (we do not count the forward pass and backward pass as two different passes).
- Steps parameter tells TF to run X of these iterations to train the model.
Example: if you have 1000 training examples, and your batch size is 500, then it will take two iterations to complete one epoch.
To learn more about the difference between epoch versus batch size versus iterations, read the article15.
In the cifar10_train.py script:
- Batch size is set to 128. It is the number of images to process in a batch.
- Max_step is set to 100000. It is the number of iterations for all epochs. On GitHub code there is a typo; instead of 100K, the number shows 1000K. Please update before running.
- The CIFAR-10 binary dataset in4 has 60000 images; 50000 images to train and 10000 images to test. Each batch size is 128, so the number of batches needed to train is 50000/128 ~ 391 batches for one epoch.
- The cifar10_train.py used 256 epochs, so the number of iterations for all the epochs is ~391*256 ~ 100K iterations or steps.
Evaluate the Model
To evaluate how well the trained model performs on a hold-out data set, we will be using the cifar10_eval.py script8:
python cifar10_eval.py
Once expected accuracy is reached you will see a precision@1 = 0.862 printed on your screen when you run the above command. This can be run while your training script in the steps above is still running and is reaching the end number of steps. Or it can be run after the training script has finished.
Sample Results
Notice that the cifar10_train.py script shows the following results:
I added a similar-looking result below that was achieved with the system described in Section II of this document. Please be advised that these numbers are only for educational purposes and no specific CPU optimizations were done.
System | Step Time (sec/batch) | Accuracy |
---|---|---|
2S Intel® Xeon® Gold processors | ~ 0.105 | 85.8% at 60K steps (~2 hrs) |
2S Intel Xeon Gold processors | ~0.109 | 86.2% at 100K steps (~3 hrs) |
Once you have finished training and testing for your CIFAR-10 dataset, the same Models directory has images for MNIST* and AlexNet* benchmarks. It could be educational to go into MNIST and AlexNet directories and try running the Python scripts there to see the results.
References:
- Install TensorFlow on CentOS7, https://gist.github.com/thoolihan/28679cd8156744a62f88
- Installing TensorFlow on Ubuntu*, https://www.tensorflow.org/install/install_linux
- Install TensorFlow on CentOS7, http://www.cnblogs.com/ahauzyy/p/4957520.html
- The CIFAR-10 dataset, https://www.cs.toronto.edu/~kriz/cifar.html
- Tensorflow, MNIST and your own handwritten digits, http://opensourc.es/blog/tensorflow-mnist
- TensorFlow Tutorial, https://github.com/Hvass-Labs/TensorFlow-Tutorials
- Tutorial on CNN on TensorFlow, https://www.tensorflow.org/tutorials/deep_cnn
- CIFAR-10 Details, https://www.tensorflow.org/tutorials/deep_cnn
- TensorFlow Models, https://github.com/tensorflow/models.git
- Installing TensorFlow from Sources, https://www.tensorflow.org/install/install_sources
- Libffi, https://sourceware.org/libffi/
- Performance Guide for TensorFlow, https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu
- What is batch size in neural network? https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network
- Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009
- Epoch vs Batch Size vs Iterations, https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9
- Virtualenv, https://virtualenv.pypa.io/en/stable/
- CPU Optimizations: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
- Download and Setup, https://www.tensorflow.org/versions/r0.12/get_started/os_setup