Using TensorFlow* for Deep Learning Training and Testing

Introduction

In this tutorial, you learn to train and test a single-node Intel® Xeon® Scalable processor platform system using TensorFlow* framework with CIFAR-10 image recognition datasets. Use these step-by-step instructions as-is, or as the foundation for enhancements and/or modifications.

Prerequisites

Hardware	Steps have been verified on Intel® Xeon® Scalable processors but should work on any latest Intel® Xeon® processor-based system. None of the software pieces used in this document were performance optimized.
Software	Basic Linux*, familiar with the concepts of deep learning training

Install TensorFlow using binary packages or from GitHub* sources. This document describes one way to successfully deploy and test on a single Intel Xeon Scalable processor system running CentOS* 7.3. Other installation methods can be found in ^2,18. This document is not meant to give an elaborate description of how to reach state-of-the-art performance; rather, it’s to introduce TensorFlow and run a simple train and test using the CIFAR-10 dataset on a single-node Intel Xeon Scalable processor system.

Hardware and Software Bill of Materials

The hardware and software bill of materials used for the verified implementation recommended here is detailed in Section II. Intel® Parallel Studio XE Cluster Edition is an optional installation for single-node implementation providing most of the basic tools and libraries in one package. Starting with Intel Parallel Studio XE Cluster Edition accelerates the learning curve needed for multi-node implementation of the same training and testing, as this software is significantly instrumental on a multi-node deep learning implementation.

Item	Manufacturer	Model/Version
Hardware
Intel® Server Chassis	Intel	R1208WT
Intel® Server Board	Intel	S2600WT
2 - Intel® Xeon® Scalable processor	Intel	Intel Xeon® Gold 6148 processor
6 - 32 GB LRDIMM DDR4	Crucial*	CT32G4LFD4266
1 - Intel® SSD 1.2 TB	Intel	S3520
Software
CentOS* Linux* Installation DVD	CentOS	7.3.1611
Intel® Parallel Studio XE Cluster Edition	Intel	2017.4
TensorFlow*		setuptools-36.7.2-py2.py3-none-any.whl

Install the Linux* Operating System

This section requires CentOS-7-x86_64-*1611.iso. This software component can be downloaded from the CentOS website.

DVD ISO was used to implement and verify the steps in this document; you can also use Everything ISO and Minimal ISO.

Step 1. Install Linux

1. Insert the CentOS 7.3 1611 install disc/USB. Boot from the drive and select Install CentOS 7.

2. Select Date and Time.

3. If necessary, select Installation Destination.

a. Select the automatic partitioning option.

b. Click Done to return home. Accept all defaults for the partitioning wizard, if prompted.

4. Select Network and host name.

a. Enter "<hostname>" as the hostname.

i. Click Apply for the hostname to take effect.

b. Select Ethernet enp3s0f3 and click Configure to set up the external interface.

i. From the General section, check Automatically connect to this network when it’s available.

ii. Configure the external interface as necessary. Save and Exit.

c. Select the toggle to ON for the interface.

d. Select the toggle to ON for the interface.

5. Select Software Selection. In the box labeled Base Environment on the left side, select Infrastructure server.

a. Click Done to return home.

b. Wait until the Begin Installation button is available, which may take several minutes. Then click it to continue

6. While waiting for the installation to finish, set the root password.

7. Click Reboot when the installation is complete.

8. Boot from the primary device and log in as root.

Step 2. Configure YUM*

If the public network implements a proxy server for Internet access, Yellowdog Updater Modified* (YUM*) must be configured in order to use it.

Open the /etc/yum.conf file for editing.
Under the main section, append the following line:
Proxy=http://<address>:<port>;
where <address> is the address of the proxy server and <port> is the HTTP port.
Save the file and Exit.

Disable updates and extras. Certain procedures in this document require packages to be built against the kernel. A future kernel update may break the compatibility of these built packages with the new kernel, so we recommend disabling repository updates and extras to provide further longevity to this document.

This document may not be used as is when CentOS updates to the next version. To use this document after such an update, it is necessary to redefine repository paths to point to CentOS 7.3 in the CentOS vault. To disable repository updates and extras: Yum-config-manager --disable updates --disable extras.

Step 3. Install EPEL

Extra Packages for Enterprise Linux (EPEL) provides 100 percent, high-quality add-on software packages for Linux distribution. To install EPEL (latest version for all packages required):

Yum –y install (download from https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm)

Step 4. Install GNU* C Compiler

Check whether the GNU Compiler Collection* is installed. It should be part of the development tools install. Verify the installation by typing:

gcc --version or whereis gcc

Step 5. Install TensorFlow*

Using virtualenv¹⁸, follow these steps to install TensorFlow:

1. Update to the latest distribution of EPEL:

yum –y install epel-release

2. To install TensorFlow, the following dependencies must be installed¹⁰:

NumPy*: a numerical processing package that TensorFlow requires
Devel*: this enables adding extensions to Python*
Pip*: this enables installing and managing certain Python packages
Wheel*: enables managing Python compressed packages in wheel formal (.whl)
Atlas*: Automatically Tuned Linear Algebra Software
Libffi*: Library provides Foreign Function Interface (FFI) that allows code written in one language to call code written in another language. It provides a portable, high-level programming interface to various calling conventions¹¹

3. Install dependencies:

sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel gcc-gfortran openssl-devel libffi-devel python-numpy

4. Install virtualenv
There are various ways to install TensorFlow¹⁸. This document uses virtualenv, a tool to create isolated Python environments¹⁶.

pip install --upgrade virtualenv

5. Create a virtualenv in your target directory:

virtualenv --system-site-packages <targetDirectory>

Example: virtualenv --system-site-packages tensorflow

6. Activate your virtualenv¹⁸:

source <targetDirectory>/bin/activate

Example: source ~/tensorflow/bin/activate

7. Upgrade your packages, if needed:

pip install --upgrade numpy scipy wheel cryptography

8. Install the latest version of Python compressed TensorFlow packages:

pip install --upgrade

https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl . OR:

pip install --upgrade tensorflow

Step 6. Train a Convolutional Neural Network (CNN)

1. Download the CIFAR10³ training dataset into /tmp/ directory:
Download the cifar-10 python version from ^4,8: https://www.cs.toronto.edu/~kriz/cifar.html

2. Unzip the tar file in the /tmp/ area as the python script (cifar10_train.py) looks for data in this directory:

tar –zxf <dir>/cifar-10-python.tar.gz

3. Change directory to TensorFlow:

cd tensorflow

4. Make a new directory:

mkdir git_tensorflow

5. Change directory to the one created in last step:

cd git_tensorflow

6. Download a clone of the TensorFlow repository from GitHub⁹:
Git clone https://github.com/tensorflow/tensorflow.git

7. If the Models folder is missing from the tensorflow/tensorflow directory, access a Git of models from:⁹
https://github.com/tensorflow/models.git:

cd tensorflow/tensorflow

git clone https://github.com/tensorflow/models.git

8. Upgrade TensorFlow to the latest version or errors could occur when training the model:

pip install --upgrade tensorflow

9. Change directory to CIFAR-10 dir to get the training and evaluation Python scripts¹⁴:

cd models/tutorials/image/cifar10

10. Before running the training code, check the cifar10_train.py code and change steps from 100K to 60K if needed, as well as logging frequency from 10 to whatever you prefer.

For this document, tests were done for both 100K steps and 60K steps, for a batch size of 128, and logging frequency of 10.

code line

11. Run the training Python script to train your network:

python cifar10_train.py

This will take few minutes and you will see an image similar to below:

Python code sample

Testing script and dataset terminology

In the neural network terminology:

One epoch = one forward pass and one backward pass of all the training examples.
Batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space required. TensorFlow pushes it all through one forward pass (in parallel) and follows with a back-propagation on the same set. This is one iteration, or step.
Number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass equals one forward pass plus one backward pass (do not count the forward pass and backward pass as two different passes).
Steps parameter tells TensorFlow to run X of these iterations to train the model.

Example: given 1,000 training examples, and a batch size of 500, then it will take two iterations to complete one epoch.

To learn more about the difference between epoch versus batch size versus iterations, read the article¹⁵.

In the cifar10_train.py script:

Batch size is set to 128. It represents the number of images to process in a batch.
Max step is set to 100,000. It is the number of iterations for all epochs.
NOTE: The GitHub code has a typo; instead of 100K, the number shows 1000K. Please update before running.
The CIFAR-10 binary dataset in⁴ has 60,000 images: 50,000 images to train and 10,000 images to test. Each batch size is 128, so the number of batches needed to train is 50,000/128 ~ 391 batches for one epoch.
The cifar10_train.py used 256 epochs, so the number of iterations for all the epochs is ~391 x 256 ~ 100K iterations or steps.

Step 7. Evaluate the model

Use the cifar10_eval.py script⁸ to evaluate how well the trained model performs on a hold-out data set.:

python cifar10_eval.py

Once you reach expected accuracy, you should see a precision @ 1 = 0.862 on your screen when running the above command, it can be run while the training script is still running towards the end of the number of steps, or it can be run after the training script has finished.

Code line

Sample results

The cifar10_train.py script shows the following results:

Results of the test

A similar-looking result below was achieved with the system described in the Hardware and Software Bill or Materials Section of this document. Note that these numbers are only for educational purposes and no specific CPU optimizations were performed.

System	Step Time (sec/batch)	Accuracy
2 - Intel® Xeon® Gold processors	~ 0.105	85.8% at 60K steps (~2 hours)
2 - Intel Xeon Gold processors	~0.109	86.2% at 100K steps (~3 hours)

When you finish training and testing your CIFAR-10 dataset, the same Models directory has images for MNIST* and AlexNet* benchmarks. For additional learning, go into MNIST and AlexNet directories and try running the Python scripts to see the results.