Review of Architecture and Optimization on Intel® Xeon® Scalable Processors in context of Intel® Optimized TensorFlow* on Intel® AI DevCloud

When I joined the Intel^®Student Developer Program in late 2017 I was pretty excited to try the Intel® Xeon® scalable processors [1] that were a part of the Intel^®AI DevCloud they launched roughly at the same time. Just to get everyone on same page I would like to begin with what Intel Xeon Scalable processors are and how they affect computation. Later I will discuss what I learned on optimizing Deep Learning TensorFlow* [2] codes to juice out the last drop of this beast.

The Intel Xeon Scalable processor family on the "Purley" platform is a new microarchitecture with many additional features compared to the previous-generation Intel® Xeon® processor E5-2600 v4 product family (formerly Broadwell microarchitecture). The core reason I believe Intel Xeon Scalable processors are so cool at handling artificial intelligence (AI) computations are due to Intel^® Advanced Vector Extension 512 (Intel^® AVX-512) [3] instruction set. It provides ultra-wide 512 bit vector operation capabilities which allows it to handle most of the high performance computing required for TensorFlow*. As the most basic units of computation in TensorFlow involve flow of tensors through operations which are paralleled through the vector processing units. These are called Single Instruction Multiple Data (SIMD) [4] operations. I will give you an example, say we add two vectors in a natural way, we will go about looping over the dimension and adding the corresponding units. While a vector supported central processing unit (CPU) will add those two with single add operation thereby reducing latency to factor of dimension of the vector, but of course in our case it can do a maximum of up to 512 bits at time. So we get a performance boost of up to 3 to 4x over normal CPUs.

Now I will switch to talk about my experiments which I conducted and which proves my point. When I joined the program I already had 1500 lines of codebase of a Neural Image Captioning System to tryout which I had previously ran only on the Google* Cloud Platform [5]. A Neural Captioning System is one which generates captions for images through an encoder-decoder neural network system. In my case I followed the work of Vinyals et al [6] with slight modifications. My encoder system for images is a VGG16 model [7]. It’s a convolution neural network that was presented in ILSVRC [8] for the object recognition competition. It turns out that it can be used as a good feature extractor, so I removed everything after its 7th fully connected layer and used the final 4096 length vector. This approach is popularly coined as Transferred Learning in the deep learning community. I pre-extracted the features of all the images of Microsoft COCO [9] dataset then performed Principle Component Analysis (PCA) over the data to reduce its dimension to 512. I conducted experiments with both the datasets (PCA and non-PCA). The work is still in progress and the codebase is in my GitHub repository if you want to take a look.

Now during my runs when I first tried to run out of the box on the Intel Xeon processors I got absolutely zero performance increase, in fact it was bit slower. So I’ve spent the last month hoping I could figure it out before Santa knocked on my door. I would like to share with you some steps that I found need to be done before you see improved performance off this complicated yet powerful processor.

As we batch through multiple data over the epoch, avoid any kind of ‘Disk Read Write’ as far as possible. On Intel AI DevCloud our home folder, Network File System (NFS), is shared between the compute node and login node. The read writes takes a long time on the cluster as it resides further away compared to home PCs. So how do we go about it? Well, TensorFlow provides an elegant queue based operation through its Dataset API [10]. The API enables us to build complex input pipelines using reusable pieces of operation. It wraps your data with a pre-processing operation of your choice and batches them together. This will drastically reduce your latency as your entire dataset shall be cached with all limitations and requirements in place and embed all the operations in a computation graph.
There is very nice paper by Colfax research [11] from which I would like to write few tips which directly affected my performance. It deals with optimization of an object detection network based on YOLO [12] and recommends tuning of certain critical variables that are quite important from a performance point of view.

KMP_BLOCKTIME : This is an ‘of many’ variable that controls the behavior of the OpenMP* API which is a parallel programming interface and is primarily responsible for multi-threading operation inside the Tensorflow API. It is the variable that controls the wait time, in milliseconds, that an OpenMP thread waits before going to sleep. A large value and you keep your data hot, but at the same time you can easily starve other threads of resources, so this variable needs to be tuned to best suite our interest. In my case I kept it to 30.
os.environ["KMP_BLOCKTIME"] = “30”
OMP_NUM_THREADS: This refers to number of parallel threads that a TensorFlow operation can use. The recommended setting for TensorFlow is to keep this to the number of physical cores. I tried 136 and it worked for my case.
os.environ["OMP_NUM_THREADS"] = “136”
KMP_AFFINITY: This provides abstract control over placement of OpenMP threads to physical cores. The recommended setting for TensorFlow is ‘fine, compact, 1, 0’. ‘Fine’ prevents thread migration thereby reducing cache misses. ‘Compact’ places neighboring threads close together. ‘1’ provides priority placement of threads on different free physical cores rather than on the same core which has a situation of hyper threading. This behavior is similar to how electronic orbitals are filled in atoms. ‘0’ refers to indexing core mapping.
os.environ["KMP_BLOCKTIME"] = “30"
Inter and Intra Operation Parallel Threads: These are the variables provided by TensorFlow to control how many simultaneous operations can be run and how many parallel threads each operation can run. In my case I kept the former at two and latter to be equal to OMP_NUM_THREADS (as recommended)
tf.app.flags.DEFINE_integer(‘inter_op’,2,”””Inter op Parallelism Threads”””)
tf.app.flags.DEFINE_integer(‘inter_op’,136,”””Intra op Parallelism Threads”””)

After tuning all the said variables I got a performance increase of up to 4x reducing my per epoch time from 2.5 hours to 30 mins thereby greatly reducing latency. As I said, Intel Xeon Scalable processors are pretty powerful and what we get in the Intel AI DevCloud is a theoretical promised performance of 260 TFlops of performance, but this can’t be expected out of box unless certain cards fall in right place.

References

Review of Architecture and Optimization on Intel® Xeon® Scalable Processors in context of Intel® Optimized TensorFlow* on Intel® AI DevCloud

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112