Intel® CPUs for Deep Learning Training

Overview

In November 2017, UC Berkeley, U-Texas, and UC Davis researchers trained ResNet-50* in a record time of 31 minutes and AlexNet* in a record time of 11 minutes on CPUs to state-of-the-art accuracy¹. These results were obtained on Intel Xeon® Scalable processors (formerly codename Skylake-SP). The main reasons for these performance speeds are:

The compute and memory capacity of Intel Xeon Scalable processors
Software optimizations in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and in the popular deep learning frameworks
Recent advancements in distributed training algorithms for supervised deep learning workloads

This level of performance shows that Intel Xeon processors are an excellent hardware platform for deep learning training. Data scientists can now use their existing general-purpose Intel Xeon processor clusters for deep learning training as well as continue using them for deep learning inference, classical machine learning and big data workloads. They can get excellent deep learning training performance using 1 server node, and further reduced the time-to-train by using more server nodes scaling near linearly to hundreds of nodes.

In this 4-part article, we explore each of the main three factors contributing to record-setting speed, and provide examples of industry uses cases.

Part 1: Compute and Memory Capacity of Intel Xeon® Scalable Processors

Training deep learning models often requires significant compute. For example, training ResNet-50² requires a total of about one exa (10¹⁸) single precision operations¹. Hardware capable of high compute throughput can reduce the training time if high utilization is achieved. High utilization requires high bandwidth memory and clever memory management to keep the compute busy on the chip³. These features are in the new generation of Intel Xeon processors: large core count at high processor frequency, fast system memory, large per-core mid-level cache (MLC or L2 cache), and new SIMD instructions making this new generation of Intel Xeon processors an excellent platform for training deep learning models. In Part 1, we review the main hardware features of the Intel Xeon Scalable processors including compute and memory, and compare the performance of the Intel Xeon Scalable processors to previous generations of Intel Xeon processors for deep learning workloads.

In July 2017, Intel launched the Intel Xeon Scalable processor family built on 14 nm process technology. The Intel Xeon Scalable processors can support up to 28 physical cores (56 threads) per socket (up to 8 sockets) at 2.50 GHz processor base frequency and 3.80 GHz max turbo frequency, and six memory channels with up to 1.5 TB of 2,666 MHz DDR4 memory. The top-bin Intel Xeon Platinum processor 8180 provides up to 199GB/s of STREAM Triad performance on a 2-socket system [a, b]. For inter-socket data transfers, the Intel Xeon Scalable processors introduced the new Ultra Path Interconnect (UPI), a coherent interconnect that replaces QuickPath Interconnect (QPI) and increases the data rate to 10.4 GT/s per UPI port and up to 3 UPI ports in a 2-socket configuration⁴.

Additional improvements include a 38.5 MB shared non-inclusive last-level cache (LLC or L3 cache), that is, memory reads fill directly to the L2 and not to both the L2 and L3, and 1MB of private L2 cache per core. The Intel Xeon Scalable processor core now includes the 512-bit wide Fused Multiply Add (FMA) instructions as part of the larger 512-bit wide vector engine with up to two 512-bit FMA units computing in parallel per core (previously introduced in the Intel Xeon Phi™ processor product line)¹ [4]. This provides a significant performance boost over the previous 256-bit wide AVX2 instructions in the previous Intel Xeon processor v3 and v4 generations (formerly codename Haswell and Broadwell, respectively) for both training and inference workloads.

The Intel Xeon Platinum 8180 processor provides up to 3.57 TFLOPS (FP32) and up to 5.18 TOPS (INT8) per socket². The 512-bit wide FMA’s essential doubles the FLOPS that the Intel Xeon Scalable processors can deliver and significantly speeds up single precision matrix arithmetic. Comparing SGEMM and IGEMM performance we observe 2.3x and 3.4x improvement, respectively, over the previous Intel Xeon processor v4 generation [c,e]. Comparing the performance on a full deep learning model, we observed using the ResNet-18 model with the neon framework a 2.2x training and 2.4x inference throughput improvement in performance using FP32 over the previous Intel Xeon processor v4 generation^d,f.

Part 2: Software optimizations in Intel MKL-DNN And The Main Frameworks

Software optimization is essential to high compute utilization and improved performance. Intel Optimized Caffe* (sometimes referred to as Intel Caffe), TensorFlow*, Apache* MXNet*, and Intel® neon™ are optimized for training and inference. Optimizations with other frameworks such as Caffe2*, CNTK*, PyTorch*, and PaddlePaddle* are also a work in progress. In Part 2, we compare the performance of Intel optimized vs non-Intel optimized models; we explain how the Intel MKL-DNN library enables high compute utilization; we discuss the difference between Intel MKL and Intel MKL-DNN; and we explain additional optimizations at the framework level that further improve performance.

Two years ago, deep learning performance was sub-optimal on Intel® CPUs as software optimizations were limited and compute utilization was low. Deep learning scientists incorrectly assumed that CPUs were not good for deep learning workloads. Over the past two years, Intel has diligently optimized deep learning functions achieving high utilization and enabling deep learning scientist to use their existing general-purpose Intel CPUs for deep learning training. By simply setting a configuration flag when building the popular deep learning frameworks (the framework will automatically download and build Intel MKL-DNN by default), data scientists can take advantage of Intel CPU optimizations.

Using Intel Xeon processors can provide over 100x performance increase with the Intel MKL-DNN library. For example, inference across all available CPU cores on AlexNet, GoogleNet v1, ResNet-50, and GoogleNet v3 with Apache MXNet on the Intel Xeon processor E5-2666 v3 (c4.8xlarge AWS* EC2* instance) can provide 111x, 109x, 66x, and 92x, respectively, higher throughput [5]. Inference across all CPU cores on AlexNet with Caffe2 on the Intel Xeon processor E5-2699 v4 can provide 39x higher throughput⁶. Training AlexNet, GoogleNet, and VGG* with TensorFlow on the Intel Xeon processor E5-2699 v4 can provide 17x, 6.7x, and 40x, respectively, higher throughput⁷. Training across all CPU cores AlexNet with Intel Optimized Caffe and Intel MKL-DNN on the Intel Xeon Scalable Platinum 8180 processor has 113x higher throughput than BVLC*-Caffe without Intel MKL-DNN on the Intel Xeon processor E5-2699 v3^d,g. Figure 1 compares the training throughput across multiple Intel Xeon processor generations with the Intel MKL-DNN library.

Image of a chart
Figure 1:Training throughput of Intel Optimized Caffe across Intel Xeon processor v2 (formerly codename Ivy Bridge)^h, Intel Xeon processor v3 (formerly codename Haswell)^g, Intel Xeon processor v4 (codename Broadwell)^e, Intel Xeon Gold processor^f and Intel Xeon Platinum processors (formerly codename: Skylake)^d with AlexNet using batch size (BS) equal to 256, GoogleNet v1 BS=96, ResNet-50 BS=50, and VGG-19 BS=64. Intel® MKL-DNN provides significant performance gains starting with the Intel Xeon processors v3 when AVX-2 instructions are introduction and another significant jump with the Intel Xeon Scalable processors when AVX-512 instructions are introduced.

At the heart of these optimizations is the Intel® Math Kernel Library (Intel® MKL)⁸ and the Intel MKL-DNN library⁹. There are a variety of deep learning models, and they may seem very different. However, most models are built from a limited set of building blocks known as primitives that operate on tensors. Some of these primitives are inner products, convolutions, rectified linear units or ReLU, batch normalization, etc, along with functions necessary to manipulate tensors. These building blocks or low-level deep learning functions have been optimized for the Intel Xeon product family inside the Intel MKL library. Intel MKL is a library that contains many mathematical functions and only some of them are used for deep learning. In order to have a more targeted deep learning library and to collaborate with deep learning developers, Intel MKL-DNN was released open-source under an Apache 2 license with all the key building blocks necessary to build complex models. Intel MKL-DNN allows industry and academic deep learning developers to distribute the library and contribute new or improved functions. Intel MKL-DNN is expected to lead in performance as all new optimizations will first be introduced in Intel MKL-DNN.

Deep learning primitives are optimized in the Intel MKL-DNN library by incorporating prefetching, data layout, cache-blocking, data reuse, vectorization, and register-blocking strategies⁹. High utilization requires that data be available when the execution units (EU) need it. This requires prefetching the data and reusing the data in cache instead of fetching that same data multiple times from main memory. For cache-blocking, the goal is to maximize the computation on a given block of data that fits in cache, typically in MLC. The data layout is arranged consecutively in memory so that access in the innermost loops is as contiguous as possible avoiding unnecessary gather/scatter operations. This results in better utilization of cache lines (and hence bandwidth) and improves pre-fetcher performance. As we loop through the block, we constrain the outer dimension of the block to be a multiple of SIMD-width and the inner most dimension looping over groups of SIMD-width to enable efficient vectorization. Register blocking may be needed to hide the latency of the FMA instructions³.

Additional parallelism across cores is important for high CPU utilization, such as parallelizing across a batch using OpenMP*. This requires improving the load balance so that each core is doing an equal amount of work and reducing synchronization events across cores. Efficiently using all cores in a balanced way requires additional parallelization within a given layer.

These sets of optimizations ensure that all the key deep learning primitives, such as convolution, matrix multiplication, batch normalization, etc. are efficiently vectorized to the latest SIMD and parallelized across the cores¹⁰. Intel MKL-DNN primitives are implemented in C with C and C++ API bindings for most widely used deep learning functions¹¹:

Direct batched convolution
Inner product
Pooling: maximum, minimum, average
Normalization: local response normalization across channels (LRN), batch normalization
Activation: rectified linear unit (ReLU), softmax
Fused primitives: convolution+ReLU
Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale
Coming soon: Long short-term memory (LSTM) and Gated recurrent units (GRU)

There are multiple deep learning frameworks such as Caffe, TensorFlow, MXNet, PyTorch, etc. Modifications (code refactorization) at the framework level is required to efficiently take advantage of the Intel MKL-DNN primitives. The frameworks carefully replace calls to existing deep learning functions with the appropriate Intel MKL-DNN APIs avoiding the framework and Intel MKL-DNN library from competing for the same threads. During setup, the framework manages layout conversions from the framework to MKL-DNN and allocate temporary arrays if the appropriate output and input data layouts do not match. To improve performance, graph optimizations may be required to keep conversion between different data layouts to a minimum. During the execution step the data is fed to the network in a plain layout like BCWH (batch, channel, width, height) and is converted to a SIMD-friendly layout. As data propagates between layers the data layout is preserved and conversions are made when it is necessary to perform operations that are not supported by the Intel MKL-DNN [10].

Image of Operation graph flow chart
Figure 2: Operation graph flow. MKL-DNN primitives are shown in blue. Framework optimizations attempts to reduce the layout conversion so that the data stays in the MKL-DNN layout for consecutive primitive operations. Image courtesy of¹⁰.

Part 3: Advancements in Distributed Training Algorithms For Deep Learning

Training a large deep learning model often takes days or even weeks. Distributing the computational requirement among multiple server nodes can reduce the time to train. However, regardless of the hardware use there are algorithmic challenges to this, but there are recent advancements in distributed algorithms that mitigate some of these challenges. In Part 3, we review the gradient descent and stochastic gradient descent (SGD) algorithms and explain the limitations of training with very large batches; we discuss model and data parallelism; we review synchronous SGD (SSGD), asynchronous SGD (ASGD) and allreduce/broadcast algorithms; finally, we present recent advances that enable larger batch-size SSGD training and present state-of-the-art results.

In supervised deep learning, input data is passed through the model and the output is compared to the ground truth or expected output. A penalty or loss is then computed. Training the model involves adjusting the model parameters to reduce this loss. There are various optimization algorithms that can be used to minimize the loss function such as gradient descent, or variants such as stochastic gradient descent, Adagrad, Adadelta, RMSprop, Adam, etc.

In gradient descent (GD), also known as steepest descent, the loss function for a particular model defined by the set of weights w is computed over the entire dataset. The weights are updated by moving in the direction opposite to the gradient; that is, moving towards the local minimum: updated-weights = current-weights – learning-rate * gradient.

In stochastic gradient descent (SGD), or more correctly called batch gradient descent, the dataset is broken into several batches. The loss is computed with respect to a batch and the weights are updated using the same update rule as gradient descent. There are other variants that speed up the training process by accumulating velocity (known as momentum) in the direction of the opposite of gradients, or that reduce the data scientist’s burden of choosing a good learning rate by automatically modifying the learning rate depending on the norm of the gradients. An in-depth discussion of these variants can be found elsewhere¹².

The behavior of SGD approaches the behavior of GD as the batch sizes increase and become the same when the batch size equals the entire dataset. There are three main challenges that GD has (and SGD also has when the batch size is very large). First, each step is computationally expensive as it requires computing the loss over the entire dataset. Second, learning slows near saddle points or areas where the gradient is close to zero. Third, according to Intel and Northwestern researchers¹³, it appears that the optimization space has many sharp minima. Gradient descent does not explore the optimization space but rather moves towards the local minimum directly underneath its starting position, which is often a sharp minimum. Sharp minima do not generalize. While the overall loss function with respect to the test dataset is similar to that of the training dataset, the actual cost at the sharp minima may be very different. A cartoonish way to visualize this is shown in Figure 3 where the loss function with respect to the test dataset is slighted shifted from the loss function with respect to the training dataset. This shift results in models that converge to a sharp minimum having a high cost with respect to the test dataset, meaning that the model does not generalize well for data outside the training set. On the other hand, models that converge to a flat minimum have a low cost with respect to the test dataset, meaning that the model generalizes well for data outside the training set.

Image of a graph
Figure 3:In this cartoonish figure, the loss function with respect to the test dataset is slighted shifted from the loss function with respect to the training dataset. The sharp minimum has a high cost with respect to the test loss function. Image courtesy of¹³.

Small-batch SGD (SB-SGD) resolves these issues. First, using SB-SGD is computational inexpensive and therefore each iteration is fast. Second, it is extremely unlikely to get stuck at a saddle point using SB-SGD since the gradients with respect to some of the batches in the training set are likely not zero even if the gradient with respect to the entire training set is zero. Third, it is more likely to find a flat minimum since SB-SGD better explores the solution space instead of moving towards the local minimum directly underneath its starting position. On the other hand, very small batches or tiny batches are also not ideal because it is difficult to have high CPU (or GPU) utilization. This becomes more problematic when distributing the computational workload of the small-batch across several worker nodes. Therefore, it is important to find a batch size large enough to maintain high CPU utilization but not so large to avoid the issues of GD. This becomes more important for synchronous data-parallel SGD discussed below.

Efficiently distributing the workload across several worker nodes can reduce the overall time-to-train. The two main techniques used are model parallelism and data parallelism. In model parallelism, the mode is split among the worker nodes with each node working on the same batch. Model parallelism is used in practice when the memory requirements exceed the worker’s memory. Data parallelism is the more common approach and works best for models with fewer weights. In data parallelism, the batch is split among the worker nodes with each node having the full model and processing a piece of the batch, known as the node-batch. Each worker node computes the gradient with respect to the node batch. These gradients are then aggregated using some allreduce algorithm (a list of allreduce options is discussed below) to compute the gradient with respect to the overall batch. The model weights are then updated and those updated weights are broadcasted to each worker node. At the end of each iteration or cycle through a batch, all the worker nodes have the same updated model, that is, the nodes are synchronized. This is known as synchronous SGD (SSGD).

Asynchronous SGD (ASGD) alleviates the overhead of synchronization. However, ASGD has additional challenges. ASGD requires more tuning of hyperparameters such as momentum, and requires more iterations to train. Furthermore, it does not match single node performance and therefore it is more difficult to debug. In practice ASGD has not been shown to scale and retain accuracy on large models. Stanford, LBNL, and Intel researchers have shown that an ASGD/SSGD hybrid approach can work where the nodes are clustered in up to 8 groups. Updates within a group are synchronous and between groups asynchronous. Going beyond 8 groups reduces performance due to the ASGD challenges¹⁴.

One strategy for communicating gradients is to appoint one node as the parameter server, which computes the sum of the node gradients, updates the model, and sends the updated weights back to each worker. However, there is a bottleneck in sending and receiving all of the gradients using one parameter server. Unless ASGD is used, a parameter server strategy is not recommended.

Allreduce and broadcast algorithms are used for communicating and adding the node gradients and then broadcasting updated weights. There are various allreduce algorithms including Tree, Butterfly, and Ring. Butterfly is optimal for latency scaling at O(log(P)) iterations, where P is the number of worker nodes, and combines allreduce and broadcast. Ring is optimal for bandwidth; for large data communication it scales at O(1) with the number of nodes. In bandwidth constraints clusters, e.g., using 10 GbE, AllReduce Ring is usually the better algorithm. A detailed explanation of the AllReduce Ring algorithm is found elsewhere¹⁵.

Image of map of communication strategies
Figure 4:Various communication strategies. Butterfly All-reduce is optimal for latency. Ring All-reduce is optimal for bandwidth. Image courtesy of Amir Gholami, Peter Jin, Yang You, Kurt Keutzer, and the PALLAS group at UC Berkeley.

In November 2014, Jeff Dean spoke of Google’s research goal to reduce training time from six weeks to a day¹⁶. Three years later, CPUs were used to train AlexNet in 11 minutes! This was accomplished by using larger batch sizes that allows distributing the computational workload to 1000+ nodes. To scale efficiently, the communication of the gradients and updated weights must be hidden in the computation of these gradients.

Increasing the overall batch size is possible with these techniques: 1) proportionally increasing the batch size and learning rate; 2) slowly increasing the learning rate during the initial part of training (known as warm-up learning rates); and 3) having a different learning rate for each layer in the model using the layer-wise adaptive rate scaling (LARS) algorithm. Let’s go through each technique in more detail.

The larger the batch size, the more confidence one has in the gradient and therefore a larger learning rate can be used. As a rule of thumb, the learning rate is increased proportional to the increased batch size⁴,¹⁷ This technique allowed UC Berkeley researchers to increase the batch size from 256 to 1024 with the GoogleNet model and scale to 128 K20-GPU nodes reducing the time-to-train from 21 days to 10.5 hours¹⁸, and Intel researchers to increase the batch size from 128 to 512 with the VGG-A model and scale to 128 Intel Xeon processor E5-2698 v3 nodes¹⁹.

A large learning rate can lead to divergence (the loss increases with each iteration rather than decreases), in particular during the initial training phase. This is because the norm of the gradients is much greater than the norm of the weights during the initial training phase²⁰. This is mitigated by gradually increasing the learning rate during the initial training phase, for example during the first 5 epochs, until the target learning rate is reached. This technique allowed Facebook* researchers to increase the batch size from 256 to 8096 with ResNet-50 and scale to 256 P100-GPU nodes reducing the time-to-train from 29 hours (using 8 P100-GPUs) to 1 hour²¹. This technique allowed SurfSARA and Intel researchers to scale to 512 2-socket Intel Xeon Platinum processors reducing the ResNet-50 time-to-train to 40 minutes²².

Researchers at NVidia* observed that the ratio of gradients to the weights for different layers within a model greatly varies [20]. They proposed having a different learning rate for each layer that is inversely proportional to this ratio. This technique (combined with the ones above) allowed them to increase the batch size to 32K.

UC Berkeley, U-Texas, and UC Davis researchers used these techniques to achieve record training times: AlexNet in 11 minutes and ResNet-50 in 31 minutes on Intel CPUs to state-of-the-art accuracy¹. They used 1024 and 1600, respectively, 2-socket Intel Xeon Platinum 8160 processor servers with the Intel MKL-DNN library and the Intel Optimized Caffe framework⁵. SURFsara and Intel researchers train ResNet-50 in 42 minutes on 1024 2S Intel Xeon Platinum Processor 8160 to state-of-the-art accuracy²³.

Part 4: Commercial Uses Cases

The additional computation and memory capacity, coupled with the software optimization and advancements of distributed training are enabling industries to use their existing Intel Xeon processors for deep learning working. In Part 4, we present two commercial uses cases. One use at Intel assembly and test factory and one use for inspection service of Honeywell Drone at Honeywell.

Intel assembly and test factory benefited from Intel Optimized Caffe on Intel Xeon processors improving silicon manufacturing package fault detection. This project aimed to reduce the human review rate for package cosmetic damage at the final inspection point, while keeping the false negative ratio at the same level as the human rate. The input was package photos, and the goal was to perform binary classification on each of them, indicating whether the package was rejected or passed. The GoogleNet model was modified for this task. Using 8 Intel Xeon Platinum 8180 processor connected via 10 Gb Ethernet training was completed within 1 hour. The false negative rate consistently met the expected human-level accuracy. Automating this process saved 70% of the inspectors’ time. Details of this project are found elsewhere²⁴.

Honeywell recently launched its first commercial unmanned aerial vehicle (UAV) inspection service²⁵. Honeywell successfully used Faster-RCNN* with Intel Optimized Caffe for the tasks of solar panel defect detection. 300 original solar panel images augmented with 36-degree rotation were used in training. Training on Intel Xeon Platinum 8180 processor took 6 hours and achieved a detection accuracy of 96.3% under some adverse influences from environments. The inference performance is 188 images per second. This is a general solution that can be used for various inspection services in the markets including oil and gas inspection, pipeline seepage and leakage; utilities inspection, transmission lines and substations; and emergency crisis response.

Conclusion

Intel’s newest Xeon Scalable processors along with optimized deep learning functions in the Intel MKL-DNN library provide sufficient compute for deep learning training workloads (in addition to inference, classical machine learning, and other AI algorithms). Popular deep learning frameworks are now incorporating these optimizations, increasing the effective performance delivered by a single server by over 100x in some cases. Recent advances in distributed algorithms have also enabled the use of hundreds of servers to reduce the time to train from weeks to minutes. Data scientists can now use their existing general-purpose Intel Xeon processor clusters for deep learning training as well as continue using them for deep learning inference, classical machine learning and big data workloads. They can get excellent deep learning training performance using 1 Intel CPU, and further reduced the time-to-train by using multiple CPU nodes scaling near linearly to hundreds of nodes.

Footnotes

Available in Intel Xeon Platinum processors, Intel Xeon Gold processors 6000 series and 5122.
The raw compute can be calculated as AVX-512-frequency * number-of-cores * number-of-FMAs-per-core * 2-instructions-per-FMA * SIMD-vector-length / number-of-bits-in-numerical-format. The Intel Xeon Platinum 8180 has AVX-512- frequency * 28 * 2 * 2 * 512/32 = 1792 * AVX-512-frequency peak TFLOPS. The AVX-512 frequencies for multiple SKUs can be found at https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html. The frequencies shown correspond to FP64 operations; the frequencies for FP32 may be slightly higher than the ones shown. For deep learning workloads, the AVX-512 max turbo-frequency may not be sustained when running high FLOPS workloads.
The FMA latency in the Intel Xeon Scalable processors is 4 clock cycles per FMA (it was 5 in the previous Intel Xeon processor generation). An Intel Xeon Scalable processor with 2 FMAs require at least 8 registers to hide these latencies. In practice, blocking by 10 registers is desired, e.g., at least 8 for the data and at least 1 for the deep learning model weights.
This does not work as we approach large batch sizes 8K+²⁶. After 8K, the learning rate should increase proportional to the square root of the increased in batch sizes.
The researchers added some modifications that will be committed to the main Intel Optimized Caffe branch.

Acknowledgements

A special thanks to the performance team for collecting and reviewing the data: Deepthi Karkada, Adam Procter, William (Prune) Wickart, Vikram Saletore, and Banu Nagasundaram, and to the wonderful reviewers Alexis Crowell, Chase Adams, Eric Dahlen, Wei Li, Akhilesh Kumar, Mike Pearce, R. Vivek Rane, Dave Hill, Mike Ferron-Jones, Allyson Klein, Frank Zhang, Akhilesh Kumar, and Israel Hirsh for ensuring technical correctness and improving the clarity and content of the article.

About the authors

Andres Rodriguez, PhD, is a Sr. Principal Engineer with Intel’s Artificial Intelligence Products Group (AIPG) where he designs AI solutions for Intel’s customers and provides technical leadership across Intel for AI products. He has 13 years of experience working in AI. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer reviewed publications in journals and conferences, and a book chapter on machine learning.

Frank Zhang is the Intel Optimized Caffe product manager from Intel Software and Service Group where he is responsible for product management of Intel Optimized Caffe deep learning framework development, product release and customer support. He has more than 20 years of industrial experience in software development from multiple companies including NEC, TI and Marvell. Frank graduated from University of Texas at Dallas with master degree in Electrical Engineering.

Jiong Gong is a senior software engineer with Intel’s Software and Service Group where he is responsible for the architectural design of Intel Optimized Caffe, making optimizations to show its performance advantage on both single-node and multi-node IA platforms. Jiong has more than 10 years industrial experiences in system software and AI. Jiong graduated from Shanghai Jiao Tong University as a master in computer science. He holds 4 US patents on AI and machine learning.

Chong Yu is a software engineer in Intel Software and Service Group, and now is working for Intel Optimized Caffe framework development and optimization on IA platforms. Chong won the Intel Fellowship and then joined Intel 5 years ago. Chong obtained the master degree in information science and technology from Fudan University. Chong published 20 journal publications and has 2 Chinese patents. His research areas include artificial intelligence, robotics, 3D reconstruction, remote sensing, steganography, etc.

Configuration Details

a. STREAM: 1-Node, 2 x Intel Xeon Platinum 8180 processor on Neon City with 384 GB Total Memory on Red Hat Enterprise Linux* 7.2-kernel 3.10.0-327 using STREAM AVX 512 Binaries. Data Source: Request Number: 2500, Benchmark: STREAM - Triad, Score: 199 Higher is better

b. SGEMM: System Summary 1-Node, 1 x Intel Xeon Platinum 8180 processor GEMM - GF/s 3570.48 processor Intel® Xeon® Platinum 8180 processor (38.5M Cache, 2.50 GHz)Vendor Intel Nodes 1 Sockets 1 Cores 28 Logical processors 56 Platform Lightning Ridge SKX Platform Comments Slots 12 Total Memory 384 GB Memory Configuration 12 slots / 32 GB / 2666 MT/s / DDR4 RDIMM Memory Comments OS Red Hat Enterprise Linux* 7.3 OS/Kernel Comments kernel 3.10.0-514.el7.x86_64 Primary / Secondary Software ic17 update2 Other Configurations BIOS Version: SE5C620.86B.01.00.0412.020920172159 HT No Turbo Yes 1-Node, 1 x Intel® Xeon® Platinum 8180 processor on Lightning Ridge SKX with 384 GB Total Memory on Red Hat Enterprise Linux* 7.3 using ic17 update2. Data Source: Request Number: 2594, Benchmark: SGEMM, Score: 3570.48 Higher is better

c. SGEMM, IGEMM proof point: SKX: Intel Xeon Platinum 8180 CPU Cores per Socket 28 Number of Sockets 2 (only 1 socket was used for experiments) TDP Frequency 2.5 GHz BIOS Version SE5C620.86B.01.00.0412.020920172159 Platform Wolf Pass OS Ubuntu 16.04 Memory 384 GB Memory Speed Achieved 2666 MHz

BDX: Intel(R) Xeon(R) CPU E5-2699 v4 Cores per Socket 22 Number of Sockets 2 (only 1 socket was used for experiments) TDP Frequency 2.2 GHz BIOS Version GRRFSDP1.86B.0271.R00.1510301446 Platform Cottonwood Pass OS Red Hat 7.0 Memory 64 GB Memory Speed Achieved 2400 MHz

d. Platform: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance

Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). Intel C++ compiler ver. 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“.

TensorFlow: (https://github.com/tensorflow/tensorflow), commit id 207203253b6f8ea5e938a512798429f91d5b4e7e. Performance numbers were obtained for three convnet benchmarks: alexnet, googlenetv1, vgg(https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow) using dummy data. GCC 4.8.5, Intel MKL small libraries version 2018.0.20170425, interop parallelism threads set to 1 for alexnet, vgg benchmarks, 2 for googlenet benchmarks, intra op parallelism threads set to 56, data format used is NCHW, KMP_BLOCKTIME set to 1 for googlenet and vgg benchmarks, 30 for the alexnet benchmark. Inference measured with --caffe time -forward_only -engine MKL2017option, training measured with --forward_backward_only option.

MxNet: (https://github.com/dmlc/mxnet/), revision 5efd91a71f36fea483e882b0358c8d46b5a7aa20. Dummy data was used. Inference was measured with “benchmark_score.py”, training was measured with a modified version of benchmark_score.py which also runs backward propagation. Topology specs from https://github.com/dmlc/mxnet/tree/master/example/image-classification/symbols. GCC 4.8.5, Intel MKL small libraries version 2018.0.20170425.

Neon: ZP/MKL_CHWN branch commit id:52bd02acb947a2adabb8a227166a7da5d9123b6d. Dummy data was used. The main.py script was used for benchmarking in mkl mode. ICC version used : 17.0.3 20170404, Intel MKL small libraries version 2018.0.20170425.

e. Platform: 2S Intel® Xeon® Gold 6148 CPU @ 2.40GHz (20 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=40, CPU Freq set with cpupower frequency-set -d 2.4G -u 3.7G -g performance

TensorFlow: (https://github.com/tensorflow/tensorflow), commit id 207203253b6f8ea5e938a512798429f91d5b4e7e. Performance numbers were obtained for three convnet benchmarks: alexnet, googlenetv1, vgg(https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow) using dummy data. GCC 4.8.5, Intel MKL small libraries version 2018.0.20170425, interop parallelism threads set to 1 for alexnet, vgg benchmarks, 2 for googlenet benchmarks, intra op parallelism threads set to 40, data format used is NCHW, KMP_BLOCKTIME set to 1 for googlenet and vgg benchmarks, 30 for the alexnet benchmark. Inference measured with --caffe time -forward_only -engine MKL2017option, training measured with –forward_backward_only option.

f. Platform: 2S Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz (22 cores), HT enabled, turbo disabled, scaling governor set to “performance” via acpi-cpufreq driver, 256GB DDR4-2133 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3500 Series (480GB, 2.5in SATA 6Gb/s, 20nm, MLC).

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=44, CPU Freq set with cpupower frequency-set -d 2.2G -u 2.2G -g performance

Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). GCC 4.8.5, Intel MKL small libraries version 2017.0.2.20170110.

TensorFlow: (https://github.com/tensorflow/tensorflow), commit id 207203253b6f8ea5e938a512798429f91d5b4e7e. Performance numbers were obtained for three convnet benchmarks: alexnet, googlenetv1, vgg(https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow) using dummy data. GCC 4.8.5, Intel MKL small libraries version 2018.0.20170425, interop parallelism threads set to 1 for alexnet, vgg benchmarks, 2 for googlenet benchmarks, intra op parallelism threads set to 44, data format used is NCHW, KMP_BLOCKTIME set to 1 for googlenet and vgg benchmarks, 30 for the alexnet benchmark. Inference measured with --caffe time -forward_only -engine MKL2017option, training measured with --forward_backward_only option.

MxNet: (https://github.com/dmlc/mxnet/), revision e9f281a27584cdb78db8ce6b66e648b3dbc10d37. Dummy data was used. Inference was measured with “benchmark_score.py”, training was measured with a modified version of benchmark_score.py which also runs backward propagation. Topology specs from https://github.com/dmlc/mxnet/tree/master/example/image-classification/symbols. GCC 4.8.5, Intel MKL small libraries version 2017.0.2.20170110.

g. Platform: 2S Intel® Xeon® CPU E5-2699 v3 @ 2.30GHz (18 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 256GB DDR4-2133 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.el7.x86_64. OS drive: Seagate* Enterprise ST2000NX0253 2 TB 2.5" Internal Hard Drive.

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=36, CPU Freq set with cpupower frequency-set -d 2.3G -u 2.3G -g performance

Intel Caffe: (http://github.com/intel/caffe/), revision b0ef3236528a2c7d2988f249d347d5fdae831236. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). GCC 4.8.5, MKLML version 2017.0.2.20170110.

BVLC-Caffe: https://github.com/BVLC/caffe, Inference & Training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before trainingBVLC Caffe (http://github.com/BVLC/caffe), revision 91b09280f5233cafc62954c98ce8bc4c204e7475 (commit date 5/14/2017). BLAS: atlas ver. 3.10.1.

h. Platform: 2S Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz (12 cores), HT enabled, turbo enabled, scaling governor set to “performance” via intel_pstate driver, 256GB DDR3-1600 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.21.1.el7.x86_64. SSD: Intel® SSD 520 Series 240GB, 2.5in SATA 6Gb/s, 25nm, MLC.

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=24, CPU Freq set with cpupower frequency-set -d 2.7G -u 3.5G -g performance

Caffe: (http://github.com/intel/caffe/), revision b0ef3236528a2c7d2988f249d347d5fdae831236. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). GCC 4.8.5, Intel MKL small libraries version 2017.0.2.20170110.

Bibliography

Y. You, et al., “ImageNet training in minutes.” Nov. 2017. https://arxiv.org/pdf/1709.05011.pdf
K. He, at al., “Deep residual learning for image recognition.” NIPS. Dec. 2015. https://arxiv.org/abs/1512.03385
N. Rao, “Comparing dense compute platforms for AI.” June 2017. https://www.intelnervana.com/comparing-dense-compute-platforms-ai/
D. Mulnix, “Intel® Xeon® processor scalable family technical overview.” Sept. 2017. https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview
A. Rodriguez and J. Riverio, “Deep learning at cloud scale: improving video discoverability by scaling up Caffe on AWS.” AWS Re-Invent, Nov. 2016. https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-deep-learning-at-cloud-scale-improving-video-discoverability-by-scaling-up-caffe-on-aws-mac205
A. Rodriguez and N. Sundaram, “Intel and Facebook collaborate to boost Caffe2 performance on Intel CPUs.” Apr. 2017. https://software.intel.com/en-us/blogs/2017/04/18/intel-and-facebook-collaborate-to-boost-caffe2-performance-on-intel-cpu-s
E Ould-Ahmed-Vall, et al., “TensorFlow optimizations on modern Intel® architecture.” Aug. 2017. https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
Intel® MKL https://software.intel.com/en-us/mkl
Intel® MKL-DNN. https://github.com/01org/mkl-dnn
V. Pirogov and G. Federov, “Introducing DNN primitives in Intel® Math Kernel Library.” Oct. 2016. https://software.intel.com/en-us/articles/introducing-dnn-primitives-in-intelr-mkl
https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp
S. Ruder, “An overview of gradient descent optimization algorithms.” June 2017. http://ruder.io/optimizing-gradient-descent/
N. Keskar, et al., “On large-batch training for deep learning: generalization gap and sharp minima.” Feb. 2017. https://arxiv.org/abs/1609.04836
T. Kurth, et al., “Deep learning at 15PF: supervised and semi-supervised classification for scientific data.” Aug. 2017. https://arxiv.org/pdf/1708.05256.pdf
A. Gibiansky, “Bringing HPC techniques to deep learning.” Feb. 2017. http://research.baidu.com/bringing-hpc-techniques-deep-learning/
J. Dean, “Large scale deep learning.” Nov. 2014. https://www.slideshare.net/hustwj/cikm-keynotenov2014
A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks.” Apr. 2014. https://arxiv.org/pdf/1404.5997.pdf
F. Iandola, et al., “FireCaffe: near-linear acceleration of deep neural network training on compute clusters.” Jan. 2016. https://arxiv.org/abs/1511.00175
D. Das, et al., “Distributed deep learning using synchronous stochastic gradient descent.” Feb. 2016. https://arxiv.org/abs/1602.06709
Y. You, I. Gitman, and B. Ginsburg, “Large batch training of convolutional networks.” https://arxiv.org/pdf/1708.03888.pdf
P. Goyal, et al., “Accurate, large minibatch SGD: training ImageNet in 1 hour.” June 2017. https://arxiv.org/abs/1706.02677
V. Codreanu, D. Podareanu and V. Saletore, “Achieving deep learning Training in less than 40 minutes on ImageNet-1K & best accuracy and training time on ImageNet-22K & Places-365 with scale-out Intel® Xeon®/Xeon Phi™ architectures.” Sep. 2017. https://blog.surf.nl/en/imagenet-1k-training-on-intel-xeon-phi-in-less-than-40-minutes/
https://arxiv.org/abs/1711.04291
V. Codreanu, D. Podareanu and V. Saletore, “Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train.” Nov. 2017. https://software.intel.com/en-us/articles/manufacturing-package-fault-detection-using-deep-learning
“Honeywell launches UAV industrial inspection service, teams with Intel on innovative offering.” Sept. 2017. https://www.honeywell.com/newsroom/pressreleases/2017/09/honeywell-launches-uav-industrial-inspection-service-teams-with-intel-on-innovative-offering

Intel® CPUs for Deep Learning Training

Overview

Part 1: Compute and Memory Capacity of Intel Xeon® Scalable Processors

Part 2: Software optimizations in Intel MKL-DNN And The Main Frameworks

Part 3: Advancements in Distributed Training Algorithms For Deep Learning

Part 4: Commercial Uses Cases

Conclusion

Footnotes

Acknowledgements

About the authors

Configuration Details

Bibliography

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112