Deep neural network (DNN) training is computationally intensive and can take days or weeks on modern computing platforms. In the recent article, Single-node Caffe Scoring and Training on Intel® Xeon® E5 Family, we demonstrated a tenfold performance increase of the Caffe* framework on the AlexNet* topology and reduced the training time to 5 days on a single node. Intel continues to deliver on the machine learning vision outlined in Pradeep Dubey’s Blog, and in this technical preview, we demonstrate how the training time for Caffe can be reduced from days to hours in a multi-node, distributed-memory environment.

Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and one of the most popular community frameworks for image recognition. Caffe is often used as a benchmark together with AlexNet*, a neural network topology for image recognition, and ImageNet*, a database of labeled images.

The Caffe framework does not support multi-node, distributed-memory systems by default and requires extensive changes to run on distributed-memory systems. We perform strong scaling of the synchronous minibatch stochastic gradient descent (SGD) algorithm with the help of Intel® MPI Library. Computation for one iteration is scaled across multiple nodes, such that the multi-threaded multi-node parallel implementation is equivalent to the single-node, single-threaded serial implementation.

We use three approaches—data parallelism, model parallelism, and hybrid parallelism—to scale computation. Model parallelism refers to partitioning the model or weights into nodes, such that parts of weights are owned by a given node and each node processes all the data points in a minibatch. This requires communication of the activations and gradients of activations, unlike communication of weights and weight gradients, as is the case with data parallelism.

With this additional level of distributed parallelization, we trained AlexNet on the full ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC-2012) dataset and reached 80% top-5 accuracy in just over 5 hours on a 64-node cluster of systems based on Intel® Xeon® processor E5 family.

Getting Started

While we are working to incorporate the new functionality outlined in this article into future versions of Intel^® Math Kernel Library (Intel^® MKL) and Intel^® Data Analytics Acceleration Library (Intel^® DAAL), you can use the technology preview package attached to this article to reproduce the demonstrated performance results and even train AlexNet on your own dataset. The preview includes both the single-node and the multi-node implementations. Note that the current implementation is limited to the AlexNet topology and may not work with other popular DNN topologies.

The package supports the AlexNet topology and introduces the ‘intel_alexnet’ and ‘mpi_intel_alexnet’ models, which are similar to ‘bvlc_alexnet’ with the addition of two new ‘IntelPack’ and ‘IntelUnpack’ layers, as well as the optimized convolution, pooling, normalization layers, and MPI-based implementations for all these layers. We also changed the validation parameters to facilitate vectorization by increasing the validation minibatch size from 50 to 256 and reducing the number of test iterations from 1,000 to 200, thus keeping constant the number of images used in the validation run. The package contains the ‘intel_alexnet’ model in these folders:

models/intel_alexnet/deploy.prototxt
models/intel_alexnet/solver.prototxt
models/intel_alexnet/train_val.prototxt.

models/mpi_intel_alexnet/deploy.prototxt
models/mpi_intel_alexnet/solver.prototxt
models/mpi_intel_alexnet/train_val.prototxt.
models/mpi_intel_alexnet/train_val_shared_db.prototxt
models/mpi_intel_alexnet/train_val_split_db.prototxt

Both the ’intel_alexnet’ and the ’mpi_intel_alexnet’ models allow you to train and test the ILSVRC-2012 training set.

To start working with the package, ensure that all the regular Caffe dependencies and Intel software tools listed in the System Requirements and Limitations section are installed on your system.

Running on Single Node

Unpack the package.
Specify the paths to the database, snapshot location, and image mean file in these ‘intel_alexnet’ model files:
- models/intel_alexnet/deploy.prototxt
- models/intel_alexnet/solver.prototxt
- models/intel_alexnet/train_val.prototxt
Set up a runtime environment for the software tools listed in the System Requirements and Limitations section.
Add the path to ./build/lib/libcaffe.so to the LD_LIBRARY_PATH environment variable.
Set the threading environment as follows:
$> export OMP_NUM_THREADS=<N_processors * N_cores>
$> export KMP_AFFINITY=compact,granularity=fine

Note: OMP_NUM_THREADS must be an even number equal to at least 2.

Run timing on a single node using this command:
$> ./build/tools/caffe time \
-iterations <number of iterations> \
--model=models/intel_alexnet/train_val.prototxt
Run training on a single node using this command:
$> ./build/tools/caffe train \
--solver=models/intel_alexnet/solver.prototxt

Running on Cluster

Unpack the package.
Set up a runtime environment for the software tools listed in the System Requirements and Limitations section.
Add the path to ./build-mpi/lib/libcaffe.so to the LD_LIBRARY_PATH environment variable.
Set the NP environment variable to the number of nodes to be used, as follows:

$> export NP=<number-of-mpi-ranks>

Note: the best performance is achieved with one MPI rank per node.

Create a node file in the root directory of the application with the name of x${NP}.hosts. For instance, for IBM* Platform LSF*, run the following command:

$> cat $PBS_NODEFILE > x${NP}.hosts

Specify the paths to the database, snapshot location, and image mean file in the following ‘mpi_intel_alexnet’ model files:
- models/mpi_intel_alexnet/deploy.prototxt,
- models/mpi_intel_alexnet/solver.prototxt,
- models/mpi_intel_alexnet/train_val_shared_db.prototxt

Note: on some system configurations, performance of a shared-disk system may become a bottleneck. In this case, pre-distributing the image database to compute nodes is recommended to achieve best performance results. Refer to the readme files included with the package for instructions.

Set the threading environment as follows:

$> export OMP_NUM_THREADS=<N_processors * N_cores>
$> export KMP_AFFINITY=compact,granularity=fine

Note: OMP_NUM_THREADS must be an even number equal to at least 2.

Run timing using this command:
$> mpirun -nodefile x${NP}.hosts -n $NP -ppn 1 -prepend-rank \

./build/tools/caffe time \

-iterations <number of iterations> \

--model=models/mpi_intel_alexnet/train_val.prototxt

Run training using this command:
$> mpirun -nodefile x${NP}.hosts -n $NP -ppn 1 -prepend-rank \

./build-mpi/tools/caffe train \

--solver=models/mpi_intel_alexnet/solver.prototxt

System Requirements and Limitations

The package has the same software dependencies as non-optimized Caffe:

Intel software tools:

Intel MKL 11.3 or higher
Intel MPI Library 5.0

Hardware compatibility:

4^th Generation Intel® Core™ Processor (code name Haswell)

This software was validated with the AlexNet topology only and may not work with other configurations.

Support

Please direct questions and comments on this package to mailto:intel.mkl@intel.com.

Caffe* Training on Multi-node Distributed-memory Systems Based on Intel® Xeon® Processor E5 Family

Getting Started

Running on Single Node

Running on Cluster

System Requirements and Limitations

Support

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112