Summary

BigDL is a distributed deep learning library for Apache Spark*. With BigDL, users can write their deep learning applications as standard Spark programs, which can run directly on top of existing Spark or Hadoop* clusters. This can enable deep learning functionalities on existing Hadoop/Spark clusters and analyze data that is already present in HDFS*, HBase*, Hive*, etc. Other common features of BigDL include:

Rich, deep learning support. Modeled after Torch*, BigDL provides comprehensive support for deep learning, including numeric computing (via Tensor) and high-level neural networks; in addition, users can load pre-trained Caffe* or Torch models into Spark programs using BigDL.
Extremely high performance. To achieve high performance, BigDL uses Intel® Math Kernel Library (Intel® MKL) and multi-threaded programming in each Spark task. Consequently, it is orders of magnitude faster than out-of-the-box open source Caffe, Torch, or TensorFlow on a single-node Intel® Xeon® processor (that is, comparable with a mainstream GPU).
Efficiently scale out. BigDL can efficiently scale out to perform data analytics at big data scale, by leveraging Apache Spark (a lightning-fast distributed data processing framework), as well as efficient implementations of synchronous SGD and all-reduce communications on Spark.

Figure 1 shows a basic overview of how a BigDL program is executed on an existing Spark cluster. With the help of a cluster manager and an application master process or a driver program, Spark tasks are distributed across the Spark worker nodes or containers (executors). BigDL enables faster execution of Spark tasks using Intel MKL.

Figure 1.Basic overview of BigDL program running on Spark* cluster.

Experimental Setup

Virtual Hadoop Cluster

The Cloudera* administrator training guide for Apache Hadoop was referenced for setting up an experimental four-node virtual Hadoop cluster with YARN* as a resource manager. Standalone Spark and Spark on YARN were both installed on the cluster.

Virtual Machine	Node_1	Node_2	Node_3	Node_4
Services	NameNode	Secondary NameNode	ResourceManager	JobHistoryServer
	NodeManager	NodeManager	NodeManager	NodeManager
	DataNode	DataNode	DataNode	DataNode
	Spark Master	Spark Worker	Spark Worker	Spark Worker
	Spark Worker

Physical Machine (Host) – System Configuration

System/Host Processor	Intel® Xeon® processor E7-8890 v4 @ 2.20 GHz (4 sockets)
Total Physical Cores	96
Host Memory	512 GB DDR-1600 MHz
Host OS	Linux*; version 3.10.0-327.el7.x86_64
Virtual Guests	4

Virtual Machine Guest - System Configuration

System/Guest Processor	Intel® Xeon® processor E7-8890 v4 @ 2.20 GHz
Physical Cores	18
Host Memory	96 GB DDR-1600 MHz
Host OS	Linux*; version 2.6.32-642.13.1.el6.x86_64
Java version	1.8.0_121
Spark version	1.6
Scala version	2.10.5
CDH version	5.10

BigDL Installation

Prerequisites

Java* Java is required for building BigDL. The latest version of Java can be downloaded from the Oracle website. It is highly recommended to use Java 8 when running with Spark 2.0; otherwise, you may observe performance issues.

 export JAVA_HOME=/usr/java/jdk1.8.0_121/

Maven* Apache Maven as a software management tool is required for downloading and building BigDL. The latest version of Maven can be downloaded and installed from the Maven website.

wget http://mirrors.ibiblio.org/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
export M2_HOME=/home/training/Downloads/apache-maven-3.3.9
export PATH=${M2_HOME}/bin:$PATH
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

When compiling with Java 7, you need to add the option “-XX:MaxPermSize=1G” to avoid OutOfMemoryError while using BigDL with Java 7.

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m -XX:MaxPermSize=1G"

Building BigDL

Download BigDL. BigDL source code is available at GitHub*.

git clone https://github.com/intel-analytics/BigDL.git

It is highly recommended that you build BigDL using the make-dist.sh script.

bash make-dist.sh

This creates a directory dist with utility script (${BigDL_HOME}/dist/bin/bigdl.sh) to set up the environment for BigDL and create packaged JAR* files with required dependencies for Spark, Python*, and other supporting tools and libraries.

By default, make-dist.sh uses Scala* 2.10 for Spark 1.5.x or 1.6.x, and Scala 2.11 for Spark 2.0. Alternative ways to build BigDL are published on the BigDL build page.

BigDL and Spark Environment

BigDL can be used with a variety of local and cluster environments using Java, standalone Spark, Spark with Hadoop YARN, or Amazon EC2 cloud. Here, we use LeNet* as an example to explain and differentiate each mode. Further details about LeNet model and usage is explained in the following sections.

Local Java application - In this mode, the application can be launched with BigDL using the local Java environment.

${BigDL_HOME}/dist/bin/bigdl.sh -- java \
  -cp ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies-and-spark.jar \
   com.intel.analytics.bigdl.models.lenet.Train \
   -f $MNIST_DIR \
   --core 8 --node 1 \
   --env local -b 512 -e 1

Spark standalone - In this mode, Spark’s own cluster manager is used to allocate resources across applications running with BigDL.

Local environment - In this mode a BigDL application is launched locally using the –master=local[$NUM-OF_THREADS] and --env local flag. For example, LeNet model training can be started on a local node as follows:

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master local[16] \
  --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  --class com.intel.analytics.bigdl.models.lenet.Train \
  ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  -f $MNIST_DIR \
  --core 16 --node 1 \
  -b 512 -e 1 --env local

Spark cluster environment - In this mode a BigDL application is launched in a cluster environment. Depending on where the driver is deployed there are two ways in which BigDL can be used in a Spark cluster environment.

Spark standalone cluster in client deploy mode—In this mode the driver program is launched locally as an external client. This is the default mode, where application progress can be viewed on the client.

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master spark://Node_1:7077 \
   --deploy-mode client --executor-cores 8 --executor-memory 4g --total-executor-cores 32 \
   --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
   --class com.intel.analytics.bigdl.models.lenet.Train \
   ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
   -f $MNIST_DIR \
   --core 8 --node 4 \
   -b 512 -e 1 --env spark

Spark standalone cluster in cluster deploy mode - In this mode the driver is launched on one of the worker nodes. You can use webUI* or Spark log files to track the progress of your application.

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master spark://Node_1:7077 \
  --deploy-mode cluster --executor-cores 8 --executor-memory 4g \
  --driver-cores 1 --driver-memory 4g --total-executor-cores 33 \
  --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  --class com.intel.analytics.bigdl.models.lenet.Train \
   ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
   -f $MNIST_DIR \
   --core 8 --node 4 \
   -b 512 -e 1 --env spark

Spark with YARN as a cluster manager - In this mode Hadoop’s YARN cluster manager is used to allocate resources across applications running with BigDL.

Client deployment mode - In this mode the spark driver runs on the host where the job is submitted.

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master yarn \
  --deploy-mode client --executor-cores 16 --executor-memory 64g \
  --driver-cores 1 --driver-memory 4g --num-executors 4 \
  --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  --class com.intel.analytics.bigdl.models.lenet.Train \
  ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  -f $MNIST_DIR \
  --core 16 --node 4 \
  -b 512 -e 1 --env spark

Cluster deployment mode - In this mode the spark driver runs on the cluster host chosen by YARN.

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master yarn --deploy-mode cluster \
  --executor-cores 16 --executor-memory 64g \
  --driver-cores 1 --driver-memory 4g --num-executors 4 \
  --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  --class com.intel.analytics.bigdl.models.lenet.Train \
  ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  -f $MNIST_DIR \
  --core 16 --node 4 \
  -b 512 -e 1 --env spark

Running on Amazon EC2 - The BigDL team has made available a public Amazon Machine Image* (AMI*) file for experimenting with BigDL with Spark on EC2. Detailed information on steps to run BigDL examples on Spark in the Amazon EC2 environment is provided on GitHub.

BigDL Sample Models

This tutorial shows training and testing for two sample models, LeNet and VGG*, to demonstrate usage of BigDL for distributed deep learning on Apache Spark.

LeNet

LeNet 5 is a classical CNN model used in digital number classification. For detailed information, please refer to http://yann.lecun.com/exdb/lenet/.

The MNIST* database can be downloaded from http://yann.lecun.com/exdb/mnist/. We downloaded images and labels for both training and validation data.

A JAR file for training and testing the sample LeNet model is created as part of the BigDL installation. If not yet created, please refer to the section on Building BigDL.

Training the LeNet Model

An example command to train the LeNet model using BigDL with Spark running on YARN can be given as follows:

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master yarn \
  --deploy-mode cluster --executor-cores 16 --executor-memory 64g \
  --driver-cores 1 --driver-memory 4g --num-executors 4 \
  --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  --class com.intel.analytics.bigdl.models.lenet.Train \
  ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  -f $MNIST_DIR \
  --core 16 --node 4 \
  -b 512 -e 5 --env spark --checkpoint ~/models

Usage:
LeNet parameters
  -f <value> | --folder <value>
        where you put the MNIST data
  -b <value> | --batchSize <value>
        batch size
  --model <value>
        model snapshot location
  --state <value>
        state snapshot location
  --checkpoint <value>
        where to cache the model
  -r <value> | --learningRate <value>
        learning rate
  -e <value> | --maxEpoch <value>
        epoch numbers
  -c <value> | --core <value>
        cores number on each node
  -n <value> | --node <value>
        node number to train the model
  -b <value> | --batchSize <value>
        batch size (currently this value should be multiple of (–-core * –-node)
  --overWrite
        overwrite checkpoint files
  --env <value>
        execution environment
YARN parameters
                --master yarn --deploy-mode cluster : Using spark with YARN cluster manager in cluster deployment mode
       --executor-cores 16 --num-executors 4: This sets the number of executors and cores per executor for YARN to match with --core and –-node parameters for LeNet training. Currently this is a known issue and hence required for successful cluster training with BigDL using Spark

Testing the LeNet Model

An example command to test the LeNet model using BigDL with Spark running on YARN can be given as follows:

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master yarn --deploy-mode cluster \
  --executor-cores 16 --executor-memory 64g \
  --driver-cores 1 --driver-memory 4g --num-executors 4 \
  --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  --class com.intel.analytics.bigdl.models.lenet.Test \
  ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  -f $MNIST_DIR \
  --core 16 --nodeNumber 4 \
  -b 512 --env spark --model ~/models/model.591

Usage:
    -f <value> | --folder <value>
       where you put the MNIST data
  --model <value>
        model snapshot location (model.iteration#)
  -c <value> | --core <value>
        cores number on each node
  -n <value> | --nodeNumber <value>
        nodes number to train the model
  -b <value> | --batchSize <value>
        batch size
  --env <value>
        execution environment

For quick verification, results for model accuracy can be seen as follows:

yarn logs -applicationId application_id | grep accuracy

Refer to the Hadoop cluster WebUI for additional information about this training.

VGG model on CIFAR-10*

This example demonstrates the use of BigDL to train and test a VGG-like model on a CIFAR-10* dataset. Details about this model can be found here.

You can download the binary version of the CIFAR-10 dataset from here.

A JAR file for training and testing the sample VGG model is created as part of the BigDL installation. If not yet created, please refer to the section Building BigDL.

Training the VGG Model

An example command to train the VGG model on the CIFAR-10 dataset using BigDL with Spark running on YARN can be given as follows:

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master yarn --deploy-mode cluster \
  --executor-cores 16 --executor-memory 64g \
  --driver-cores 1 --driver-memory 16g --num-executors 4 \
  --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  --class com.intel.analytics.bigdl.models.vgg.Train \
  ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  -f $VGG_DIR \
  --core 16 --node 4 \
  -b 512 -e 5 --env spark --checkpoint ~/models

Usage:
  -f <value> | --folder <value>
        where you put the Cifar10 data
  --model <value>
        model snapshot location
  --state <value>
        state snapshot location
  --checkpoint <value>
        where to cache the model and state
  -c <value> | --core <value>
        cores number on each node
  -n <value> | --node <value>
        node number to train the model
  -e <value> | --maxEpoch <value>
        epoch numbers
  -b <value> | --batchSize <value>
        batch size
  --overWrite
        overwrite checkpoint files
  --env <value>
        execution environment

Testing the VGG Model

An example command to test the VGG model on the CIFAR-10 dataset using BigDL with Spark running on YARN can be given as follows:

${BigDL_HOME}/dist/bin/bigdl.sh -- spark-submit --master yarn \
  --deploy-mode cluster --executor-cores 16 --executor-memory 64g \
  --driver-cores 1 --driver-memory 16g --num-executors 4 \
  --driver-class-path ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  --class com.intel.analytics.bigdl.models.vgg.Test \
  ${BigDL_HOME}/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
  -f $VGG_DIR \
  --core 16 --nodeNumber 4 \
  -b 512 --env spark --model ~/models/model.491

Usage:
  -f <value> | --folder <value>
        where you put the Cifar10 data
  --model <value>
        model snapshot location
  -c <value> | --core <value>
        cores number on each node
  -n <value> | --nodeNumber <value>
        nodes number to train the model
  -b <value> | --batchSize <value>
        batch size
  --env <value>
        execution environment

Detailed steps for training and testing other sample models, like a recurrent neural network (RNN), a residual network (ResNet), Inception*, Autoencoder*, and so on using BigDL, are published on the BigDL GitHub site.

BigDL can also be used to load pre-trained Torch and Caffe models into the Spark program for classification or prediction. One such example is shown on the BigDL GitHub site.

Performance Scaling

Figure 2 shows performance scaling of training VGG and ResNet models using BigDL on Spark with an increasing number of cores and nodes (virtual nodes as per the current setup). Here, we compare the average time taken to train both models on the CIFAR-10 dataset for five epochs.

Figure 2:Performance scaling of VGG and ResNet with BigDL on Spark running with YARN.

Conclusion

In this article, we validated the steps to install and use BigDL for training and testing some of the commonly used deep neural network models on Apache Spark using a four-node virtual Hadoop cluster. We saw how BigDL can easily enable deep learning functionalities on existing Hadoop/Spark clusters. The total time to train a model can be significantly reduced; first, with the help of the Intel MKL and multi-threaded programming in each Spark task, and then by distributing Spark tasks across multiple nodes on a Hadoop/Spark cluster.