Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Distributed Training of Deep Networks on Amazon Web Services* (AWS)

$
0
0

Download Document

Ravi Panchumarthy (Intel), Thomas “Elvis” Jones (AWS), Andres Rodriguez (Intel), Joseph Spisak (Intel)

Deep neural networks are capable of amazing levels of representation power resulting in state-of-the-art accuracy in areas such as computer vision, speech recognition, natural language processing, and various data analytic domains. Deep networks require large amounts of computation to train, and the time to train is often days or weeks. Intel is optimizing popular frameworks such as Caffe*, TensorFlow*, Theano*, and others to significantly improve performance and reduce the overall time to train on a single node. In addition, Intel is adding or enhancing multi. node distributed training capabilities to these frameworks to share the computational requirements across multiple nodes and further reduce time to train. A workload that previously required days can now be trained in a matter of hours. Read more about this.

Amazon Web Services* (AWS) Virtual Private Cloud (VPC) provides a great environment to facilitate multinode distributed deep network training. AWS and Intel partnered to create a simple set of scripts for creating clusters that allows developers to easily deploy and train deep networks, leveraging the scale of AWS. In this article, we provide the steps to set up the AWS CloudFormation* environment to train deep networks using the Caffe network.

AWS CloudFormation Setup

The following steps create a VPC that has an Elastic Compute Cloud (EC2) t2.micro instance as the AWS CloudFormation cluster (cfncluster) controller. The cfncluster controller is then used to create a cluster composed of a master EC2 instance and a number of compute EC2 instances within the VPC.

Steps to deploy the Cloudformation and cfncluster

  1. Use the AWS Management Console to launch the AWS CloudFormation (Figure 1).


    Figure 1. CloudFormation in Amazon Web Services

  2. Click Create Stack.
  3. In the section labeled, Choose a template (Figure 2), select Specify an Amazon S3 template URL, and then enter https://s3.amazonaws.com/caffecfncluster/1.0/intelcaffe_cfncluster.template. Click Next.


    Figure 2. Entering the template URL.

  4. Give the Stack a name, such as myFirstStack. UnderSelect a key pair, find the key pair you just named (follow these instructions if you need to create a key pair). Leave the rest of the Parameters as they are. Click Next.
  5. Enter a Key, for example, name, and a Value, such as, cfnclustercaffe.
    Note that you can give any names to the key and value. The name does not have to match the key-pair from the previous step.
  6. Click Next.
  7. Review the stack, check the acknowledgement box, and then click Create. Creating the stacks will take a few minutes. Wait until the status of all three created stacks is CREATE_COMPLETE.
  8. The template used in Step 3 calls two other nested templates, creating a VPC with an EC2 t2.micro instance (Figure 3). Select the stack with the EC2 instance, and then select Resources. Click the Physical ID of the cfnclusterMaster.


    Figure 3. Selecting the Physical ID from the Resources tab.

  9. This will take you to AWS EC2 console (Figure 4). Under Description, note the VPC ID and the Subnet ID as you’ll need them in a later step. Right-click on the instance, select Connect and follow the instructions.


    Figure 4. AWS EC2 console.

  10. Once you ssh into the instance, prepare to modify the cluster’s configuration with the following commands:

    cd .cfncluster
    cp config.edit_this_cfncluster_config config
    vi config

  11. Follow the comments in the config file (opened with the final command in Step 9) to fill in the appropriate information.

    Note that while the master node is not labelled as a compute node, it also acts as a compute node. Therefore, if the total number of nodes to be used in training is 32, then choose a queue_size = 31 compute nodes.

    • Use the VPC ID and Subnet ID obtained in Step 8.
    • The latest custom_ami to use should be ami-77aa6117; this article will be updated when newer AMI are provided.
  12. Launch a cluster with the command: cfncluster create <vpc_name_choosen_in_config_file>. This will launch more AWS CloudFormation templates. You can see them via the AWS CloudFormation page in the AWS Management Console.

Sample Scripts to Train a Few Popular Networks

After the cloud-formation-setup is complete, if you configured the size of the cluster to be N, there will be N+1 instances created (1 master node and N compute nodes). Note that the master node is also treated as a compute node. The created cluster has a shared drive among all N+1 instances. The instances contain intelcaffe, Intel® Math Kernel Library (Intel® MKL), and sample scripts to train CIFAR-10 and GoogLeNet.

To start training a sample network, log in to the master node and configure the scripts provided: CIFAR-10 (~/scripts/aws_ic_mn_run_cifar.sh) and GoogLeNet (~/scripts/aws_ic_mn_run_googlenet.sh). Both the scripts have the following variables which need to be edited before running.

#Set stackname_tag to VPC name prefixed with cfncluster-. For example: cfncluster-myvpc-name. VPC name is the same as the value for vpc_settings.
	stackname_tag=cfncluster-
	num_instances=
	aws_region=us-west-2

There are few other configurable variables for more customization in both ~/scripts/aws_ic_mn_run_cifar.sh and ~/scripts/aws_ic_mn_run_googlenet.sh

To run CIFAR-10 training, after editing the above mentioned variables in the script, run:

cd ~/scripts/
./aws_ic_mn_run_cifar.sh

To run GoogLeNet training, after editing the above mentioned variables in the script, run:

cd ~/scripts/
./aws_ic_mn_run_googlenet.sh

The script aws_ic_mn_run_cifar.sh creates a hosts file (~/hosts.aws) by querying and retrieving the instances information based on stackname_tag variable. It then updates the solver and train_val prototxt files. The script will start the data server, which will provide data to the compute nodes. There will be a little overhead on the master with data server running along with the compute. After the data server is launched, the distributed training is launched using mpirun command.

The script aws_ic_mn_run_googlenet.sh creates a hosts file (~/hosts.aws) by querying and retrieving the instances information based on stackname_tag variable. Unlike the CIFAR-10 example where the data server provides the data, in GoogLeNet training each worker will read its own data. The script will create a separate solver, train_val prototxt files, and train.txt files for each worker. The script will then launch the job using the mpirun command.

Notices

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

For more information go to http://www.intel.com/performance.


Viewing all articles
Browse latest Browse all 3384

Trending Articles