Distributed Training of Deep Networks on Amazon Web Services* (AWS)

Ravi Panchumarthy (Intel), Thomas “Elvis” Jones (AWS), Andres Rodriguez (Intel), Joseph Spisak (Intel)

Deep neural networks are capable of amazing levels of representation power resulting in state-of-the-art accuracy in areas such as computer vision, speech recognition, natural language processing, and various data analytic domains. Deep networks require large amounts of computation to train, and the time to train is often days or weeks. Intel is optimizing popular frameworks such as Caffe*, TensorFlow*, Theano*, and others to significantly improve performance and reduce the overall time to train on a single node. In addition, Intel is adding or enhancing multi. node distributed training capabilities to these frameworks to share the computational requirements across multiple nodes and further reduce time to train. A workload that previously required days can now be trained in a matter of hours. Read more about this.

Amazon Web Services* (AWS) Virtual Private Cloud (VPC) provides a great environment to facilitate multinode distributed deep network training. AWS and Intel partnered to create a simple set of scripts for creating clusters that allows developers to easily deploy and train deep networks, leveraging the scale of AWS. In this article, we provide the steps to set up the AWS CloudFormation* environment to train deep networks using the Caffe network.

AWS CloudFormation Setup

The following steps create a VPC that has an Elastic Compute Cloud (EC2) t2.micro instance as the AWS CloudFormation cluster (cfncluster) controller. The cfncluster controller is then used to create a cluster composed of a master EC2 instance and a number of compute EC2 instances within the VPC.

Steps to deploy the Cloudformation and cfncluster

Use the AWS Management Console to launch the AWS CloudFormation (Figure 1).

Figure 1. CloudFormation in Amazon Web Services
Click Create Stack.
In the section labeled, Choose a template (Figure 2), select Specify an Amazon S3 template URL, and then enter https://s3.amazonaws.com/caffecfncluster/1.0/intelcaffe_cfncluster.template. Click Next.

Figure 2. Entering the template URL.
Give the Stack a name, such as myFirstStack. UnderSelect a key pair, find the key pair you just named (follow these instructions if you need to create a key pair). Leave the rest of the Parameters as they are. Click Next.
Enter a Key, for example, name, and a Value, such as, cfnclustercaffe.
Note that you can give any names to the key and value. The name does not have to match the key-pair from the previous step.
Click Next.
Review the stack, check the acknowledgement box, and then click Create. Creating the stacks will take a few minutes. Wait until the status of all three created stacks is CREATE_COMPLETE.
The template used in Step 3 calls two other nested templates, creating a VPC with an EC2 t2.micro instance (Figure 3). Select the stack with the EC2 instance, and then select Resources. Click the Physical ID of the cfnclusterMaster.

Figure 3. Selecting the Physical ID from the Resources tab.
This will take you to AWS EC2 console (Figure 4). Under Description, note the VPC ID and the Subnet ID as you’ll need them in a later step. Right-click on the instance, select Connect and follow the instructions.

Figure 4. AWS EC2 console.
Once you ssh into the instance, prepare to modify the cluster’s configuration with the following commands:
cd .cfncluster cp config.edit_this_cfncluster_config config vi config
Follow the comments in the config file (opened with the final command in Step 9) to fill in the appropriate information.
- Obtain your AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY from your AWS system administrator if you don’t have one already or follow these steps to obtain them.
Note that while the master node is not labelled as a compute node, it also acts as a compute node. Therefore, if the total number of nodes to be used in training is 32, then choose a queue_size = 31 compute nodes.
- Use the VPC ID and Subnet ID obtained in Step 8.
- The latest custom_ami to use should be ami-77aa6117; this article will be updated when newer AMI are provided.
Launch a cluster with the command: cfncluster create <vpc_name_choosen_in_config_file>. This will launch more AWS CloudFormation templates. You can see them via the AWS CloudFormation page in the AWS Management Console.

Sample Scripts to Train a Few Popular Networks

After the cloud-formation-setup is complete, if you configured the size of the cluster to be N, there will be N+1 instances created (1 master node and N compute nodes). Note that the master node is also treated as a compute node. The created cluster has a shared drive among all N+1 instances. The instances contain intelcaffe, Intel® Math Kernel Library (Intel® MKL), and sample scripts to train CIFAR-10 and GoogLeNet.

To start training a sample network, log in to the master node and configure the scripts provided: CIFAR-10 (~/scripts/aws_ic_mn_run_cifar.sh) and GoogLeNet (~/scripts/aws_ic_mn_run_googlenet.sh). Both the scripts have the following variables which need to be edited before running.

#Set stackname_tag to VPC name prefixed with cfncluster-. For example: cfncluster-myvpc-name. VPC name is the same as the value for vpc_settings.
	stackname_tag=cfncluster-
	num_instances=
	aws_region=us-west-2

There are few other configurable variables for more customization in both ~/scripts/aws_ic_mn_run_cifar.sh and ~/scripts/aws_ic_mn_run_googlenet.sh

To run CIFAR-10 training, after editing the above mentioned variables in the script, run:

cd ~/scripts/ ./aws_ic_mn_run_cifar.sh

To run GoogLeNet training, after editing the above mentioned variables in the script, run:

cd ~/scripts/ ./aws_ic_mn_run_googlenet.sh

The script aws_ic_mn_run_cifar.sh creates a hosts file (~/hosts.aws) by querying and retrieving the instances information based on stackname_tag variable. It then updates the solver and train_val prototxt files. The script will start the data server, which will provide data to the compute nodes. There will be a little overhead on the master with data server running along with the compute. After the data server is launched, the distributed training is launched using mpirun command.

The script aws_ic_mn_run_googlenet.sh creates a hosts file (~/hosts.aws) by querying and retrieving the instances information based on stackname_tag variable. Unlike the CIFAR-10 example where the data server provides the data, in GoogLeNet training each worker will read its own data. The script will create a separate solver, train_val prototxt files, and train.txt files for each worker. The script will then launch the job using the mpirun command.

Notices

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

For more information go to http://www.intel.com/performance.

Distributed Training of Deep Networks on Amazon Web Services* (AWS)

AWS CloudFormation Setup

Steps to deploy the Cloudformation and cfncluster

Sample Scripts to Train a Few Popular Networks

Notices

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112