Overview
This article provides a recipe for how to obtain, compile, and run ROME1.0 SML on Intel® Xeon® processors and Intel® Xeon Phi™ processors. Before you run SML, you need to run the MAP processing phase first, because SML will use the output of MAP. So this document also describes how to run MAP as well as SML. Please follow the instructions below to run the MAP and SML workloads.
The source and test workloads for this version of ROME can be downloaded from: http://ipccsb.dfci.harvard.edu/rome/download.html.
Introduction
ROME (Refinement and Optimization via Machine lEarning for cryo-EM) is one of the major research software packages from the Dana-Farber Cancer Institute. ROME is a parallel computing software system dedicated to high-resolution cryo-EM structure determination and data analysis, implementing advanced machine learning approaches optimized for HPC clusters. ROME 1.0 introduces SML (statistical manifold learning)-based deep classification, following MAP-based (maximum a posteriori) image alignment. More information about ROME can found at http://ipccsb.dfci.harvard.edu/rome/index.html.
The ROME system has be optimized for both Intel® Xeon® processors and Intel® Xeon Phi™ processors. Detailed information about the underlying algorithms and optimizations can be found at http://arxiv.org/abs/1604.04539.
In this document, we used three workloads: Inflammasome, RP-a and RP-b. The workload descriptions are as follows:
- Inflammasome data: 16306 images of NLRC4/NAIP2 inflammasome with a size of 2502 pixels
- RP-a: 57001 images of proteasome regulatory particles (RP) with a size of 1602 pixels
- RP-b: 35407 images of proteasome regulatory particles (RP) with a size of 1602 pixels
In these documents, we use “ring11_all” to refer to the Inflammasome workload, “data6” to refer to the RP-a workload, and “data8” to refer to the RP-b workload.
Preliminaries
- To match these results, the Intel Xeon Phi processor machine needs to be booted with BIOS settings for quad cluster mode and MCDRAM cache mode. Please review this document for further information. The Intel Xeon processor system does not need to be started in any special manner.
- To build this package, install the Intel® MPI Library for Linux* 5.1(Update 3) and Intel® Parallel Studio XE Composer Edition for C++ Linux* Version 2016 (Update 3) or higher products on your systems.
- Download the source ROME1.0a.tar.gz from http://ipccsb.dfci.harvard.edu/rome/download.html
- Unpack the source code to /home/users.
> cp ROME1.0a.tar.gz /home/users
> tar –xzvf ROME1.0a.tar.gz
- The workloads are provided by the Intel® Parallel Computing Center for Structural Biology (http://ipccsb.dfci.harvard.edu/). As noted above, the workloads can be downloaded from http://ipccsb.dfci.harvard.edu/rome/download.html. Following the EMPIAR-10069 link, download Inf_data1.* (Set 1) and rename them ring11_all.*. Download RP_data2.* (Set 2) and rename them data8.*. Download RP_data4.* (Set 4) and rename them data6.*. The scripts referred to below can be obtained by pulling the file KNL_LAUNCH.tgz from http://ipccsb.dfci.harvard.edu/rome/download.html
- Copy the workloads and run scripts to your home directory. You should have the following files:
>cp ring11_all.star /home/users
>cp ring11_all.mrcs /home/users
>cp data6.star /home/users
>cp data6.mrcs /home/users
>cp data8.star /home/users
>cp data8.mrcs /home/users
>cp run_ring11_all_map_XEON.sh /home/users
>cp run_ring11_all_sml_XEON.sh /home/users
>cp run_ring11_all_map_XEONPHI.sh /home/users
>cp run_ring11_all_sml_XEONPHI.sh /home/users
>cp run_data6_map_XEON.sh /home/users
>cp run_data6_sml_XEON.sh /home/users
>cp run_data6_map_XEONPHI.sh /home/users
>cp run_data6_sml_XEONPHI.sh /home/users
>cp run_data8_map_XEON.sh /home/users
>cp run_data8_sml_XEON.sh /home/users
>cp run_data8_map_XEONPHI.sh /home/users
>cp run_data8_sml_XEONPHI.sh /home/users
Prepare the binaries for the Intel Xeon processor and the Xeon Phi processor
- Set up the Intel® MPI Library and Intel® C++ Compiler environments:
> source /opt/intel/impi/<version>/bin64/mpivars.sh
> source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64
> source /opt/intel/mkl/<version>/bin/mklvars.sh intel64
- Set environment variables for compilation of ROME:
>export ROME_CC=mpiicpc
- Build the binaries for the Intel Xeon processor.
>cd /home/users/ROME1.0a
>make
>mkdir bin
>mv rome_map bin/rome_map
>mv rome_sml bin/rome_sml
- Build the binaries for the Intel Xeon Phi processor.
>cd /home/users/ROME1.0a
>vi makefile
Modify FLAGS to below:
FLAGS := -mkl -fopenmp -O3 -xMIC-AVX512 -DNDEBUG -std=c++11
>make
>mkdir bin_knl
>mv rome_map bin_knl/rome_map
>mv rome_sml bin_knl/rome_sml
Run the test workloads on the Intel Xeon processor (an Intel® Xeon® processor E5-2697 v4 is assumed by the scripts)
- Running the ROME MAP phase for these workloads:
Running workload1: ring11_all
>cd /home/users/
>sh run_ring11_all_map_XEON.shRunning workload2: data6
>cd /home/users/
>sh run_data6_map_XEON.shRunning workload3: data8
>cd /home/users/
>sh run_data8_map_XEON.sh
- Running the ROME SML phase for these workloads:
Running workload1: ring11_all
>cd /home/users/
>sh run_ring11_all_sml_XEON.shRunning workload2: data6
>cd /home/users/
>sh run_data6_sml_XEON.shRunning workload3: data8
>cd /home/users/
>sh run_data8_sml_XEON.sh
Run the test workloads on the Intel Xeon Phi processor
- Running the ROME MAP phase for these workloads:
>cd /home/users/
Running workload1: ring11_all
>cd /home/users/
>sh run_ring11_all_map_XEONPHI.shRunning workload2: data6
>cd /home/users/
>sh run_data6_map_XEONPHI.shRunning workload3: data8
>cd /home/users/
>sh run_data8_map_XEONPHI.sh
- Running ROME SML phase for these workloads:
Running workload1: ring11_all
>cd /home/users/
>sh run_ring11_all_sml_XEONPHI.shRunning workload2: data6
>cd /home/users/
>sh run_data6_sml_XEONPHI.shRunning workload3: data8
>cd /home/users/
>sh run_data8_sml_XEONPHI.sh
Performance gain seen with ROME SML
For the workloads we described above, the following graph shows the speedups achieved from running this code on the Intel Xeon Phi processor. As you can see, up to a 2.37x speedup for the ring11_all workload can be achieved when running this code on one Intel® Xeon Phi™ processor 7250 versus one two-socket Intel Xeon processor E5-2697 v4. The data used below were stored on a Lustre* file system.
Testing platform configuration:
Intel Xeon processor E5-2697 v4: BDW-EP node with dual sockets, 18 cores/socket HT enabled @2.3 GHz 145W (Intel Xeon processor E5-2697 v4 w/128 GB RAM), Red Hat Enterprise Linux Server release 6.7 (Santiago)
Intel Xeon Phi processor 7250 (68 cores): Intel Xeon Phi processor 7250 68 core, 272 threads, 1400 MHz core freq. MCDRAM 16 GB 7.2 GT/s, DDR4 96 GB 2400 MHz, Red Hat Enterprise Linux Server release 6.7 (Santiago), quad cluster mode, MCDRAM cache mode.