I. Overview
This article provides a recipe for how to obtain, compile, and run an optimized version of relion-1.4 on Intel® Xeon® processors and Intel® Xeon Phi™ processors.
The source for this version of relion-1.4 can be downloaded from: http://www2.mrc-lmb.cam.ac.uk/relion/index.php
II. Introduction
RELION is an image processing software package and widely used to achieve high resolution Cryo-EM structures. It uses Bayesian MAP+EM algorithm to provide more reliable structures than existing methods which is more suitable for heterogeneous data. RELION is distributed under a GPL license, it is completely free, open-source software for both academia and industry. The code is based on C++. Parallelization is achieved through the MPI and Pthread. More information about relion can refer to http://www2.mrc-lmb.cam.ac.uk/relion/index.php
This project optimizes the performance of the auto-refine part of RELION on both Intel® Xeon® processor and Intel® Xeon Phi™ processor.
Optimizations in this package include:
- Improve data alignment with 64-byte to reach better performance. With this data alignment, it will take about 10% performance improvement for this workload
- Vectorize the hotspot loop. Especially for the first hotspot loop, it is used very frequently during program running. So after Vectorize two hotspot loops, it can take above 30% performance improvement for this workload.
- RELOIN is memory bond application, and taking advantage of available fast MCDRAM on the Xeon Phi 7250 processor should improve the performance. Using the MCDRAM in cache mode we see about a 10% performance boost for this workload.
III. Preliminaries
- To match these results, the Intel® Xeon Phi™ processor machine needs to be booted with bios settings for quad cluster mode and MCDRAM cache mode. Please review this document for further information.
- To build this package, install the Intel® MPI Library for Linux* 5.1(Update 3) and Intel® Parallel Studio XE Composer Edition for C++ Linux* Version 2016 (Update 3) or higher products on your system.
- Download relion-1.4.tar.bz2 from http://www2.mrc-lmb.cam.ac.uk/relion/index.php
Set up the Intel® MPI Library and Intel® C++ Compiler environments:
> source /opt/intel/impi/<version>/bin64/mpivars.sh> source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64
Unpack the source code to /home/users.
> cp relion-1.4.tar.bz2 /home/users> tar –xjvf relion-1.4.tar.bz2 > cd ./relion-1.4
- Please contact Peking University, Yanan Zhu <ynzhu@pku.edu.cn> to get testing workload. Please request the version used for the Intel KNL Recipes.
Copy the workload to your home directory, the workload will include the following files:
>cp relion-autorefine-5000.tar.gz /home/users>cd /home/users>tar –xzvf relion-autorefine-5000.tar.gz
IV. Add optimized code into relion
Reload new and delete of class MultidimArray in src/multidim_array.h
>cd /home/users/relion-1.4>vi src/multidim_array.h
Insert the below optimized code before line 496
void *operator new(size_t size) { return _mm_malloc(size, 64); } void operator delete(void *p) { _mm_free(p); }
Vectorize the hotspot loop
>cd /home/users/relion-1.4>vi src/multidim_array.h
Replace the original code with optimized code
Original code is(about line 930): for (long int l = 0; l < Ndim; l++) for (long int k = 0; k < Zdim; k++) for (long int i = 0; i < Ydim; i++) for (long int j = 0; j < Xdim; j++) { T val; if (k >= ZSIZE(*this)) val = 0; else if (i >= YSIZE(*this)) val = 0; else if (j >= XSIZE(*this)) val = 0; else val = DIRECT_A3D_ELEM(*this, k, i, j); new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = val; }
Optimized code is:
if ( (ZSIZE(*this)<= Zdim) && (YSIZE(*this)<= Ydim) && (XSIZE(*this)<= Xdim) ) { for (long int l = 0; l < Ndim; l++) for (long int k = 0; k < Zdim; k++) for (long int i = 0; i < Ydim; i++) { #pragma simd for (long int j = 0; j < Xdim; j++) { new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = 0; } } for (long int l = 0; l < Ndim; l++) for (long int k = 0; k < ZSIZE(*this); k++) for (long int i = 0; i < YSIZE(*this); i++) { #pragma simd for (long int j = 0; j < XSIZE(*this); j++) { new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = DIRECT_A3D_ELEM(*this, k, i, j); } } } else { for (long int l = 0; l < Ndim; l++) for (long int k = 0; k < Zdim; k++) for (long int i = 0; i < Ydim; i++) for (long int j = 0; j < Xdim; j++) { T val; if (k >= ZSIZE(*this)) val = 0; else if (i >= YSIZE(*this)) val = 0; else if (j >= XSIZE(*this)) val = 0; else val = DIRECT_A3D_ELEM(*this, k, i, j); new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = val; } }
Vectorize the hotspot loop
>cd /home/users/relion-1.4>vi src/ml_optimiser.cpp
Replace the original code with optimized code
Original code is(line 3652):FOR_ALL_DIRECT_ELEMENTS_IN_MULTIDIMARRAY(Frefctf) { diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).real * (*(Fimg_shift + n)).real; diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).imag * (*(Fimg_shift + n)).imag; suma2 += norm(DIRECT_MULTIDIM_ELEM(Frefctf, n)); }
Optimized code is:
Complex *opp; FOR_ALL_DIRECT_ELEMENTS_IN_MULTIDIMARRAY(Frefctf) { diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).real * (*(Fimg_shift + n)).real; diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).imag * (*(Fimg_shift + n)).imag; opp = & DIRECT_MULTIDIM_ELEM(Frefctf, n); suma2 += opp->real * opp->real + opp->imag * opp->imag; }
- There is a known issue in relion 1.4 as references in the following link:
http://www2.mrc-lmb.cam.ac.uk/relion/index.php/Known_issue
Change line 405 in src/ml_optimiser_mpi.cpp from:
length_fn_ctf = exp_fn_img.length() + 1; // +1 to include \0 at the end of the string
into:
length_fn_ctf = exp_fn_ctf.length() + 1; // +1 to include \0 at the end of the string
V. Prepare for Intel® Xeon® processor
Set environment variables for compilation of relion:
>export CC=icc>export CXX=icpc>export F77=ifort>export MPICC=mpiicc>export MPICXX=mpiicpc>export CFLAGS="-O3 -xHost -fno-alias -align">export FFLAGS="-O3 -xHost -fno-alias -align">>export CXXFLAGS="-O3 -xHost -fno-alias -align"
Suggestion: you can also add
-qopt-report=5
intoCFLAGS/FFLAGS/CXXFLAGS
to see optimization reportBuild the library for the Intel® Xeon processor.
> cd /home/users> cd ./relion-1.4> ./INSTALL.sh
VI. Prepare for Intel® Xeon® Phi™ processor
Set environment variables for compilation of relion:
>export CC=icc>export CXX=icpc>export F77=ifort>export MPICC=mpiicc>export MPICXX=mpiicpc>export CFLAGS="-O3 -xMIC-AVX512 -fno-alias -align">export FFLAGS="-O3 -xMIC-AVX512 -fno-alias -align">export CXXFLAGS="-O3 -xMIC-AVX512 -fno-alias -align"
Suggestion: you can also add
-qopt-report=5
intoCFLAGS/FFLAGS/CXXFLAGS
to see optimization reportBuild the library for the Intel® Xeon Phi™ processor.
cd /home/users cd ./relion-1.4 ./INSTALL.sh
VII. Run the test workload on Intel® Xeon processor
Create running scripts for this workload
>vi auotrefine.sh #!/bin/sh nprocs=9 nthreads=4 mrcsfile="adkc_05000.mrcs.mrcs" starfile="adkc_05000.star" defocusfile="3eulerctf_05000.dat" echo ""> $starfile echo "data_">> $starfile echo "">> $starfile echo "loop_">> $starfile echo "_rlnVoltage #1">> $starfile echo "_rlnDefocusU #2">> $starfile echo "_rlnDefocusV #3">> $starfile echo "_rlnDefocusAngle #4">> $starfile echo "_rlnSphericalAberration #5">> $starfile echo "_rlnAmplitudeContrast #6">> $starfile echo "_rlnImageName #7">> $starfile awk 'BEGIN{npart=0}{if($1 !~/^;/){npart++; printf("%s %s %s %s %s %s %d@'"$mrcsfile"'n", $1,$3,$4,$5,$6,$7,npart)}}' $defocusfile >> $starfile mkdir -p Refine/adkc_05000 mpirun -np $nprocs relion_refine_mpi --o Refine/adkc_05000/run01 --j $nthreads --iter 10 --split_random_halves --i $starfile --particle_diameter 86 --angpix 0.86 --ref adkc.mrc --firstiter_cc --ini_high 60 --ctf --tau2_fudge 4 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 10 --offset_step 2 --sym C1 --norm --scale --memory_per_thread 0.5 --auto_local_healpix_order 5 --low_resol_join_halves 40
Set PATH to add LD_LIBRARY_PATH to running relion autorefine workload
>export PATH=$PATH:/home/users/relion-1.4/bin>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/users/relion-1.4/lib
Running this workload
>cd /home/users/5000_vec1>sh auotrefine.sh
VIII. Run the test workload on Intel® Xeon Phi™ processor
Create running scripts for this workload
>vi auotrefine.sh #!/bin/sh nprocs=65 nthreads=4 mrcsfile="adkc_05000.mrcs.mrcs" starfile="adkc_05000.star" defocusfile="3eulerctf_05000.dat" echo ""> $starfile echo "data_">> $starfile echo "">> $starfile echo "loop_">> $starfile echo "_rlnVoltage #1">> $starfile echo "_rlnDefocusU #2">> $starfile echo "_rlnDefocusV #3">> $starfile echo "_rlnDefocusAngle #4">> $starfile echo "_rlnSphericalAberration #5">> $starfile echo "_rlnAmplitudeContrast #6">> $starfile echo "_rlnImageName #7">> $starfile awk 'BEGIN{npart=0}{if($1 !~/^;/){npart++; printf("%s %s %s %s %s %s %d@'"$mrcsfile"'n", $1,$3,$4,$5,$6,$7,npart)}}' $defocusfile >> $starfile mkdir -p Refine/adkc_05000 mpirun -np $nprocs relion_refine_mpi --o Refine/adkc_05000/run01 --j $nthreads --iter 10 --split_random_halves --i $starfile --particle_diameter 86 --angpix 0.86 --ref adkc.mrc --firstiter_cc --ini_high 60 --ctf --tau2_fudge 4 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 10 --offset_step 2 --sym C1 --norm --scale --memory_per_thread 0.5 --auto_local_healpix_order 5 --low_resol_join_halves 40
Set PATH to add LD_LIBRARY_PATH to running relion autorefine
>export PATH=$PATH:/home/users/relion-1.4/bin>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/users/relion-1.4/lib
Running this workload
>cd /home/users/5000_vec1>sh auotrefine.sh
IX. Performance gain
For this autorefine workload we described above, the following graph shows the speedup achieved from this optimization. As you can see, up to a 1.31x speedup can be achieved when running this code on one Intel® Xeon Phi™ 7250 vs. one 2-Socket Intel® Xeon® Processor E5-2697 v4. Up to a 1.23x speedup can be achieved when running this code on one Intel® Xeon Phi™ 7210 vs. one 2-Socket Intel® Xeon® Processor E5-2697 v4
- 2S Intel® Xeon® processor E5-2697 v4, (18 Ranks)
- 1 Intel® Xeon Phi™ 7210 (63 Ranks)
- 1 Intel® Xeon Phi™ 7250 (65 Ranks)
Testing platform configuration:
Intel® Xeon® processor E5-2697 v4: Dual Socket ® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT off), DDR4 128GB, 2400 MHz, CentOS release 6.7 (Final)
Intel® Xeon Phi™ processor 7210 (64 cores): Intel® Xeon Phi™ processor 7210 64 core, 256 threads, 1300 MHz core freq. (Turbo ON), , MCDRAM 16 GB 6.4 GT/s, BIOS GVPRCRB1.86B.0010.R00.1603251732, DDR4 96GB 2133 MHz, Red Hat 7.2(Maipo), quad cluster mode, MCDRAM cache mode
Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), MCDRAM 16 GB 7.2 GT/s, BIOS GVPRCRB1.86B.0010.D42.1604182214, DDR4 96GB 2400 MHz, Red Hat 7.2(Maipo), quad cluster mode, MCDRAM cache mode