Recipe: RELION for Intel® Xeon Phi™ 7250 processor

I. Overview

This article provides a recipe for how to obtain, compile, and run an optimized version of relion-1.4 on Intel® Xeon® processors and Intel® Xeon Phi™ processors.

The source for this version of relion-1.4 can be downloaded from: http://www2.mrc-lmb.cam.ac.uk/relion/index.php

II. Introduction

RELION is an image processing software package and widely used to achieve high resolution Cryo-EM structures. It uses Bayesian MAP+EM algorithm to provide more reliable structures than existing methods which is more suitable for heterogeneous data. RELION is distributed under a GPL license, it is completely free, open-source software for both academia and industry. The code is based on C++. Parallelization is achieved through the MPI and Pthread. More information about relion can refer to http://www2.mrc-lmb.cam.ac.uk/relion/index.php

This project optimizes the performance of the auto-refine part of RELION on both Intel® Xeon® processor and Intel® Xeon Phi™ processor.

Optimizations in this package include:

Improve data alignment with 64-byte to reach better performance. With this data alignment, it will take about 10% performance improvement for this workload
Vectorize the hotspot loop. Especially for the first hotspot loop, it is used very frequently during program running. So after Vectorize two hotspot loops, it can take above 30% performance improvement for this workload.
RELOIN is memory bond application, and taking advantage of available fast MCDRAM on the Xeon Phi 7250 processor should improve the performance. Using the MCDRAM in cache mode we see about a 10% performance boost for this workload.

III. Preliminaries

To match these results, the Intel® Xeon Phi™ processor machine needs to be booted with bios settings for quad cluster mode and MCDRAM cache mode. Please review this document for further information.
To build this package, install the Intel® MPI Library for Linux* 5.1(Update 3) and Intel® Parallel Studio XE Composer Edition for C++ Linux* Version 2016 (Update 3) or higher products on your system.
Download relion-1.4.tar.bz2 from http://www2.mrc-lmb.cam.ac.uk/relion/index.php

Set up the Intel® MPI Library and Intel® C++ Compiler environments:

> source /opt/intel/impi/<version>/bin64/mpivars.sh> source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64

Unpack the source code to /home/users.

> cp relion-1.4.tar.bz2 /home/users> tar –xjvf relion-1.4.tar.bz2
> cd ./relion-1.4

Please contact Peking University, Yanan Zhu <ynzhu@pku.edu.cn> to get testing workload. Please request the version used for the Intel KNL Recipes.

Copy the workload to your home directory, the workload will include the following files:

>cp relion-autorefine-5000.tar.gz /home/users>cd /home/users>tar –xzvf relion-autorefine-5000.tar.gz

IV. Add optimized code into relion

Reload new and delete of class MultidimArray in src/multidim_array.h

>cd /home/users/relion-1.4>vi src/multidim_array.h

Insert the below optimized code before line 496

void *operator new(size_t size)
{
     return _mm_malloc(size, 64);
}
void operator delete(void *p)
{
     _mm_free(p);
}

Vectorize the hotspot loop

>cd /home/users/relion-1.4>vi src/multidim_array.h

Replace the original code with optimized code

Original code is(about line 930):
       for (long int l = 0; l < Ndim; l++)
            for (long int k = 0; k < Zdim; k++)
                for (long int i = 0; i < Ydim; i++)
                    for (long int j = 0; j < Xdim; j++)
                    {
                        T val;
                        if (k >= ZSIZE(*this))
                            val = 0;
                        else if (i >= YSIZE(*this))
                            val = 0;
                        else if (j >= XSIZE(*this))
                            val = 0;
                        else
                            val = DIRECT_A3D_ELEM(*this, k, i, j);
                        new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = val;
                    }

Optimized code is:

if ( (ZSIZE(*this)<= Zdim) && (YSIZE(*this)<= Ydim) && (XSIZE(*this)<= Xdim) ) {
        for (long int l = 0; l < Ndim; l++)
            for (long int k = 0; k < Zdim; k++)
                for (long int i = 0; i < Ydim; i++) {
   #pragma simd
                    for (long int j = 0; j < Xdim; j++)
                    {
                        new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = 0;
                    }
                }
        for (long int l = 0; l < Ndim; l++)
            for (long int k = 0; k < ZSIZE(*this); k++)
                for (long int i = 0; i < YSIZE(*this); i++) {
   #pragma simd
                    for (long int j = 0; j < XSIZE(*this); j++)
                    {
                        new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = DIRECT_A3D_ELEM(*this, k, i, j);
                    }
                }
} else {

        for (long int l = 0; l < Ndim; l++)
            for (long int k = 0; k < Zdim; k++)
                for (long int i = 0; i < Ydim; i++)
                    for (long int j = 0; j < Xdim; j++)
                    {
                        T val;
                        if (k >= ZSIZE(*this))
                            val = 0;
                        else if (i >= YSIZE(*this))
                            val = 0;
                        else if (j >= XSIZE(*this))
                            val = 0;
                        else
                            val = DIRECT_A3D_ELEM(*this, k, i, j);
                        new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = val;
           }
}

Vectorize the hotspot loop

>cd /home/users/relion-1.4>vi src/ml_optimiser.cpp

Replace the original code with optimized code
Original code is(line 3652):

FOR_ALL_DIRECT_ELEMENTS_IN_MULTIDIMARRAY(Frefctf)
{
diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).real * (*(Fimg_shift + n)).real;
diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).imag * (*(Fimg_shift + n)).imag;
suma2 += norm(DIRECT_MULTIDIM_ELEM(Frefctf, n));
}

Optimized code is:

Complex *opp;                                                                                                   FOR_ALL_DIRECT_ELEMENTS_IN_MULTIDIMARRAY(Frefctf)                                                                                {
diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).real * (*(Fimg_shift + n)).real;
diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).imag * (*(Fimg_shift + n)).imag;
opp = & DIRECT_MULTIDIM_ELEM(Frefctf, n);
suma2 += opp->real * opp->real + opp->imag * opp->imag;
}

There is a known issue in relion 1.4 as references in the following link:
http://www2.mrc-lmb.cam.ac.uk/relion/index.php/Known_issue
Change line 405 in src/ml_optimiser_mpi.cpp from:
length_fn_ctf = exp_fn_img.length() + 1; // +1 to include \0 at the end of the string
into:
length_fn_ctf = exp_fn_ctf.length() + 1; // +1 to include \0 at the end of the string

V. Prepare for Intel® Xeon® processor

Set environment variables for compilation of relion:

>export CC=icc>export CXX=icpc>export F77=ifort>export MPICC=mpiicc>export MPICXX=mpiicpc>export CFLAGS="-O3 -xHost -fno-alias -align">export FFLAGS="-O3 -xHost -fno-alias -align">>export CXXFLAGS="-O3 -xHost -fno-alias -align"

Suggestion: you can also add -qopt-report=5 into CFLAGS/FFLAGS/CXXFLAGS to see optimization report

Build the library for the Intel® Xeon processor.
```
> cd /home/users> cd ./relion-1.4> ./INSTALL.sh
```

VI. Prepare for Intel® Xeon® Phi™ processor

Set environment variables for compilation of relion:

>export CC=icc>export CXX=icpc>export F77=ifort>export MPICC=mpiicc>export MPICXX=mpiicpc>export CFLAGS="-O3 -xMIC-AVX512 -fno-alias -align">export FFLAGS="-O3 -xMIC-AVX512 -fno-alias -align">export CXXFLAGS="-O3 -xMIC-AVX512 -fno-alias -align"

Suggestion: you can also add -qopt-report=5 into CFLAGS/FFLAGS/CXXFLAGS to see optimization report

Build the library for the Intel® Xeon Phi™ processor.
```
cd /home/users
cd ./relion-1.4
./INSTALL.sh 
```

VII. Run the test workload on Intel® Xeon processor

Create running scripts for this workload

>vi auotrefine.sh
#!/bin/sh
nprocs=9
nthreads=4
mrcsfile="adkc_05000.mrcs.mrcs"
starfile="adkc_05000.star"
defocusfile="3eulerctf_05000.dat"

echo ""> $starfile
echo "data_">> $starfile
echo "">> $starfile
echo "loop_">> $starfile
echo "_rlnVoltage #1">> $starfile
echo "_rlnDefocusU #2">> $starfile
echo "_rlnDefocusV #3">> $starfile
echo "_rlnDefocusAngle #4">> $starfile
echo "_rlnSphericalAberration #5">> $starfile
echo "_rlnAmplitudeContrast #6">> $starfile
echo "_rlnImageName #7">> $starfile

awk 'BEGIN{npart=0}{if($1 !~/^;/){npart++; printf("%s %s %s %s %s %s %d@'"$mrcsfile"'n", $1,$3,$4,$5,$6,$7,npart)}}' $defocusfile >> $starfile
mkdir -p Refine/adkc_05000
mpirun -np $nprocs relion_refine_mpi --o Refine/adkc_05000/run01 --j $nthreads --iter 10 --split_random_halves --i $starfile  --particle_diameter 86 --angpix 0.86 --ref adkc.mrc --firstiter_cc --ini_high 60 --ctf --tau2_fudge 4 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 10 --offset_step 2 --sym C1 --norm --scale --memory_per_thread 0.5 --auto_local_healpix_order 5 --low_resol_join_halves 40

Set PATH to add LD_LIBRARY_PATH to running relion autorefine workload

>export PATH=$PATH:/home/users/relion-1.4/bin>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/users/relion-1.4/lib

Running this workload

>cd /home/users/5000_vec1>sh auotrefine.sh

VIII. Run the test workload on Intel® Xeon Phi™ processor

Create running scripts for this workload

>vi auotrefine.sh
#!/bin/sh
nprocs=65
nthreads=4
mrcsfile="adkc_05000.mrcs.mrcs"
starfile="adkc_05000.star"
defocusfile="3eulerctf_05000.dat"

echo ""> $starfile
echo "data_">> $starfile
echo "">> $starfile
echo "loop_">> $starfile
echo "_rlnVoltage #1">> $starfile
echo "_rlnDefocusU #2">> $starfile
echo "_rlnDefocusV #3">> $starfile
echo "_rlnDefocusAngle #4">> $starfile
echo "_rlnSphericalAberration #5">> $starfile
echo "_rlnAmplitudeContrast #6">> $starfile
echo "_rlnImageName #7">> $starfile

awk 'BEGIN{npart=0}{if($1 !~/^;/){npart++; printf("%s %s %s %s %s %s %d@'"$mrcsfile"'n", $1,$3,$4,$5,$6,$7,npart)}}' $defocusfile >> $starfile
mkdir -p Refine/adkc_05000
mpirun -np $nprocs relion_refine_mpi --o Refine/adkc_05000/run01 --j $nthreads --iter 10 --split_random_halves --i $starfile  --particle_diameter 86 --angpix 0.86 --ref adkc.mrc --firstiter_cc --ini_high 60 --ctf --tau2_fudge 4 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 10 --offset_step 2 --sym C1 --norm --scale --memory_per_thread 0.5 --auto_local_healpix_order 5 --low_resol_join_halves 40

Set PATH to add LD_LIBRARY_PATH to running relion autorefine

>export PATH=$PATH:/home/users/relion-1.4/bin>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/users/relion-1.4/lib

Running this workload

>cd /home/users/5000_vec1>sh auotrefine.sh

IX. Performance gain

For this autorefine workload we described above, the following graph shows the speedup achieved from this optimization. As you can see, up to a 1.31x speedup can be achieved when running this code on one Intel® Xeon Phi™ 7250 vs. one 2-Socket Intel® Xeon® Processor E5-2697 v4. Up to a 1.23x speedup can be achieved when running this code on one Intel® Xeon Phi™ 7210 vs. one 2-Socket Intel® Xeon® Processor E5-2697 v4

2S Intel® Xeon® processor E5-2697 v4, (18 Ranks)
1 Intel® Xeon Phi™ 7210 (63 Ranks)
1 Intel® Xeon Phi™ 7250 (65 Ranks)

Testing platform configuration:

Intel® Xeon® processor E5-2697 v4: Dual Socket ® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT off), DDR4 128GB, 2400 MHz, CentOS release 6.7 (Final)

Intel® Xeon Phi™ processor 7210 (64 cores): Intel® Xeon Phi™ processor 7210 64 core, 256 threads, 1300 MHz core freq. (Turbo ON), , MCDRAM 16 GB 6.4 GT/s, BIOS GVPRCRB1.86B.0010.R00.1603251732, DDR4 96GB 2133 MHz, Red Hat 7.2(Maipo), quad cluster mode, MCDRAM cache mode

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), MCDRAM 16 GB 7.2 GT/s, BIOS GVPRCRB1.86B.0010.D42.1604182214, DDR4 96GB 2400 MHz, Red Hat 7.2(Maipo), quad cluster mode, MCDRAM cache mode

Recipe: RELION for Intel® Xeon Phi™ 7250 processor

I. Overview

II. Introduction

III. Preliminaries

IV. Add optimized code into relion

V. Prepare for Intel® Xeon® processor

VI. Prepare for Intel® Xeon® Phi™ processor

VII. Run the test workload on Intel® Xeon processor

VIII. Run the test workload on Intel® Xeon Phi™ processor

IX. Performance gain

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112