Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

BLAST for the Intel® Xeon Phi™ Coprocessor

$
0
0

Purpose

This code recipe describes how to get, build, and use the BLAST+ code that includes support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture.

Introduction

The NCBI BLAST algorithm, and the associated application suite, is used extensively in the field of Bioinformatics. BLAST looks for similarities (homologs) in a nucleotide or protein query sequence of length n by aligning it against a subject sequence of length m. The NCBI BLAST application suite supplies two flavors of DNA sequence match searching applications: nucleotide-based (blastn-type) and protein-based (blastp-type). At the time of publishing this document, the most recent publicly available version of BLAST is version 30.

BLAST uses heuristics to prune the query-subject search space. The heuristics are based on the concept of hitting a seed (starter match) and growing a larger match around it, thus reducing the computational cost from O(n*m) to O(n+m).  

BLAST’s runtime can be broken down into three stages:

a. Preliminary search: starting with a minimal match (seeding), extending bi-directionally the match without gaps  (ungapped) and by allowing gaps (gapped)

b. Post search (Gapped extension with Traceback) match scores obtained in a are corrected by tracing matches back to streamlined subject stream (GAT)

c. Output Formatting Section (OFS): populating a structure holding the output of a and b, specialized for streaming to output file or screen.

BLAST is highly scalable on both Intel® Xeon™ and Intel® Xeon Phi™ because these runtime stages a, b and c have been efficiently parallelized. In BLAST v.30, the stages a and b are parallelized for blastn, while only stage a has been parallelized for blastp. A working version of BLAST which parallelizes stage c is in development, and is available on request.

This document discusses a heterogeneous mode of operation involving the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel® Xeon Phi™ coprocessor(s) (referred to as ‘coprocessor(s)’ in this document). Heterogeneous operation has been tested on a single node. In heterogeneous mode, host and coprocessor(s) handle a balanced split of a multi-query workload, where the query set is partitioned into complete and mutually disjoint host/coprocessor(s) sets.

Code Access

BLAST+ code is maintained by NCBI and is available under the license:

This software/database is a "United States Government Work" under the
terms of the United States Copyright Act.  It was written as part of
the author's official duties as a United States Government employee and
thus cannot be copyrighted.  This software/database is freely available
to the public for use. The National Library of Medicine and the U.S.
Government have not placed any restriction on its use or reproduction.

Please cite the author in any work or product based on this material.

To get access to the code and test workloads:

a) Source code: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/

b) Subject data bases: ftp://ftp.ncbi.nlm.nih.gov/blast/db/

c) Download benchmark2013.tar.gz from ftp://ftp.ncbi.nlm.nih.gov/blast/demo/benchmark/

A working version of BLAST which parallelizes all runtime stages, including  stage, c is available on request from the author of this document (albert.golembiowski.@intel.com).

Build Directions

To run a workload on a coprocessor, as well as in heterogeneous mode, two separate builds are necessary, first for the host, and the second for the coprocessor, in that order. The recipe below delineates the build process for BLAST version 30, however it applies equally well to version 29.

  1. On the host, uncompress ncbi-blast-2.2.30+-src.tar.gz, obtained from (a), in an empty directory <src>
  2. Make the Intel® C++ Compiler (icc) discoverable by installing Intel® Parallel Studio XE 2015 Composer Edition and then invoking the environment setup script:
    > source /opt/intel/composer_xe_2015/bin/iccvars.csh intel64
  3. Change the directory to <src>/ncbi-blast-2.2.30+-src and duplicate the top directory node
    > cp –rp c++ icc Both the host and the coprocessor builds can be processed as separate configurations off the common icc source tree, as shown below (by means of the `-–with-build-root' flag), or the coprocessor source tree can be made separate > cp –rp c++ mic

    Host:

  4. Change the directory to <src>/ncbi-blast-2.2.30+-src/icc, and run
    > ./configure --with-bin-release --without-debug --without-gui --with-mt --without-boost --without-z --without-bz2 –-without-strip CXX=icpc CC=icc CFLAGS=’-O3’ CXXFLAGS=’-O3’ LDFLAGS=’-O3’ -–with-symbols
    The flag –-without-strip is optional, and it changes in the file
    ReleaseMT/build/Makefile.mk
    the value of the variable CONF_STRIP from strip to @:. The optional flag -–with-symbols appends -g to all occurences of -O3 in the file Makefile.mk. (essential to retain symbols, for example needed for Intel® VTune analysis). The flags
    --without-boost --without-z --without-bz2 can be substituted by a single flag
    --without-3psw which inhibits incorporating 3rd party software into the build. This has no effect on blastn/blastp performance, however it inhibits building some of the test suites.
    A debug build from the shared icc source tree is obtained by swapping --without-debug with --with-debug and adding -–with-build-root=DebugMT:
    > ./configure --with-bin-release --with-debug --without-gui --with-mt --without-boost --without-z --without-bz2 CXX=icpc CC=icc -–with-build-root=DebugMT
  5. Edit the file src/algo/blast/core/blast_gapalign.c by inserting the line just ahead of BLAST_GetGappedScore function definition (at code line 3401 in BLAST version 30).
    #pragma intel optimization_level 1
    Dropping the optimization level for this one function is needed for ICC 2015 to build a stable version of blastp (not needed for blastn) and future versions of ICC will not need it.
  6. Step into directory ReleaseMT/build
  7. Run (flag -k is essential to ignore errors, -j8 is optional)
    > make -k all_r -j8
    The executables (blastn, blastp, blastx, ...) are built in directory ReleaseMT/build/app/blast and are copied to directory ReleaseMT/bin

    Coprocessor:
  1. Change the directory to <src>/ncbi-blast-2.2.30+-src/icc, and repeat Host: step 4 above with flags --host=x86_64-k1om-linux and –with-build-root=ReleaseMIC added, however making this time the flags –-without-3psw and –-without-strip mandatory, i.e.
    > ./configure --with-bin-release --without-debug --without-gui --with-mt --without-3psw –-without-strip -–with-symbols CXX=icpc CC=icc CFLAGS=’-O3 –g –mmic’ CXXFLAGS=’-O3 –mmic’ LDFLAGS=’-O3 –mmic’ --host=x86_64-k1om-linux --with-build-root=ReleaseMIC

    Again, the flags –-without-strip and -–with-symbols are optional, and the same remark as in Host: step 4 applies. The build root name ReleaseMIC is picked by the user. If the value of the flag --host is not recognized as a valid Intel® MIC architecture (k1om is Intel® code name for the Knight’s Corner™ coprocessor), then it may need to be dropped, together with –mmic, and the file ReleaseMIC/build/Makefile.mk hand edited by appending –mmic to all occurrences of -O3.
    In the case of a separate coprocessor mic source tree, skip the above, change the directory to <src>/ncbi-blast-2.2.30+-src/mic and execute the same ./configure command, the flag --with-build-root=ReleaseMIC being optional now (defaults to ReleaseMT). Set the environment variable NCBI_DATATOOL_PATH to point to the directory that contains the executables generated in Host: step in 7, i.e.
    > export NCBI_DATATOOL_PATH=<src>/ncbi-blast-2.2.30+-src/icc/ReleaseMT/bin

  2. Edit the file include/corelib/ncbifloat.h: search for the line
    # if __cplusplus >= 201103L && defined(_GLIBCXX_constexpr)
    and append
    && !defined (__MIC__)
    (line 71 in BLAST version 30, has the effect of eliding `ISNAN_CONSTEXPR'):
  3. Change the directory to ReleaseMIC/build
  4. Build using the command in Host: step 7 above.
    The coprocessor executables are built in directory ReleaseMIC/build/app/blast and are copied to directory ReleaseMIC/bin
  5. If you don’t see the executables in directory ReleaseMIC/bin, execute:
    > make -C corelib> make -C util> make -C serial> make -C serial/datatool> make -k all_r -j8

Run Directions

Heterogeneous Intel® Xeon™/XeonPhi™ workload distribution model on a single node:

  1. Create a working directory <work_dir> on a local disk (host) and NFS mount it on the coprocessor. This process is explained in the Intel® Manycore Platform Software Stack (Intel® MPSS) documentation, i.e
    /opt/intel/mic/docs/MPSS_Boot_Config_Guide.pdf
  2. Download the refseq_rna.00 refseq_rna.01 refseq_rna.02 reference databases per Code Access instruction above and place them in <work_dir>/db
  3. Download the sample database queries per Code Access instruction above and place them in <work_dir>/queries
  4. Create a directory called <work_dir>/m_queries/blastn where the concatenated queries formed from the 99 blastn queries contained in <work_dir>/queries/blastn/ will be placed.
  5. Create a file that concatenates the names of the first 80 out of 99 query files into a single multiple-query file, say <work_dir>/m_queries/blastn/NM_80_all, by stepping into directory <work_dir>/queries/blastn/ and running
    > echo “cat “ `ls *` “> ../../queries_m/blastn/NM_80_all” > nm_99_all.sh
    Strip the tail-end 19 query file names by editing the file nm_99_all.sh, and saving it as nm_80_all.sh, and execute
    > source nm_80_all.sh
  6. Concatenate the last 19 out of 99 queries into a single multiple-query file, for example as <work_dir>/m_queries/blastn/_NM_19_all in a similar manner to step 5 (this time delete all but the last 19 query file names).
  7. Copy the executables blastn obtained in Host: step 8 and Coprocessor: step 8 above
    > cp <src>/ncbi-blast-2.2.29+-src/icc/ReleaseMT/bin/blastn <work_dir>/blastn
    > cp <src>/ncbi-blast-2.2.29+-src/mic/ReleaseMT/bin/blastn <work_dir>/blastn_mic

This is optional, as both executables may be run from where they were built provided the build directory is visible also from the coprocessor, i.e. is NFS mounted. Similarly, you may copy the libiomp5.so for the Intel® Xeon Phi™ coprocessor > cp /opt/intel/composer_xe_2015/compiler/lib/mic/libiomp5.so <work_dir>

Copying the executables to a local disk <work_dir> is advisable as it may reduce NFS access overhead. Create a script to run the workload natively on the coprocessor and the host, say blastn_mic_host.sh, containing

#!/bin/sh
QUERY_MIC=$1
THREADS_MIC=$2
QUERY_HOST=$3
THREADS_HOST=$4
OUT_NAME=$5
WORK_DIR=<work_dir>
BATCH_SIZE=48000
OUT_NAME_MIC=${WORK_DIR}/${DB_NAME}_${QUERY_MIC}_mic.out
OUT_NAME_HOST=${WORK_DIR}/${DB_NAME}_${QUERY_HOST}.out
# ----------------------running on coprocessor (mic1) ----------------------------------------------
/usr/bin/time -p ssh mic1 "cd ${WORK_DIR};  export LD_LIBRARY_PATH=${WORK_DIR};
export BATCH_SIZE=${BATCH_SIZE};  ./blastn_mic -task blastn -use_index false -db'db/refseq_rna.00 db/refseq_rna.01 db/refseq_rna.02' -query  queries_m/blastn/${QUERY_MIC}  -
num_threads ${THREADS_MIC} > ${OUT_NAME_MIC}"&
# ----------------------running on host ----------------------------------------------
cd ${WORK_DIR};  export BATCH_SIZE=${BATCH_SIZE};  /usr/bin/time –p ./blastn -task blastn -
use_index false -db 'db/refseq_rna.00 db/refseq_rna.01 db/refseq_rna.02' -query
queries_m/blastn/${QUERY_HOST}  -num_threads ${THREADS_HOST} > ${OUT_NAME_HOST}"
# ---------------------- merging results (have to wait for host and coprocessor to finish!) -------------------
# cat  ${OUT_NAME_MIC}  ${OUT_NAME_HOST}  >  ${OUT_NAME}
exit

The environment variable BATCH_SIZE is specific to BLAST internals, and it governs resource allocation critical to efficient multiple-query handling. In the script, remember to replace <work_dir> with an actual path.
To run a 80/19 query host/coprocessor distributed workload, for example, execute the following command from the host
> ./blastn_mic_host.sh _NM_19_all 180 NM_80_all 48
where the coprocessor is given 19 queries contained in file _NM_19_all running 180 threads and the host is given 80 queries contained in file NM_80_all running 48 threads. The final result from running the Intel® Xeon® processor/Intel® Xeon Phi™ coprocessor distributed workload is obtaining by a merge, executed after all processes launched by the script above terminate
> cat ${OUT_NAME_MIC} ${OUT_NAME_HOST} > NM_99_merged
where ${OUT_NAME_MIC} and ${OUT_NAME_HOST} are defined in the script above.
To keep the script blastn_mic_host.sh simple, the merge call is commented off since it requires more complexity to implement a wait on both coprocessor and host tasks to finish first before the merge can be executed.

8. blastp can be run in a heterogeneous manner in exactly the same way.

Expected performance (in seconds)

The results of running in a heterogeneous mode as per Run Directions step 7 above are presented below. In this experiment, the total list of 99 queries is split into 80 concatenated queries given to the host and the remainder of queries given to the coprocessor, both in a multiple-query multiple-db volumes invocation model.
A user running this model should be cautioned that coprocessor internal memory saturation issues may arise (excessive runtime/crash/hanging) when the combined size of the subject database volumes and the size of the executable blastn/blastp, as well as the runtime memory demand, come close to or exceeds the available coprocessor RAM. The three subject db volumes used (refseq_rna.00 refseq_rna.01 refseq_rna.02) have a combined size of ~4GB, plus the executable size (.25 GB) and the runtime heap memory demand (~2GB) combine to ~6.25 GB, which is well within the available coprocessor memory (the coprocessor we used had 16GB or RAM). In case of exceeding the available RAM, the larger BLAST query/db problem has to be subdivided into sub-problems that fit into the available coprocessor RAM.
Also, for accurate runtime performance analysis, the first runs on the coprocessor(s) and the host should be discarded as the runtimes are padded with the time it takes to build NFS cache in coprocessor(s) and the host RAM, largely due to retrieval of the NFS mounted db volumes.

 Col BCol CCol DCol E
     host    coprocessormin completion time: max(Col B, Col C) 
number of threads -->48180 speedup
query split (host/coprocessor)    
99/054.5s0s54.5s 
81/1843.8s39s43.8s 
80/1943.3s41s43.3s1.26
79/2042.8s43.7s43.7s 
78/2142.7s44.2s44.2s 

Platform Configurations

Platform 
 Intel R2208GZ4GC platform
2U chassis
hot-swap drives, 24 DIMMs, 1 750W Redundant Power Supply
CPU/Stepping 
 Xeon E5-2697 v2
2.7 GHz , 12 core, 8GT/s dual QPI links, 130 W, 3.5GHz Max Turbo Frequency
768kB instr L1 / 3072kB L2 / 30MB L3 cache
Coprocessor 
 Intel Xeon Phi 7110 and 7120; 61 cores, 1.1 and 1.238 GHz
ECC enabled, TURBO disabled
Software Details:
MPSS version - 2.1.6720-15
Flash version - 2.1.03.0386
Memory 
 Hynix HMT31GR7BFR-PB
64GB total 8*8GB 1600MHZ Reg ECC DDR3
Chipset 
 Rev 4.6
SE5C600.86B.99.99.x069.071520130923
BIOS 
 BIOS Configuration: default except:
Turbo Enabled
EIST Enabled
SMT enabled
NUMA enabled
Memory speed 1600MHz
Note: all prefetchers enabled (this is default)
GigE Node Adapter 
 Intel Ethernet Controller I350 (rev 01)
4 Gigabit network connections
Only one connection in use
IB switch 
 36 port switch/rack Mellanox FDR (model MSX6025F-1BFR)
Firmware version:9_2_4002
IB adapters 
 MCX353A-FCAT memfree
PCI-Express x8
FDR InfiniBand 8x
Firmware version:2.30.3200
HDD Specs 
 SEAGATE ST9600205SS (scsi)
1x600 GB SAS HDD 10kRPM
OS 
 RHEL 6.4

Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>