Purpose
This code recipe describes how to get, build, and use the BLAST+ code that includes support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture.
Introduction
The NCBI BLAST algorithm, and the associated application suite, is used extensively in the field of Bioinformatics. BLAST looks for similarities (homologs) in a nucleotide or protein query sequence of length n by aligning it against a subject sequence of length m. The NCBI BLAST application suite supplies two flavors of DNA sequence match searching applications: nucleotide-based (blastn-type) and protein-based (blastp-type). At the time of publishing this document, the most recent publicly available version of BLAST is version 30.
BLAST uses heuristics to prune the query-subject search space. The heuristics are based on the concept of hitting a seed (starter match) and growing a larger match around it, thus reducing the computational cost from O(n*m) to O(n+m).
BLAST’s runtime can be broken down into three stages:
a. Preliminary search: starting with a minimal match (seeding), extending bi-directionally the match without gaps (ungapped) and by allowing gaps (gapped)
b. Post search (Gapped extension with Traceback) match scores obtained in a are corrected by tracing matches back to streamlined subject stream (GAT)
c. Output Formatting Section (OFS): populating a structure holding the output of a and b, specialized for streaming to output file or screen.
BLAST is highly scalable on both Intel® Xeon™ and Intel® Xeon Phi™ because these runtime stages a, b and c have been efficiently parallelized. In BLAST v.30, the stages a and b are parallelized for blastn, while only stage a has been parallelized for blastp. A working version of BLAST which parallelizes stage c is in development, and is available on request.
This document discusses a heterogeneous mode of operation involving the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel® Xeon Phi™ coprocessor(s) (referred to as ‘coprocessor(s)’ in this document). Heterogeneous operation has been tested on a single node. In heterogeneous mode, host and coprocessor(s) handle a balanced split of a multi-query workload, where the query set is partitioned into complete and mutually disjoint host/coprocessor(s) sets.
Code Access
BLAST+ code is maintained by NCBI and is available under the license:
This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction. Please cite the author in any work or product based on this material.
To get access to the code and test workloads:
a) Source code: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/
b) Subject data bases: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
c) Download benchmark2013.tar.gz from ftp://ftp.ncbi.nlm.nih.gov/blast/demo/benchmark/
A working version of BLAST which parallelizes all runtime stages, including stage, c is available on request from the author of this document (albert.golembiowski.@intel.com).
Build Directions
To run a workload on a coprocessor, as well as in heterogeneous mode, two separate builds are necessary, first for the host, and the second for the coprocessor, in that order. The recipe below delineates the build process for BLAST version 30, however it applies equally well to version 29.
- On the host, uncompress ncbi-blast-2.2.30+-src.tar.gz, obtained from (a), in an empty directory
<src>
- Make the Intel® C++ Compiler (icc) discoverable by installing Intel® Parallel Studio XE 2015 Composer Edition and then invoking the environment setup script:
> source /opt/intel/composer_xe_2015/bin/iccvars.csh intel64
- Change the directory to
<src>/ncbi-blast-2.2.30+-src
and duplicate the top directory node> cp –rp c++ icc
Both the host and the coprocessor builds can be processed as separate configurations off the common icc source tree, as shown below (by means of the `-–with-build-root
' flag), or the coprocessor source tree can be made separate> cp –rp c++ mic
Host: - Change the directory to
<src>/ncbi-blast-2.2.30+-src/icc
, and run> ./configure --with-bin-release --without-debug --without-gui --with-mt --without-boost --without-z --without-bz2 –-without-strip CXX=icpc CC=icc CFLAGS=’-O3’ CXXFLAGS=’-O3’ LDFLAGS=’-O3’ -–with-symbols
The flag–-without-strip
is optional, and it changes in the fileReleaseMT/build/Makefile.mk
the value of the variableCONF_STRIP
fromstrip
to@:
. The optional flag-–with-symbols
appends-g
to all occurences of-O3
in the fileMakefile.mk
. (essential to retain symbols, for example needed for Intel® VTune analysis). The flags--without-boost --without-z --without-bz2
can be substituted by a single flag--without-3psw
which inhibits incorporating 3rd party software into the build. This has no effect on blastn/blastp performance, however it inhibits building some of the test suites.
A debug build from the sharedicc
source tree is obtained by swapping--without-debug
with--with-debug
and adding-–with-build-root=DebugMT:
> ./configure --with-bin-release --with-debug --without-gui --with-mt --without-boost --without-z --without-bz2 CXX=icpc CC=icc -–with-build-root=DebugMT
- Edit the file
src/algo/blast/core/blast_gapalign.c
by inserting the line just ahead ofBLAST_GetGappedScore
function definition (at code line 3401 in BLAST version 30).#pragma intel optimization_level 1
Dropping the optimization level for this one function is needed for ICC 2015 to build a stable version of blastp (not needed for blastn) and future versions of ICC will not need it. - Step into directory
ReleaseMT/build
- Run (flag
-k
is essential to ignore errors,-j8
is optional)> make -k all_r -j8
The executables(blastn, blastp, blastx, ...)
are built in directoryReleaseMT/build/app/blast
and are copied to directoryReleaseMT/bin
Coprocessor:
Change the directory to
<src>/ncbi-blast-2.2.30+-src/icc
, and repeat Host: step 4 above with flags--host=x86_64-k1om-linux
and–with-build-root=ReleaseMIC
added, however making this time the flags–-without-3psw
and–-without-strip
mandatory, i.e.> ./configure --with-bin-release --without-debug --without-gui --with-mt --without-3psw –-without-strip -–with-symbols CXX=icpc CC=icc CFLAGS=’-O3 –g –mmic’ CXXFLAGS=’-O3 –mmic’ LDFLAGS=’-O3 –mmic’ --host=x86_64-k1om-linux --with-build-root=ReleaseMIC
Again, the flags
–-without-strip
and-–with-symbols
are optional, and the same remark as in Host: step 4 applies. The build root nameReleaseMIC
is picked by the user. If the value of the flag--host
is not recognized as a valid Intel® MIC architecture (k1om
is Intel® code name for the Knight’s Corner™ coprocessor), then it may need to be dropped, together with–mmic
, and the fileReleaseMIC/build/Makefile.mk
hand edited by appending–mmic
to all occurrences of-O3
.
In the case of a separate coprocessor mic source tree, skip the above, change the directory to<src>/ncbi-blast-2.2.30+-src/mic
and execute the same./configure
command, the flag--with-build-root=ReleaseMIC
being optional now (defaults toReleaseMT
). Set the environment variableNCBI_DATATOOL_PATH
to point to the directory that contains the executables generated in Host: step in 7, i.e.> export NCBI_DATATOOL_PATH=<src>/ncbi-blast-2.2.30+-src/icc/ReleaseMT/bin
- Edit the file
include/corelib/ncbifloat.h:
search for the line# if __cplusplus >= 201103L && defined(_GLIBCXX_constexpr)
and append&& !defined (__MIC__)
(line71
in BLAST version 30, has the effect of eliding`ISNAN_CONSTEXPR
'): - Change the directory to
ReleaseMIC/build
- Build using the command in Host: step 7 above.
The coprocessor executables are built in directoryReleaseMIC/build/app/blast
and are copied to directoryReleaseMIC/bin
- If you don’t see the executables in directory
ReleaseMIC/bin
, execute:> make -C corelib> make -C util> make -C serial> make -C serial/datatool> make -k all_r -j8
Run Directions
Heterogeneous Intel® Xeon™/XeonPhi™ workload distribution model on a single node:
- Create a working directory
<work_dir>
on a local disk (host) and NFS mount it on the coprocessor. This process is explained in the Intel® Manycore Platform Software Stack (Intel® MPSS) documentation, i.e/opt/intel/mic/docs/MPSS_Boot_Config_Guide.pdf
- Download the
refseq_rna.00 refseq_rna.01 refseq_rna.02
reference databases per Code Access instruction above and place them in<work_dir>/db
- Download the sample database queries per Code Access instruction above and place them in
<work_dir>/queries
- Create a directory called
<work_dir>/m_queries/blastn
where the concatenated queries formed from the 99 blastn queries contained in<work_dir>/queries/blastn/ will be placed.
- Create a file that concatenates the names of the first 80 out of 99 query files into a single multiple-query file, say
<work_dir>/m_queries/blastn/NM_80_all
, by stepping into directory<work_dir>/queries/blastn/
and running> echo “cat “ `ls *` “> ../../queries_m/blastn/NM_80_all” > nm_99_all.sh
Strip the tail-end 19 query file names by editing the filenm_99_all.sh
, and saving it asnm_80_all.sh
, and execute> source nm_80_all.sh
- Concatenate the last 19 out of 99 queries into a single multiple-query file, for example as
<work_dir>/m_queries/blastn/_NM_19_all
in a similar manner to step 5 (this time delete all but the last 19 query file names). - Copy the executables
blastn
obtained in Host: step 8 and Coprocessor: step 8 above> cp <src>/ncbi-blast-2.2.29+-src/icc/ReleaseMT/bin/blastn <work_dir>/blastn
> cp <src>/ncbi-blast-2.2.29+-src/mic/ReleaseMT/bin/blastn <work_dir>/blastn_mic
This is optional, as both executables may be run from where they were built provided the build directory is visible also from the coprocessor, i.e. is NFS mounted. Similarly, you may copy the libiomp5.so
for the Intel® Xeon Phi™ coprocessor > cp /opt/intel/composer_xe_2015/compiler/lib/mic/libiomp5.so <work_dir>
Copying the executables to a local disk <work_dir>
is advisable as it may reduce NFS access overhead. Create a script to run the workload natively on the coprocessor and the host, say blastn_mic_host.sh
, containing
#!/bin/sh QUERY_MIC=$1 THREADS_MIC=$2 QUERY_HOST=$3 THREADS_HOST=$4 OUT_NAME=$5 WORK_DIR=<work_dir> BATCH_SIZE=48000 OUT_NAME_MIC=${WORK_DIR}/${DB_NAME}_${QUERY_MIC}_mic.out OUT_NAME_HOST=${WORK_DIR}/${DB_NAME}_${QUERY_HOST}.out # ----------------------running on coprocessor (mic1) ---------------------------------------------- /usr/bin/time -p ssh mic1 "cd ${WORK_DIR}; export LD_LIBRARY_PATH=${WORK_DIR}; export BATCH_SIZE=${BATCH_SIZE}; ./blastn_mic -task blastn -use_index false -db'db/refseq_rna.00 db/refseq_rna.01 db/refseq_rna.02' -query queries_m/blastn/${QUERY_MIC} - num_threads ${THREADS_MIC} > ${OUT_NAME_MIC}"& # ----------------------running on host ---------------------------------------------- cd ${WORK_DIR}; export BATCH_SIZE=${BATCH_SIZE}; /usr/bin/time –p ./blastn -task blastn - use_index false -db 'db/refseq_rna.00 db/refseq_rna.01 db/refseq_rna.02' -query queries_m/blastn/${QUERY_HOST} -num_threads ${THREADS_HOST} > ${OUT_NAME_HOST}" # ---------------------- merging results (have to wait for host and coprocessor to finish!) ------------------- # cat ${OUT_NAME_MIC} ${OUT_NAME_HOST} > ${OUT_NAME} exit
The environment variable BATCH_SIZE is specific to BLAST internals, and it governs resource allocation critical to efficient multiple-query handling. In the script, remember to replace <work_dir>
with an actual path.
To run a 80/19 query host/coprocessor distributed workload, for example, execute the following command from the host> ./blastn_mic_host.sh _NM_19_all 180 NM_80_all 48
where the coprocessor is given 19 queries contained in file _NM_19_all
running 180
threads and the host is given 80 queries contained in file NM_80_all
running 48
threads. The final result from running the Intel® Xeon® processor/Intel® Xeon Phi™ coprocessor distributed workload is obtaining by a merge, executed after all processes launched by the script above terminate> cat ${OUT_NAME_MIC} ${OUT_NAME_HOST} > NM_99_merged
where ${OUT_NAME_MIC} and ${OUT_NAME_HOST} are defined in the script above.
To keep the script blastn_mic_host.sh
simple, the merge call is commented off since it requires more complexity to implement a wait on both coprocessor and host tasks to finish first before the merge can be executed.
8. blastp can be run in a heterogeneous manner in exactly the same way.
Expected performance (in seconds)
The results of running in a heterogeneous mode as per Run Directions step 7 above are presented below. In this experiment, the total list of 99 queries is split into 80 concatenated queries given to the host and the remainder of queries given to the coprocessor, both in a multiple-query multiple-db volumes invocation model.
A user running this model should be cautioned that coprocessor internal memory saturation issues may arise (excessive runtime/crash/hanging) when the combined size of the subject database volumes and the size of the executable blastn/blastp, as well as the runtime memory demand, come close to or exceeds the available coprocessor RAM. The three subject db volumes used (refseq_rna.00 refseq_rna.01 refseq_rna.02) have a combined size of ~4GB, plus the executable size (.25 GB) and the runtime heap memory demand (~2GB) combine to ~6.25 GB, which is well within the available coprocessor memory (the coprocessor we used had 16GB or RAM). In case of exceeding the available RAM, the larger BLAST query/db problem has to be subdivided into sub-problems that fit into the available coprocessor RAM.
Also, for accurate runtime performance analysis, the first runs on the coprocessor(s) and the host should be discarded as the runtimes are padded with the time it takes to build NFS cache in coprocessor(s) and the host RAM, largely due to retrieval of the NFS mounted db volumes.
Col B | Col C | Col D | Col E | |
---|---|---|---|---|
host | coprocessor | min completion time: max(Col B, Col C) | ||
number of threads --> | 48 | 180 | speedup | |
query split (host/coprocessor) | ||||
99/0 | 54.5s | 0s | 54.5s | |
81/18 | 43.8s | 39s | 43.8s | |
80/19 | 43.3s | 41s | 43.3s | 1.26 |
79/20 | 42.8s | 43.7s | 43.7s | |
78/21 | 42.7s | 44.2s | 44.2s |
Platform Configurations
Platform | |
Intel R2208GZ4GC platform 2U chassis hot-swap drives, 24 DIMMs, 1 750W Redundant Power Supply | |
CPU/Stepping | |
Xeon E5-2697 v2 2.7 GHz , 12 core, 8GT/s dual QPI links, 130 W, 3.5GHz Max Turbo Frequency 768kB instr L1 / 3072kB L2 / 30MB L3 cache | |
Coprocessor | |
Intel Xeon Phi 7110 and 7120; 61 cores, 1.1 and 1.238 GHz ECC enabled, TURBO disabled Software Details: MPSS version - 2.1.6720-15 Flash version - 2.1.03.0386 | |
Memory | |
Hynix HMT31GR7BFR-PB 64GB total 8*8GB 1600MHZ Reg ECC DDR3 | |
Chipset | |
Rev 4.6 SE5C600.86B.99.99.x069.071520130923 | |
BIOS | |
BIOS Configuration: default except: Turbo Enabled EIST Enabled SMT enabled NUMA enabled Memory speed 1600MHz Note: all prefetchers enabled (this is default) | |
GigE Node Adapter | |
Intel Ethernet Controller I350 (rev 01) 4 Gigabit network connections Only one connection in use | |
IB switch | |
36 port switch/rack Mellanox FDR (model MSX6025F-1BFR) Firmware version:9_2_4002 | |
IB adapters | |
MCX353A-FCAT memfree PCI-Express x8 FDR InfiniBand 8x Firmware version:2.30.3200 | |
HDD Specs | |
SEAGATE ST9600205SS (scsi) 1x600 GB SAS HDD 10kRPM | |
OS | |
RHEL 6.4 |