Intel(R) Cluster Studio XE 2013 is a powerful tool suite - which helps you to develop applications, with low latency Intel MPI library, high performance C++/FORTRAN compiler, native profiling component named VTune(TM) Amplifier XE 2013, node level analysis component named Intel(R) Trace Collector/Analyzer, Threading and memory correctness components named Inspector XE 2013.
Purposes of this article are:
- Get familiarity of using Intel® Software Development Products on Intel® Xeon Phi™ Coprocessor
- Know different usage modes of development
- Get familiar with Intel® Trace Collector/Analyzer and VTune™ Amplifier XE
Note :
1. All demo code are attached in zip file, you can practise below demos
2. Use amplxe-gui to open vtune result. I showed some screen-shots in demos
Intel® Xeon Phi™ coprocessor software configuration
Key features of the Intel® Xeon Phi™ Coprocessor:
- 50+ cores which run the Intel instruction set architecture
- 4 threads per physical core
- 512 bit registers for SIMD operations (vector operations)
- 512K L2 cache per core
- High speed bi-directional ring connecting the 50+ cores
Getting Ready…
- Ensure Xeon Phi™ coprocessor is running
- Use “service mpss status” to check
- Use “service mpss start” to invoke if it stops
- Install Intel(R) Cluster Studio XE 2013
- Install VTune™ Amplifier driver on Phi coprocessor
- Check if driver is working on Phi coprocessor
# ssh mic0
# lsmod | grep sep3
e.g: sep3_8 45016 0
If the driver is not installed
# cd vtune_root/bin64/k1om/
# ./sep_micboot_install.sh
Use “service mpss restart” to restart mpss
Setting environment variables
- source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64
- source /opt/intel/impi/4.1.0.024/bin64/mpivars.sh
- source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh
- source /opt/intel/itac/8.1.0.024/bin/itacvars.sh impi4
- export I_MPI_MIC=1
- export I_MPI_FABRICS=shm:tcp
- export VT_LOGFILE_FORMAT=stfsingle
- scp -r /opt/intel/composer_xe_2013.2.146/compiler/lib/mic/* mic0:/lib64/
- scp -r /opt/intel/impi/4.1.0.024/mic/bin/* mic0:/bin/
- scp -r /opt/intel/impi/4.1.0.024/mic/lib/* mic0:/lib64/
- scp -r /opt/intel/composer_xe_2013.2.146/tbb/lib/mic/* mic0:/lib64
Demo #1, OpenMP* program on Xeon Phi coprocessor
Compile OpenMP code for Xeon Phi Coprocessor
# icc -g -O3 -mmic -openmp -openmp-report omp_pi.c -o omp_pi.MIC
omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
Copy binary to the target device
# scp omp_pi.MIC mic0:/root
omp_pi.MIC 100% 20KB 19.7KB/s 00:00
Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/omp_pi.MIC
Demo #2, Intel® TBB built program on Xeon Phi coprocessor
Compile TBB code for Xeon Phi Coprocessor
# icpc -g -O3 -mmic -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x /opt/intel/composer_xe_2013.2.146/tbb/lib/mic/libtbb_debug.so.2 tbb_pi.cpp -o tbb_pi.MIC -lpthread
Copy binary to the target device
# scp tbb_pi.MIC mic0:/root
tbb_pi.MIC 100% 91KB 90.8KB/s 00:00
Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/tbb_pi.MIC
Demo #3, “Offload” program on Xeon Phi coprocessor
Compile “offload” code for Xeon Phi Coprocessor
# icc -g -O3 -openmp -openmp-report offload_pi.c -o offload_pi
offload_pi.c(18): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
offload_pi.c(18): (col. 9) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED.
Use VTune™ Amplifier XE to analyze
# amplxe-cl -collect knc-lightweight-hotspots -- ./offload_pi
Demo #4, Use MPI built program on Xeon Phi coprocessor
Compile MPI code for Xeon and Xeon Phi Coprocessor
# mpiicc -g -openmp -O3 -o test-openmp test-openmp.c
# mpiicc -g -openmp -mmic -O3 -o test-openmp.MIC test-openmp.c
Copy binary to the target device
# scp test-openmp.MIC mic0:/root
test-openmp.MIC 100% 17KB 17.2KB/s 00:00
Run the Intel MPI tests before:
# mpirun -host `hostname` -n 2 ./test-openmp
# mpirun -env OMP_NUM_THREADS 4 -host mic0 -n 2 /root/test-openmp.MIC
Use MPI built program on Xeon Phi coprocessor – Hybrid mode
# mpirun -env OMP_NUM_THREADS 2 -host `hostname` -n 2 ./test-openmp : -env OMP_NUM_THREADS 4 -host mic0 -n 2 /root/test-openmp.MIC
Demo #5, Use VTune™ Amlipifier XE to analyze
Compile MPI code for Xeon Phi™ Coprocessor
# make clean | make MIC
Copy binary to the target device
# scp poisson.MIC mic0:/root
Run the Intel MPI tests:
# amplxe-cl -collect knc-general-exploration -cpu-mask=1-64 --search-dir all:rp=. -- ssh mic0 OMP_NUM_THREADS=64 /root/poisson.MIC -n 3500 -iter 10
Demo #6, Intel Trace Collector / Analyzer
Compile MPI code for Xeon Phi™ Coprocessor
# make clean | make
# make clean | make MIC
Note: there is “-tcollect” option in Makefile
Copy binary to the target device
# scp poisson.MIC mic0:/root
Run the Intel MPI tests before:
export VT_LOGFILE_FORMAT=stfsingle
# mpirun -env OMP_NUM_THREADS=1 -host `hostname` -n 2 ./poisson -n 3500 -iter 10 : -env OMP_NUM_THREADS=1 -host mic0 -n 6 /root/poisson.MIC -n 3500 -iter 10
traceanalyzer poisson.single.stf