The Intel® MPI Library and OpenMP* runtime libraries can create affinities between processes or threads, and hardware resources. This affinity keeps an MPI process or OpenMP thread from migrating to a different hardware resource, which can have a dramatic effect on the execution speed of a program.
Hardware Threading
The Intel® Xeon Phi™ processor x200 (code-named Knights Landing) supports up to four hardware thread contexts per core. Two cores sharing a single level 2 cache comprise one tile, as in Figure 1.
Figure 1:An Intel® Xeon Phi™ processor x200 tile has two cores, four vector processing units, a 1 MB cache shared by the two cores, and a cache home agent.
Additional hardware threads help hide latencies. While one hardware thread is stalled, another can schedule a core. The optimal number of hardware threads an application uses per core or per tile depends on the application. Some applications may benefit from executing only one thread per hardware tile. For all examples in this paper, an Intel Xeon Phi processor has 34 tiles.
OpenMP Thread Affinity
OpenMP separates allocating hardware resources from pinning threads to the hardware resources.
Intel compilers support both OpenMP 4 affinity settings (as of version 13.0) and the Intel OpenMP runtime extensions. The following settings are used to allocate hardware resources and pin OpenMP threads to hardware resources.
| OpenMP* 4 Affinity | Intel OpenMP Runtime Extensions |
---|---|---|
Allocate hardware threads | OMP_PLACES | KMP_PLACE_THREADS |
Pin OpenMP threads to hardware threads | OMP_PROC_BIND | KMP_AFFINITY |
Thread Affinity Using Intel OpenMP Runtime Extensions
KMP_PLACE_THREADS controls allocation of hardware resources. An OpenMP application may be assigned a number of cores and a number of threads per core. The letter C indicates cores, and T indicates threads. For example, 68c,4t specifies four threads per core on 68 cores, and 34c,2t specifies two threads per core on 34 cores.
KMP_AFFINITY controls how OpenMP threads are bound to resources. Common choices are COMPACT, SCATTER, and BALANCED. The granularity can be set to CORE or THREAD. The affinity choices are illustrated in Figure 2, Figure 3, and Figure 4.
Figure 2:KMP_AFFINITY=compact
Figure 3:KMP_AFFINITY=balanced
Figure 4:KMP_AFFINITY=scatter
A full explanation of KMP_PLACE_THREADS and KMP_AFFINITY is available in the Thread Affinity Interface section of the Intel compiler documentation. For Intel Xeon Phi processors x200, the Intel compilers default to KMP_AFFINITY=compact. For Intel® Xeon® processors, the default setting is KMP_AFFINITY=none.
The following examples demonstrate how to pin OpenMP threads to a specific number of threads per tile or core on Intel Xeon Phi processors x200 using Intel OpenMP runtime extensions on a Linux* system.
The default affinity KMP_AFFINITY=compact is assumed.
1 thread per tile
KMP_AFFINITY=proclist=[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66],explicit
1 thread per core
KMP_PLACE_THREADS=1T
2 threads per core
KMP_PLACE_THREADS=2T
3 threads per core
KMP_PLACE_THREADS=3T
4 threads per core
Default, no additional setting needed
Tip: Use the KMP_AFFINITY VERBOSE modifier to see how threads are mapped to OS processors. This modifier also shows how the OS processors map to physical cores.
The same settings work when undersubscribing the cores.
1 thread per tile
KMP_AFFINITY=proclist=[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66],explicit OMP_NUM_THREADS=4
1 thread per core
KMP_PLACE_THREADS=1T OMP_NUM_THREADS=8
2 threads per core
KMP_PLACE_THREADS=2T OMP_NUM_THREADS=16
3 threads per core
KMP_PLACE_THREADS=3T OMP_NUM_THREADS=24
4 threads per core
OMP_NUM_THREADS=32
Thread Affinity Using OpenMP 4 Affinity
Version 4 of the OpenMP standard introduced affinity settings controlled by OMP_PLACES and OMP_PROC_BIND environment variables. OMP_PLACES specifies hardware resources. The value can be either an abstract name describing a list of places or (uncommonly) an explicit list of places. Choices are CORES or THREADS. OMP_PROC_BIND controls how OpenMP threads are bound to resources. Common values for OMP_PROC_BIND include CLOSE and SPREAD.
The following examples show how to run an OpenMP threaded application using one to four hardware threads per core using OpenMP 4 affinity.
1 thread per tile
OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=34
1 thread per core
OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=68
2 threads per core
OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=136
3 threads per core
OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=204
4 threads per core
OMP_PROC_BIND=close OMP_PLACES=threads
These examples show how to undersubscribe the cores using OpenMP 4 affinity.
1 thread per tile
OMP_PROC_BIND=spread OMP_PLACES=”threads(32)” OMP_NUM_THREADS=4
1 thread per core
OMP_PROC_BIND=spread OMP_PLACES=”threads(32)” OMP_NUM_THREADS=8
2 threads per core
OMP_PROC_BIND=close OMP_PLACES=”cores(8)” OMP_NUM_THREADS=16
3 threads per core
OMP_PROC_BIND=close OMP_PLACES=”cores(8)” OMP_NUM_THREADS=24
4 threads per core
OMP_PROC_BIND=close OMP_PLACES=threads OMP_NUM_THREADS=32
Nested Thread Affinity Using OpenMP 4 Affinity Settings
When an application has more than one level of OpenMP threading, additional values are specified for OMP_PLACES and OMP_NUM_THREADS. The following example executes nested threads using one hardware thread per core. For additional hardware threads, increase the second value of OMP_NUM_THREADS to 4, 6, or 8.
OMP_NESTED=1
OMP_MAX_ACTIVE_LEVELS=2
KMP_HOT_TEAMS=1
KMP_HOT_TEAMS_MAX_LEVEL=2
OMP_NUM_THREADS=34,2
OMP_PROC_BIND=spread,spread
OMP_PLACES=cores
MPI Library Affinity
MPI library affinity is controlled by environment variable I_MPI_PIN_PROCESSOR_LIST. A list may be an explicit list of logical processors or a processor set defined by keywords. Common keywords include ALL, ALLCORES, GRAIN, and SHIFT.
- ALL specifies all logical processors, including the hardware threads.
- ALLCORES specifies the physical cores.
- GRAIN specifies the pinning granularity.
- SHIFT specifies the granularity of the round-robin scheduling in GRAIN units.
The following are examples of how to run an MPI executable with one rank per tile, and one, two, or four ranks per core.
1 rank per tile
mpirun –perhost 34 –env I_MPI_PROCESSOR_LIST all:shift=cache2
1 rank per core
mpirun –perhost 68 –env I_MPI_PROCESSOR_LIST allcores
2 ranks per core
mpirun –perhost 136 -env I_MPI_PROCESSOR_LIST all:grain=2,shift=2
4 ranks per core
mpirun –perhost 272 -env I_MPI_PROCESSOR_LIST all
Tips:
- Set I_MPI_DEBUG to 4 or higher to see how ranks are mapped to OS processors.
- Intel MPI cpuinfo utility shows how the OS processors map to physical caches.
Intel® MPI Library Interoperability with OpenMP
Intel® MPI and OpenMP affinity settings may be combined for hybrid execution. When using all cores, specifying one to four hardware threads per core is straightforward in the following examples, using either OpenMP runtime extensions or OpenMP 4 affinity.
Intel MPI/OpenMP affinity examples using Intel OpenMP runtime extensions, for Intel Xeon Phi processors x200:
1 thread per core
mpirun –env KMP_PLACE_THREADS 1T
2 threads per core
mpirun –env KMP_PLACE_THREADS 2T
3 threads per core
mpirun -env KMP_PLACE_THREADS 3T
4 threads per core
Default, no extra settings needed
Intel MPI/OpenMP affinity examples using OpenMP 4 affinity, for Intel Xeon Phi processors x200:
1 thread per tile
mpirun -env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 8
1 thread per core
mpirun –env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 17
2 threads per core
mpirun -env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 34
4 threads per core
mpirun -env OMP_PROC_BIND close –env OMP_PLACES threads
Intel MPI also provides an environment variable, I_MPI_PIN_DOMAIN, for use with executables launching both MPI ranks and OpenMP threads. The variable is used to define a number of non-overlapping subsets of logical processors, binding one MPI rank to each of these domains. An explicit domain binding is especially useful for undersubscribing the cores. The following examples run a hybrid MPI/OpenMP executable on fewer than the 68 cores of the Intel Xeon Phi processor x200.
1 thread per core on 2 quadrants
mpirun -perhost 2 –env I_MPI_PIN_DOMAIN 68 –env KMP_PLACE_THREADS 1T
12 ranks, 1 rank per tile, 2 threads per core
mpirun -perhost 12 –env I_MPI_PIN_DOMAIN 8 –env KMP_PLACE_THREADS 2T
Tip: When I_MPI_PIN_DOMAIN is set, I_MPI_PIN_PROCESSOR_LIST is ignored.
Future
The Intel MPI library and Intel OpenMP runtime extensions will be extended in 2016 and 2017 to simplify placing processes and threads on Intel Xeon Phi processor x200 NUMA domains.
Conclusion
The Intel MPI Library and OpenMP runtimes provide mechanisms to bind MPI ranks and OpenMP threads to specific processors. Our examples showed how to experiment with different core and hardware thread configurations on Intel® Xeon Phi™ processor x200 (code-named Knights Landing). Following the examples, we can discover whether an application performs best using from one to four hardware threads per core, and we can look for optimal combinations of MPI ranks and OpenMP threads.
More Information
Intel® Fortran Compiler User and Reference Guide, Thread Affinity Interface, https://software.intel.com/en-us/compiler_15.0_ug_f
Intel® C++ Compiler User and Reference Guide, Thread Affinity Interface, https://software.intel.com/en-us/compiler_15.0_ug_c
OpenMP* 4.0 Complete Specifications, http://openmp.org
Intel® MPI Library Developer Reference for Linux* OS, Process Pinning, https://software.intel.com/en-us/intel-mpi-library/documentation
Intel® MPI Library Developer Reference for Linux* OS, Interoperability with OpenMP* API, https://software.intel.com/en-us/mpi-refman-lin-html
Using Nested Parallelism In OpenMP, https://software.intel.com/en-us/videos/using-nested-parallelism-in-openmp
Beginning Hybrid MPI/OpenMP Development, https://software.intel.com/en-us/articles/beginning-hybrid-mpiopenmp-development