Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Process and Thread Affinity for Intel® Xeon Phi™ Processors x200

$
0
0

The Intel® MPI Library and OpenMP* runtime libraries can create affinities between processes or threads, and hardware resources. This affinity keeps an MPI process or OpenMP thread from migrating to a different hardware resource, which can have a dramatic effect on the execution speed of a program.

Hardware Threading

The Intel® Xeon Phi™ processor x200 (code-named Knights Landing) supports up to four hardware thread contexts per core. Two cores sharing a single level 2 cache comprise one tile, as in Figure 1.


Figure 1:An Intel® Xeon Phi™ processor x200 tile has two cores, four vector processing units, a 1 MB cache shared by the two cores, and a cache home agent.

Additional hardware threads help hide latencies. While one hardware thread is stalled, another can schedule a core. The optimal number of hardware threads an application uses per core or per tile depends on the application. Some applications may benefit from executing only one thread per hardware tile. For all examples in this paper, an Intel Xeon Phi processor has 34 tiles.

OpenMP Thread Affinity

OpenMP separates allocating hardware resources from pinning threads to the hardware resources.

Intel compilers support both OpenMP 4 affinity settings (as of version 13.0) and the Intel OpenMP runtime extensions. The following settings are used to allocate hardware resources and pin OpenMP threads to hardware resources.

 

OpenMP* 4 Affinity

Intel OpenMP Runtime Extensions

Allocate hardware threads

OMP_PLACES

KMP_PLACE_THREADS

Pin OpenMP threads to hardware threads

OMP_PROC_BIND

KMP_AFFINITY

Thread Affinity Using Intel OpenMP Runtime Extensions

KMP_PLACE_THREADS controls allocation of hardware resources. An OpenMP application may be assigned a number of cores and a number of threads per core. The letter C indicates cores, and T indicates threads. For example, 68c,4t specifies four threads per core on 68 cores, and 34c,2t specifies two threads per core on 34 cores.

KMP_AFFINITY controls how OpenMP threads are bound to resources. Common choices are COMPACT, SCATTER, and BALANCED. The granularity can be set to CORE or THREAD. The affinity choices are illustrated in Figure 2, Figure 3, and Figure 4.


Figure 2:KMP_AFFINITY=compact


Figure 3:KMP_AFFINITY=balanced


Figure 4:KMP_AFFINITY=scatter

A full explanation of KMP_PLACE_THREADS and KMP_AFFINITY is available in the Thread Affinity Interface section of the Intel compiler documentation. For Intel Xeon Phi processors x200, the Intel compilers default to KMP_AFFINITY=compact. For Intel® Xeon® processors, the default setting is KMP_AFFINITY=none.

The following examples demonstrate how to pin OpenMP threads to a specific number of threads per tile or core on Intel Xeon Phi processors x200 using Intel OpenMP runtime extensions on a Linux* system.

The default affinity KMP_AFFINITY=compact is assumed.

1 thread per tile

KMP_AFFINITY=proclist=[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66],explicit

1 thread per core

KMP_PLACE_THREADS=1T

2 threads per core

KMP_PLACE_THREADS=2T

3 threads per core

KMP_PLACE_THREADS=3T

4 threads per core

Default, no additional setting needed

Tip: Use the KMP_AFFINITY VERBOSE modifier to see how threads are mapped to OS processors. This modifier also shows how the OS processors map to physical cores.

 

The same settings work when undersubscribing the cores.

1 thread per tile

KMP_AFFINITY=proclist=[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66],explicit OMP_NUM_THREADS=4

1 thread per core

KMP_PLACE_THREADS=1T OMP_NUM_THREADS=8

2 threads per core

KMP_PLACE_THREADS=2T OMP_NUM_THREADS=16

3 threads per core

KMP_PLACE_THREADS=3T OMP_NUM_THREADS=24

4 threads per core

OMP_NUM_THREADS=32

Thread Affinity Using OpenMP 4 Affinity

Version 4 of the OpenMP standard introduced affinity settings controlled by OMP_PLACES and OMP_PROC_BIND environment variables. OMP_PLACES specifies hardware resources. The value can be either an abstract name describing a list of places or (uncommonly) an explicit list of places. Choices are CORES or THREADS. OMP_PROC_BIND controls how OpenMP threads are bound to resources. Common values for OMP_PROC_BIND include CLOSE and SPREAD.

The following examples show how to run an OpenMP threaded application using one to four hardware threads per core using OpenMP 4 affinity.

1 thread per tile

OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=34

1 thread per core

OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=68

2 threads per core

OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=136

3 threads per core

OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=204

4 threads per core

OMP_PROC_BIND=close OMP_PLACES=threads

 

These examples show how to undersubscribe the cores using OpenMP 4 affinity.

1 thread per tile

OMP_PROC_BIND=spread OMP_PLACES=”threads(32)” OMP_NUM_THREADS=4

1 thread per core

OMP_PROC_BIND=spread OMP_PLACES=”threads(32)” OMP_NUM_THREADS=8

2 threads per core

OMP_PROC_BIND=close OMP_PLACES=”cores(8)” OMP_NUM_THREADS=16

3 threads per core

OMP_PROC_BIND=close OMP_PLACES=”cores(8)” OMP_NUM_THREADS=24

4 threads per core

OMP_PROC_BIND=close OMP_PLACES=threads OMP_NUM_THREADS=32

Nested Thread Affinity Using OpenMP 4 Affinity Settings

When an application has more than one level of OpenMP threading, additional values are specified for OMP_PLACES and OMP_NUM_THREADS. The following example executes nested threads using one hardware thread per core. For additional hardware threads, increase the second value of OMP_NUM_THREADS to 4, 6, or 8.

OMP_NESTED=1

OMP_MAX_ACTIVE_LEVELS=2

KMP_HOT_TEAMS=1

KMP_HOT_TEAMS_MAX_LEVEL=2

OMP_NUM_THREADS=34,2

OMP_PROC_BIND=spread,spread

OMP_PLACES=cores

MPI Library Affinity

MPI library affinity is controlled by environment variable I_MPI_PIN_PROCESSOR_LIST. A list may be an explicit list of logical processors or a processor set defined by keywords. Common keywords include ALL, ALLCORES, GRAIN, and SHIFT.

  • ALL specifies all logical processors, including the hardware threads.
  • ALLCORES specifies the physical cores.
  • GRAIN specifies the pinning granularity.
  • SHIFT specifies the granularity of the round-robin scheduling in GRAIN units.

The following are examples of how to run an MPI executable with one rank per tile, and one, two, or four ranks per core.

1 rank per tile

mpirun –perhost 34 –env I_MPI_PROCESSOR_LIST all:shift=cache2

1 rank per core

mpirun –perhost 68 –env I_MPI_PROCESSOR_LIST allcores

2 ranks per core

mpirun –perhost 136 -env I_MPI_PROCESSOR_LIST all:grain=2,shift=2

4 ranks per core

mpirun –perhost 272 -env I_MPI_PROCESSOR_LIST all

Tips:

  • Set I_MPI_DEBUG to 4 or higher to see how ranks are mapped to OS processors.
  • Intel MPI cpuinfo utility shows how the OS processors map to physical caches.

Intel® MPI Library Interoperability with OpenMP

Intel® MPI and OpenMP affinity settings may be combined for hybrid execution. When using all cores, specifying one to four hardware threads per core is straightforward in the following examples, using either OpenMP runtime extensions or OpenMP 4 affinity.

Intel MPI/OpenMP affinity examples using Intel OpenMP runtime extensions, for Intel Xeon Phi processors x200:

1 thread per core

mpirun –env KMP_PLACE_THREADS 1T

2 threads per core

mpirun –env KMP_PLACE_THREADS 2T

3 threads per core

mpirun -env KMP_PLACE_THREADS 3T

4 threads per core

Default, no extra settings needed

 

Intel MPI/OpenMP affinity examples using OpenMP 4 affinity, for Intel Xeon Phi processors x200:

1 thread per tile

mpirun -env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 8

1 thread per core

mpirun –env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 17

2 threads per core

mpirun -env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 34

4 threads per core

mpirun -env OMP_PROC_BIND close –env OMP_PLACES threads

 

Intel MPI also provides an environment variable, I_MPI_PIN_DOMAIN, for use with executables launching both MPI ranks and OpenMP threads. The variable is used to define a number of non-overlapping subsets of logical processors, binding one MPI rank to each of these domains. An explicit domain binding is especially useful for undersubscribing the cores. The following examples run a hybrid MPI/OpenMP executable on fewer than the 68 cores of the Intel Xeon Phi processor x200.

1 thread per core on 2 quadrants

mpirun -perhost 2 –env I_MPI_PIN_DOMAIN 68 –env KMP_PLACE_THREADS 1T

12 ranks, 1 rank per tile, 2 threads per core

mpirun -perhost 12 –env I_MPI_PIN_DOMAIN 8 –env KMP_PLACE_THREADS 2T

Tip: When I_MPI_PIN_DOMAIN is set, I_MPI_PIN_PROCESSOR_LIST is ignored.

Future

The Intel MPI library and Intel OpenMP runtime extensions will be extended in 2016 and 2017 to simplify placing processes and threads on Intel Xeon Phi processor x200 NUMA domains.

Conclusion

The Intel MPI Library and OpenMP runtimes provide mechanisms to bind MPI ranks and OpenMP threads to specific processors. Our examples showed how to experiment with different core and hardware thread configurations on Intel® Xeon Phi™ processor x200 (code-named Knights Landing). Following the examples, we can discover whether an application performs best using from one to four hardware threads per core, and we can look for optimal combinations of MPI ranks and OpenMP threads.

More Information

Intel® Fortran Compiler User and Reference Guide, Thread Affinity Interface, https://software.intel.com/en-us/compiler_15.0_ug_f

Intel® C++ Compiler User and Reference Guide, Thread Affinity Interface, https://software.intel.com/en-us/compiler_15.0_ug_c

OpenMP* 4.0 Complete Specifications, http://openmp.org

Intel® MPI Library Developer Reference for Linux* OS, Process Pinning, https://software.intel.com/en-us/intel-mpi-library/documentation

Intel® MPI Library Developer Reference for Linux* OS, Interoperability with OpenMP* API, https://software.intel.com/en-us/mpi-refman-lin-html

Using Nested Parallelism In OpenMP, https://software.intel.com/en-us/videos/using-nested-parallelism-in-openmp

Beginning Hybrid MPI/OpenMP Development, https://software.intel.com/en-us/articles/beginning-hybrid-mpiopenmp-development


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>