Introduction
Large matrix operations are the cornerstones of many important numerical and machine learning applications. In this article, we provide some recommendations for using operations in Scipy or Numpy for large matrices with more than 5,000 elements in each dimension.
General Advice for Setting up Python*
Use the latest version of Intel® Distribution for Python* (version 2018.0.0 at the time of this writing), and preferably Python (version 3.6) for better memory deallocation and timing performance. This means downloading the Miniconda* installation script (that is, Miniconda3-latest-Linux-x86_64.sh
) with Python (version 3.6). After downloading the script, the following lines are some example bash commands that you can use to execute the script for the installation of conda*, a Python package manager that we will use throughout this guide:
INSTALLATION_DIR=$HOME/miniconda3 bash $<DOWNLOAD_PATH>/Miniconda3-latest-Linux-x86_64.sh -b -p $INSTALLATION_DIR -f CONDA=${INSTALLATION_DIR}/bin/conda ACTIVATE=${INSTALLATION_DIR}/bin/activate
For the installation of the Python packages from Intel, you can create a conda environment called idp as follows:
$CONDA create -y -c intel -n idp intelpython3_core python=3
To activate the conda environment for use, run
source $ACTIVATE idp
Then you will be able to use the installed packages from the Intel Distribution for Python.
Tips for Using the Matrix Operations
These recommendations may help you obtain faster computational performance for large matrix operations on compatible Intel® processors. From our benchmarks, we see great speedups of these large matrix operations when used in parallel on the Intel processors that belong to the server class, such as the Intel® Xeon® processors and Intel® Xeon Phi™ processors. These speedups are a result of the parallel computation at the multithreading layer of the Numpy
and Scipy
libraries.
We based the five tips described below on the performance observed for the matrix operation benchmarks, which are version controlled. After activating the conda environment for the Intel Distribution for Python, you can install our benchmarks following these steps for the bash command line:
git clone https://github.com/IntelPython/ibench.git cd ibench python setup.py install
The example SLURM job script for running the benchmark on an Intel Xeon Phi processor (formerly code-named Knights Landing) processor can be found at the following link, while the example SLURM job script for an Intel Xeon processor can be found at the following link.
Tip 1: Tune the relevant environmental variable settings
To make the best use of the multithreading capabilities of server-class Intel processors, we recommend you tune the threading and memory allocation settings accordingly. Factors that can affect the performance of the matrix operations include:
- Shape and size of the input matrix
- Matrix operation used
- Amount of computational resources on the system
- Usage pattern of computational resource of the specific code base that runs the matrix operation(s)
For Intel® Xeon® Phi™ processors
Some example (bash environmental) settings are listed below as a baseline for tuning the performance.
The first set of parameters determines how many threads will be used.
export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l) export OMP_NUM_THREADS=$(( $NUM_OF_THREADS / 4 )) export MKL_NUM_THREADS=$(( $NUM_OF_THREADS / 4 ))
The parameters OMP_NUM_THREADS
and MKL_NUM_THREADS
specify the number of threads for the matrix operations. On an Intel Xeon Phi processor node dedicated for the computation job, we recommend using one thread per available core as a starting point for tuning the performance.
The second set of parameters specifies the behavior of each thread.
export KMP_BLOCKTIME=800 export KMP_AFFINITY=granularity=fine,compact export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t
The parameter KMP_BLOCKTIME
specifies how long a thread should stay active after the completion of a compute task. When KMP_BLOCKTIME
is longer than the default value of 200 milliseconds, there will be less overhead for waking up the thread for subsequent computation(s).
The KMP_AFFINITY
parameter dictates the placement of neighboring OpenMP
thread context. For instance, the setting of compact
assigns the OpenMP
thread <n>+1
to a free thread context as close as possible to the thread context where the <n> OpenMP
thread was placed. A detailed guide for KMP_AFFINITY
setting can be found at this link.
The KMP_HW_SUBSET
setting specifies the number of threads placed onto each processor core.
The last parameter controls the memory allocation behavior.
export HPL_LARGEPAGE=1
The parameter HPL_LARGEPAGE
enables large page size for the memory allocation of data objects. Having a huge page size can enable more efficient memory patterns for large matrices due to better translation lookaside buffer behavior.
For Intel® Xeon® processors (processors released with a code-name newer than Haswell)
Some (bash environmental) example settings are:
export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l) export OMP_NUM_THREADS=$(( $NUM_OF_THREADS )) export MKL_NUM_THREADS=$(( $NUM_OF_THREADS )) export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t export HPL_LARGEPAGE=1 export KMP_BLOCKTIME=800
Tip 2: Use Fortran* memory layout for Numpy arrays (assuming matrix operations are called repeatedly)
Numpy arrays can use either Fortran memory layout (column-major order) or C memory layout (row-major order). Many Numpy and Scipy APIs are implemented with LAPACK
and BLAS
, which require Fortran memory layout. If a C memory layout Numpy array is passed to a Numpy or Scipy API that uses Fortran order internally, it will perform a costly transpose first. If a Numpy array is used repeatedly, convert it to Fortran order before the first use.
The Python function that can enable this memory layout conversion is numpy.asfortranarray
. Here is a short code example:
import numpy as np matrix_input = np.random.rand(5000, 5000) matrix_fortran = np.asfortranarray(matrix_input, dtype=matrix_input.dtype)
Tip 3: Save the result of a matrix operation in the input matrix (kwargs: overwrite_a=True
)
It is natural to obtain large outputs from matrix operations that have large matrices as inputs. The creation of additional data structures can add overhead.
Many Scipy
matrix linear algebra functions have an optional parameter called overwrite_a
, which can be set to True
. This option makes the function provide the result by overwriting an input instead of allocating a new Numpy array. Using the LU decomposition from the Scipy
library as an example, we have:
import scipy scipy.linalg.lu(a=matrix, overwrite_a=True)
Tip 4: Use the TBB
or SMP
Python modules to avoid the oversubscription of threads
Efficient parallelism is a known recipe for unlocking the performance on an Intel processor with more than one core. The reasoning about parallelism within Python, however, is sometimes less than transparent. Individual Python libraries may implement their own mechanism and level of parallelism. When a combination of Python libraries is used, it can result in an unintended usage pattern of the computational resources. For example, some Python modules (for example, multiprocessing) and libraries (for example, Scikit-Learn*) introduce parallelism into some functions by forking multiple Python processes. Each of these Python processes can spin up a number of threads as specified by the library.
Sometimes the number of threads can be bigger than the available amount of CPU resources in a way that is known as thread oversubscription. Thread oversubscription can slow down a computational job. Within the Intel Distribution for Python, several modules help to manage threads more efficiently by avoiding oversubscription. These are the Static Multi-Processing library (SMP) and the Python module for Intel® Threading Building Blocks (Intel® TBB).
In Table 1, Intel engineers Anton Gorshkov and Anton Malakhov provide recommendations for which module to use for a Python application based on the parallelism characteristics, in order to achieve good thread subscription for parallel Python applications. These recommendations can serve as a starting point for tuning the parallelism.
Table 1. Module recommendations based on parallelism characteristics.
Innermost parallelism level | Outermost parallelism level | ||
Balanced work | Unbalanced work | ||
Low subscription | High subscription | ||
Low subscription | Python | Python with SMP | Python with Intel® Threading Building Blocks |
High subscription | KMP_COMPOSABILITY |
Below we give the exact command line instructions for each entry in the table.
The SMP and Intel TBB modules can be easily installed in an Anaconda* or a Miniconda installation of Python using the conda installer commands:
conda install -c intel smp conda install -c intel tbb
More specifically, we explain the bash commands for each possible scenario outlined in Table 1. Assume the Python script that we try to run is called PYTHONPROGRAM.py
, the syntax for invoking the script at a bash command line is:
python PYTHONPROGRAM.py # No change
If we want to use the Python modules with the script, we invoke one of the Python modules by supplying one of them as the argument of the -m
flag of the Python executable:
python -m smp PYTHONPROGRAM.py # Use SMP with Python script python -m tbb PYTHONPROGRAM.py # Use TBB with Python script
Additionally, the TBB module has various options. For example, we can supply a -p
flag with Tbb
to limit the maximum number of threads:
python -m tbb -p $MAX_NUM_OF_THREADS PYTHONPROGRAM.py
And for a Python script with high thread subscription in the inner parallelism level and unbalanced work, we can set the variable KMP_COMPOSABILITY
for a bash shell as follows:
KMP_COMPOSABILITY=mode=exclusive python
To read about additional resources for Intel TBB and SMP, we recommend the SciPy 2017 tutorial given by Anton Malakhov. Or you can find a tutorial at Unleash the Parallel Performance of Python* Programs.
Tip 5: Turn on transparent hugepages for memory allocation on Cray Linux*
The Linux* OS has a feature called hugepages that can accelerate programs with a large memory footprint by 50 percent or greater. For many Linux systems, the hugepages functionality is transparently provided by the kernel. Enabling hugepages for Python on Cray Linux requires some extra steps. Our team at Intel has verified that the presence of hugepages is needed to achieve the best possible performances for some large matrix operations (> 5000 x 5000 matrix size). These operations include LU, QR, SVD, and Cholesky decompositions.
First, on the relevant Cray system, check that the system hugepages module is loaded. For example, if Tcl* modules are used, you can find and load the system hugepages module via instructions similar to:
module avail craype-hugepages2M module load craype-hugepages2M
You will also need to run the one-line ainstallation instruction of the hugetlbfs
package within a Conda Python environment or installation:
You will also need to run the one-line installation instruction of the hugetlbfs
package within a Conda Python environment or installation:
conda install -c intel hugetlbfs
After that the Python
binary within the conda environment will allocate the memory for objects using hugepages.