Large Matrix Operations with SciPy* and NumPy*: Tips and Best Practices

Introduction

Large matrix operations are the cornerstones of many important numerical and machine learning applications. In this article, we provide some recommendations for using operations in Scipy or Numpy for large matrices with more than 5,000 elements in each dimension.

General Advice for Setting up Python*

Use the latest version of Intel® Distribution for Python* (version 2018.0.0 at the time of this writing), and preferably Python (version 3.6) for better memory deallocation and timing performance. This means downloading the Miniconda* installation script (that is, Miniconda3-latest-Linux-x86_64.sh) with Python (version 3.6). After downloading the script, the following lines are some example bash commands that you can use to execute the script for the installation of conda*, a Python package manager that we will use throughout this guide:

INSTALLATION_DIR=$HOME/miniconda3
bash $<DOWNLOAD_PATH>/Miniconda3-latest-Linux-x86_64.sh -b -p $INSTALLATION_DIR -f
CONDA=${INSTALLATION_DIR}/bin/conda
ACTIVATE=${INSTALLATION_DIR}/bin/activate

For the installation of the Python packages from Intel, you can create a conda environment called idp as follows:

$CONDA create -y -c intel -n idp intelpython3_core python=3

To activate the conda environment for use, run

source $ACTIVATE idp

Then you will be able to use the installed packages from the Intel Distribution for Python.

Tips for Using the Matrix Operations

These recommendations may help you obtain faster computational performance for large matrix operations on compatible Intel® processors. From our benchmarks, we see great speedups of these large matrix operations when used in parallel on the Intel processors that belong to the server class, such as the Intel® Xeon® processors and Intel® Xeon Phi™ processors. These speedups are a result of the parallel computation at the multithreading layer of the Numpy and Scipy libraries.

We based the five tips described below on the performance observed for the matrix operation benchmarks, which are version controlled. After activating the conda environment for the Intel Distribution for Python, you can install our benchmarks following these steps for the bash command line:

git clone https://github.com/IntelPython/ibench.git
cd ibench
python setup.py install

The example SLURM job script for running the benchmark on an Intel Xeon Phi processor (formerly code-named Knights Landing) processor can be found at the following link, while the example SLURM job script for an Intel Xeon processor can be found at the following link.

Tip 1: Tune the relevant environmental variable settings

To make the best use of the multithreading capabilities of server-class Intel processors, we recommend you tune the threading and memory allocation settings accordingly. Factors that can affect the performance of the matrix operations include:

Shape and size of the input matrix
Matrix operation used
Amount of computational resources on the system
Usage pattern of computational resource of the specific code base that runs the matrix operation(s)

For Intel® Xeon® Phi™ processors

Some example (bash environmental) settings are listed below as a baseline for tuning the performance.

The first set of parameters determines how many threads will be used.

export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l)
export OMP_NUM_THREADS=$(( $NUM_OF_THREADS / 4  ))
export MKL_NUM_THREADS=$(( $NUM_OF_THREADS / 4  ))

The parameters OMP_NUM_THREADS and MKL_NUM_THREADS specify the number of threads for the matrix operations. On an Intel Xeon Phi processor node dedicated for the computation job, we recommend using one thread per available core as a starting point for tuning the performance.

The second set of parameters specifies the behavior of each thread.

export KMP_BLOCKTIME=800
export KMP_AFFINITY=granularity=fine,compact
export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t

The parameter KMP_BLOCKTIME specifies how long a thread should stay active after the completion of a compute task. When KMP_BLOCKTIME is longer than the default value of 200 milliseconds, there will be less overhead for waking up the thread for subsequent computation(s).

The KMP_AFFINITY parameter dictates the placement of neighboring OpenMP thread context. For instance, the setting of compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. A detailed guide for KMP_AFFINITY setting can be found at this link.

The KMP_HW_SUBSET setting specifies the number of threads placed onto each processor core.

The last parameter controls the memory allocation behavior.

export HPL_LARGEPAGE=1

The parameter HPL_LARGEPAGE enables large page size for the memory allocation of data objects. Having a huge page size can enable more efficient memory patterns for large matrices due to better translation lookaside buffer behavior.

For Intel® Xeon® processors (processors released with a code-name newer than Haswell)

Some (bash environmental) example settings are:

export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l)
export OMP_NUM_THREADS=$(( $NUM_OF_THREADS ))
export MKL_NUM_THREADS=$(( $NUM_OF_THREADS ))
export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t
export HPL_LARGEPAGE=1
export KMP_BLOCKTIME=800

Tip 2: Use Fortran* memory layout for Numpy arrays (assuming matrix operations are called repeatedly)

Numpy arrays can use either Fortran memory layout (column-major order) or C memory layout (row-major order). Many Numpy and Scipy APIs are implemented with LAPACK and BLAS, which require Fortran memory layout. If a C memory layout Numpy array is passed to a Numpy or Scipy API that uses Fortran order internally, it will perform a costly transpose first. If a Numpy array is used repeatedly, convert it to Fortran order before the first use.

The Python function that can enable this memory layout conversion is numpy.asfortranarray. Here is a short code example:

import numpy as np
matrix_input = np.random.rand(5000, 5000)
matrix_fortran = np.asfortranarray(matrix_input, dtype=matrix_input.dtype)

Tip 3: Save the result of a matrix operation in the input matrix (`kwargs: overwrite_a=True`)

It is natural to obtain large outputs from matrix operations that have large matrices as inputs. The creation of additional data structures can add overhead.

Many Scipy matrix linear algebra functions have an optional parameter called overwrite_a, which can be set to True. This option makes the function provide the result by overwriting an input instead of allocating a new Numpy array. Using the LU decomposition from the Scipy library as an example, we have:

import scipy
scipy.linalg.lu(a=matrix, overwrite_a=True)

Tip 4: Use the `TBB` or `SMP` Python modules to avoid the oversubscription of threads

Efficient parallelism is a known recipe for unlocking the performance on an Intel processor with more than one core. The reasoning about parallelism within Python, however, is sometimes less than transparent. Individual Python libraries may implement their own mechanism and level of parallelism. When a combination of Python libraries is used, it can result in an unintended usage pattern of the computational resources. For example, some Python modules (for example, multiprocessing) and libraries (for example, Scikit-Learn*) introduce parallelism into some functions by forking multiple Python processes. Each of these Python processes can spin up a number of threads as specified by the library.

Sometimes the number of threads can be bigger than the available amount of CPU resources in a way that is known as thread oversubscription. Thread oversubscription can slow down a computational job. Within the Intel Distribution for Python, several modules help to manage threads more efficiently by avoiding oversubscription. These are the Static Multi-Processing library (SMP) and the Python module for Intel® Threading Building Blocks (Intel® TBB).

In Table 1, Intel engineers Anton Gorshkov and Anton Malakhov provide recommendations for which module to use for a Python application based on the parallelism characteristics, in order to achieve good thread subscription for parallel Python applications. These recommendations can serve as a starting point for tuning the parallelism.

Table 1. Module recommendations based on parallelism characteristics.

Innermost parallelism level	Outermost parallelism level
	Balanced work		Unbalanced work
	Low subscription	High subscription
Low subscription	Python	Python with SMP	Python with Intel® Threading Building Blocks
High subscription			KMP_COMPOSABILITY

Below we give the exact command line instructions for each entry in the table.

The SMP and Intel TBB modules can be easily installed in an Anaconda* or a Miniconda installation of Python using the conda installer commands:

conda install -c intel smp
conda install -c intel tbb

More specifically, we explain the bash commands for each possible scenario outlined in Table 1. Assume the Python script that we try to run is called PYTHONPROGRAM.py, the syntax for invoking the script at a bash command line is:

python PYTHONPROGRAM.py                  # No change

If we want to use the Python modules with the script, we invoke one of the Python modules by supplying one of them as the argument of the -m flag of the Python executable:

python -m smp PYTHONPROGRAM.py           # Use SMP with Python script
python -m tbb PYTHONPROGRAM.py           # Use TBB with Python script

Additionally, the TBB module has various options. For example, we can supply a -p flag with Tbb to limit the maximum number of threads:

python -m tbb -p $MAX_NUM_OF_THREADS PYTHONPROGRAM.py

And for a Python script with high thread subscription in the inner parallelism level and unbalanced work, we can set the variable KMP_COMPOSABILITY for a bash shell as follows:

KMP_COMPOSABILITY=mode=exclusive python

To read about additional resources for Intel TBB and SMP, we recommend the SciPy 2017 tutorial given by Anton Malakhov. Or you can find a tutorial at Unleash the Parallel Performance of Python* Programs.

Tip 5: Turn on transparent hugepages for memory allocation on Cray Linux*

The Linux* OS has a feature called hugepages that can accelerate programs with a large memory footprint by 50 percent or greater. For many Linux systems, the hugepages functionality is transparently provided by the kernel. Enabling hugepages for Python on Cray Linux requires some extra steps. Our team at Intel has verified that the presence of hugepages is needed to achieve the best possible performances for some large matrix operations (> 5000 x 5000 matrix size). These operations include LU, QR, SVD, and Cholesky decompositions.

First, on the relevant Cray system, check that the system hugepages module is loaded. For example, if Tcl* modules are used, you can find and load the system hugepages module via instructions similar to:

module avail craype-hugepages2M
module load craype-hugepages2M

You will also need to run the one-line ainstallation instruction of the hugetlbfs package within a Conda Python environment or installation:

You will also need to run the one-line installation instruction of the hugetlbfs package within a Conda Python environment or installation:

conda install -c intel hugetlbfs

After that the Python binary within the conda environment will allocate the memory for objects using hugepages.

Large Matrix Operations with SciPy* and NumPy*: Tips and Best Practices

Introduction

General Advice for Setting up Python*

Tips for Using the Matrix Operations

Tip 1: Tune the relevant environmental variable settings

For Intel® Xeon® Phi™ processors

For Intel® Xeon® processors (processors released with a code-name newer than Haswell)

Tip 2: Use Fortran* memory layout for Numpy arrays (assuming matrix operations are called repeatedly)

Tip 3: Save the result of a matrix operation in the input matrix (`kwargs: overwrite_a=True`)

Tip 4: Use the `TBB` or `SMP` Python modules to avoid the oversubscription of threads

Tip 5: Turn on transparent hugepages for memory allocation on Cray Linux*

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112

Introduction

General Advice for Setting up Python*

Tips for Using the Matrix Operations

Tip 1: Tune the relevant environmental variable settings

For Intel® Xeon® Phi™ processors

For Intel® Xeon® processors (processors released with a code-name newer than Haswell)

Tip 2: Use Fortran* memory layout for Numpy arrays (assuming matrix operations are called repeatedly)

Tip 3: Save the result of a matrix operation in the input matrix (kwargs: overwrite_a=True)

Tip 4: Use the TBB or SMP Python modules to avoid the oversubscription of threads

Tip 5: Turn on transparent hugepages for memory allocation on Cray Linux*

Trending Articles

Tip 3: Save the result of a matrix operation in the input matrix (`kwargs: overwrite_a=True`)

Tip 4: Use the `TBB` or `SMP` Python modules to avoid the oversubscription of threads