One of the big new features introduced in the Intel MKL 11.2 is the greatly improved performance for small problem sizes. In 11.2, this improvement focuses on xGEMM functions (matrix multiplication). Out of the box, there is already a version-to-version improvement (from Intel MKL 11.1 to Intel MKL 11.2). But on top of it, Intel MKL introduces a new control that can lead to further significant performance boost for small matrices. Users can enable this control when linking with Intel MKL by specifying "-DMKL_DIRECT_CALL" or "-DMKL_DIRECT_CALL_SEQ". At the run time, the execution will be dispatched to a fast path for small input matrices. The fast path skips error checking and multiple layers of function calls, therefore improves performance by reducing associated overhead. The matrix sizes have to be small, for example, only a few dozens of rows and columns. For larger matrices the regular execution path is taken. MKL_DIRECT_CALL and MKL_DIRECT_CALL_SEQ do not help, but do not do any harm either.
The chart below is a comparison between 4 scenarios of computing double-precision matrix-matrix multiplication for small matrices (range from 4x4 to 20x20).
- A naive implementation using triple-nested loops, compiled with flags "-O3 -xCORE-AVX2" using Intel C++ Compiler 15.0.
- Using DGEMM from Intel MKL 11.1.1.
- Using DGEMM from Intel MKL 11.2.
- Using DGEMM from Intel MKL 11.2 and with "-DMKL_DIRECT_CALL" enabled.
The matrices used in this chart are all square. The version-to-version improvement of Intel MKL 11.2 over 11.1.1, as well as the additional benefit brought by MKL_DIRECT_CALL, are evident.
How to use MKL_DIRECT_CALL and MKL_DIRECT_CALL_SEQ
These are the macros to be defined at link time to instruct Intel MKL to pick the fast path for small matrices. The first macro, MKL_DIRECT_CALL, is used when you link to the parallel Intel MKL library. The second, MKL_DIRECT_CALL_SEQ, is used when you link to the sequential Intel MKL library. These macros do not have effects on larger matrices.
For a program in the C language on Linux system, simply add -DMKL_DIRECT_CALL or -DMKL_DIRECT_CALL_SEQ to the link line. On Windows, the syntax is /DMKL_DIRECT_CALL or /DMKL_DIRECT_CALL_SEQ. Usually, the flag -std=c99 (/Qstd=c99 on Windows) is also needed. This has been tested on mainstream C and C++ compilers such as Intel C++ Compiler, GCC, Microsoft Visual Studio, etc.
For a program in Fortran, first inlcude "mkl_direct_call.fi". See below for an example from the "Intel MKL User's Guide". Then, add -DMKL_DIRECT_CALL (/DMKL_DIRECT_CALL on Windows) or -DMKL_DIRECT_CALL_SEQ (/DMKL_DIRECT_CALL_SEQ on Windows) to the link line. If you are using Intel Fortran Compiler then pass -fpp (/fpp on Windows) to enable Fortran pre-processing. If you are using PGI Fortran compiler then pass -Mpreprocess instead. This feature does not work with GNU Fortran compiler.
# include "mkl_direct_call.fi" program DGEMM_MAIN .... * Call Intel MKL DGEMM .... call sub1() stop 1 end * A subroutine that calls DGEMM subroutine sub1 * Call Intel MKL DGEMM end
Limitations
There are a few limitations of using this feature.
- The performance gain is a result of skipping error checking and function inlining. There will be no error reported if incorrect parameters are passed to the function call. For this reason, users should not use this feature during code development and debugging. Users should only enable this feature when the code is ready for deployment.
- The "verbose mode" (another new feature introduced in Intel MKL 11.2) does not work for functions that take the fast path enabled by this feature.
- This feature currently does not have effect on C/ZGEMM3 functions.
- This feature does not work with CBLAS function calls.
- The performance benefit of this feature on Intel Xeon Phi coprocessors is marginal. Work is in progress to fully extend this feature to cover Intel Xeon Phi coprocessors.
- CNR (Conditional Numerical Reproducibility) is not supported.
- For Fortran programs, the GNU Fortran compiler is not supported.