Contents
Introduction
An Overview of the Classic Matrix Multiplication Algorithm
Total Number of Floating Point Operations
Implementation Complexity
Optimization Techniques
Memory Allocation Schemes
Loop Processing Schemes
Compute Schemes
Error Analysis
Performance on Intel® Xeon Phi™ Processor System
OpenMP* Product Thread Affinity Control
Recommended Intel® C++ Compiler Command-Line Options
Conclusion
References
Downloads
Abbreviations
Appendix A - Technical Specifications of Intel Xeon Phi Processor System
Appendix B - Comparison of Processing Times for MMAs vs. MTA
Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)
Appendix D - Performance of MMAs for Different MASs
About the Author
Introduction
Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language, and its performance significantly improves when different optimization techniques are applied.
Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel® Math Kernel Library (Intel® MKL)7. Tests are completed on a computer system with Intel® Xeon Phi™ processor 72105 running the Linux Red Hat* operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.
All versions of CMMAs for single and double precision floating point data types described in the article are implemented in the C programming language and compiled with Intel® C++ Compiler versions 17 and 16 for Linux*6.
The article targets experienced C/C++ software engineers and can be considered as a reference on application optimization techniques, analysis of performance, and accuracy of computations related to MMAs.
If needed, the reader may review the contents of References1 or2 for a description of mathematical fundamentals of MM, because theoretical topics related to MM are not covered in this article.
An Overview of the Classic Matrix Multiplication Algorithm
A fundamental property of any algorithm is its asymptotic complexity (AC)3.
In generic form, AC for MMA can be expressed as follows:
MMA AC = O(N^Omega)
where O stands for operation on a data element, also known in computer science as a Big O; N is one dimension of the matrix, and omega is a matrix exponent which equals 3.0 for CMMA. That is:
CMMA AC = O(N^3)
In order to compute a product of two square matrices using CMMA, a cubic number of floating point (FP) multiplication operations is required. In other words, the CMMA runs in O(N^3) time.
An omega lower than 3.0 is possible, and it means that an MMA computes a product of two matrices faster because an optimization technique, mathematical or programming, is applied and fewer FP multiplication operations are required to compute the product.
A list of several MMAs with different values of omega is as follows:
Algorithm | Omega | Note |
---|---|---|
Francois Le Gall | 2.3728639 | (1) |
Virginia Vassilevska Williams | 2.3728642 | |
Stothers | 2.3740000 | |
Coppersmith-Winograd | 2.3760000 | |
Bini | 2.7790000 | |
Pan | 2.7950000 | |
Strassen | 2.8070000 | (2) |
Strassen-Winograd | 2.8070000 | |
Classic | 3.0000000 | (3) |
Table 1. Algorithms are sorted by omega in ascending order.
Total Number of Floating Point Operations
Let's assume that:
M x N is a dimension of a matrix A, or A[M,N]
N x P is a dimension of a matrix B, or B[N,P]
M x P is a dimension of a matrix C, or C[M,P]
There are three relations between M, N and P:
Relation #1: A[...,N] = B[N,...]
Relation #2: A[M,...] = C[M,...]
Relation #3: B[...,P] = C[...,P]
If one of these three relations is not met, the product of two matrices cannot be computed.
In this article only square matrices of dimension N, where M = N = P, will be considered. Therefore:
A[N,N] is the same as A[M,N]
B[N,N] is the same as B[N,P]
C[N,N] is the same as C[M,P]
The following table shows how many multiplications are needed to compute a product of two square matrices of different Ns for three algorithms from Table 1 with omega = 2.3728639 (1), omega = 2.807 (2) and omega = 3.0 (3).
N | Omega = 2.3728639 (1) | Omega = 2.807 (2) | Omega = 3.0 (3) |
---|---|---|---|
128 | 100,028 | 822,126 | 2,097,152 |
256 | 518,114 | 5,753,466 | 16,777,216 |
512 | 2,683,668 | 40,264,358 | 134,217,728 |
1024 | 13,900,553 | 281,781,176 | 1,073,741,824 |
2048 | 72,000,465 | 1,971,983,042 | 8,589,934,592 |
4096 | 372,939,611 | 13,800,485,780 | 68,719,476,736 |
8192 | 1,931,709,091 | 96,579,637,673 | 549,755,813,888 |
16384 | 10,005,641,390 | 675,891,165,093 | 4,398,046,511,104 |
32768 | 51,826,053,965 | 4,730,074,351,662 | 35,184,372,088,832 |
65536 | 268,442,548,034 | 33,102,375,837,652 | 281,474,976,710,656 |
Table 2.
For example, to compute a product of two square dense matrices of dimension N equal to 32,768, Francois Le Gall (1) MMA needs ~51,826,053,965 multiplications and Classic (3) MMA needs ~35,184,372,088,832 multiplications.
Imagine the case of the product of two square matrices where N equals 32,768 needs to be computed on a very slow computer system. It means that if the Francois Le Gall MMA completes the processing in one day, then the classic MMA will need ~679 days on the same computer system, or almost two years. This is because the Francois Le Gall MMA needs ~679x fewer multiplications to compute a product:
~35,184,372,088,832 / ~51,826,053,965 = ~678.9
In the case of using a famous Strassen (2) MMA, ~91 days would be needed:
~4,730,074,351,662 / ~51,826,053,965 = ~91.3
In many software benchmarks the performance of an algorithm, or some processing, is measured in floating point operations per second (FLOPS), and not in elapsed time intervals, like days, hours, minutes, or seconds. That is why it is very important to know an exact total number (TN) of FP operations completed to calculate a FLOPS value.
With modern C++ compilers, it is very difficult to estimate an exact TN of FP operations per unit of time completed at run time due to extensive optimizations of generated binary codes. It means that an analysis of binary codes could be required, and this is outside of the scope of this article.
However, an estimate value of the TN of FP operations, multiplications and additions, for CMMA when square matrices are used can be easily calculated. Here are two simple examples:
Example 1: N = 2
Multiplications = 8 // 2 * 2 * 2 = 2^3 Additions = 4 // 2 * 2 * 1 = 2^2*(2-1) TN FP Ops = 8 + 4 = 12
Example 2: N = 3
Multiplications = 27 // 3 * 3 * 3 = 3^3 Additions = 18 // 3 * 3 * 2 = 3^2*(3-1) TN FP Ops = 27 + 18 = 45
It is apparent that the TN of FP operations to compute a product of two square matrices can be calculated using a simple formula:
TN FP Ops = (N^3) + ((N^2) * (N-1))
Note: Take into account that in the versions of the MMA used for sparse matrices, no FP operations are performed if the matrix element at position (i,j) is equal to zero.
Implementation Complexity
In the C programming language only four lines of code are needed to implement a core part of the CMMA:
for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[i][j] += A[i][k] * B[k][j];
Therefore, CMMA's implementation complexity (IC) could be rated as very simple.
Declarations of all intermediate variables, memory allocations, and initialization of matrices are usually not taken into account.
More complex versions of MMA, like Strassen or Strassen-Winograd, could have several thousands of code lines.
Optimization Techniques
In computer programming, matrices could be represented in memory as 1-D or 2-D data structures.
Here is a static declaration of matrices A, B, and C as 1-D data structures of a single precision (SP) FP data type (float):
float fA[N*N]; float fB[N*N]; float fC[N*N];
and this is what a core part of the CMMA looks like:
for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[N*i+j] += A[N*i+k] * B[N*k+j];
Here is a static declaration of matrices A, B, and C as 2-D data structures of a single precision (SP) FP data type (float):
float fA[N][N]; float fB[N][N]; float fC[N][N];
and this is what the core part of CMMA looks like:
for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[i][j] += A[i][k] * B[k][j];
Many other variants of the core part of CMMA are possible and they will be reviewed.
Memory Allocation Schemes
In the previous section of this article, two examples of a static declaration of matrices A, B, and C were given. In the case of dynamic allocation of memory for matrices, explicit calls to memory allocation functions need to be made. In this case, declarations and allocations of memory can look like the following:
Declaration of matrices A, B, and C as 1-D data structures:
__attribute__( ( aligned( 64 ) ) ) float *fA; __attribute__( ( aligned( 64 ) ) ) float *fB; __attribute__( ( aligned( 64 ) ) ) float *fC;
and this is how memory needs to be allocated:
fA = ( float * )_mm_malloc( N * sizeof( float ), 64 ); fB = ( float * )_mm_malloc( N * sizeof( float ), 64 ); fC = ( float * )_mm_malloc( N * sizeof( float ), 64 );
Note: Allocated memory blocks are 64-byte aligned, contiguous, and not fragmented by an operating system memory manager; this improves performance of processing.
Declaration of matrices A, B, and C as 2-D data structures:
__attribute__( ( aligned( 64 ) ) ) float **fA; __attribute__( ( aligned( 64 ) ) ) float **fB; __attribute__( ( aligned( 64 ) ) ) float **fC;
and this is how memory needs to be allocated:
fA = ( float ** )calloc( N, sizeof( float * ) ); fB = ( float ** )calloc( N, sizeof( float * ) ); fC = ( float ** )calloc( N, sizeof( float * ) ); for( i = 0; i < N; i += 1 ) { fA[i] = ( float * )calloc( N, sizeof( float ) ); fB[i] = ( float * )calloc( N, sizeof( float ) ); fC[i] = ( float * )calloc( N, sizeof( float ) ); }
Note: Allocated memory blocks are not contiguous and can be fragmented by an operating system memory manager, and fragmentation can degrade performance of processing.
In the previous examples, a DDR4-type RAM memory was allocated for matrices. However, on an Intel Xeon Phi processor system5 a multichannel DRAM (MCDRAM)-type RAM memory could be allocated as well, using functions from a memkind library11 when MCDRAM mode is configured to 'Flat' or 'Hybrid'. For example, this is how an MCDRAM-type RAM memory can be allocated:
fA = ( float * )hbw_malloc( N * sizeof( float ) ); fB = ( float * )hbw_malloc( N * sizeof( float ) ); fC = ( float * )hbw_malloc( N * sizeof( float ) );
Note: An 'hbw_malloc' function of the memkind library was used instead of an '_mm_malloc' function.
On an Intel Xeon Phi processor system, eight variants of memory allocation for matrices A, B, and C are possible:
Matrix A | Matrix B | Matrix C | Note |
---|---|---|---|
DDR4 | DDR4 | DDR4 | (1) |
DDR4 | DDR4 | MCDRAM | (2) |
DDR4 | MCDRAM | DDR4 | |
DDR4 | MCDRAM | MCDRAM | |
MCDRAM | DDR4 | DDR4 | |
MCDRAM | DDR4 | MCDRAM | |
MCDRAM | MCDRAM | DDR4 | |
MCDRAM | MCDRAM | MCDRAM |
Table 3.
It is recommended to use MCDRAM memory as much as possible because its bandwidth is ~400 GB/s, and it is ~5 times faster than the ~80 GB/s bandwidth of DDR4 memory5.
Here is an example of how 'cblas_sgemm' MMA performs for two memory allocation schemes (MASs) (1) and (2):
Matrix multiplication C=A*B where matrix A (32768x32768) and matrix B (32768x32768) Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:DDR4 Initializing matrix data Matrix multiplication started Matrix multiplication completed at 50.918 seconds
Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:MCDRAM Initializing matrix data Matrix multiplication started Matrix multiplication completed at 47.385 seconds
It is clear that there is a performance improvement of ~7 percent when an MCDRAM memory was allocated for matrix C.
Loop Processing Schemes
A loop processing scheme (LPS) describes what optimization techniques are applied to the 'for' statements of the C language of the core part of CMMA. For example, the following code:
for( i = 0; i < N; i += 1 ) // loop 1 for( j = 0; j < N; j += 1 ) // loop 2 for( k = 0; k < N; k += 1 ) // loop 3 C[i][j] += A[i][k] * B[k][j];
corresponds to an LPS=1:1:1, and it means that loop counters are incremented by 1.
Table 4 below includes short descriptions of different LPSs:
LPS | Note |
---|---|
1:1:1 | Loops not unrolled |
1:1:2 | 3rd loop unrolls to 2-in-1 computations |
1:1:4 | 3rd loop unrolls to 4-in-1 computations |
1:1:8 | 3rd loop unrolls to 8-in-1 computations |
1:2:1 | 2nd loop unrolls to 2-in-1 computations |
1:4:1 | 2nd loop unrolls to 4-in-1 computations |
1:8:1 | 2nd loop unrolls to 8-in-1 computations |
Table 4.
For example, the following code corresponds to an LPS=1:1:2, and it means that counters 'i' and 'j' for loops 1 and 2 are incremented by 1, and counter 'k' for loop 3 is incremented by 2:
for( i = 0; i < N; i += 1 ) // :1 { for( j = 0; j < N; j += 1 ) // :1 { for( k = 0; k < N; k += 2 ) // :2 (unrolled loop) { C[i][j] += A[i][k ] * B[k ][j]; C[i][j] += A[i][k+1] * B[k+1][j]; } } }
Note: A C++ compiler could unroll loops as well if command line-options for unrolling are used. A software engineer should prevent such cases when source code unrolling is used at the same time, because it prevents vectorization of inner loops, and degrades performance of processing.
Another optimization technique is the loop interchange optimization technique (LIOT). When the LIOT is used, a core part of CMMA looks as follows:
for( i = 0; i < N; i += 1 ) // loop 1 for( k = 0; k < N; k += 1 ) // loop 2 for( j = 0; j < N; j += 1 ) // loop 3 C[i][j] += A[i][k] * B[k][j];
It is worth noting that counters 'j' and 'k' for loops 2 and 3 were exchanged.
The loops unrolling and LIOT allow for improved performance of processing because elements of matrices A and B are accessed more efficiently.
Compute Schemes
A compute scheme (CS) describes the computation of final or intermediate values and how elements of matrices are accessed.
In a CMMA an element (i,j) of the matrix C is calculated as follows:
C[i][j] += A[i][k] * B[k][j]
and its CS is ij:ik:kj.
However, elements of matrix B are accessed in a very inefficient way. That is, the next element of matrix B, which needs to be used in the calculation, is located at a distance of (N * sizeof (datatype)) bytes. For very small matrices this is not critical because they can fit into CPU caches. However, for larger matrices it affects performance of computations, which can be significantly degraded, due to cache misses.
In order to solve that problem and improve performance of computations, a very simple optimization technique is used. If matrix B is transposed, the next element that needs to be used in the calculation will be located at a distance of (sizeof (datatype)) bytes. Thus, access to the elements of matrix B will be similar to the access to the elements of matrix A.
In a transpose-based CMMA, an element (i,j) of the matrix C is calculated as follows:
C[i][j] += A[i][k] * B[j][k]
and its CS is ij:ik:jk. Here B[j][k] is used instead of B[k][j].
It is very important to use the fastest possible algorithm for the matrix B transposition before processing is started. In Appendix B an example is given on how much time is needed to transpose a square matrix of 32,768 elements, and how much time is needed to compute the product on an Intel Xeon Phi processor system.
Another optimization technique is the loop blocking optimization technique (LBOT) and it allows the use of smaller subsets of A, B, and C matrices to compute the product. When the LBOT is used, a core part of CMMA looks as follows:
for( i = 0; i < N; i += BlockSize ) { for( j = 0; j < N; j += BlockSize ) { for( k = 0; k < N; k += BlockSize ) { for( ii = i; ii < ( i+BlockSize ); ii += 1 ) for( jj = j; jj < ( j+BlockSize ); jj += 1 ) for( kk = k; kk < ( k+BlockSize ); kk += 1 ) C[ii][jj] += A[ii][kk] * B[kk][jj]; } } }
Note: A detailed description of LBOT can be found at10.
Table 5 shows four examples of CSs:
CS | Note |
---|---|
ij:ik:kj | Default |
ij:ik:jk | Transposed |
iijj:iikk:kkjj | Default LBOT |
iijj:iikk:jjkk | Transposed LBOT |
Table 5.
Error Analysis
In any version of MMA many FP operations need to be done in order to compute values of elements of matrix C. Since FP data types SP or DP have limited precision4, rounding errors accumulate very quickly. A common misconception is that rounding errors can occur only in cases where large or very large matrices need to be multiplied. This is not true because, in the case of floating point arithmetic (FPA), a rounding error is also a function of the range of an input value, and it is not a function of the size of input matrices.
However, a very simple optimization technique allows improvement in the accuracy of computations.
If matrices A and B are declared as an SP FP data type, then intermediate values could be stored in a variable of DP FP data type:
for( i = 0; i < N; i += 1 ) { for( j = 0; j < N; j += 1 ) { double sum = 0.0; for( k = 0; k < N; k += 1 ) { sum += ( double )( A[i][k] * B[k][j] ); } C[i][j] = sum; } }
The accuracy of computations will be improved, but performance of processing can be lower.
An error analysis (EA) is completed using the mmatest4.c test program for different sizes of matrices of SP and DP FP data types (see Table 6 in Appendix C with results).
Performance on the Intel® Xeon Phi™ Processor System
Several versions of the CMMA to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7. Also see Appendix D for more evaluations.
Figure 1. Performance tests for matrix multiply algorithms on Intel® Xeon Phi™ processor using mmatest1.c with KMP_AFFINITY environment variable set to 'scatter', 'balanced', and 'compact'. A lower bar height means faster processing.
Here are the names of source files with a short description of tests:
mmatest1.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor.
mmatest2.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in one MCDRAM mode ('Flat') for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
mmatest3.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in three MCDRAM modes ('All2All', 'Flat', and 'Cache') for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRAM MASs. Note: In 'Cache' MCDRAM mode, MCDRAM:MCDRAM:MCDRAM MAS cannot be used.
mmatest4.c - Verification matrix multiply algorithms accuracy of computations on an Intel Xeon Phi processor.
OpenMP* Product Thread Affinity Control
OpenMP* product compiler directives can be easily used to parallelize processing and significantly speed up processing. However, it is very important to execute OpenMP threads on different logical CPUs of modern multicore processors in order to utilize their internal resources as best as possible.
In the case of using the Intel C++ compiler and Intel OpenMP run-time libraries, the KMP_AFFINITY environment variable provides flexibility and simplifies that task. Here are three simple examples of using the KMP_AFFINITY environment variable:
KMP_AFFINITY = scatter KMP_AFFINITY = balanced KMP_AFFINITY = compact
These two screenshots of the Htop* utility12 demonstrate how OpenMP threads are assigned (pinned) to Intel Xeon Phi processor 72105 logical CPUs during processing of an MMA using 64 cores of the processor:
Screenshot 1. KMP_AFFINITY = scatter or balanced. Note: Processing is faster when compared to KMP_AFFINITY = compact.
Screenshot 2. KMP_AFFINITY = compact. Note: Processing is slower when compared to KMP_AFFINITY = scatter or balanced.
Recommended Intel® C++ Compiler Command-Line Options
Here is a list of Intel C++ Compiler command-line options that a software engineer should consider, which can improve performance of processing of CMMAs:
O3
fp-model
parallel
unroll
unroll-aggressive
opt-streaming-stores
opt-mem-layout-trans
Os
openmp
ansi-alias
fma
opt-matmul
opt-block-factor
opt-prefetch
The reader can use 'icpc -help' or 'icc -help' to learn more about these command-line options.
Conclusion
Application of different optimization techniques to the CMMA were reviewed in this article.
Three versions of CMMA to compute a product of square dense matrices were evaluated in four test programs. Performance of these CMMAs was compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7.
Tests were completed on a computer system with an Intel® Xeon Phi processor 72105 running the Linux Red Hat operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.
It was demonstrated that CMMA could be used for cases when matrices of small sizes, up to 1,024 x 1,024, need to be multiplied.
It was demonstrated that performance of MMAs is higher when MCDRAM-type RAM memory is allocated for matrices with sizes up to 16,384 x 16,384 instead of DDR4-type RAM memory.
Advantages of using CMMA to compute the product of two matrices are as follows:
- In any programming language, simple to implement to run on CPUs or GPUs9
- Highly portable source codes when implemented in C, C++, or Java programming languages
- Simple to integrate with existing software for a wide range of computer platforms
- Simple to debug and troubleshoot
- Predictable memory footprint at run time
- Easy to optimize using parallelization and vectorization techniques
- Low overheads and very good performance for matrices of sizes ranging from 256 x 256 to 1,024 x 1,024 (see Figures 1 through 5)
- Very good accuracy of computations for matrices of sizes ranging from 8 x 8 to 2,048 x 2,048 (see Table 6 in Appendix C)
Disadvantages of using CMMA to compute a product of two matrices are as follows:
- Poor performance for large matrices with sizes greater than 2048 x 2048
- Poor performance when implemented using high-level programming languages due to processing overheads
- Reduced accuracy of computations for matrices of sizes ranging from 2,048 x 2,048 to 65,536 x 65,536 (see Table 6 in Appendix C)
References
1. Matrix Multiplication on Mathworld
http://mathworld.wolfram.com/MatrixMultiplication.html
2. Matrix Multiplication on Wikipedia
https://en.wikipedia.org/wiki/Matrix_multiplication
3. Asymptotic Complexity of an Algorithm
https://en.wikipedia.org/wiki/Time_complexity
4. The IEEE 754 Standard for Floating Point Arithmetic
5. Intel® Many Integrated Core Architecture
https://software.intel.com/en-us/xeon-phi/x200-processor
http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
https://software.intel.com/en-us/forums/intel-many-integrated-core
6. Intel® C++ Compiler
https://software.intel.com/en-us/c-compilers
https://software.intel.com/en-us/forums/intel-c-compiler
7. Intel® MKL
https://software.intel.com/en-us/intel-mkl
https://software.intel.com/en-us/intel-mkl/benchmarks
https://software.intel.com/en-us/forums/intel-math-kernel-library
8. Intel® Developer Zone Forums
https://software.intel.com/en-us/forum
9. Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen 9
https://software.intel.com/en-us/articles/sgemm-ocl-opt
10. Performance Tools for Software Developers Loop Blocking
https://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking
11. Memkind library
https://github.com/memkind/memkind
12. Htop* monitoring utility
https://sourceforge.net/projects/htop
Downloads
List of all files (sources, test reports, and so on):
Performance_CMMA_system.pdf - Copy of this paper.
mmatest1.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors.
dataset1.txt - Results of tests.
mmatest2.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
dataset2.txt - Results of tests.
mmatest3.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors in three MCDRAM modes for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRA MASs.
dataset3.txt - Results of tests.
mmatest4.c - Verification of matrix multiply algorithms accuracy of computations on Intel® Xeon Phi processors.
dataset4.txt - Results of tests.
Note: Intel C++ Compiler versions used to compile tests:
17.0.1 Update 132 for Linux*
16.0.3 Update 210 for Linux*
Abbreviations
CPU - Central processing unit
GPU - Graphics processing unit
ISA - Instruction set architecture
MIC - Intel® Many Integrated Core Architecture
RAM - Random access memory
DRAM - Dynamic random access memory
MCDRAM - Multichannel DRAM
HBW - High bandwidth memory
DDR4 - Double data rate (generation) 4
SIMD - Single instruction multiple data
SSE - Streaming SIMD extensions
AVX - Advanced vector extensions
FP - Floating point
FPA - Floating point arithmetic4
SP - Single precision4
DP - Double precision4
FLOPS - Floating point operations per second
MM - Matrix multiplication
MMA - Matrix multiplication algorithm
CMMA - Classic matrix multiplication algorithm
MTA - Matrix transpose algorithm
AC - Asymptotic complexity
IC - Implementation complexity
EA - Error analysis
MAS - Memory allocation scheme
LPS - Loop processing scheme
CS - Compute scheme
LIOT - Loop interchange optimization technique
LBOT - Loop blocking optimization technique
ICC - Intel C++ Compiler6
MKL - Math kernel library7
CBLAS - C basic linear algebra subprograms
IDZ - Intel® Developer Zone8
IEEE - Institute of Electrical and Electronics Engineers4
GB - Gigabytes
TN - Total number
Appendix A - Technical Specifications of the Intel® Xeon Phi™ Processor System
Summary of the Intel Xeon Phi processor system used for testing:
Process technology: 14nm
Processor name: Intel Xeon Phi processor 7210
Frequency: 1.30 GHz
Packages (sockets): 1
Cores: 64
Processors (CPUs): 256
Cores per package: 64
Threads per core: 4
On-Package Memory: 16 GB high bandwidth MCDRAM (bandwidth ~400 GB/s)
DDR4 Memory: 96 GB 6 Channel (Bandwidth ~ 80 GB/s)
ISA: Intel® AVX-512 (Vector length 512-bit)
Detailed processor specifications:
http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
Summary of a Linux operating system:
[guest@... ~]$ uname -a
Linux c002-n002 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 #1 SMP
Fri Jul 8 11:44:24 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[guest@... ~]$ cat /proc/version
Linux version 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 (qb_user@89829b4f89a5)
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)) #1 SMP Fri Jul 8 11:44:24 UTC 2016
Appendix B - Comparison of Processing Times for MMAs versus MTA
Comparison of processing times for Intel MKL 'cblas_sgemm' and CMMA vs. MTA:
[Intel MKL & CMMA]
Matrix A [32768 x 32768] Matrix B [32768 x 32768]
Number of OpenMP threads: 64
MKL - Completed in: 51.2515874 seconds
CMMA - Completed in: 866.5838490 seconds
[MTA]
Matrix size: 32768 x 32768
Transpose Classic - Completed in: 1.730 secs
Transpose Diagonal - Completed in: 1.080 secs
Transpose Eklundh - Completed in: 0.910 secs
When compared processing time of MTA to:
MKL 'cblas_sgemm'. the transposition takes ~2.42 percent of the processing time.
CMMA, the transposition takes ~0.14 percent of the processing time.
Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)
N | MMA | Calculated SP Value | Absolute Error |
---|---|---|---|
8 | MKL | 8.000080 | 0.000000 |
8 | CMMA | 8.000080 | 0.000000 |
16 | MKL | 16.000160 | 0.000000 |
32 | CMMA | 16.000160 | 0.000000 |
32 | MKL | 32.000309 | -0.000011 |
32 | CMMA | 32.000320 | 0.000000 |
64 | MKL | 64.000671 | 0.000031 |
128 | CMMA | 64.000641 | 0.000001 |
128 | MKL | 128.001160 | -0.000120 |
128 | CMMA | 128.001282 | 0.000002 |
256 | MKL | 256.002319 | -0.000241 |
512 | CMMA | 256.002563 | 0.000003 |
512 | MKL | 512.004639 | -0.000481 |
512 | CMMA | 512.005005 | -0.000115 |
1024 | MKL | 1024.009521 | -0.000719 |
2048 | CMMA | 1024.009888 | -0.000352 |
2048 | MKL | 2048.019043 | -0.001437 |
2048 | CMMA | 2048.021484 | 0.001004 |
4096 | MKL | 4096.038574 | -0.002386 |
8192 | CMMA | 4096.037109 | -0.003851 |
8192 | MKL | 8192.074219 | -0.007701 |
8192 | CMMA | 8192.099609 | 0.017689 |
16384 | MKL | 16384.14648 | -0.017356 |
32768 | CMMA | 16384.09961 | -0.064231 |
32768 | MKL | 32768.33594 | 0.008258 |
32768 | CMMA | 32768.10156 | -0.226118 |
65536 | MKL | 65536.71875 | 0.063390 |
65536 | CMMA | 65536.10156 | -0.553798 |
Table 6.
Appendix D - Performance of MMAs for Different MASs
Figure 2. Performance of Intel® MKL 'cbals_sgemm'. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest2.c. A lower bar height means faster processing.
Figure 3. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest3.c. A lower bar height means faster processing.
Figure 4. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Hybrid 50-50'. Test program mmatest3.c. A lower bar height means faster processing.
Figure 5. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Cache'. Test program mmatest3.c. A lower bar height means faster processing.
About the Author
Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt Developer. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms, and high performance computing of big data sets.