Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

Introduction
An Overview of the Classic Matrix Multiplication Algorithm
Total Number of Floating Point Operations
Implementation Complexity
Optimization Techniques
Memory Allocation Schemes
Loop Processing Schemes
Compute Schemes
Error Analysis
Performance on Intel® Xeon Phi™ Processor System
OpenMP* Product Thread Affinity Control
Recommended Intel® C++ Compiler Command-Line Options
Conclusion
References
Downloads
Abbreviations
Appendix A - Technical Specifications of Intel Xeon Phi Processor System
Appendix B - Comparison of Processing Times for MMAs vs. MTA
Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)
Appendix D - Performance of MMAs for Different MASs
About the Author

Introduction

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language, and its performance significantly improves when different optimization techniques are applied.

Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel® Math Kernel Library (Intel® MKL)⁷. Tests are completed on a computer system with Intel® Xeon Phi™ processor 7210⁵ running the Linux Red Hat* operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

All versions of CMMAs for single and double precision floating point data types described in the article are implemented in the C programming language and compiled with Intel® C++ Compiler versions 17 and 16 for Linux*⁶.

The article targets experienced C/C++ software engineers and can be considered as a reference on application optimization techniques, analysis of performance, and accuracy of computations related to MMAs.

If needed, the reader may review the contents of References¹ or² for a description of mathematical fundamentals of MM, because theoretical topics related to MM are not covered in this article.

An Overview of the Classic Matrix Multiplication Algorithm

A fundamental property of any algorithm is its asymptotic complexity (AC)³.

In generic form, AC for MMA can be expressed as follows:

MMA AC = O(N^Omega)

where O stands for operation on a data element, also known in computer science as a Big O; N is one dimension of the matrix, and omega is a matrix exponent which equals 3.0 for CMMA. That is:

CMMA AC = O(N^3)

In order to compute a product of two square matrices using CMMA, a cubic number of floating point (FP) multiplication operations is required. In other words, the CMMA runs in O(N^3) time.

An omega lower than 3.0 is possible, and it means that an MMA computes a product of two matrices faster because an optimization technique, mathematical or programming, is applied and fewer FP multiplication operations are required to compute the product.

A list of several MMAs with different values of omega is as follows:

Algorithm	Omega	Note
Francois Le Gall	2.3728639	(1)
Virginia Vassilevska Williams	2.3728642
Stothers	2.3740000
Coppersmith-Winograd	2.3760000
Bini	2.7790000
Pan	2.7950000
Strassen	2.8070000	(2)
Strassen-Winograd	2.8070000
Classic	3.0000000	(3)

Table 1. Algorithms are sorted by omega in ascending order.

Total Number of Floating Point Operations

Let's assume that:

M x N is a dimension of a matrix A, or A[M,N]
N x P is a dimension of a matrix B, or B[N,P]
M x P is a dimension of a matrix C, or C[M,P]

There are three relations between M, N and P:

Relation #1: A[...,N] = B[N,...]
Relation #2: A[M,...] = C[M,...]
Relation #3: B[...,P] = C[...,P]

If one of these three relations is not met, the product of two matrices cannot be computed.

In this article only square matrices of dimension N, where M = N = P, will be considered. Therefore:

A[N,N] is the same as A[M,N]
B[N,N] is the same as B[N,P]
C[N,N] is the same as C[M,P]

The following table shows how many multiplications are needed to compute a product of two square matrices of different Ns for three algorithms from Table 1 with omega = 2.3728639 (1), omega = 2.807 (2) and omega = 3.0 (3).

N	Omega = 2.3728639 (1)	Omega = 2.807 (2)	Omega = 3.0 (3)
128	100,028	822,126	2,097,152
256	518,114	5,753,466	16,777,216
512	2,683,668	40,264,358	134,217,728
1024	13,900,553	281,781,176	1,073,741,824
2048	72,000,465	1,971,983,042	8,589,934,592
4096	372,939,611	13,800,485,780	68,719,476,736
8192	1,931,709,091	96,579,637,673	549,755,813,888
16384	10,005,641,390	675,891,165,093	4,398,046,511,104
32768	51,826,053,965	4,730,074,351,662	35,184,372,088,832
65536	268,442,548,034	33,102,375,837,652	281,474,976,710,656

Table 2.

For example, to compute a product of two square dense matrices of dimension N equal to 32,768, Francois Le Gall (1) MMA needs ~51,826,053,965 multiplications and Classic (3) MMA needs ~35,184,372,088,832 multiplications.

Imagine the case of the product of two square matrices where N equals 32,768 needs to be computed on a very slow computer system. It means that if the Francois Le Gall MMA completes the processing in one day, then the classic MMA will need ~679 days on the same computer system, or almost two years. This is because the Francois Le Gall MMA needs ~679x fewer multiplications to compute a product:

~35,184,372,088,832 / ~51,826,053,965 = ~678.9

In the case of using a famous Strassen (2) MMA, ~91 days would be needed:

~4,730,074,351,662 / ~51,826,053,965 = ~91.3

In many software benchmarks the performance of an algorithm, or some processing, is measured in floating point operations per second (FLOPS), and not in elapsed time intervals, like days, hours, minutes, or seconds. That is why it is very important to know an exact total number (TN) of FP operations completed to calculate a FLOPS value.

With modern C++ compilers, it is very difficult to estimate an exact TN of FP operations per unit of time completed at run time due to extensive optimizations of generated binary codes. It means that an analysis of binary codes could be required, and this is outside of the scope of this article.

However, an estimate value of the TN of FP operations, multiplications and additions, for CMMA when square matrices are used can be easily calculated. Here are two simple examples:

Example 1: N = 2

Image may be NSFW.
Clik here to view.

	Multiplications	= 8				// 2 * 2 * 2 = 2^3
	Additions	= 4				// 2 * 2 * 1 = 2^2*(2-1)
	TN FP Ops	= 8 + 4 = 12

Example 2: N = 3

Image may be NSFW.
Clik here to view.

	Multiplications	= 27				// 3 * 3 * 3 = 3^3
	Additions	= 18				// 3 * 3 * 2 = 3^2*(3-1)
	TN FP Ops	= 27 + 18 = 45

It is apparent that the TN of FP operations to compute a product of two square matrices can be calculated using a simple formula:

TN FP Ops = (N^3) + ((N^2) * (N-1))

Note: Take into account that in the versions of the MMA used for sparse matrices, no FP operations are performed if the matrix element at position (i,j) is equal to zero.

Implementation Complexity

In the C programming language only four lines of code are needed to implement a core part of the CMMA:

for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[i][j] += A[i][k] * B[k][j];

Therefore, CMMA's implementation complexity (IC) could be rated as very simple.

Declarations of all intermediate variables, memory allocations, and initialization of matrices are usually not taken into account.

More complex versions of MMA, like Strassen or Strassen-Winograd, could have several thousands of code lines.

Optimization Techniques

In computer programming, matrices could be represented in memory as 1-D or 2-D data structures.

Here is a static declaration of matrices A, B, and C as 1-D data structures of a single precision (SP) FP data type (float):

	float fA[N*N];
	float fB[N*N];
	float fC[N*N];

and this is what a core part of the CMMA looks like:

	for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[N*i+j] += A[N*i+k] * B[N*k+j];

Here is a static declaration of matrices A, B, and C as 2-D data structures of a single precision (SP) FP data type (float):

	float fA[N][N];
	float fB[N][N];
	float fC[N][N];

and this is what the core part of CMMA looks like:

	for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[i][j] += A[i][k] * B[k][j];

Many other variants of the core part of CMMA are possible and they will be reviewed.

Memory Allocation Schemes

In the previous section of this article, two examples of a static declaration of matrices A, B, and C were given. In the case of dynamic allocation of memory for matrices, explicit calls to memory allocation functions need to be made. In this case, declarations and allocations of memory can look like the following:

Declaration of matrices A, B, and C as 1-D data structures:

	__attribute__( ( aligned( 64 ) ) ) float *fA;
	__attribute__( ( aligned( 64 ) ) ) float *fB;
	__attribute__( ( aligned( 64 ) ) ) float *fC;

and this is how memory needs to be allocated:

	fA = ( float * )_mm_malloc( N * sizeof( float ), 64 );
	fB = ( float * )_mm_malloc( N * sizeof( float ), 64 );
	fC = ( float * )_mm_malloc( N * sizeof( float ), 64 );

Note: Allocated memory blocks are 64-byte aligned, contiguous, and not fragmented by an operating system memory manager; this improves performance of processing.

Declaration of matrices A, B, and C as 2-D data structures:

	__attribute__( ( aligned( 64 ) ) ) float **fA;
	__attribute__( ( aligned( 64 ) ) ) float **fB;
	__attribute__( ( aligned( 64 ) ) ) float **fC;

and this is how memory needs to be allocated:

	fA = ( float ** )calloc( N, sizeof( float * ) );
	fB = ( float ** )calloc( N, sizeof( float * ) );
	fC = ( float ** )calloc( N, sizeof( float * ) );
	for( i = 0; i < N; i += 1 )
	{
		fA[i] = ( float * )calloc( N, sizeof( float ) );
		fB[i] = ( float * )calloc( N, sizeof( float ) );
		fC[i] = ( float * )calloc( N, sizeof( float ) );
	}

Note: Allocated memory blocks are not contiguous and can be fragmented by an operating system memory manager, and fragmentation can degrade performance of processing.

In the previous examples, a DDR4-type RAM memory was allocated for matrices. However, on an Intel Xeon Phi processor system⁵ a multichannel DRAM (MCDRAM)-type RAM memory could be allocated as well, using functions from a memkind library¹¹ when MCDRAM mode is configured to 'Flat' or 'Hybrid'. For example, this is how an MCDRAM-type RAM memory can be allocated:

	fA = ( float * )hbw_malloc( N * sizeof( float ) );
	fB = ( float * )hbw_malloc( N * sizeof( float ) );
	fC = ( float * )hbw_malloc( N * sizeof( float ) );

Note: An 'hbw_malloc' function of the memkind library was used instead of an '_mm_malloc' function.

On an Intel Xeon Phi processor system, eight variants of memory allocation for matrices A, B, and C are possible:

Matrix A	Matrix B	Matrix C	Note
DDR4	DDR4	DDR4	(1)
DDR4	DDR4	MCDRAM	(2)
DDR4	MCDRAM	DDR4
DDR4	MCDRAM	MCDRAM
MCDRAM	DDR4	DDR4
MCDRAM	DDR4	MCDRAM
MCDRAM	MCDRAM	DDR4
MCDRAM	MCDRAM	MCDRAM

Table 3.

It is recommended to use MCDRAM memory as much as possible because its bandwidth is ~400 GB/s, and it is ~5 times faster than the ~80 GB/s bandwidth of DDR4 memory⁵.

Here is an example of how 'cblas_sgemm' MMA performs for two memory allocation schemes (MASs) (1) and (2):

	Matrix multiplication C=A*B where matrix A (32768x32768) and matrix B (32768x32768)
	Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:DDR4
	Initializing matrix data
	Matrix multiplication started
	Matrix multiplication completed at 50.918 seconds

	Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:MCDRAM
	Initializing matrix data
	Matrix multiplication started
	Matrix multiplication completed at 47.385 seconds

It is clear that there is a performance improvement of ~7 percent when an MCDRAM memory was allocated for matrix C.

Loop Processing Schemes

A loop processing scheme (LPS) describes what optimization techniques are applied to the 'for' statements of the C language of the core part of CMMA. For example, the following code:

	for( i = 0; i < N; i += 1 )						// loop 1
		for( j = 0; j < N; j += 1 )					// loop 2
			for( k = 0; k < N; k += 1 )				// loop 3
				C[i][j] += A[i][k] * B[k][j];

corresponds to an LPS=1:1:1, and it means that loop counters are incremented by 1.

Table 4 below includes short descriptions of different LPSs:

LPS	Note
1:1:1	Loops not unrolled
1:1:2	3rd loop unrolls to 2-in-1 computations
1:1:4	3rd loop unrolls to 4-in-1 computations
1:1:8	3rd loop unrolls to 8-in-1 computations
1:2:1	2nd loop unrolls to 2-in-1 computations
1:4:1	2nd loop unrolls to 4-in-1 computations
1:8:1	2nd loop unrolls to 8-in-1 computations

Table 4.

For example, the following code corresponds to an LPS=1:1:2, and it means that counters 'i' and 'j' for loops 1 and 2 are incremented by 1, and counter 'k' for loop 3 is incremented by 2:

	for( i = 0; i < N; i += 1 )						// :1
	{
		for( j = 0; j < N; j += 1 )					// :1
		{
			for( k = 0; k < N; k += 2 )				// :2 (unrolled loop)
			{
				C[i][j] += A[i][k  ] * B[k   ][j];
				C[i][j] += A[i][k+1] * B[k+1][j];
			}
		}
	}

Note: A C++ compiler could unroll loops as well if command line-options for unrolling are used. A software engineer should prevent such cases when source code unrolling is used at the same time, because it prevents vectorization of inner loops, and degrades performance of processing.

Another optimization technique is the loop interchange optimization technique (LIOT). When the LIOT is used, a core part of CMMA looks as follows:

	for( i = 0; i < N; i += 1 )						// loop 1
		for( k = 0; k < N; k += 1 )					// loop 2
			for( j = 0; j < N; j += 1 )				// loop 3
				C[i][j] += A[i][k] * B[k][j];

It is worth noting that counters 'j' and 'k' for loops 2 and 3 were exchanged.

The loops unrolling and LIOT allow for improved performance of processing because elements of matrices A and B are accessed more efficiently.

Compute Schemes

A compute scheme (CS) describes the computation of final or intermediate values and how elements of matrices are accessed.

In a CMMA an element (i,j) of the matrix C is calculated as follows:

	C[i][j] += A[i][k] * B[k][j]

and its CS is ij:ik:kj.

However, elements of matrix B are accessed in a very inefficient way. That is, the next element of matrix B, which needs to be used in the calculation, is located at a distance of (N * sizeof (datatype)) bytes. For very small matrices this is not critical because they can fit into CPU caches. However, for larger matrices it affects performance of computations, which can be significantly degraded, due to cache misses.

In order to solve that problem and improve performance of computations, a very simple optimization technique is used. If matrix B is transposed, the next element that needs to be used in the calculation will be located at a distance of (sizeof (datatype)) bytes. Thus, access to the elements of matrix B will be similar to the access to the elements of matrix A.

In a transpose-based CMMA, an element (i,j) of the matrix C is calculated as follows:

	C[i][j] += A[i][k] * B[j][k]

and its CS is ij:ik:jk. Here B[j][k] is used instead of B[k][j].

It is very important to use the fastest possible algorithm for the matrix B transposition before processing is started. In Appendix B an example is given on how much time is needed to transpose a square matrix of 32,768 elements, and how much time is needed to compute the product on an Intel Xeon Phi processor system.

Another optimization technique is the loop blocking optimization technique (LBOT) and it allows the use of smaller subsets of A, B, and C matrices to compute the product. When the LBOT is used, a core part of CMMA looks as follows:

	for( i = 0; i < N; i += BlockSize )
	{
		for( j = 0; j < N; j += BlockSize )
		{
			for( k = 0; k < N; k += BlockSize )
			{
				for( ii = i; ii < ( i+BlockSize ); ii += 1 )
					for( jj = j; jj < ( j+BlockSize ); jj += 1 )
						for( kk = k; kk < ( k+BlockSize ); kk += 1 )
							C[ii][jj] += A[ii][kk] * B[kk][jj];
			}
		}
	}

Note: A detailed description of LBOT can be found at¹⁰.

Table 5 shows four examples of CSs:

CS	Note
ij:ik:kj	Default
ij:ik:jk	Transposed
iijj:iikk:kkjj	Default LBOT
iijj:iikk:jjkk	Transposed LBOT

Table 5.

Error Analysis

In any version of MMA many FP operations need to be done in order to compute values of elements of matrix C. Since FP data types SP or DP have limited precision⁴, rounding errors accumulate very quickly. A common misconception is that rounding errors can occur only in cases where large or very large matrices need to be multiplied. This is not true because, in the case of floating point arithmetic (FPA), a rounding error is also a function of the range of an input value, and it is not a function of the size of input matrices.

However, a very simple optimization technique allows improvement in the accuracy of computations.

If matrices A and B are declared as an SP FP data type, then intermediate values could be stored in a variable of DP FP data type:

	for( i = 0; i < N; i += 1 )
	{
		for( j = 0; j < N; j += 1 )
		{
			double sum = 0.0;
			for( k = 0; k < N; k += 1 )
			{
				sum += ( double )( A[i][k] * B[k][j] );
			}
			C[i][j] = sum;
		}
	}

The accuracy of computations will be improved, but performance of processing can be lower.

An error analysis (EA) is completed using the mmatest4.c test program for different sizes of matrices of SP and DP FP data types (see Table 6 in Appendix C with results).

Performance on the Intel® Xeon Phi™ Processor System

Several versions of the CMMA to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel MKL⁷. Also see Appendix D for more evaluations.

Image may be NSFW.
Clik here to view. performance evaluation
Figure 1. Performance tests for matrix multiply algorithms on Intel® Xeon Phi™ processor using mmatest1.c with KMP_AFFINITY environment variable set to 'scatter', 'balanced', and 'compact'. A lower bar height means faster processing.

Here are the names of source files with a short description of tests:

mmatest1.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor.
mmatest2.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in one MCDRAM mode ('Flat') for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
mmatest3.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in three MCDRAM modes ('All2All', 'Flat', and 'Cache') for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRAM MASs. Note: In 'Cache' MCDRAM mode, MCDRAM:MCDRAM:MCDRAM MAS cannot be used.
mmatest4.c - Verification matrix multiply algorithms accuracy of computations on an Intel Xeon Phi processor.

OpenMP* Product Thread Affinity Control

OpenMP* product compiler directives can be easily used to parallelize processing and significantly speed up processing. However, it is very important to execute OpenMP threads on different logical CPUs of modern multicore processors in order to utilize their internal resources as best as possible.

In the case of using the Intel C++ compiler and Intel OpenMP run-time libraries, the KMP_AFFINITY environment variable provides flexibility and simplifies that task. Here are three simple examples of using the KMP_AFFINITY environment variable:

	KMP_AFFINITY = scatter
	KMP_AFFINITY = balanced
	KMP_AFFINITY = compact

These two screenshots of the Htop* utility¹² demonstrate how OpenMP threads are assigned (pinned) to Intel Xeon Phi processor 7210⁵ logical CPUs during processing of an MMA using 64 cores of the processor:

Image may be NSFW.
Clik here to view. KMP
Screenshot 1. KMP_AFFINITY = scatter or balanced. Note: Processing is faster when compared to KMP_AFFINITY = compact.

Image may be NSFW.
Clik here to view. KMP
Screenshot 2. KMP_AFFINITY = compact. Note: Processing is slower when compared to KMP_AFFINITY = scatter or balanced.

Recommended Intel® C++ Compiler Command-Line Options

Here is a list of Intel C++ Compiler command-line options that a software engineer should consider, which can improve performance of processing of CMMAs:

O3
fp-model
parallel
unroll
unroll-aggressive
opt-streaming-stores
opt-mem-layout-trans

Os
openmp
ansi-alias
fma
opt-matmul
opt-block-factor
opt-prefetch

The reader can use 'icpc -help' or 'icc -help' to learn more about these command-line options.

Conclusion

Application of different optimization techniques to the CMMA were reviewed in this article.

Three versions of CMMA to compute a product of square dense matrices were evaluated in four test programs. Performance of these CMMAs was compared to a highly optimized 'cblas_sgemm' function of the Intel MKL⁷.

Tests were completed on a computer system with an Intel® Xeon Phi processor 7210⁵ running the Linux Red Hat operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

It was demonstrated that CMMA could be used for cases when matrices of small sizes, up to 1,024 x 1,024, need to be multiplied.

It was demonstrated that performance of MMAs is higher when MCDRAM-type RAM memory is allocated for matrices with sizes up to 16,384 x 16,384 instead of DDR4-type RAM memory.

Advantages of using CMMA to compute the product of two matrices are as follows:

In any programming language, simple to implement to run on CPUs or GPUs⁹
Highly portable source codes when implemented in C, C++, or Java programming languages
Simple to integrate with existing software for a wide range of computer platforms
Simple to debug and troubleshoot
Predictable memory footprint at run time
Easy to optimize using parallelization and vectorization techniques
Low overheads and very good performance for matrices of sizes ranging from 256 x 256 to 1,024 x 1,024 (see Figures 1 through 5)
Very good accuracy of computations for matrices of sizes ranging from 8 x 8 to 2,048 x 2,048 (see Table 6 in Appendix C)

Disadvantages of using CMMA to compute a product of two matrices are as follows:

Poor performance for large matrices with sizes greater than 2048 x 2048
Poor performance when implemented using high-level programming languages due to processing overheads
Reduced accuracy of computations for matrices of sizes ranging from 2,048 x 2,048 to 65,536 x 65,536 (see Table 6 in Appendix C)

References

1. Matrix Multiplication on Mathworld

http://mathworld.wolfram.com/MatrixMultiplication.html

2. Matrix Multiplication on Wikipedia

https://en.wikipedia.org/wiki/Matrix_multiplication

3. Asymptotic Complexity of an Algorithm

https://en.wikipedia.org/wiki/Time_complexity

4. The IEEE 754 Standard for Floating Point Arithmetic

http://standards.ieee.org/

5. Intel® Many Integrated Core Architecture

https://software.intel.com/en-us/xeon-phi/x200-processor
http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
https://software.intel.com/en-us/forums/intel-many-integrated-core

6. Intel® C++ Compiler

https://software.intel.com/en-us/c-compilers
https://software.intel.com/en-us/forums/intel-c-compiler

7. Intel® MKL

https://software.intel.com/en-us/intel-mkl
https://software.intel.com/en-us/intel-mkl/benchmarks
https://software.intel.com/en-us/forums/intel-math-kernel-library

8. Intel® Developer Zone Forums

https://software.intel.com/en-us/forum

9. Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen 9

https://software.intel.com/en-us/articles/sgemm-ocl-opt

10. Performance Tools for Software Developers Loop Blocking

https://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking

11. Memkind library

https://github.com/memkind/memkind

12. Htop* monitoring utility

https://sourceforge.net/projects/htop

Downloads

Performance_CMMA_system.zip

List of all files (sources, test reports, and so on):

Performance_CMMA_system.pdf - Copy of this paper.

mmatest1.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors.

dataset1.txt - Results of tests.

mmatest2.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.

dataset2.txt - Results of tests.

mmatest3.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors in three MCDRAM modes for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRA MASs.

dataset3.txt - Results of tests.

mmatest4.c - Verification of matrix multiply algorithms accuracy of computations on Intel® Xeon Phi processors.

dataset4.txt - Results of tests.

Note: Intel C++ Compiler versions used to compile tests:
17.0.1 Update 132 for Linux*
16.0.3 Update 210 for Linux*

Abbreviations

CPU - Central processing unit
GPU - Graphics processing unit
ISA - Instruction set architecture
MIC - Intel® Many Integrated Core Architecture
RAM - Random access memory
DRAM - Dynamic random access memory
MCDRAM - Multichannel DRAM
HBW - High bandwidth memory
DDR4 - Double data rate (generation) 4
SIMD - Single instruction multiple data
SSE - Streaming SIMD extensions
AVX - Advanced vector extensions
FP - Floating point
FPA - Floating point arithmetic⁴
SP - Single precision⁴
DP - Double precision⁴
FLOPS - Floating point operations per second
MM - Matrix multiplication
MMA - Matrix multiplication algorithm
CMMA - Classic matrix multiplication algorithm
MTA - Matrix transpose algorithm
AC - Asymptotic complexity
IC - Implementation complexity
EA - Error analysis
MAS - Memory allocation scheme
LPS - Loop processing scheme
CS - Compute scheme
LIOT - Loop interchange optimization technique
LBOT - Loop blocking optimization technique
ICC - Intel C++ Compiler⁶
MKL - Math kernel library⁷
CBLAS - C basic linear algebra subprograms
IDZ - Intel® Developer Zone⁸
IEEE - Institute of Electrical and Electronics Engineers⁴
GB - Gigabytes
TN - Total number

Appendix A - Technical Specifications of the Intel® Xeon Phi™ Processor System

Summary of the Intel Xeon Phi processor system used for testing:

Process technology: 14nm
Processor name: Intel Xeon Phi processor 7210
Frequency: 1.30 GHz
Packages (sockets): 1
Cores: 64
Processors (CPUs): 256
Cores per package: 64
Threads per core: 4
On-Package Memory: 16 GB high bandwidth MCDRAM (bandwidth ~400 GB/s)
DDR4 Memory: 96 GB 6 Channel (Bandwidth ~ 80 GB/s)
ISA: Intel® AVX-512 (Vector length 512-bit)

Detailed processor specifications:

http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core

Summary of a Linux operating system:

[guest@... ~]$ uname -a

Linux c002-n002 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 #1 SMP
Fri Jul 8 11:44:24 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[guest@... ~]$ cat /proc/version

Linux version 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 (qb_user@89829b4f89a5)
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)) #1 SMP Fri Jul 8 11:44:24 UTC 2016

Appendix B - Comparison of Processing Times for MMAs versus MTA

Comparison of processing times for Intel MKL 'cblas_sgemm' and CMMA vs. MTA:

[Intel MKL & CMMA]

Matrix A [32768 x 32768] Matrix B [32768 x 32768]
Number of OpenMP threads: 64
MKL - Completed in: 51.2515874 seconds
CMMA - Completed in: 866.5838490 seconds

[MTA]

Matrix size: 32768 x 32768
Transpose Classic - Completed in: 1.730 secs
Transpose Diagonal - Completed in: 1.080 secs
Transpose Eklundh - Completed in: 0.910 secs

When compared processing time of MTA to:
MKL 'cblas_sgemm'. the transposition takes ~2.42 percent of the processing time.
CMMA, the transposition takes ~0.14 percent of the processing time.

Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)

N	MMA	Calculated SP Value	Absolute Error
8	MKL	8.000080	0.000000
8	CMMA	8.000080	0.000000
16	MKL	16.000160	0.000000
32	CMMA	16.000160	0.000000
32	MKL	32.000309	-0.000011
32	CMMA	32.000320	0.000000
64	MKL	64.000671	0.000031
128	CMMA	64.000641	0.000001
128	MKL	128.001160	-0.000120
128	CMMA	128.001282	0.000002
256	MKL	256.002319	-0.000241
512	CMMA	256.002563	0.000003
512	MKL	512.004639	-0.000481
512	CMMA	512.005005	-0.000115
1024	MKL	1024.009521	-0.000719
2048	CMMA	1024.009888	-0.000352
2048	MKL	2048.019043	-0.001437
2048	CMMA	2048.021484	0.001004
4096	MKL	4096.038574	-0.002386
8192	CMMA	4096.037109	-0.003851
8192	MKL	8192.074219	-0.007701
8192	CMMA	8192.099609	0.017689
16384	MKL	16384.14648	-0.017356
32768	CMMA	16384.09961	-0.064231
32768	MKL	32768.33594	0.008258
32768	CMMA	32768.10156	-0.226118
65536	MKL	65536.71875	0.063390
65536	CMMA	65536.10156	-0.553798

Table 6.

Appendix D - Performance of MMAs for Different MASs

Image may be NSFW.
Clik here to view. MKL Performance
Figure 2. Performance of Intel® MKL 'cbals_sgemm'. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest2.c. A lower bar height means faster processing.

Image may be NSFW.
Clik here to view. MKL Performance
Figure 3. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest3.c. A lower bar height means faster processing.

Image may be NSFW.
Clik here to view. MKL Performance
Figure 4. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Hybrid 50-50'. Test program mmatest3.c. A lower bar height means faster processing.

Image may be NSFW.
Clik here to view. MKL Performance
Figure 5. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Cache'. Test program mmatest3.c. A lower bar height means faster processing.

About the Author

Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt Developer. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms, and high performance computing of big data sets.

Contents