1. Background
Sparse Matrix-Vector Multiplication (SpMxV) is a common linear algebra function that often appears in real recognition-related problems, such as speech recognition. In standard framework of speech/facial recognition, input data directly extracted from outside are not suitable for pattern matching. It is a must step to transform input data to more compact and amenable feature data by multiplying with a huge-scale constant sparse parameter matrix.
Figure 1: Linear Feature Extraction Equation
A matrix is characterized as sparse if most of its elements are zero. Density of matrix is defined as percentage of non-zero elements among the matrix, which varies from 0% to 50%. The basic idea on optimizing a SpMxV is to concentrate non-zero elements to avoid unnecessary multiple-with-zero operations as many as possible. In general, concentration methods can be classified as two kinds.
The first is the widely used Compressed Row Storage (CRS), which only store non-zero elements and their position information for each row. But it is so unfriendly to modern SIMD architecture that it is hardly vectorized with SIMD, and only outperforms SIMD-accelerated ordinary matrix-vector multiplication when matrix is extreme sparse. A variation of this means, tailored for SIMD implementation, is Blocked Compressed Row Storage (BCRS) in which a fixed-size block instead of an element is handled in the same way. Because of involvement of indirect memory access, its performance may degrade severely when matrix density increases.
The second is to reorder matrix row/column via permutation. The key of these algorithms is to find the best matrix permutation scheme measured by certain criterion correlated with non-zero concentration degree, such as:
- Group non-zero elements together to facilitate partitioning matrix to sub-matrices
- Minimize total count of continuous non-zeros N x 1 blocks
Figure 2: Permutation to minimize N x 1 blocks
However, in some applications, such as speech/facial recognition, there exist some permutation-insensitive sparse matrices. That is to say that any permutation operation does not bring about significant improvement for SpMxV. An extremely simplified example matrix is:
Figure 3: Simplest permutation-insensitive matrix
If non-zero elements are uniformly distributed inside a sparse matrix, it may happen that when exchanging any two columns, rows benefitted are nearly same as rows negatively. When this situation happens, the matrix is permutation-insensitive.
Additionally, for those sparse matrices of somewhat high density, if no help can be expected from two methods, we have to resort to ordinary matrix-vector multiplication merely accelerated by SIMD instructions, illustrated in Figure 4, which is totally sparseness-unaware. In hopes of alleviating this problem, we initiated and generalized a gathering-based SpMxV algorithm that is effective for not only evenly distributed but also irregular constant sparse matrix.
2. Terms and Task
Before detailing the algorithm, we introduce some terms/definitions/assumptions to ease description.
A SIMD Block is a memory block that is same-sized as SIMD register. A SIMD BlockSet consists of one or several SIMD Blocks. A SIMD value is either a SIMD Block or a SIMD register, which can be a SIMD instruction operand.
An element is underlying basic data unit of SIMD value. Type of element can be built-in integer or float. Type of whole SIMD value is called SIMD type, and is vector of element types. Element index is element LSB order in the SIMD value, equal to element-offset/element-byte-size.
Instructions of loading a SIMD Block into a SIMD register are symbolized as SIMD_LOAD. For most element types, there are corresponding SIMD multiplication or multiplication-accumulation instructions. On X86, examples are PMADDUBSW/PMADDWD for integer, MULPS/MULPD for float. These instructions are symbolized as SIMD_MUL.
Angular bracket “< >” is used to indicate parameterization similar to C++ template.
For a value X in memory or register, X<L>[i] is the ith L-bit slice of X, in LSB order.
On modern SIMD processors, an ordinary matrix-vector multiplication can be greatly accelerated with the help of SIMD instructions as the following pseudo-code:
Figure 4: Plain Matrix-Vector SIMD Multiplication
In the case of sparse matrix, we propose an innovative technique to compact non-zeros of the matrix, while sustaining SpMxV’s implementability via SIMD ISA as above pseudo-code, with a goal of reducing unnecessary SIMD_MUL instructions. Since a matrix is assumed to be constant, the operation of compacting non-zeros is considered as preprocessing on the matrix, which can be completed during program initialization or off-line matrix data preparation, so that no runtime cost is incurred for a matrix-vector multiplication.
3. Description
GATHER Operation
First of all, we should define a conceptual GATHER operation, which is the basis of this work. And its general description is:
GATHER<T, K>(destination = [D0, D1, …, DE–1], source = [S0, S1, …, SK*E–1], hint = [H0, H1, …, HE–1]) |
The parameters destination and source are SIMD values, whose SIMD type is specified by T. And destination is one SIMD value whose element count is denoted by E, while source consists K SIMD value(s) whose total element count is K*E. The parameter hint, called Relocation Hint, has E integer values, each of which is called Relocation Index. A Relocation Index is derived from a virtual index ranging between –1 and K*E–1, and can be described by a mathematical mapping as:
RELOCATION_INDEX<T>(index), abbreviated as RI<T>(index) |
GATHER operation will move elements of source into destination based on Relocation Indices as:
- If Hi is RI<T>(–1), GATHER will retain context of Di.
- If Hi is RI<T>(j) (0 ≤ j < K*E), GATHER will move Sj to Di.
Implementation of GATHER operation is specific to processor’s ISA. Correspondingly, RI mapping depends on instruction selection for GATHER. Likewise, materialization of hint may be a SIMD value or an integer array, or even mixed with other Relocation Hints, which is totally instruction-specific.
According to ISA availability of certain SIMD processor, we only consider those, called fast or intrinsic GATHER operation, which can be translated to simple and efficient instruction sequence with low CPU cycles.
Fast GATHER on X86
On X86 processor, we propose a method to construct fast GATHER using BLEND and SHUFFLE instruction pair.
Given a SIMD type T, imagined BLEND and SHUFFLE instruction are defined as:
- BLEND<T, L>(operand1, operand2,mask) -> result
L is power of 2, not more than element bit length of T. And operand1, operand2 and result are values of T; mask is a SIMD value whose element is L-bit integer, and its element count is denoted by E. For the ith (0 ≤ i < E) element of mask, we have:
- operand1<L>[i] -> result<L>[i] (if the element’s MSB is 0)
- operand2<L>[i] -> result<L>[i] (if the element’s MSB is 1)
- SHUFFLE<T, L>(operand1, mask) -> result
Parameters description is same as BLEND. In element of mask, only low log2(E) bits, called SHUFFLE INDEX BITS, and MSB are significant. For the ith (0 ≤ i < E) element of mask, we have:
- operand1<L>[mask<L>[i] & (E–1) ] -> result<L>[i] (if the element’s MSB is 0)
- instruction specific value -> result<L>[i] (if the element’s MSB is 1)
Then, we will construct fast GATHER<T, K> using SHUFFLE<T, LS> and BLEND<T, LB> instruction pair. And element bit length of T is denoted by LT, SHUFFLE INDEX BITS is SIB. Relocation Hint is materialized as one SIMD value and each Relocation Index is LT-bit integer. The mathematical mapping RI<T>( ) is defined as:
RI<T>(virtual index = –1) = –1
If virtual index ≥ 0, in other words, we can suppose the element indicated by this index is actually the pth element of the kth (0 ≤ k < K) source SIMD value. Final result, denoted by rid, is computed according to the formulations:
- LS ≤ LB (0 ≤ i < LT/LS)
rid< LS>[i] = k * 2SIB + p * LT/LS + i ( i = integer * LB/LS – 1)
rid< LS>[i] = ? * 2SIB + p * LT/LS + i ( i ≠ integer * LB/LS – 1) - LS > LB (0 ≤ i < LT/LB)
rid< LB>[i] = k * 2SIB + p * LT/LS + i * LB/LS ( i = integer * LS/LB)
rid< LB>[i] = k * 2SIB + ? & (2SIB– 1) ( i ≠ integer * LS/LB)
- LS ≤ LB (0 ≤ i < LT/LS)
Figure 5 is an example illustrating Relocation Hint for a GATHER<8*int16, 2> while LS = LB = 8.
Figure 5: Relocation Hint For Gathering 2 SSE Blocks
The code sequence of fast GATHER<T, K> is depicted in Figure 6. Destination and Relocation Hint are symbolized as D and H. Source values are represented by B0, B1, …, BK–1. Besides, an essential SIMD constant I, of which element bit length is min(LS, LB) and each element is the integer 2SIB, will be used. Additionally, a condition should be satisfied that K is not more than 2min(LS, LB) – SIB – 1, which is K ≤ 8 for above case.
Figure 6: Fast GATHER Code Sequence
Depending on SIMD type and processor SIMD ISA, SHUFFLE and BLEND should be mapped to specific instructions as optimal as possible. Some existing instruction selections are listed as examples.
SSE128 - Integer | PSHUFB + PBLENDV | LS=8, LB=8 |
SSE128 - Float | VPERMILPS + BLENDPS | LS=32, LB=32 |
SSE128 - Double | VPERMILPD + BLENDPD | LS=64, LB=64 |
AVX256 - Int32/64 | VPERMD + PBLENDV | LS=32, LB=8 |
AVX256 - Float | VPERMPS + BLENDPS | LS=32, LB=32 |
Sparse Matrix Re-organization
In a SpMxV, two operands, the matrix and the vector, are expressed by M and V respectively. Each row in M is partitioned into several pieces in unit of SIMD Block according to certain scheme. Non-zero elements in a piece are compacted into one SIMD Block as many as possible. If there are some remaining non-zero elements outside of compaction, the piece’s SIMD Blocks containing them should be as least as possible. Meanwhile, these leftover elements are moved to a left-over matrix ML. Obviously, M*V is theoretically broken up to (M–ML)*V and ML*V. When a proper partition scheme is adopted, especially possible for those nearly even distributed sparse matrices, ML is intended to be an ultra sparse matrix that is far sparser than M so that computation time of ML*V is non-significant in total time. We can apply standard compression-based algorithm or like, which will not be covered in this invention, to ML*V. And organization of ML is subject to its multiplication algorithm and its storage is separate from M’s compacted data, whose organization is detailed as the following.
Given a piece, suppose it contains N+1 SIMD Blocks of type T, expressed by MB0, MB1, …, MBN. We use MB0 as containing Block, select and gather non-zero elements of the other N Blocks into MB0. Without loss of generality, we assume that this gathering-N-Block operation is synthesized from one or several intrinsic GATHERs, whose ‘K’ parameters are K1, K2, …, KG that are subject to N = K1 + K2 + … + KG. That is to say, the N Blocks are divided into G groups sized by K1, K2, …, KG, and these groups are individually gathered into MB0 one by one. To archive best performance, we should find a decomposition that minimizes G. This is a classical knapsack type problem and can be solved in either dynamic programming or greedy method. As a special case, when intrinsic GATHER<T, N> exists, G=1.
Relocation Hints for those G intrinsic GATHERs are expressed by MH1, MH2, …, MHG. So, the piece will be replaced with its compacted form consisting of two parts: MB0 after compaction and ₡(MH1, MH2, …, MHG). The former is called Data Block. The latter is called Relocation Block and means certain possible combination form of all Relocation Hints, which is specific to any implementation or optimization consideration that is out of discussion of this paper. The combination form may be affected by alignment enforcement, memory optimization, or other instruction-specific reasons. For example, if a Relocation Index occupies only half a byte, we can merge two Relocation Indices from two Relocation Hints into one byte so as to reduce memory usage. Ordinarily, a simple way is to layout Relocation Hints end to end. Figure 5 also shows how to create Data Block and Relocation Block for a 3-Block piece. A blank in SIMD Block means zero-valued element.
Sparse Matrix Partitioning Scheme
To guide decision on how to partition a row of matrix, we introduce a cost model. For a piece of N+1 SIMD Blocks, suppose that there will be R (R ≤ N) SIMD Blocks containing non-zero elements to be moved to ML. The cost of this piece is 1 + N*CostG + R*(1+CostL), in which:
In the following description, one or several adjacent pieces in a row will be referred as a whole, which is termed piece clique. All rows of the matrix have same partitioning scheme as:
- 1 is cost of a SIMD multiplication in the piece.
- CostG (CostG < 1) means cost of gathering one SIMD Block.
- CostL means extra effort for a SIMD multiplication in ML, is always a very small value.
In the following description, one or several adjacent pieces in a row will be referred as a whole, which is termed piece clique. All rows of the matrix have same partitioning scheme as:
- Row is cut into identical primary cliques except a possible leftover clique with fewer pieces than primary one.
- Pieces in any clique should be not more than a pre-defined count limit C(1≤ C), which is statically deduced from characteristic of non-zero distribution of the sparse matrix and is also used to control code complexity in final implementation.
- Total cost of all pieces in the matrix should be minimal for the given count limit C. As to how to find this most optimal scheme, we may rely on an exhaustive search or an improved beam algorithm. This beam algorithm will be covered in a new patent and ignored here.
An example of partitioning is [4, 5, 2], [4, 5, 2], [4, 5, 2], [2, 5] for a 40-Block row when C=3. ‘[ ]’ means a piece clique. For those even-distributed matrices, C=1 is always chosen.
Gather-Based Matrix-Vector Multiplication
Multiplication between vector V and a row of M is broken up into sub-multiplications on partitioned pieces. Given a piece in M, which we suppose its original form has N+1 SIMD Blocks, the corresponding SIMD Blocks in vector V are expressed by VB0, VB1, …, VBN. Previous symbol definitions for piece are extended to this section.
With new compacted form, a piece multiplication between [MB0, MB1, …, MBN] and [VB0, VB1, …, VBN] is transformed to operations of gathering effective vector elements into VB0 and only one SIMD multiplication on Data Block and VB0. Figure 7 depicts the pseudo-code of new multiplication, in which Data Block is MD, Relocation Block is MR and the vector is VB. And we will refer to a conceptual function EXTRACT_HINT(MR, i) (1 ≤ i ≤ G), which means extracting the ith Relocation Hint from MR and is the reverse operation to aforementioned ₡(MH1, MH2, …, MHG). To improve performance, there may be some internal temporaries inside this function. For example, register value of previous Relocation Hint was retained to avoid memory access. But detail of this function is not in scope of the article.
Figure 7: Multiplication For Compacted Form of N+1 SIMD Blocks
In the code, original N SIMD multiplications are replaced by G gathering operations. Therefore, computation acceleration is possible and meaningful only if the former is much more time-consuming than the latter. We should compose efficient intrinsic GATHER to guarantee this assertion. This matter is easily done for some processors, such as ARM, on which intrinsic GATHER of SIMD integer type can be directly mapped to single low-cost hardware instruction. To be more specific, the fast GATHER elaborately constructed on X86 also satisfies the assertion. For the ith (1 ≤ i ≤ G) SIMD Block group in the piece, Ki SIMD_MULs are replaced by Ki rather faster BLEND and SHUFFLE pairs, and Ki–1 SIMD_LOADs from the matrix are avoided and replaced by Ki–1 much more CPU-cycle-saving SIMD_SUBs.
At last, new SpMxV algorithm can be described by the following flowchart:
Figure 8: New Sparse Matrix-Vector Multiplication
4. Summary
The algorithm can be used to improve sparse matrix-vector and matrix-matrix multiplication in any numerical computation. As we know, there are lots of applications involving semi-sparse matrix computation in High Performance Computing. Additionally, in popular perceptual computing low-level engines, especially speech and facial recognition, semi-sparse matrices are found to be very common. Therefore, this invention can be applied to those mathematical libraries dedicated to these kinds of recognition engines.