Algorithms to vectorize load groups in x86

The article describes algorithms used to achieve vectorization described in the blog.

Basically the algorithms convert an array of structures to the structure of arrays, where array length is vector length. For example:

We have an array of 32bit float coordinates X, Y, Z. Vector length is 128 bits. Each vector consists of 4 32bit float elements. As we have 3 elements in structure we need 4 XYZ structures (gray, blue, green and orange):

to get 3 vectors of X, Y and Z:

It’s obvious that this works only when number of elements in a vector is greater than number of elements in a structure.

There are at least 2 algorithms to complete this. Let’s call them: Direct permutations and Shift permutations.

1. Direct permutations. Complexity N²+N(N-1). Complexity to get 1 element 2N-1. Where N is number of elements in structure.

To get a vector of X elements we put X into its position (using “shuffle” instructions) in each vector and then unite all elements (using “or” or “blend” instructions):

The complexity to get 1 structure element is N “shuffles” + (N-1) “ors”. If we need vectors for all structure elements the complexity will be N*N “shuffles” + N*(N-1) “ors”.

This algorithm has high parallelism. Potentially we can count all N vectors in parallel.

2. Shift permutations. Complexity N+N²-1. Complexity to get 1 element (N²+N+2)/2. Where N is number of elements in structure.

First we group all structure elements in a vector (using shuffles). Should take N shuffles:

Now we are able to operate with groups. What we need is to put each group one-by-one. We will do this in N-1 steps using “palignr” instructions:

Here we have Z vector ready and therefore need just N-1 “palignr” more:

Finally X and Y vectors are ready.

The complexity of the algorithm is N “shuffles” + (N-1) steps * N “palignr” instructions + (N-1) “pailgnr” instructions at final step: N + (N-1)*N + N-1 = N²+N-1.

The complexity for 1 structure element is N “shuffles” + (N-1 + N-2 + … +1) + 1 (if on final step a vector is not ready we need 1 more “palignr”. Finally 1 element complexity is (N²+N+2)/2.

This algorithm has less complexity than previous but lower parallelism.

3. Power of M case: N = M^K

When number of elements in structure N is power of some natural number N = M^K we can lower complexity of the algorithms to ~N*log(N).

Let A be one of described above algorithms, A(N) its complexity for N element structures. When its complexity could be lowered for the case N = M^K if we K times sequentially apply algorithm for M element structures to each M vectors group. The complexity will be N/M – number of groups to which we should apply the algorithm multiplied by K*A(M). So finally N * K * A(M) / M = N * log_M(N) * A(M) / M.

Below is example for N = 4, M = 2

Generally when we decompose N on multipliers = M * L, we can get final result applying algorithm for M elements to N/M groups and then algorithm for L elements to N/L groups. The complexity of such algorithm will be N * (A(M)/M + A(L)/L).

The algorithms are implemented in GCC compiler (functions vect_permute_load_chain and vect_shift_permute_load_chain): https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/tree-vect-data-refs.c?revision=218160&view=markup&pathrev=218160

Algorithms to vectorize load groups in x86

1. Direct permutations. Complexity N²+N(N-1). Complexity to get 1 element 2N-1. Where N is number of elements in structure.

2. Shift permutations. Complexity N+N²-1. Complexity to get 1 element (N²+N+2)/2. Where N is number of elements in structure.

3. Power of M case: N = M^K

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112

1. Direct permutations. Complexity N2+N*(N-1). Complexity to get 1 element 2*N-1. Where N is number of elements in structure.

2. Shift permutations. Complexity N+N2-1. Complexity to get 1 element (N2+N+2)/2. Where N is number of elements in structure.

3. Power of M case: N = MK

Trending Articles

1. Direct permutations. Complexity N²+N(N-1). Complexity to get 1 element 2N-1. Where N is number of elements in structure.

2. Shift permutations. Complexity N+N²-1. Complexity to get 1 element (N²+N+2)/2. Where N is number of elements in structure.

3. Power of M case: N = M^K