Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Efficient SIMD in Animation with SIMD Data Layout Templates (SDLT) and Data Preconditioning

$
0
0

Introduction

In order to get the most out of SIMD,1 the key is to put in effort beyond just getting it to vectorize.2 It is tempting to add a #pragma omp simd3 to your loop, see that the compiler successfully vectorized it, and then be satisfied, especially if you get a speedup. However, it is possible that there is no speedup at all, or even a slowdown. In either case, to maximize the benefits and speedups of SIMD execution, you often have to rework your algorithm and data layout so that the generated SIMD code is as efficient as possible. Furthermore, often there is the added bonus that even the scalar (non-vectorized) version of your code will perform better.

In this paper, we walk through a 3D Animation algorithm example to demonstrate step-by-step what was done beyond just adding a “#pragma”. In the process, there will be some techniques and methodologies that may benefit your next vectorization endeavors. We also integrate the algorithm with SIMD Data Layout Templates (SDLT), which is a feature of Intel® C++ Compiler, to improve data layout and SIMD efficiency. All the source code in this paper is available for download, and includes other details not mentioned here.

Background and Problem Statement

Sometimes just getting your loop to vectorize is not enough for a performance improvement for your algorithm. The Intel® C++ Compiler may tell you that it “could vectorize but may be inefficient.” But just because a loop can be vectorized does not mean the generated code is more efficient than if the loop were not vectorized. If vectorization does not provide a speedup, it is up to you to investigate why. Often to get efficient SIMD code, the problem requires you to refactor the data layout and algorithm. In many instances, the optimizations that benefit SIMD often account for the majority of the speedups regardless of vectorization or not. However by improving the efficiency of your algorithm, SIMD speedups will be much greater.

In this paper, we introduce a loop from the example source code plus four other versions of it that represent the changes that were made to improve SIMD efficiency. Use Figure 1 as a reference for this paper as well as for the downloaded source code. Sections for Versions #0 through #3 provide the core of this paper. And for extra credit, Version #4 discusses an advanced SDLT feature to overcome SIMD conversion overhead.

Algorithm Version
#0: Original
#1: SIMD
#2: SIMD, data sorting
#3: SIMD, data sorting, SDLT container
#4: SIMD, data sorting, SDLT container, sdlt::uniform_soa_over_1d

Figure 1: Legend of version number and corresponding description of the set of code changes in available source code. Version numbers also imply the order of changes.

Algorithms requiring data to be gathered and scattered can inhibit performance, for both scalar and SIMD. And if you have chains of gathers (or scatters), this further exacerbates underperformance. If your loop contains indirect accesses (or non-unit strided memory accesses4) as in Figure 2, the compiler will likely generate gather instructions (whether it be an explicit gather instruction or several instructions emulating a gather). And with indirect accesses to large structs, the number of gathers grows proportionally with the number of data primitives. For example, if struct “A” contains 4 doubles, an indirect access to this struct generates 4 gathers. It may be the case that indirect accesses in your algorithm are unavoidable. However, you should investigate and search for solutions to avoid indirect accesses if possible. Avoiding inefficiencies such as gathers (or scatters) can greatly improve SIMD performance.

Furthermore, data alignment can improve SIMD performance. If your loop is operating on data that is not aligned to SIMD data lanes, your performance may be reduced.

Indirect memory gather or scatter

Figure 2: Indirect memory addressing can be either a gather or a scatter, and is where a loop index is used to look up another index. Gathers are indexed loads. Scatters are indexed stores.

We present a simple 3D mesh deformation algorithm example to illustrate some techniques that can be employed to improve efficiency of generated code that benefits both scalar and SIMD. In Figure 3, each Vertex of the 3D mesh has an Attachment that contains data that influences the deformation of that Vertex. Each Attachment indirectly references 4 Joints. The Attachments and Joints are stored in 1D arrays.

Example algorithm of 3D mesh deformation

Figure 3: Example algorithm of 3D mesh deformation.

Version #0: The Algorithm

In Figure 4, the algorithm loop iterates over an array of “Attachments.” Each Attachment contains 4 Joint index values that indirectly accesses into an array of “Joints.” And each Joint contains a transformation matrix (3x4) of 12 doubles. So each loop iteration requires gathers for 48 doubles (12 doubled by 4 Joints). This large number of gathers can contribute to slower SIMD performance. So if we can somehow reduce or avoid these gathers, the SIMD performance can be greatly improved.

typedef std::vector<Attachment> AttachmentsArray;
AttachmentsArray mAttachments;

void
computeAffectedPositions(std::vector<Vec3d>& aAffectedPositions)
{
    const int count = mAttachments.size();
#pragma novector
    for (unsigned int i=0u; i < count; ++i) {
                 Attachment a = mAttachments[i];
 
        // Compute affected position
        // NOTE: Requires many gathers (indirect accesses)
        Vec3d deformed0 = a.offset * mJoints[a.index0].xform * a.weight0;
        Vec3d deformed1 = a.offset * mJoints[a.index1].xform * a.weight1;
        Vec3d deformed2 = a.offset * mJoints[a.index2].xform * a.weight2;
        Vec3d deformed3 = a.offset * mJoints[a.index3].xform * a.weight3;
 
        Vec3d deformedPos = deformed0 + deformed1 + deformed2 + deformed3;
 
        // Scatter result
        aAffectedPositions[i] = deformedPos;
    }
}

Figure 4: Version #0: Example algorithm with 48 gathers per loop iteration.

Version #1: SIMD

For Version #1, we vectorize the loop. In our example, the loop successfully vectorizes by simply adding “#pragma omp simd” (see Figure 5) because it already meets the criteria to be vectorizable (for example, no function calls, single entry and single exit, and straight-line code5). In addition, it follows SDLT’s vectorization strategy, which is to restrict objects to help the compiler succeed in the privatization.6 However, it should be noted that in many cases, simply adding the pragma will result in compilation errors or incorrectly generated code.7 There is often code refactoring required to get to the state where the loop is vectorizable.

#pragma omp simd

Figure 5: Version #1: Change line 8 of Version #0 (see Figure 4) to vectorize the loop.

Figure 6 shows the Intel® C++ Composer (ICC) XE Opt-report8 for the loop in Version #1. For an Intel® Advanced Vector Extensions (Intel® AVX)9 build, you can see that the Opt-report states that even though the loop did vectorize, it estimates only a 5-percent speedup. However in our case, the actual performance of Version #1 was 15 percent slower than that of Version #0. Regardless of the estimated speedup reported by the Opt-report, you should test for actual performance.

Furthermore, Figure 6 shows 48 “indirect access” masked indexed loads for each of the doubles in the transformation matrix of all 4 Joints. Correspondingly, it generates 48 “indirect access” remarks such as the one in Figure 7. Opt-report remarks should not be ignored; you should investigate the cause and attempt to resolve them.

Intel® C++ Compiler Opt-report for loop

Figure 6: Version #1: Intel® C++ Compiler Opt-report for loop.

Intel® C++ Compiler Opt-report, indirect access remark

Figure 7: Version #1: Intel® C++ Compiler Opt-report, indirect access remark.

Even though the loop was vectorized, any potential performance gain from SIMD is hindered by the large number of gathers from indirect accesses.

Solution

After successful vectorization, you may or may not get speedups. In either case, just getting your loop to vectorize should be just the beginning of the optimization process, not the end. Instead, utilize tools (for example, Opt-report, assembly code, Intel® VTune™ Amplifier XE, Intel® Advisor XE) to help investigate inefficiencies and implement solutions to improve your SIMD code.

Version #2 (Part 1): Data Preconditioning by Sorting for Uniform Data

In our example, the Opt-report reported 48 gathers and corresponding “indirect access” remarks. The indirect access remarks were particularly alarming, since the report was littered with them. Investigating further, we discovered that they corresponded to the 4x3 matrix values for each of the 4 Joints that were being indirectly accessed from inside the vectorized loop, totaling 48 gathers. We know that gathers (and scatters) can affect performance. But what can we do about them. Are they necessary, or is there a way to avoid them?

For our example, we asked ourselves, “Is there any uniform data being accessed from within the loop body that can be hoisted to outside the loop?” The initial answer was “no,” so then we asked, “Is there a way to refactor the algorithm so that we do have uniform data that is loop-invariant?”

Sorting algorithm data

Figure 8: Sorting algorithm data. On the left, the loop iterates over all Attachments where Joint indexes vary. On the right, the Attachments are sorted so that each (inner) sub-loop has the same set of uniform Joint data.

As shown in Figure 8, many individual Attachments share the same set of Joint index values. By sorting the Attachments so that all the ones that share the same indexes are grouped together, this creates an opportunity to loop over a sub-set of Attachments where the Joints are uniform (loop-invariant) over the sub-loop’s iteration space. This would allow hoisting of the Joint data to outside the vectorized inner loop. And subsequently, the inner vectorized loop would not have any gathers.

void
computeAffectedPositions(std::vector<Vec3d>& aAffectedPositions)
{
    // Here we have a "sorted" array of Attachments, and an array of IndiceSets.
    // Each IndiceSet specifies the range of Attachment-indexes that share common
    // set of Joint-indexes. So we loop over the IndiceSets (outer loop), and
    // loop over the Attachments over the range (inner SIMD loop).
    const int setCount = static_cast<int>(mIndiceSetArray.size());
    for (int setIndex = 0; setIndex < setCount; ++setIndex) {
        const auto & indiceSet = mIndiceSetArray[setIndex];
        const int startAt = indiceSet.rangeStartAt;
        const int endBefore = indiceSet.rangeEndBefore;

        // Uniform (loop-invariant) data, hoisted outside inner loop
        // NOTE: Avoids indirection, therefore gathers
        const Joint joint0 = mJoints[indiceSet.index0];
        const Joint joint1 = mJoints[indiceSet.index1];
        const Joint joint2 = mJoints[indiceSet.index2];
        const Joint joint3 = mJoints[indiceSet.index3];

#pragma omp simd
        for (int i = startAt; i < endBefore; ++i) {
            const Attachment a = mAttachmentsSorted[i];

            // Compute an affected position
            const Vec3d deformed0 = a.offset * joint0.xform * a.weight0;
            const Vec3d deformed1 = a.offset * joint1.xform * a.weight1;
            const Vec3d deformed2 = a.offset * joint2.xform * a.weight2;
            const Vec3d deformed3 = a.offset * joint3.xform * a.weight3;

            const Vec3d deformedPos = deformed0 + deformed1 + deformed2 + deformed3;

            // Scatter result
            aAffectedPositions[a.workIdIndex] = deformedPos;
        }
    }
}

Figure 9: Version #2: Algorithm refactored to create uniform (loop-invariant) data.

Figure 9 shows resulting code using a sorted data array, which groups together elements sharing uniform data, where the original loop is converted to an outer and inner (vectorized) loop that avoids gathers. The array of IndiceSet’s mIndiceSetArray is to track the start and stop indices in the sorted array. This is why we have an outer loop and inner loop. Also since the data is reordered, we needed to add workIdIndex to track the original location to write the results out.

Now the Opt-report (see Figure 10) no longer reports 48 indexed masked loads (or gathers) due to the Joints. And it “estimates” a 2.35x speedup for Intel® AVX. In our case, the actual speedup was 2.30x.

Intel® C++ Compiler Opt-report of refactored loop with uniform Joint data

Figure 10: Version #2: Intel® C++ Compiler Opt-report of refactored loop with uniform Joint data.

In Figure 10, notice that the Opt-report still reports 8 “gathers” or “masked strided loads.” They result from accessing the array of structures memory layout of the mSortedAttatchments array. Ideally, we want to achieve “unmasked aligned unit stride” loads. Later we will demonstrate how to improve this with SDLT. Also notice in the Opt-report (see Figure 10) that we now have 3 scatters. This is because we reordered the input data and thus need to write out the results to the output in the correct order (shown at line number 29 in Figure 9). But it is better to scatter 3 values than to gather 48 values, which is introducing a small overhead to remove a much larger cost.

Version #2 (Part 2): Data Padding

At this point, the Opt-report estimated a good speedup. However we have reordered our original very large attachments loop into many smaller sub-loops, and we noticed that for actual execution of the loop the performance may not be optimal when processing short trip counts. For short trip counts, a significant portion of the execution time might be spent in the Peel or Remainder loop, which are not fully vectorized. Figure 11 provides an example where unaligned data can result in execution in the Peel, Main, or Remainder loop. This happens when either the start or end indexes (or both) of the iteration space are not a multiple of the SIMD vector lane count. Ideally we want all the execution time to be in the Main SIMD loop.

Anatomy of a SIMD loop

Figure 11: Anatomy of a SIMD loop. When the compiler does vectorization, it generates code for 3 types of loops: the main SIMD loop, the Peel and the Remainder loop. In this diagram, we have a 4 Vector lane example where the loop iteration space is 3 to 18. The Main loop will process 4 elements at a time, starting at SIMD lane boundary 4 and ending at 15, while the peel loop will process element 3, and the Remainder will process 16–18.

Intel® VTune™ Amplifier XE (2016) can be used to see corresponding assembly code

Figure 12: Intel® VTune™ Amplifier XE (2016) can be used to see where time is spent in the corresponding assembly code. When inspecting (executed) assembly within Intel VTune Amplifier XE, alongside the scrollbar there are blue horizontal bars that indicate execution time. By identifying the Peel, Main, and Remainder loops in the assembly, you can determine how much time is being spent outside your vectorized Main loop (if any).

Therefore, in addition to sorting the Attachments, the SIMD performance could be improved by padding the Attachment data so that it aligns to multiples of the SIMD vector lane count. Figure 13 illustrates how padding the data array can allow for all execution to occur in the Main SIMD loop, which is ideal. Results may vary, but padding your data is generally a beneficial technique.

Padding data array

Figure 13: Padding data array. In this example of 4 vector lanes, it shows the Attachments, sorted and grouped into two sub-loops. (Left) For sub-loop #1, attachments 0–3 are processed in the Main loop, while the remaining two elements (4 and 5) are processed by Remainder. For sub-loop #2, with only a trip count of 3, all 3 are processed by the Peel loop. (Right) We padded each sub-loop to align with a multiple of 4 SIMD lanes, allowing all attachments to be processed by the vectorized loop.

Version #3: SDLT Container

Now that we have refactored our algorithm to avoid gathers and significantly improve our SIMD performance, we can leverage SDLT to help further improve the efficiency of the generated SIMD code. Until now, all loads have been “masked” and unaligned. Ideally, we want unmasked, aligned, and unit-stride. We utilize SDLT Primitives and Containers to achieve this. SDLT helps with the success of the privatization of local variables in the SIMD loop, meaning each SIMD lane gets a private instance of the variable. The SDLT Containers and Accessors will automatically handle data transformation and alignment.

In Figure 14, the source code shows the changes to integrate SDLT. The key changes are to declare the SDLT_PRIMITIVE for the struct AttachmentSorted and then convert the input data container for the array of Attachments from the std::vector container, which is an Array of Structures (AOS) data layout, to an SDLT container. The programmer uses the operator [] on SDLT accessors as if they were C arrays or std::vector. Initially, we used the SDLT Structure of Arrays (SOA) container (sdlt::soa1d_container), but the Array of Structures of Arrays (ASA) container (sdlt::asa1d_container) yielded better performance. It is easy to switch (that is, using typedef) between SDLT container types to experiment and test for best performance, and you are encouraged to do so. In Figure 14, we also introduce the SDLT_SIMD_LOOP macros, which is a “Preview” feature in ICC 16.2 (SDLT v2), and is compatible with both ASA and SOA container types.

// typedef sdlt::soa1d_container<AttachmentSorted> AttachmentsSdltContainer;
typedef sdlt::asa1d_container<AttachmentSorted, sdlt::simd_traits<double>::lane_count>    AttachmentsSdltContainer;
AttachmentsSdltContainer mAttachmentsSdlt;

void
computeAffectedPositions(std::vector<Vec3>& aAffectedPositions)
{
    // SDLT access for inputs
    auto sdltInputs = mAttachmentsSdlt->const_access();

    math::Vec3* affectedPos = &aAffectedPositions[0];
    for (int setIndex=0; setIndex < setCount; ++setIndex) {
        // . . .

        // SIMD inner loop
        // The ‘sdlt::asa1d_container’ needs a compound index that identifies the AOS index as
        // well as the SOA lane index, and the macro SDLT_SIMD_LOOP provides a compatible index
        // over ranges that begin/end on SIMD lane count boundaries (because we padded our data).
        // NOTE: sdlt::asa_container and SDLT_SIMD_LOOP are “Preview” features in ICC 16.2, SDLT v2.
        SDLT_SIMD_LOOP_BEGIN(index, startAt, endBefore, sdlt::simd_traits<double>::lane_count)
        {
            const AttachmentSorted a = sdltInputs[index];

            // . . .

            affectedPos[a.workIdIndex] = deformedPos;
        }
        SDLT_SIMD_LOOP_END
    }
}

Figure 14: Version #3. Integrate SDLT Container (lines 1–3,7) and Accessor (lines 8 and 19); also, using “Preview” features of SDLT_SIMD_LOOP macros (lines 17 and 23). Only shows differences to Version #2.

Intel® C++ Compiler Opt-report using SDLT Primitive and Containers

Figure 15: Version #3: Intel® C++ Compiler Opt-report using SDLT Primitive and Containers.

In Figure 15, the Opt-report estimates a 1.88x estimated speedup for Version #3. But keep in mind that this is just an estimate and not actual speedup. In fact, in our case it resulted in an actual speedup of 3.17x. Furthermore, recall that the Opt-report for Version #2 (Figure 10) reported “masked strided” loads. Notice now (Figure 15) that the loads are “unmasked,” “aligned,” and “unit stride”. This is ideal for optimal performance and was facilitated by using a SDLT container to improve data layout and memory access efficiency.

Version #4: sdlt::uniform_soa_over1d

In Version #4 of the algorithm, we can discover more opportunities for improvement. Notice that from one sub-loop to its subsequent sub-loops, the same 3 out of 4 Joint data are being pulled for uniform access in the inner loop. Also, be aware that there is large overhead in getting uniform data ready for every entry into a SIMD loop, and we incur this cost at every iteration of the outer loop for every piece of uniform data.

For SIMD loops, depending on the SIMD instruction set,10 there is overhead in prepping uniform data before iteration starts. For each uniform value the compiler may 1) load the scalar value into the register, 2) broadcast the scalar value in the register to all lanes of the SIMD register, and then (3) store the SIMD register to a new location on the stack for use inside the SIMD loop body. For long trip counts, this overhead can be easily amortized. But for short trip counts, it can hurt performance. In Version #3, every iteration of the outer loop incurs this overhead for the 4 Joints, which means 48 doubles (12 doubles per Joint) total.

Finding trip counts for loop execution

Figure 16: Finding trip counts: Intel® Advisor XE (2016) has a useful feature that provides trip counts for loop execution. This allows you to easily identify short versus long trip counts.

For a scenario such as this, SDLT provides a way to explicitly manage this SIMD data conversion by determining when to incur the overhead, rather than having to automatically incur the overhead cost. This advanced feature of SDLT is sdlt::uniform_soa_over1d. It offers the ability to decouple SIMD conversion overhead from the loop, and puts the user in control of when to incur this overhead. It does this by storing the loop-invariant data in a SIMD-ready format so that the SIMD loops can directly access the data without conversion. It enables partial updates and reuse of uniform data, which benefits the performance of our example.

To illustrate where the SIMD data conversion overhead occurs and how SDLT can help mitigate this overhead, we provide a pseudo-code example in Figures 17 and 18. Figure 17 shows that the overhead is incurred for every iteration of the outer loop (at line 8) and for every double that is accessed from UniformData (12 doubles). And Figure 18 shows how the usage of sdlt::uniform_soa_over1d can reduce the overall cost by having to incur the overhead only once (at line 6). This is an advanced feature that may provide a benefit in specific scenarios. Users should experiment. Results may vary.

Incur SIMD data conversion overhead of uniform data

Figure 17: Before entering a SIMD loop on line 12, for each uniform value (of UniformData) the compiler may (1) load the scalar value into the register, (2) broadcast the scalar value in the register to all lanes of SIMD register, and then (3) store the SIMD register to a new location on the stack for use inside the SIMD loop body. For long trip counts, the overhead can be easily amortized. But for short trip counts, it can hurt performance.

The SIMD loop can use &#039;UniformOver1d&#039; data without need for conversion

Figure 18: By using sdlt::uniform_soa_over1d, you can explicitly manage this SIMD data conversion by determining when to incur the overhead, rather than having to automatically incur the cost. This advanced feature of SDLT offers the ability to decouple SIMD conversion overhead from the loop and puts you in control of when to incur this overhead. It does this by storing the loop-invariant data in a SIMD-ready format so that the SIMD loops can directly access the data without conversion.

So for the first step to further improve the performance for the case of short trip counts, we refactor the algorithm so that we reuse 3 out of 4 Joints data from outer loop index i to i+1, as illustrated in Figure 19. The usage of the SDLT feature helps mitigate the accumulated overhead of prepping the SIMD data for the sub-loops.

Uniform data for 3 out of 4 Joints can be reused in subsequent sub-loops

Figure 19: Version #3: Uniform data for 3 out of 4 Joints can be reused in subsequent sub-loops. Partial updating of uniform data can be implemented to minimize loads (or gathers in outer loop). And use sdlt::uniform_soa_over1d to store uniform data in SIMD-ready format to minimize the SIMD conversion overhead for all sub-loops.

By refactoring our loop so that we can reuse uniform data from sub-loop to subsequent sub-loop and only have to do partial updates, on average we have to update only a quarter of the uniform data. Thus, we save 75 percent of the overhead involved in setting up the uniform data for use in an SIMD loop.

Conclusion

Performance speedups for Intel® Advanced Vector Extensions

Figure 20: Performance speedups for Intel® Advanced Vector Extensions build on Intel® Xeon® CPU processor E5-2699 v3 (code-named Haswell)i

Getting your code to vectorize is only the beginning of SIMD speedups, but it should not be the end. Thereafter, you should use available resources and tools (for example, optimization reports, Intel VTune Amplifier XE, Intel Advisor XE) to investigate the efficiency of the generated code. Through analysis, you may then discover opportunities that would benefit both scalar and SIMD code. Then you can employ and experiment with techniques, whether they are common or like the ones provided in this document. You may need to rethink the algorithm and the data layout in order to ultimately improve the efficiency of your code and especially your generated assembly.

Taking our example, the biggest payoff was from Version #2, which implemented data preconditioning to the algorithm so that we could eliminate all the indirection (gathers). And then, Version #3 yielded additional speedups from using SDLT to help improve memory accesses with unmasked aligned unit-stride loads and also padding data to align with SIMD lane boundaries. And for a scenario of short trip counts, we utilized a SDLT advanced feature to help minimize the overall cost of uniform data overhead.

References

  1. Github repository to download example source code:
    https://github.intel.com/amwells/animation-simd-sdlt-whitepaper
  2. SDLT documentation (contains some code examples):
    https://software.intel.com/en-us/node/600110
  3. SIGGRAPH 2015: DreamWorks Animation (DWA): How We Achieved a 4x Speedup of Skin Deformation with SIMD:
    http://www.slideshare.net/IntelSoftware/dreamwork-animation-dwa
  4. For “try before buy” evaluation copy of Intel® Parallel Studio XE:
    http://software.intel.com/intel-parallel-studio-xe/try-buy
  5. For free copy of Intel® Parallel Studio XE for qualified students, educators, academic researchers and open source contributors:
    https://software.intel.com/en-us/qualify-for-free-software
  6. Intel® VTune™ Amplifier 2016:
    https://software.intel.com/en-us/intel-vtune-amplifier-xe
  7. Intel® Advisor:
    https://software.intel.com/en-us/intel-advisor-xe

Footnotes

1 Single instruction, multiple data (SIMD), refers to the exploitation of data-level parallelism, where a single instruction processes multiple data simultaneously. This is in contrast to the conventional “scalar operations” of using a single instruction to process each individual data.

2 Vectorization is where a computer program is converted from a scalar implementation to a vector (or SIMD) implementation.

3 pragma simd: https://software.intel.com/en-us/node/583427. pragma omp simd: https://software.intel.com/en-us/node/583456

4 Non-Unit Stride Memory Access means that as your loop increments consecutively, you access data from non-adjacent locations in memory. This can add significant overhead on performance. In contrast, accessing memory in a unit-strided (or sequential) fashion can be much more efficient.

5 For reference: https://software.intel.com/sites/default/files/8c/a9/CompilerAutovectorizationGuide.pdf

6 SDLT Primitives restrict objects to help the compiler succeed in the privatization of local variables in a SIMD loop, which means that each SIMD lane gets a private instance of the variable. To meet this criteria, the objects must be Plain Old Data (POD) and have in-lined object members, no nested arrays, and no virtual functions.

7 In the process of vectorizing, the developer should experiment with various pragmas (for example, simd, omp simd, ivdep, and vector {always [assert]}) and utilize Opt-reports.

8 For the Intel® C++ Compiler 16.0 (2016) on Linux*, we added command-line options “-qopt-report=5 –qopt-report-phase=vec” to generate the Opt-Report (*.optrpt).

9 To generate Intel® Advanced Vector Extensions (Intel® AVX) instructions using the Intel® C++ Compiler 16.0, add the option “-xAVX” to the compile command line.

10 The AVX512 instruction set has broadcast load instructions that can reduce the SIMD overhead in prepping uniform data before iteration starts.

i Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark *and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information, visit http://www.intel.com/performance.

Configurations: Intel® Xeon® processor E5-2699 v3 (45M Cache, 2.30 GHz). CPUs: Two 18-Core C1, 2.3 GHz. UNCORE: 2.3 GHz. Intel® QuickPath Interconnect: 9.6 GT/sec. RAM: 128 GB, DDR4-2133 MHz (16 x 8 GB). Disks: 7200 RPM SATA disk. 800 GB SSD. Hyper-Threading OFF, Turbo OFF. Red Hat Enterprise Linux* Server release 7.0. 3.10.0-123.el7.x86_64.


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>