Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Calculating “FLOP” using Intel® Software Development Emulator (Intel® SDE)

$
0
0

Purpose

Floating point operations (FLOP) rate is used widely by the High Performance Computing (HPC) community as a metric for analysis and/or benchmarking purposes. Many HPC nominations (e.g., Gordon Bell) require the FLOP rate be specified for their application submissions.

The methodology described here DOES NOT rely on the Performance Monitoring Unit (PMU) events/counters. This is an alternative software methodology to evaluate FLOP using the Intel® SDE.

Methodology

  • We split the FLOP (Floating point Operations) count into two categories:
    • Unmasked FLOP: For Intel® Architectures that do not support masking feature
    • Unmasked + Masked FLOP: For Intel® Architectures that do support masking feature
      Example of some Intel® Architectures that do not support masking feature
      Processor NameCode Name
      2nd gen Intel® Core™ processor familySandy Bridge
      3rd gen Intel® Core™ processor familyIvy Bridge
      4th gen Intel® Core™ processor familyHaswell
      Example of some Intel® Architectures that do support masking feature
      Processor NameCode Name
      Intel® Xeon Phi™ coprocessorKnights Landing
  • There is some debate on what is considered to be a floating point instruction/operation.
  • Provided below is the list of general floating point instructions used in this method: ADD, SUB, MUL, DIV, SQRT, RCP, FMA, FMS, DPP, MAX, MIN (each has many flavors)
  • The high level idea is:
    • Decode every floating point instruction to identify the following:
      • Vector (packed) vs. Scalar
      • Data Type (Single Precision vs. Double Precision)
      • Register Type Used (xmm – 128 bits, ymm – 256 bits, zmm – 512 bits)
      • Masking – masked vs. unmasked instruction
    • Use the above information with its “dynamic execution” count to evaluate the FLOP count for that instruction. Example: vfmadd231pd zmm0, zmm30, zmm1 executed 500 times
      • p – packed instruction (vector), without any mask
      • d – double precision data type (64 bit)
      • zmm – operating on 512 bit registers
      • fma – fused multiply and add (2 floating point operations)

        The FLOP count for the above instruction = 8 * 2 (fma) * 500 (execution count) = 8000 FLOP.
  • You do not need to parse/decode all of the above for every floating point instruction to evaluate the FLOP count for your application.
  • Intel SDE’s instruction mix histogram and dynamic mask profile provide a set of pre-evaluated counters (using the methodology described above + more) that can be used to evaluate the FLOP count on your application.

The next section describes the details on this.
 

Instructions to Count Unmasked FLOP

  • This is applicable for all Intel architectures (Sandy Bridge, Ivy Bridge, Haswell, Knights Landing, etc.)
  • Obtain the latest version of Intel SDE here.
  • Generate the instruction mix histogram for your application using Intel SDE as follows:
    • sde - -iform 1 -omix myapp_mix.out -top_blocks 5000 -- ./myapp.exe
      1. is the architecture that you want to run on.
      2. Compile the binary correctly for the architecture you are running on.
      3. Supports multi-threaded runs
        Example:
        sde -knl -iform 1 -omix myapp_knl_mix.out -top_blocks 5000 -- ./myapp.knl.exe
  • In the instruction mix output (e.g., myapp_mix.out),under the “EMIT_GLOBAL_DYNAMIC_STATS” section, check for the following pre-evaluated counters:
    1. *elements_fp_(single/double)_(1/2/4/8/16)
    2. *elements_fp_(single/double)_(8/16) _masked
  • The different counters mean the following:
    elements_fp_single_1 – floating point instructions with single precision and one element (probably scalar) and no mask

    elements_fp_double_4 – floating point instructions with double precision and four elements and no mask (ymm)

    elements_fp_double_8 – floating point instructions with double precision and eight elements and no mask (zmm)

    …...... elements_fp_single_16_masked – similar as above but now with masks

    (Note: you will see the mask counts only on architectures + ISA that support masking)
  • The above by itself is not sufficient since the Fused Multiply and Add instruction (FMA) is counted as 1 FLOP by the above counters.
  • “EMIT_GLOBAL_DYNAMIC_STATS” section also prints dynamic counts of every type/flavor of FMA executed in your application. Look for the following:
    VFMADD213SD_XMMdq_XMMdq_XMMdq
    scalar, double precision, on xmm (128 bit) = 1 element
    VFMADD231PD_YMMdq_YMMdq_YMMdq
    packed, double precision, on ymm (256 bit) = 4 elements
    VFMADD132PS_ZMMdq_ZMMdq_ZMMdq
    packed, single precision, on zmm (512 bit) = 16 elements
    ......
    Other flavors of FMA like VFNMSUB132PD_YMMqq_YMMqq_MEMqq, VFNMADD231SD_XMMdq_XMMq_XMM, etc. may also be present.
  • Counting FLOPs
    Step 1
    • For each data type (single/double); use the “dynamic” instruction count corresponding to each of the above counters and multiply by the “elements (1/2/4/8/16)” to get the FLOP count.

      Example:

      Intel SDE (Haswell) Instruction Mix output (snapshot) of a Molecular Dynamics code from Sandia Mantevo Suite (look for the below section under EMIT_GLOBAL_DYNAMIC_STATS).

Unmasked FLOP (Double Precision) =
(23513724690 * 1 + 274320019 * 2 + 37317021308 * 4) = ~173.3304 GFLOP

Note/Caveats:

  • The above by itself is not sufficient since the Fused Multiply and Add instruction (FMA) is counted as 1 FLOP (see “Step 2” on how to take that into account).
  • For Intel® AVX-512 (KNL) instruction mix output you may see “*elements_fp*_masked” counters as well. Counting masked FLOP is covered in the next section.
  • Also the “masked” counters above do not specify the actual mask values, so can’t take them into account anyways.
  • Step 2 Step 3
    • Taking into account FMA and its flavors

For each FMA flavor, based on data type (single vs. double), packed vs. scalar, and register type as described above + the “dynamic” instruction count corresponding to each FMA, compute and add the corresponding FLOP “just one more time” to the above FLOP count computed in Step 1.

Example:

Intel SDE (Haswell) Instruction Mix output (snapshot) of a Molecular Dynamics code from Sandia Mantevo Suite (look for the VFM* section under EMIT_GLOBAL_DYNAMIC_STATS).

VFMADD213PD_XMMdq_XMMdq_XMMdq 1728000

VFMADD213PD_YMMqq_YMMqq_YMMqq 47496488

VFMADD213SD_XMMdq_XMMq_MEMq 825422220

VFMADD213SD_XMMdq_XMMq_XMMq 5733116808

VFMADD231PD_XMMdq_XMMdq_XMMdq 432000

VFMADD231PD_YMMqq_YMMqq_YMMqq 3189961568

VFMADD231SD_XMMdq_XMMq_MEMq 4

VFMADD231SD_XMMdq_XMMq_XMMq 475482133

VFMSUB213PD_YMMqq_YMMqq_YMMqq 1594141168

VFMSUB231PD_YMMqq_YMMqq_YMMqq 47064488

VFMSUB231SD_XMMdq_XMMq_XMMq 1656723

Unmasked FMA FLOP (Double Precision) =
(1728000 * 2 + 47496488 * 4 + 825422220 * 1 + 5733116808 * 1 + 432000 * 2 + 3189961568 * 4 + 4 * 1 + 475482133 * 1 + 1594141168 * 4 + 47064488 * 4 + 1656723 * 1) = ~26.5546 GFLOP

Note/Caveats:

  • For Intel AVX-512 (KNL/SKL) instruction mix output all FMA (and its flavors) instructions (masked or full vectors) will be marked as masked (e.g., VFMADD132PD_ZMMf64_MASKmskw_ZMMf64_MEMf64_AVX512).
  • The next section “Instructions to Count Masked FLOP” will cover that.
    • Add the FLOP counted in step 1 and step 2.

      Example (for the Advection routine): Total Unmasked FLOP (Double Precision) = 173.3304 + 26.5546 = 199.885 GFLOP
    • If running on an architecture that does not support masking, then you have your total FLOP count (can skip the next section).
    • For floating point operation per second (FLOPS), divide the FLOP count computed using the above method by the application run time measured on appropriate hardware.
  • On another note, the FLOP count of an application will most likely be the same irrespective of the architecture it is run on (unless the compiler generates completely different code impacting FLOP count for the two different binaries–which is very rare). Thus, to find the FLOP count for an application, compute as described above on Ivy Bridge (or Haswell) with no hardware masking feature and use the same count for other architectures (like Knights Landing, etc.). Thus you do not have to deal with masking at all while evaluating FLOP count.
  • But if you still need to evaluate the FLOP count on architecture with masking support, refer to the next section, which describes how to count masked FLOP using the dynamic mask profile feature from Intel SDE.

Instructions to Count Masked FLOP

  • Intel SDE has a dynamic mask profile feature that evaluates and prints the number of operations for each executed instruction with a mask.
  • Generate the dynamic mask profile for your application using Intel SDE as follows:
    • sde - -iform 1 -odyn_mask_profile myapp_msk.out -top_blocks 5000 -- ./myapp.exe
      1. is the architecture that you want to run on. Note not all architectures support masking.
      2. Compile the binary correctly for the architecture you are running on.
      3. Supports multi-threaded runs.
        Example:
        sde -knl -iform 1 -odyn_mask_profile myapp_knl_msk.out -top_blocks 5000 -- ./myapp.knl.exe
  • The dynamic mask profile is an XML output, with a summary table per thread of the different categories of instructions with and without masking and their total instruction and operation count.
  • In addition, the mask profile also prints the dynamic instruction count and operation count per instruction.
  • Summary Table (Dynamic Mask Profile)
    Example: Intel® SDE (Knight’s Landing) dynamic mask profile output (snapshot below):
    ColumnHeaderDescription
    FirstmaskClassifies the masked instructions vs. unmasked instructions
    SecondcatClassifies categories of the instruction (e.g. memory instructions (data transfer), sparse (gather/scatter) and computational (mask)
    Thirdvec-lengthSpecifies the vector register width
    Fourth#elementsSpecifies maximum number of elements possible in the vector register (with vec-length in third column) and with element size (specified in fifth column)
    Fifthelement_sSpecifies size of element (or data type) in bits (e.g. 64b = 64 bits = 8 byte)
    Sixthelement_tClassifies based on element type (e.g. fp - floating point vs. int – integer)
    SeventhicountTotal Instruction count of each category/type
    Eighthcomp_countCorresponding computation count for the executed instructions of each category/type
    Ninth%max-compShows % vector lane utilization for each category/type
    • For example in the above snapshot only the rows highlighted have to be used for “masked” FLOP count
      • Please note (in your run) you need to mainly look for “masked” instructions with “mask” category and “element_t = fp” for masked FLOP count.
    • The “comp_count” number is basically the masked FLOP count.
      • But again FMA is counted as only one FLOP in the comp_count counter.
      • See the next section on how to take into account masked FMA (to count them as 2 FLOP).
  • Per Instruction Details (Dynamic Mask Profile)
    • In addition to the “summary table” per thread, the dynamic mask profile also prints the computational count on a per instruction basis.
    • Below is a snapshot of it.
    • In this case, the masked “vfmadd213pd” instruction has an execution count = 862280 and the computation count = 5052521. Thus all the executions of this instruction are NOT using all the vector lanes in this case.
    • In the snapshot above, the “vfmadd213pd” instruction has an execution count = 4000 and the computation count = 32000 (4000 * 8). Thus all the executions of this instruction are using all the vector lanes in this case (no mask).
    • Since the summary table accounts for the FMA instructions (and its flavors) as only 1 FLOP, you have to add the computation count for all the masked FMA instructions from the instruction-details (as above) “one more time” to account for 2 FLOP per FMA.
  • Counting Masked FLOP

    Step1
    • From the summary table add the “comp_count” value from all “masked” instructions with “mask” category and “element_t = fp”.
  • Step2
    • Parse all the FMA instructions with mask, from per instruction-details and add the “computation-counts” to the above sum evaluated in Step 1 one more time.

      Thus you have the total Masked FLOP count.

      Note/Caveats:
      • As mentioned in the previous section, in Intel AVX-512 (KNL/SKL) instruction mix output all FMA (and its flavors) instructions (masked or full vectors) are marked as masked (e.g. VFMADD132PD_ZMMf64_MASKmskw_ZMMf64_MEMf64_AVX512).
      • Thus you can use the dynamic mask profile “instruction-details” to evaluate the “computation-count” for all FMA instructions (masked or unmasked – full vectors).

Validation

The above methodology may look a bit overwhelming at first, but the reason for such detailed instructions is so that you can write your own simple scripts to parse the above information. We hope to provide the scripts (currently used internally) to evaluate FLOP count as part of the Intel SDE releases in the future.

Below is a summary of the FLOP count validation on some applications.

  • The error margin is basically the difference in count between the Reference count and the FLOP count evaluated using Intel SDE.
  • The reason for the difference can be due to reasons like theoretical evaluation vs. code generation, instructions counted as FLOP, etc. We have not looked into the details for this difference.
  • But you can see the error margin is very minimal.
WorkloadsReference FLOP
(from NERSC)
MPI ranksFLOP Count
(using Intel® SDE)
Error Margin
MiniFE5.05435E+121445.18039E+121.03
miniGhost6.55500E+12966.85624E+121.05
AMG1.30418E+12961.43311E+121.10
UMT1.30211E+13961.38806E+131.07

Footnotes:

Masking: Even on Intel® AVX/AVX2 (IvyBridge/Haswell) the compiler supports "masking" internally with blends and so forth. Thus in vectorized loops with conditionals there will be unused computations (e.g., compiler computes both the true and false branches and then blends them, throwing away the unused parts). This means that FLOP will be an overestimate of useful computation. Arguably the masked version (KNL/SKL) will be more accurate since the pop count of the mask is exact (assuming the compiler uses masks everywhere).

About the Author:

Karthik Raman is a Software Architect at Intel, focusing mainly on performance analysis and optimization of HPC workloads for the Intel® Xeon Phi™ architecture. He focuses on analyzing for optimal compiler code generation, vectorization, and assessing key architectural features for performance. He helps deliver transformative methods and tools to expose new opportunities and insights. Karthik received his Masters of Science in Electrical Engineering (with specialization in VLSI and Computer Architecture) from the University of Southern California, Los Angeles.


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>