Measuring performance in HPC

This is the first article in a series of articles about High Performance Computing with the Intel Xeon Phi. The Intel Xeon Phi is the first commercial product of Intel to incorporate the Many Integrated Core architecture. In this article I will present the basics of the Xeon Phi architecture, the programming models and what we can do to measure the performance in cycles for micro benchmarks.

The Intel Xeon Phi

The Intel Xeon Phi is the first commercially available product of the Intel MIC architecture. It was codenamed Intel Knights Corner (KNC) and is the successor of Knights Ferry (KNF). It has 60 cores and runs at a fixed clock speed of 1.053 GHz. It contains 8 GB of GDDR5 random access memory with a bandwidth of 320 GB/s. On the cache side we have 32KB for instructions and 32KB for data in L1 (each 8-way, with 64B line size). The L2 consist of 512KB slices per core, but can also be thought of as a fully coherent cache, with a total size equal to the sum of the slices. Information can be copied to each core that uses it to provide the fastest possible local access, or a single copy can be present for all cores to provide maximum cache capacity. The L2 cache contains both instructions and data (again 8-way and 64B line size).

It is important to know that the instruction set of the Intel MIC is quite special. While the instruction set is based on x86, we have a special set of vector instructions, which make use of the very big vector unit. This allows us to use SIMD programming very efficiently. The CPU also does FMA instructions, which one should try to optimize for, since each thread can only execute instructions every other cycle.

The Xeon Phi co-processor utilizes multi-threading on each core as a key to masking the latencies inherent in an in-order micro-architecture. This should not be confused with hyper-threading on Xeon processors that exists primarily to more fully feed a dynamic execution engine. In HPC workloads, very often hyper-threading may be ignored or even turned off without degrading effects on performance. This is not true of Xeon Phi co-processor hardware threads, where multi-threading of programs should not be ignored and hardware threads cannot be turned off.

The Intel Xeon Phi co-processor offers four hardware threads per core with sufficient memory capabilities and floating-point capabilities to make it generally impossible for a single thread per core to approach either limit. Highly tuned kernels of code may reach saturation with two threads, but generally applications need a minimum of three or four active threads per core to access all that the co-processor can offer. For this reason, the number of threads per core utilized should be a tunable parameter in an application and be set based on experience in running the application.

Programming the Xeon Phi

Given that we know how to program the Intel Xeon processors in the host system, the question that arises is how to involve the Intel Xeon Phi co-processor in an application. There are two major approaches:

The processor centric "offload" model where the program is viewed as running on processors and offloading select work to co-processors.
The "native" model where the program runs natively on processors and co-processors which may communicate with each other by various methods.

An MPI program can be structured using either model, e.g. a program with ranks only on processors may employ offload to access the performance of the co-processors or a program may run in a native mode with ranks on both processors and co-processors. There is really no machine "mode" in either case, only a programming style that can be intermingled in a single application if desired.

Offload is generally used for finer grained parallelism and as such generally involves localized changes to a program. MPI is more often done in a coarse grained manner often requiring more scattered changes in a program in order to add MPI calls. Intel MPI is tuned for both processors and co-processors, so can exploit hardware features like remote direct memory access (RDMA).

Let's first have a look at the "offload" model.

Programming with the offload model

The offload model for Intel Xeon Phi is quite rich. The syntax and semantics of the Intel Language Extensions for Offload includes capabilities not present in some other offload models including OpenACC (since OpenACC is limited by GPU compatibility). This provides greater interoperability with OpenMP, along with the ability to manage multiple Xeon Phi cards, and the ability to offload complex program components that the Intel Xeon Phi can process but that a GPU could not.

void doMult(int size, float (* restrict A)[size], float (* restrict B)[size], float (* restrict C)[size]) 
{
#pragma offload target(mic:MIC_DEV) 
                in(A:length(size*size)) in( B:length(size*size))    
                out(C:length(size*size))
  {
    // Zero the C matrix
#pragma omp parallel for default(none) shared(C,size)
    for (int i = 0; i < size; ++i)
      for (int j = 0; j < size; ++j)
        C[i][j] =0.f;
     
    // Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
    for (int i = 0; i < size; ++i)
      for (int k = 0; k < size; ++k)
        for (int j = 0; j < size; ++j)
          C[i][j] += A[i][k] * B[k][j];
  }
}

The offload pragma (as shown above) provides additional annotation so the compiler can correctly move data to and from the external Xeon Phi. We should note that multiple OpenMP loops can be contained within the scope of the offload directive. We will now discuss a little bit the different clauses.

The offload pragma keyword specifies that the following clauses contain information relevant to offloading to the target device. Here target(mic:MIC_DEV) is the target clause that tells the compiler to generate code for both the host processor and the specified offload device. In the example, the target will be a Xeon Phi card associated with the number specified by the constant MIC_DEV.

The in(var-list modifiersopt) clause explicitly copies data from the host to the co-processor. By default, memory will be allocated on the device and deallocated on exiting the scope of the directive. The alloc_if(condition) and free_if(condition) modifiers can change this behavior.

The out(var-list modifiersopt) clause explicitly copies data from the coprocessor to the host. Again by default, the specified memory will be deallocated on exiting the scope of the directive. The free_if(condition) modifier can be used to change the default behavior.

Finally we also want to have a look at the native model.

Programming with the native model

In total there are three different programming model. We already touched the offload model briefly, where the host processor runs the application and offloads compute intensive code and associated data to the device as specified by the programmer via pragmas in the source code. Another possibility would be to run the code as a traditional OpenMP application on the host. This is quite uninteresting, since we do not need a Xeon Phi card for this. The opposite of this called host model is the native programming model: Here the entire application runs on the Phi card.

The most comfortable way to use the native programming is to consider writing a program for a processor consisting of many cores. We do not care about being on the Phi card. The only difference is that in the end we will compile our code for the MIC and execute it on the device. This transfer could either by done by hand (transfer everything per scp, connect per ssh and run it) or by using a utility like micnativeloadex. It is important to note that possible dependencies have to be resolved on the Phi as well, i.e. environment variables might need to be set accordingly for programs to execute successfully.

#include <stdio.h>

void say_hello()
{
  #ifdef __MIC__
    printf("Hello, I am MIC!n");
  #else
    printf("We are still on the host!n");
  #endif
}

int main(int argc, char **argv)
{
  say_hello();
  return 0;
}

Now we need to compile this, e.g. with "icc -mmic -o hello-mic hello-mic.c". Executing with the micnativeloadex utility works by calling "micnativeloadex hello-mic".

The difference to building offload applications is that we use the mmic flag for compiling. Offload applications have to be compiled using the offload-build flag instead. Also programs using the offload programming model have the previously introduced pragmas included.

We will learn more about programming for the Xeon Phi in the next articles.

Measuring performance

There are several ways to measure the time of something in seconds, milliseconds or even microseconds, however, measuring the time in cycles (or nanoseconds) is not so straight forward. Of course one could always count instructions in the assembly code, but this requires understanding of how long certain operations take and does not take any overhead into account. Additionally we want to gather real world data, and not just theoretical calculations, which is why we need to take measurements with cycle precision.

Luckily there is something build into the CPU: the so-called Time Stamp Counter (TSC), which is a 64-bit register present on all x86 processors. It counts the number of cycles since reset. The corresponding assembly instruction is called RDTSC (RD means read) and returns the TSC in EDX:EAX.

The RDTSC has been an excellent high-resolution, low-overhead way of getting CPU timing information. With the advent of multi-core/hyper-threaded CPUs, systems with multiple CPUs, and hibernating operating systems, the TSC cannot be relied on to provide accurate results in general. There is no promise that the RDTSC in case of multiple CPUs on a single motherboard is synchronized, even though great care might be taken.

In case of the Intel MIC we can just assume all counters to be about equal. The clock-rate is also equal. The much bigger problem is that modern CPUs support out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. The solution is an instruction called RDTSCP, where the additional P means "and processor ID". This is a hint that this version is serialized, i.e. the order is guaranteed.

The bad news is that the Intel MIC does not support the RDTSCP instruction, however, we can just write the code for ourselves. The offset of each core's counter could be different, however, since this remains a constant offset we could (in most measurements) neglect it or set up the experiment in such a way where the offset of each counter is dropping out.

In C/C++ we can create the following function:

static inline unsigned long rdtsc()
{
	unsigned int hi, lo;

	__asm volatile (
		"xorl %%eax, %%eax nt""cpuid             nt""rdtsc             nt"
		:"=a"(lo), "=d"(hi)
		:
		:"%ebx", "%ecx"
	);
	return ((unsigned long)hi << 32) | lo;
}

This is a serialized version of the RDTSC instruction, since it reads out the processor ID as well. The first thing we might want to measure using this piece of code is the precision / overhead of the time measurement.

static int tsc_overhead()
{
    unsigned long t0, t1;
    t0 = rdtsc();
    t1 = rdtsc();
    return (int)(t1 - t0);
}

Usually the overhead is O(100) cycles or about 0.1 µs (depending on the clock frequency) precision. So even for an operation that just takes 1 cycle we will measure at least O(100) cycles.

Conclusion

In this article we have seen the basic architecture and the available programming models for the Intel Xeon Phi. We also had a short look at measuring execution time. Even though we have access to a high performance counter, we still need to keep some rules in mind. We should only measure many repetitions of a reasonable small count of instructions to get meaningful numbers. We also have to keep in mind that a certain offset might be present which is unique for each core.

In the next article we will walk through some of the available threading models, namely pThreads / C++11 Threads, Intel Cilk Plus, OpenMP and Intel TBB.