Why is Cilk™ Plus not speeding up my program? (Part 1)

One of the most frequently-asked questions in the Cilk Plus forum is one of the following form:

I modified my code to use Cilk Plus, and it is not speeding up over the serial version the way I think it should be. What is going wrong?

This two-part series describes 10 common pitfalls that Cilk Plus programmers, both old and new, sometimes fall into. This article (Part 1) covers the first 5 items. The next article (Part 2) will cover the remaining items.

Many Cilk Plus programs that seem to run slower than their serial equivalents may exhibit one of the following problems:

Parallel regions with insufficient work.
Parallel regions with insufficient parallelism.
Code that explicitly depends on the number of workers/cores.
Tasks that are too fine-grained.
Different compiler optimizations between serial and parallel versions.
Reducer inefficiencies.
Data races or contention from sharing.
Parallel regions that are memory-bandwidth bound.
Bugs in timing methodology.
Nested calls to code parallelized using operating-system threads or OpenMP.

In this article, I will describe many performance problems in the context of a simple parallel loop:

cilk::reducer_opadd<int> sum(0);
cilk_for(int i = 0; i < n; ++i) {
     sum += f(i);
}
std::printf("Final sum is %d\n", i);

Figure 1: A simple parallel loop in Cilk Plus.

Many of the problems and solutions also apply to more general codes that use cilk_spawn and cilk_sync for nested fork-join parallelism. But it turns out that we can get surprisingly far in our discussion considering only the parallel loop shown in Figure 1.

In this article, I use the notation P as a shorthand for the number of worker threads being used to execute a Cilk Plus program. By default, the Cilk Plus runtime automatically sets P to be the number of cores detected on a system, unless the user has overridden the default value (e.g., by setting the environment variable CILK_NWORKERS). I will also assume that P is no greater than the number of cores on the system, and that the user program is the only one running on the system. In fact, the performance of Cilk Plus is robust even in an environment with other processes running concurrently. If some cores are spending significant time executing other processes, however, it does not make sense to expect linear speedup, i.e., to expect the running time to decrease linearly as P increases.

1. Parallel regions with insufficient work

Consider the following question about the loop in Figure 1: for what values of n do we expect to see reasonable speedups? In general, if n is small and the function f is simple, then there may be too little work in the entire loop to make it worthwhile to exploit parallelism using threads.

In any platform, using multiple threads to execute a parallel loop requires some communication between threads to coordinate and distribute work. To have any hope of seeing speedup in a parallel loop, the time it takes to finish the entire loop serially must be sufficiently larger than the time it takes for the threads to communicate and figure out the work they need to do.

Cilk Plus uses a work-stealing scheduler to distribute work. For a cilk_for, all the work of the loop conceptually begins on the worker thread that starts executing the loop. An idle worker thread may steal part of the work of the loop from this initial worker and start executing loop iterations. Similarly, other idle workers may steal work from other workers that are currently executing part of the loop. If there are enough loop iterations and iterations are sufficiently expensive, then eventually all P workers will manage to steal successfully and help execute the loop.

On the other hand, a steal can be an expensive operation. If the time it takes for the initial worker to finish the entire loop is not much longer than the time it takes for a steal to happen, then we would expect to see no speedup or even slowdown as P increases, since there is not enough time for work to get distributed.

One mistake that new Cilk Plus programmers sometimes make is to time a parallel loop that does not have enough work. If the loop is small enough to complete before any steals are likely to occur, then then the time to finish the loop does not decrease as P increases. In this situation, the fastest running time is likely to occur with P=1; more workers should not help, and might even hurt performance.

One threshold I often use when coding for my multicore desktop machine is the "microsecond test." If the time to execute the entire loop serially is only a "few" microseconds, then the loop might be too small to parallelize with multithreading. If the loop requires hundreds of microseconds to execute, then it may benefit from parallelization. The microsecond test is derived from a rough estimate of the cost of a steal in Cilk Plus, that a successful steal incurs an overhead that is on the order of microseconds.

Of course, there is a large gray area where it can be hard to make any predictions either way, and there are always exceptions to every rule. Also, steal overheads may vary quite a bit depending on the platform. But with a little bit of experimentation, one can usually figure out a reasonable approximate threshold for a new system.

2. Parallel regions with insufficient parallelism

Suppose we have a parallel loop that has sufficient work (e.g., it runs for tens of milliseconds on a multicore machine), but we are not seeing linear speedup as we add more worker threads. This loop may have insufficient parallelism.

In Cilk Plus, one can actually define the parallelism of a region precisely, as described in the online documentation. The parallelism of a region is the total work of the region divided by the span of a region, i.e., the work done along a longest path (also sometimes called critical path) through the region. For a simple parallel loop where each loop iteration executes serially, the parallelism of the loop is the total work of all iterations, divided by the work of the largest iteration.

If a parallel region has insufficient parallelism, then there is not enough work to keep all the cores busy, and we won't see linear speedup as we increase P.

One way to check whether a parallel region has sufficient parallelism is to use Intel Cilk view. Roughly speaking, to see linear speedups, we want Cilk view to report that a parallel region has "burdened parallelism" that is at least an order of magnitude more than the number of processors available on a machine. But I'll leave a more detailed discussion of Cilk view for another day. :)

3. Code that explicitly depends on the number of workers/cores

This pitfall is almost always a special case of having a region having insufficient parallelism. But it is a particularly pernicious pitfall that deserves special mention.

Arguably, the most common mistake we see among new Cilk Plus programmers, beginner and expert alike, is parallelizing a loop in one of the ways in Figure 2.

// A terrible way to write a parallel loop in Cilk Plus.
int P = __cilkrts_get_nworkers();
#pragma cilk grainsize n/P
cilk_for(int i = 0; i < n; ++i) {
     f(i);
}

// Another terrible way to write a parallel loop in Cilk Plus.
// Assume n is divisible by P.
cilk_for(int i = 0; i < n/P; ++i) {
    for (int j = i*n/P; j < (i+1)*n/P; ++j) {
       f(j);
    }
}

Figure 2: Terrible ways to code a parallel loop in Cilk Plus.

These codes are terrible because they explicitly depend on P. In particular, they attempt to divide the work of the loop into P pieces. In my experience, 99% of the time, writing one of the two loops above leads to suboptimal performance for a parallel loop in Cilk Plus. These loops have insufficient parallelism.

Imagine for a moment that n is perfectly divisible by P and that all loop iterations f(i) perform the same amount of work. Then the parallelism of this loop is exactly P. Shouldn't we expect a P-fold speedup running on P processors? Unfortunately, this logic has several flaws.

First, it is not always reasonable to assume that each of the P chunks of the loop will perform the same amount of work. Any imbalance in the work of the f(i)'s can only decrease the parallelism to a value less than P, since the work of the largest of P chunks can only be bigger than the average work of all P chunks.
Second, even if the user code for f(i) does create perfectly balanced chunks in terms of computation, the underlying system or the Cilk Plus runtime itself may introduce variations in the running time. For instance, worker 1 might execute f(i) faster than worker 2 can execute f(i+1) if worker 1 already had most of the necessary data in its cache, but worker 2 needs to load data from main memory. Also, because Cilk Plus uses a work-stealing scheduler, all the work begins on a single worker, and different workers will likely incur slightly different overheads to steal their chunks.
Finally, if we are running on a typical desktop machine instead of a server dedicated specifically to run our program, then chances are good that periodically, some other background OS process may need to run on the machine. In this case, we never know when such a process might run, consume one of the cores, and wreck our "perfect" load balancing.

For Cilk Plus, it is better to have an overdecomposition — divide the loop iterations into many more than P chunks. By having more chunks, the runtime is better able to load balance the work of the loop between cores.

Most of the time, the correct way to write the parallel loop above is to leave the loop in its natural form, i.e.,

cilk_for(int i = 0; i < n; ++i) {
   f(i);
}

By default, the Cilk Plus runtime automatically chooses a grainsize of roughly n/8P, a value that we have observed to usually work pretty well in practice. Alternatively, if the loop is not scaling well using the default grainsize, then the optimal setting might be to specify a grainsize of 1, which exposes as much parallelism as possible. A grainsize of 1 may be optimal if n is large and each f(i) performs a lot of work. Are there ever reasons to use a grainsize value different from the default or 1? Maybe it might happen, but in my experience, this situation is quite rare.

The temptation to write the loops in Figure 2 is understandable, especially for programmers who are familiar with tuning multithreaded code on other platforms. But to be an effective Cilk Plus programmer, one must first get out of the habit of writing code to do explicit load balancing, which is the fundamental problem exhibited by the loops in Figure 2.

The analogy I would use is that of cars with manual vs. automatic transmissions. When I drive a car with an automatic transmission, I don't expect to be shifting gears back and forth while I am driving. Even if I could somehow shift gears in automatic mode, the results are unlikely to be as good as if I had left the system alone to do what it was designed to do in the first place. Similarly, if I use Cilk Plus, then I am relying on the runtime system's work-stealing scheduler to handle load-balancing automatically. Writing a loop that explicitly divides the work of loops into P chunks is making an explicit scheduling decision that is likely only going to interfere with the runtime's ability to load-balance. Instead, one should rely upon the default grainsize of a cilk_for, which usually automatically achieves a good balance between exposing sufficient parallelism and keeping overheads low.

This principle applies more generally to code whose parallelism structure is more complicated than parallel loops, e.g., code with recursive nested parallelism. It is generally bad practice to write a Cilk Plus program that explicitly depends on the value of P or other scheduling-specific constructs such as worker thread ids. For these kinds of codes, there is often better way to express the same algorithm in Cilk Plus without using such constructs.

4. Tasks that are too fine-grained

When trying to avoid the previously mentioned pitfalls, we may find ourselves at the opposite extreme, namely a program with tasks that are too fine-grained. In this case, we often discover that the parallel version of our program runs significantly slower than its corresponding serial version.

As an example, consider the two loops shown in Figure 3, where each call to a function f only performs a tiny amount of work. Figure 3 also forces the grainsize of the parallel loop to be 1. The parallel loop may run much more slowly than the serial loop, even when run using P=1.

void f(int i) { return i * i; }

// Parallel loop.
cilk::reducer_opadd<int> sum(0);
#pragma cilk grainsize=1
cilk_for(int i = 0; i < n; ++i) {
     sum += f(i);
}

// Equivalent serial loop.
for (int i = 0; i < n; ++i) {
     sum += f(i);
}

Figure 3: Comparing a cilk_for loop with an equivalent serial loop.

In Cilk Plus, every cilk_spawn statement incurs some additional overhead, modifying some runtime data structures to enable work to be stolen by other processors. This overhead, which I will refer to as spawn overhead, is paid even when P=1, i.e., when we tell the runtime to execute a Cilk Plus program to use only 1 worker thread. Similarly, every cilk_for incurs some spawn overhead, since the runtime executes a cilk_for using a divide-and-conquer algorithm that uses cilk_spawn statements.

In general, Cilk Plus tries to minimize the spawn overhead, choosing whenever possible to shift any extra bookkeeping into the overhead of a steal. Some spawn overhead is inevitable, however, and for simple codes such as Figure 3, it can be significant.

How can we tell if spawn overhead is a problem in a program? The rule I usually follow when parallelizing a program is to always compare a Cilk Plus program with its serialization. More precisely, compare two versions of your code:

The normal Cilk Plus program executed with P=1, and
The serialization of the Cilk Plus program, that is, the code with all Cilk keywords elided.

If the running time of our Cilk Plus program with P=1 is nearly the same as time for the serialization, then spawn overhead is not a significant performance problem. Otherwise, if the Cilk Plus version runs for much longer than the serialization, we may want to coarsen the computation, so that the code spawns larger tasks, or eliminate Cilk Plus keywords altogether from certain functions, especially simple functions called within inner loops.

The Intel compiler actually provides a flag that makes it easy to generate the serialization, i.e., replacing Cilk Plus keywords with serial equivalents.

Of course, as mentioned earlier, there is a natural tension between making tasks sufficiently coarse-grained, and making sure the program has sufficient parallelism. Which one is more important to optimize for? Usually, I find minimizing spawn overhead to be the more important factor. More specifically, I often find it useful to first limit the spawn overhead in a serial execution below an acceptable threshold, and then expose as much parallelism in the program as possible without going above that threshold. Often, it does not help performance to pay more spawn overhead in a program to increase the available parallelism. Increasing spawn overhead usually means making tasks more fine-grained, which means the work in each stolen task is likely to become even smaller compared to the cost of a steal. Since a steal is already much more expensive than a spawn, having a small task stolen might not improve performance unless that task happens to be on the span/critical path and finishing it also enables more work.

One must be careful in applying this logic, though, because not every spawn in a Cilk Plus program is guaranteed to be stolen. Also, one should avoid writing code that explicitly depends on P, as discussed in pitfall #3. But keeping spawn and steal overheads in mind can provide some helpful intuition when tuning a Cilk Plus program for performance.

5. Different compiler optimizations between serial and parallel versions

This pitfall is one of the most subtle but important ones to watch out for. It comes up quite often when comparing Cilk Plus programs against other platforms, such as TBB or OpenMP, and especially in the context of vectorization.

When comparing the execution of a Cilk Plus program with P=1 and the serialization, we might discover that the serialization runs significantly faster (e.g., an order of magnitude faster). We might be tempted to blame pitfall #4, and conclude that we have too much spawn overhead in our program. But there can sometimes be another, sneakier culprit, however. The compiler might be generating different optimized code for the Cilk Plus code and the serialization!

When it comes to performance, optimizing compilers can simultaneously be our best friend and worst enemy. A good optimizing compiler may perform code transformations inside a loop that can improve performance by an order of magnitude, especially if the compiler is able to vectorize an inner for loop. Unfortunately, this performance improvement may also suddenly disappear if you make a subtle code change that prevents the compiler from employing that optimization. This performance drop may be unavoidable if the code change actually invalidates the compiler's original optimization. Other times, the performance drop may be a bug, in that compiler is missing an opportunity to optimize when it could have. But in either case, the overall conclusion remains the same: when compiler optimizations are involved, "simple" code changes may not be as innocuous as they seem.

In the context of Cilk Plus, it is important to keep in mind that any function f that uses Cilk Plus keywords (cilk_spawn, cilk_sync, or cilk_for) is compiled differently from the serial equivalent of f, since the compiler must perform some code transformations for the Cilk Plus keywords. This transformation for Cilk Plus can sometimes interact with other compiler optimizations in unexpected ways. For example, for the code in Figure 4, with some older Cilk Plus compilers, we have observed that the Version 2 of the parallel loop sometimes executes faster than Version 1, because the compiler manages to more aggressively optimize the serial function loop_body.

// Version 1: Cilk Plus code in a single loop.
cilk_for (int i = 0; i < n; ++i) {
   // Execute body of loop directly.
   ...
}

// Version 2:
void loop_body(i) {
   // Split the body of the loop out 
   // into a separate function.
   ...
}
cilk_for (int i = 0; i < n; ++i) {
   loop_body(i);
}

Figure 4: Two variations on the parallel loop. Some compilers may be able to more aggressively optimize the second version.

The most significant performance variations that one can see in compiled code is whether the compiler vectorizes a given loop or not. If a performance-critical loop is not vectorizing in a Cilk Plus program, but is vectorizing in the serialization, then one can easily see an order-of-magnitude performance difference.

The Intel compiler provides a -vec-report flag to generate a vectorization report, which one can use to help track down significant performance differences due to vectorization. For loops that are vectorizing in the serialization, but not in the Cilk Plus version, judicious use of #pragma simd on the loop may help. This pragma gives the compiler permission to vectorize a loop, even in cases where auto-vectorization might fail. For some additional information about vectorization in Intel Cilk Plus, check out the webinar “Introduction to Vectorization using Intel® Cilk™ Plus Extensions.”

This pitfall sometimes arises when comparing OpenMP code and Cilk Plus code. If one sees an OpenMP loop that runs faster on a parallel machine than a Cilk Plus loop, then one might think that the difference is due to runtime scheduling. The first thing that one should check, however, is whether the serial execution times of both versions are comparable. If they are not, then differences in compiler optimizations may be the culprit. Semantically, a compiler should always be able to vectorize a cilk_for loop if it can vectorize the equivalent OpenMP loop. In practice, however, there can sometimes be differences due to artifacts in a given implementation.

Summary

I've described the first half of our list of performance pitfalls for Cilk Plus programs. Programs that have insufficient parallelism or too much spawn overhead may not speedup relative to their serial equivalents. One can avoid these pitfalls by making sure that parallel regions in a program have enough work, enough parallelism, and sufficiently coarse-grained tasks. To diagnose these issues, one can also use Cilk view to profile parallelism.

Different compiler optimizations between the serial version and a parallel version of a program can also explain a lack of parallel speedup. When measuring performance, it is important to always compare a Cilk Plus program executed using one worker thread to its serialization.

Stay tuned for Part 2, which will describe the remaining pitfalls on our list!

For more information about Intel Cilk Plus, see the website http://cilkplus.org . For questions and discussions about Intel Cilk Plus, see the forum http://software.intel.com/en-us/forums/intel-cilk-plus.

Improving performance

Multithread development