One of the most frequently-asked questions in the Cilk Plus forum is one of the following form:
I modified my code to use Cilk Plus, and it is not speeding up over the serial version the way I think it should be. What is going wrong?
This two-part series describes 10 common pitfalls that Cilk Plus programmers, both old and new, sometimes fall into. This article (Part 1) covers the first 5 items. The next article (Part 2) will cover the remaining items.
Many Cilk Plus programs that seem to run slower than their serial equivalents may exhibit one of the following problems:
- Parallel regions with insufficient work.
- Parallel regions with insufficient parallelism.
- Code that explicitly depends on the number of workers/cores.
- Tasks that are too fine-grained.
- Different compiler optimizations between serial and parallel versions.
- Reducer inefficiencies.
- Data races or contention from sharing.
- Parallel regions that are memory-bandwidth bound.
- Bugs in timing methodology.
- Nested calls to code parallelized using operating-system threads or OpenMP.
In this article, I will describe many performance problems in the context of a simple parallel loop:
cilk::reducer_opadd<int> sum(0); cilk_for(int i = 0; i < n; ++i) { sum += f(i); } std::printf("Final sum is %d\n", i); |
Figure 1: A simple parallel loop in Cilk Plus. |
Many of the problems and solutions also apply to more general codes
that use cilk_spawn
and cilk_sync
for nested
fork-join parallelism. But it turns out that we can get surprisingly
far in our discussion considering only the parallel loop shown in
Figure 1.
In this article, I use the notation P
as a shorthand
for the number of worker threads being used to execute a Cilk Plus
program. By default, the Cilk Plus runtime automatically sets
P
to be the number of cores detected on a system, unless
the user has overridden the default value (e.g., by setting the
environment variable CILK_NWORKERS
). I will also assume
that P
is no greater than the number of cores on the
system, and that the user program is the only one running on the
system. In fact, the performance of Cilk Plus is robust even in an
environment with other processes running concurrently. If some cores
are spending significant time executing other processes, however, it
does not make sense to expect linear
speedup, i.e., to expect the running time to decrease
linearly as P
increases.
1. Parallel regions with insufficient work
Consider the following question about the loop in Figure 1: for
what values of n
do we expect to see reasonable speedups?
In general, if n
is small and the function f
is simple, then there may be too little work in the entire loop to
make it worthwhile to exploit parallelism using threads.
In any platform, using multiple threads to execute a parallel loop requires some communication between threads to coordinate and distribute work. To have any hope of seeing speedup in a parallel loop, the time it takes to finish the entire loop serially must be sufficiently larger than the time it takes for the threads to communicate and figure out the work they need to do.
Cilk Plus uses a work-stealing scheduler to distribute work. For a
cilk_for
, all the work of the loop conceptually begins on
the worker thread that starts executing the loop. An idle worker
thread may steal part of the work of the
loop from this initial worker and start executing loop iterations.
Similarly, other idle workers may steal work from other workers that
are currently executing part of the loop. If there are enough loop
iterations and iterations are sufficiently expensive, then eventually
all P
workers will manage to steal successfully and help
execute the loop.
On the other hand, a steal can be an expensive operation. If the
time it takes for the initial worker to finish the entire loop is not
much longer than the time it takes for a steal to happen, then we
would expect to see no speedup or even slowdown as P
increases, since there is not enough time for work to get
distributed.
One mistake that new Cilk Plus programmers sometimes make is to
time a parallel loop that does not have enough work. If the loop is
small enough to complete before any steals are likely to occur, then
then the time to finish the loop does not decrease as P
increases. In this situation, the fastest running time is likely to
occur with P=1
; more workers should not help, and might
even hurt performance.
One threshold I often use when coding for my multicore desktop machine is the "microsecond test." If the time to execute the entire loop serially is only a "few" microseconds, then the loop might be too small to parallelize with multithreading. If the loop requires hundreds of microseconds to execute, then it may benefit from parallelization. The microsecond test is derived from a rough estimate of the cost of a steal in Cilk Plus, that a successful steal incurs an overhead that is on the order of microseconds.
Of course, there is a large gray area where it can be hard to make any predictions either way, and there are always exceptions to every rule. Also, steal overheads may vary quite a bit depending on the platform. But with a little bit of experimentation, one can usually figure out a reasonable approximate threshold for a new system.
2. Parallel regions with insufficient parallelism
Suppose we have a parallel loop that has sufficient work (e.g., it runs for tens of milliseconds on a multicore machine), but we are not seeing linear speedup as we add more worker threads. This loop may have insufficient parallelism.
In Cilk Plus, one can actually define the parallelism of a region precisely, as described in the online documentation. The parallelism of a region is the total work of the region divided by the span of a region, i.e., the work done along a longest path (also sometimes called critical path) through the region. For a simple parallel loop where each loop iteration executes serially, the parallelism of the loop is the total work of all iterations, divided by the work of the largest iteration.
If a parallel region has insufficient parallelism, then there is
not enough work to keep all the cores busy, and we won't see linear
speedup as we increase P
.
One way to check whether a parallel region has sufficient parallelism is to use Intel Cilk view. Roughly speaking, to see linear speedups, we want Cilk view to report that a parallel region has "burdened parallelism" that is at least an order of magnitude more than the number of processors available on a machine. But I'll leave a more detailed discussion of Cilk view for another day. :)
3. Code that explicitly depends on the number of workers/cores
This pitfall is almost always a special case of having a region having insufficient parallelism. But it is a particularly pernicious pitfall that deserves special mention.
Arguably, the most common mistake we see among new Cilk Plus programmers, beginner and expert alike, is parallelizing a loop in one of the ways in Figure 2.
These codes are terrible because they explicitly depend on
P
. In particular, they attempt to divide the work of the
loop into P
pieces. In my experience, 99% of the time,
writing one of the two loops above leads to suboptimal performance for
a parallel loop in Cilk Plus. These loops have insufficient
parallelism.
Imagine for a moment that n
is perfectly divisible by
P
and that all loop iterations f(i)
perform
the same amount of work. Then the parallelism of this loop is exactly
P
. Shouldn't we expect a P
-fold speedup
running on P
processors? Unfortunately, this logic has
several flaws.
- First, it is not always reasonable to assume that each of
the
P
chunks of the loop will perform the same amount of work. Any imbalance in the work of thef(i)
's can only decrease the parallelism to a value less thanP
, since the work of the largest ofP
chunks can only be bigger than the average work of allP
chunks. - Second, even if the user code for
f(i)
does create perfectly balanced chunks in terms of computation, the underlying system or the Cilk Plus runtime itself may introduce variations in the running time. For instance, worker 1 might executef(i)
faster than worker 2 can executef(i+1)
if worker 1 already had most of the necessary data in its cache, but worker 2 needs to load data from main memory. Also, because Cilk Plus uses a work-stealing scheduler, all the work begins on a single worker, and different workers will likely incur slightly different overheads to steal their chunks. - Finally, if we are running on a typical desktop machine instead of a server dedicated specifically to run our program, then chances are good that periodically, some other background OS process may need to run on the machine. In this case, we never know when such a process might run, consume one of the cores, and wreck our "perfect" load balancing.
For Cilk Plus, it is better to have an overdecomposition —
divide the loop iterations into many more than P
chunks.
By having more chunks, the runtime is better able to load balance the
work of the loop between cores.
Most of the time, the correct way to write the parallel loop above is to leave the loop in its natural form, i.e.,
cilk_for(int i = 0; i < n; ++i) { f(i); }
By default, the Cilk Plus runtime automatically chooses a
grainsize of roughly n/8P
, a value that we have observed
to usually work pretty well in practice. Alternatively, if the loop
is not scaling well using the default grainsize, then the optimal
setting might be to specify a grainsize of 1, which exposes as much
parallelism as possible. A grainsize of 1 may be optimal if
n
is large and each f(i)
performs a lot of
work. Are there ever reasons to use a grainsize value different from
the default or 1? Maybe it might happen, but in my experience, this
situation is quite rare.
The temptation to write the loops in Figure 2 is understandable, especially for programmers who are familiar with tuning multithreaded code on other platforms. But to be an effective Cilk Plus programmer, one must first get out of the habit of writing code to do explicit load balancing, which is the fundamental problem exhibited by the loops in Figure 2.
The analogy I would use is that of cars with manual vs. automatic
transmissions. When I drive a car with an automatic transmission, I
don't expect to be shifting gears back and forth while I am driving.
Even if I could somehow shift gears in automatic mode, the results are
unlikely to be as good as if I had left the system alone to do what it
was designed to do in the first place. Similarly, if I use Cilk Plus,
then I am relying on the runtime system's work-stealing scheduler to
handle load-balancing automatically. Writing a loop that explicitly
divides the work of loops into P
chunks is making an
explicit scheduling decision that is likely only going to interfere
with the runtime's ability to load-balance. Instead, one should rely
upon the default grainsize of a cilk_for
, which usually
automatically achieves a good balance between exposing sufficient
parallelism and keeping overheads low.
This principle applies more generally to code whose parallelism
structure is more complicated than parallel loops, e.g., code with
recursive nested parallelism. It is generally bad practice to write a
Cilk Plus program that explicitly depends on the value of
P
or other scheduling-specific constructs such as worker
thread ids. For these kinds of codes, there is often better way to
express the same algorithm in Cilk Plus without using such
constructs.
4. Tasks that are too fine-grained
When trying to avoid the previously mentioned pitfalls, we may find ourselves at the opposite extreme, namely a program with tasks that are too fine-grained. In this case, we often discover that the parallel version of our program runs significantly slower than its corresponding serial version.
As an example, consider the two loops shown in Figure 3, where
each call to a function f
only performs a tiny amount of
work. Figure 3 also forces the grainsize of the parallel loop to be
1. The parallel loop may run much more slowly than the serial loop,
even when run using P=1
.
In Cilk Plus, every cilk_spawn
statement incurs some
additional overhead, modifying some runtime data structures to enable
work to be stolen by other processors. This overhead, which I will
refer to as spawn overhead, is paid even
when P=1
, i.e., when we tell the runtime to execute a
Cilk Plus program to use only 1 worker thread. Similarly, every
cilk_for
incurs some spawn overhead, since the runtime
executes a cilk_for
using a divide-and-conquer algorithm
that uses cilk_spawn
statements.
In general, Cilk Plus tries to minimize the spawn overhead, choosing whenever possible to shift any extra bookkeeping into the overhead of a steal. Some spawn overhead is inevitable, however, and for simple codes such as Figure 3, it can be significant.
How can we tell if spawn overhead is a problem in a program? The rule I usually follow when parallelizing a program is to always compare a Cilk Plus program with its serialization. More precisely, compare two versions of your code:
- The normal Cilk Plus program executed with
P=1
, and - The serialization of the Cilk Plus program, that is, the code with all Cilk keywords elided.
P=1
is
nearly the same as time for the serialization, then spawn overhead is
not a significant performance problem. Otherwise, if the Cilk Plus
version runs for much longer than the serialization, we may want to
coarsen the computation, so that the code spawns larger tasks, or
eliminate Cilk Plus keywords altogether from certain functions,
especially simple functions called within inner loops.
The Intel compiler actually provides a flag that makes it easy to generate the serialization, i.e., replacing Cilk Plus keywords with serial equivalents.
Of course, as mentioned earlier, there is a natural tension between making tasks sufficiently coarse-grained, and making sure the program has sufficient parallelism. Which one is more important to optimize for? Usually, I find minimizing spawn overhead to be the more important factor. More specifically, I often find it useful to first limit the spawn overhead in a serial execution below an acceptable threshold, and then expose as much parallelism in the program as possible without going above that threshold. Often, it does not help performance to pay more spawn overhead in a program to increase the available parallelism. Increasing spawn overhead usually means making tasks more fine-grained, which means the work in each stolen task is likely to become even smaller compared to the cost of a steal. Since a steal is already much more expensive than a spawn, having a small task stolen might not improve performance unless that task happens to be on the span/critical path and finishing it also enables more work.
One must be careful in applying this logic, though, because not
every spawn in a Cilk Plus program is guaranteed to be stolen. Also,
one should avoid writing code that explicitly depends on
P
, as discussed in pitfall
#3. But keeping spawn and steal overheads in mind can provide
some helpful intuition when tuning a Cilk Plus program for
performance.
5. Different compiler optimizations between serial and parallel versions
This pitfall is one of the most subtle but important ones to watch out for. It comes up quite often when comparing Cilk Plus programs against other platforms, such as TBB or OpenMP, and especially in the context of vectorization.
When comparing the execution of a Cilk Plus program with
P=1
and the serialization, we might discover that the
serialization runs significantly faster (e.g., an order of magnitude
faster). We might be tempted to blame pitfall #4, and conclude that we have too
much spawn overhead in our program. But there can sometimes be
another, sneakier culprit, however. The compiler might be generating
different optimized code for the Cilk Plus code and the serialization!
When it comes to performance, optimizing compilers can simultaneously be our best friend and worst enemy. A good optimizing compiler may perform code transformations inside a loop that can improve performance by an order of magnitude, especially if the compiler is able to vectorize an inner for loop. Unfortunately, this performance improvement may also suddenly disappear if you make a subtle code change that prevents the compiler from employing that optimization. This performance drop may be unavoidable if the code change actually invalidates the compiler's original optimization. Other times, the performance drop may be a bug, in that compiler is missing an opportunity to optimize when it could have. But in either case, the overall conclusion remains the same: when compiler optimizations are involved, "simple" code changes may not be as innocuous as they seem.
In the context of Cilk Plus, it is important to keep in mind that
any function f
that uses Cilk Plus keywords
(cilk_spawn
, cilk_sync
, or
cilk_for
) is compiled differently from the serial
equivalent of f
, since the compiler must perform some
code transformations for the Cilk Plus keywords. This transformation
for Cilk Plus can sometimes interact with other compiler optimizations
in unexpected ways. For example, for the code in Figure 4, with some
older Cilk Plus compilers, we have observed that the Version 2 of the
parallel loop sometimes executes faster than Version 1, because the
compiler manages to more aggressively optimize the serial function
loop_body
.
The most significant performance variations that one can see in compiled code is whether the compiler vectorizes a given loop or not. If a performance-critical loop is not vectorizing in a Cilk Plus program, but is vectorizing in the serialization, then one can easily see an order-of-magnitude performance difference.
The Intel compiler provides a -vec-report
flag to
generate a vectorization
report, which one can use to help track down significant
performance differences due to vectorization. For loops that are
vectorizing in the serialization, but not in the Cilk Plus version,
judicious use of #pragma simd
on the loop may help.
This pragma gives the compiler permission to vectorize a loop, even in
cases where auto-vectorization might fail. For some additional
information about vectorization in Intel Cilk Plus, check out the
webinar “Introduction to Vectorization using Intel® Cilk™ Plus
Extensions.”
This pitfall sometimes arises when comparing OpenMP code and Cilk
Plus code. If one sees an OpenMP loop that runs faster on a parallel
machine than a Cilk Plus loop, then one might think that the
difference is due to runtime scheduling. The first thing that one
should check, however, is whether the serial execution times of both
versions are comparable. If they are not, then differences in
compiler optimizations may be the culprit. Semantically, a compiler
should always be able to vectorize a cilk_for
loop if it
can vectorize the equivalent OpenMP loop. In practice, however, there
can sometimes be differences due to artifacts in a given
implementation.
Summary
I've described the first half of our list of performance pitfalls for Cilk Plus programs. Programs that have insufficient parallelism or too much spawn overhead may not speedup relative to their serial equivalents. One can avoid these pitfalls by making sure that parallel regions in a program have enough work, enough parallelism, and sufficiently coarse-grained tasks. To diagnose these issues, one can also use Cilk view to profile parallelism.
Different compiler optimizations between the serial version and a parallel version of a program can also explain a lack of parallel speedup. When measuring performance, it is important to always compare a Cilk Plus program executed using one worker thread to its serialization.
Stay tuned for Part 2, which will describe the remaining pitfalls on our list!
For more information about Intel Cilk Plus, see the website http://cilkplus.org . For questions and discussions about Intel Cilk Plus, see the forum http://software.intel.com/en-us/forums/intel-cilk-plus.