Introduction
Starting with Intel® C++ Compiler 18.0 Gold Release, Intel® Cilk™ Plus constructs will be marked as deprecated and eventually removed in a future release. Intel® Cilk™ Plus when introduced in 2010 was a one stop programming model for enabling both multi-threading and vectorization in programs. Intel® Cilk™ Plus specification is made open at https://www.cilkplus.org/specs. With OpenMP* 4.0 SIMD extensions, we can enable both multi-threading and explicit vectorization using OpenMP pragmas. For all C++ programmers who prefer using Intel® TBB as the threading model can still continue to use TBB for threading and use OpenMP 4.0 SIMD pragmas for enabling vectorization. Below is a table which quickly summaries the options for moving from Intel® Cilk™ Plus to either OpenMP or Intel® TBB.
Migration Cheat Sheet | Intel® Cilk™ Plus | OpenMP | Intel® Threading Building Blocks (Intel® TBB) |
Task Parallelism | cilk_spawn | #pragma omp task | task_group t; t.run([](){ }) |
cilk_sync | #pragma omp taskwait | t.wait() | |
Data Parallelism (Threading) | cilk_for | #pragma omp parallel for | tbb::parallel_for() |
Data Parallelism (Explicit vectorization for loops) | #pragma simd | #pragma omp simd | TBB doesn’t support vectorization. Use TBB for threading and OpenMP pragma for vectorization. #pragma omp simd |
Data Parallelism (Vector functions) | __declspec(vector()) or __attribute__((vector())) | #pragma omp declare simd | #pragma omp declare simd |
Control the number of threads | __cilkrts_set_param("nworkers", nthreads); and Environment variable CILK_NWORKERS=nthreads | omp_set_num_threads(nthreads); and Environment variable OMP_NUM_THREADS=nthreads | task_schedule_init(nthreads); and global_control c(global_control::max_allowed_parallelism. nthreads); |
Each scenario listed above in the table is elaborated in sequence with simple code snippets.
Case 1: Task Parallelism
Below is a single threaded implementation for sum of Fibonacci series:
int fibonacci(int num) { int a, b; if (num == 0) return 0; if (num == 1) return 1; a = fibonacci(num - 1); b = fibonacci(num - 2); return(a+b); }
In the above kernel, fibonacci(num-1) and fibonacci(num-2) can be computed independently. The procedure used to implement task level parallelism in Intel® Cilk™ Plus is demonstrated below:
int fibonacci(int num) { int a, b; if (num == 0) return 0; if (num == 1) return 1; a = cilk_spawn fibonacci(num - 1); b = fibonacci(num - 2); cilk_sync; return(a+b); }
cilk_spawn is used to spawn a new worker thread into action for executing the corresponding task. cilk_sync is a barrier where the threads sync up before the main thread proceeds.
Equivalent OpenMP code:
int fibonacci(int num) { int a, b; if (num == 0) return 0; if (num == 1) return 1; #pragma omp task shared(a) a = fibonacci(num - 1); #pragma omp task shared(b) b = fibonacci(num - 2); #pragma omp taskwait return(a+b); }
OpenMP standard provides #pragma omp task (equivalent to cilk_spawn) which is used here to annotate the section of the code which needs to be executed in a separate thread. The shared clause specifies that the variable “a” and “b” needs to be shared between threads. The thread barrier is implemented using #pragma omp taskwait (equivalent to cilk_sync).
Equivalent Intel® TBB code:
#include "tbb\task_group.h" int fibonacci(int num) { int a, b; task_group p; if (num == 0) return 0; if (num == 1) return 1; p.run([&] {a = fibonacci(num - 1); }); p.run([&] {b = fibonacci(num - 2); }); p.wait(); return(a+b); }
Intel® TBB allows the developer to create a task group by creating an instance of task_group class and use run() member function to create new independent task. wait() member function of task_group is used for implementation of barrier.
Case 2: Data Parallelism (Threading)
Consider the below loop which does vector addition in serial fashion:
void vector_add(float *a, float *b, float *c, int N) { for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; }
The above operation is highly data parallel and this can be one using multiple threads. Intel® Cilk™ Plus offers cilk_for keyword for the same as shown below:
void vector_add(float *a, float *b, float *c, int N) { cilk_for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; }
Annotating the for loop with cilk_for keyword will let the compiler know to generate parallel code for the loop body.
Equivalent OpenMP code:
void vector_add(float *a, float *b, float *c, int N) { #pragma omp parallel for for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; }
#pragma omp parallel for (equivalent of cilk_for) is OpenMP equivalent for expressing parallelism in the loop.
Equivalent Intel® TBB code:
#include "tbb/parallel_for.h" void vector_add(float *a, float *b, float *c, int N) { tbb::parallel_for(0, N, 1, [&](int i) { c[i] = a[i] + b[i]; }); }
tbb::parallel_for is a template function which performs parallel iteration over the above range.
Case 3: Data Parallelism (Explicit Vectorization)
Consider the below loop which does Gaussian Filter in serial fashion:
void GaussianFilter(Mat &output, Mat &input) { float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}}; int rows = output.rows; int cols = output.cols; float value; for(int i = 1; i < rows; i++) { int index = i*cols+1; for(int j = index; j < index+cols; j++) { value = 0.0f; for(int k1 = -1; k1 <= 1; k1++) { int index1 = j+(k1*cols); for(int k2 = -1; k2 <= 1; k2++) value += filter[k1+1][k2+1]*input.data[j+k2]; } output.data[j] = value; } } return; }
The above operation is highly data parallel. By default compiler always targets innermost loop for vectorization but here the innermost loop iteration trip count is just 3. So it makes sense to do the outer loop vectorization (loop with loop index j). Intel® Cilk™ Plus provides #pragma simd which helps annotate the loop as shown below:
void GaussianFilter(Mat &output, Mat &input) { float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}}; int rows = output.rows; int cols = output.cols; float value; for(int i = 1; i < rows; i++) { int index = i*cols+1; #pragma simd private(value) for(int j = index; j < index+cols; j++) { value = 0.0f; for(int k1 = -1; k1 <= 1; k1++) { int index1 = j+(k1*cols); for(int k2 = -1; k2 <= 1; k2++) value += filter[k1+1][k2+1]*input.data[j+k2]; } output.data[j] = value; } } return; }
Equivalent OpenMP code:
void GaussianFilter(Mat &output, Mat &input) { float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}}; int rows = output.rows; int cols = output.cols; float value; #pragma omp parallel for private(value) for(int i = 1; i < rows; i++) { int index = i*cols+1; #pragma omp simd private(value) for(int j = index; j < index+cols; j++) { value = 0.0f; for(int k1 = -1; k1 <= 1; k1++) { int index1 = j+(k1*cols); for(int k2 = -1; k2 <= 1; k2++) value += filter[k1+1][k2+1]*input.data[j+k2]; } output.data[j] = value; } } return; }
OpenMP 4.0 SIMD extensions offers #pragma omp simd which is equivalent to #pragma simd offered by Intel® Cilk™ Plus and helps in annotating loops which are good candidates for vectorization. In the above case, it also enables to override the default behavior of the compiler by enabling outer loop vectorization.
Even if the developer chooses to use Intel® TBB threading for the outer loop, the loop with loop index j can be vectorized using OpenMP pragma as shown below:
void GaussianFilter(Mat &output, Mat &input) { float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}}; int rows = output.rows; int cols = output.cols; float value; tbb::parallel_for(1, row, 1, [&](int i) { int index = i*cols+1; #pragma omp simd private(value) for(int j = index; j < index+cols; j++) { value = 0.0f; for(int k1 = -1; k1 <= 1; k1++) { int index1 = j+(k1*cols); for(int k2 = -1; k2 <= 1; k2++) value += filter[k1+1][k2+1]*input.data[j+k2]; } output.data[j] = value; } }); return; }
Case 4: Data Parallelism (Vector Functions):
Consider the below code:
#include <iostream> #include <stdlib.h> __declspec(noinline) void vector_add(float *a, float *b, float *c, int i){ c[i] = a[i] + b[i]; } int main(int argc, char *argv[]) { float a[100], b[100], c[100]; for (int i = 0; i < 100; i++) { a[i] = i; b[i] = 100 - i; c[i] = 0; } for(int i = 0; i < 100; i++) vector_add(a,b,c,i); std::cout << c[0] << "\n"; return 0; }
Compiling the above loop produces the following vectorization report:
LOOP BEGIN at test.cc(16,2) remark #15543: loop was not vectorized: loop with function call not considered an optimization candidate. LOOP END
Function calls traditionally take scalar arguments and return scalar values. By enabling SIMD mode for functions, they can accept vector arguments and return vector values. This will enable vectorization of loop body which invokes these functions. Intel® Cilk™ Plus offers __declspec(vector(<clauses>)) on Windows and __attribute__((vector(<clauses>)) on Linux which when annotated for a function, the compiler generates the vector variant of the function body as shown below:
#include <iostream> #include <stdlib.h> __declspec(noinline, vector(uniform(a,b,c), linear(i:1))) void vector_add(float *a, float *b, float *c, int i){ c[i] = a[i] + b[i]; } int main(int argc, char *argv[]) { float a[100], b[100], c[100]; for (int i = 0; i < 100; i++) { a[i] = i; b[i] = 100 - i; c[i] = 0; } for(int i = 0; i < 100; i++) vector_add(a,b,c,i); std::cout << c[0] << "\n"; return 0; }
Equivalent OpenMP Code:
#include <iostream> #include <stdlib.h> #pragma omp declare simd uniform(a,b,c) linear(i) __declspec(noinline) void vector_add(float *a, float *b, float *c, int i){ c[i] = a[i] + b[i]; } int main(int argc, char *argv[]) { float a[100], b[100], c[100]; for (int i = 0; i < 100; i++) { a[i] = i; b[i] = 100 - i; c[i] = 0; } for(int i = 0; i < 100; i++) vector_add(a,b,c,i); std::cout << c[0] << "\n"; return 0; }
OpenMP4.0 SIMD constructs supports #pragma omp declare simd which can be used to annotate a function to generate vector code.