Migration from Intel® Cilk™ Plus to OpenMP* or Intel® Threading Building Blocks

Introduction

Starting with Intel® C++ Compiler 18.0 Gold Release, Intel® Cilk™ Plus constructs will be marked as deprecated and eventually removed in a future release. Intel® Cilk™ Plus when introduced in 2010 was a one stop programming model for enabling both multi-threading and vectorization in programs. Intel® Cilk™ Plus specification is made open at https://www.cilkplus.org/specs. With OpenMP* 4.0 SIMD extensions, we can enable both multi-threading and explicit vectorization using OpenMP pragmas. For all C++ programmers who prefer using Intel® TBB as the threading model can still continue to use TBB for threading and use OpenMP 4.0 SIMD pragmas for enabling vectorization. Below is a table which quickly summaries the options for moving from Intel® Cilk™ Plus to either OpenMP or Intel® TBB.

Migration Cheat Sheet	Intel® Cilk™ Plus	OpenMP	Intel® Threading Building Blocks (Intel® TBB)
Task Parallelism	cilk_spawn	#pragma omp task	task_group t; t.run([](){ })
Task Parallelism	cilk_sync	#pragma omp taskwait	t.wait()
Data Parallelism (Threading)	cilk_for	#pragma omp parallel for	tbb::parallel_for()
Data Parallelism (Explicit vectorization for loops)	#pragma simd	#pragma omp simd	TBB doesn’t support vectorization. Use TBB for threading and OpenMP pragma for vectorization. #pragma omp simd
Data Parallelism (Vector functions)	__declspec(vector()) or __attribute__((vector()))	#pragma omp declare simd	#pragma omp declare simd
Control the number of threads	__cilkrts_set_param("nworkers", nthreads); and Environment variable CILK_NWORKERS=nthreads	omp_set_num_threads(nthreads); and Environment variable OMP_NUM_THREADS=nthreads	task_schedule_init(nthreads); and global_control c(global_control::max_allowed_parallelism. nthreads);

Each scenario listed above in the table is elaborated in sequence with simple code snippets.

Case 1: Task Parallelism

Below is a single threaded implementation for sum of Fibonacci series:

int fibonacci(int num) {
	int a, b;
	if (num == 0)
		return 0;
	if (num == 1)
		return 1;
	a = fibonacci(num - 1);
	b = fibonacci(num - 2);
	return(a+b);

}

In the above kernel, fibonacci(num-1) and fibonacci(num-2) can be computed independently. The procedure used to implement task level parallelism in Intel® Cilk™ Plus is demonstrated below:

int fibonacci(int num) {
	int a, b;
	if (num == 0)
		return 0;
	if (num == 1)
		return 1;
	a = cilk_spawn fibonacci(num - 1);
	b = fibonacci(num - 2);
	cilk_sync;
	return(a+b);

}

cilk_spawn is used to spawn a new worker thread into action for executing the corresponding task. cilk_sync is a barrier where the threads sync up before the main thread proceeds.

Equivalent OpenMP code:

int fibonacci(int num) {
	int a, b;
	if (num == 0)
		return 0;
	if (num == 1)
		return 1;
#pragma omp task shared(a)
	a = fibonacci(num - 1);
#pragma omp task shared(b)
	b = fibonacci(num - 2);
#pragma omp taskwait
	return(a+b);

}

OpenMP standard provides #pragma omp task (equivalent to cilk_spawn) which is used here to annotate the section of the code which needs to be executed in a separate thread. The shared clause specifies that the variable “a” and “b” needs to be shared between threads. The thread barrier is implemented using #pragma omp taskwait (equivalent to cilk_sync).

Equivalent Intel® TBB code:

#include "tbb\task_group.h"
int fibonacci(int num) {
	int a, b;
	task_group p;
	if (num == 0)
		return 0;
	if (num == 1)
		return 1;
	p.run([&] {a = fibonacci(num - 1); });
	p.run([&] {b = fibonacci(num - 2); });
	p.wait();
	return(a+b);

}

Intel® TBB allows the developer to create a task group by creating an instance of task_group class and use run() member function to create new independent task. wait() member function of task_group is used for implementation of barrier.

Case 2: Data Parallelism (Threading)

Consider the below loop which does vector addition in serial fashion:

void vector_add(float *a, float *b, float *c, int N) {
	for (int i = 0; i < N; i++)
		c[i] = a[i] + b[i];
}

The above operation is highly data parallel and this can be one using multiple threads. Intel® Cilk™ Plus offers cilk_for keyword for the same as shown below:

void vector_add(float *a, float *b, float *c, int N) {
	cilk_for (int i = 0; i < N; i++)
		c[i] = a[i] + b[i];
}

Annotating the for loop with cilk_for keyword will let the compiler know to generate parallel code for the loop body.

Equivalent OpenMP code:

void vector_add(float *a, float *b, float *c, int N) {
#pragma omp parallel for
	for (int i = 0; i < N; i++)
		c[i] = a[i] + b[i];
}

#pragma omp parallel for (equivalent of cilk_for) is OpenMP equivalent for expressing parallelism in the loop.

Equivalent Intel® TBB code:

#include "tbb/parallel_for.h"
void vector_add(float *a, float *b, float *c, int N) {
	tbb::parallel_for(0, N, 1, [&](int i) { c[i] = a[i] + b[i]; });
}

tbb::parallel_for is a template function which performs parallel iteration over the above range.

Case 3: Data Parallelism (Explicit Vectorization)

Consider the below loop which does Gaussian Filter in serial fashion:

void GaussianFilter(Mat &output, Mat &input)
{
	float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}};
	int rows = output.rows;
	int cols = output.cols;
	float value;
	for(int i = 1; i < rows; i++)
	{
		int index = i*cols+1;
		for(int j = index; j < index+cols; j++)
		{
			value = 0.0f;
			for(int k1 = -1; k1 <= 1; k1++)
			{
				int index1 = j+(k1*cols);
				for(int k2 = -1; k2 <= 1; k2++)
					value += filter[k1+1][k2+1]*input.data[j+k2];
			}
			output.data[j] = value;
		}
	}
	return;
}

The above operation is highly data parallel. By default compiler always targets innermost loop for vectorization but here the innermost loop iteration trip count is just 3. So it makes sense to do the outer loop vectorization (loop with loop index j). Intel® Cilk™ Plus provides #pragma simd which helps annotate the loop as shown below:

void GaussianFilter(Mat &output, Mat &input)
{
	float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}};
	int rows = output.rows;
	int cols = output.cols;
	float value;
	for(int i = 1; i < rows; i++)
	{
		int index = i*cols+1;
		#pragma simd private(value)
		for(int j = index; j < index+cols; j++)
		{
			value = 0.0f;
			for(int k1 = -1; k1 <= 1; k1++)
			{
				int index1 = j+(k1*cols);
				for(int k2 = -1; k2 <= 1; k2++)
					value += filter[k1+1][k2+1]*input.data[j+k2];
			}
			output.data[j] = value;
		}
	}
	return;
}

Equivalent OpenMP code:

void GaussianFilter(Mat &output, Mat &input)
{
	float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}};
	int rows = output.rows;
	int cols = output.cols;
	float value;
	#pragma omp parallel for private(value)
	for(int i = 1; i < rows; i++)
	{
		int index = i*cols+1;
		#pragma omp simd private(value)
		for(int j = index; j < index+cols; j++)
		{
			value = 0.0f;
			for(int k1 = -1; k1 <= 1; k1++)
			{
				int index1 = j+(k1*cols);
				for(int k2 = -1; k2 <= 1; k2++)
					value += filter[k1+1][k2+1]*input.data[j+k2];
			}
			output.data[j] = value;
		}
	}
	return;
}

OpenMP 4.0 SIMD extensions offers #pragma omp simd which is equivalent to #pragma simd offered by Intel® Cilk™ Plus and helps in annotating loops which are good candidates for vectorization. In the above case, it also enables to override the default behavior of the compiler by enabling outer loop vectorization.

Even if the developer chooses to use Intel® TBB threading for the outer loop, the loop with loop index j can be vectorized using OpenMP pragma as shown below:

void GaussianFilter(Mat &output, Mat &input)
{
	float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}};
	int rows = output.rows;
	int cols = output.cols;
	float value;
	tbb::parallel_for(1, row, 1, [&](int i)
	{
		int index = i*cols+1;
		#pragma omp simd private(value)
		for(int j = index; j < index+cols; j++)
		{
			value = 0.0f;
			for(int k1 = -1; k1 <= 1; k1++)
			{
				int index1 = j+(k1*cols);
				for(int k2 = -1; k2 <= 1; k2++)
					value += filter[k1+1][k2+1]*input.data[j+k2];
			}
			output.data[j] = value;
		}
	});
	return;
}

Case 4: Data Parallelism (Vector Functions):

Consider the below code:

#include <iostream>
#include <stdlib.h>
__declspec(noinline)
void vector_add(float *a, float *b, float *c, int i){
        c[i] = a[i] + b[i];
}
int main(int argc, char *argv[])
{
        float a[100], b[100], c[100];
        for (int i = 0; i < 100; i++)
        {
                a[i] = i;
                b[i] = 100 - i;
                c[i] = 0;
        }
        for(int i = 0; i < 100; i++)
                vector_add(a,b,c,i);
        std::cout << c[0] << "\n";
    return 0;
}

Compiling the above loop produces the following vectorization report:

LOOP BEGIN at test.cc(16,2)
   remark #15543: loop was not vectorized: loop with function call not considered an optimization candidate.
LOOP END

Function calls traditionally take scalar arguments and return scalar values. By enabling SIMD mode for functions, they can accept vector arguments and return vector values. This will enable vectorization of loop body which invokes these functions. Intel® Cilk™ Plus offers __declspec(vector(<clauses>)) on Windows and __attribute__((vector(<clauses>)) on Linux which when annotated for a function, the compiler generates the vector variant of the function body as shown below:

#include <iostream>
#include <stdlib.h>
__declspec(noinline, vector(uniform(a,b,c), linear(i:1)))
void vector_add(float *a, float *b, float *c, int i){
        c[i] = a[i] + b[i];
}
int main(int argc, char *argv[])
{
        float a[100], b[100], c[100];
        for (int i = 0; i < 100; i++)
        {
                a[i] = i;
                b[i] = 100 - i;
                c[i] = 0;
        }
        for(int i = 0; i < 100; i++)
                vector_add(a,b,c,i);
        std::cout << c[0] << "\n";
    return 0;
}

Equivalent OpenMP Code:

#include <iostream>
#include <stdlib.h>
#pragma omp declare simd uniform(a,b,c) linear(i)
__declspec(noinline) void vector_add(float *a, float *b, float *c, int i){
        c[i] = a[i] + b[i];
}
int main(int argc, char *argv[])
{
        float a[100], b[100], c[100];
        for (int i = 0; i < 100; i++)
        {
                a[i] = i;
                b[i] = 100 - i;
                c[i] = 0;
        }
        for(int i = 0; i < 100; i++)
                vector_add(a,b,c,i);
        std::cout << c[0] << "\n";
    return 0;
}

OpenMP4.0 SIMD constructs supports #pragma omp declare simd which can be used to annotate a function to generate vector code.

Migration from Intel® Cilk™ Plus to OpenMP* or Intel® Threading Building Blocks

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

SAHARA FLASH LIVE IN WERAGOLLA 2018-04-20

HP P2000 Storage Error Controller A Unknown Issue Resolution Request

Download The Last Ship 3ª Temporada Dublado e Legendado – MEGA

The 10 Tennessee Cities With The Largest Black Population For 2021

PHOTOS: Taarak Mehta Ka Ooltah Chashmah cast then and now; Check out your...

Moondru Mudichu 04-10-2017 – Polimer tv Serial

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

TunerPad KeyGen FREE

Philly Mobster Ronnie Turchi Took Last Ride In October ’99, Turned Up Trunk...

Ex-Colchester United youth player Craig Winskill carried out armed robbery to...

Karimnagar District Tahsildars Phone Numbers-Mobile Numbers Telangana-State

Bureau of Internal Revenue: Regional Offices (Directory)

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Windows Update / Microsoft Update の接続先 URL について

Black Angus Grilled Artichokes

Four Air Leitchville Pty Ltd v Hurlad Pty Ltd (No 3) [2024] FCA 238

Form: VAT: registration - land and property (VAT5L)

Forum Post: RE: Plugin timeout exception in custom workflow activity

Serial child killer David Threinen’s reign of terror