Parallel STL: Parallel Algorithms in Standard Template Library

C++17 standards include enabling multi-threading and vectorization for STL algorithms. Intel® C++ Compiler 18.0 Beta and above supports Parallel STL. The beauty of STL is that the data storage (STL Containers) are abstracted from the operations performed on the data (STL algorithms) by a concept called STL iterators. Irrespective of which container a developer chooses for their application, most operations like parsing the container, sorting etc are common operations. For instance, let’s consider two different STL containers:

std::vector<int> a(N);
std::unordered_map<int> b(N);

One STL algorithm which can parse through the above two containers is:

std::for_each(a.begin(), a.end(), [&](auto &c){ std::cout<<c<<”\n”; });
std::for_each(b.begin(), b.end(), [&](auto &c){ std::cout<<c<<”\n”;

But the above STL algorithm is single threaded. But Modern processors are multi-core with SIMD units in each core. For more efficient use of the silicon, the operation done by the STL algorithm needs to be multi-threaded and vectorized. Parallel STL (PSTL) feature in C++17 standards provides with different execution policies which controls how the algorithm will run. The implementation of Parallel STL in Intel Compiler is under pstl namespace. The four execution policies which Intel Compiler implements for PSTL are:

Execution Policy	Description
pstl::execution::seq	Single threaded, Scalar
pstl::execution::unseq	Single threaded, Vectorized
pstl::execution::par	Multi-threaded, Scalar
pstl::execution::par_unseq	Multi-threaded, Vectorized

To evaluate the Intel Compiler’s PSTL implementation, please refer to Getting Started Article. More information Parallel STL with simple examples are well explained in Intel Parallel Universe Magazine article.

Does traditional STL coexist with Parallel STL implementation?

Yes, they coexist. The traditional STL implementation is in std namespace while the Parallel STL implementation is in pstl namespace.

Does Intel Compiler’s PSTL implementation with Microsoft or GNU’s PSTL implementation?

Yes, they coexist. The Microsoft’s PSTL implementation is under std::experimental::parallel namespace and GNU’s PSTL implementation is under __gnu_parallel namespace.

Which threading model is used for Parallelism?

Intel Compiler’s PSTL implementation uses Intel Threading Building Blocks (Intel® TBB) runtime, GNU’s PSTL implementation uses OpenMP runtime and Microsoft’s PSTL implementation uses native threads.

PSTL Threading:

Consider a simple sorting example using stl::sort() to start with:

#include<iostream>
#include<algorithm>
#include<vector>
#include<chrono>
#define N 99999999
int main(){
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_stop;
        srand(time(NULL));
        std::vector<int> myvec1(N), myvec2(N);
        for(int i = 0; i < N; i++)
              myvec1[i] = myvec2[i] = rand();
        //Sorting the content of vector using STL algorithm
       timer_start = std::chrono::system_clock::now();
       std::sort(myvec1.begin(), myvec1.end(), [&](int j, int k){ return(j>k); });
       timer_stop = std::chrono::system_clock::now();
       std::chrono::duration<double> elapsed_seconds = timer_stop - timer_start;         std::cout<<"Standard STL: Time taken in seconds is "<<elapsed_seconds.count()<<" seconds \n";
       return 0;
}

Enabling multi-threading using Parallel STL:

#include<iostream>
#include<pstl/algorithm>
#define TBB_PREVIEW_GLOBAL_CONTROL 1
#include<tbb/global_control.h>
#include<vector>
#include<chrono>
#define N 99999999
int main(){
        tbb::global_control c(tbb::global_control::max_allowed_parallelism,2);
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_stop;
        srand(time(NULL));
        std::vector<int> myvec1(N), myvec2(N);
        for(int i = 0; i < N; i++)
              myvec1[i] = myvec2[i] = rand();
        //Sorting the content of vector using STL algorithm
       timer_start = std::chrono::system_clock::now();
       std::sort(pstl::execution::par, myvec1.begin(), myvec1.end(), [&](int j, int k){ return(j>k); });
       timer_stop = std::chrono::system_clock::now();
       std::chrono::duration<double> elapsed_seconds = timer_stop - timer_start;         std::cout<<"Standard STL: Time taken in seconds is "<<elapsed_seconds.count()<<" seconds \n";
       return 0;
}

Does just enabling multi-threading in STL algorithm work in every case?

Not necessarily. In the above case, the vectors are essentially broken down into mutually exclusive chunks and given to individual threads for sorting. The chances of 2 threads accessing the same location is 0. But that need not be the case with every algorithm. For instance, consider the scenario of Naïve Bayes Supervised Classification Learning Algorithm which is based on Bayes Theorem. The example involves training the program with a census dataset from UCI Machine Learning Repository. The dataset has 14 attributes for each person (like age, sex, capital gain etc.) and his/her annual salary tracked either has <=50K or >50K. Each attribute can have multiple values. The data structure used to hold the learnt model is:

std::vector<std::vector<std::unordered_map<std::string, int> > >

The outermost vector will be of size two, one for each annual salary category (<=50K and >50K). The inner vector holds the 14 attributes and for each attribute, the unordered_map stores the attribute value and number of occurrence as a <key, value> pair. The main computation happens in the below loop:

	for_each(dataset.begin(), dataset.end(), [&](std::string &s) {
		size_t start = 0, end = 0, ques = 0, index;
		char line[300];
		for (size_t num_of_cols = 0; num_of_cols < 15; num_of_cols++) {
			end = s.substr(start, s.length()).find(',');
			std::string newString = s.substr(start, end);
			ques = newString.find('?');
			if (ques != std::string::npos)
				continue;
			//Windowing for certain numeric fields with continuous values
			switch (num_of_cols) {
			case 0:
				if (newString.find("<=50K") != std::string::npos)
					index = 0;
				else
					index = 1;
				break;
			case 1:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10) * 10, line, 10)));
				break;
			case 3:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10000) * 10000, line, 10)));
				break;
			case 11:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 1000) * 1000, line, 10)));
				break;
			case 12:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 1000) * 1000, line, 10)));
				break;
			case 13:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10) * 10, line, 10)));
				break;
			default: break;
			}
			std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second++;
			}
			start = start + end + 1;
		}
		});

To enable parallelism using Parallel STL, try adding the execution policy pstl::execution::par for the above for_each algorithm as shown below:

for_each(pstl::execution::par, dataset.begin(), dataset.end(), [&](std::string &s) {
….
});

When executing this program in multi-threaded mode, it crashes. Debugging this program will clearly point to the insert() of STL unordered_map as shown below:

The insert() of STL unordered_map is not thread safe and thus when two threads try to concurrently insert values into the unordered_map, it errors out. Intel® TBB offers thread safe equivalent of STL unordered_map which is tbb::concurrent_unordered_map and it supports the same interfaces which STL container offers. Modifying the data structure in our program to replace unordered_map to concurrent_unordered_map as shown below:

From:

std::vector<std::vector<std::unordered_map<std::string, int> > >

To:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, int> > >

With the above modification, multiple threads can concurrently insert into the unordered_map and the program demonstrates 2x improvement in performance with 2 TBB threads. But checking the output file reveals that though the program ran successfully with performance, the frequency of occurrence of the individual attribute values are registered wrongly. This is because the operation of incrementing the frequency of occurrence of attribute value if this record already exist in the unordered_map is not atomic and results in a data race condition. This can be fixed by changing the data structure to accommodate the following change:

From:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, int> > >
.
.
std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second++;
			}

To:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, tbb::atomic<int> > > >
.
.
std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second.fetch_and_increment();
			}

By doing the following changes, the code runs faster with multiple threads without compromising the accuracy. The key learning from the above exercise is watch out for needs of concurrent containers and potential data race conditions.

PSTL Vectorization:

Consider the example of searching for an integer in a std::vector:

#include<vector>
#include<algorithm>
#include<iostream>
#include<chrono>
#include<stdlib.h>
#define N1 999999999
#ifdef PSTL
#include"pstl/algorithm"
#include"pstl/execution"
#endif
#ifdef GNU_PSTL
#include"parallel/algorithm"
#include<omp.h>
#endif
using namespace std;
std::vector<long long>::iterator mysearch(long long n1, std::vector<long long> &n2) {
#ifdef PSTL
        return find(pstl::execution::unseq, n2.begin(), n2.end(), n1);
#elif defined(GNU_PSTL)
        return __gnu_parallel::find(n2.begin(), n2.end(), n1);
#else
        return find(n2.begin(), n2.end(), n1);
#endif
}
int main(int argc, char *argv[]){
        long long num_to_search;
        if(argc < 2)
        {
                std::cout<<"Enter the number to searched as command line argument range [0 - 999999999]\n";
                return 0;
        }
        else
                num_to_search = atoi(argv[1]);
        static long long p;
        std::vector<long long> myvec(N1);
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_end;
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                generate(pstl::execution::unseq, myvec.begin(), myvec.end(), [&]() { return p++; });
        #elif defined(GNU_PSTL)
                omp_set_num_threads(1);
                __gnu_parallel::generate(myvec.begin(), myvec.end(), [&]() { return p++; });
        #else
                generate(myvec.begin(), myvec.end(), [&]() { return p++; });
        #endif
        timer_end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by generate algorithm is "<<elapsed_seconds.count()<<"\n";
        timer_start = std::chrono::system_clock::now();
        auto result = mysearch(num_to_search, myvec);
        timer_end = std::chrono::system_clock::now();
        elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by find algorithm is "<<elapsed_seconds.count()<<"\n";
        if(result != myvec.end())
                std::cout<<"Found the element "<<*result<<", p = "<<p<<"\n";
        else
                std::cout<<"Element not found, p = "<<p<<"\n";
        return 0;
}

Intel Compiler auto-vectorizes the code and the vectorized code performs better than GCC 5.1 generated binary. Please download the code samples attached, evaluate and compare the PSTL implementations provided by different compiler vendors. The PSTL specific code path and auto-vectorized code path will perform the same in this case. In general, Intel Compiler has a very good vectorization heuristics to identify different code patterns and vectorizes them when it is safe to do so. For instance, consider the histogram loop pattern as shown below:

#include<vector>
#include<algorithm>
#include<iostream>
#include<chrono>
#include<stdlib.h>
#define N1 9999999
#ifdef PSTL
#include"pstl/algorithm"
#endif
#ifdef GNU_PSTL
#include<parallel/algorithm>
#include<omp.h>
#endif
using namespace std;
int main(int argc, char *argv[]){
        std::vector<long long> hist(10);
        fill(hist.begin(), hist.end(), 0);
        std::vector<long long> myvec(N1);
        std::cout<<"---------------------\n";
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_end;
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                generate(pstl::execution::unseq, myvec.begin(), myvec.end(), std::rand);
        #elif defined(GNU_PSTL)
                omp_set_num_threads(1);
                __gnu_parallel::generate(myvec.begin(), myvec.end(), std::rand);
        #else
                generate(myvec.begin(), myvec.end(), std::rand);
        #endif
        timer_end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by generate algorithm is "<<elapsed_seconds.count()<<"\n";
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                for_each(pstl::execution::unseq, myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++;
);
        #elif defined(GNU_PSTL)
                __gnu_parallel::for_each(myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++; });        #else
                for_each(myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++; });
        #endif
        timer_end = std::chrono::system_clock::now();
        elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by for_each algorithm is "<<elapsed_seconds.count()<<"\n";
        for_each(hist.begin(), hist.end(), [&](const long long &q){ std::cout<<q<<"\n"; });
        return 0;
}

When pstl::execution::unseq execution policy is enabled with Intel Compiler, the vectorized is forced using #pragma omp simd pragma (SIMD pragma from OpenMP4.0). One important point to remember is when using #pragma omp simd, the compiler’s vectorization heuristics will not perform the routine data dependency and data flow analysis but just follows the user’s directive to go ahead and vectorize. So always exercise caution when using this. For instance, in the above program (p%4) will result on values 0,1,2,3 in the SIMD register when targeting SSE (no duplicate values), but when targeting for AVX the SIMD register will have 0,1,2,3,0,1,2,3. When trying to execute hist[p%4] in vectorized mode, there is a data race condition. Intel® AVX-512 instruction set supports a conflict detection instruction (vconflict) which will look for conflict if any (duplicates if any) in the SIMD register. Please try the above example with the build script attached and see the performance difference between Intel Compiler generated code and GNU compiler generated code.

Parallel STL: Parallel Algorithms in Standard Template Library

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112