Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Parallel STL: Parallel Algorithms in Standard Template Library

$
0
0

C++17 standards include enabling multi-threading and vectorization for STL algorithms. Intel® C++ Compiler 18.0 Beta and above supports Parallel STL. The beauty of STL is that the data storage (STL Containers) are abstracted from the operations performed on the data (STL algorithms) by a concept called STL iterators. Irrespective of which container a developer chooses for their application, most operations like parsing the container, sorting etc are common operations. For instance, let’s consider two different STL containers:

std::vector<int> a(N);
std::unordered_map<int> b(N);

One STL algorithm which can parse through the above two containers is:

std::for_each(a.begin(), a.end(), [&](auto &c){ std::cout<<c<<”\n”; });
std::for_each(b.begin(), b.end(), [&](auto &c){ std::cout<<c<<”\n”;

But the above STL algorithm is single threaded. But Modern processors are multi-core with SIMD units in each core. For more efficient use of the silicon, the operation done by the STL algorithm needs to be multi-threaded and vectorized. Parallel STL (PSTL) feature in C++17 standards provides with different execution policies which controls how the algorithm will run. The implementation of Parallel STL in Intel Compiler is under pstl namespace. The four execution policies which Intel Compiler implements for PSTL are:

Execution PolicyDescription
pstl::execution::seqSingle threaded, Scalar
pstl::execution::unseqSingle threaded, Vectorized
pstl::execution::parMulti-threaded, Scalar
pstl::execution::par_unseqMulti-threaded, Vectorized

To evaluate the Intel Compiler’s PSTL implementation, please refer to Getting Started Article. More information Parallel STL with simple examples are well explained in Intel Parallel Universe Magazine article.

Does traditional STL coexist with Parallel STL implementation?

Yes, they coexist. The traditional STL implementation is in std namespace while the Parallel STL implementation is in pstl namespace.

Does Intel Compiler’s PSTL implementation with Microsoft or GNU’s PSTL implementation?

Yes, they coexist. The Microsoft’s PSTL implementation is under std::experimental::parallel namespace and GNU’s PSTL implementation is under __gnu_parallel namespace.

Which threading model is used for Parallelism?

Intel Compiler’s PSTL implementation uses Intel Threading Building Blocks (Intel® TBB) runtime, GNU’s PSTL implementation uses OpenMP runtime and Microsoft’s PSTL implementation uses native threads.

PSTL Threading:

Consider a simple sorting example using stl::sort() to start with:

#include<iostream>
#include<algorithm>
#include<vector>
#include<chrono>
#define N 99999999
int main(){
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_stop;
        srand(time(NULL));
        std::vector<int> myvec1(N), myvec2(N);
        for(int i = 0; i < N; i++)
              myvec1[i] = myvec2[i] = rand();
        //Sorting the content of vector using STL algorithm
       timer_start = std::chrono::system_clock::now();
       std::sort(myvec1.begin(), myvec1.end(), [&](int j, int k){ return(j>k); });
       timer_stop = std::chrono::system_clock::now();
       std::chrono::duration<double> elapsed_seconds = timer_stop - timer_start;         std::cout<<"Standard STL: Time taken in seconds is "<<elapsed_seconds.count()<<" seconds \n";
       return 0;
}

Enabling multi-threading using Parallel STL:

#include<iostream>
#include<pstl/algorithm>
#define TBB_PREVIEW_GLOBAL_CONTROL 1
#include<tbb/global_control.h>
#include<vector>
#include<chrono>
#define N 99999999
int main(){
        tbb::global_control c(tbb::global_control::max_allowed_parallelism,2);
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_stop;
        srand(time(NULL));
        std::vector<int> myvec1(N), myvec2(N);
        for(int i = 0; i < N; i++)
              myvec1[i] = myvec2[i] = rand();
        //Sorting the content of vector using STL algorithm
       timer_start = std::chrono::system_clock::now();
       std::sort(pstl::execution::par, myvec1.begin(), myvec1.end(), [&](int j, int k){ return(j>k); });
       timer_stop = std::chrono::system_clock::now();
       std::chrono::duration<double> elapsed_seconds = timer_stop - timer_start;         std::cout<<"Standard STL: Time taken in seconds is "<<elapsed_seconds.count()<<" seconds \n";
       return 0;
}

Does just enabling multi-threading in STL algorithm work in every case?

Not necessarily. In the above case, the vectors are essentially broken down into mutually exclusive chunks and given to individual threads for sorting. The chances of 2 threads accessing the same location is 0. But that need not be the case with every algorithm. For instance, consider the scenario of Naïve Bayes Supervised Classification Learning Algorithm which is based on Bayes Theorem. The example involves training the program with a census dataset from UCI Machine Learning Repository. The dataset has 14 attributes for each person (like age, sex, capital gain etc.) and his/her annual salary tracked either has <=50K or >50K. Each attribute can have multiple values. The data structure used to hold the learnt model is:

std::vector<std::vector<std::unordered_map<std::string, int> > >

The outermost vector will be of size two, one for each annual salary category (<=50K and >50K). The inner vector holds the 14 attributes and for each attribute, the unordered_map stores the attribute value and number of occurrence as a <key, value> pair. The main computation happens in the below loop:

	for_each(dataset.begin(), dataset.end(), [&](std::string &s) {
		size_t start = 0, end = 0, ques = 0, index;
		char line[300];
		for (size_t num_of_cols = 0; num_of_cols < 15; num_of_cols++) {
			end = s.substr(start, s.length()).find(',');
			std::string newString = s.substr(start, end);
			ques = newString.find('?');
			if (ques != std::string::npos)
				continue;
			//Windowing for certain numeric fields with continuous values
			switch (num_of_cols) {
			case 0:
				if (newString.find("<=50K") != std::string::npos)
					index = 0;
				else
					index = 1;
				break;
			case 1:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10) * 10, line, 10)));
				break;
			case 3:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10000) * 10000, line, 10)));
				break;
			case 11:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 1000) * 1000, line, 10)));
				break;
			case 12:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 1000) * 1000, line, 10)));
				break;
			case 13:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10) * 10, line, 10)));
				break;
			default: break;
			}
			std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second++;
			}
			start = start + end + 1;
		}
		});

To enable parallelism using Parallel STL, try adding the execution policy pstl::execution::par for the above for_each algorithm as shown below:

for_each(pstl::execution::par, dataset.begin(), dataset.end(), [&](std::string &s) {
….
});

When executing this program in multi-threaded mode, it crashes. Debugging this program will clearly point to the insert() of STL unordered_map as shown below:

 

The insert() of STL unordered_map is not thread safe and thus when two threads try to concurrently insert values into the unordered_map, it errors out. Intel® TBB offers thread safe equivalent of STL unordered_map which is tbb::concurrent_unordered_map and it supports the same interfaces which STL container offers. Modifying the data structure in our program to replace unordered_map to concurrent_unordered_map as shown below:

From:

std::vector<std::vector<std::unordered_map<std::string, int> > >

To:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, int> > >

With the above modification, multiple threads can concurrently insert into the unordered_map and the program demonstrates 2x improvement in performance with 2 TBB threads. But checking the output file reveals that though the program ran successfully with performance, the frequency of occurrence of the individual attribute values are registered wrongly. This is because the operation of incrementing the frequency of occurrence of attribute value if this record already exist in the unordered_map is not atomic and results in a data race condition. This can be fixed by changing the data structure to accommodate the following change:

From:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, int> > >
.
.
std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second++;
			}

To:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, tbb::atomic<int> > > >
.
.
std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second.fetch_and_increment();
			}

By doing the following changes, the code runs faster with multiple threads without compromising the accuracy. The key learning from the above exercise is watch out for needs of concurrent containers and potential data race conditions.

PSTL Vectorization:

Consider the example of searching for an integer in a std::vector:

#include<vector>
#include<algorithm>
#include<iostream>
#include<chrono>
#include<stdlib.h>
#define N1 999999999
#ifdef PSTL
#include"pstl/algorithm"
#include"pstl/execution"
#endif
#ifdef GNU_PSTL
#include"parallel/algorithm"
#include<omp.h>
#endif
using namespace std;
std::vector<long long>::iterator mysearch(long long n1, std::vector<long long> &n2) {
#ifdef PSTL
        return find(pstl::execution::unseq, n2.begin(), n2.end(), n1);
#elif defined(GNU_PSTL)
        return __gnu_parallel::find(n2.begin(), n2.end(), n1);
#else
        return find(n2.begin(), n2.end(), n1);
#endif
}
int main(int argc, char *argv[]){
        long long num_to_search;
        if(argc < 2)
        {
                std::cout<<"Enter the number to searched as command line argument range [0 - 999999999]\n";
                return 0;
        }
        else
                num_to_search = atoi(argv[1]);
        static long long p;
        std::vector<long long> myvec(N1);
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_end;
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                generate(pstl::execution::unseq, myvec.begin(), myvec.end(), [&]() { return p++; });
        #elif defined(GNU_PSTL)
                omp_set_num_threads(1);
                __gnu_parallel::generate(myvec.begin(), myvec.end(), [&]() { return p++; });
        #else
                generate(myvec.begin(), myvec.end(), [&]() { return p++; });
        #endif
        timer_end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by generate algorithm is "<<elapsed_seconds.count()<<"\n";
        timer_start = std::chrono::system_clock::now();
        auto result = mysearch(num_to_search, myvec);
        timer_end = std::chrono::system_clock::now();
        elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by find algorithm is "<<elapsed_seconds.count()<<"\n";
        if(result != myvec.end())
                std::cout<<"Found the element "<<*result<<", p = "<<p<<"\n";
        else
                std::cout<<"Element not found, p = "<<p<<"\n";
        return 0;
}

Intel Compiler auto-vectorizes the code and the vectorized code performs better than GCC 5.1 generated binary. Please download the code samples attached, evaluate and compare the PSTL implementations provided by different compiler vendors. The PSTL specific code path and auto-vectorized code path will perform the same in this case. In general, Intel Compiler has a very good vectorization heuristics to identify different code patterns and vectorizes them when it is safe to do so. For instance, consider the histogram loop pattern as shown below:

#include<vector>
#include<algorithm>
#include<iostream>
#include<chrono>
#include<stdlib.h>
#define N1 9999999
#ifdef PSTL
#include"pstl/algorithm"
#endif
#ifdef GNU_PSTL
#include<parallel/algorithm>
#include<omp.h>
#endif
using namespace std;
int main(int argc, char *argv[]){
        std::vector<long long> hist(10);
        fill(hist.begin(), hist.end(), 0);
        std::vector<long long> myvec(N1);
        std::cout<<"---------------------\n";
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_end;
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                generate(pstl::execution::unseq, myvec.begin(), myvec.end(), std::rand);
        #elif defined(GNU_PSTL)
                omp_set_num_threads(1);
                __gnu_parallel::generate(myvec.begin(), myvec.end(), std::rand);
        #else
                generate(myvec.begin(), myvec.end(), std::rand);
        #endif
        timer_end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by generate algorithm is "<<elapsed_seconds.count()<<"\n";
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                for_each(pstl::execution::unseq, myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++;
);
        #elif defined(GNU_PSTL)
                __gnu_parallel::for_each(myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++; });        #else
                for_each(myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++; });
        #endif
        timer_end = std::chrono::system_clock::now();
        elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by for_each algorithm is "<<elapsed_seconds.count()<<"\n";
        for_each(hist.begin(), hist.end(), [&](const long long &q){ std::cout<<q<<"\n"; });
        return 0;
}

When pstl::execution::unseq execution policy is enabled with Intel Compiler, the vectorized is forced using #pragma omp simd pragma (SIMD pragma from OpenMP4.0). One important point to remember is when using #pragma omp simd, the compiler’s vectorization heuristics will not perform the routine data dependency and data flow analysis but just follows the user’s directive to go ahead and vectorize. So always exercise caution when using this. For instance, in the above program (p%4) will result on values 0,1,2,3 in the SIMD register when targeting SSE (no duplicate values), but when targeting for AVX the SIMD register will have 0,1,2,3,0,1,2,3. When trying to execute hist[p%4] in vectorized mode, there is a data race condition. Intel® AVX-512 instruction set supports a conflict detection instruction (vconflict) which will look for conflict if any (duplicates if any) in the SIMD register. Please try the above example with the build script attached and see the performance difference between Intel Compiler generated code and GNU compiler generated code.


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>