Introduction
The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a benchmark suite composed of multithreaded programs. The suite focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors.
The benchmark suite with all its applications and input sets is available as open source free of charge. Some of the benchmark programs have their own licensing terms, which might limit their use in some cases.
The Black-Scholes benchmark is a one of the 13 benchmarks in the PARSEC. This benchmark does option pricing with Black-Scholes Partial Differential Equation (PDE). The Black-Scholes equation is a differential equation that describes how, under a certain set of assumptions, the value of an option changes as the price of the underlying asset changes.
The formula for a put option is similar. The cumulative normal distribution function, CND(x), gives the probability that a normally distributed random variable will have a value less than x. There is no closed form expression for this function, and as such it must be evaluated numerically. Alternatively, the values of this function may be pre-computed and hard-coded in the table; in this case, they can be obtained at runtime using table lookup.
Based on this formula, one can compute the option price analytically based on the five input parameters. Using this analytical approach to price option, the limiting factor lies with the amount of floating-point calculation a processor can perform.
The Hotspots
If we look at the results of Intel® VTune™ Amplifier XE profiling, we see two major hotspots.
- Read from input file and write to output file.
- Black-Scholes calculations.
Let’s look at each one in more detail.
Read from input file and write to output file
The problem is that the input file contains 10 million rows of data. Every row is an element of struct OptionData which contains nine parameters. That’s why before we start the calculations we have to spend a lot of time reading and parsing the data from the input file to an array of OptionData struct. This array is called data.
Obviously the same problem occurs after all the calculations, and you have to write the results in an output file.
These problems can be solved by using parallel read and write by pointers. For more details you can read about it here: Optimization of Data Read/Write in a Parallel Application | Intel® Developer Zone.
Black-Scholes calculations
We will consider this case in more detail. All calculations contain two functions: CNDF and BlkSchlsEqEuroNoDiv.
CNDF
This function realizes the cumulative distribution function of the standard normal distribution. For more details, refer to any book about the theory of probability.
In our case, this function is as follows:
#define inv_sqrt_2xPI 0.39894228040143270286 fptype CNDF ( fptype InputX ) { int sign; fptype OutputX; fptype xInput; fptype xNPrimeofX; fptype expValues; fptype xK2; fptype xK2_2, xK2_3; fptype xK2_4, xK2_5; fptype xLocal, xLocal_1; fptype xLocal_2, xLocal_3; // Check for negative value of InputX if (InputX < 0.0) { InputX = -InputX; sign = 1; } else sign = 0; xInput = InputX; // Compute NPrimeX term common to both four & six decimal accuracy calcs expValues = exp(-0.5f * InputX * InputX); xNPrimeofX = expValues; xNPrimeofX = xNPrimeofX * inv_sqrt_2xPI; xK2 = 0.2316419 * xInput; xK2 = 1.0 + xK2; xK2 = 1.0 / xK2; xK2_2 = xK2 * xK2; xK2_3 = xK2_2 * xK2; xK2_4 = xK2_3 * xK2; xK2_5 = xK2_4 * xK2; xLocal_1 = xK2 * 0.319381530; xLocal_2 = xK2_2 * (-0.356563782); xLocal_3 = xK2_3 * 1.781477937; xLocal_2 = xLocal_2 + xLocal_3; xLocal_3 = xK2_4 * (-1.821255978); xLocal_2 = xLocal_2 + xLocal_3; xLocal_3 = xK2_5 * 1.330274429; xLocal_2 = xLocal_2 + xLocal_3; xLocal_1 = xLocal_2 + xLocal_1; xLocal = xLocal_1 * xNPrimeofX; xLocal = 1.0 - xLocal; OutputX = xLocal; if (sign) { OutputX = 1.0 - OutputX; } return OutputX; }
BlkSchlsEqEuroNoDiv
This is known as the Black-Scholes model, which gives a theoretical estimate of the price of European-style options. All we need to know for our purposes is a Black-Scholes formula, which will be used for calculations. The formula is as follows:
C(S,t)=N(d1)S-N(d2)Ke-r(T-t)
- The value of a call option for a non-dividend-paying underlying stock.
P(S,t)=Ke-r(T-t)N(-d2)-N(d1)S
- The price of a corresponding put option.
d1= 1/(σ√(T-t))(ln(S/K)+(r+σ2/2)(T-t))
d2= 1/(σ√(T-t))(ln(S/K)+(r-σ2/2)(T-t))
For both, as above:
- N(∙) is the CNDF function.
- T - t is the time to maturity.
- S is the spot price of the underlying asset.
- K is the strike price.
- r is the risk.
- σ is the volatility of returns of the underlying asset.
In our case, this function is as follows:
fptype BlkSchlsEqEuroNoDiv( fptype sptprice, fptype strike, fptype rate, fptype volatility, fptype time, int otype, float timet ) { fptype OptionPrice; // local private working variables for the calculation fptype xStockPrice; fptype xStrikePrice; fptype xRiskFreeRate; fptype xVolatility; fptype xTime; fptype xSqrtTime; fptype logValues; fptype xLogTerm; fptype xD1; fptype xD2; fptype xPowerTerm; fptype xDen; fptype d1; fptype d2; fptype FutureValueX; fptype NofXd1; fptype NofXd2; fptype NegNofXd1; fptype NegNofXd2; xStockPrice = sptprice; xStrikePrice = strike; xRiskFreeRate = rate; xVolatility = volatility; xTime = time; xSqrtTime = sqrt(xTime); logValues = log( sptprice / strike ); xLogTerm = logValues; xPowerTerm = xVolatility * xVolatility; xPowerTerm = xPowerTerm * 0.5; xD1 = xRiskFreeRate + xPowerTerm; xD1 = xD1 * xTime; xD1 = xD1 + xLogTerm; xDen = xVolatility * xSqrtTime; xD1 = xD1 / xDen; xD2 = xD1 - xDen; d1 = xD1; d2 = xD2; NofXd1 = CNDF( d1 ); NofXd2 = CNDF( d2 ); FutureValueX = strike * ( exp( -(rate)*(time) ) ); if (otype == 0) { OptionPrice = (sptprice * NofXd1) - (FutureValueX * NofXd2); } else { NegNofXd1 = (1.0 - NofXd1); NegNofXd2 = (1.0 - NofXd2); OptionPrice = (FutureValueX * NegNofXd2) - (sptprice * NegNofXd1); } return OptionPrice; }
Optimizations
The only function that is in main() is the bs_thread(), this function is as follows:
#ifdef WIN32 DWORD WINAPI bs_thread(LPVOID tid_ptr){ #else int bs_thread(void *tid_ptr) { #endif int i, j; fptype price; fptype priceDelta; int tid = *(int *)tid_ptr; int start = tid * (numOptions / nThreads); int end = start + (numOptions / nThreads); for (j=0; j<NUM_RUNS; j++) { #ifdef ENABLE_OPENMP #pragma omp parallel for private(i, price, priceDelta) for (i=0; i<numOptions; i++) { #else //ENABLE_OPENMP for (i=start; i<end; i++) { #endif //ENABLE_OPENMP /* Calling main function to calculate option value based on * Black & Scholes's equation. */ price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i], otype[i], 0); prices[i] = price; #ifdef ERR_CHK priceDelta = data[i].DGrefval - price; if( fabs(priceDelta) >= 1e-4 ){ printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta); numError ++; } #endif } } return 0; } #endif //ENABLE_TBB
Let’s take a loop from 0 to numOptions and place it to BlkSchlsEqEuroNoDiv and add one more parameter to this function. The prices array will be this parameter. Of course our function BlkSchlsEqEuroNoDiv will take sptprice, strike, rate, volatility, time, otype arrays entirely. We also need to add some OpenMP* pragmas to parallelize this loop:
#pragma omp parallel for private(i) #pragma simd #pragma vector aligned
Thus we will get a performance gain by vectorization and parallelization.
If we read more about the CDF function of the standard normal distribution we will see that this function can be expressed by using the error function:
N(x)= 1/2(1+erf(x/√2))
Let’s create this implementation of the CDF function and try to use it:
# define ERF(x) erff(x) # define INVSQRT(x) 1.0f/sqrtf(x) # define HALF 0.5f fptype cdfnorm (fptype input) { fptype buf; fptype output; buf = input*INVSQRT(2); output = HALF + HALF*ERF(buf); return output; }
Before we use it in a calculation, I want to mention that I used the erff(x) function, which is contained in the Intel® Math Kernel Library. I used #include <mathimf.h> at the top of source file.
Now we are ready to conduct the experiments!
The system
The system I used to test my modification was the dual Intel® Xeon® E5-2697 v3 server. The Intel Xeon E5-2697 v3 processor contains 14 cores with 2.6 GHz core frequency. Also it has Intel® Hyper-Threading Technology on. Thus we have 28 threads and 35 MB of L3 cache per package. In sum we have 56 available threads.
The compiler
I used the Intel® C++ Compilers 2013.
The compiler keys that I used are as follows:
export CFLAGS="-O2 -xCORE-AVX2 -funroll-loops -opt-prefetch -g -traceback"
These keys may be changed in the configure file «icc.bldconf».
Experiments
I launched the blackscholes benchmark, and the results were as follows:
Summary
Thus, we made some performance gains due to the modifications of the computational functions.
With the I/O parallelization using the method specified earlier in this article, we can achieve a big performance gain.