Introduction:
Options Contract is a type of financial instrument which protects the person who buys the contract from the grantor’s ability to revoke the offer. When it comes to the specifics, the options contract differ in different geographies. European options can be exercised only at the expiration date of the contract while American options can be exercised any time before the expiration date. Options comes in two forms: call options and put options. In this article, we will be computing the call option price. Call option is a contract where the buyer gets into a contract with a seller that he will buy certain financial instruments at a particular strike price and the contract is valid for a specific time period. In order to initiate this contract, the buyer need not make the full transaction upfront, but pay a premium to get into this contract. This premium is called the Call options price. For instance, if the buyer is interested in buying 50 shares of company A because he predicts the share value to go up in future, he will buy a call option with a certain strike price (say $52/share) for a certain time period. The advantage of getting the call option is that, even though the market value of the share increases beyond $52 in future, the buyer can exercise the options at the strike price. Thus he makes a profit of (Current Share value – Strike price). If the stock price doesn’t increase as the buyer predicted, he need not exercise the call option. In this process the seller gets the premium for call options as the profit. An Asian option is a type of exotic option where the price of the option is path-dependent. The price is a function of the price of an underlying asset at the multiple points along the path. More information on Asian Options can be found at https://en.wikipedia.org/wiki/Asian_option. This article demonstrates how an Asian Options Pricing program can be optimized to run on CPU cores and offloaded to utilize the GPU cores of Intel® Processor Graphics. The attached code sample calculates the price of 1,000 Asian options using the Monte Carlo method. Each of the options simulates 100,000 paths and 87 time steps along each path. Arithmetic mean is used for the payoff.
Performance tuning on CPU:
Modern Intel® Core™ Processors come with multiple processing cores and each processing core have SIMD registers which supports vectorization. In order to optimize the code for CPU, the serial code is profiled using Intel® Vtune Amplifier to find the hotspots. Serial version of the program for single precision floating point results in the following:
>icl /Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc Driver.cpp AsianOptions.cpp Timer.cpp /Qopenmp /D__SERIAL__ /D__DO_FLOAT__ /Qmkl
>Driver.exe
Monte Carlo Asian Option Pricing in Single Precision By Time Step
Time to complete option pricing: 6.5645 seconds.
Computation rate (in options/sec): 152.3349.
The next step is exploit any potential thread level and data parallelism in the hotspot. In the program, the hotspot is the Monte Carlo simulation API. Each thread handles a different option from the options array and each option’s Monte Carlo simulation is done in SIMD mode. Both the threading and the vectorization is enabled using OpenMP. Below are the code snippets:
Enable threading (AsianOptions.cpp: MonteCarloTimeStepParallel()):
#pragma omp parallel for for (int opt = 0; opt < OPT_N; opt++) { MonteCarloTimeStepLoopBody(h_CallResult, S, X, T, l_Random, opt); } }
Enable vectorization (AsianOptions.cpp: MonteCarloTimeStepLoopBody()):
for (int pos = 0; pos < RAND_N; pos += VECLEN) { __declspec(align(64)) tfloat prevStepResult[VECLEN]; __declspec(align(64)) tfloat nextStepResult[VECLEN]; __declspec(align(64)) tfloat avgMean[VECLEN]; __declspec(align(64)) tfloat callValue[VECLEN]; prevStepResult[:] = Sval; avgMean[:] = tfloat(0.0); tfloat *ptrZ = l_Random; for (int simStep = 0; simStep < SIMSTEPS; simStep++) { int location = pos*SIMSTEPS + simStep*VECLEN; monteCarloByTimeStepKernel(prevStepResult, dt, &ptrZ[location], VBySqrtT, uByT, nextStepResult); avgMean[:] += nextStepResult[:]; prevStepResult[:] = nextStepResult[:]; } //Use Arithmetic Mean callValue[:] = max(((avgMean[:] / SIMSTEPS) - Xval), 0); val += __sec_reduce_add(callValue[:]) / VECLEN; }
Optimized version of the program (for CPU) for single precision floating point results in the following:
> icl /Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc Driver.cpp AsianOptions.cpp Timer.cpp /Qopenmp /D__DO_FLOAT__ /Qmkl
>Driver.exe
Monte Carlo Asian Option Pricing in Single Precision By Time Step
Time to complete option pricing: 2.0421 seconds.
Computation rate (in options/sec): 489.6819.
Offloading the kernel to GPU:
The kernel can be offloaded to work on GPU by annotating the functions and the code segment with the #pragma offload directives. Below are the quick snapshots of the code snippets:
Enable threading on GPU (AsianOptions.cpp: MonteCarloTimeStepParallel()):
#ifdef __DO_OFFLOAD__ #pragma omp target map(tofrom: h_CallResult[0:OPT_N], S[0:OPT_N], X[0:OPT_N], T[0:OPT_N], _Random[0:RAND_N*SIMSTEPS]) #endif #pragma omp parallel for for (int opt = 0; opt < OPT_N; opt++) { MonteCarloTimeStepLoopBody(h_CallResult, S, X, T, l_Random, opt); }
Explicit vectorization techniques for GPU is same as on CPU. Optimized version of the program (for GPU) for single precision floating point results in the following:
> icl /Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc Driver.cpp AsianOptions.cpp Timer.cpp /Qopenmp /D__DO_OFFLOAD__ /Qmkl /Qopenmp-offload:gfx /Qoffload-arch:haswell:visa3.1 /D__DO_FLOAT__
>Driver.exe
Monte Carlo Asian Option Pricing in Single Precision By Time Step
Time to complete option pricing: 0.5996 seconds.
Computation rate (in options/sec): 1667.7982.
More information on how to setup the machine for GPU offloading and how to enable the code for offload is described at https://software.intel.com/en-us/articles/getting-started-with-compute-offload-to-intelr-graphics-technology.
Performance Numbers for Single Precision Floating Point:
Versions | Speedup | Compiler Version | Compiler Options | System Specifications |
OpenMP threading + vectorization on CPU cores | 3.21x | Intel® C++ Compiler 16.0 | /Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc /Qopenmp /D__DO_FLOAT__ /Qmkl | Processor: Intel® Core™ i7-4770R @3.2GHz RAM: 8GB Processor Graphics: Intel® Iris™ Pro Graphics |
GPU offloaded version | 11.12x | Intel® C++ Compiler 16.0 | /Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc /Qopenmp /D__DO_OFFLOAD__ /Qmkl /Qopenmp-offload:gfx /Qoffload-arch:haswell:visa3.1 /D__DO_FLOAT__ | Processor: Intel® Core™ i7-4770R @3.2GHz RAM: 8GB |
Future Work:
Stay tuned for heterogeneous version of this algorithm which will break the workload between CPU and GPU.
References: