Offloading Asian Options Pricing Algorithm to Intel(R) Processor Graphics

Introduction:

Options Contract is a type of financial instrument which protects the person who buys the contract from the grantor’s ability to revoke the offer. When it comes to the specifics, the options contract differ in different geographies. European options can be exercised only at the expiration date of the contract while American options can be exercised any time before the expiration date. Options comes in two forms: call options and put options. In this article, we will be computing the call option price. Call option is a contract where the buyer gets into a contract with a seller that he will buy certain financial instruments at a particular strike price and the contract is valid for a specific time period. In order to initiate this contract, the buyer need not make the full transaction upfront, but pay a premium to get into this contract. This premium is called the Call options price. For instance, if the buyer is interested in buying 50 shares of company A because he predicts the share value to go up in future, he will buy a call option with a certain strike price (say $52/share) for a certain time period. The advantage of getting the call option is that, even though the market value of the share increases beyond $52 in future, the buyer can exercise the options at the strike price. Thus he makes a profit of (Current Share value – Strike price). If the stock price doesn’t increase as the buyer predicted, he need not exercise the call option. In this process the seller gets the premium for call options as the profit. An Asian option is a type of exotic option where the price of the option is path-dependent. The price is a function of the price of an underlying asset at the multiple points along the path. More information on Asian Options can be found at https://en.wikipedia.org/wiki/Asian_option. This article demonstrates how an Asian Options Pricing program can be optimized to run on CPU cores and offloaded to utilize the GPU cores of Intel® Processor Graphics. The attached code sample calculates the price of 1,000 Asian options using the Monte Carlo method. Each of the options simulates 100,000 paths and 87 time steps along each path. Arithmetic mean is used for the payoff.

Performance tuning on CPU:

Modern Intel® Core™ Processors come with multiple processing cores and each processing core have SIMD registers which supports vectorization. In order to optimize the code for CPU, the serial code is profiled using Intel® Vtune Amplifier to find the hotspots. Serial version of the program for single precision floating point results in the following:

>icl /Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc Driver.cpp AsianOptions.cpp Timer.cpp /Qopenmp /D__SERIAL__ /D__DO_FLOAT__ /Qmkl
>Driver.exe
Monte Carlo Asian Option Pricing in Single Precision By Time Step
Time to complete option pricing: 6.5645 seconds.
Computation rate (in options/sec): 152.3349.

The next step is exploit any potential thread level and data parallelism in the hotspot. In the program, the hotspot is the Monte Carlo simulation API. Each thread handles a different option from the options array and each option’s Monte Carlo simulation is done in SIMD mode. Both the threading and the vectorization is enabled using OpenMP. Below are the code snippets:

Enable threading (AsianOptions.cpp: MonteCarloTimeStepParallel()):

#pragma omp parallel for
                for (int opt = 0; opt < OPT_N; opt++)
                {
                                MonteCarloTimeStepLoopBody(h_CallResult, S, X, T, l_Random, opt);
                }
}

Enable vectorization (AsianOptions.cpp: MonteCarloTimeStepLoopBody()):

                for (int pos = 0; pos < RAND_N; pos += VECLEN)
                {
                                __declspec(align(64)) tfloat prevStepResult[VECLEN];
                                __declspec(align(64)) tfloat nextStepResult[VECLEN];
                                __declspec(align(64)) tfloat avgMean[VECLEN];
                                __declspec(align(64)) tfloat callValue[VECLEN];
                                prevStepResult[:] = Sval;
                                avgMean[:] = tfloat(0.0);
                                tfloat *ptrZ = l_Random;
                                for (int simStep = 0; simStep < SIMSTEPS; simStep++)
                                {
                                                int location = pos*SIMSTEPS + simStep*VECLEN;
                                                monteCarloByTimeStepKernel(prevStepResult, dt, &ptrZ[location], VBySqrtT,
uByT, nextStepResult);
                                                avgMean[:] += nextStepResult[:];
                                                prevStepResult[:] = nextStepResult[:];
                                }
//Use Arithmetic Mean
                                callValue[:] = max(((avgMean[:] / SIMSTEPS) - Xval), 0);
                                val += __sec_reduce_add(callValue[:]) / VECLEN;
                }

Optimized version of the program (for CPU) for single precision floating point results in the following:

> icl /Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc Driver.cpp AsianOptions.cpp Timer.cpp /Qopenmp /D__DO_FLOAT__ /Qmkl
>Driver.exe
Monte Carlo Asian Option Pricing in Single Precision By Time Step
Time to complete option pricing: 2.0421 seconds.
Computation rate (in options/sec): 489.6819.

Offloading the kernel to GPU:

The kernel can be offloaded to work on GPU by annotating the functions and the code segment with the #pragma offload directives. Below are the quick snapshots of the code snippets:

Enable threading on GPU (AsianOptions.cpp: MonteCarloTimeStepParallel()):

#ifdef __DO_OFFLOAD__
#pragma omp target map(tofrom: h_CallResult[0:OPT_N], S[0:OPT_N], X[0:OPT_N], T[0:OPT_N],
_Random[0:RAND_N*SIMSTEPS])
#endif
#pragma omp parallel for
                for (int opt = 0; opt < OPT_N; opt++)
                {
                                MonteCarloTimeStepLoopBody(h_CallResult, S, X, T, l_Random, opt);
                }

Explicit vectorization techniques for GPU is same as on CPU. Optimized version of the program (for GPU) for single precision floating point results in the following:

> icl /Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc Driver.cpp AsianOptions.cpp Timer.cpp /Qopenmp /D__DO_OFFLOAD__ /Qmkl /Qopenmp-offload:gfx /Qoffload-arch:haswell:visa3.1 /D__DO_FLOAT__
>Driver.exe
Monte Carlo Asian Option Pricing in Single Precision By Time Step
Time to complete option pricing: 0.5996 seconds.
Computation rate (in options/sec): 1667.7982.

More information on how to setup the machine for GPU offloading and how to enable the code for offload is described at https://software.intel.com/en-us/articles/getting-started-with-compute-offload-to-intelr-graphics-technology.

Performance Numbers for Single Precision Floating Point:

Versions

Speedup

Compiler Version

Compiler Options

System Specifications

OpenMP threading + vectorization on CPU cores

3.21x

Intel® C++ Compiler 16.0

/Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc /Qopenmp /D__DO_FLOAT__ /Qmkl

Processor: Intel® Core™ i7-4770R @3.2GHz

RAM: 8GB
OS: Windows 7 Enterprise SP1

Processor Graphics: Intel® Iris™ Pro Graphics

GPU offloaded version

11.12x

Intel® C++ Compiler 16.0

/Qrestrict /W3 /QxCORE-AVX2 /Zi /Ox /Ob2 /Qipo /Oi /EHsc /Qopenmp /D__DO_OFFLOAD__ /Qmkl /Qopenmp-offload:gfx /Qoffload-arch:haswell:visa3.1 /D__DO_FLOAT__

Processor: Intel® Core™ i7-4770R @3.2GHz

RAM: 8GB
OS: Windows 7 Enterprise SP1
Processor Graphics: Intel® Iris™ Pro Graphics

Future Work:

Stay tuned for heterogeneous version of this algorithm which will break the workload between CPU and GPU.

References:

Offloading Asian Options Pricing Algorithm to Intel(R) Processor Graphics

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112