This post shows how to write a GPU accelerated Monte Carlo simulation for financial derivative pricing from scratch with F# and Alea GPU. We first look at the generation of random numbers with the Alea.CUDA.Unbound library and then write a Monte Carlo simulation running on the GPU. A brief sample of writing simple functions can be seen in the previous post

The post assume that you have some F# programming experience. Knowledge of the CUDA programming model, the geometric Brownian motion, and the Monte Carlo method are helpful but not necessary to understand most of the code.

To run the code you need an F# development environment such as Visual Studio or Xamarin and a CUDA capable GPU. The code was written and tested with Alea GPU 2.0, Visual Studio 2013 and CUDA 6.5 on Windows.

## Generating Random Numbers

Generating random numbers using Alea.CUDA actually requires no understanding of the CUDA programming model at all. Alea.CUDA comes with the Alea.CUDA.Unbound library that contains two random number generators (XorShift7, Mrg32k3a) and one quasi random number generator (Sobol) out of the box for your CUDA device and the host.

To generate nmc random numbers on your CUDA device you first have to set up a random number generator with

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
module Randoms open Alea.CUDA open Alea.CUDA.Utilities open Alea.CUDA.Unbound.Rng let example () = // parameters of random number generator let numDimensions = 1 let seed = 1u let nmc = 100000 // determine worker let worker = Worker.Default let target = GPUModuleTarget.Worker(worker) |

In the code above we have created a XorShift7 random number generator that returns normally distributed single-precision random numbers. numDimensions can be understood as the number of dimensions of the underlying problem. For example if you are randomly drawing points in a plane, you need 2 random numbers to determine a sample and numDimensions = 2. We have to allocate memory on the device to which the random numbers can be written.

1 2 3 |
// setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, numDimensions, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer (nmc) |

It’s instructive to look at the type of
prngBuffer. It is
DeviceMemory<float32>. Variables of type
DeviceMemory<'T> represent memory allocated on the device with the ability to transfer from and to

the device. They can be roughly though of as device versions of
'T[]. A close “relative” is
deviceptr<'T> which can be obtained by calling
.Ptr on a
DeviceMemory<'T> and which we can think of as the pointer to the data on the device that have have to pass to a kernel (more later). Having allocated the memory on the device, we have to instruct the generator to fill the memory with
nmc samples and then return the result from the device to the host memory.

1 2 3 4 |
// create random numbers cudaRandom.Fill(0, nmc, prngBuffer) // transfer results from device to host prngBuffer.Gather() |

Again the types are instructive. DeviceMemory<'T> has a member function Gather : unit -> 'T [] and we can infer from the type that Gather transfers the content from the device to the host.

Having generated the random numbers we can use them on the CPU. Alea.CUDA.Unbound also contains other algorithms that can be used out of the box without a detailed knowledge of the GPU.

The full example is

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
module Randoms open Alea.CUDA open Alea.CUDA.Utilities open Alea.CUDA.Unbound.Rng let example () = // parameters of random number generator let numDimensions = 1 let seed = 1u let nmc = 100000 // determine worker let worker = Worker.Default let target = GPUModuleTarget.Worker(worker) // setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, numDimensions, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer (nmc) // create random numbers cudaRandom.Fill(0, nmc, prngBuffer) // transfer results from device to host prngBuffer.Gather() |

The plot of the distribution of random numbers generated by the code should look like

## A small Monte Carlo Simulation

The problem we will solve is a simulation of a geometric Brownian motion of the form

$$\frac{dS_t}{S_t} = \mu(t) \, dt + \sigma(t,S_t) \, dW_t$$

also known as a local volatility model in finance due to the dependence of the volatility on the coordinate $S_t$. A discussion of the model is far beyond the scope of this post. We discretize the time axis of the problem into $n$ time steps $t_i$ from $t_0 = 0$ to $t_{n} = T$ and get

$$S_{t_{i+1}} = S_{t_i} \cdot \exp \left[ dt_i \left( \mu(t_i) – \frac{\sigma(t_i,S_{t_i})^2}{2} \right) + \sqrt{dt_i} \, \sigma(t_i,S_{t_i}) \, dW_i \right],$$

where $dt_i = t_{i+1} – t_i$ and the $dW_i$ are drawn independently from a standard normal distribution. This is not accidentally the distribution we have used in the example of the random numbers. We will now implement a Monte Carlo simulation for the GPU bottom up. We start by the implementation of the time step.

1 2 3 4 5 6 7 |
type Step = int type Real = float32 /// Performs an Euler step in log-coordinate space [<ReflectedDefinition>] let eulerLogStep (dt : Real) (mu : Real) (sigma : Real) (dW : Real) s = s * (exp <| dt * (mu - (1G / 2G) * sigma * sigma) + sqrt dt * sigma * dW) |

This is the formula of the discretization above and not much more can be said. Next we implement the time stepping for a single sample.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
/// Time loop [<ReflectedDefinition>] let mcLoop (tStart : Step) (tEnd : Step) (dt : Step -> Real) (s0 : Real) (mu : Step -> Real) (sigma : Step -> Real -> Real) (dW : Step -> Real) = let mutable t = tStart let mutable s = s0 while t < tEnd do let s' = eulerLogStep (dt t) (mu t) (sigma t s) (dW t) s s <- s' t <- t + 1 s |

We are using the fact that we can pass functions to a kernel. dt, mu, and dW are functions from Step to value and sigma is – as in the formulas above a function from time and value to value and represented as such. This loop is for a single sample only and indifferent to the functional form we use for example for sigma. We could implement and pass a linear interpolation of data. We could also and will do so later pass a constant. Before we can create the final kernel to run we also have to provide an implementation of the sample loop.

1 2 3 4 5 6 7 8 9 |
/// Sample loop [<ReflectedDefinition>] let forAll (n : int) (f : int -> unit) = let iStart = blockIdx.x * blockDim.x + threadIdx.x let iStep = gridDim.x * blockDim.x let mutable i = iStart while i < n do f i i <- i + iStep |

Compare the sample loop given by forAll to apply. The are nearly identical. Only the input and output parameters are absent and the function f has the sample id as an additional input.

For the final kernel we decide to have mu and sigma as constants over time and pass the random numbers as a deviceptr<_>.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
/// Simulation kernel [<ReflectedDefinition;AOTCompile>] let mcSim (nmc : Sample) (nt : Step) (dt : deviceptr<Real>) (s0 : Real) (mu : Real) (sigma : Real) (dW : deviceptr<Real>) (st : deviceptr<Real>) = let go sample = st.[sample] <- mcLoop 0 nt (fun t -> dt.[t]) s0 (fun _ -> mu) (fun _ _ -> sigma) (fun t -> dW.[nt * sample + t]) forAll nmc go |

We provide all required functions and as the resulting function is a function of basic types and deviceptr<_> only, it is a kernel we can compile ahead of time and also do so. We finally combine the kernel with parameters and the random number generation and obtain

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
let example () = // model parameters let T = 1.0f let s0 = 1.0f let mu = 0.02f let sigma = 0.20f // simulation parameters let nt = 40 let nmc = 1000000 let dt = Array.create nt (T / float32 nt) // parameters of random number generator let numDimensions = nt let seed = 1u // determine worker let worker = Worker.Default let target = GPUModuleTarget.Worker(worker) // setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, numDimensions, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer (nmc) // create random numbers cudaRandom.Fill(0, nmc, prngBuffer) // transfer to device let devDt = worker.Malloc dt let devSt = worker.Malloc nmc // determine launch parameters let lp = launchParam worker nmc // launch kernel worker.Launch <@ mcSim @> lp nmc nt devDt.Ptr s0 mu sigma prngBuffer.Ptr devSt.Ptr // transfer results from device to host devSt.Gather() |

The full example is

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
module MonteCarlo open Alea.CUDA open Alea.CUDA.Utilities open Alea.CUDA.Unbound.Rng type Step = int type Real = float32 type Sample = int /// Performs an Euler step in log-coordinate space [<ReflectedDefinition>] let eulerLogStep (dt : Real) (mu : Real) (sigma : Real) (dW : Real) s = s * (exp <| dt * (mu - (1G / 2G) * sigma * sigma) + sqrt dt * sigma * dW ) /// Time loop [<ReflectedDefinition>] let mcLoop (tStart : Step) (tEnd : Step) (dt : Step -> Real) (s0 : Real) (mu : Step -> Real) (sigma : Step -> Real -> Real) (dW : Step -> Real) = let mutable t = tStart let mutable s = s0 while t < tEnd do let s' = eulerLogStep (dt t) (mu t) (sigma t s) (dW t) s s <- s' t <- t + 1 s /// Sample loop [<ReflectedDefinition>] let forAll (n : Sample) (f : Sample -> unit) = let iStart = blockIdx.x * blockDim.x + threadIdx.x let iStep = gridDim.x * blockDim.x let mutable i = iStart while i < n do f i i <- i + iStep /// Simulation kernel [<ReflectedDefinition;AOTCompile>] let mcSim (nmc : Sample) (nt : Step) (dt : deviceptr<Real>) (s0 : Real) (mu : Real) (sigma : Real) (dW : deviceptr<Real>) (st : deviceptr<Real>) = let go sample = st.[sample] <- mcLoop 0 nt (fun t -> dt.[t]) s0 (fun _ -> mu) (fun _ _ -> sigma) (fun t -> dW.[nt * sample + t]) forAll nmc go let launchParam (worker : Worker) n = let blockSize = 256 let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT let gridSize = min (16 * numSm) (divup n blockSize) LaunchParam(gridSize, blockSize) let example () = // model parameters let T = 1.0f let s0 = 1.0f let mu = 0.02f let sigma = 0.20f // simulation parameters let nt = 40 let nmc = 1000000 let dt = Array.create nt (T / float32 nt) // parameters of random number generator let numDimensions = nt let seed = 1u // determine worker let worker = Worker.Default let target = GPUModuleTarget.Worker(worker) // setup random number generator use prngBuffer = cudaRandom.AllocCUDAStreamBuffer (nmc) // create random numbers cudaRandom.Fill(0, nmc, prngBuffer) // transfer to device let devDt = worker.Malloc dt let devSt = worker.Malloc nmc // determine launch parameters let lp = launchParam worker nmc // launch kernel worker.Launch <@ mcSim @> lp nmc nt devDt.Ptr s0 mu sigma prngBuffer.Ptr devSt.Ptr // transfer results from device to host devSt.Gather() |

and the plot of the distribution of $S_T$ should look like

The code is relatively simple and straightforward. By the way, only minor changes are necessary to run the code on the CPU and functions such as eulerLogStep and mcLoop can be reused.

Enjoy playing with Alea GPU …