Tag Archives: CUDA


Writing Monte Carlo Simulation for Derivative Pricing with Alea GPU

This post shows how to write a GPU accelerated Monte Carlo simulation for financial derivative pricing from scratch with F# and Alea GPU. We first look at the generation of random numbers with the Alea.CUDA.Unbound library and then write a Monte Carlo simulation running on the GPU. A brief sample of writing simple functions can be seen in the previous post

The post assume that you have some F# programming experience. Knowledge of the CUDA programming model, the geometric Brownian motion, and the Monte Carlo method are helpful but not necessary to understand most of the code.

To run the code you need an F# development environment such as Visual Studio or Xamarin and a CUDA capable GPU. The code was written and tested with Alea GPU 2.0, Visual Studio 2013 and CUDA 6.5 on Windows.

Generating Random Numbers

Generating random numbers using Alea.CUDA actually requires no understanding of the CUDA programming model at all. Alea.CUDA comes with the Alea.CUDA.Unbound library that contains two random number generators (XorShift7, Mrg32k3a) and one quasi random number generator (Sobol) out of the box for your CUDA device and the host.

To generate nmc random numbers on your CUDA device you first have to set up a random number generator with

In the code above we have created a XorShift7 random number generator that returns normally distributed single-precision random numbers. numDimensions can be understood as the number of dimensions of the underlying problem. For example if you are randomly drawing points in a plane, you need 2 random numbers to determine a sample and numDimensions = 2. We have to allocate memory on the device to which the random numbers can be written.

It’s instructive to look at the type of prngBuffer. It is DeviceMemory<float32>. Variables of type DeviceMemory<'T> represent memory allocated on the device with the ability to transfer from and to
the device. They can be roughly though of as device versions of 'T[]. A close “relative” is deviceptr<'T> which can be obtained by calling .Ptr on a DeviceMemory<'T> and which we can think of as the pointer to the data on the device that have have to pass to a kernel (more later). Having allocated the memory on the device, we have to instruct the generator to fill the memory with nmc samples and then return the result from the device to the host memory.

Again the types are instructive. DeviceMemory<'T> has a member function Gather : unit -> 'T [] and we can infer from the type that Gather transfers the content from the device to the host.

Having generated the random numbers we can use them on the CPU. Alea.CUDA.Unbound also contains other algorithms that can be used out of the box without a detailed knowledge of the GPU.

The full example is

The plot of the distribution of random numbers generated by the code should look like

A small Monte Carlo Simulation

The problem we will solve is a simulation of a geometric Brownian motion of the form

$$\frac{dS_t}{S_t} = \mu(t) \, dt + \sigma(t,S_t) \, dW_t$$

also known as a local volatility model in finance due to the dependence of the volatility on the coordinate $S_t$. A discussion of the model is far beyond the scope of this post. We discretize the time axis of the problem into $n$ time steps $t_i$ from $t_0 = 0$ to $t_{n} = T$ and get

$$S_{t_{i+1}} = S_{t_i} \cdot \exp \left[ dt_i \left( \mu(t_i) – \frac{\sigma(t_i,S_{t_i})^2}{2} \right) + \sqrt{dt_i} \, \sigma(t_i,S_{t_i}) \, dW_i \right],$$

where $dt_i = t_{i+1} – t_i$ and the $dW_i$ are drawn independently from a standard normal distribution. This is not accidentally the distribution we have used in the example of the random numbers. We will now implement a Monte Carlo simulation for the GPU bottom up. We start by the implementation of the time step.

This is the formula of the discretization above and not much more can be said. Next we implement the time stepping for a single sample.

We are using the fact that we can pass functions to a kernel. dt, mu, and dW are functions from Step to value and sigma is – as in the formulas above a function from time and value to value and represented as such. This loop is for a single sample only and indifferent to the functional form we use for example for sigma. We could implement and pass a linear interpolation of data. We could also and will do so later pass a constant. Before we can create the final kernel to run we also have to provide an implementation of the sample loop.

Compare the sample loop given by forAll to apply. The are nearly identical. Only the input and output parameters are absent and the function f has the sample id as an additional input.

For the final kernel we decide to have mu and sigma as constants over time and pass the random numbers as a deviceptr<_>.

We provide all required functions and as the resulting function is a function of basic types and deviceptr<_> only, it is a kernel we can compile ahead of time and also do so. We finally combine the kernel with parameters and the random number generation and obtain

The full example is

and the plot of the distribution of $S_T$ should look like

The code is relatively simple and straightforward. By the way, only minor changes are necessary to run the code on the CPU and functions such as eulerLogStep and mcLoop can be reused.

Enjoy playing with Alea GPU …


Simple GPU Functions with Alea GPU

This post shows how to write simple GPU accelerated data parallel functions with F# and Alea GPU.

Short Guide to CUDA

It’s likely that your computer has not only a CPU but also a GPU that can be used for computations. While your CPU will have a few calculation units or cores (~ 1 – 8) that can perform a wide variety of computations efficiently, your GPU has many cores (~ 100 – 1000). These cores are simpler than CPU cores and are optimized for Single Instruction Multiple Data work loads which perform the same calculation on a wide range of input.

As of today the CPU (also called the host) and the GPU (also called the device) often have separate memory. A typical calculation therefore consists of the following steps:

  • transfer data from host to device
  • start the calculations on the device
  • return result from device to host

Writing and Running GPU Computations

Assume we have an array of single precision numbers in memory and want to calculate the sine of each element of the array on the GPU. Allocating memory on the device and transferring data to the device can be done with

Malloc can be of type int -> DeviceMemory<'T> or 'T [] -> DeviceMemory<'T>. The first version allocates memory for the given number of 'T’s. The second version allocates the memory and transfers the data of the array from host to device. A function that transfers data from host to already allocated device memory is Scatter. Once we also have allocated memory for the result using

we launch the yet to be written GPU function with

The launch parameter essentially determines how many threads are scheduled for execution on the GPU and in which format they are scheduled. A more detailed discussion can be found e.g here.

For our simple example it is enough to know that we schedule nThread = gridDim.x * blockDim.x threads and each thread is identified with the index iThread = blockIdx.x * blockDim.x + threadIdx.x where 0 <= iThread < nThread.

The following function applies an operator f to each element of an input array on the device and writes the result to an output array on the device

Here we meet a deviceptr<'T> which represents the memory on the device.

The [<ReflectedDefinition>] annotation causes F# to include representation of apply in the form of an Expr<_> in the assembly. The Alea GPU compiler can take this Expr<_> and compile it to executable GPU code.

The operator f is an argument of the apply function. The type of f is float32 -> float32. We can now pass the sin operator to the apply function

The arguments of the resulting applySin function are of primitive type such as int, float, float32 or deviceptr<_> of basic types. We are left with data that can be transferred from the host to the device. This contrasts with the operator f in the apply function, which cannot be transferred from host to device as a function object.

For applySin we can use ahead-of-time compilation with the annotation AOTCompile so that the corresponding GPU code is created at compile time and embedded in the assembly. In contrast the apply function still has a parameter f that is yet undetermined. This code can only be compiled once the operator f is known and we have to resort to just in time compilation.

We launch the computations on the GPU with

and then transfer the result from device to host with Gather.

The full example is

and the result should look like


Alea GPU 2.0 Final Release

In Feburary 2015 we made the release candidate of Alea GPU 2.0 available.

Just in time for GTC 2015 we released the final Alea GPU on nuget.org.

Alea GPU is a professional cross-platform GPU development environment for .NET.

  • Supports all .NET languages, including C#, F# and VB
  • Improves developer productivity with first class tooling for coding, debugging and profiling, fully integrated in Visual Studio™
  • Reduces development time with pre-fabricated GPU algorithms and libraries
  • Increases agility with GPU scripting and a REPL in Visual Studio™ for rapid prototyping of GPU code

With Alea GPU the .NET framework can be used as a cross platform technology for the CPU and GPU code base.

  • Runs on Windows, Linux and Mac OS X
  • Single code base for multiple platforms – build once and run on any platform supporting either .NET or Mono
  • Simplified deployment because generated assemblies are binary compatible for all platforms

Radically Simplified GPU Parallelization: The Alea Dataflow Programming Model

Many programmers still leave the massive GPU parallel power unused – be it because of lacking experience in CUDA or because of limited time and budget. We aim to drastically simplify GPU parallelization by introducing our Alea dataflow programming model based on .NET. Complex computations can be easily and rapidly composed of a set of prefabricated and customizable operations that underlie asynchronous execution. The run-time system automatically translates this abstract model to efficient GPU code and schedules the operations with minimum memory transfers. By way of illustrative application cases of finance and statistics, we explain the model, take a look at the run-time system, and demonstrate its performance that proves to be as good as in manually optimized CUDA implementations.

Presentation slides


GPU Accelerated Backtesting and ML for Quant Trading Strategies

In algorithmic trading large amounts of time series data are analyzed to derive buy and sell orders so that the strategy is profitable but also risk measures are at an acceptable level.
Bootstrapping walk forward optimization is becoming increasingly popular to avoid curve fitting and data snooping. It is computationally extremely expensive but can be very well distributed to a GPU cluster.

We present a framework for bootstrapping walk forward optimization of trading strategies on GPU clusters, which allows us to analyze strategies in minutes instead of days. Moreover, we show how signal generation can be combined with Machine Learning to make the strategies more adaptive to further improve the robustness and profitability.

Presentation slides