This blog post is QuantAlea’s contribution to the F# Advent Calendar. Many thanks to Sergey Tihon for organizing this series. The code for the least correlated subset problem solved below in F# with GPUs can be found on GitHub.
QuantAlea has specialized in GPU computing with a strong focus on the financial services industry. You might have heard already of Alea GPU, QuantAlea’s professional GPU development environment for .NET. It allows to compile F#, C# or VB code directly to executable GPU code. Alea GPU has some unique features:
 Its JIT compiler generates GPU code on the fly, which is very useful for GPU scripting.
 Alea GPU produces binary compatible cross platform code, which runs on Windows, Linux or Mac OS X without recompilation.
 Alea GPU is a fully fledged compiler with native debugging support and is independent of any third party tools.
Other solutions are far less complete and not as convenient to use as Alea GPU. For instance CudaFy generates C code under the hood. It requires a post build step to compile the generated C code with a CUDA C compiler, breaking the integrity of the development process. The library managedCuda is even simpler, it is just a wrapper for the CUDA Driver API and requires that all the GPU code is programmed in C or C++.
Instead of giving an introduction to Alea GPU and describing how to program GPUs with F# in general I thought it would be more illustrative show how we use F# and GPUs for our own algorithmic trading at InCube Capital. F# and GPUs are a very good combination to develop sophisticated algorithmic trading strategies and analyze and optimize their profitability. I’d like to illustrate that with a few examples.
Coding Strategies in F#
Let me begin by explaining how we develop the core trading logic in F#. For this we assume that our strategy decides to buy and sell based on a signal which has sufficient predictive power to forecast price movements and turning points. The more complex the trading logic becomes, the easier it is to miss a case, which then generates an error during paper trading or even worse, produces wrong orders in live execution. We extensively rely on exhaustive pattern matching. It is a very powerful tool to improve the correctness and robustness of our trading logic. We first define some active patterns. Here is an active pattern for the signal transition from an previous value to an actual value:

let (FromZeroToZeroNoChangeChange) (f, s) = match UnitSignalValue.OfFloat f, UnitSignalValue.OfFloat s with  Zero, Pos > FromZero  Zero, Neg > FromZero  Pos , Zero > ToZero  Neg , Zero > ToZero  Neg , Pos > Change  Pos , Neg > Change  Pos , Pos > NoChange  Neg , Neg > NoChange  Zero, Zero > NoChange 
Yet another active pattern determines the different trading phases during the day. We have three time intervals. We wait until the trading start date, then we trade and after some time we try to liquidate the positions. If we do not manage to get rid of the position during the liquidation time we do a forced liquidation.

let (WaitTradeLiquidateStop) (input:(TimeSpan*TimeSpan*TimeSpan)*DateTime) = let (t1, t2, t3), d = input let dt = d.TimeOfDay if dt < t1 then Wait else if t1 <= dt && dt <= t2 then Trade else if t2 < dt && dt <= t3 then Liquidate else Stop 
With these active patterns we can now formulate the core trading logic as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

let newSignal = filterStep past filter.Coeffs match (periods, quote.TimeStamp), pnlTransactions.HasPosition, (signal, newSignal) with  Wait, _ , _ > ()  Trade, _ , ToZero > clearOpenPosition "on signal to zero" quote.Quote  Trade, false, FromZero > ()  Trade, false, NoChange > ()  Trade, false, Change > enterNewPosition newSignal quote.Quote  Trade, true , Change > switchSide newSignal quote.Quote  Trade, true , NoChange > ()  Trade, true , FromZero > failwithf "inconsistent zero unit signal with position %d" pnlTransactions.Position  Liquidate, true , ToZero  Liquidate, true , Change > clearOpenPosition (sprintf "on signal change after %A" filter.StopTime) quote.Quote  Liquidate, _ , _ > ()  Stop , true , _ > clearOpenPosition (sprintf "after exit time %A" filter.LiquidationTime) quote.Quote  Stop , _ , _ > () 
You can see that we match on the trading periods, the existence of a position and the pair of previous and new signal value. The trading logic is very compact, descriptive and easy to understand: we open a position for the first time if there is a signal change, switch positions on a signal change, liquidate in the liquidation period only on a signal change and clear all remaining position after the stop date. The crucial point however is that the pattern match is exhaustive so that we know for sure that all cases are properly handled.
Backtesting and Strategy Selection with F# GPUs
Once the strategy is coded it has to be configured and validated with backtesting. I already blogged on our GPU cluster framework for bootstrapping walk forward optimization of trading strategies and machine learning, which I also presented at the NVIDIA GTC conference early this year.
Instead of going into more details about bootstrap walk forward optimization or machine learning I would like to present another technique to choose suitable strategies from a collection of candidates. Let’s assume that we somehow identified 20 to 50 signal configurations. Each signal gives rise to a trading strategy. We can backtest these strategies and deduce various characteristics and statistics such as the daily returns, Sharp ratio, draw down, mean, skewness, kurtosis and the daily return correlation between them. The idea is to select out of these 20 to 50 candidates a handful of strategies which have the least amount of correlation, the best overall sharp ratio and are as Gaussian as possible.
To keep it simple we just consider the case of selecting a fixed number of strategies which have the least amount of correlation. Unfortunately, there is no known closed form solution to this problem and the straightforward is a search algorithm. As soon as we add additional filters and constraints as mentioned above a full blown search is inevitable anyway, so we didn’t even bother about finding an alternative solution approach. Assuming that we have $n$ candidate strategies and we want to find the $k$ strategies with the least amount of correlation we would need to examine
$$ \binom{n}{k} = \frac{n!}{k! \cdot (nk)!} $$
different combinations of choosing $k$ elements from a set of $n$ elements. This becomes very soon a prohibitively large number.
$k$ 
$\binom{20}{k}$ 
$\binom{50}{k}$ 
2 
190 
1225 
3 
1140 
19600 
4 
4845 
230300 
5 
15504 
2118760 
6 
38760 
15890700 
7 
77520 
99884400 
8 
125970 
536878650 
9 
167960 
2505433700 
10 
184756 
10272278170 
First we have to implement a function to calculate $\binom{n}{k}$. For this we should rely on the product formula
$$ \binom{n}{k} = \prod_{i=1}^{l} \frac{n – k + i}{i} $$
The change between multiplication and division prevents temporary values from unnecessary growth. Of course we have to use
uint64 because the range of
int is not sufficient. The implementation is pretty straightforward. The reason why we add the attribute
[<ReflectedDefinition>] will become clear soon.

[<ReflectedDefinition>] let choose (n:int) (k:int) = if k > n then 0UL else if k = 0  k = n then 1UL else if k = 1  k = n  1 then uint64 n else let delta, iMax = if k < n  k then uint64 (n  k), k else uint64 k, n  k let mutable res = delta + 1UL for i = 2 to iMax do res < (res * (delta + uint64 i)) / (uint64 i) res 
Next we have to find an algorithm to list all the $\binom{n}{k}$ different combination. A serial algorithm is straightforward. Here is an example with $n = 5$ and $k=3$. We start with the first combination [0; 1; 2] and then increase the last element until it cannot be increased any further. Then the preceding element is increased, always keeping the elements in increasing order.
$m$ 
Combination 
0 
[0; 1; 2] 
1 
[0; 1; 3] 
2 
[0; 1; 4] 
3 
[0; 2; 3] 
4 
[0; 2; 4] 
5 
[0; 3; 4] 
6 
[1; 2; 3] 
7 
[1; 2; 4] 
8 
[1; 3; 4] 
9 
[2; 3; 4] 
For a parallel implementation we need a way to generate the $m$th combination directly. Here is an implementation which does the job.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

[<ReflectedDefinition>] let largest a b x = let mutable v = a  1 while choose v b > x do v < v  1 v let subset (n:int) (k:int) (m:uint64) = let sub = Array.zeroCreate k let mutable x = (choose n k)  1UL  m let mutable a = n let mutable b = k for i = 0 to k1 do sub.[i] < largest a b x x < x  choose sub.[i] b a < sub.[i] b < b  1 for i = 0 to k1 do sub.[i] < (n  1)  sub.[i] sub 
The function
choose did not require any memory allocation or object creation and can run as such on a GPU. This is not the case for
subset. We refactor it as follows

[<ReflectedDefinition>] let subset (getSub : int > int) (setSub : int > int > unit) (n:int) (k:int) (m:uint64) = let mutable x = (choose n k)  1UL  m let mutable a = n let mutable b = k for i = 0 to k1 do largest a b x > setSub i x < x  choose (getSub i) b a < getSub i b < b  1 for i = 0 to k1 do (n  1)  getSub i > setSub i 
The functions
getSub : int > int and
setSub : int > int > unit abstract the memory for storing the combination. The attribute
[<ReflectedDefinition>] instructs the F# compiler to generate a quotation expression so that our Alea GPU compiler can transform it to GPU code. The curious reader can find more details about our compilation process in the manual.
Having solved the combinatoric problem we can continue with the calculation of the correlation of a combination. For efficiency reasons we store the correlation matrix in a compact format, which only stores the lower triangular part as an array. We do not store the diagonal as it is always 1, nor the upper triangular part because the matrix is symmetric. Let’s take as an example a 3×3 correlation matrix
r/c 
0 
1 
2 
0 
1.0000 
0.9419 
0.5902 
1 
0.9419 
1.0000 
0.0321 
2 
0.5902 
0.0321 
1.0000 
We only have to store the 3 matrix elements
r/c 
0 
1 
2 
0 
* 
* 
* 
1 
0.9419 
* 
* 
2 
0.5902 
0.0321 
* 
which we flatten into a vector [0.9419; 0.5902; 0.0321]. Here is the F# implementation:

let lowerTriangularPacked (A:float[][]) = A > Array.mapi (fun i row > Array.sub row 0 i) > Array.concat 
The length of the packed vector for a matrix of dimension $n$ is
$$ \mathbf{size}(n) = \frac{n(n+1)}{2} – n = \frac{n(n1)}{2}, $$
because we do not store the diagonal. The offset in the packed vector of an element $(i,j)$ with $i > j$ is calculated as
$$ \mathbf{offset}(i, j) = \frac{i(i1)}{2} + j = \mathbf{size}(i) + j.$$
Given a combination $c$ as calculated by the function
subset we build the correlation submatrix again in packed storage format as follows:

[<ReflectedDefinition>] let subMatixPacked (matrix: int > float) (subMatrix: int > float > unit) (selection:int > int) selLength = let mutable l = 0 for i = 0 to selLength  1 do for j = 0 to i  1 do //printfn "(%d, %d) > (%d, %d)" i j s.[i] s.[j] let offset = offset (selection i) (selection j) matrix offset > subMatrix l l < l + 1 
As in the previous code fragment we externalize the correlation matrix access and the selection indices with functions
matrix: int > float, respectively
selection:int > int and provide a function
subMatrix: int > float > unit to set the elements of the submatrix. Now we have everything together to calculate the amount of correlation of a submatrix determined by a combination. Mathematically, we want to measure how far the submatrix is from the $k$ dimensional identity matrix
$$  C_{sub}(c) – \mathbf{Id}_k , $$
where $\cdot$ is a suitable matrix norm. We decided to choose the Frobenius norm
$$  C _{F} = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n} C_{ij}^2 }. $$
There are of course different choices such as the $L_1$ or $L_{\infty}$ norm. In our special case the implementation of the Frobenius distance to the identity is very simple. It is just the square root of the sum of squares of all the elements in the packed vector

[<ReflectedDefinition>] let distToIdentity n (A: int > float) = let mutable dist = 0.0 for i = 0 to n1 do dist < dist + square(A i) sqrt dist 
We would like to stress that all of the critical numerical code runs on CPU and GPU, so that we have 100% code reuse. Before we present the final GPU implementation of the least correlated subset problem, let’s see how we can use our numerical building blocks to implement the algorithm for CPU execution. All what we have to do is to manage array creation and call the numerical building blocks with a lambda function to access the data. Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

let distToIdentity (A:float[]) = distToIdentity A.Length (fun i > A.[i]) let subMatixPacked (A:float[]) (selection:int[]) = let subMatrix = Array.zeroCreate (size selection.Length) subMatixPacked (fun i > A.[i]) (fun i v > subMatrix.[i] < v) (fun i > selection.[i]) selection.Length subMatrix let allDistToIdentity (C : float[]) n k = let cnk = Binomial.choose n k Array.init (int cnk) (fun i > let selection = Binomial.Cpu.subset n k (uint64 i) i, subMatixPacked C selection > distToIdentity ) let leastCorrelated (C : float[]) n k = let best = allDistToIdentity C n k > Array.minBy snd Binomial.Cpu.subset n k (fst best > uint64) 
Note that the function
allDistToIdentity only works for small values of $n$ and $k$ where $\binom{n}{k}$ is in the range of
int. The function
leastCorrelated returns the best combination.
Lets move on to the final GPU implementation. Up to now no particular GPU knowledge was required appart from adding the attribute []
. Now we need to know some basic CUDA terms. For those of you who are GPU freshmen I recommend that you briefly look at this example, where you can learn the basic GPU coding principles. To be able to follow the example you should know that the GPU has a memory space that is different from the CPU and that the GPU runs functions, called kernels, in parallel in many threads. These threads are grouped into block of threads. Each thread in a block is identified with its
threadIdx and each block with its
blockIdx. The size of the block is given by
blockDim, whereas the number of blocks can be read from
gridDim.
We start with a very simple implementation of the calculation of the distance to the identity for many combinations in parallel. We write a so called GPU kernel, that is a function which will be executed on the GPU in parallel in many different instances.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

[<ReflectedDefinition>] let distToIdentityKernelDevice (n:int) (k:int) (cnk:int) (ns:int) (C:deviceptr<float>) (dist:deviceptr<float>) (selection:deviceptr<int>) (subMatrix:deviceptr<float>) = let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let selection = selection + start * k let subMatrix = subMatrix + start * ns let mutable m = start while m < cnk do subset (fun i > selection.[i]) (fun i v > selection.[i] < v) n k (uint64 m) subMatixPacked (fun i > C.[i]) (fun i v > subMatrix.[i] < v) (fun i > selection.[i]) k dist.[m] < distToIdentity ns (fun i > subMatrix.[i]) m < m + stride 
Note that we use the type
deviceptr<float> and
deviceptr<int> to access data on the GPU. You immediately see that we call our functions
subset and
subMatrixPacked with lambda functions such as
fun i > selection.[i] or
fun i v > subMatrix.[i] < v. In this implementation the temporary storage for
selection and
subMatrix is in the so called global device memory, which is the largest memory pool on a GPU, but also the slowest in terms of access latency. How do we now call such a kernel function? For this we have to allocate storage on the GPU, copy data to the GPU, define launch parameters and call the kernel function. Here is the code which does all these steps:

let allDistToIdentityDevice (C : float[]) n k = let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT let cnk = choose n k > int let ns = size k let blockSize = 128 let gridSize = 8 * numSm use dC = worker.Malloc(C) use dDist = worker.Malloc(cnk) use dSelection = worker.Malloc(gridSize * blockSize * k) use dSubMatrix = worker.Malloc(gridSize * blockSize * ns) let lp = new LaunchParam(gridSize, blockSize) worker.Launch <@ distToIdentityKernelDevice @> lp n k cnk ns dC.Ptr dDist.Ptr dSelection.Ptr dSubMatrix.Ptr dDist.Gather() 
You see from the
use binding that all our allocated memory on the GPU is disposable. The kernel function
distToIdentityKernelDevice is passed as an expression to the kernel launch
worker.Launch <@ distToIdentityKernelDevice @>. Here
worker is essentially a GPU. For simplicity we choose 128 threads per block and launch 8 times more blocks than streaming multiprocessors on the GPU. The last step
dDist.Gather() copies the calculated distances back to the CPU memory.
Let’s do a first optimization. As already mentioned global device memory has a high access latency. Registers and thread local memory, as long as it does not spill over to device memory, are faster. Yet another memory type is shared memory. As suggested by its name it can be accessed from multiple threads, i.e. shared between all the threads of a block. This is very useful for the original correlation matrix $C$. Here is the implementation using shared memory for
C as well as for the temporaries
selection and
subMatrix:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

[<ReflectedDefinition>] let distToIdentityKernelShared (m0:uint64) (partitionSize:int) (n:int) (k:int) (ns:int) (C:deviceptr<float>) (dist:deviceptr<float>) = let sizeC = size n let sizeSelection = blockDim.x * k let shared = __shared__.ExternArray<byte>() > __array_to_ptr let sharedC = shared.Reinterpret<float>() let selection = (sharedC + sizeC).Reinterpret<int>() let subMatrix = (selection + sizeSelection).Reinterpret<float>() let mutable i = threadIdx.x while i < sizeC do sharedC.[i] < C.[i] i < i + blockDim.x __syncthreads() let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let selection = selection + threadIdx.x * k let subMatrix = subMatrix + threadIdx.x * ns let mutable m = start while m < partitionSize do subset (fun i > selection.[i]) (fun i v > selection.[i] < v) n k (m0 + uint64 m) subMatixPacked (fun i > sharedC.[i]) (fun i v > subMatrix.[i] < v) (fun i > selection.[i]) k dist.[m] < distToIdentity ns (fun i > subMatrix.[i]) m < m + stride 
Variable amount of shared memory is allocated as part of the kernel launch. We show how to do that later. In this new kernel we first get the externally allocated shared memory
shared = <strong>shared</strong>.ExternArray(), calculate the offsets for the temporaries and then copy the matrix
C to
sharedC in shared memory in parallel and then call
__syncthreads() to wait for its completion. The rest of the kernel is identical. This performs much better. Note also the small change that we added an additional argument
m0 for the offset where to start the calculations.
So far we just calculated the distance to the identity but did not yet do the minimization. Determining the smallest element in a vector is a classical reduction problem. Alea GPU provides a very efficient and generic reduction implementation. Because we have to find the index of the smallest element we have to create a special type.

[<Struct>] type IndexAndValue = val Index : uint64 val Value : float [<ReflectedDefinition>] new(i, v) = { Index = i; Value = v } override this.ToString() = sprintf "(%d, %f)" this.Index this.Value [<ReflectedDefinition>] static member Min (a:IndexAndValue) (b:IndexAndValue) = if a.Value < b.Value then a else b 
Because instances of this type have to live on the GPU it must be a value type, as enforced by the attribute
[<Struct>]. Now we can easily find the index of the minimal value in an array with the parallel reduction from Alea GPU Unbound library. First create an instance of
DeviceReduceModule:

let minReductionModule = new DeviceReduceModule<IndexAndValue>(GPUModuleTarget.Worker(worker), <@ IndexAndValue.Min @>) 
Then we create a reducer for arrays of length up to
len and launch the reduction

use minimum = minReductionModule.Create(len) use idxAndValues = worker.Malloc<IndexAndValue>(len) minimum.Reduce(idxAndValues.Ptr, len) 
Currently, because of some library limitations you have to set the context type to “threaded” as follows:

Alea.CUDA.Settings.Instance.Worker.DefaultContextType < "threaded" 
Let us now combine this with the kernel to calculate the distance to the identity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

let leastCorrelatedWithTransform (partitionSize:int) (C : float[]) n k = let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT let maxSharedMem = worker.Device.Attributes.MAX_SHARED_MEMORY_PER_BLOCK let ns = size k let cnk = choose n k let blockSize = 128 let gridSize = 8 * numSm let sharedSize = __sizeof<float>() * size n + blockSize * __sizeof<int>() * k + blockSize * __sizeof<float>() * ns if sharedSize > maxSharedMem then failwithf "too much shared memory required: max shared mem = %d, required shared memory size = %d" maxSharedMem sharedSize use dC = worker.Malloc(C) use dDist = worker.Malloc(partitionSize) use idxAndValues = worker.Malloc<IndexAndValue>(partitionSize) use minimum = minReductionModule.Create(partitionSize) let lpd = new LaunchParam(gridSize, blockSize, sharedSize) let lpr = LaunchParam(gridSize, blockSize) let findBest m = worker.Launch <@ distToIdentityKernelShared @> lpd m partitionSize n k ns dC.Ptr dDist.Ptr worker.Launch <@ transform @> lpr dDist.Ptr idxAndValues.Ptr m partitionSize minimum.Reduce(idxAndValues.Ptr, partitionSize) let numPartitions = divup cnk (uint64 partitionSize) let bestInPartitions = [0UL..numPartitions  1UL] > List.map (fun p > let m = p * (uint64 partitionSize) in findBest m) let best = bestInPartitions > List.minBy (fun v > v.Value) let bestSelection = Binomial.Cpu.subset n k best.Index bestSelection 
This implementation of
leastCorrelatedWithTransform first calculates the amount of shared memory
sharedSize that is required. It then partitions the $\binom{n}{k}$ different combinations into blocks of size
partitionSize and loops over all the partitions

let bestInPartitions = [0UL..numPartitions  1UL] > List.map (...) 
The final reduction is done on the CPU. To call the reduction kernel we have to transform the vector of distances calculated by
distToIdentityKernelShared to a vector of
IndexAndValue.

[<ReflectedDefinition>] let transform (inputs:deviceptr<float>) (outputs:deviceptr<IndexAndValue>) (m0:uint64) (n:int) = let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let mutable i = start while i < n do outputs.[i] < IndexAndValue(m0 + uint64 i, inputs.[i]) i < i + stride 
We can avoid the transform and directly generate a vector of
IndexAndValue. Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

[<ReflectedDefinition>] let distToIdentityKernelSharedIndexAndValue (m0:uint64) (partitionSize:int) (n:int) (k:int) (ns:int) (C:deviceptr<float>) (dist:deviceptr<IndexAndValue>) = let sizeC = size n let sizeSelection = blockDim.x * k let shared = __shared__.ExternArray<byte>() > __array_to_ptr let sharedC = shared.Reinterpret<float>() let selection = (sharedC + sizeC).Reinterpret<int>() let subMatrix = (selection + sizeSelection).Reinterpret<float>() let mutable i = threadIdx.x while i < sizeC do sharedC.[i] < C.[i] i < i + blockDim.x __syncthreads() let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let selection = selection + threadIdx.x * k let subMatrix = subMatrix + threadIdx.x * ns let mutable m = start while m < partitionSize do subset (fun i > selection.[i]) (fun i v > selection.[i] < v) n k (m0 + uint64 m) subMatixPacked (fun i > sharedC.[i]) (fun i v > subMatrix.[i] < v) (fun i > selection.[i]) k dist.[m] < IndexAndValue(m0 + uint64 m, distToIdentity ns (fun i > subMatrix.[i])) m < m + stride let leastCorrelated (partitionSize:int) (C : float[]) n k = let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT let maxSharedMem = worker.Device.Attributes.MAX_SHARED_MEMORY_PER_BLOCK let ns = size k let cnk = choose n k let blockSize = 128 let gridSize = 8 * numSm let sharedSize = __sizeof<float>() * size n + blockSize * __sizeof<int>() * k + blockSize * __sizeof<float>() * ns if sharedSize > maxSharedMem then failwithf "too much shared memory required: max shared mem = %d, required shared memory size = %d" maxSharedMem sharedSize use dC = worker.Malloc(C) use dDist = worker.Malloc<IndexAndValue>(partitionSize) use minimum = minReductionModule.Create(partitionSize) let lpd = new LaunchParam(gridSize, blockSize, sharedSize) let findBest m = worker.Launch <@ distToIdentityKernelSharedIndexAndValue @> lpd m partitionSize n k ns dC.Ptr dDist.Ptr minimum.Reduce(dDist.Ptr, partitionSize) let numPartitions = divup cnk (uint64 partitionSize) let bestInPartitions = [0UL..numPartitions  1UL] > List.map (fun p > let m = p * (uint64 partitionSize) in findBest m) let best = bestInPartitions > List.minBy (fun v > v.Value) let bestSelection = Binomial.Cpu.subset n k best.Index bestSelection 
Of course the GPU implementation is more difficult than the CPU version. But it can handle much larger problems faster. The important point is that it shares all the core functions with the CPU implementation. Let us have a look at the performance of our implementation.
n 
k 
CPU 
GPU 
Speedup 
20 
5 
33.65 ms 
2.70 ms 
12.40 
20 
6 
69.40 ms 
6.41 ms 
10.81 
20 
7 
169.02 ms 
22.57 ms 
7.49 
20 
8 
334.72 ms 
40.40 ms 
8.28 
20 
9 
513.01 ms 
57.76 ms 
8.88 
50 
3 
25.52 ms 
2.93 ms 
8.70 
50 
4 
369.69 ms 
57.76 ms 
12.32 
50 
5 
5455.00 ms 
329.20 ms 
16.57 
Optimizing with Constant and Local Memory
This is not the end of the optimization, there is still room for improvement. First, because shared memory is always very scarce we should try to save it. Because the algorithm is only reading from C
we can as well place it in constant memory, which has a read latency comparable to shared memory. For this we have to use the cuda
computational workflow to define the constant memory resource of a GPU module. Note that the constant memory size must be a GPU compile time constant. This is not a big problem because we can always JITcompile a new configuration. The next optimization is to store some temporary data, such as the selection indices, into thread local memory. Note that the size of the local memory has to be also a GPU compile time constant. Here is the code of the version that uses constant memory and local memory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

let leastCorrelatedModuleUsingConstAndLocalMem maxDim maxK = cuda { let sizeC = size maxDim let! constC = Compiler.DefineConstantArray<float>(sizeC) let! kernel = <@ fun (m0:uint64) (partitionSize:int) (n:int) (k:int) (ns:int) (dist:deviceptr<IndexAndValue>) > let subMatrix = __shared__.ExternArray<float>() > __array_to_ptr let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let selection = __local__.Array<int>(maxK) let subMatrix = subMatrix + threadIdx.x * ns let mutable m = start while m < partitionSize do subset (fun i > selection.[i]) (fun i v > selection.[i] < v) n k (m0 + uint64 m) subMatixPacked (fun i > constC.[i]) (fun i v > subMatrix.[i] < v) (fun i > selection.[i]) k dist.[m] < IndexAndValue(m0 + uint64 m, distToIdentity ns (fun i > subMatrix.[i])) m < m + stride @> > Compiler.DefineKernel return Entry(fun (program:Program) > let worker = program.Worker let constC = program.Apply(constC) let kernel = program.Apply(kernel) let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT let maxSharedMem = worker.Device.Attributes.MAX_SHARED_MEMORY_PER_BLOCK let run (partitionSize:int) (C : float[]) n k = if size n > sizeC then failwithf "dimension %d of C is too large only support dimension up to %d" n maxDim let ns = size k let cnk = choose n k let blockSize = 128 let gridSize = 8 * numSm let sharedSize = blockSize * __sizeof<float>() * ns printfn "block size %d, shared memory size %d (%d), const memory size %d" blockSize sharedSize maxSharedMem sizeC if sharedSize > maxSharedMem then failwithf "too much shared memory required: max shared mem = %d, required shared memory size = %d" maxSharedMem sharedSize constC.Scatter C use dDist = worker.Malloc<IndexAndValue>(partitionSize) use minimum = minReductionModule.Create(partitionSize) let lpd = new LaunchParam(gridSize, blockSize, sharedSize) let findBest m = kernel.Launch lpd m partitionSize n k ns dDist.Ptr minimum.Reduce(dDist.Ptr, partitionSize) let numPartitions = divup cnk (uint64 partitionSize) let bestInPartitions = [0UL..numPartitions  1UL] > List.map (fun p > let m = p * (uint64 partitionSize) in findBest m) let best = bestInPartitions > List.minBy (fun v > v.Value) let bestSelection = Binomial.Cpu.subset n k best.Index bestSelection run ) } 
You see that the kernel also becomes slightly shorter. The performance of the smaller problems is pretty much the same, but for larger problems we see an improvement.
n 
k 
CPU 
GPU 
Speedup 
50 
3 
25.52 ms 
2.19 ms 
11.65 
50 
4 
369.69 ms 
10.31 ms 
35.85 
50 
5 
5455.00 ms 
170.89 ms 
31.92 
There is still more optimization potential which can increase performance by another factor. Namely, each single distance calculation can be done with a parallel reduction too. We leave this for a future improvement.
Debugging and Profiling
To enable profiling of JIT compiled code you have to set the JIT compile level

Alea.CUDA.Settings.Instance.JITCompile.Level < "Diagnostic" 
Then all that is required is to build the code in debug mode and place a breakpoint in the GPU code. Start the NVIDIA debugger from the Visual Studio menu NSIGHT and choose Start CUDA Debugging. Data across different warps are best analyzed in the CUDA warp watch window. Here is an example
Profiling is crucial to identify further optimization potential. To facilitate the profiling of your GPU code Alea GPU is well integrated with the NVIDIA profiler. For details we refer to the profiling chapter in our tutorial.
Conclusion
We have shown how we use F# and GPUs for our own internal algorithmic trading. F# has a lot of extremely helpful language features, such as active patterns and exhaustive pattern matching, which allow us to program strategies faster, more expressive and more robust.
The usability of GPUs is illustrated with the practically relevant example of finding the subset of least correlated strategies. We went through some optimization and refactoring to illustrate how we develop GPU kernels, starting with simple implementations and adding more and more optimizations. The key point is that with the right level of abstraction and by using lambda functions we can share critical code between CPU and GPU in an elegant manner. This improves development time and reduces maintainability costs.