A Performance Comparison of Alea GPU C# and CUDA C++
In the last post we gave a sneak preview of the upcoming Alea GPU version 3. Alea GPU version 3 sets completely new standards for GPU development on .NET. The highlights of the new version are:
 Higher level abstractions such as GPU LINQ, a GPU parallelfor and parallel aggregation.

Automatic memory management and data transfer, which makes GPU programming easier and reduces a lot of boiler plate code.

Integration with the .NET type system so that .NET arrays can be used directly in GPU kernels.

CUDA programming model integrated with C# and F#.
In the last post we discussed the new API and its usability. This time we benchmark Alea GPU against multithreaded .NET and CUDA C++. For the performance study we use an application from quantitative finance that calculates the price of an Asian option with Monte Carlo simulation based on the celebrated BlackScholes model. Nowadays Monte Carlo simulations are a common practice to calculate the price of complex financial products and risk figures. GPUs significantly reduce the calculation time.
We look at different implementation versions using different features of Alea GPU.
 Implicit memory management: we use the Alea GPU parallelfor and parallel aggregate together with implicit memory management, i.e. we let Alea GPU manage the memory allocation and data transfer between the CPU and the GPU.

Explicit memory management: we use the Alea GPU parallelfor and parallel aggregate but we explicitly copy the data to the GPU.

Transformreduce: we combine the parallelfor and parallel aggregate into a single parallel transformreduce and use implicit memory management.

LINQ workflow: we express the calculation using Alea GPU LINQ, which allows us to do kernel fusing and in a future version also more aggressive memory transfer optimizations.

Constant memory: we use constant memory to store data that is readonly on the GPU.
The CPU implementation is fully multithreaded and also uses the fast CPU random number generators that come with the NVIDIA cuRand library. We run the computations on an Intel i74770 CPU at 3.5 GHz and on different GPU configurations. Let us look at the timings in milliseconds:
As expected the Kepler K40 is the fastest GPUs thanks to its high number of 2880 CUDA cores and very good double precision support. The version using implicit memory management is pretty close to the explicit one. Using constant memory also slightly improves performance, because it allows to cache data reads. Now we compare the speedup relative to a multithreaded C# implementation:
Already with a rather weak GeForce GT 750M mobile GPU we gain a performance increase of roughly a factor of 4. The highend K40 GPU boosts performance by a factor 100. The version using explicit memory management scales best to multiple GPUs as shown by running it on two GeForce GTX Titan Black GPUs as well as on an AWS instance with four GRID K520 GPUs. In the current version of Alea GPU the implicit memory management uses a global tracker, which is not yet optimized for multiple GPUs. This explains the why the implicit memory management version scales poorly to multiple GPUs.
Now let us look at the performance of Alea GPU compared to a CUDA C++ implementation. Again we run the computations on GPUs of different hardware generations. Here are the timings in milliseconds of the Asian option pricing in C# using explicit management and a CUDA C++ implementation:
Investigating the runtime behavior in more details with the Visual Profiler reveals a few interesting details: The CUDA C++ Asian option pricing kernel uses 37 to 40 registers, depending on the CUDA compute architecture, where the .NET version compiled with Alea GPU only uses 30 registers. This has some consequences on the number of blocks inflight and the overall occupancy. The .NET version can use 52 thread blocks and achieves 100% occupancy. The CUDA C++ version can only launch 39 thread blocks and achieves 75% occupancy. However, the kernel execution time is almost the same. Regarding the overall execution time, the .NET version is slightly slower because of in .NET we have a small overhead to launch a kernel.
For those who are interested in further details we sketch the C# as well as the CUDA C++ implementation now.
GPU and CPU Implementation with Alea GPU
We like share the core algorithmic code between the CPU and GPU implementation. We achieve this by encapsulating the main calculation in a delegate:

public static double AsianCall(int i, double spot0, double strike, double[] dt, double[] rates, double[] volas, double[] gaussian) { var sum = 0.0; var spot = spot0; for (var k = 0; k < dt.Length; k++) { var sigma = volas[k]; var drift = dt[k] * (rates[k]  sigma * sigma / 2); spot = spot * DeviceFunction.Exp(drift + DeviceFunction.Sqrt(dt[k]) * sigma * gaussian[k * dt.Length + i]); sum += spot; } return DeviceFunction.Max(sum / dt.Length  strike, 0.0); } 
This delegate calculates the payoff along path i
. The price computation simulates multiple sample batches so that we can scaleup the number of simulations. For each batch we generate numSamplesPerBatch
normally distributed random numbers with the NVIDIA cuBLAS library. Then we use the Alea GPU parallelfor to generate the paths and the prices using the AsianCall
delegate. Then we apply the Alea GPU average reduction to calculate the batch mean. The final price is the average of all batch prices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

[GpuManaged] public static double PriceGpu(int numBatches, int numSamplesPerBatch, double spot0, double strike, double[] dt, double[] rates, double[] volas) { using (var rng = Generator.CreateGpu(gpu, RngType.PSEUDO_XORWOW)) { var nt = dt.Length; var gaussian = gpu.Allocate<double>(numSamplesPerBatch*nt); var prices = gpu.Allocate<double>(numSamplesPerBatch); var batchPrices = new double[numBatches]; for (var batch = 0; batch < numBatches; batch++) { rng.SetGeneratorOffset((ulong)batch * (ulong)(numSamplesPerBatch * nt)); rng.GenerateNormal(gaussian, 0, 1); gpu.For(0, numSamplesPerBatch, i => prices[i] = AsianCall(i, spot0, strike, dt, rates, volas, gaussian)); batchPrices[batch] = gpu.Average(prices); } Gpu.Free(gaussian); Gpu.Free(prices); return batchPrices.Average(); } } 
Note that PriceGpu
has the attribute [GpuManaged]
because we let Alea GPU to do the memory transfer for us. We only create storage for the random numbers and the prices on the device. All the input data, such as dt
, rates
and volas
are captured in a closure and moved automatically to the GPU. The explicit deallocation of the GPU memory with Gpu.Free
is optional but good practice: GPU memory is pagelocked, so to avoid unnecessary GPU memory limitations, it is good to release unused GPU memory immediately and not to wait for the next garbage collection run.
We can very easily program a CPU version:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

public static double PriceCpu(int numBatches, int numSamplesPerBatch, double spot0, double strike, double[] dt, double[] rates, double[] volas) { using (var rng = Generator.CreateCpu(rngType)) { var nt = dt.Length; var gaussian = new double[numSamplesPerBatch * nt]; var prices = new double[numSamplesPerBatch]; var batchPrices = new double[numBatches]; for (var batch = 0; batch < numBatches; batch++) { rng.SetGeneratorOffset((ulong)batch * (ulong)(numSamplesPerBatch * nt)); rng.GenerateNormal(gaussian, 0, 1); Parallel.For(0, numSamplesPerBatch, i => prices[i] = AsianCall(i, spot0, strike, dt, rates, volas, gaussian)); batchPrices[batch] = prices.Average(); } return batchPrices.Average(); } } 
This time we use the .NET Parallel.For
and reuse the core path generation and payoff logic from AsianCall
. The other versions also share the same core algorithm AsianCall
. We skip further code listings.
Implementation with CUDA C++
The CUDA C++ version looks very similar, execpt that there is no GPU parallel for, so we program the kernel ourselves. Because in CUDA C++ there are no array types as in .NET we have to pass the array length as additional arguments to the kernel.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

__global__ void asianCall(int numSamplesPerBatch, double spot0, double strike, int nt, double* dt, double* rates, double* volas, double* gaussian, double* prices) { unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x; unsigned int step = gridDim.x * blockDim.x; for (unsigned int i = tid ; i < numSamplesPerBatch ; i += step) { double sum = 0.0; double spot = spot0; for (int k = 0; k < nt; k++) { double sigma = volas[k]; double drift = dt[k] * (rates[k]  sigma * sigma / 2); spot = spot * exp(drift + sqrt(dt[k]) * sigma * gaussian[k * nt + i]); sum += spot; } prices[i] = max(sum / (double)nt  strike, 0.0); } } 
The orchestration of the calculations are now a bit more involved because there is no automatic memory management and transfer. Also because cudaMemcpy
takes the number of bytes to copy, we have to calculate the size ourselves. Typecasts further complicate the code. The average price per batch is calculated with the reduction function of CUDA Thrust.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

double priceGpu(int gridSize, int blockSize, int numBatches, int numSamplesPerBatch, double spot0, double strike, int nt, double* dtHost, double* ratesHost, double* volasHost) { curandGenerator_t gen; double *gaussian, *prices, *batchPrices, *dt, *rates, *volas; batchPrices = (double *)calloc(numBatches, sizeof(double)); clock_t t; t = clock(); CUDA_CALL(cudaMalloc((void **)&gaussian, numSamplesPerBatch*nt*sizeof(double))); CUDA_CALL(cudaMalloc((void **)&prices, numSamplesPerBatch*sizeof(double))); CUDA_CALL(cudaMalloc((void **)&dt, nt*sizeof(double))); CUDA_CALL(cudaMalloc((void **)&rates, nt*sizeof(double))); CUDA_CALL(cudaMalloc((void **)&volas, nt*sizeof(double))); CUDA_CALL(cudaMemcpy(dt, dtHost, nt*sizeof(double), cudaMemcpyHostToDevice)); CUDA_CALL(cudaMemcpy(rates, ratesHost, nt*sizeof(double), cudaMemcpyHostToDevice)); CUDA_CALL(cudaMemcpy(volas, volasHost, nt*sizeof(double), cudaMemcpyHostToDevice)); CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_XORWOW)); CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen, 1234ULL)); thrust::device_ptr<double> pricesPtr(prices); for (int batch = 0; batch < numBatches; batch++) { CURAND_CALL(curandSetGeneratorOffset(gen, (unsigned long long)batch * (unsigned long long)(numSamplesPerBatch * nt))); CURAND_CALL(curandGenerateNormalDouble(gen, gaussian, numSamplesPerBatch*nt, 0.0, 1.0)); asianCall<<<gridSize, blockSize>>>(numSamplesPerBatch, spot0, strike, nt, dt, rates, volas, gaussian, prices); batchPrices[batch] = thrust::reduce(pricesPtr, pricesPtr + numSamplesPerBatch, 0.0) / (double)numSamplesPerBatch; } double batchSum = 0.0; for (int batch = 0; batch < numBatches; batch++) { batchSum += batchPrices[batch]; } double price = batchSum / (double)numBatches; cudaFree(prices); cudaFree(gaussian); cudaFree(dt); cudaFree(rates); cudaFree(volas); free(batchPrices); return price; } 
Conclusion
We have seen that Alea GPU is very competitive compared to CUDA C++. Using our higher level abstractions such as parallelfor and parallel aggregate, C# programmers can implement performance critical algorithms targetng GPUs without explicit knowlege of CUDA C++ but with the same performance benefits as native CUDA C++.
Your Feedback
We are interested to hear your feedback and suggestions for Alea GPU. Write to us at info@quantalea.com or @QuantAlea on Twitter.
If you would like to already play around with Alea GPU V3 come and join us April 4 at GTC 2016 and attend our tutorial on simplified GPU programming with C#.