Another Reason to Use a GPU!

I recently gave a talk at an F# meetup hosted by Jet.com about deficiencies of .NET CLR JIT compilers.

We know that often C# or F# does not perform at the level of native C++ because the CLR JIT compiler is not optimizing the code well enough. In worst cases we loose a factor of 2 to 4 against native code. To investigate this problem in more depth you can check how .NET CLR JIT compilers compile simple loops and nested loops. It is not enough to just look at MSIL code. We have to dig deep into the optimized assembly code, generated by the CLR JIT compilers. We find that the CLR JIT compilers are not capable to remove array bound checks or optimize array access patterns of the form a[i*lda + j] in simple nested loops. This is very bad news for performance critical code in .NET.

Fortunately, you can get around these problems by moving performance critical code to the GPU. The Floyd-Warshall all shortest path algorithm serves as an example: an advanced GPU implementation fully written in C# and compiled with Alea GPU gives a significant speedup. It runs at the same speed as a native GPU version coded in C++ and 125 times faster than the initial C# version!

Developing such efficient algorithms is not straightforward at all and requires some experience. We therefore take a step back and show that simpler problems can often be solved efficiently with parallel-for and parallel aggregate patterns running on the GPU with a dramatic performance increase of a factor of 50 to 100.

Here are the slides.

GPU Computing on .NET at Speed of CUDA C++

A Performance Comparison of Alea GPU C# and CUDA C++

In the last post we gave a sneak preview of the upcoming Alea GPU version 3. Alea GPU version 3 sets completely new standards for GPU development on .NET. The highlights of the new version are:

1. Higher level abstractions such as GPU LINQ, a GPU parallel-for and parallel aggregation.

2. Automatic memory management and data transfer, which makes GPU programming easier and reduces a lot of boiler plate code.

3. Integration with the .NET type system so that .NET arrays can be used directly in GPU kernels.

4. CUDA programming model integrated with C# and F#.

In the last post we discussed the new API and its usability. This time we benchmark Alea GPU against multi-threaded .NET and CUDA C++. For the performance study we use an application from quantitative finance that calculates the price of an Asian option with Monte Carlo simulation based on the celebrated Black-Scholes model. Nowadays Monte Carlo simulations are a common practice to calculate the price of complex financial products and risk figures. GPUs significantly reduce the calculation time.

We look at different implementation versions using different features of Alea GPU.

1. Implicit memory management: we use the Alea GPU parallel-for and parallel aggregate together with implicit memory management, i.e. we let Alea GPU manage the memory allocation and data transfer between the CPU and the GPU.

2. Explicit memory management: we use the Alea GPU parallel-for and parallel aggregate but we explicitly copy the data to the GPU.

3. Transform-reduce: we combine the parallel-for and parallel aggregate into a single parallel transform-reduce and use implicit memory management.

4. LINQ work-flow: we express the calculation using Alea GPU LINQ, which allows us to do kernel fusing and in a future version also more aggressive memory transfer optimizations.

5. Constant memory: we use constant memory to store data that is read-only on the GPU.

The CPU implementation is fully multi-threaded and also uses the fast CPU random number generators that come with the NVIDIA cuRand library. We run the computations on an Intel i7-4770 CPU at 3.5 GHz and on different GPU configurations. Let us look at the timings in milliseconds:

As expected the Kepler K40 is the fastest GPUs thanks to its high number of 2880 CUDA cores and very good double precision support. The version using implicit memory management is pretty close to the explicit one. Using constant memory also slightly improves performance, because it allows to cache data reads. Now we compare the speed-up relative to a multi-threaded C# implementation:

Already with a rather weak GeForce GT 750M mobile GPU we gain a performance increase of roughly a factor of 4. The high-end K40 GPU boosts performance by a factor 100. The version using explicit memory management scales best to multiple GPUs as shown by running it on two GeForce GTX Titan Black GPUs as well as on an AWS instance with four GRID K520 GPUs. In the current version of Alea GPU the implicit memory management uses a global tracker, which is not yet optimized for multiple GPUs. This explains the why the implicit memory management version scales poorly to multiple GPUs.

Now let us look at the performance of Alea GPU compared to a CUDA C++ implementation. Again we run the computations on GPUs of different hardware generations. Here are the timings in milliseconds of the Asian option pricing in C# using explicit management and a CUDA C++ implementation:

Investigating the runtime behavior in more details with the Visual Profiler reveals a few interesting details: The CUDA C++ Asian option pricing kernel uses 37 to 40 registers, depending on the CUDA compute architecture, where the .NET version compiled with Alea GPU only uses 30 registers. This has some consequences on the number of blocks in-flight and the overall occupancy. The .NET version can use 52 thread blocks and achieves 100% occupancy. The CUDA C++ version can only launch 39 thread blocks and achieves 75% occupancy. However, the kernel execution time is almost the same. Regarding the overall execution time, the .NET version is slightly slower because of in .NET we have a small overhead to launch a kernel.

For those who are interested in further details we sketch the C# as well as the CUDA C++ implementation now.

GPU and CPU Implementation with Alea GPU

We like share the core algorithmic code between the CPU and GPU implementation. We achieve this by encapsulating the main calculation in a delegate:

This delegate calculates the payoff along path i. The price computation simulates multiple sample batches so that we can scale-up the number of simulations. For each batch we generate numSamplesPerBatch normally distributed random numbers with the NVIDIA cuBLAS library. Then we use the Alea GPU parallel-for to generate the paths and the prices using the AsianCall delegate. Then we apply the Alea GPU average reduction to calculate the batch mean. The final price is the average of all batch prices.

Note that PriceGpu has the attribute [GpuManaged] because we let Alea GPU to do the memory transfer for us. We only create storage for the random numbers and the prices on the device. All the input data, such as dt, rates and volas are captured in a closure and moved automatically to the GPU. The explicit deallocation of the GPU memory with Gpu.Free is optional but good practice: GPU memory is pagelocked, so to avoid unnecessary GPU memory limitations, it is good to release unused GPU memory immediately and not to wait for the next garbage collection run.

We can very easily program a CPU version:

This time we use the .NET Parallel.For and reuse the core path generation and payoff logic from AsianCall. The other versions also share the same core algorithm AsianCall. We skip further code listings.

Implementation with CUDA C++

The CUDA C++ version looks very similar, execpt that there is no GPU parallel for, so we program the kernel ourselves. Because in CUDA C++ there are no array types as in .NET we have to pass the array length as additional arguments to the kernel.

The orchestration of the calculations are now a bit more involved because there is no automatic memory management and transfer. Also because cudaMemcpy takes the number of bytes to copy, we have to calculate the size ourselves. Typecasts further complicate the code. The average price per batch is calculated with the reduction function of CUDA Thrust.

Conclusion

We have seen that Alea GPU is very competitive compared to CUDA C++. Using our higher level abstractions such as parallel-for and parallel aggregate, C# programmers can implement performance critical algorithms targetng GPUs without explicit knowlege of CUDA C++ but with the same performance benefits as native CUDA C++.

We are interested to hear your feedback and suggestions for Alea GPU. Write to us at info@quantalea.com or @QuantAlea on Twitter.

If you would like to already play around with Alea GPU V3 come and join us April 4 at GTC 2016 and attend our tutorial on simplified GPU programming with C#.

Play with Particles III – Visualize using OpenGL

In the last two posts we implemented three different solvers for the n-body problem, one on the CPU and two different GPU versions. In this post we visualize the n-body simulation graphically with OpenGL to illustrate the cross platform capabilities of Alea GPU and its interoperability with OpenGL.

A Simple OpenGL Example with OpenTK

There exist essentially two different 3d visualization technologies: Microsoft’s DirectX and OpenGL. DirectX is targeting the Windows platform. Examples showing the interoperability of Alea GPU and DirectX can be found in the sample gallery of Alea GPU. The benefit of OpenGL is platform independence. A good starting point is the OpenGL tutorial.

In this blog article we use OpenGL through OpenTK which wraps OpenGL for .NET. You might find the OpenTK documentation, the OpenTK tutorial, how to set up OpenTK, and this OpenTK example useful resources.

We introduce OpenTK with a simplistic example, which renders a triangle. Add the OpenTK NuGet package to your project and reference System.Drawing. Then open the following namespaces:

Create a new class OpenTKExample which inherits from the OpenTK class GameWindow:

Then overwrite the following methods:

On load set the Background to DarkBlue:

On resize of the window use the whole area to paint and set the projection:

On render the frame is where all the drawing happens. Clear the buffer and draw a triangle and three points with different colors. Instead of a triangle we could also draw many different other figures such as points, lines, etc. End the triangle mode and swap the buffers.

Add a function creating an instance of our OpenTKExample class and running it:

The result is a colored triangle on a dark blue background:

We will use the same structure in order to display our particles.

Displaying Particles Directly on the GPU

The main difference between the simple example above and the n-body simulation is that in case of the n-body simulation the data already resides in GPU memory and we do not want copy it from GPU to CPU and back to an OpenGL buffer to finally display the particles. This needs some infrastructure code. First we show how to create two buffers accessible from OpenGL and Alea GPU, in which we save our positions:

1. Generate an array consisting of GLuint.
2. Create a buffer using GL.GenBuffers.
3. For every element of the array:
1. bind the buffer;
2. allocate the memory;
3. get the buffer-size;
4. unbind the buffer;
5. register the buffer with cuGLRegisterBufferObject available from Alea GPU.

We now have two buffers which can be accessed from OpenTK and Alea GPU. We also need some resource pointers corresponding to the buffers. We obtain them by calling the Alea GPU function cuGraphicsGLRegisterBuffer inside a cuSafeCall:

If we work with the buffers outside of OpenTK we need to lock their positions. We therefore write a function lockPos which locks the positions, calls a function f on the positions and unlocks the positions again:

To share the buffers we require an Alea.Worker on the same context as used by OpenTK. The following function creates a CUDA context on the machine’s default device:

We use this function to create an Alea.Worker on the same context using the same CUDA device.

We can now initialize the positions using the lockPos function and our newly generated worker:

Recall that we read from oldPos and write to newPos in the GPU implementation. We need to swap the buffers before each integration step using the function swapPos:

In the OnRenderFrame method we swap the buffers and perform an integration step:

We bind the buffer and draw the positions:

Implementation Details

We point out some implementation details. To use the different GPU implementations and test them for different block sizes we introduce a queue of ISimulator objects. During simulation we walk through the queue with an “S” key down event.

We create the simulators and put them into the queue. Note that we also return a dispose function to clean up the simulators at the end:

Here is the logic to switch between simulators:

We use Matrix4.LookAt to inform OpenTK that our viewing position is (0,0,50) and that the viewing direction is along the z axis:

These additional comments should be helpful to understand how the positions are displayed using OpenTK and how the data is directly read from GPU memory. The previous blog posts explain the physics, the CPU implementation, the two GPU implementations and their differences. All that remains is to run the example.

Play with Particles II – GPU Implementation

In our last post we considered the CPU implementation of n-body simulations. In this post we will focus on the parallelization strategy and two GPU implementations using a dynamic and a static thread block size.

Parallelization Strategy

Recall the CPU implementation of the n-body simulation. The algorithm has essentially two parts:

1. Calculating and summing up the accelerations for each pair of particles.
2. Integrating velocity and position given the calculated accelerations.

The expensive part is the calculation of the acceleration which is $O\left(n^2\right)$ compared to $O\left(n \right)$ for the integration. As the acceleration for every pair of particles has to be calculated, we need to read every particle position $n$, leading to $2n^2-n$ global GPU memory reads. Accessing global GPU memory can become a critical bottleneck. Fortunately, global memory is not the only memory type an NVIDIA GPU has:

• Memory accessible globally on the GPU:
• Global memory
• Constant memory
• Texture memory
• Memory accessible from all threads in a given block:
• Shared memory
• Memory only accessible from a given thread:
• Registers
• Local memory

The less accessible the memory is, the bigger is its bandwidth. We will not go deeper into the different memory types, you can find more information here, here, here, here, and here. The take home message is that shared memory is a lot faster than global memory. The plan is to only load in each thread block every position once from global memory and store it into shared memory. We then can read $2n^2-n$ positions from shared memory after having loaded all positions once. More precisely, we do the following steps:

• Every block reads in the first blockSize particles into shared memory. Every thread is responsible for reading one element.
• Synchronize, i.e. wait until all threads of the block have stored their values in shared memory.
• Calculating the acceleration from the first blockSize particles. Every thread calculates the acceleration to a different particle.
• Synchronize again and iterate this loop until the acceleration from all particles has been calculated.

Now the acceleration for the first blockSize particles has been calculated. Running several blocks takes care of the rest.

We use the datatype float4 for the positions and save the particle masses in the w-coordinate. The use of float4 data allows coalesced memory access to the global device memory which improves the IO bandwidth.

A sketch of the F# kernel code looks as follows:

We pass the memory size during the kernel launch using the launch parameters:

If the size of the shared memory is known at compile time it can be allocated directly with

and the memory size does not need to be given during the kernel launch.

GPU Implementation with Dynamic Block Size

We organize our GPU resources using the class GPUModule:

We implement the particles’ acceleration using shared memory as described above in a method class method:

The GPU kernel calculates the accelerations using the method ComputeBodyAccel, which we already introduced in the previous post, and updates the velocities and positions accordingly. It reads the positions from oldPos and writes them into newPos, hence these two variables have to be flipped before calling the method a second time. The idea behind having two position variables will become clear when displaying the positions using OpenGL through OpenTK which allows to render and to calculate at the same time, see the next blog post.

As in the CPU version we add two methods to create implementations for ISimulator and ISimulatorTester. Note that for ISimulatorTester we switch positions, for ISimulator this has to be done externally.

GPU Implementation for Static Block Size

In the previous implementation the blockSize and the numTiles are not GPU compile time constant. It makes the code more flexible, but the GPU compiler can do less optimizations, in particular:

•  the for-loop in ComputeBodyAccel cannot be unrolled;
•  the size of the shared memory is not known to the compiler.

We optimize the GPU implementation so that the block size is constant at compile time.

We will only discuss the differences to the implementation with dynamic block size. The SimulatorModule inherits again from GPUModule and has an additional constructor argument for the block size. As a consequence, we cannot AOT compile this class.

To enable AOT compilation we inherit from SimulatorModule and set the block size to a fixed constant:

The shared memory size is now fixed. In the method ComputeBodyAccel we can declare it as follows:

The launch parameters simplify as well because the shared memory size does not need to be specified anymore:

Having fixed the number of tiles, hence the number of loop iterations, we can add a statement for loop unrolling to the innermost loop in the method ComputeBodyAccel. Note that loop unrolling is currently only possible for F# but not for C#:

The performance is measured in terms of frames per seconds achieved by the OpenGL visualization, which is presented in the next blog post. We compare different block sizes and the static and dynamic block size implementation for F# and C# on Mono installed on a Ubuntu Linux machine with a Titan Black GPU:

Apparently a block size of 128 is optimal for this GPU and static block sizes slightly improve the performance.

Play with Particles I – The Physics and CPU Implementation

This blog starts a sequence of posts introducing an implementation of the n-body problem, i.e. the simulation of Newton’s equations for many bodies or particles. The goal of these posts is to illustrate the cross platform capabilities of Alea GPU and its interoperability with OpenGL.

This post explains the physics and a CPU implementation. In the second post we provide two GPU implementations using Alea GPU and benchmark them on Mono running on a Ubuntu Linux machine. The last post considers how to visualize the particle dynamics with OpenGL. All code is fully cross platform and runs on Windows, Linux or Mac OS X. We even tested it on the Jestson TK1 embedded development kit.

N-body simulation can be solved analytically for two particles, but in general has no analytic solution for $n>2$. It is often used in astronomy, where particles mostly represent stars (but might also represent parts of a planet as in impact studies resulting in moon formation) and the interacting forces are the classical Newtonian gravitational forces. Hence, n-body simulations can be used to study the formation of galaxies.

In physics, chemistry and material sciences, where particles represent atoms or molecules, the method is called molecular dynamics. Often periodic boundary conditions are used to avoiding strong boundary effects, and thermo- and bariostates are attached to simulate constant temperature and/or constant pressure ensemble, known as the Nosé-Hoover thermostat algorithm. Some interesting simulations can also be found on YouTube, for example a simulation of water vaporizing (annoying sound) and melting ice.

Finally, the method is applied in computer graphics to simulate explosions, the flow of water, smoke and many more.

The Algorithm and the Physics Behind It

The n-body problem is all about solving Newton’s equations

$$\frac{d^2}{dt_i^2} x = \frac{F_i}{m_i},$$

where we use the gravitational force

$$F_i = m_i \sum_{i \neq j} \frac{m_j}{\left| x_i – x_j \right|}$$

induced by all other particles. In fact we will slightly change this potential by adding a softening factor so that the gravitational force becomes

$$F_i = m_i \sum_{i \neq j} \frac{m_j}{\sqrt{\left| x_i – x_j \right|^2+ \rm{softening}^2}}.$$

With this modified gravitational force, although physically an approximation, we have removed the singularity for close particles. The acceleration on particle $i$ is then

$$a_i = \sum_{i \neq j} \frac{m_j}{\sqrt{\left| x_i – x_j \right|^2+ \rm{softening}^2}}.$$

Newton’s equations imply conservation of energy, momentum and angular momentum. Using the wrong scheme for numerical integration or too big time steps destroys these conservation laws. Integration schemes that preserve the conservation laws are the simplectic integrators such as verlet and the velocity verlet algorithm. In this post we follow NVIDIA’s example and use an integration scheme including a damping term $<0$, which slightly reduces the energy in every integration step

$$v_{t+dt} = \rm{damping} \left(v_t + a_t dt\right),$$

$$x_{t+dt} = x_{t} + v_{t+dt} dt.$$

Note that there is a factor of $\frac{1}{2}$ missing for the position integration. This is adding slightly more energy to the system.

CPU Implementation

We use an abstract classes so that we can implement different versions for CPUs and GPUs.

Furthermore, we need some methods to initialize the particles’ position and velocity:

The next function adjusts the momentum of every particle such that the total momentum is 0:

The function bodyBodyInteraction will used on the CPU as well as on the GPU and therefore has the attribute []. It calculates the acceleration between two particles at positions bi and bj. The other arguments are the softeningSquared factor and the acceleration ai on particle $i$ from particle interactions calculated so far. It returns ai plus the acceleration of the particle at position bj exerting on the particle at position bi. The positions are of type float4, where the w-coordinate is the particle’s mass. We use the intrinsic GPU function __nv_rsqrtf which calculates $f(x) = \frac{1}{sqrt(x)}$ and has an optimized GPU implementation.

We now need a method which calls the bodyBodyInteraction function for all pairs of particles. Because of our modified gravitational force with a softening factor we can simplify the implementation and also add the acceleration of each particle with itself, which is just zero. We can safe an if condition and obtain branch free code, which is important on the GPU. The time integration using a damping term can now be implemented as follows:

We have everything ready to implement the abstract interfaces ISimulator and ISimulatorTester.

This completes the CPU implementation of the n-body simulation. We can now create an ISimulator and steps forward in time using the Integrate method.

In the next blog post we will focus on the GPU implementation. In a third post we will show how to use OpenGL through the Open Toolkit library to visualize the results directly on the GPU.

﻿