Play with Particles II – GPU Implementation

In our last post we considered the CPU implementation of n-body simulations. In this post we will focus on the parallelization strategy and two GPU implementations using a dynamic and a static thread block size.

Parallelization Strategy

Recall the CPU implementation of the n-body simulation. The algorithm has essentially two parts:

  1. Calculating and summing up the accelerations for each pair of particles.
  2. Integrating velocity and position given the calculated accelerations.

The expensive part is the calculation of the acceleration which is $O\left(n^2\right)$ compared to $O\left(n \right)$ for the integration. As the acceleration for every pair of particles has to be calculated, we need to read every particle position $n$, leading to $2n^2-n$ global GPU memory reads. Accessing global GPU memory can become a critical bottleneck. Fortunately, global memory is not the only memory type an NVIDIA GPU has:

  • Memory accessible globally on the GPU:
    • Global memory
    • Constant memory
    • Texture memory
  • Memory accessible from all threads in a given block:
    • Shared memory
  • Memory only accessible from a given thread:
    • Registers
    • Local memory

The less accessible the memory is, the bigger is its bandwidth. We will not go deeper into the different memory types, you can find more information here, here, here, here, and here. The take home message is that shared memory is a lot faster than global memory. The plan is to only load in each thread block every position once from global memory and store it into shared memory. We then can read $2n^2-n$ positions from shared memory after having loaded all positions once. More precisely, we do the following steps:

  • Every block reads in the first blockSize particles into shared memory. Every thread is responsible for reading one element.
  • Synchronize, i.e. wait until all threads of the block have stored their values in shared memory.
  • Calculating the acceleration from the first blockSize particles. Every thread calculates the acceleration to a different particle.
  • Synchronize again and iterate this loop until the acceleration from all particles has been calculated.

Now the acceleration for the first blockSize particles has been calculated. Running several blocks takes care of the rest.

We use the datatype float4 for the positions and save the particle masses in the w-coordinate. The use of float4 data allows coalesced memory access to the global device memory which improves the IO bandwidth.

A sketch of the F# kernel code looks as follows:

We pass the memory size during the kernel launch using the launch parameters:

If the size of the shared memory is known at compile time it can be allocated directly with

and the memory size does not need to be given during the kernel launch.

GPU Implementation with Dynamic Block Size

We organize our GPU resources using the class GPUModule:

We implement the particles’ acceleration using shared memory as described above in a method class method:

The GPU kernel calculates the accelerations using the method ComputeBodyAccel, which we already introduced in the previous post, and updates the velocities and positions accordingly. It reads the positions from oldPos and writes them into newPos, hence these two variables have to be flipped before calling the method a second time. The idea behind having two position variables will become clear when displaying the positions using OpenGL through OpenTK which allows to render and to calculate at the same time, see the next blog post.

As in the CPU version we add two methods to create implementations for ISimulator and ISimulatorTester. Note that for ISimulatorTester we switch positions, for ISimulator this has to be done externally.

GPU Implementation for Static Block Size

In the previous implementation the blockSize and the numTiles are not GPU compile time constant. It makes the code more flexible, but the GPU compiler can do less optimizations, in particular:

  •  the for-loop in ComputeBodyAccel cannot be unrolled;
  •  the size of the shared memory is not known to the compiler.

We optimize the GPU implementation so that the block size is constant at compile time.

We will only discuss the differences to the implementation with dynamic block size. The SimulatorModule inherits again from GPUModule and has an additional constructor argument for the block size. As a consequence, we cannot AOT compile this class.

To enable AOT compilation we inherit from SimulatorModule and set the block size to a fixed constant:

The shared memory size is now fixed. In the method ComputeBodyAccel we can declare it as follows:

The launch parameters simplify as well because the shared memory size does not need to be specified anymore:

Having fixed the number of tiles, hence the number of loop iterations, we can add a statement for loop unrolling to the innermost loop in the method ComputeBodyAccel. Note that loop unrolling is currently only possible for F# but not for C#:

The performance is measured in terms of frames per seconds achieved by the OpenGL visualization, which is presented in the next blog post. We compare different block sizes and the static and dynamic block size implementation for F# and C# on Mono installed on a Ubuntu Linux machine with a Titan Black GPU:


Apparently a block size of 128 is optimal for this GPU and static block sizes slightly improve the performance.