Tag Archives: N-Body


Play with Particles III – Visualize using OpenGL

In the last two posts we implemented three different solvers for the n-body problem, one on the CPU and two different GPU versions. In this post we visualize the n-body simulation graphically with OpenGL to illustrate the cross platform capabilities of Alea GPU and its interoperability with OpenGL.

A Simple OpenGL Example with OpenTK

There exist essentially two different 3d visualization technologies: Microsoft’s DirectX and OpenGL. DirectX is targeting the Windows platform. Examples showing the interoperability of Alea GPU and DirectX can be found in the sample gallery of Alea GPU. The benefit of OpenGL is platform independence. A good starting point is the OpenGL tutorial.

In this blog article we use OpenGL through OpenTK which wraps OpenGL for .NET. You might find the OpenTK documentation, the OpenTK tutorial, how to set up OpenTK, and this OpenTK example useful resources.

We introduce OpenTK with a simplistic example, which renders a triangle. Add the OpenTK NuGet package to your project and reference System.Drawing. Then open the following namespaces:

Create a new class OpenTKExample which inherits from the OpenTK class GameWindow:

Then overwrite the following methods:

On load set the Background to DarkBlue:

On resize of the window use the whole area to paint and set the projection:

On render the frame is where all the drawing happens. Clear the buffer and draw a triangle and three points with different colors. Instead of a triangle we could also draw many different other figures such as points, lines, etc. End the triangle mode and swap the buffers.

Add a function creating an instance of our OpenTKExample class and running it:

The result is a colored triangle on a dark blue background:


We will use the same structure in order to display our particles.

Displaying Particles Directly on the GPU

The main difference between the simple example above and the n-body simulation is that in case of the n-body simulation the data already resides in GPU memory and we do not want copy it from GPU to CPU and back to an OpenGL buffer to finally display the particles. This needs some infrastructure code. First we show how to create two buffers accessible from OpenGL and Alea GPU, in which we save our positions:

  1. Generate an array consisting of GLuint.
  2. Create a buffer using GL.GenBuffers.
  3. For every element of the array:
    1. bind the buffer;
    2. allocate the memory;
    3. get the buffer-size;
    4. unbind the buffer;
    5. register the buffer with cuGLRegisterBufferObject available from Alea GPU.

We now have two buffers which can be accessed from OpenTK and Alea GPU. We also need some resource pointers corresponding to the buffers. We obtain them by calling the Alea GPU function cuGraphicsGLRegisterBuffer inside a cuSafeCall:

If we work with the buffers outside of OpenTK we need to lock their positions. We therefore write a function lockPos which locks the positions, calls a function f on the positions and unlocks the positions again:

To share the buffers we require an Alea.Worker on the same context as used by OpenTK. The following function creates a CUDA context on the machine’s default device:

We use this function to create an Alea.Worker on the same context using the same CUDA device.

We can now initialize the positions using the lockPos function and our newly generated worker:

Recall that we read from oldPos and write to newPos in the GPU implementation. We need to swap the buffers before each integration step using the function swapPos:

In the OnRenderFrame method we swap the buffers and perform an integration step:

We bind the buffer and draw the positions:

Implementation Details

We point out some implementation details. To use the different GPU implementations and test them for different block sizes we introduce a queue of ISimulator objects. During simulation we walk through the queue with an “S” key down event.

We create the simulators and put them into the queue. Note that we also return a dispose function to clean up the simulators at the end:

Here is the logic to switch between simulators:

We use Matrix4.LookAt to inform OpenTK that our viewing position is (0,0,50) and that the viewing direction is along the z axis:

These additional comments should be helpful to understand how the positions are displayed using OpenTK and how the data is directly read from GPU memory. The previous blog posts explain the physics, the CPU implementation, the two GPU implementations and their differences. All that remains is to run the example.


Play with Particles II – GPU Implementation

In our last post we considered the CPU implementation of n-body simulations. In this post we will focus on the parallelization strategy and two GPU implementations using a dynamic and a static thread block size.

Parallelization Strategy

Recall the CPU implementation of the n-body simulation. The algorithm has essentially two parts:

  1. Calculating and summing up the accelerations for each pair of particles.
  2. Integrating velocity and position given the calculated accelerations.

The expensive part is the calculation of the acceleration which is $O\left(n^2\right)$ compared to $O\left(n \right)$ for the integration. As the acceleration for every pair of particles has to be calculated, we need to read every particle position $n$, leading to $2n^2-n$ global GPU memory reads. Accessing global GPU memory can become a critical bottleneck. Fortunately, global memory is not the only memory type an NVIDIA GPU has:

  • Memory accessible globally on the GPU:
    • Global memory
    • Constant memory
    • Texture memory
  • Memory accessible from all threads in a given block:
    • Shared memory
  • Memory only accessible from a given thread:
    • Registers
    • Local memory

The less accessible the memory is, the bigger is its bandwidth. We will not go deeper into the different memory types, you can find more information here, here, here, here, and here. The take home message is that shared memory is a lot faster than global memory. The plan is to only load in each thread block every position once from global memory and store it into shared memory. We then can read $2n^2-n$ positions from shared memory after having loaded all positions once. More precisely, we do the following steps:

  • Every block reads in the first blockSize particles into shared memory. Every thread is responsible for reading one element.
  • Synchronize, i.e. wait until all threads of the block have stored their values in shared memory.
  • Calculating the acceleration from the first blockSize particles. Every thread calculates the acceleration to a different particle.
  • Synchronize again and iterate this loop until the acceleration from all particles has been calculated.

Now the acceleration for the first blockSize particles has been calculated. Running several blocks takes care of the rest.

We use the datatype float4 for the positions and save the particle masses in the w-coordinate. The use of float4 data allows coalesced memory access to the global device memory which improves the IO bandwidth.

A sketch of the F# kernel code looks as follows:

We pass the memory size during the kernel launch using the launch parameters:

If the size of the shared memory is known at compile time it can be allocated directly with

and the memory size does not need to be given during the kernel launch.

GPU Implementation with Dynamic Block Size

We organize our GPU resources using the class GPUModule:

We implement the particles’ acceleration using shared memory as described above in a method class method:

The GPU kernel calculates the accelerations using the method ComputeBodyAccel, which we already introduced in the previous post, and updates the velocities and positions accordingly. It reads the positions from oldPos and writes them into newPos, hence these two variables have to be flipped before calling the method a second time. The idea behind having two position variables will become clear when displaying the positions using OpenGL through OpenTK which allows to render and to calculate at the same time, see the next blog post.

As in the CPU version we add two methods to create implementations for ISimulator and ISimulatorTester. Note that for ISimulatorTester we switch positions, for ISimulator this has to be done externally.

GPU Implementation for Static Block Size

In the previous implementation the blockSize and the numTiles are not GPU compile time constant. It makes the code more flexible, but the GPU compiler can do less optimizations, in particular:

  •  the for-loop in ComputeBodyAccel cannot be unrolled;
  •  the size of the shared memory is not known to the compiler.

We optimize the GPU implementation so that the block size is constant at compile time.

We will only discuss the differences to the implementation with dynamic block size. The SimulatorModule inherits again from GPUModule and has an additional constructor argument for the block size. As a consequence, we cannot AOT compile this class.

To enable AOT compilation we inherit from SimulatorModule and set the block size to a fixed constant:

The shared memory size is now fixed. In the method ComputeBodyAccel we can declare it as follows:

The launch parameters simplify as well because the shared memory size does not need to be specified anymore:

Having fixed the number of tiles, hence the number of loop iterations, we can add a statement for loop unrolling to the innermost loop in the method ComputeBodyAccel. Note that loop unrolling is currently only possible for F# but not for C#:

The performance is measured in terms of frames per seconds achieved by the OpenGL visualization, which is presented in the next blog post. We compare different block sizes and the static and dynamic block size implementation for F# and C# on Mono installed on a Ubuntu Linux machine with a Titan Black GPU:


Apparently a block size of 128 is optimal for this GPU and static block sizes slightly improve the performance.


Play with Particles I – The Physics and CPU Implementation

This blog starts a sequence of posts introducing an implementation of the n-body problem, i.e. the simulation of Newton’s equations for many bodies or particles. The goal of these posts is to illustrate the cross platform capabilities of Alea GPU and its interoperability with OpenGL.

This post explains the physics and a CPU implementation. In the second post we provide two GPU implementations using Alea GPU and benchmark them on Mono running on a Ubuntu Linux machine. The last post considers how to visualize the particle dynamics with OpenGL. All code is fully cross platform and runs on Windows, Linux or Mac OS X. We even tested it on the Jestson TK1 embedded development kit.

N-body simulation can be solved analytically for two particles, but in general has no analytic solution for $n>2$. It is often used in astronomy, where particles mostly represent stars (but might also represent parts of a planet as in impact studies resulting in moon formation) and the interacting forces are the classical Newtonian gravitational forces. Hence, n-body simulations can be used to study the formation of galaxies.

In physics, chemistry and material sciences, where particles represent atoms or molecules, the method is called molecular dynamics. Often periodic boundary conditions are used to avoiding strong boundary effects, and thermo- and bariostates are attached to simulate constant temperature and/or constant pressure ensemble, known as the Nosé-Hoover thermostat algorithm. Some interesting simulations can also be found on YouTube, for example a simulation of water vaporizing (annoying sound) and melting ice.

Finally, the method is applied in computer graphics to simulate explosions, the flow of water, smoke and many more.

In this article we focus on an example form astronomy. It is similar to NVIDIA’s example code (see also NVIDIA’s documentation).

The Algorithm and the Physics Behind It

The n-body problem is all about solving Newton’s equations

$$\frac{d^2}{dt_i^2} x = \frac{F_i}{m_i},$$

where we use the gravitational force

$$F_i = m_i \sum_{i \neq j} \frac{m_j}{\left| x_i – x_j \right|}$$

induced by all other particles. In fact we will slightly change this potential by adding a softening factor so that the gravitational force becomes

$$F_i = m_i \sum_{i \neq j} \frac{m_j}{\sqrt{\left| x_i – x_j \right|^2+ \rm{softening}^2}}.$$

With this modified gravitational force, although physically an approximation, we have removed the singularity for close particles. The acceleration on particle $i$ is then

$$a_i = \sum_{i \neq j} \frac{m_j}{\sqrt{\left| x_i – x_j \right|^2+ \rm{softening}^2}}.$$

Newton’s equations imply conservation of energy, momentum and angular momentum. Using the wrong scheme for numerical integration or too big time steps destroys these conservation laws. Integration schemes that preserve the conservation laws are the simplectic integrators such as verlet and the velocity verlet algorithm. In this post we follow NVIDIA’s example and use an integration scheme including a damping term $<0$, which slightly reduces the energy in every integration step

$$v_{t+dt} = \rm{damping} \left(v_t + a_t dt\right),$$

$$x_{t+dt} = x_{t} + v_{t+dt}  dt.$$

Note that there is a factor of $\frac{1}{2}$ missing for the position integration. This is adding slightly more energy to the system.

CPU Implementation

We use an abstract classes so that we can implement different versions for CPUs and GPUs.

Furthermore, we need some methods to initialize the particles’ position and velocity:

The next function adjusts the momentum of every particle such that the total momentum is 0:

The function bodyBodyInteraction will used on the CPU as well as on the GPU and therefore has the attribute [<ReflectedDefinition>]. It calculates the acceleration between two particles at positions bi and bj. The other arguments are the softeningSquared factor and the acceleration ai on particle $i$ from particle interactions calculated so far. It returns ai plus the acceleration of the particle at position bj exerting on the particle at position bi. The positions are of type float4, where the w-coordinate is the particle’s mass. We use the intrinsic GPU function __nv_rsqrtf which calculates $f(x) = \frac{1}{sqrt(x)}$ and has an optimized GPU implementation.

We now need a method which calls the bodyBodyInteraction function for all pairs of particles. Because of our modified gravitational force with a softening factor we can simplify the implementation and also add the acceleration of each particle with itself, which is just zero. We can safe an if condition and obtain branch free code, which is important on the GPU. The time integration using a damping term can now be implemented as follows:

We have everything ready to implement the abstract interfaces ISimulator and ISimulatorTester.

This completes the CPU implementation of the n-body simulation. We can now create an ISimulator and steps forward in time using the Integrate method.

In the next blog post we will focus on the GPU implementation. In a third post we will show how to use OpenGL through the Open Toolkit library to visualize the results directly on the GPU.