Tag Archives: .NET


“Auto-magic” GPU Programming

We all agree, GPUs are not as easy to program as a CPU, even though CUDA and OpenCL provide a well established abstraction of the GPU processor design. GPU support in high-level programming languages such as Java, C# or Python still leaves a lot to be desired. Only developers familiar with C++ can directly access a rich set of APIs and tools for general-purpose GPU programming.

We believe this does not need to be so. This blog is a sneak preview of the new upcoming version 3 of Alea GPU which sets completely new standards for GPU development on managed platforms such as .NET or the JVM. So what is so exciting about version 3?

  1. It adds higher level abstractions such as GPU LINQ, a GPU parallel-for and parallel aggregate, which can build and execute GPU workflows or run delegates and lambda expressions on the the GPU. An additional benefit is code reuse – the same delegate can be executed on a GPU as well as on the CPU.

  2. It makes GPU programming much easier and often “auto-magic” – we manage memory allocation and transfer to and from the GPU in an economic way. It is built for developers who grew up without pointers, malloc and free.

  3. We integrate well with the .NET type system, which means for example that .NET arrays can be used directly in GPU kernels.

  4. Of course for all the hard core CUDA programmers with .NET affinity, we still support CUDA in C# and F# as we did already with our previous versions – even better.

Let’s look at the API from a usability point of view and at the underlying technology and implementation details.


LINQ is a technology that extends languages such as C# with powerful query capabilities for data access and transformation. It can be extended to support virtually any kind of data store, also data and collections that reside on a GPU. Alea GPU LINQ introduces new LINQ extensions to express GPU computations with LINQ expressions that are optimized and compiled to efficient GPU code. The main advantages of coding whole GPU workflows with LINQ expressions are:

  1. The execution of GPU LINQ workflows is delayed. They are composable , which facilitates code modularity and reuse.

  2. GPU LINQ workflows provide many standard operations such as parallel aggregation or parallel map, which makes them more expressive and reduces boiler plate code.

  3. We can apply various kernel optimization techniques, such as GPU kernel fusing, which results in fewer GPU kernel launches and more compact GPU code.

  4. Having the full code as an expression allows us to better optimize memory management and data transfer.

Here is a simple but interesting example, which determines the largest value of a array of values on the GPU in two steps. First we index the sequence and then we reduce the new array of indexed values with a custom binary operator that just compares the values to find the maximal value and its index. A priori this would require two GPU kernels. The first is a parallel map, the second is a parallel reduction. Alea GPU can fuse the two kernels into one.

A more complex example is the calculation of the fair value of an Asian option with Monte Carlo simulation based on the celebrated Black-Scholes model.

The Monte Carlo simulation runs in multiple batches, each batch consists of numSamplesPerBatch samples. Workflows are composable. The outer workflow is a map-reduce, which launches the batch sampling and reduces the match mean to the mean across all batches. The inner workflow does the actual Monte Carlo simulation. It first binds storage to the workflow which is then populated with normally distributed random numbers. The core algorithm is in the Select. For each sample index i it generates a path and prices the option along the path. The Aggregate method of the inner workflow calculates the batch sample mean with a parallel reduction.

Black Scholes Simulation

GPU Parallel-For and Parallel Aggregate

An alternative abstraction is provided with the GPU parallel-for and parallel aggregate pattern. Together with automatic memory management, they allows us to write parallel GPU code as if you would write serial CPU code. The usability is very simple. We select a GPU device and pass a delegate to the gpu.For method. All the variables used in the delegate are captured in a closure that is then passed to the parallel-for body. The data is automatically moved to the GPU and the results are brought back automatically to the host.

Parallel-For Closure

The element-wise sum on the GPU is now as simple as this:

The delegate accesses data elements arg1 and arg2 that are defined outside of the loop body and writes the result directly to a .NET array. The runtime system takes care of all the memory management and transfer. Because the delegate does not rely on any GPU specific features such as shared memory,the delegate can execute on the CPU and the GPU. The runtime system also takes care of selecting the thread block size based on the occupancy of the generated kernel.

The parallel aggregate works in the same way. It requires a binary associative operator which is used to reduce the input collection to a single value. Our implementation does not require that the operator is commutative. The following code calculates the sum of the array elements on the GPU:

Automatic Memory Management

Our automatic memory management system handles memory allocation and data movement between the different memory spaces of the CPU and the GPU without the programmer having to manage this manually. It is efficient – unnecessary copy operations are avoided by analyzing the memory access. The implementation is based on code instrumentation, a technique that inserts additional instructions into an existing execution path. Alea GPU modifies the CPU code by inserting instructions that monitor array accesses and perform minimum data transfers between the CPU and GPU. As these runtime checks generate a slight performance overhead, the scope of analysis is limited to the code carrying the attribute [GpuManaged]. Leaving out this attribute never means that data will not be copying – it may only affect unnecessary intermediate copying.

To illustrate the automatic memory management in more detail, we look at an example. We iterate 100 times a parallel-for loop that increments the input by one. First of all, we consider the situation without the [GpuManaged] attribute. In this case, the data is automatically copying, although more frequently than necessary due to a limited scope of analysis.

We check the memory copy operations by using the NVIDIA Visual Nsight profiler. As expected the low level CUDA driver functions cuLaunchKernel, cuMemcpyHtoD_v2 and cuMemcpyDtoH_v2 to launch the kernel and to perform memory copy are called 100 times each. This means that the data is copied in and out for each of the 100 sequential parallel for launches. Let us add the attribute [GpuManaged] to turn on automatic memory management.

We see that cuMemcpyHtoD_v2 and cuMemcpyDtoH_v2 are now called just once. The reason is that result data of a preceding GPU parallel-for loop can stay on the GPU for the succeeding parallel-for loop without need of copying the intermediate data back and forth to CPU. Copying is only involved for the input of the first GPU execution as well as for the output of the last GPU computation.

Using .NET Arrays and Value Types in Kernels

For a C# developer it would be very convenient to use .NET arrays and other standard .NET types also directly in a GPU kernel and that all the memory management and data movement is handled automatically. .NET types are either reference types or value types. Value types are types that hold both data and memory at the same location, a reference type has a pointer which points to the memory location. Structs are value types and classes are reference types. Blittable types are types that have a common representation in both managed and unmanaged memory, in particular reference types are always non-blittable. Copying non-blittable types from one memory space to another requires marshalling, which is usually slow.

From the point of view of efficiency we made the decision to only support .NET arrays with blittable element types as well as jagged arrays thereof. This is a good compromise between usability and performance. To illustrate the benefits let’s look at how to write an optimized matrix transpose. With Alea GPU version 2 you have to work with device pointers and all the matrix index calculations have to be done by hand.

Alea GPU version 2 requires that kernels and other GPU resources are in a class that inherits from ILGPUModule. Apart from this the kernel implementation resembles the CUDA C implementation very closely.

With Alea GPU V3 you don’t need to inherit from a base module class anymore. You can directly work with .NET arrays in the kernel, also for the shared memory tile. We save the error prone matrix element index calculations and only need to map the thread block to the matrix tile.

Alea GPU version 2 requires explicit memory allocation, data copying and calling the kernel with device pointers. An additional inconvenience is that matrices stored in two-dimensional arrays first have to be flatten.

Here is the kernel launch code that relies on automatic memory management. The developer allocates a .NET array for the result, passes that, together with the input matrix directly to the kernel.

Without compromizing the usability, the programmer can also work with explicit memory management.

Here the arrays a and at are fake arrays representing arrays on the GPU device and he can use them in a GPU kernel the same way as an ordinary .NET array. The only difference is that he is now responsible to copy back the result explicitly with CopyToHost. Of course the deviceptr API is still available and often useful for low level primitives or to write highly optimized code.

Improved Support for Delegates and Generic Structs

Alea GPU version 3 also has better support for delegates and lambda expressions. Here is a simple generic transform that takes a binary function object as argument and applies it to arrays of input data:

The next example defines a struct representing a complex number which becomes a blittable value type.

We define a delegate that adds two complex numbers. It creates the result directly with the default constructor. Note that this delegate is free of any GPU specific code and can be executed on the CPU and GPU alike.

It can be used in the parallel Gpu.For to perform element-wise complex addition

or in the above generic transform kernel.

JIT Compiling Delegates to GPU Code

From an implementation point of view a challenge is that delegates are runtime objects. This means we have to JIT compile the delegate code at runtime. Fortunately our compiler has this feature since its initial version. For a delegate such as

the C# compiler will generate a closure class with fields and an Invoke method:

To instantiate the delegate instance, the C# compiler generates code to instantiate the closure class, set its fields, and to create the delegate instance with the closure instance and the method’s function pointer:

The above code is just illustrative and not legal C# code. Both ldftn and methodof are in fact the
real IL instructions that C# compiler generates.

Whenever the Alea GPU compiler finds this delegate, it translates the closure class into a kernel struct, and JIT compiles the GPU kernel code that comes from the Invoke method of the compiler generated class. Alea GPU caches the result of JIT compilations in a dictionary using the key methodof(CompilerGenerated.Invoke), so it will not compile delegates with same method multiple times.

There is one thing needs to be noted. Since we translate the closure class into a struct and pass it to the GPU as a kernel argument, it is not possible to change the values of the fields. For example a delegate like i => result = arg1 does not work.

Code Instrumentation for Automatic Memory Management

The core component of our automatic memory management system is a memory tracker. It tracks .NET arrays and their counterparts residing in GPU device memory. Every array has a flag that indicates if it is out of date. The tracking of an array starts the first time it is used (implicitly in a closure or explicitly as an argument) in a GPU kernel launch. A weak reference table stores for every tracked array the host-out-of-date flag and for every GPU the corresponding device memory, together with the device-out-of-date flag.

The memory tracker has the following methods:

  1. Start tracking a host array
  2. Make an array up to date on a specific GPU
  3. Make an array up to date on host
  4. Make all arrays up to date on a specific GPU
  5. Make all arrays up to date on host

The default procedure is as follows: If an array is used in a kernel launch on a GPU the tracker makes the array up to date on that GPU by copying it to device memory just before the kernel is launched. After the kernel launch the tracker makes the array again up to date on the host by copying it back to the CPU memory. This very simple strategy always works but often leads to unnecessary memory transfers. The basic idea of our automatic memory management system is to defer the synchronization of a host arrays with its device counterpart to the point when the host array is actually accessed again. We implement this deferred synchronization strategy with code instrumentation, which inserts additional checks and memory tracker method calls at the right place.

Because instrumentation adds additional overhead we narrow down the ranges of instrumentation. A function can either be GpuManaged or GpuUnmanaged. By default, a function is GpuUnmanaged, which means that it does not defer the memory synchronization and thus its code is not instrumented. If a function has the GpuManaged attribute, we insert code and method calls to track the array access and defer the synchronization. At least, the functions Alea.Gpu.Launch and Alea.Gpu.For are GpuManaged.

Methods with the attribute GpuManaged are inspected in a post-build process. We check if a function contains IL instructions such as ldelem , ldelema , stelem, call Array.GetItem(), call Array.SetItem(), etc. to access a specific array. In this case we extract the array operand and insert code to defer its synchronization. A standard use case is a loop over all the array elements to set or modify them. In such a case we can optimize the tracking by creating local cache flags. Here is an example:

Instrumention produces code that is functionally equivalent to the following source code:

Calling a method like MemoryTracker.HostUpToDateFor() many times to check if an array has to be synchronized is generating a huge overhead. We use the flag to bypass the call once we know the array is synchronized and reset the flag after kernel launches. At the end of GpuManaged method, we will insert code to bring all out-of-date implicitly traced array back to host. A frequent case is calling other functions from a GpuManaged function. These other functions could be either GpuManaged or GpuUnmanaged. We need to notify the callee to defer memory synchronization. We use some mechanism to pass the managed session to the callee, so that it won’t bring back all out-of-date array to host, because it is not the end of the GpuManaged session.

The implementation relies on Mono.Cecil and Fody. There is a sketch of the full code instrumentation that is executed in a post build step:

  1. Load the compiled assemblies with Mono.Cecil through Fody
  2. For each GpuManaged function
  3. Add memory barrier code

– for every array element access add cache flag and call to HostUpToDateFor()
– for GpuManaged functions call SetFlagOnCurrentThread() before, reset all cache flags after
– for GpuUnmanaged functions call HostUpToDateForAllArrays() before calling them
1. Add try finally block and in finally call HostUpToDateForAll() if the caller is GpuUnmanaged
1. Weave the modified assembly via Fody

Your Feedback

We hope that after reading this post you share the same excitement for the new upcoming version 3 of the Alea GPU compiler for .NET as we do.

Of course we are interested to hear all of your feedback and suggestions for Alea GPU. Write to us at info@quantalea.com or @QuantAlea on Twitter.

The features that we presented here are still in preview and might slightly change until we finally release version 3.

If you would like to already play around with Alea GPU V3 come and join us April 4 at GTC 2016 and attend our tutorial on simplified GPU programming with C#.


Don’t Get Lost in the Forest IV – Performance Test

Having discussed and implemented a random forest algorithm in the last three posts, we will finalize this series by comparing the performance of our two implementations with the random forest implementation of the Python package sklearn. For the test we will use a randomly created data set consisting of 20000 samples with 20 features and two classes. We will train 1000 trees and make two examples with different maximum tree-depth:

  1. depth = 1 (stumps)
  2. depth = 5

In addition to the discussed GPU implementation, we have a version using CUDA-Streams, a way to perform multiple CUDA operations simultaneously (beyond multi-threaded parallelism), which helps hidding time lost by copying memory. We will leave this topic for a possible future post, but will use the implementation for performance tests.

The tests have been performed on a machine with an Intel Core i7-4771 CPU @ 3.50GHz, 16 GB RAM and a NVIDIA Titan Black GPU.

getting the following result:

For Python we use the following script, reading in the same test data as in .NET.

It builds the 1000 trees in between 43 s (single CPU core) and 9.6 s (all CPU cores). This is much faster than our CPU implementation, which was expected as our code, though reasonably well optimized, is not expected to catch up with the highly optimized sklearn library. Our GPU implementation however still gets a nice speed up of over 4 against the most parallel Python run.

For deeper trees with depth 5 we get the following results:

For these parameters the Python code needs between 3min 40s (single CPU core) and 44 s (all CPU cores). Here the speedup is smaller, what can be explained as the GPU implementation is parallel on the number of samples and they shrink with every split.


Don’t Get Lost in the Forest III – The GPU Implementation

In the previous post we looked at the CPU implementation of a random forest training algorithm. We also discussed two parallelization strategies at different levels:

  1. Build the independent trees in parallel.
  2. Search for the optimal split for all features in parallel.

The first strategy is straightforward. Here we focus on the second strategy and discuss how it can be implemented on the GPU. We can reuse most of the CPU code. We only have to replace the function optimizeFeatures with a proper GPU implementation:

This function performs the following calculations:

  • Create a cumulative histograms by:
    • Find and prepare the indices with non-zero weights.
    • Expand the weights: get initial weights for given features and the sample indices before sorting and sort them according to their class (see below for more details).
    • Calculate the histograms using a cumulative sum function applied to the expanded weights.
  • Calculate the entropy and it’s minimum.

In the following we discuss each step in more detail using example data consisting of four features with four samples

feature 1 feature 2 feature 3 feature 4 label
1 3 7 8 0
2 2 2 9 1
3 1 1 7 1
4 5 3 5 0

and the weights

1 1 0 1

After sorting the features we get four triples feature values, labels and index before sorting. This corresponds to the SortedFeatures of the discriminated union type LabeledFeatureSet discussed in the previous post:

1,0,0 1,1,2 1,1,2 5,0,3
2,1,1 2,1,1 2,1,1 7,1,2
3,1,2 3,0,0 3,0,3 8,0,0
4,0,3 5,0,3 7,0,0 9,1,1

After sorting we do not need the feature values anymore. It is therefore enough to keep the two matrices, the first one storing the labels

0 1 1 0
1 1 1 1
1 0 0 0
0 0 0 1

and the second one the indices

0 2 2 3
1 1 1 2
2 0 3 0
3 3 0 1

Find Non-Zero Indices

This function prepares the input data such that all samples with a non-zero weight, respectively indices of these samples, are stored consecutively.

It calls the GPU kernel LogicalWeightExpansionKernel, which creates for every feature a vector of elements in ${0,1}$ depending on weights being non-zero (looking at the expanded weights, i.e. initial weights before sorting). Here is the actual GPU code:

For our sample data this kernel produces the following result:

1 0 0 1
1 1 1 0
0 1 1 1
1 1 1 1

After calculating a column-wise cumulative sum we obtain

1 0 0 1
2 1 1 0
2 2 2 2
3 3 3 3

The function FindNonZeroIndicesKernel

then creates the matrix of indices

0 1 1 0
1 2 2 2
3 3 3 3

It records the next index for which the feature has a non-zero weight. Note that the index here is related to the features after sorting.

Expand the Weights and Create Histograms

The WeightExpansionKernel uses the indices matrix and returns for every sample with non-zero-weight its weight, sorted according to the class (label) the sample belongs to. Given the following weights and labels:

Weight: Label:
1 0
2 1
1 1
2 0

We get the following expanded weights where a zero indicates that a feature has a different class:


The weight expansion function

transforms our example data to

1 0 0 1
0 1 1 1
1 1 1 0
0 1 1 0
1 0 0 0
0 0 0 1

To calculate the cumulative histogram for each feature over all labels we use a row-wise cumulative sum (row-wise because the matrix layout in the code is different to the layout in our example), which is implemented with the Alea.Unbound primitive ConsumeRangeConsecutiveInclusive

and a cumulative sum kernel

The cumulative histograms are then used for the entropy calculation:

1 0 0 1
1 1 1 2
2 2 2 2
2 3 3 2
3 3 3 2
3 3 3 3

Calculate the Entropy and it’s Minimum

The entropy for every possible split and for every feature is calculated from the cumulative histograms with the EntropyKernel function:

The optimal split for every feature, i.e. the split leading to the lowest entropy, can now be calculated with the MinimumEntropy function

To find tie minimal entropy the function LaunchOptAndArgOptKernelDoubleWithIdcs uses a the Alea.Unbound library function MultiChannelReducePrimitive. Then the optimal splits are checked for edge cases in the same way as in the function optimizeFeatures.

This explains the main functionality of the GPU implementation. In our last blog post, we will consider the performance of the different implementations and look at an implementations in Python based on scikit-learn.


Don’t Get Lost in the Forest II – The CPU Implementation

This is the second post in a series about GPU accelerated random forests. In our last post we introduced random forests and described how to train them with an algorithm based on the split entropy. In this post we will focus on a CPU implementation and leave the GPU implementation for the next blog post.

First we look at the implementation of the forecasting function. We will then discuss the training of the random forest and dive into the two main functions of the training algorithm.

Implementation of the Forecasting Function

Before implementing the forecasting function, we need to discuss how we represent a decision tree in code. The record Split consists of a feature index Feature and a Threshold

where type FeatureValue = float because we restrict to continuous features for the sake of simplicity. We define a tree consisting of Leaf objects (representing predicted labels) and Node objects (consisting of a triple of a split and two sub-trees, one for the elements with feature values below the threshold and one for the elements with feature values above the threshold):

A sample is a vector of feature values, which are all float values:

In the case of the Iris data set we have the four values for the features sepal length and width and the petal length stored in an array [|sepalLength; sepalWidth; petalLength; petalWidth|].

We can predict the label for a given sample using a function traversing the tree recursively:

A random forest is then represented by the type Model which consists of an array of trees and the number of classes:

We can predict the label of a sample using a random forest model by calling the function forecastTree for every tree in the forest and apply a majority vote on the resulting labels:

Here maxAndArgMax is a function returning the maximum of an array and the maximum’s index. A simple functional implementation looks as follows:

We will revisit this function later and improve its performance with an alternative implementation.

CPU Implementation of the Training Algorithm

Rather than going into every detail of the implementation we will focus on the main ideas. The main part of the training algorithm is implemented in two functions:

  1. optimizeFeatures returns the optimal split position and its entropy for every feature;
  2. trainTrees uses the function optimizeFeatures and trains decision trees.

A crucial step in the training algorithm as described in the previous post is the sorting of the features according to their values. As the old position of the feature served as the index for accessing the weights, we save them as indices. We introduce a discriminated union type to represent the data.

The canonical form of the input data is represented by LabeledSamples, which is an array of tuples Sample * Label. Recall that a Sample is just an array of float vectors and Label is an integer value representing the class of the sample. For the Iris data set a LabeledSample is a tuple of a float vector of four values and an integer ([|sepalLength; sepalWidth; petalLength; petalWidth|], 1).

The type SortedFeatures is used to represent the sorted features during the implementation of the algorithm. For each feature, it holds the feature values from all samples in a FeatureArray, the corresponding labels in Labels and the indices representing the old position of the features.

Note that both the Sample and FeatureArray are both float arrays. The reason why we distinguish them is because a Sample is an observation of one feature value per feature, whereas a FeatureArray holds the value of a single feature for a collection of samples. If we stack the samples row by row in a matrix, the rows would correspond to samples, whereas the columns would correspond to feature arrays.

The Function optimizeFeatures

The function optimizeFeatures returns for every feature the best split entropy and its corresponding index in the sorted samples. Its arguments are a mode, which determines the parallelization scheme, the options for the entropy-optimization, the number of classes, the labels per feature, the indices per features, and the weights. It performs the following computations:

  • Reorder the weights such that every weight is stored at the same index as the corresponding sample, which is done for every feature independently.
  • Project all features and weights which are non-zero to the beginning of the array.
  • Implement the helper functions
    • mapper which, depending on the mode, is either Array.mapi or Array.Parallel.mapi.
    • translator which translates the index after projection to the post projection index, or the last index in case for the last non-zero index.
    • A mapping function, the argument for the mapper, which for all non-masked features performs the following:
      • Get the weights.
      • Create the histogram for every split possibility for the given feature.
      • Mask splits which are clearly not optimal. This is not necessary, but helps speeding up the CPU implementation.
      • Calculate the entropy for remaining split possibilities.
      • Return the lowest entropy and its index and translates the index to its value before the projection using the translator function.
  • Select a subset of the features using a FeatureSelector.
  • Calculate the entropy for the remaining split possibilities using the helper functions.

Here is the full source code:

The Function trainTrees:

The function trainTrees trains decision trees:

  • Calculate entropy, split index and feature index with the minimal split entropy.
  • Find split threshold using the function  featureArrayThreshold. If all indices are zero, stop branching and create a leaf.
  • Set weights for samples with value below respectively above the split threshold to zero for the lower respectively upper branch of the next iteration.
  • If depth = 0 then return the tree, else train new trees.

Note that each FeatureArray that is given in the argument sortedTrainingSet has to be sorted in ascending order.

To go from a single decision tree to a random forest, we need to train multiple trees and add random weights for every sample in every tree. This is implemented in randomForestClassifier. We show how to apply this function to the Iris data set at the end of this post.

Some additional code is required to exploit parallelism and to delegate certain parts of the algorithm to the GPU.

CPU Parallelization

For the CPU implementation we exploit two parallelization strategies:

  1. Use Array.Parallel.mapi instead of Array.mapi.
  2. Partition the trees in n bins and calculate the bins in parallel. Since there is no need to exchange data between bins, this approach allows to use different devices and even to mix CPU and GPU units.

Performance Relevant Implementations (minArgmin)

We need to have a well performing CPU implementation so that we can later compare the CPU and GPU implementation. To improve the performance of the CPU implementation we introduce the entropyMask mentioned earlier and have to make a compromise between a clean functional implementation and a performance optimized implementation.

A first bottleneck is the function minAndArgMin. A compact functional implementation uses Seq.minBy:

The optimized implementation which relies on the .NET IEnumerable interface is about 10% faster:

Note that using sequences rather than arrays gives us some additional performance benefits.

Applying the Code to the Iris Data Set

We now apply our CPU random forest classifier to the Iris data set. First we read the data.

Then we splitting the data in to test and training data:

We set up the parameters for the training algorithm:

The SquareRootFeatureSelector chooses randomly $\sqrt{n}$ features from the existing $n$ features. In case of the Iris data set this results in two features out of the existing four features, randomly selected for every tree in the forest. Now we train the model:

Once the model is trained we can use it to predict the labels on the test data:

You can call these functionality by calling:

leading to the following output:

Now it is your turn. Download the code, play with it and have fun!


Play with Particles III – Visualize using OpenGL

In the last two posts we implemented three different solvers for the n-body problem, one on the CPU and two different GPU versions. In this post we visualize the n-body simulation graphically with OpenGL to illustrate the cross platform capabilities of Alea GPU and its interoperability with OpenGL.

A Simple OpenGL Example with OpenTK

There exist essentially two different 3d visualization technologies: Microsoft’s DirectX and OpenGL. DirectX is targeting the Windows platform. Examples showing the interoperability of Alea GPU and DirectX can be found in the sample gallery of Alea GPU. The benefit of OpenGL is platform independence. A good starting point is the OpenGL tutorial.

In this blog article we use OpenGL through OpenTK which wraps OpenGL for .NET. You might find the OpenTK documentation, the OpenTK tutorial, how to set up OpenTK, and this OpenTK example useful resources.

We introduce OpenTK with a simplistic example, which renders a triangle. Add the OpenTK NuGet package to your project and reference System.Drawing. Then open the following namespaces:

Create a new class OpenTKExample which inherits from the OpenTK class GameWindow:

Then overwrite the following methods:

On load set the Background to DarkBlue:

On resize of the window use the whole area to paint and set the projection:

On render the frame is where all the drawing happens. Clear the buffer and draw a triangle and three points with different colors. Instead of a triangle we could also draw many different other figures such as points, lines, etc. End the triangle mode and swap the buffers.

Add a function creating an instance of our OpenTKExample class and running it:

The result is a colored triangle on a dark blue background:


We will use the same structure in order to display our particles.

Displaying Particles Directly on the GPU

The main difference between the simple example above and the n-body simulation is that in case of the n-body simulation the data already resides in GPU memory and we do not want copy it from GPU to CPU and back to an OpenGL buffer to finally display the particles. This needs some infrastructure code. First we show how to create two buffers accessible from OpenGL and Alea GPU, in which we save our positions:

  1. Generate an array consisting of GLuint.
  2. Create a buffer using GL.GenBuffers.
  3. For every element of the array:
    1. bind the buffer;
    2. allocate the memory;
    3. get the buffer-size;
    4. unbind the buffer;
    5. register the buffer with cuGLRegisterBufferObject available from Alea GPU.

We now have two buffers which can be accessed from OpenTK and Alea GPU. We also need some resource pointers corresponding to the buffers. We obtain them by calling the Alea GPU function cuGraphicsGLRegisterBuffer inside a cuSafeCall:

If we work with the buffers outside of OpenTK we need to lock their positions. We therefore write a function lockPos which locks the positions, calls a function f on the positions and unlocks the positions again:

To share the buffers we require an Alea.Worker on the same context as used by OpenTK. The following function creates a CUDA context on the machine’s default device:

We use this function to create an Alea.Worker on the same context using the same CUDA device.

We can now initialize the positions using the lockPos function and our newly generated worker:

Recall that we read from oldPos and write to newPos in the GPU implementation. We need to swap the buffers before each integration step using the function swapPos:

In the OnRenderFrame method we swap the buffers and perform an integration step:

We bind the buffer and draw the positions:

Implementation Details

We point out some implementation details. To use the different GPU implementations and test them for different block sizes we introduce a queue of ISimulator objects. During simulation we walk through the queue with an “S” key down event.

We create the simulators and put them into the queue. Note that we also return a dispose function to clean up the simulators at the end:

Here is the logic to switch between simulators:

We use Matrix4.LookAt to inform OpenTK that our viewing position is (0,0,50) and that the viewing direction is along the z axis:

These additional comments should be helpful to understand how the positions are displayed using OpenTK and how the data is directly read from GPU memory. The previous blog posts explain the physics, the CPU implementation, the two GPU implementations and their differences. All that remains is to run the example.