Simple GPU Functions with Alea GPU

This post shows how to write simple GPU accelerated data parallel functions with F# and Alea GPU.

Short Guide to CUDA

It’s likely that your computer has not only a CPU but also a GPU that can be used for computations. While your CPU will have a few calculation units or cores (~ 1 – 8) that can perform a wide variety of computations efficiently, your GPU has many cores (~ 100 – 1000). These cores are simpler than CPU cores and are optimized for Single Instruction Multiple Data work loads which perform the same calculation on a wide range of input.

As of today the CPU (also called the host) and the GPU (also called the device) often have separate memory. A typical calculation therefore consists of the following steps:

  • transfer data from host to device
  • start the calculations on the device
  • return result from device to host

Writing and Running GPU Computations

Assume we have an array of single precision numbers in memory and want to calculate the sine of each element of the array on the GPU. Allocating memory on the device and transferring data to the device can be done with

Malloc can be of type int -> DeviceMemory<'T> or 'T [] -> DeviceMemory<'T>. The first version allocates memory for the given number of 'T’s. The second version allocates the memory and transfers the data of the array from host to device. A function that transfers data from host to already allocated device memory is Scatter. Once we also have allocated memory for the result using

we launch the yet to be written GPU function with

The launch parameter essentially determines how many threads are scheduled for execution on the GPU and in which format they are scheduled. A more detailed discussion can be found e.g here.

For our simple example it is enough to know that we schedule nThread = gridDim.x * blockDim.x threads and each thread is identified with the index iThread = blockIdx.x * blockDim.x + threadIdx.x where 0 <= iThread < nThread.

The following function applies an operator f to each element of an input array on the device and writes the result to an output array on the device

Here we meet a deviceptr<'T> which represents the memory on the device.

The [<ReflectedDefinition>] annotation causes F# to include representation of apply in the form of an Expr<_> in the assembly. The Alea GPU compiler can take this Expr<_> and compile it to executable GPU code.

The operator f is an argument of the apply function. The type of f is float32 -> float32. We can now pass the sin operator to the apply function

The arguments of the resulting applySin function are of primitive type such as int, float, float32 or deviceptr<_> of basic types. We are left with data that can be transferred from the host to the device. This contrasts with the operator f in the apply function, which cannot be transferred from host to device as a function object.

For applySin we can use ahead-of-time compilation with the annotation AOTCompile so that the corresponding GPU code is created at compile time and embedded in the assembly. In contrast the apply function still has a parameter f that is yet undetermined. This code can only be compiled once the operator f is known and we have to resort to just in time compilation.

We launch the computations on the GPU with

and then transfer the result from device to host with Gather.

The full example is

and the result should look like