Tag Archives: GPU

trading-with-fsharp

Algo Trading with F# and GPUs

This blog post is QuantAlea’s contribution to the F# Advent Calendar. Many thanks to Sergey Tihon for organizing this series. The code for the least correlated subset problem solved below in F# with GPUs can be found on GitHub.

QuantAlea has specialized in GPU computing with a strong focus on the financial services industry. You might have heard already of Alea GPU, QuantAlea’s professional GPU development environment for .NET. It allows to compile F#, C# or VB code directly to executable GPU code. Alea GPU has some unique features:

  • Its JIT compiler generates GPU code on the fly, which is very useful for GPU scripting.
  • Alea GPU produces binary compatible cross platform code, which runs on Windows, Linux or Mac OS X without recompilation.
  • Alea GPU is a fully fledged compiler with native debugging support and is independent of any third party tools.

Other solutions are far less complete and not as convenient to use as Alea GPU. For instance CudaFy generates C code under the hood. It requires a post build step to compile the generated C code with a CUDA C compiler, breaking the integrity of the development process. The library managedCuda is even simpler, it is just a wrapper for the CUDA Driver API and requires that all the GPU code is programmed in C or C++.

Instead of giving an introduction to Alea GPU and describing how to program GPUs with F# in general I thought it would be more illustrative show how we use F# and GPUs for our own algorithmic trading at InCube Capital. F# and GPUs are a very good combination to develop sophisticated algorithmic trading strategies and analyze and optimize their profitability. I’d like to illustrate that with a few examples.

Coding Strategies in F#

Let me begin by explaining how we develop the core trading logic in F#. For this we assume that our strategy decides to buy and sell based on a signal which has sufficient predictive power to forecast price movements and turning points. The more complex the trading logic becomes, the easier it is to miss a case, which then generates an error during paper trading or even worse, produces wrong orders in live execution. We extensively rely on exhaustive pattern matching. It is a very powerful tool to improve the correctness and robustness of our trading logic. We first define some active patterns. Here is an active pattern for the signal transition from an previous value to an actual value:

Yet another active pattern determines the different trading phases during the day. We have three time intervals. We wait until the trading start date, then we trade and after some time we try to liquidate the positions. If we do not manage to get rid of the position during the liquidation time we do a forced liquidation.

With these active patterns we can now formulate the core trading logic as follows:

You can see that we match on the trading periods, the existence of a position and the pair of previous and new signal value. The trading logic is very compact, descriptive and easy to understand: we open a position for the first time if there is a signal change, switch positions on a signal change, liquidate in the liquidation period only on a signal change and clear all remaining position after the stop date. The crucial point however is that the pattern match is exhaustive so that we know for sure that all cases are properly handled.

Back-testing and Strategy Selection with F# GPUs

Once the strategy is coded it has to be configured and validated with back-testing. I already blogged on our GPU cluster framework for bootstrapping walk forward optimization of trading strategies and machine learning, which I also presented at the NVIDIA GTC conference early this year.

Instead of going into more details about bootstrap walk forward optimization or machine learning I would like to present another technique to choose suitable strategies from a collection of candidates. Let’s assume that we somehow identified 20 to 50 signal configurations. Each signal gives rise to a trading strategy. We can back-test these strategies and deduce various characteristics and statistics such as the daily returns, Sharp ratio, draw down, mean, skewness, kurtosis and the daily return correlation between them. The idea is to select out of these 20 to 50 candidates a handful of strategies which have the least amount of correlation, the best overall sharp ratio and are as Gaussian as possible.

To keep it simple we just consider the case of selecting a fixed number of strategies which have the least amount of correlation. Unfortunately, there is no known closed form solution to this problem and the straightforward is a search algorithm. As soon as we add additional filters and constraints as mentioned above a full blown search is inevitable anyway, so we didn’t even bother about finding an alternative solution approach. Assuming that we have $n$ candidate strategies and we want to find the $k$ strategies with the least amount of correlation we would need to examine

$$ \binom{n}{k} = \frac{n!}{k! \cdot (n-k)!} $$

different combinations of choosing $k$ elements from a set of $n$ elements. This becomes very soon a prohibitively large number.

$k$ $\binom{20}{k}$ $\binom{50}{k}$
2 190 1225
3 1140 19600
4 4845 230300
5 15504 2118760
6 38760 15890700
7 77520 99884400
8 125970 536878650
9 167960 2505433700
10 184756 10272278170

First we have to implement a function to calculate $\binom{n}{k}$. For this we should rely on the product formula

$$ \binom{n}{k} = \prod_{i=1}^{l} \frac{n – k + i}{i} $$

The change between multiplication and division prevents temporary values from unnecessary growth. Of course we have to use uint64 because the range of int is not sufficient. The implementation is pretty straightforward. The reason why we add the attribute [<ReflectedDefinition>] will become clear soon.

Next we have to find an algorithm to list all the $\binom{n}{k}$ different combination. A serial algorithm is straightforward. Here is an example with $n = 5$ and $k=3$. We start with the first combination [0; 1; 2] and then increase the last element until it cannot be increased any further. Then the preceding element is increased, always keeping the elements in increasing order.

$m$ Combination
0 [0; 1; 2]
1 [0; 1; 3]
2 [0; 1; 4]
3 [0; 2; 3]
4 [0; 2; 4]
5 [0; 3; 4]
6 [1; 2; 3]
7 [1; 2; 4]
8 [1; 3; 4]
9 [2; 3; 4]

For a parallel implementation we need a way to generate the $m$-th combination directly. Here is an implementation which does the job.

The function choose did not require any memory allocation or object creation and can run as such on a GPU. This is not the case for subset. We refactor it as follows

The functions getSub : int -> int and setSub : int -> int -> unit abstract the memory for storing the combination. The attribute [<ReflectedDefinition>] instructs the F# compiler to generate a quotation expression so that our Alea GPU compiler can transform it to GPU code. The curious reader can find more details about our compilation process in the manual.

Having solved the combinatoric problem we can continue with the calculation of the correlation of a combination. For efficiency reasons we store the correlation matrix in a compact format, which only stores the lower triangular part as an array. We do not store the diagonal as it is always 1, nor the upper triangular part because the matrix is symmetric. Let’s take as an example a 3×3 correlation matrix

r/c 0 1 2
0 1.0000 0.9419 0.5902
1 0.9419 1.0000 0.0321
2 0.5902 0.0321 1.0000

We only have to store the 3 matrix elements

r/c 0 1 2
0 * * *
1 0.9419 * *
2 0.5902 0.0321 *

which we flatten into a vector [0.9419; 0.5902; 0.0321]. Here is the F# implementation:

The length of the packed vector for a matrix of dimension $n$ is

$$ \mathbf{size}(n) = \frac{n(n+1)}{2} – n = \frac{n(n-1)}{2}, $$

because we do not store the diagonal. The offset in the packed vector of an element $(i,j)$ with $i > j$ is calculated as

$$ \mathbf{offset}(i, j) = \frac{i(i-1)}{2} + j = \mathbf{size}(i) + j.$$

Given a combination $c$ as calculated by the function subset we build the correlation sub-matrix again in packed storage format as follows:

As in the previous code fragment we externalize the correlation matrix access and the selection indices with functions matrix: int -> float, respectively selection:int -> int and provide a function subMatrix: int -> float -> unit to set the elements of the sub-matrix. Now we have everything together to calculate the amount of correlation of a sub-matrix determined by a combination. Mathematically, we want to measure how far the sub-matrix is from the $k$ dimensional identity matrix

$$ | C_{sub}(c) – \mathbf{Id}_k |, $$

where $|\cdot|$ is a suitable matrix norm. We decided to choose the Frobenius norm

$$ | C |_{F} = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n} |C_{ij}|^2 }. $$

There are of course different choices such as the $L_1$ or $L_{\infty}$ norm. In our special case the implementation of the Frobenius distance to the identity is very simple. It is just the square root of the sum of squares of all the elements in the packed vector

We would like to stress that all of the critical numerical code runs on CPU and GPU, so that we have 100% code reuse. Before we present the final GPU implementation of the least correlated subset problem, let’s see how we can use our numerical building blocks to implement the algorithm for CPU execution. All what we have to do is to manage array creation and call the numerical building blocks with a lambda function to access the data. Here is the code:

Note that the function allDistToIdentity only works for small values of $n$ and $k$ where $\binom{n}{k}$ is in the range of int. The function leastCorrelated returns the best combination.

Lets move on to the final GPU implementation. Up to now no particular GPU knowledge was required appart from adding the attribute []. Now we need to know some basic CUDA terms. For those of you who are GPU freshmen I recommend that you briefly look at this example, where you can learn the basic GPU coding principles. To be able to follow the example you should know that the GPU has a memory space that is different from the CPU and that the GPU runs functions, called kernels, in parallel in many threads. These threads are grouped into block of threads. Each thread in a block is identified with its threadIdx and each block with its blockIdx. The size of the block is given by blockDim, whereas the number of blocks can be read from gridDim.

We start with a very simple implementation of the calculation of the distance to the identity for many combinations in parallel. We write a so called GPU kernel, that is a function which will be executed on the GPU in parallel in many different instances.

Note that we use the type deviceptr<float> and deviceptr<int> to access data on the GPU. You immediately see that we call our functions subset and subMatrixPacked with lambda functions such as fun i -> selection.[i] or fun i v -> subMatrix.[i] <- v. In this implementation the temporary storage for selection and subMatrix is in the so called global device memory, which is the largest memory pool on a GPU, but also the slowest in terms of access latency. How do we now call such a kernel function? For this we have to allocate storage on the GPU, copy data to the GPU, define launch parameters and call the kernel function. Here is the code which does all these steps:

You see from the use binding that all our allocated memory on the GPU is disposable. The kernel function distToIdentityKernelDevice is passed as an expression to the kernel launch worker.Launch <@ distToIdentityKernelDevice @>. Here worker is essentially a GPU. For simplicity we choose 128 threads per block and launch 8 times more blocks than streaming multiprocessors on the GPU. The last step dDist.Gather() copies the calculated distances back to the CPU memory.

Let’s do a first optimization. As already mentioned global device memory has a high access latency. Registers and thread local memory, as long as it does not spill over to device memory, are faster. Yet another memory type is shared memory. As suggested by its name it can be accessed from multiple threads, i.e. shared between all the threads of a block. This is very useful for the original correlation matrix $C$. Here is the implementation using shared memory for C as well as for the temporaries selection and subMatrix:

Variable amount of shared memory is allocated as part of the kernel launch. We show how to do that later. In this new kernel we first get the externally allocated shared memory shared = <strong>shared</strong>.ExternArray(), calculate the offsets for the temporaries and then copy the matrix C to sharedC in shared memory in parallel and then call __syncthreads() to wait for its completion. The rest of the kernel is identical. This performs much better. Note also the small change that we added an additional argument m0 for the offset where to start the calculations.

So far we just calculated the distance to the identity but did not yet do the minimization. Determining the smallest element in a vector is a classical reduction problem. Alea GPU provides a very efficient and generic reduction implementation. Because we have to find the index of the smallest element we have to create a special type.

Because instances of this type have to live on the GPU it must be a value type, as enforced by the attribute [<Struct>]. Now we can easily find the index of the minimal value in an array with the parallel reduction from Alea GPU Unbound library. First create an instance of DeviceReduceModule:

Then we create a reducer for arrays of length up to len and launch the reduction

Currently, because of some library limitations you have to set the context type to “threaded” as follows:

Let us now combine this with the kernel to calculate the distance to the identity.

This implementation of leastCorrelatedWithTransform first calculates the amount of shared memory sharedSize that is required. It then partitions the $\binom{n}{k}$ different combinations into blocks of size partitionSize and loops over all the partitions

The final reduction is done on the CPU. To call the reduction kernel we have to transform the vector of distances calculated by distToIdentityKernelShared to a vector of IndexAndValue.

We can avoid the transform and directly generate a vector of IndexAndValue. Here is the code:

Of course the GPU implementation is more difficult than the CPU version. But it can handle much larger problems faster. The important point is that it shares all the core functions with the CPU implementation. Let us have a look at the performance of our implementation.

n k CPU GPU Speedup
20 5 33.65 ms 2.70 ms 12.40
20 6 69.40 ms 6.41 ms 10.81
20 7 169.02 ms 22.57 ms 7.49
20 8 334.72 ms 40.40 ms 8.28
20 9 513.01 ms 57.76 ms 8.88
50 3 25.52 ms 2.93 ms 8.70
50 4 369.69 ms 57.76 ms 12.32
50 5 5455.00 ms 329.20 ms 16.57

Optimizing with Constant and Local Memory

This is not the end of the optimization, there is still room for improvement. First, because shared memory is always very scarce we should try to save it. Because the algorithm is only reading from C we can as well place it in constant memory, which has a read latency comparable to shared memory. For this we have to use the cuda computational workflow to define the constant memory resource of a GPU module. Note that the constant memory size must be a GPU compile time constant. This is not a big problem because we can always JIT-compile a new configuration. The next optimization is to store some temporary data, such as the selection indices, into thread local memory. Note that the size of the local memory has to be also a GPU compile time constant. Here is the code of the version that uses constant memory and local memory:

You see that the kernel also becomes slightly shorter. The performance of the smaller problems is pretty much the same, but for larger problems we see an improvement.

n k CPU GPU Speedup
50 3 25.52 ms 2.19 ms 11.65
50 4 369.69 ms 10.31 ms 35.85
50 5 5455.00 ms 170.89 ms 31.92

There is still more optimization potential which can increase performance by another factor. Namely, each single distance calculation can be done with a parallel reduction too. We leave this for a future improvement.

Debugging and Profiling

To enable profiling of JIT compiled code you have to set the JIT compile level

Then all that is required is to build the code in debug mode and place a breakpoint in the GPU code. Start the NVIDIA debugger from the Visual Studio menu NSIGHT and choose Start CUDA Debugging. Data across different warps are best analyzed in the CUDA warp watch window. Here is an example

least_corr_debug_1

Profiling is crucial to identify further optimization potential. To facilitate the profiling of your GPU code Alea GPU is well integrated with the NVIDIA profiler. For details we refer to the profiling chapter in our tutorial.

Conclusion

We have shown how we use F# and GPUs for our own internal algorithmic trading. F# has a lot of extremely helpful language features, such as active patterns and exhaustive pattern matching, which allow us to program strategies faster, more expressive and more robust.

The usability of GPUs is illustrated with the practically relevant example of finding the subset of least correlated strategies. We went through some optimization and refactoring to illustrate how we develop GPU kernels, starting with simple implementations and adding more and more optimizations. The key point is that with the right level of abstraction and by using lambda functions we can share critical code between CPU and GPU in an elegant manner. This improves development time and reduces maintainability costs.

2015-10-01_1526

Finally – GPUs are Coming to Azure!

GPU cloud computing is gaining more and more momentum because a growing number of applications and use cases rely on fast enterprise GPU hardware, such as deep learning, applied to image and speech recognition or natural language processing, data mining, photo-realistic real-time rendering, etc. Some of these applications also benefit from scaling to multi-GPU servers and to GPU server clusters with high speed network interconnects.

Public cloud market leader Amazon AWS was the first to provide GPU accelerated instances. Its g2.2xlarge instance has 8 CPU core and 1 GRID K520 GPU, the larger g2.8xlarge instance has 32 CPU cores and 4 GRID K520 GPUs, available on demand at hourly rates.

Other providers such as Nimbix, Peer1 Hosting, Penguin Computing or Rapid Switch offer GPU powered infrastructure for quite some time.

This summer IBM SoftLayer launches a new server line with Nvidia Tesla K80 GPUs. And just a few days later the Chinese e-commerce giant Alibaba announced GPU support on its Aliyun public cloud.

Finally last Tuesday at AzureCon 2015, Microsoft announced the new N-series GPU-enabled virtual machines. They use K80 GPUs for compute intensive workloads and M60 GPUs with the recently released Nvidia Grid 2.0 technology for virtualized graphics. With this hardware infrastructure Microsoft Azure can now deliver GPU accelerated computations and graphics to any connected device.

Azure n-series

The new GPU cloud infrastructure of Azure uses GPU virtualization with Hyper-V and DDA technology. Fast RDMA network connections can be used to build clusters of multiple GPU nodes for HPC workloads.

cloud architecture

For more details check out the Nvidia press release, Mark Staveley’s presentation at AzureCon

and Jen-Hsun Huang who speaks about GPUs for the Azure Cloud

For us at QuantAlea this is great news. We are already in the development process of Alea GPU V3, which will be yet another major release of our GPU compiler infrastructure for .NET, C# and F#. The key features of V3 will be a clever new memory management system, a simplified user interface and API so that also novice C# developers can easily write GPU code and of course a tight integration with the Azure GPU infrastructure and higher level Azure cloud services.

These news also had a very positive effect on the share price of Nvidia, which gained 2.7% closing at USD 24.36, hitting a 52-week high of USD 24.58 on Wednesday after the announcement. With all these events in just a few months the blog published early July this year on The Platform can be extended with a few more sections.

We are curious to see how and when Google will enhance his public cloud with GPU computing capabilities.

NBodiesLast

Play with Particles III – Visualize using OpenGL

In the last two posts we implemented three different solvers for the n-body problem, one on the CPU and two different GPU versions. In this post we visualize the n-body simulation graphically with OpenGL to illustrate the cross platform capabilities of Alea GPU and its interoperability with OpenGL.

A Simple OpenGL Example with OpenTK

There exist essentially two different 3d visualization technologies: Microsoft’s DirectX and OpenGL. DirectX is targeting the Windows platform. Examples showing the interoperability of Alea GPU and DirectX can be found in the sample gallery of Alea GPU. The benefit of OpenGL is platform independence. A good starting point is the OpenGL tutorial.

In this blog article we use OpenGL through OpenTK which wraps OpenGL for .NET. You might find the OpenTK documentation, the OpenTK tutorial, how to set up OpenTK, and this OpenTK example useful resources.

We introduce OpenTK with a simplistic example, which renders a triangle. Add the OpenTK NuGet package to your project and reference System.Drawing. Then open the following namespaces:

Create a new class OpenTKExample which inherits from the OpenTK class GameWindow:

Then overwrite the following methods:

On load set the Background to DarkBlue:

On resize of the window use the whole area to paint and set the projection:

On render the frame is where all the drawing happens. Clear the buffer and draw a triangle and three points with different colors. Instead of a triangle we could also draw many different other figures such as points, lines, etc. End the triangle mode and swap the buffers.

Add a function creating an instance of our OpenTKExample class and running it:

The result is a colored triangle on a dark blue background:

OpenTKTriangle

We will use the same structure in order to display our particles.

Displaying Particles Directly on the GPU

The main difference between the simple example above and the n-body simulation is that in case of the n-body simulation the data already resides in GPU memory and we do not want copy it from GPU to CPU and back to an OpenGL buffer to finally display the particles. This needs some infrastructure code. First we show how to create two buffers accessible from OpenGL and Alea GPU, in which we save our positions:

  1. Generate an array consisting of GLuint.
  2. Create a buffer using GL.GenBuffers.
  3. For every element of the array:
    1. bind the buffer;
    2. allocate the memory;
    3. get the buffer-size;
    4. unbind the buffer;
    5. register the buffer with cuGLRegisterBufferObject available from Alea GPU.

We now have two buffers which can be accessed from OpenTK and Alea GPU. We also need some resource pointers corresponding to the buffers. We obtain them by calling the Alea GPU function cuGraphicsGLRegisterBuffer inside a cuSafeCall:

If we work with the buffers outside of OpenTK we need to lock their positions. We therefore write a function lockPos which locks the positions, calls a function f on the positions and unlocks the positions again:

To share the buffers we require an Alea.Worker on the same context as used by OpenTK. The following function creates a CUDA context on the machine’s default device:

We use this function to create an Alea.Worker on the same context using the same CUDA device.

We can now initialize the positions using the lockPos function and our newly generated worker:

Recall that we read from oldPos and write to newPos in the GPU implementation. We need to swap the buffers before each integration step using the function swapPos:

In the OnRenderFrame method we swap the buffers and perform an integration step:

We bind the buffer and draw the positions:

Implementation Details

We point out some implementation details. To use the different GPU implementations and test them for different block sizes we introduce a queue of ISimulator objects. During simulation we walk through the queue with an “S” key down event.

We create the simulators and put them into the queue. Note that we also return a dispose function to clean up the simulators at the end:

Here is the logic to switch between simulators:

We use Matrix4.LookAt to inform OpenTK that our viewing position is (0,0,50) and that the viewing direction is along the z axis:

These additional comments should be helpful to understand how the positions are displayed using OpenTK and how the data is directly read from GPU memory. The previous blog posts explain the physics, the CPU implementation, the two GPU implementations and their differences. All that remains is to run the example.

NBodiesMiddle

Play with Particles II – GPU Implementation

In our last post we considered the CPU implementation of n-body simulations. In this post we will focus on the parallelization strategy and two GPU implementations using a dynamic and a static thread block size.

Parallelization Strategy

Recall the CPU implementation of the n-body simulation. The algorithm has essentially two parts:

  1. Calculating and summing up the accelerations for each pair of particles.
  2. Integrating velocity and position given the calculated accelerations.

The expensive part is the calculation of the acceleration which is $O\left(n^2\right)$ compared to $O\left(n \right)$ for the integration. As the acceleration for every pair of particles has to be calculated, we need to read every particle position $n$, leading to $2n^2-n$ global GPU memory reads. Accessing global GPU memory can become a critical bottleneck. Fortunately, global memory is not the only memory type an NVIDIA GPU has:

  • Memory accessible globally on the GPU:
    • Global memory
    • Constant memory
    • Texture memory
  • Memory accessible from all threads in a given block:
    • Shared memory
  • Memory only accessible from a given thread:
    • Registers
    • Local memory

The less accessible the memory is, the bigger is its bandwidth. We will not go deeper into the different memory types, you can find more information here, here, here, here, and here. The take home message is that shared memory is a lot faster than global memory. The plan is to only load in each thread block every position once from global memory and store it into shared memory. We then can read $2n^2-n$ positions from shared memory after having loaded all positions once. More precisely, we do the following steps:

  • Every block reads in the first blockSize particles into shared memory. Every thread is responsible for reading one element.
  • Synchronize, i.e. wait until all threads of the block have stored their values in shared memory.
  • Calculating the acceleration from the first blockSize particles. Every thread calculates the acceleration to a different particle.
  • Synchronize again and iterate this loop until the acceleration from all particles has been calculated.

Now the acceleration for the first blockSize particles has been calculated. Running several blocks takes care of the rest.

We use the datatype float4 for the positions and save the particle masses in the w-coordinate. The use of float4 data allows coalesced memory access to the global device memory which improves the IO bandwidth.

A sketch of the F# kernel code looks as follows:

We pass the memory size during the kernel launch using the launch parameters:

If the size of the shared memory is known at compile time it can be allocated directly with

and the memory size does not need to be given during the kernel launch.

GPU Implementation with Dynamic Block Size

We organize our GPU resources using the class GPUModule:

We implement the particles’ acceleration using shared memory as described above in a method class method:

The GPU kernel calculates the accelerations using the method ComputeBodyAccel, which we already introduced in the previous post, and updates the velocities and positions accordingly. It reads the positions from oldPos and writes them into newPos, hence these two variables have to be flipped before calling the method a second time. The idea behind having two position variables will become clear when displaying the positions using OpenGL through OpenTK which allows to render and to calculate at the same time, see the next blog post.

As in the CPU version we add two methods to create implementations for ISimulator and ISimulatorTester. Note that for ISimulatorTester we switch positions, for ISimulator this has to be done externally.

GPU Implementation for Static Block Size

In the previous implementation the blockSize and the numTiles are not GPU compile time constant. It makes the code more flexible, but the GPU compiler can do less optimizations, in particular:

  •  the for-loop in ComputeBodyAccel cannot be unrolled;
  •  the size of the shared memory is not known to the compiler.

We optimize the GPU implementation so that the block size is constant at compile time.

We will only discuss the differences to the implementation with dynamic block size. The SimulatorModule inherits again from GPUModule and has an additional constructor argument for the block size. As a consequence, we cannot AOT compile this class.

To enable AOT compilation we inherit from SimulatorModule and set the block size to a fixed constant:

The shared memory size is now fixed. In the method ComputeBodyAccel we can declare it as follows:

The launch parameters simplify as well because the shared memory size does not need to be specified anymore:

Having fixed the number of tiles, hence the number of loop iterations, we can add a statement for loop unrolling to the innermost loop in the method ComputeBodyAccel. Note that loop unrolling is currently only possible for F# but not for C#:

The performance is measured in terms of frames per seconds achieved by the OpenGL visualization, which is presented in the next blog post. We compare different block sizes and the static and dynamic block size implementation for F# and C# on Mono installed on a Ubuntu Linux machine with a Titan Black GPU:

NBodyPerformance

Apparently a block size of 128 is optimal for this GPU and static block sizes slightly improve the performance.