Tag Archives: Performance


Deficiencies of .NET CLR JIT Compilers

Another Reason to Use a GPU!

I recently gave a talk at an F# meetup hosted by Jet.com about deficiencies of .NET CLR JIT compilers.

We know that often C# or F# does not perform at the level of native C++ because the CLR JIT compiler is not optimizing the code well enough. In worst cases we loose a factor of 2 to 4 against native code. To investigate this problem in more depth you can check how .NET CLR JIT compilers compile simple loops and nested loops. It is not enough to just look at MSIL code. We have to dig deep into the optimized assembly code, generated by the CLR JIT compilers. We find that the CLR JIT compilers are not capable to remove array bound checks or optimize array access patterns of the form a[i*lda + j] in simple nested loops. This is very bad news for performance critical code in .NET.

Fortunately, you can get around these problems by moving performance critical code to the GPU. The Floyd-Warshall all shortest path algorithm serves as an example: an advanced GPU implementation fully written in C# and compiled with Alea GPU gives a significant speedup. It runs at the same speed as a native GPU version coded in C++ and 125 times faster than the initial C# version!

Developing such efficient algorithms is not straightforward at all and requires some experience. We therefore take a step back and show that simpler problems can often be solved efficiently with parallel-for and parallel aggregate patterns running on the GPU with a dramatic performance increase of a factor of 50 to 100.

Here are the slides.


Don’t Get Lost in the Forest IV – Performance Test

Having discussed and implemented a random forest algorithm in the last three posts, we will finalize this series by comparing the performance of our two implementations with the random forest implementation of the Python package sklearn. For the test we will use a randomly created data set consisting of 20000 samples with 20 features and two classes. We will train 1000 trees and make two examples with different maximum tree-depth:

  1. depth = 1 (stumps)
  2. depth = 5

In addition to the discussed GPU implementation, we have a version using CUDA-Streams, a way to perform multiple CUDA operations simultaneously (beyond multi-threaded parallelism), which helps hidding time lost by copying memory. We will leave this topic for a possible future post, but will use the implementation for performance tests.

The tests have been performed on a machine with an Intel Core i7-4771 CPU @ 3.50GHz, 16 GB RAM and a NVIDIA Titan Black GPU.

getting the following result:

For Python we use the following script, reading in the same test data as in .NET.

It builds the 1000 trees in between 43 s (single CPU core) and 9.6 s (all CPU cores). This is much faster than our CPU implementation, which was expected as our code, though reasonably well optimized, is not expected to catch up with the highly optimized sklearn library. Our GPU implementation however still gets a nice speed up of over 4 against the most parallel Python run.

For deeper trees with depth 5 we get the following results:

For these parameters the Python code needs between 3min 40s (single CPU core) and 44 s (all CPU cores). Here the speedup is smaller, what can be explained as the GPU implementation is parallel on the number of samples and they shrink with every split.