Tag Archives: .NET

deep-learning-neurons

GPU Accelerated Deep Learning also Wanted for .NET

We took the chance and did a second Channel 9 recording on our GPU accelerated Machine Learning project in the Microsoft offices at Time Square New York City. It was a great experience to do the recording with Seth Juarez. Many thanks Seth!

There exist already several deep learning libraries but none of them targets .NET. Alea TK is a new open source project that aims to develop a complete GPU accelerated deep learning stack for .NET. It is built on top of Alea GPU and is designed from ground up to be GPU accelerated, easy to extend and to deploy. It is in an early phase. Contributors and developers are welcome! The recording explains why we started the project and what we plan to do with it in the future.

Check out Alea TK on our project web site and on GitHub.

channel-9-with-seth

Radically Simplified GPU Programming with C#

We were very happy to do a Channel 9 recording for our new Alea GPU version 3 in the Microsoft offices at Time Square New York City. It was a great experience to do the recording with Seth Juarez. Many thanks Seth!

GPU computing is all about number crunching and performance. Do you have a lot of parallel calculations? Then try to use GPU with C#. With the new Alea GPU parallel GPU methods it is as easy as changing a few lines of code to utilize the power of GPUs. No GPU in your box? Don’t worry, you can get them from Azure or other cloud providers. I explained how easy it is to run C# code on the GPU, with full debugging support in Visual Studio.

Check out Alea GPU on our product web site.

gtc-europe-2016

A new Deep Learning Stack for .NET

alea-tk-images

I gave a talk at GTC Europe 2016 in Amsterdam about our new open source project Alea TK.

Alea TK is a library for general purpose numerical computing and Deep Learning based on tensors and tensor expressions supporting imperative calculations as well as symbolic calculations with auto-differentiation. It is designed from ground up with CUDA acceleration in mind. It is easy to extend, install and deploy and is perfectly suited for rapid prototyping and new model development. Alea TK is built entirely in .NET and C#. It relies on the Alea GPU compiler and uses NVIDIA’s cuDNN library to accelerate many standard deep neural networks primitives.

Alea TK is still a young project. I explained the main design principles and presented the framework, with a particular focus on GPU kernel fusion technology.

Check out the slides and our poster.

jet_logo

Deficiencies of .NET CLR JIT Compilers

Another Reason to Use a GPU!

I recently gave a talk at an F# meetup hosted by Jet.com about deficiencies of .NET CLR JIT compilers.

We know that often C# or F# does not perform at the level of native C++ because the CLR JIT compiler is not optimizing the code well enough. In worst cases we loose a factor of 2 to 4 against native code. To investigate this problem in more depth you can check how .NET CLR JIT compilers compile simple loops and nested loops. It is not enough to just look at MSIL code. We have to dig deep into the optimized assembly code, generated by the CLR JIT compilers. We find that the CLR JIT compilers are not capable to remove array bound checks or optimize array access patterns of the form a[i*lda + j] in simple nested loops. This is very bad news for performance critical code in .NET.

Fortunately, you can get around these problems by moving performance critical code to the GPU. The Floyd-Warshall all shortest path algorithm serves as an example: an advanced GPU implementation fully written in C# and compiled with Alea GPU gives a significant speedup. It runs at the same speed as a native GPU version coded in C++ and 125 times faster than the initial C# version!

Developing such efficient algorithms is not straightforward at all and requires some experience. We therefore take a step back and show that simpler problems can often be solved efficiently with parallel-for and parallel aggregate patterns running on the GPU with a dramatic performance increase of a factor of 50 to 100.

Here are the slides.

blackscholes

GPU Computing on .NET at Speed of CUDA C++

A Performance Comparison of Alea GPU C# and CUDA C++

In the last post we gave a sneak preview of the upcoming Alea GPU version 3. Alea GPU version 3 sets completely new standards for GPU development on .NET. The highlights of the new version are:

  1. Higher level abstractions such as GPU LINQ, a GPU parallel-for and parallel aggregation.

  2. Automatic memory management and data transfer, which makes GPU programming easier and reduces a lot of boiler plate code.

  3. Integration with the .NET type system so that .NET arrays can be used directly in GPU kernels.

  4. CUDA programming model integrated with C# and F#.

In the last post we discussed the new API and its usability. This time we benchmark Alea GPU against multi-threaded .NET and CUDA C++. For the performance study we use an application from quantitative finance that calculates the price of an Asian option with Monte Carlo simulation based on the celebrated Black-Scholes model. Nowadays Monte Carlo simulations are a common practice to calculate the price of complex financial products and risk figures. GPUs significantly reduce the calculation time.

We look at different implementation versions using different features of Alea GPU.

  1. Implicit memory management: we use the Alea GPU parallel-for and parallel aggregate together with implicit memory management, i.e. we let Alea GPU manage the memory allocation and data transfer between the CPU and the GPU.

  2. Explicit memory management: we use the Alea GPU parallel-for and parallel aggregate but we explicitly copy the data to the GPU.

  3. Transform-reduce: we combine the parallel-for and parallel aggregate into a single parallel transform-reduce and use implicit memory management.

  4. LINQ work-flow: we express the calculation using Alea GPU LINQ, which allows us to do kernel fusing and in a future version also more aggressive memory transfer optimizations.

  5. Constant memory: we use constant memory to store data that is read-only on the GPU.

The CPU implementation is fully multi-threaded and also uses the fast CPU random number generators that come with the NVIDIA cuRand library. We run the computations on an Intel i7-4770 CPU at 3.5 GHz and on different GPU configurations. Let us look at the timings in milliseconds:

Timings .NET GPU C#

As expected the Kepler K40 is the fastest GPUs thanks to its high number of 2880 CUDA cores and very good double precision support. The version using implicit memory management is pretty close to the explicit one. Using constant memory also slightly improves performance, because it allows to cache data reads. Now we compare the speed-up relative to a multi-threaded C# implementation:

Speed-up relative to .NET C#

Already with a rather weak GeForce GT 750M mobile GPU we gain a performance increase of roughly a factor of 4. The high-end K40 GPU boosts performance by a factor 100. The version using explicit memory management scales best to multiple GPUs as shown by running it on two GeForce GTX Titan Black GPUs as well as on an AWS instance with four GRID K520 GPUs. In the current version of Alea GPU the implicit memory management uses a global tracker, which is not yet optimized for multiple GPUs. This explains the why the implicit memory management version scales poorly to multiple GPUs.

Now let us look at the performance of Alea GPU compared to a CUDA C++ implementation. Again we run the computations on GPUs of different hardware generations. Here are the timings in milliseconds of the Asian option pricing in C# using explicit management and a CUDA C++ implementation:

Timings .NET GPU C# versus CUDA C++

Investigating the runtime behavior in more details with the Visual Profiler reveals a few interesting details: The CUDA C++ Asian option pricing kernel uses 37 to 40 registers, depending on the CUDA compute architecture, where the .NET version compiled with Alea GPU only uses 30 registers. This has some consequences on the number of blocks in-flight and the overall occupancy. The .NET version can use 52 thread blocks and achieves 100% occupancy. The CUDA C++ version can only launch 39 thread blocks and achieves 75% occupancy. However, the kernel execution time is almost the same. Regarding the overall execution time, the .NET version is slightly slower because of in .NET we have a small overhead to launch a kernel.

For those who are interested in further details we sketch the C# as well as the CUDA C++ implementation now.

GPU and CPU Implementation with Alea GPU

We like share the core algorithmic code between the CPU and GPU implementation. We achieve this by encapsulating the main calculation in a delegate:

This delegate calculates the payoff along path i. The price computation simulates multiple sample batches so that we can scale-up the number of simulations. For each batch we generate numSamplesPerBatch normally distributed random numbers with the NVIDIA cuBLAS library. Then we use the Alea GPU parallel-for to generate the paths and the prices using the AsianCall delegate. Then we apply the Alea GPU average reduction to calculate the batch mean. The final price is the average of all batch prices.

Note that PriceGpu has the attribute [GpuManaged] because we let Alea GPU to do the memory transfer for us. We only create storage for the random numbers and the prices on the device. All the input data, such as dt, rates and volas are captured in a closure and moved automatically to the GPU. The explicit deallocation of the GPU memory with Gpu.Free is optional but good practice: GPU memory is pagelocked, so to avoid unnecessary GPU memory limitations, it is good to release unused GPU memory immediately and not to wait for the next garbage collection run.

We can very easily program a CPU version:

This time we use the .NET Parallel.For and reuse the core path generation and payoff logic from AsianCall. The other versions also share the same core algorithm AsianCall. We skip further code listings.

Implementation with CUDA C++

The CUDA C++ version looks very similar, execpt that there is no GPU parallel for, so we program the kernel ourselves. Because in CUDA C++ there are no array types as in .NET we have to pass the array length as additional arguments to the kernel.

The orchestration of the calculations are now a bit more involved because there is no automatic memory management and transfer. Also because cudaMemcpy takes the number of bytes to copy, we have to calculate the size ourselves. Typecasts further complicate the code. The average price per batch is calculated with the reduction function of CUDA Thrust.

Conclusion

We have seen that Alea GPU is very competitive compared to CUDA C++. Using our higher level abstractions such as parallel-for and parallel aggregate, C# programmers can implement performance critical algorithms targetng GPUs without explicit knowlege of CUDA C++ but with the same performance benefits as native CUDA C++.

Your Feedback

We are interested to hear your feedback and suggestions for Alea GPU. Write to us at info@quantalea.com or @QuantAlea on Twitter.

If you would like to already play around with Alea GPU V3 come and join us April 4 at GTC 2016 and attend our tutorial on simplified GPU programming with C#.