# GPU Accelerated Deep Learning also Wanted for .NET

We took the chance and did a second Channel 9 recording on our GPU accelerated Machine Learning project in the Microsoft offices at Time Square New York City. It was a great experience to do the recording with Seth Juarez. Many thanks Seth!

There exist already several deep learning libraries but none of them targets .NET. Alea TK is a new open source project that aims to develop a complete GPU accelerated deep learning stack for .NET. It is built on top of Alea GPU and is designed from ground up to be GPU accelerated, easy to extend and to deploy. It is in an early phase. Contributors and developers are welcome! The recording explains why we started the project and what we plan to do with it in the future.

Check out Alea TK on our project web site and on GitHub.

# A new Deep Learning Stack for .NET

I gave a talk at GTC Europe 2016 in Amsterdam about our new open source project Alea TK.

Alea TK is a library for general purpose numerical computing and Deep Learning based on tensors and tensor expressions supporting imperative calculations as well as symbolic calculations with auto-differentiation. It is designed from ground up with CUDA acceleration in mind. It is easy to extend, install and deploy and is perfectly suited for rapid prototyping and new model development. Alea TK is built entirely in .NET and C#. It relies on the Alea GPU compiler and uses NVIDIA’s cuDNN library to accelerate many standard deep neural networks primitives.

Alea TK is still a young project. I explained the main design principles and presented the framework, with a particular focus on GPU kernel fusion technology.

Check out the slides and our poster.

# Don’t Get Lost in the Forest IV – Performance Test

Having discussed and implemented a random forest algorithm in the last three posts, we will finalize this series by comparing the performance of our two implementations with the random forest implementation of the Python package sklearn. For the test we will use a randomly created data set consisting of 20000 samples with 20 features and two classes. We will train 1000 trees and make two examples with different maximum tree-depth:

1. depth = 1 (stumps)
2. depth = 5

In addition to the discussed GPU implementation, we have a version using CUDA-Streams, a way to perform multiple CUDA operations simultaneously (beyond multi-threaded parallelism), which helps hidding time lost by copying memory. We will leave this topic for a possible future post, but will use the implementation for performance tests.

The tests have been performed on a machine with an Intel Core i7-4771 CPU @ 3.50GHz, 16 GB RAM and a NVIDIA Titan Black GPU.

getting the following result:

For Python we use the following script, reading in the same test data as in .NET.

It builds the 1000 trees in between 43 s (single CPU core) and 9.6 s (all CPU cores). This is much faster than our CPU implementation, which was expected as our code, though reasonably well optimized, is not expected to catch up with the highly optimized sklearn library. Our GPU implementation however still gets a nice speed up of over 4 against the most parallel Python run.

For deeper trees with depth 5 we get the following results:

For these parameters the Python code needs between 3min 40s (single CPU core) and 44 s (all CPU cores). Here the speedup is smaller, what can be explained as the GPU implementation is parallel on the number of samples and they shrink with every split.

# Don’t Get Lost in the Forest III – The GPU Implementation

In the previous post we looked at the CPU implementation of a random forest training algorithm. We also discussed two parallelization strategies at different levels:

1. Build the independent trees in parallel.
2. Search for the optimal split for all features in parallel.

The first strategy is straightforward. Here we focus on the second strategy and discuss how it can be implemented on the GPU. We can reuse most of the CPU code. We only have to replace the function optimizeFeatures with a proper GPU implementation:

This function performs the following calculations:

• Create a cumulative histograms by:
• Find and prepare the indices with non-zero weights.
• Expand the weights: get initial weights for given features and the sample indices before sorting and sort them according to their class (see below for more details).
• Calculate the histograms using a cumulative sum function applied to the expanded weights.
• Calculate the entropy and it’s minimum.

In the following we discuss each step in more detail using example data consisting of four features with four samples

 feature 1 feature 2 feature 3 feature 4 label 1 3 7 8 0 2 2 2 9 1 3 1 1 7 1 4 5 3 5 0

and the weights

 1 1 0 1

After sorting the features we get four triples feature values, labels and index before sorting. This corresponds to the SortedFeatures of the discriminated union type LabeledFeatureSet discussed in the previous post:

 1,0,0 1,1,2 1,1,2 5,0,3 2,1,1 2,1,1 2,1,1 7,1,2 3,1,2 3,0,0 3,0,3 8,0,0 4,0,3 5,0,3 7,0,0 9,1,1

After sorting we do not need the feature values anymore. It is therefore enough to keep the two matrices, the first one storing the labels

 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1

and the second one the indices

 0 2 2 3 1 1 1 2 2 0 3 0 3 3 0 1

# Find Non-Zero Indices

This function prepares the input data such that all samples with a non-zero weight, respectively indices of these samples, are stored consecutively.

It calls the GPU kernel LogicalWeightExpansionKernel, which creates for every feature a vector of elements in ${0,1}$ depending on weights being non-zero (looking at the expanded weights, i.e. initial weights before sorting). Here is the actual GPU code:

For our sample data this kernel produces the following result:

 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1

After calculating a column-wise cumulative sum we obtain

 1 0 0 1 2 1 1 0 2 2 2 2 3 3 3 3

The function FindNonZeroIndicesKernel

then creates the matrix of indices

 0 1 1 0 1 2 2 2 3 3 3 3

It records the next index for which the feature has a non-zero weight. Note that the index here is related to the features after sorting.

# Expand the Weights and Create Histograms

The WeightExpansionKernel uses the indices matrix and returns for every sample with non-zero-weight its weight, sorted according to the class (label) the sample belongs to. Given the following weights and labels:

 Weight: Label: 1 0 2 1 1 1 2 0

We get the following expanded weights where a zero indicates that a feature has a different class:

 1 0 0 2 0 2 1 0

The weight expansion function

transforms our example data to

 1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 1

To calculate the cumulative histogram for each feature over all labels we use a row-wise cumulative sum (row-wise because the matrix layout in the code is different to the layout in our example), which is implemented with the Alea.Unbound primitive ConsumeRangeConsecutiveInclusive

and a cumulative sum kernel

The cumulative histograms are then used for the entropy calculation:

 1 0 0 1 1 1 1 2 2 2 2 2 2 3 3 2 3 3 3 2 3 3 3 3

# Calculate the Entropy and it’s Minimum

The entropy for every possible split and for every feature is calculated from the cumulative histograms with the EntropyKernel function:

The optimal split for every feature, i.e. the split leading to the lowest entropy, can now be calculated with the MinimumEntropy function

To find tie minimal entropy the function LaunchOptAndArgOptKernelDoubleWithIdcs uses a the Alea.Unbound library function MultiChannelReducePrimitive. Then the optimal splits are checked for edge cases in the same way as in the function optimizeFeatures.

This explains the main functionality of the GPU implementation. In our last blog post, we will consider the performance of the different implementations and look at an implementations in Python based on scikit-learn.

# Don’t Get Lost in the Forest II – The CPU Implementation

This is the second post in a series about GPU accelerated random forests. In our last post we introduced random forests and described how to train them with an algorithm based on the split entropy. In this post we will focus on a CPU implementation and leave the GPU implementation for the next blog post.

First we look at the implementation of the forecasting function. We will then discuss the training of the random forest and dive into the two main functions of the training algorithm.

# Implementation of the Forecasting Function

Before implementing the forecasting function, we need to discuss how we represent a decision tree in code. The record Split consists of a feature index Feature and a Threshold

where type FeatureValue = float because we restrict to continuous features for the sake of simplicity. We define a tree consisting of Leaf objects (representing predicted labels) and Node objects (consisting of a triple of a split and two sub-trees, one for the elements with feature values below the threshold and one for the elements with feature values above the threshold):

A sample is a vector of feature values, which are all float values:

In the case of the Iris data set we have the four values for the features sepal length and width and the petal length stored in an array [|sepalLength; sepalWidth; petalLength; petalWidth|].

We can predict the label for a given sample using a function traversing the tree recursively:

A random forest is then represented by the type Model which consists of an array of trees and the number of classes:

We can predict the label of a sample using a random forest model by calling the function forecastTree for every tree in the forest and apply a majority vote on the resulting labels:

Here maxAndArgMax is a function returning the maximum of an array and the maximum’s index. A simple functional implementation looks as follows:

We will revisit this function later and improve its performance with an alternative implementation.

# CPU Implementation of the Training Algorithm

Rather than going into every detail of the implementation we will focus on the main ideas. The main part of the training algorithm is implemented in two functions:

1. optimizeFeatures returns the optimal split position and its entropy for every feature;
2. trainTrees uses the function optimizeFeatures and trains decision trees.

A crucial step in the training algorithm as described in the previous post is the sorting of the features according to their values. As the old position of the feature served as the index for accessing the weights, we save them as indices. We introduce a discriminated union type to represent the data.

The canonical form of the input data is represented by LabeledSamples, which is an array of tuples Sample * Label. Recall that a Sample is just an array of float vectors and Label is an integer value representing the class of the sample. For the Iris data set a LabeledSample is a tuple of a float vector of four values and an integer ([|sepalLength; sepalWidth; petalLength; petalWidth|], 1).

The type SortedFeatures is used to represent the sorted features during the implementation of the algorithm. For each feature, it holds the feature values from all samples in a FeatureArray, the corresponding labels in Labels and the indices representing the old position of the features.

Note that both the Sample and FeatureArray are both float arrays. The reason why we distinguish them is because a Sample is an observation of one feature value per feature, whereas a FeatureArray holds the value of a single feature for a collection of samples. If we stack the samples row by row in a matrix, the rows would correspond to samples, whereas the columns would correspond to feature arrays.

<

h2>The Function optimizeFeatures

The function optimizeFeatures returns for every feature the best split entropy and its corresponding index in the sorted samples. Its arguments are a mode, which determines the parallelization scheme, the options for the entropy-optimization, the number of classes, the labels per feature, the indices per features, and the weights. It performs the following computations:

• Reorder the weights such that every weight is stored at the same index as the corresponding sample, which is done for every feature independently.
• Project all features and weights which are non-zero to the beginning of the array.
• Implement the helper functions
• mapper which, depending on the mode, is either Array.mapi or Array.Parallel.mapi.
• translator which translates the index after projection to the post projection index, or the last index in case for the last non-zero index.
• A mapping function, the argument for the mapper, which for all non-masked features performs the following:
• Get the weights.
• Create the histogram for every split possibility for the given feature.
• Mask splits which are clearly not optimal. This is not necessary, but helps speeding up the CPU implementation.
• Calculate the entropy for remaining split possibilities.
• Return the lowest entropy and its index and translates the index to its value before the projection using the translator function.
• Select a subset of the features using a FeatureSelector.
• Calculate the entropy for the remaining split possibilities using the helper functions.

Here is the full source code:

## The Function trainTrees:

The function trainTrees trains decision trees:

• Calculate entropy, split index and feature index with the minimal split entropy.
• Find split threshold using the function  featureArrayThreshold. If all indices are zero, stop branching and create a leaf.
• Set weights for samples with value below respectively above the split threshold to zero for the lower respectively upper branch of the next iteration.
• If depth = 0 then return the tree, else train new trees.

Note that each FeatureArray that is given in the argument sortedTrainingSet has to be sorted in ascending order.

To go from a single decision tree to a random forest, we need to train multiple trees and add random weights for every sample in every tree. This is implemented in randomForestClassifier. We show how to apply this function to the Iris data set at the end of this post.

Some additional code is required to exploit parallelism and to delegate certain parts of the algorithm to the GPU.

## CPU Parallelization

For the CPU implementation we exploit two parallelization strategies:

1. Use Array.Parallel.mapi instead of Array.mapi.
2. Partition the trees in n bins and calculate the bins in parallel. Since there is no need to exchange data between bins, this approach allows to use different devices and even to mix CPU and GPU units.

## Performance Relevant Implementations (minArgmin)

We need to have a well performing CPU implementation so that we can later compare the CPU and GPU implementation. To improve the performance of the CPU implementation we introduce the entropyMask mentioned earlier and have to make a compromise between a clean functional implementation and a performance optimized implementation.

A first bottleneck is the function minAndArgMin. A compact functional implementation uses Seq.minBy:

The optimized implementation which relies on the .NET IEnumerable interface is about 10% faster:

Note that using sequences rather than arrays gives us some additional performance benefits.

# Applying the Code to the Iris Data Set

We now apply our CPU random forest classifier to the Iris data set. First we read the data.

Then we splitting the data in to test and training data:

We set up the parameters for the training algorithm:

The SquareRootFeatureSelector chooses randomly $\sqrt{n}$ features from the existing $n$ features. In case of the Iris data set this results in two features out of the existing four features, randomly selected for every tree in the forest. Now we train the model:

Once the model is trained we can use it to predict the labels on the test data:

You can call these functionality by calling: