# Don’t Get Lost in the Forest IV – Performance Test

Having discussed and implemented a random forest algorithm in the last three posts, we will finalize this series by comparing the performance of our two implementations with the random forest implementation of the Python package sklearn. For the test we will use a randomly created data set consisting of 20000 samples with 20 features and two classes. We will train 1000 trees and make two examples with different maximum tree-depth:

1. depth = 1 (stumps)
2. depth = 5

In addition to the discussed GPU implementation, we have a version using CUDA-Streams, a way to perform multiple CUDA operations simultaneously (beyond multi-threaded parallelism), which helps hidding time lost by copying memory. We will leave this topic for a possible future post, but will use the implementation for performance tests.

The tests have been performed on a machine with an Intel Core i7-4771 CPU @ 3.50GHz, 16 GB RAM and a NVIDIA Titan Black GPU.

getting the following result:

For Python we use the following script, reading in the same test data as in .NET.

It builds the 1000 trees in between 43 s (single CPU core) and 9.6 s (all CPU cores). This is much faster than our CPU implementation, which was expected as our code, though reasonably well optimized, is not expected to catch up with the highly optimized sklearn library. Our GPU implementation however still gets a nice speed up of over 4 against the most parallel Python run.

For deeper trees with depth 5 we get the following results:

For these parameters the Python code needs between 3min 40s (single CPU core) and 44 s (all CPU cores). Here the speedup is smaller, what can be explained as the GPU implementation is parallel on the number of samples and they shrink with every split.

# Don’t Get Lost in the Forest III – The GPU Implementation

In the previous post we looked at the CPU implementation of a random forest training algorithm. We also discussed two parallelization strategies at different levels:

1. Build the independent trees in parallel.
2. Search for the optimal split for all features in parallel.

The first strategy is straightforward. Here we focus on the second strategy and discuss how it can be implemented on the GPU. We can reuse most of the CPU code. We only have to replace the function optimizeFeatures with a proper GPU implementation:

This function performs the following calculations:

• Create a cumulative histograms by:
• Find and prepare the indices with non-zero weights.
• Expand the weights: get initial weights for given features and the sample indices before sorting and sort them according to their class (see below for more details).
• Calculate the histograms using a cumulative sum function applied to the expanded weights.
• Calculate the entropy and it’s minimum.

In the following we discuss each step in more detail using example data consisting of four features with four samples

 feature 1 feature 2 feature 3 feature 4 label 1 3 7 8 0 2 2 2 9 1 3 1 1 7 1 4 5 3 5 0

and the weights

 1 1 0 1

After sorting the features we get four triples feature values, labels and index before sorting. This corresponds to the SortedFeatures of the discriminated union type LabeledFeatureSet discussed in the previous post:

 1,0,0 1,1,2 1,1,2 5,0,3 2,1,1 2,1,1 2,1,1 7,1,2 3,1,2 3,0,0 3,0,3 8,0,0 4,0,3 5,0,3 7,0,0 9,1,1

After sorting we do not need the feature values anymore. It is therefore enough to keep the two matrices, the first one storing the labels

 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1

and the second one the indices

 0 2 2 3 1 1 1 2 2 0 3 0 3 3 0 1

# Find Non-Zero Indices

This function prepares the input data such that all samples with a non-zero weight, respectively indices of these samples, are stored consecutively.

It calls the GPU kernel LogicalWeightExpansionKernel, which creates for every feature a vector of elements in ${0,1}$ depending on weights being non-zero (looking at the expanded weights, i.e. initial weights before sorting). Here is the actual GPU code:

For our sample data this kernel produces the following result:

 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1

After calculating a column-wise cumulative sum we obtain

 1 0 0 1 2 1 1 0 2 2 2 2 3 3 3 3

The function FindNonZeroIndicesKernel

then creates the matrix of indices

 0 1 1 0 1 2 2 2 3 3 3 3

It records the next index for which the feature has a non-zero weight. Note that the index here is related to the features after sorting.

# Expand the Weights and Create Histograms

The WeightExpansionKernel uses the indices matrix and returns for every sample with non-zero-weight its weight, sorted according to the class (label) the sample belongs to. Given the following weights and labels:

 Weight: Label: 1 0 2 1 1 1 2 0

We get the following expanded weights where a zero indicates that a feature has a different class:

 1 0 0 2 0 2 1 0

The weight expansion function

transforms our example data to

 1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 1

To calculate the cumulative histogram for each feature over all labels we use a row-wise cumulative sum (row-wise because the matrix layout in the code is different to the layout in our example), which is implemented with the Alea.Unbound primitive ConsumeRangeConsecutiveInclusive

and a cumulative sum kernel

The cumulative histograms are then used for the entropy calculation:

 1 0 0 1 1 1 1 2 2 2 2 2 2 3 3 2 3 3 3 2 3 3 3 3

# Calculate the Entropy and it’s Minimum

The entropy for every possible split and for every feature is calculated from the cumulative histograms with the EntropyKernel function:

The optimal split for every feature, i.e. the split leading to the lowest entropy, can now be calculated with the MinimumEntropy function

To find tie minimal entropy the function LaunchOptAndArgOptKernelDoubleWithIdcs uses a the Alea.Unbound library function MultiChannelReducePrimitive. Then the optimal splits are checked for edge cases in the same way as in the function optimizeFeatures.

This explains the main functionality of the GPU implementation. In our last blog post, we will consider the performance of the different implementations and look at an implementations in Python based on scikit-learn.

# Don’t Get Lost in the Forest II – The CPU Implementation

This is the second post in a series about GPU accelerated random forests. In our last post we introduced random forests and described how to train them with an algorithm based on the split entropy. In this post we will focus on a CPU implementation and leave the GPU implementation for the next blog post.

First we look at the implementation of the forecasting function. We will then discuss the training of the random forest and dive into the two main functions of the training algorithm.

# Implementation of the Forecasting Function

Before implementing the forecasting function, we need to discuss how we represent a decision tree in code. The record Split consists of a feature index Feature and a Threshold

where type FeatureValue = float because we restrict to continuous features for the sake of simplicity. We define a tree consisting of Leaf objects (representing predicted labels) and Node objects (consisting of a triple of a split and two sub-trees, one for the elements with feature values below the threshold and one for the elements with feature values above the threshold):

A sample is a vector of feature values, which are all float values:

In the case of the Iris data set we have the four values for the features sepal length and width and the petal length stored in an array [|sepalLength; sepalWidth; petalLength; petalWidth|].

We can predict the label for a given sample using a function traversing the tree recursively:

A random forest is then represented by the type Model which consists of an array of trees and the number of classes:

We can predict the label of a sample using a random forest model by calling the function forecastTree for every tree in the forest and apply a majority vote on the resulting labels:

Here maxAndArgMax is a function returning the maximum of an array and the maximum’s index. A simple functional implementation looks as follows:

We will revisit this function later and improve its performance with an alternative implementation.

# CPU Implementation of the Training Algorithm

Rather than going into every detail of the implementation we will focus on the main ideas. The main part of the training algorithm is implemented in two functions:

1. optimizeFeatures returns the optimal split position and its entropy for every feature;
2. trainTrees uses the function optimizeFeatures and trains decision trees.

A crucial step in the training algorithm as described in the previous post is the sorting of the features according to their values. As the old position of the feature served as the index for accessing the weights, we save them as indices. We introduce a discriminated union type to represent the data.

The canonical form of the input data is represented by LabeledSamples, which is an array of tuples Sample * Label. Recall that a Sample is just an array of float vectors and Label is an integer value representing the class of the sample. For the Iris data set a LabeledSample is a tuple of a float vector of four values and an integer ([|sepalLength; sepalWidth; petalLength; petalWidth|], 1).

The type SortedFeatures is used to represent the sorted features during the implementation of the algorithm. For each feature, it holds the feature values from all samples in a FeatureArray, the corresponding labels in Labels and the indices representing the old position of the features.

Note that both the Sample and FeatureArray are both float arrays. The reason why we distinguish them is because a Sample is an observation of one feature value per feature, whereas a FeatureArray holds the value of a single feature for a collection of samples. If we stack the samples row by row in a matrix, the rows would correspond to samples, whereas the columns would correspond to feature arrays.

## The Function optimizeFeatures

The function optimizeFeatures returns for every feature the best split entropy and its corresponding index in the sorted samples. Its arguments are a mode, which determines the parallelization scheme, the options for the entropy-optimization, the number of classes, the labels per feature, the indices per features, and the weights. It performs the following computations:

• Reorder the weights such that every weight is stored at the same index as the corresponding sample, which is done for every feature independently.
• Project all features and weights which are non-zero to the beginning of the array.
• Implement the helper functions
• mapper which, depending on the mode, is either Array.mapi or Array.Parallel.mapi.
• translator which translates the index after projection to the post projection index, or the last index in case for the last non-zero index.
• A mapping function, the argument for the mapper, which for all non-masked features performs the following:
• Get the weights.
• Create the histogram for every split possibility for the given feature.
• Mask splits which are clearly not optimal. This is not necessary, but helps speeding up the CPU implementation.
• Calculate the entropy for remaining split possibilities.
• Return the lowest entropy and its index and translates the index to its value before the projection using the translator function.
• Select a subset of the features using a FeatureSelector.
• Calculate the entropy for the remaining split possibilities using the helper functions.

Here is the full source code:

## The Function trainTrees:

The function trainTrees trains decision trees:

• Calculate entropy, split index and feature index with the minimal split entropy.
• Find split threshold using the function  featureArrayThreshold. If all indices are zero, stop branching and create a leaf.
• Set weights for samples with value below respectively above the split threshold to zero for the lower respectively upper branch of the next iteration.
• If depth = 0 then return the tree, else train new trees.

Note that each FeatureArray that is given in the argument sortedTrainingSet has to be sorted in ascending order.

To go from a single decision tree to a random forest, we need to train multiple trees and add random weights for every sample in every tree. This is implemented in randomForestClassifier. We show how to apply this function to the Iris data set at the end of this post.

Some additional code is required to exploit parallelism and to delegate certain parts of the algorithm to the GPU.

## CPU Parallelization

For the CPU implementation we exploit two parallelization strategies:

1. Use Array.Parallel.mapi instead of Array.mapi.
2. Partition the trees in n bins and calculate the bins in parallel. Since there is no need to exchange data between bins, this approach allows to use different devices and even to mix CPU and GPU units.

## Performance Relevant Implementations (minArgmin)

We need to have a well performing CPU implementation so that we can later compare the CPU and GPU implementation. To improve the performance of the CPU implementation we introduce the entropyMask mentioned earlier and have to make a compromise between a clean functional implementation and a performance optimized implementation.

A first bottleneck is the function minAndArgMin. A compact functional implementation uses Seq.minBy:

The optimized implementation which relies on the .NET IEnumerable interface is about 10% faster:

Note that using sequences rather than arrays gives us some additional performance benefits.

# Applying the Code to the Iris Data Set

We now apply our CPU random forest classifier to the Iris data set. First we read the data.

Then we splitting the data in to test and training data:

We set up the parameters for the training algorithm:

The SquareRootFeatureSelector chooses randomly $\sqrt{n}$ features from the existing $n$ features. In case of the Iris data set this results in two features out of the existing four features, randomly selected for every tree in the forest. Now we train the model:

Once the model is trained we can use it to predict the labels on the test data:

You can call these functionality by calling:

leading to the following output:

Now it is your turn. Download the code, play with it and have fun!

# Don’t Get Lost in the Forest I – The Algorithm

With this blog post we start a series on the machine learning algorithm random forest and how to accelerate the training process with GPUs. For simplicity we focus on continuous features. If you are interested in categorical features you might want read Mathias’ blog.

In this first post we explain what a decision tree is, how to go from a decision tree to a random forest and discuss the advantages of random forests over decision trees. We finish by describing a random forest training algorithm. In a second post we go through the important aspects of the algorithm’s CPU implementation and will do the same for the GPU implementation in a third post. The series will end with a post on the performance of the different implementations.

Random forests were developed by Leo Breiman and Adele Cutler, their dominance in the field of machine learning has been illustrated nicely in a blog post at Strata 2012: “Ensembles of decision trees (often known as random forests) have been the most successful general-purpose algorithm in modern times” when it comes to “automatically identify the structure, interactions, and relationships in the data”. Furthermore it is noted that “most Kaggle competitions have at least one top entry that heavily uses this approach”. Random forests also have been the algorithm of choice for the body part recognition of Microsoft’s Kinect, a motion sensing input devices for Xbox consoles and Windows PCs. More details can be found in the paper Why did Microsoft decide to use Random Forests in the Kinect.

Random forests consist of an ensemble of decision trees. We therefore start to look at decision trees. A simple decision tree could be used to decide what you need to take with you when leaving the house:

This tree has been constructed by reason only, not involving any data. In machine learning however the trees are created such that they optimally predict a given sample of data (training data).

For this series we will use the well-known Iris data set as training data. It contains four features (the length and the width of sepals and petals) of three Iris species:

Random forests can play their strength when having hundreds of different features. But few features make a data set much better accessible as an example. Checking the scatter plot of all features, you can see the clustering of the different types.

The scatter plot illustrates that Setosa is nicely distinguishable from the other two types. Keeping Virginica apart from Versicolor is much more difficult. A decision tree predicting the Iris specie from these features might look like:

This tree does not classify all the examples correctly. We could change this by increasing the maximum tree depth. By doing so, the tree can predict the sample data 100% correctly, but only by learning the noise in the samples. In the most extreme case the algorithm is equivalent to a dictionary containing every sample. This is knows as over-fitting and leads to bad results when using it for out of sample predictions. In order to overcome over-fitting, we can train multiple decision trees, by introducing weights for the samples and only considering a random subset of the features for each split. The final decision of the random forest will be determined by a majority vote on the trees’ predictions. This technique is also known as bagging. It helps to decrease the variance (error from sensitivity to noise in the training set) without increasing the bias (error introduced from not enough flexibility of the model).

# Algorithm

We first explain, how to build decision trees. In a second part we look at how to create a random forest.

## Building Decision Trees

To build good decision trees a discrimination measure is required. A common choice is to use the split information or entropy as described in Efficient C4.5 as measure. It is definded as

$$\rm{Split}(T) = \sum_{i,b} \frac{T_{i,b}}{T} \log_2 \left( \frac{T_{i,b}}{T} \right),$$

where $T_{i,b}$ is the number of samples with label $i$ in branch $b$ after the split and $T$ is the total number of samples. Other possible metrics are the information gain, the Gini impurity and the variance reduction.

The search for a decision tree which is a global optimum for the discrimination measure is known as a NP-complete problem. We therefore focus on the a local optimization where for each split the local optimum is chosen, i.e. for every node we check which feature to use for splitting and where to split to get the lowest split information.

The steps to build a decision tree are:

1. For all features (e.g. for Iris data set the sepal-length, sepal-width, petal-length, petal-width), take all samples and order them by value.
2. For every split possibility, i.e. two successive samples with different values, calculate the split entropy:
– Make a histogram based on labels for both branches after the split.
– Calculate the split entropy $\rm{Split}(T)$.
3. Choose the split for the feature and split threshold with the smallest split entropy.
4. Apply the algorithm from step 1 on all resulting sub-branches, until either all features have the same label, or the predefined maximal three depth has been reached.

We illustrate the calculation of the split entropy with a simple example. Assume we have ten samples for a single continuous feature, with three different labels all of them with weight one and the following value-label-map:

 feature value (sorted) label 1 1 2 2 3 1 4 1 5 1 6 1 7 3 8 3 9 2 10 3

We have nine split possibilities 1.5, 2.5, …, 9.5. The split between 1 and 2 at 1.5 would lead to the following two histograms. The lower branch would be

 label counts 1 1 2 0 3 0

whereas the upper branch would be

 label counts 1 4 2 2 3 3

The split at 1.5 leads to a split entropy of

$$\frac{1}{1} \log_2 \left( \frac{1}{1} \right) + \frac{0}{1} \log_2 \left( \frac{0}{1} \right) + \frac{0}{1} \log_2 \left( \frac{0}{1} \right) + \frac{4}{9} \log_2 \left( \frac{4}{9} \right) + \frac{2}{9} \log_2 \left( \frac{2}{9} \right) + \frac{3}{9} \log_2 \left( \frac{3}{9} \right)= 13.77.$$

We calculate the split entropies for all the other splits possibilities and plot them as a function of the split threshold:

We see from the plot that a split between 6 and 7 is ideal. The same procedure is now applied to the lower (samples 1-6) and the upper (samples 7-10) branch and the sub-branches thereof until either the maximum depth has been reached or the sub-branch only contains a single sample.

## Bagging Decision Trees to Random Forests

Decision trees, in particular those with many levels, suffer from over-fitting. To reduce the risk of over-fitting many different decision trees are trained and combined to an ensemble with majority vote. This decreases the variance of the decision trees without increasing the bias. In order to get different trees, randomness is introduced at two points:

1. Every sample gets a weight which is randomly chosen for every tree. The weights are chosen by randomly selecting n samples out of the existing n samples with replacement. The histograms are then created not by counts, but by sums on these weights.
2. For every split only some randomly chosen features are considered.

To gain a deeper understanding the following video of Jeremy Howard is helpful:

﻿