# Don’t Get Lost in the Forest II – The CPU Implementation

This is the second post in a series about GPU accelerated random forests. In our last post we introduced random forests and described how to train them with an algorithm based on the split entropy. In this post we will focus on a CPU implementation and leave the GPU implementation for the next blog post.

First we look at the implementation of the forecasting function. We will then discuss the training of the random forest and dive into the two main functions of the training algorithm.

# Implementation of the Forecasting Function

Before implementing the forecasting function, we need to discuss how we represent a decision tree in code. The record Split consists of a feature index Feature and a Threshold

where type FeatureValue = float because we restrict to continuous features for the sake of simplicity. We define a tree consisting of Leaf objects (representing predicted labels) and Node objects (consisting of a triple of a split and two sub-trees, one for the elements with feature values below the threshold and one for the elements with feature values above the threshold):

A sample is a vector of feature values, which are all float values:

In the case of the Iris data set we have the four values for the features sepal length and width and the petal length stored in an array [|sepalLength; sepalWidth; petalLength; petalWidth|].

We can predict the label for a given sample using a function traversing the tree recursively:

A random forest is then represented by the type Model which consists of an array of trees and the number of classes:

We can predict the label of a sample using a random forest model by calling the function forecastTree for every tree in the forest and apply a majority vote on the resulting labels:

Here maxAndArgMax is a function returning the maximum of an array and the maximum’s index. A simple functional implementation looks as follows:

We will revisit this function later and improve its performance with an alternative implementation.

# CPU Implementation of the Training Algorithm

Rather than going into every detail of the implementation we will focus on the main ideas. The main part of the training algorithm is implemented in two functions:

1. optimizeFeatures returns the optimal split position and its entropy for every feature;
2. trainTrees uses the function optimizeFeatures and trains decision trees.

A crucial step in the training algorithm as described in the previous post is the sorting of the features according to their values. As the old position of the feature served as the index for accessing the weights, we save them as indices. We introduce a discriminated union type to represent the data.

The canonical form of the input data is represented by LabeledSamples, which is an array of tuples Sample * Label. Recall that a Sample is just an array of float vectors and Label is an integer value representing the class of the sample. For the Iris data set a LabeledSample is a tuple of a float vector of four values and an integer ([|sepalLength; sepalWidth; petalLength; petalWidth|], 1).

The type SortedFeatures is used to represent the sorted features during the implementation of the algorithm. For each feature, it holds the feature values from all samples in a FeatureArray, the corresponding labels in Labels and the indices representing the old position of the features.

Note that both the Sample and FeatureArray are both float arrays. The reason why we distinguish them is because a Sample is an observation of one feature value per feature, whereas a FeatureArray holds the value of a single feature for a collection of samples. If we stack the samples row by row in a matrix, the rows would correspond to samples, whereas the columns would correspond to feature arrays.

<

h2>The Function optimizeFeatures

The function optimizeFeatures returns for every feature the best split entropy and its corresponding index in the sorted samples. Its arguments are a mode, which determines the parallelization scheme, the options for the entropy-optimization, the number of classes, the labels per feature, the indices per features, and the weights. It performs the following computations:

• Reorder the weights such that every weight is stored at the same index as the corresponding sample, which is done for every feature independently.
• Project all features and weights which are non-zero to the beginning of the array.
• Implement the helper functions
• mapper which, depending on the mode, is either Array.mapi or Array.Parallel.mapi.
• translator which translates the index after projection to the post projection index, or the last index in case for the last non-zero index.
• A mapping function, the argument for the mapper, which for all non-masked features performs the following:
• Get the weights.
• Create the histogram for every split possibility for the given feature.
• Mask splits which are clearly not optimal. This is not necessary, but helps speeding up the CPU implementation.
• Calculate the entropy for remaining split possibilities.
• Return the lowest entropy and its index and translates the index to its value before the projection using the translator function.
• Select a subset of the features using a FeatureSelector.
• Calculate the entropy for the remaining split possibilities using the helper functions.

Here is the full source code:

## The Function trainTrees:

The function trainTrees trains decision trees:

• Calculate entropy, split index and feature index with the minimal split entropy.
• Find split threshold using the function  featureArrayThreshold. If all indices are zero, stop branching and create a leaf.
• Set weights for samples with value below respectively above the split threshold to zero for the lower respectively upper branch of the next iteration.
• If depth = 0 then return the tree, else train new trees.

Note that each FeatureArray that is given in the argument sortedTrainingSet has to be sorted in ascending order.

To go from a single decision tree to a random forest, we need to train multiple trees and add random weights for every sample in every tree. This is implemented in randomForestClassifier. We show how to apply this function to the Iris data set at the end of this post.

Some additional code is required to exploit parallelism and to delegate certain parts of the algorithm to the GPU.

## CPU Parallelization

For the CPU implementation we exploit two parallelization strategies:

1. Use Array.Parallel.mapi instead of Array.mapi.
2. Partition the trees in n bins and calculate the bins in parallel. Since there is no need to exchange data between bins, this approach allows to use different devices and even to mix CPU and GPU units.

## Performance Relevant Implementations (minArgmin)

We need to have a well performing CPU implementation so that we can later compare the CPU and GPU implementation. To improve the performance of the CPU implementation we introduce the entropyMask mentioned earlier and have to make a compromise between a clean functional implementation and a performance optimized implementation.

A first bottleneck is the function minAndArgMin. A compact functional implementation uses Seq.minBy:

The optimized implementation which relies on the .NET IEnumerable interface is about 10% faster:

Note that using sequences rather than arrays gives us some additional performance benefits.

# Applying the Code to the Iris Data Set

We now apply our CPU random forest classifier to the Iris data set. First we read the data.

Then we splitting the data in to test and training data:

We set up the parameters for the training algorithm:

The SquareRootFeatureSelector chooses randomly $\sqrt{n}$ features from the existing $n$ features. In case of the Iris data set this results in two features out of the existing four features, randomly selected for every tree in the forest. Now we train the model:

Once the model is trained we can use it to predict the labels on the test data:

You can call these functionality by calling: