IrisDataDecisionTreeDepth3_RF

Don’t Get Lost in the Forest I – The Algorithm

With this blog post we start a series on the machine learning algorithm random forest and how to accelerate the training process with GPUs. For simplicity we focus on continuous features. If you are interested in categorical features you might want read Mathias’ blog.

In this first post we explain what a decision tree is, how to go from a decision tree to a random forest and discuss the advantages of random forests over decision trees. We finish by describing a random forest training algorithm. In a second post we go through the important aspects of the algorithm’s CPU implementation and will do the same for the GPU implementation in a third post. The series will end with a post on the performance of the different implementations.

Random forests were developed by Leo Breiman and Adele Cutler, their dominance in the field of machine learning has been illustrated nicely in a blog post at Strata 2012: “Ensembles of decision trees (often known as random forests) have been the most successful general-purpose algorithm in modern times” when it comes to “automatically identify the structure, interactions, and relationships in the data”. Furthermore it is noted that “most Kaggle competitions have at least one top entry that heavily uses this approach”. Random forests also have been the algorithm of choice for the body part recognition of Microsoft’s Kinect, a motion sensing input devices for Xbox consoles and Windows PCs. More details can be found in the paper Why did Microsoft decide to use Random Forests in the Kinect.

Random forests consist of an ensemble of decision trees. We therefore start to look at decision trees. A simple decision tree could be used to decide what you need to take with you when leaving the house:

SimplisticDecisionTree

This tree has been constructed by reason only, not involving any data. In machine learning however the trees are created such that they optimally predict a given sample of data (training data).

Petal and Sepal
Example: Petal vs. Sepal

For this series we will use the well-known Iris data set as training data. It contains four features (the length and the width of sepals and petals) of three Iris species:

 

Image of Iris Setosa
Iris Setosa

Iris versicolor
Iris versicolor

Iris virginica
Iris virginica

Random forests can play their strength when having hundreds of different features. But few features make a data set much better accessible as an example. Checking the scatter plot of all features, you can see the clustering of the different types.

Scatterplot Iris data
Scatterplot Iris data

The scatter plot illustrates that Setosa is nicely distinguishable from the other two types. Keeping Virginica apart from Versicolor is much more difficult. A decision tree predicting the Iris specie from these features might look like:

IrisDataDecisionTreeDepth3

This tree does not classify all the examples correctly. We could change this by increasing the maximum tree depth. By doing so, the tree can predict the sample data 100% correctly, but only by learning the noise in the samples. In the most extreme case the algorithm is equivalent to a dictionary containing every sample. This is knows as over-fitting and leads to bad results when using it for out of sample predictions. In order to overcome over-fitting, we can train multiple decision trees, by introducing weights for the samples and only considering a random subset of the features for each split. The final decision of the random forest will be determined by a majority vote on the trees’ predictions. This technique is also known as bagging. It helps to decrease the variance (error from sensitivity to noise in the training set) without increasing the bias (error introduced from not enough flexibility of the model).

Algorithm

We first explain, how to build decision trees. In a second part we look at how to create a random forest.

Building Decision Trees

To build good decision trees a discrimination measure is required. A common choice is to use the split information or entropy as described in Efficient C4.5 as measure. It is definded as

$$\rm{Split}(T) = \sum_{i,b} \frac{T_{i,b}}{T} \log_2 \left( \frac{T_{i,b}}{T} \right),$$

where $T_{i,b}$ is the number of samples with label $i$ in branch $b$ after the split and $T$ is the total number of samples. Other possible metrics are the information gain, the Gini impurity and the variance reduction.

The search for a decision tree which is a global optimum for the discrimination measure is known as a NP-complete problem. We therefore focus on the a local optimization where for each split the local optimum is chosen, i.e. for every node we check which feature to use for splitting and where to split to get the lowest split information.

The steps to build a decision tree are:

  1. For all features (e.g. for Iris data set the sepal-length, sepal-width, petal-length, petal-width), take all samples and order them by value.
  2. For every split possibility, i.e. two successive samples with different values, calculate the split entropy:
    – Make a histogram based on labels for both branches after the split.
    – Calculate the split entropy $\rm{Split}(T)$.
  3. Choose the split for the feature and split threshold with the smallest split entropy.
  4. Apply the algorithm from step 1 on all resulting sub-branches, until either all features have the same label, or the predefined maximal three depth has been reached.

We illustrate the calculation of the split entropy with a simple example. Assume we have ten samples for a single continuous feature, with three different labels all of them with weight one and the following value-label-map:

feature value (sorted) label
1 1
2 2
3 1
4 1
5 1
6 1
7 3
8 3
9 2
10 3

We have nine split possibilities 1.5, 2.5, …, 9.5. The split between 1 and 2 at 1.5 would lead to the following two histograms. The lower branch would be

label counts
1 1
2 0
3 0

whereas the upper branch would be

label counts
1 4
2 2
3 3

The split at 1.5 leads to a split entropy of

$$\frac{1}{1} \log_2 \left( \frac{1}{1} \right) + \frac{0}{1} \log_2 \left( \frac{0}{1} \right) + \frac{0}{1} \log_2 \left( \frac{0}{1} \right) + \frac{4}{9} \log_2 \left( \frac{4}{9} \right) + \frac{2}{9} \log_2 \left( \frac{2}{9} \right) + \frac{3}{9} \log_2 \left( \frac{3}{9} \right)= 13.77.$$

We calculate the split entropies for all the other splits possibilities and plot them as a function of the split threshold:

Entropy

We see from the plot that a split between 6 and 7 is ideal. The same procedure is now applied to the lower (samples 1-6) and the upper (samples 7-10) branch and the sub-branches thereof until either the maximum depth has been reached or the sub-branch only contains a single sample.

Bagging Decision Trees to Random Forests

Decision trees, in particular those with many levels, suffer from over-fitting. To reduce the risk of over-fitting many different decision trees are trained and combined to an ensemble with majority vote. This decreases the variance of the decision trees without increasing the bias. In order to get different trees, randomness is introduced at two points:

  1. Every sample gets a weight which is randomly chosen for every tree. The weights are chosen by randomly selecting n samples out of the existing n samples with replacement. The histograms are then created not by counts, but by sums on these weights.
  2. For every split only some randomly chosen features are considered.

To gain a deeper understanding the following video of Jeremy Howard is helpful: