This is the second post in a series about GPU accelerated random forests. In our last post we introduced random forests and described how to train them with an algorithm based on the split entropy. In this post we will focus on a CPU implementation and leave the GPU implementation for the next blog post.

First we look at the implementation of the forecasting function. We will then discuss the training of the random forest and dive into the two main functions of the training algorithm.

# Implementation of the Forecasting Function

Before implementing the forecasting function, we need to discuss how we represent a decision tree in code. The record Split consists of a feature index Feature and a Threshold

1 2 3 |
type Split = { Feature : int Threshold : FeatureValue } |

where type FeatureValue = float because we restrict to continuous features for the sake of simplicity. We define a tree consisting of Leaf objects (representing predicted labels) and Node objects (consisting of a triple of a split and two sub-trees, one for the elements with feature values below the threshold and one for the elements with feature values above the threshold):

1 2 3 |
type Tree = | Leaf of Label | Node of low : Tree * split : Split * high : Tree |

A sample is a vector of feature values, which are all float values:

1 |
type Sample = FeatureValue[] |

In the case of the Iris data set we have the four values for the features sepal length and width and the petal length stored in an array [|sepalLength; sepalWidth; petalLength; petalWidth|].

We can predict the label for a given sample using a function traversing the tree recursively:

1 2 3 4 5 6 |
let rec forecastTree tree (sample : Sample) = match tree with | Tree.Leaf label -> label | Tree.Node(low, split, high) -> if sample.[split.Feature] <= split.Threshold then forecastTree low sample else forecastTree high sample |

A random forest is then represented by the type Model which consists of an array of trees and the number of classes:

1 2 |
type Model = | RandomForest of trees : Tree[] * numClasses : int |

We can predict the label of a sample using a random forest model by calling the function forecastTree for every tree in the forest and apply a majority vote on the resulting labels:

1 2 3 4 5 6 7 |
let forecast model sample : Label = match model with | RandomForest(trees, numClasses) -> (Array.zeroCreate numClasses, trees) ||> Seq.fold (fun hist tree -> let label = forecastTree tree sample hist.[label] maxAndArgMax |> snd) |

Here maxAndArgMax is a function returning the maximum of an array and the maximum’s index. A simple functional implementation looks as follows:

1 2 3 |
let inline maxAndArgMax (source : seq<'T>) : ('T * int) = source |> Seq.mapi (fun i e -> e, i) |> Seq.maxBy (fun e -> fst e) |

We will revisit this function later and improve its performance with an alternative implementation.

# CPU Implementation of the Training Algorithm

Rather than going into every detail of the implementation we will focus on the main ideas. The main part of the training algorithm is implemented in two functions:

- optimizeFeatures returns the optimal split position and its entropy for every feature;
- trainTrees uses the function optimizeFeatures and trains decision trees.

A crucial step in the training algorithm as described in the previous post is the sorting of the features according to their values. As the old position of the feature served as the index for accessing the weights, we save them as indices. We introduce a discriminated union type to represent the data.

1 2 3 4 |
type LabeledFeatureSet = | LabeledSamples of LabeledSample[] | LabeledFeatures of FeatureArrays*Labels | SortedFeatures of (FeatureArray*Labels*Indices)[] |

The canonical form of the input data is represented by LabeledSamples, which is an array of tuples Sample * Label. Recall that a Sample is just an array of float vectors and Label is an integer value representing the class of the sample. For the Iris data set a LabeledSample is a tuple of a float vector of four values and an integer ([|sepalLength; sepalWidth; petalLength; petalWidth|], 1).

The type SortedFeatures is used to represent the sorted features during the implementation of the algorithm. For each feature, it holds the feature values from all samples in a FeatureArray, the corresponding labels in Labels and the indices representing the old position of the features.

Note that both the Sample and FeatureArray are both float arrays. The reason why we distinguish them is because a Sample is an observation of one feature value per feature, whereas a FeatureArray holds the value of a single feature for a collection of samples. If we stack the samples row by row in a matrix, the rows would correspond to samples, whereas the columns would correspond to feature arrays.

## The Function optimizeFeatures

The function optimizeFeatures returns for every feature the best split entropy and its corresponding index in the sorted samples. Its arguments are a mode, which determines the parallelization scheme, the options for the entropy-optimization, the number of classes, the labels per feature, the indices per features, and the weights. It performs the following computations:

- Reorder the weights such that every weight is stored at the same index as the corresponding sample, which is done for every feature independently.
- Project all features and weights which are non-zero to the beginning of the array.
- Implement the helper functions
- mapper which, depending on the mode, is either Array.mapi or Array.Parallel.mapi.
- translator which translates the index after projection to the post projection index, or the last index in case for the last non-zero index.
- A
mapping function, the argument for the
mapper, which for all non-masked features performs the following:
- Get the weights.
- Create the histogram for every split possibility for the given feature.
- Mask splits which are clearly not optimal. This is not necessary, but helps speeding up the CPU implementation.
- Calculate the entropy for remaining split possibilities.
- Return the lowest entropy and its index and translates the index to its value before the projection using the translator function.

- Select a subset of the features using a FeatureSelector.
- Calculate the entropy for the remaining split possibilities using the helper functions.

Here is the full source code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
let optimizeFeatures (mode : CpuMode) (options : EntropyOptimizationOptions) numClasses (labelsPerFeature : Labels[]) (indicesPerFeature : Indices[]) weights = // remove zero weights let weightsPerFeature = Array.expandWeights indicesPerFeature weights let nonZeroIdcsPerFeature = Array.findNonZeroWeights weightsPerFeature let mapper f = match mode with | Sequential -> Array.mapi f | Parallel -> Array.Parallel.mapi f let projector (sources : _[][]) = sources |> mapper (fun i source -> source |> Array.projectIdcs nonZeroIdcsPerFeature.[i]) let weightsPerFeature = projector weightsPerFeature let labelsPerFeature = projector labelsPerFeature let upperBoundNonZero = nonZeroIdcsPerFeature.[0].GetUpperBound(0) let upperBoundWeights = weights.GetUpperBound(0) let translator featureIdx (x, i) = if i = upperBoundNonZero then x, upperBoundWeights else x, nonZeroIdcsPerFeature.[featureIdx].[i] let totals = countTotals numClasses labelsPerFeature.[0] weightsPerFeature.[0] let total = totals |> snd // heuristic for avoiding small splits let combinedMinWeight = options.MinWeight numClasses total let rounding (value : float) = System.Math.Round(value, options.Decimals) let featureSelectingMask = options.FeatureSelector(labelsPerFeature |> Array.length) // for each feature find minimum entropy and corresponding index let mapping = (fun featureIdx labels -> if (featureSelectingMask.[featureIdx]) <> 0 then let featureWeights = weightsPerFeature.[featureIdx] let countsPerSplit = cumHistograms numClasses labels featureWeights let mask = entropyMask featureWeights labels total combinedMinWeight let entropyPerSplit = splitEntropies mask countsPerSplit totals entropyPerSplit |> Seq.map rounding |> minAndArgMin |> translator featureIdx else (infinity, upperBoundWeights)) labelsPerFeature |> mapper mapping |

## The Function trainTrees:

The function trainTrees trains decision trees:

- Calculate entropy, split index and feature index with the minimal split entropy.
- Find split threshold using the function featureArrayThreshold. If all indices are zero, stop branching and create a leaf.
- Set weights for samples with value below respectively above the split threshold to zero for the lower respectively upper branch of the next iteration.
- If depth = 0 then return the tree, else train new trees.

Note that each FeatureArray that is given in the argument sortedTrainingSet has to be sorted in ascending order.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
let rec trainTrees depth (optimizer : IEntropyOptimizer) numClasses (sortedTrainingSet : (FeatureArray*Labels*Indices)[]) (weights : Weights[]) = let triples = if depth = 0 then weights |> Array.map (fun weights -> nan, weights.GetUpperBound(0), 0) else optimizer.Optimize(weights) |> Array.map (fun results -> results |> Array.mapi (fun i (entropy, splitIdx) -> entropy, splitIdx, i) |> Array.minBy (fun (entropy, _, _) -> entropy)) let trees0 = triples |> Array.mapi (fun i (entropy, splitIdx, featureIdx) -> let weights = weights.[i] let featureArray, labels, indices = sortedTrainingSet.[featureIdx] let sortedWeights = weights |> Array.projectIdcs indices let threshold = featureArrayThreshold sortedWeights featureArray splitIdx if DEBUG then printfn "depth: %A, entropy: %A, splitIdx: %A, featureIdx: %A" depth entropy splitIdx featureIdx printf "Labels:\n[" (sortedWeights, labels) ||> Array.iteri2 (fun i weight label -> if weight <> 0 then printf " (%A * %A) " label weight if (i = splitIdx) then printf "|") printfn "]" match threshold with | Some num -> // set weights to 0 for elements which aren't included in left and right // branches respectively let lowWeights = restrictBelow splitIdx sortedWeights let highWeights = restrictAbove (splitIdx + 1) sortedWeights if DEBUG then printfn "Low counts: %A" (countSamples lowWeights numClasses labels) printfn "High counts: %A" (countSamples highWeights numClasses labels) let lowWeights = Array.permByIdcs indices lowWeights let highWeights = Array.permByIdcs indices highWeights let set (lowTree : Tree) (highTree : Tree) = match lowTree, highTree with | Leaf lowLabel, Leaf highLabel when lowLabel = highLabel -> i, Tree.Leaf lowLabel | _ -> i, Tree.Node(lowTree, { Feature = featureIdx Threshold = num }, highTree) None, Some lowWeights, Some highWeights, Some set | None -> let label = findMajorityClass sortedWeights numClasses labels Some(Tree.Leaf label), None, None, None) |

To go from a single decision tree to a random forest, we need to train multiple trees and add random weights for every sample in every tree. This is implemented in randomForestClassifier. We show how to apply this function to the Iris data set at the end of this post.

Some additional code is required to exploit parallelism and to delegate certain parts of the algorithm to the GPU.

## CPU Parallelization

For the CPU implementation we exploit two parallelization strategies:

- Use Array.Parallel.mapi instead of Array.mapi.
- Partition the trees in n bins and calculate the bins in parallel. Since there is no need to exchange data between bins, this approach allows to use different devices and even to mix CPU and GPU units.

## Performance Relevant Implementations (minArgmin)

We need to have a well performing CPU implementation so that we can later compare the CPU and GPU implementation. To improve the performance of the CPU implementation we introduce the entropyMask mentioned earlier and have to make a compromise between a clean functional implementation and a performance optimized implementation.

A first bottleneck is the function minAndArgMin. A compact functional implementation uses Seq.minBy:

1 2 3 |
let inline minAndArgMin (source: seq) : ('T * int) = source |> Seq.mapi (fun i e -> e, i) |> Seq.minBy (fun e -> fst e) |

The optimized implementation which relies on the .NET IEnumerable interface is about 10% faster:

1 2 3 4 5 6 7 8 |
let inline minAndArgMin (source : seq<'T>) : 'T*int = use e = source.GetEnumerator() if not (e.MoveNext()) then invalidArg "source" "empty sequence" let mutable i = 0 let mutable accv = e.Current let mutable acci = i while e.MoveNext() do i |

Note that using sequences rather than arrays gives us some additional performance benefits.

# Applying the Code to the Iris Data Set

We now apply our CPU random forest classifier to the Iris data set. First we read the data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
// read data let path = @"..\src\fsharp\examples\random_forest\irisExample.csv" let data = CsvFile.Load(path).Cache() let trainingData = [| for row in data.Rows -> [| row.GetColumn "Sepal length" |> float row.GetColumn "Sepal width" |> float row.GetColumn "Petal length" |> float row.GetColumn "Petal width" |> float |], row.GetColumn "Species" |> (fun x -> match x with | "I. setosa" -> 0 | "I. versicolor" -> 1 | "I. virginica" -> 2 | x -> failwithf "No such Label: %A" x) |] irisScatterPlot trainingData let cpuDevice = CPU(CpuMode.Parallel) let gpuDevice = GPU(GpuMode.SingleWeightWithStream 10, GpuModuleProvider.DefaultModule) printFractionOfCorrectForcasts trainingData cpuDevice printFractionOfCorrectForcasts trainingData gpuDevice |

Then we splitting the data in to test and training data:

1 2 |
let trainingData, testData = randomlySplitUpArray (getRngFunction 42) (50*Array.length trainingData/100) trainingData |

We set up the parameters for the training algorithm:

1 2 3 4 5 6 |
let options = { TreeOptions.Default with MaxDepth = 3 Device = device EntropyOptions = EntropyOptimizationOptions.DefaultWithSquareRootFeatureSelector } printfn "%A" options |

The SquareRootFeatureSelector chooses randomly $\sqrt{n}$ features from the existing $n$ features. In case of the Iris data set this results in two features out of the existing four features, randomly selected for every tree in the forest. Now we train the model:

1 2 |
let trainingData = LabeledSamples trainingData let model = randomForestClassifier options (getRngFunction 42) 100 trainingData |

Once the model is trained we can use it to predict the labels on the test data:

1 2 3 4 5 6 7 8 9 10 |
let features, expectedLabels = Array.unzip testData let forecastedLabels = Array.map (forecast model) features let fraction = (forecastedLabels, expectedLabels) ||> Array.map2 (fun x y -> if x = y then 1.0 else 0.0) |> Array.average printf "%f %% of forecasts were correct (out of sample)" (fraction*100.0) |

You can call these functionality by calling:

1 |
Tutorial.Fs.exe ExamplesRandomForestIrisExample |

leading to the following output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
{MaxDepth = 3; Device = CPU Parallel; EntropyOptions = {AbsMinWeight = 1; RelMinDivisor = 10; RelMinBound = 25; Decimals = 6; FeatureSelector = <fun:get_DefaultWithSquareRootFeatureSelector@161>;};} 98.666667 % of forecasts were correct (out of sample) {MaxDepth = 3; Device = GPU (SingleWeightWithStream 10,DefaultModule); EntropyOptions = {AbsMinWeight = 1; RelMinDivisor = 10; RelMinBound = 25; Decimals = 6; FeatureSelector = <fun:get_DefaultWithSquareRootFeatureSelector@161>;};} 98.666667 % of forecasts were correct (out of sample) Done. |

Now it is your turn. Download the code, play with it and have fun!