This is the second post in a series about GPU accelerated random forests. In our last post we introduced random forests and described how to train them with an algorithm based on the split entropy. In this post we will focus on a CPU implementation and leave the GPU implementation for the next blog post.
First we look at the implementation of the forecasting function. We will then discuss the training of the random forest and dive into the two main functions of the training algorithm.
Implementation of the Forecasting Function
Before implementing the forecasting function, we need to discuss how we represent a decision tree in code. The record
Split consists of a feature index
Feature and a
Threshold

type Split = { Feature : int Threshold : FeatureValue } 
where
type FeatureValue = float because we restrict to continuous features for the sake of simplicity. We define a tree consisting of
Leaf objects (representing predicted labels) and
Node objects (consisting of a triple of a split and two subtrees, one for the elements with feature values below the threshold and one for the elements with feature values above the threshold):

type Tree =  Leaf of Label  Node of low : Tree * split : Split * high : Tree 
A sample is a vector of feature values, which are all float values:

type Sample = FeatureValue[] 
In the case of the Iris data set we have the four values for the features sepal length and width and the petal length stored in an array
[sepalLength; sepalWidth; petalLength; petalWidth].
We can predict the label for a given sample using a function traversing the tree recursively:

let rec forecastTree tree (sample : Sample) = match tree with  Tree.Leaf label > label  Tree.Node(low, split, high) > if sample.[split.Feature] <= split.Threshold then forecastTree low sample else forecastTree high sample 
A random forest is then represented by the type
Model which consists of an array of trees and the number of classes:

type Model =  RandomForest of trees : Tree[] * numClasses : int 
We can predict the label of a sample using a random forest model by calling the function
forecastTree for every tree in the forest and apply a majority vote on the resulting labels:

let forecast model sample : Label = match model with  RandomForest(trees, numClasses) > (Array.zeroCreate numClasses, trees) > Seq.fold (fun hist tree > let label = forecastTree tree sample hist.[label] maxAndArgMax > snd) 
Here
maxAndArgMax is a function returning the maximum of an array and the maximum’s index. A simple functional implementation looks as follows:

let inline maxAndArgMax (source : seq<'T>) : ('T * int) = source > Seq.mapi (fun i e > e, i) > Seq.maxBy (fun e > fst e) 
We will revisit this function later and improve its performance with an alternative implementation.
CPU Implementation of the Training Algorithm
Rather than going into every detail of the implementation we will focus on the main ideas. The main part of the training algorithm is implemented in two functions:

optimizeFeatures returns the optimal split position and its entropy for every feature;

trainTrees uses the function
optimizeFeatures and trains decision trees.
A crucial step in the training algorithm as described in the previous post is the sorting of the features according to their values. As the old position of the feature served as the index for accessing the weights, we save them as indices. We introduce a discriminated union type to represent the data.

type LabeledFeatureSet =  LabeledSamples of LabeledSample[]  LabeledFeatures of FeatureArrays*Labels  SortedFeatures of (FeatureArray*Labels*Indices)[] 
The canonical form of the input data is represented by
LabeledSamples, which is an array of tuples
Sample * Label. Recall that a
Sample is just an array of float vectors and
Label is an integer value representing the class of the sample. For the Iris data set a
LabeledSample is a tuple of a float vector of four values and an integer
([sepalLength; sepalWidth; petalLength; petalWidth], 1).
The type
SortedFeatures is used to represent the sorted features during the implementation of the algorithm. For each feature, it holds the feature values from all samples in a
FeatureArray, the corresponding labels in
Labels and the indices representing the old position of the features.
Note that both the
Sample and
FeatureArray are both float arrays. The reason why we distinguish them is because a
Sample is an observation of one feature value per feature, whereas a
FeatureArray holds the value of a single feature for a collection of samples. If we stack the samples row by row in a matrix, the rows would correspond to samples, whereas the columns would correspond to feature arrays.
The Function optimizeFeatures
The function
optimizeFeatures returns for every feature the best split entropy and its corresponding index in the sorted samples. Its arguments are a
mode, which determines the parallelization scheme, the options for the entropyoptimization, the number of classes, the labels per feature, the indices per features, and the weights. It performs the following computations:
 Reorder the weights such that every weight is stored at the same index as the corresponding sample, which is done for every feature independently.
 Project all features and weights which are nonzero to the beginning of the array.
 Implement the helper functions

mapper which, depending on the
mode, is either
Array.mapi or
Array.Parallel.mapi.

translator which translates the index after projection to the post projection index, or the last index in case for the last nonzero index.
 A
mapping function, the argument for the
mapper, which for all nonmasked features performs the following:
 Get the weights.
 Create the histogram for every split possibility for the given feature.
 Mask splits which are clearly not optimal. This is not necessary, but helps speeding up the CPU implementation.
 Calculate the entropy for remaining split possibilities.
 Return the lowest entropy and its index and translates the index to its value before the
projection using the
translator function.
 Select a subset of the features using a
FeatureSelector.
 Calculate the entropy for the remaining split possibilities using the helper functions.
Here is the full source code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

let optimizeFeatures (mode : CpuMode) (options : EntropyOptimizationOptions) numClasses (labelsPerFeature : Labels[]) (indicesPerFeature : Indices[]) weights = // remove zero weights let weightsPerFeature = Array.expandWeights indicesPerFeature weights let nonZeroIdcsPerFeature = Array.findNonZeroWeights weightsPerFeature let mapper f = match mode with  Sequential > Array.mapi f  Parallel > Array.Parallel.mapi f let projector (sources : _[][]) = sources > mapper (fun i source > source > Array.projectIdcs nonZeroIdcsPerFeature.[i]) let weightsPerFeature = projector weightsPerFeature let labelsPerFeature = projector labelsPerFeature let upperBoundNonZero = nonZeroIdcsPerFeature.[0].GetUpperBound(0) let upperBoundWeights = weights.GetUpperBound(0) let translator featureIdx (x, i) = if i = upperBoundNonZero then x, upperBoundWeights else x, nonZeroIdcsPerFeature.[featureIdx].[i] let totals = countTotals numClasses labelsPerFeature.[0] weightsPerFeature.[0] let total = totals > snd // heuristic for avoiding small splits let combinedMinWeight = options.MinWeight numClasses total let rounding (value : float) = System.Math.Round(value, options.Decimals) let featureSelectingMask = options.FeatureSelector(labelsPerFeature > Array.length) // for each feature find minimum entropy and corresponding index let mapping = (fun featureIdx labels > if (featureSelectingMask.[featureIdx]) <> 0 then let featureWeights = weightsPerFeature.[featureIdx] let countsPerSplit = cumHistograms numClasses labels featureWeights let mask = entropyMask featureWeights labels total combinedMinWeight let entropyPerSplit = splitEntropies mask countsPerSplit totals entropyPerSplit > Seq.map rounding > minAndArgMin > translator featureIdx else (infinity, upperBoundWeights)) labelsPerFeature > mapper mapping 
The Function trainTrees:
The function
trainTrees trains decision trees:
 Calculate entropy, split index and feature index with the minimal split entropy.
 Find split threshold using the function
featureArrayThreshold. If all indices are zero, stop branching and create a leaf.
 Set weights for samples with value below respectively above the split threshold to zero for the lower respectively upper branch of the next iteration.
 If depth = 0 then return the tree, else train new trees.
Note that each
FeatureArray that is given in the argument
sortedTrainingSet has to be sorted in ascending order.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

let rec trainTrees depth (optimizer : IEntropyOptimizer) numClasses (sortedTrainingSet : (FeatureArray*Labels*Indices)[]) (weights : Weights[]) = let triples = if depth = 0 then weights > Array.map (fun weights > nan, weights.GetUpperBound(0), 0) else optimizer.Optimize(weights) > Array.map (fun results > results > Array.mapi (fun i (entropy, splitIdx) > entropy, splitIdx, i) > Array.minBy (fun (entropy, _, _) > entropy)) let trees0 = triples > Array.mapi (fun i (entropy, splitIdx, featureIdx) > let weights = weights.[i] let featureArray, labels, indices = sortedTrainingSet.[featureIdx] let sortedWeights = weights > Array.projectIdcs indices let threshold = featureArrayThreshold sortedWeights featureArray splitIdx if DEBUG then printfn "depth: %A, entropy: %A, splitIdx: %A, featureIdx: %A" depth entropy splitIdx featureIdx printf "Labels:\n[" (sortedWeights, labels) > Array.iteri2 (fun i weight label > if weight <> 0 then printf " (%A * %A) " label weight if (i = splitIdx) then printf "") printfn "]" match threshold with  Some num > // set weights to 0 for elements which aren't included in left and right // branches respectively let lowWeights = restrictBelow splitIdx sortedWeights let highWeights = restrictAbove (splitIdx + 1) sortedWeights if DEBUG then printfn "Low counts: %A" (countSamples lowWeights numClasses labels) printfn "High counts: %A" (countSamples highWeights numClasses labels) let lowWeights = Array.permByIdcs indices lowWeights let highWeights = Array.permByIdcs indices highWeights let set (lowTree : Tree) (highTree : Tree) = match lowTree, highTree with  Leaf lowLabel, Leaf highLabel when lowLabel = highLabel > i, Tree.Leaf lowLabel  _ > i, Tree.Node(lowTree, { Feature = featureIdx Threshold = num }, highTree) None, Some lowWeights, Some highWeights, Some set  None > let label = findMajorityClass sortedWeights numClasses labels Some(Tree.Leaf label), None, None, None) 
To go from a single decision tree to a random forest, we need to train multiple trees and add random weights for every sample in every tree. This is implemented in
randomForestClassifier. We show how to apply this function to the Iris data set at the end of this post.
Some additional code is required to exploit parallelism and to delegate certain parts of the algorithm to the GPU.
CPU Parallelization
For the CPU implementation we exploit two parallelization strategies:
 Use
Array.Parallel.mapi instead of
Array.mapi.
 Partition the trees in
n bins and calculate the bins in parallel. Since there is no need to exchange data between bins, this approach allows to use different devices and even to mix CPU and GPU units.
Performance Relevant Implementations (minArgmin)
We need to have a well performing CPU implementation so that we can later compare the CPU and GPU implementation. To improve the performance of the CPU implementation we introduce the
entropyMask mentioned earlier and have to make a compromise between a clean functional implementation and a performance optimized implementation.
A first bottleneck is the function
minAndArgMin. A compact functional implementation uses
Seq.minBy:

let inline minAndArgMin (source: seq) : ('T * int) = source > Seq.mapi (fun i e > e, i) > Seq.minBy (fun e > fst e) 
The optimized implementation which relies on the .NET
IEnumerable interface is about 10% faster:

let inline minAndArgMin (source : seq<'T>) : 'T*int = use e = source.GetEnumerator() if not (e.MoveNext()) then invalidArg "source" "empty sequence" let mutable i = 0 let mutable accv = e.Current let mutable acci = i while e.MoveNext() do i 
Note that using sequences rather than arrays gives us some additional performance benefits.
Applying the Code to the Iris Data Set
We now apply our CPU random forest classifier to the Iris data set. First we read the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

// read data let path = @"..\src\fsharp\examples\random_forest\irisExample.csv" let data = CsvFile.Load(path).Cache() let trainingData = [ for row in data.Rows > [ row.GetColumn "Sepal length" > float row.GetColumn "Sepal width" > float row.GetColumn "Petal length" > float row.GetColumn "Petal width" > float ], row.GetColumn "Species" > (fun x > match x with  "I. setosa" > 0  "I. versicolor" > 1  "I. virginica" > 2  x > failwithf "No such Label: %A" x) ] irisScatterPlot trainingData let cpuDevice = CPU(CpuMode.Parallel) let gpuDevice = GPU(GpuMode.SingleWeightWithStream 10, GpuModuleProvider.DefaultModule) printFractionOfCorrectForcasts trainingData cpuDevice printFractionOfCorrectForcasts trainingData gpuDevice 
Then we splitting the data in to test and training data:

let trainingData, testData = randomlySplitUpArray (getRngFunction 42) (50*Array.length trainingData/100) trainingData 
We set up the parameters for the training algorithm:

let options = { TreeOptions.Default with MaxDepth = 3 Device = device EntropyOptions = EntropyOptimizationOptions.DefaultWithSquareRootFeatureSelector } printfn "%A" options 
The
SquareRootFeatureSelector chooses randomly $\sqrt{n}$ features from the existing $n$ features. In case of the Iris data set this results in two features out of the existing four features, randomly selected for every tree in the forest. Now we train the model:

let trainingData = LabeledSamples trainingData let model = randomForestClassifier options (getRngFunction 42) 100 trainingData 
Once the model is trained we can use it to predict the labels on the test data:

let features, expectedLabels = Array.unzip testData let forecastedLabels = Array.map (forecast model) features let fraction = (forecastedLabels, expectedLabels) > Array.map2 (fun x y > if x = y then 1.0 else 0.0) > Array.average printf "%f %% of forecasts were correct (out of sample)" (fraction*100.0) 
You can call these functionality by calling:

Tutorial.Fs.exe ExamplesRandomForestIrisExample 
leading to the following output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

{MaxDepth = 3; Device = CPU Parallel; EntropyOptions = {AbsMinWeight = 1; RelMinDivisor = 10; RelMinBound = 25; Decimals = 6; FeatureSelector = <fun:get_DefaultWithSquareRootFeatureSelector@161>;};} 98.666667 % of forecasts were correct (out of sample) {MaxDepth = 3; Device = GPU (SingleWeightWithStream 10,DefaultModule); EntropyOptions = {AbsMinWeight = 1; RelMinDivisor = 10; RelMinBound = 25; Decimals = 6; FeatureSelector = <fun:get_DefaultWithSquareRootFeatureSelector@161>;};} 98.666667 % of forecasts were correct (out of sample) Done. 
Now it is your turn. Download the code, play with it and have fun!