IrisDataDecisionTreeDepth3_RF

Don’t Get Lost in the Forest III – The GPU Implementation

In the previous post we looked at the CPU implementation of a random forest training algorithm. We also discussed two parallelization strategies at different levels:

  1. Build the independent trees in parallel.
  2. Search for the optimal split for all features in parallel.

The first strategy is straightforward. Here we focus on the second strategy and discuss how it can be implemented on the GPU. We can reuse most of the CPU code. We only have to replace the function optimizeFeatures with a proper GPU implementation:

This function performs the following calculations:

  • Create a cumulative histograms by:
    • Find and prepare the indices with non-zero weights.
    • Expand the weights: get initial weights for given features and the sample indices before sorting and sort them according to their class (see below for more details).
    • Calculate the histograms using a cumulative sum function applied to the expanded weights.
  • Calculate the entropy and it’s minimum.

In the following we discuss each step in more detail using example data consisting of four features with four samples

feature 1 feature 2 feature 3 feature 4 label
1 3 7 8 0
2 2 2 9 1
3 1 1 7 1
4 5 3 5 0

and the weights

1 1 0 1

After sorting the features we get four triples feature values, labels and index before sorting. This corresponds to the SortedFeatures of the discriminated union type LabeledFeatureSet discussed in the previous post:

1,0,0 1,1,2 1,1,2 5,0,3
2,1,1 2,1,1 2,1,1 7,1,2
3,1,2 3,0,0 3,0,3 8,0,0
4,0,3 5,0,3 7,0,0 9,1,1

After sorting we do not need the feature values anymore. It is therefore enough to keep the two matrices, the first one storing the labels

0 1 1 0
1 1 1 1
1 0 0 0
0 0 0 1

and the second one the indices

0 2 2 3
1 1 1 2
2 0 3 0
3 3 0 1

Find Non-Zero Indices

This function prepares the input data such that all samples with a non-zero weight, respectively indices of these samples, are stored consecutively.

It calls the GPU kernel LogicalWeightExpansionKernel, which creates for every feature a vector of elements in ${0,1}$ depending on weights being non-zero (looking at the expanded weights, i.e. initial weights before sorting). Here is the actual GPU code:

For our sample data this kernel produces the following result:

1 0 0 1
1 1 1 0
0 1 1 1
1 1 1 1

After calculating a column-wise cumulative sum we obtain

1 0 0 1
2 1 1 0
2 2 2 2
3 3 3 3

The function FindNonZeroIndicesKernel

then creates the matrix of indices

0 1 1 0
1 2 2 2
3 3 3 3

It records the next index for which the feature has a non-zero weight. Note that the index here is related to the features after sorting.

Expand the Weights and Create Histograms

The WeightExpansionKernel uses the indices matrix and returns for every sample with non-zero-weight its weight, sorted according to the class (label) the sample belongs to. Given the following weights and labels:

Weight: Label:
1 0
2 1
1 1
2 0

We get the following expanded weights where a zero indicates that a feature has a different class:

1
0
0
2
0
2
1
0

The weight expansion function

transforms our example data to

1 0 0 1
0 1 1 1
1 1 1 0
0 1 1 0
1 0 0 0
0 0 0 1

To calculate the cumulative histogram for each feature over all labels we use a row-wise cumulative sum (row-wise because the matrix layout in the code is different to the layout in our example), which is implemented with the Alea.Unbound primitive ConsumeRangeConsecutiveInclusive

and a cumulative sum kernel

The cumulative histograms are then used for the entropy calculation:

1 0 0 1
1 1 1 2
2 2 2 2
2 3 3 2
3 3 3 2
3 3 3 3

Calculate the Entropy and it’s Minimum

The entropy for every possible split and for every feature is calculated from the cumulative histograms with the EntropyKernel function:

The optimal split for every feature, i.e. the split leading to the lowest entropy, can now be calculated with the MinimumEntropy function

To find tie minimal entropy the function LaunchOptAndArgOptKernelDoubleWithIdcs uses a the Alea.Unbound library function MultiChannelReducePrimitive. Then the optimal splits are checked for edge cases in the same way as in the function optimizeFeatures.

This explains the main functionality of the GPU implementation. In our last blog post, we will consider the performance of the different implementations and look at an implementations in Python based on scikit-learn.