[-1,1]: Random Forests and Decision Trees

Introduction: Random Forest in Python

Here we will build a Python(-ic/-esque) Random Forest. Since with python everything is made so easy that you can easily build very complex machines out from one or two libraries, it is better to delve into basic topics before dipping our nose into untameable beasts.

Let us start from a single “decision tree” (a simple problem). After that we will extend our knowledge and learn to build a Random Forest and an application to a real problem.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve
from collections import Counter
from sklearn.metrics import confusion_matrix
import itertools
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.tree.export import export_text
from sklearn import metrics, datasets, ensemble
# Set the random seed to guarantee reproducibility

RSEED = 50

Easy start : basic problem

To warm up, we will start with a toy problem, with only two features and two classes. The latter condition has a name and is “binary classification”. Let us create the features (X) and the labels for the classes (y). With only two features, we won’t have problems to visualize our data.

X = np.array([[2, 2], 
              [2, 1],
              [2, 3], 
              [1, 2], 
              [1, 1],
              [3, 3]])

y = np.array([0, 1, 1, 1, 0, 1])

print(X)
print(y)

Data visualization

To better grasp the sense of this exercise, we can plot the data by using their label character as their graphical point in the space.

# Plot formatting
plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18
plt.figure(figsize = (8, 8))

# Every point as its label
for x1, x2, label in zip(X[:, 0], X[:, 1], y):
    plt.text(x1, x2, str(label), fontsize = 40, color = 'g',
             ha='center', va='center')
    
# Plot formatting
plt.grid(None);
plt.xlim((0, 3.5));
plt.ylim((0, 3.5));
plt.xlabel('x1', size = 20); plt.ylabel('x2', size = 20); plt.title('data', size = 24)

Although only two features are composing this problem, it is not linearly separable. A simple linear classifier will not be able to draw a hyperplane separating the two classes. The “Decision tree” will be able to perfectly distinguish the classes since it will draw several linear margins. A “Decision tree” is a non-parametric model because the amount of parameters grows with the size of our data dimensions.

A single Decision tree

Here we will build and train a single “Decision tree” on our data by using Scikit-learn. The tree will learn to separate the classes by building a decision diagram based on feature thresholds. At every step, the “Decision tree” will branch, maximizing the reduction of the so-called “Gini Impurity”.

We will use default hyperparameters for our “Decision tree”, which implies that it will be able to grow to the size needed to completely separate the classes. This in turn implies an overfitting on the input data. Usually we want to limit the tree depth in order to push for generalization.

# Build a decision tree and train it
tree = DecisionTreeClassifier(random_state=RSEED)
tree.fit(X, y)
print(f'Decision tree has {tree.tree_.node_count} nodes with maximum depth {tree.tree_.max_depth}.')

Our decision tree is made of 9 nodes and as a maximum depth of 3. It should have reached 100% of accuracy on training data because we did not limit the depth and therefore it will be able to perfectly classify the training data (–> overfitting).

print(f'Model Accuracy: {tree.score(X, y)}')

Visualize the decision tree

To get a grasp on how the decision tree “thinks”, it is useful to see its entire structure. We will see each tree’s node. Since the tree is relatively small, we are able to plot it in one piece.

First, export the tree as a .dot file making sure se label feature and classes explicitly.

# export in dot format
export_graphviz(tree, 'tree.dot', rounded = True, 
                feature_names = ['x1', 'x2'], 
                class_names = ['0', '1'], filled = True)

Then, use a system call and the “dot” function of the Graphviz software to convert it in PNG format. This requires you to install graphviz on your computer.

# Convert in png format
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=400']);

Finally, visualize the whole tree

Image('tree.png')

A “decision tree” is an intuitive model: it takes decisions by progressing through a series of simple questions about features’ thresholds. For each node (excluding the leaf nodes), the five rows are read like this:

  1. Question data about one specific feature: This determines the direction of the decision path.
  2. gini: the “Gini impurity” of a node. The mean gini impurity (weighted on observations) gets smaller by descending the tree.
  3. samples: number of observations used to train this node.
  4. value: number of observation of respectively the first and the second class.
  5. class: The predicted class of each point passing through this node should the tree end there.

Leaf nodes (the nodes at the end of each branch/path) do not have “questions” because they are the final nodes of each path, and the prediction is made there. All the observations ending in the same leaf have the same prediction (Gini = 0.0, we’ll get there in a moment).

Gini Impurity

The “Gini Impurity” can be read as the probability that a random sample passing through this node will be classified with the wrong label. In this case there is the 44.4% of probability of guessing wrong. The “Gini Impurity” is how the tree divides. It divides observations(samples) based on a feature threshold that maximized the reduction of the Gini Impury. Speaking math, the weighted mean of the Gini Impurity gets smaller by traversing the tree root-to-leaves.

Eventually, the Gini Impurity gets to 0 if we correctly classified each point. But remember that correctly classifying each training point is not a good proxy of a model’s success, because there is a high probability of overfitting.

Here’s why we should limit the depth of a tree and assess the model’s generalization power by using a “test set”.

Constraining the maximum depth

Although this will likely reduce our training accuracy, a model is likely useless if it cannot generalize (unless, to say it all, you just want to describe thoroughly your data by purposedly overfitting a model, but that is another story). Our aim is to score high on the test set.

# Constrain maximum depth and train

short_tree = DecisionTreeClassifier(max_depth = 2, random_state=RSEED)
short_tree.fit(X, y)

print(f'Model Accuracy: {short_tree.score(X, y)}')

Now follow the same procedure to visualize the tree

# Export as dot and convert in png
export_graphviz(short_tree, 'shorttree.dot', rounded = True, 
                feature_names = ['x1', 'x2'], 
                class_names = ['0', '1'], filled = True)

call(['dot', '-Tpng', 'shorttree.dot', '-o', 'shorttree.png', '-Gdpi=400']);
Image('shorttree.png')

Our model does not have the maximum accuracy on the training set. But you can expect a better accuracy on the test set because of our pushing towards generalization. This is a classic example of the bias-variance compromise which is an important topic in machine learning. A model with high variance learns a lot from training data but usually fails in predicting new points. On the other hand, a model with high bias cannot learn enough from the training data.

By constraining the depth of a single “decision tree” we are testing one way towards creating a less “distorted” model. Another option is that of creating an entire forest (of trees, obv.), training each one of them on sub-samples of our training data. The final model will consider the “council” of all trees (and entmoot) to get to a final classification. This is the idea behind a “Random Forest”.

real Dataset

Dataset Available Here —> https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system –> download only the dataset called “2015”

(Data direct link: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system/download/m5nxxInjHYIFKmB0gGTi%2Fversions%2FdK5GOra7ircJK4cHLtSe%2Ffiles%2F2015.csv?datasetVersionNumber=1)

Sign-up with Kaggle because it is worth your time.

The following dataset comes from the CDC and includes socio-economic indicators and lifestyles for hundreds of thousands of indivisuals. The goal is that of predicting the health status of a person: 0 for poor health and 1 for good healt. We will limit the data to 100,000 individual to have a faster training (only for learning purposes).

The problem is unbalanced (one class has more representatives than the other), so we will not use the accuracy as measure, but instead: Recall, Precision, the ROC curve and its area. Accuracy is not a good measure on unbalanced problems (think about it).

Data cleaning

We’ll read the data and make some cleaning

df = pd.read_csv('2015.csv').sample(100000, random_state = RSEED)
df.head()
df = df.select_dtypes('number')

Label exploration

df['_RFHLTH'] = df['_RFHLTH'].replace({2: 0})
df = df.loc[df['_RFHLTH'].isin([0, 1])].copy()
df = df.rename(columns = {'_RFHLTH': 'label'})
df['label'].value_counts()

Unbalanced labels tell us that accuracy is probably NOT the best metric here.

We won’t do any EDA here, but remember that, in general, EDA falls into good practice. It can help with feature engineering or to avoid misuses of bad data.

First thing we will remove a series of columns we won’t use in the creation of our model.

# Remove columns with missing values
df = df.drop(columns = ['POORHLTH', 'PHYSHLTH', 'GENHLTH', 'PAINACT2', 
                        'QLMENTL2', 'QLSTRES2', 'QLHLTH2', 'HLTHPLN1', 'MENTHLTH'])

Training-Test split

To evaluate our prediction we will need to use a training and a test set. The model fits on the training set and predicts on the test set. This is the way to understand how much has our model generalized.

We will use 30% of the data as the test set

# Extract labels
labels = np.array(df.pop('label'))
labels
# 30% as test
train, test, train_labels, test_labels = train_test_split(df, labels, 
                                                          stratify = labels,
                                                          test_size = 0.3, 
                                                          random_state = RSEED)

Missing values imputation

We will fill missing values with the mean value of their column. It’s fundamental to understand that the test set missing values will be filled with the same mean of the training set (that is, the mean column values found in the training). It’s easy to understand why: with new points you won’t have any mean to compute and therefore the only option will be the usage of the training set computed mean.

train = train.fillna(train.mean())
test = test.fillna(test.mean())

# Features for feature importances
features = list(train.columns)
train.shape
test.shape

Decision tree on real data

First thing we will train the decision tree. We’ll allow unconstrained growing to see the overfitting.

# Train tree
tree.fit(train, train_labels)
print(f'Decision tree has {tree.tree_.node_count} nodes with maximum depth {tree.tree_.max_depth}.')
predictions = tree.predict(test)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))

Evaluating performances

Given the node number and the maximum depth of our decision tree, we can expect it to overfit on the training set. Remember that this means that it will work great on training data but poorly on new (test) data.

# Probabilistic things
train_probs = tree.predict_proba(train)[:, 1]
probs = tree.predict_proba(test)[:, 1]

train_predictions = tree.predict(train)
predictions = tree.predict(test)
print(f'Train ROC AUC Score: {roc_auc_score(train_labels, train_probs)}')
print(f'Test ROC AUC  Score: {roc_auc_score(test_labels, probs)}')

Remember that an AUC score of 0.5 is that of a binary classifier that guesses at random. 0.66 is slightly better than that, but we can do better

Generalizing evaluation

Let’s write a function which computes a series of metrics, in order to easily test different set-ups

def evaluate_model(predictions, probs, train_predictions, train_probs):
    """Compare machine learning model to baseline performance.
    Computes statistics and shows ROC curve."""
    
    baseline = {}
    
    baseline['recall'] = recall_score(test_labels, [1 for _ in range(len(test_labels))])
    baseline['precision'] = precision_score(test_labels, [1 for _ in range(len(test_labels))])
    baseline['roc'] = 0.5
    
    results = {}
    
    results['recall'] = recall_score(test_labels, predictions)
    results['precision'] = precision_score(test_labels, predictions)
    results['roc'] = roc_auc_score(test_labels, probs)
    
    train_results = {}
    train_results['recall'] = recall_score(train_labels, train_predictions)
    train_results['precision'] = precision_score(train_labels, train_predictions)
    train_results['roc'] = roc_auc_score(train_labels, train_probs)
    
    for metric in ['recall', 'precision', 'roc']:
        print(f'{metric.capitalize()} Baseline: {round(baseline[metric], 2)} Test: {round(results[metric], 2)} Train: {round(train_results[metric], 2)}')
    
    # Calculate false positive rates and true positive rates
    base_fpr, base_tpr, _ = roc_curve(test_labels, [1 for _ in range(len(test_labels))])
    model_fpr, model_tpr, _ = roc_curve(test_labels, probs)

    plt.figure(figsize = (8, 6))
    plt.rcParams['font.size'] = 16
    
    # Plot both curves
    plt.plot(base_fpr, base_tpr, 'b', label = 'baseline')
    plt.plot(model_fpr, model_tpr, 'r', label = 'model')
    plt.legend();
    plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate'); plt.title('ROC Curves');
print(Counter(probs))
print(Counter(predictions))
evaluate_model(predictions, probs, train_predictions, train_probs)

Here is the problem with a single decision tree with no constraints on depth : overfitting.

Another good “set” of metrics is the confusion matrix.

Confusion Matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Oranges):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.figure(figsize = (10, 10))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, size = 24)
    plt.colorbar(aspect=4)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45, size = 14)
    plt.yticks(tick_marks, classes, size = 14)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    
    # Labeling the plot
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), fontsize = 20,
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
        
    plt.grid(None)
    plt.tight_layout()
    plt.ylabel('True label', size = 18)
    plt.xlabel('Predicted label', size = 18)
cm = confusion_matrix(test_labels, predictions)
plot_confusion_matrix(cm, classes = ['Poor Health', 'Good Health'],
                      title = 'Health Confusion Matrix')

This shows the predicted classes from the model on the test set, along with the real labels. We can see that our model has many False Negatives (predicted healthy but in poor health actually) e False Positives (predicted in Poor health but actually healthy).

Features’ importance

We can take a look at the features considered most important by the decision tree. Values are computed by summing the reduction of the Gini Impurity on each node where the feature is considered.

fi = pd.DataFrame({'feature': features,
                   'importance': tree.feature_importances_}).\
                    sort_values('importance', ascending = False)
fi.head()

We won’t be looking at the definitions here, but you can inspect the data to determine the meaning of each feature here: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system#2015_formats.json

Visualize the tree

This time we will constrain the maximum depth otherwise it will be impossible to visualize.

# Save tree as dot file
export_graphviz(tree, 'tree_real_data.dot', rounded = True, 
                feature_names = features, max_depth = 6,
                class_names = ['poor health', 'good health'], filled = True)

# Convert to png
call(['dot', '-Tpng', 'tree_real_data.dot', '-o', 'tree_real_data.png', '-Gdpi=200'])

# Visualize
Image(filename='tree_real_data.png')

We can see that the model is extremely complex. To reduce the variance of our model we could use more trees, each one trained on a random sampling of the observations.

Entering: Random Forest

Random Forest

Let’s move on to a more powerful model, the Random Forest. The basic idea is that of a series of decision trees, each one using a random sample of the total observation and a random subset of the features. When used as a prediction tool, the random forest uses a sort of “majority vote” by considering each one of its trees.

# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=100, #number of trees in the forest 
                               random_state=RSEED, 
                               max_features = 'sqrt',
                               n_jobs=-1, verbose = 1)

#n_jobs=-1 mean using every available resource

# Fit on training data
model.fit(train, train_labels)

We can inspect how many nodes, on average, are on the trees, and their maximum depth. There were 100 trees in the forest.

n_nodes = []
max_depths = []

for ind_tree in model.estimators_:
    n_nodes.append(ind_tree.tree_.node_count)
    max_depths.append(ind_tree.tree_.max_depth)
    
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')

Random Forest results

train_rf_predictions = model.predict(train)
train_rf_probs = model.predict_proba(train)[:, 1]

rf_predictions = model.predict(test)
rf_probs = model.predict_proba(test)[:, 1]
evaluate_model(rf_predictions, rf_probs, train_rf_predictions, train_rf_probs)

The model reaches perfect performances on the training set, but this time performances on the test set are good too. If we compare the ROC-AUC, we can see that the Random Forest goes way better than the single decision tree.

cm = confusion_matrix(test_labels, rf_predictions)
plot_confusion_matrix(cm, classes = ['Poor Health', 'Good Health'],
                      title = 'Health Confusion Matrix')

Less False Positives (but more False Negatives). But in general, it works better.

fi_model = pd.DataFrame({'feature': features,
                   'importance': model.feature_importances_}).\
                    sort_values('importance', ascending = False)
fi_model.head(10)

The interpretation is straightforward: the Random Forest is better in generalizing.

Random Forest optimization through random parameter search

To get the best performance out of the Random Forest, we can do a random search over all the hyperparameters.

# Hyperparameter grid
param_grid = {
    'n_estimators': np.linspace(10, 200).astype(int),
    'max_depth': [None] + list(np.linspace(3, 20).astype(int)),
    'max_features': ['auto', 'sqrt', None] + list(np.arange(0.5, 1, 0.1)),
    'max_leaf_nodes': [None] + list(np.linspace(10, 50, 500).astype(int)),
    'min_samples_split': [2, 5, 10],
    'bootstrap': [True, False]
}

# Estimator for use in random search
estimator = RandomForestClassifier(random_state = RSEED)

# Create the random search model
rs = RandomizedSearchCV(estimator, param_grid, n_jobs = -1, 
                        scoring = 'roc_auc', cv = 3, 
                        n_iter = 10, verbose = 1, random_state=RSEED)

# Fit 
rs.fit(train, train_labels)

##this could take some minutes to compute...
rs.best_params_

The best parameters are often not the ones provided by default. This shows the importance of optimizing a model before throwing it away or use it as it is. Each data set will have different characteristics and the model that best fits each data set differs.

Using the best model

We can now evaluate the performances on the best model.

best_model = rs.best_estimator_
train_rf_predictions = best_model.predict(train)
train_rf_probs = best_model.predict_proba(train)[:, 1]

rf_predictions = best_model.predict(test)
rf_probs = best_model.predict_proba(test)[:, 1]
n_nodes = []
max_depths = []

for ind_tree in best_model.estimators_:
    n_nodes.append(ind_tree.tree_.node_count)
    max_depths.append(ind_tree.tree_.max_depth)
    
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')

The best depth is a finite value. Here is a demonstration of how limiting the depth can help the model to generalize.

evaluate_model(rf_predictions, rf_probs, train_rf_predictions, train_rf_probs)

In this case the optimized model performances are comparable to the default. More iterations in the random search would probably improve the results, or it can be that we reached the limit of performance reachable by a random forest with this problem and data representation.

estimator = best_model.estimators_[1]

# Export a tree 
export_graphviz(estimator, 'tree_from_optimized_forest.dot', rounded = True, 
                feature_names=train.columns, max_depth = 8, 
                class_names = ['poverty', 'no poverty'], filled = True)
call(['dot', '-Tpng', 'tree_from_optimized_forest.dot', '-o', 'tree_from_optimized_forest.png', '-Gdpi=200'])
Image('tree_from_optimized_forest.png')

Conclusions

Summarizing the key concepts:

  1. The single decision tree is an intuitive model which takes decisions based on a series of thresholds on the features. It has a high variance pointed by possible overfittings.
  2. Gini Impurity: metric minimized by the decision tree. It represents the probability that a random sample classified in that node has a different label from the predicted one.
  3. Random Forest: it falls into the Ensemble models category, made up of many single decision trees (hundreds to thousands) and its final prediction takes into account the predictions of each decision tree.

Remember: it’s fun to use machine learning libraries but it’s important to know what is happening behind the black boxes. Each method will give you some kind of result, but knowing what to apply, when, and how to correctly interpret the result is what distinguishes a person from a monster.

This notebook is freely re-interpreted from this article: https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76 – original copyright : https://github.com/WillKoehrsen.