This notebook will use the Iris flower dataset from sklearn to introduce classification with Random Forest.

Feature importance and partial dependency plots will be created once the model is trained.

Overview

Set up train-test data
Review decision trees
Visualize a decision tree
Introduction to random forest
Train and predict with a random forest
Visualize feature importances
Create partial dependency plots for random forest
Knowledge check and questions
Additional EDA & Visualizations

Prerequisites

Python imports
Train-test split
Classification metrics
Decision Trees
Measures of node impurity (Shannon Entropy and Gini Index)

Learning Objectives

Apply a random forest classifier to a dataset
Visualize feature importances from a trained random forest model

Dependencies

This notebook was made with the following packages: 1. python=3.7.6 1. sklearn=1.2.2 1. matplotlib=3.1.3 1. pandas=1.0.1

Set up data set

import pandas as pd
from sklearn import datasets
import seaborn as sns
import skimpy

data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df['label'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	label
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

df.shape

(150, 6)

df.label.value_counts()

label
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[data.feature_names], 
    df['label'], 
    random_state=42,
    stratify=df['label']
)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(112, 4)
(38, 4)
(112,)
(38,)

Decision Tree Review

We will first build a decision tree model to review their structure before moving on to random forest classification

from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

DecisionTreeClassifier(random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(dt, 
                   feature_names=data.feature_names,  
                   class_names=data.target_names,
                   filled=True)

predicted_labels = dt.predict(X_test)

from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    f1_score
)

def print_metrics(y_test, y_pred):
    """function to display model metrics in a human readable fashion
    y_test: Array of true y values
    y_pred: Arrah of predicted y values
    """
    scores = [accuracy_score, recall_score, precision_score, f1_score]
    s_labels = ['Accuracy', 'Recall', 'Precision', 'F1']
    for score, s_label in zip(scores, s_labels):
        if s_label == 'Accuracy':
            print(f"{s_label}: {score(y_test, y_pred): .2f}")
        else:
            print(f"{s_label}: {score(y_test, y_pred, average='weighted'):.2f}")

print_metrics(y_test, predicted_labels)

Accuracy:  0.89
Recall: 0.89
Precision: 0.90
F1: 0.89

from sklearn.metrics import ConfusionMatrixDisplay

cm_disp = ConfusionMatrixDisplay.from_predictions(y_true=y_test,y_pred=predicted_labels, cmap='Blues')

Decision Tree Knowledge

Are decision trees deterministic?
- Yes they are deterministic. The best split will be found at each iterative step and will be used.
How are decision trees split determined?
- Information gain or entropy reduction
Are decision trees parametric?
- No, they are not parametric. Splits may differ in direction based on values.
Decision trees often have high variance, why might that be?
- May split on wrong features or overfit to data.

From tree to forest

How might multiple decision trees be leveraged to reduce variance?
1. Create multiple classifiers and average the results. Multiple weak learners can ofter produce a strong learner.
Would multiple deterministic decision trees be useful?
1. Only if they were NOT deterministic.
How could they not be deteministic?
1. Bootstrapping data and limiting features at each step

Random Forest

An ensemble method that combines many decision trees which have been given different subsets of the data and features to create a strong learner
Decision made on majority vote
Reduces variance and creates a non-deterministic model
Generally use a large number of bushy trees
Can get excellent performance with minimum tuning

Random Forest Example

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=10, random_state=1)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

cm_disp = ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=rf_preds, cmap='Blues')

print_metrics(y_test, rf_preds)

Accuracy:  0.92
Recall: 0.92
Precision: 0.92
F1: 0.92

Visualizing trees in the forest

def plot_tree(est_num=0):
    fn=data.feature_names
    cn=data.target_names
    tree.plot_tree(rf.estimators_[est_num],
                   feature_names = fn, 
                   class_names = cn,
                   filled = True)

plot_tree(est_num=0)

plot_tree(est_num=1)

Focusing only on the first split in the two trees above, we can see differences in the splits used to build the forest.

The first one split on sepal lenth <= 5.35 and the second on petal length <=2.45. The depth of the trees also varies. This is due to the randomness induced in the trees with bootstrapping and feature selection.

Random Forest Feature Importances

ft_import = pd.DataFrame()
ft_import['Features'] = data.feature_names
ft_import['Importance'] = rf.feature_importances_
ft_import

	Features	Importance
0	sepal length (cm)	0.259131
1	sepal width (cm)	0.047562
2	petal length (cm)	0.332972
3	petal width (cm)	0.360335

# All importances sum to 1
ft_import.Importance.sum()

np.float64(1.0)

ft_import.plot.bar(x='Features', y='Importance', rot=30)

Partial Dependency Plots

Shows the marginal effect of a feature on predictions.

Shows effect of predictions when all observations have a feature set to a particular value.

from sklearn.inspection import PartialDependenceDisplay

pd_plot = PartialDependenceDisplay.from_estimator(rf, X_train, ['petal length (cm)'], target='setosa')

for target in data.target_names:
    print('Target is: ', target)
    # plot_partial_dependence(rf, X_train, data.feature_names, target=target)
    PartialDependenceDisplay.from_estimator(rf, X_train, data.feature_names, target=target)

Target is:  setosa
Target is:  versicolor
Target is:  virginica

Advantages of Random Forests

Ensemble model (Wisdom of the Crowd)
Good out-of-box performance
Multiple trees can be trained at once

Cons of Random Forests

Expensive to train
Can produce very large model files

Model Comparison and evaluation

Which model performed better?
Which feature had the most influence on the random forest model?

Knowledge Check

A random forest is the same as combining many decision trees?
Name two ways in which random forests are made non-deterministic.
Random forest classifiers are parametric?
Name two visuals to help intrepret random forest models.

Review Objectives

Apply a random forest classifier to a dataset.
Visualize feature importances from a trained random forest model.

Next Steps

Tune the random forest classifier
1. Number of estimators
2. Criterion: default is gini, can also try entropy
3. Max depth, min samples
4. Number of features
Create a random forest regressor and test on sklearn Boston housing data
1. Compare to decision tree model
2. Plot feature importances
3. Create partial dependency plots

Full Day Activities

Code random forest from scratch
Code partial dependecy plot function from scratch

Additional EDA & Visualizations

skimpy.skim(df)

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ Dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 150    │ │ float64     │ 4     │                                                          │
│ │ Number of columns │ 6      │ │ int64       │ 1     │                                                          │
│ └───────────────────┴────────┘ │ string      │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column                 ┃ NA  ┃ NA %   ┃ mean    ┃ sd       ┃ p0   ┃ p25  ┃ p50    ┃ p75  ┃ p100  ┃ hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━━━┩  │
│ │ sepal length (cm)      │   0 │      0 │   5.843 │   0.8281 │  4.3 │  5.1 │    5.8 │  6.4 │   7.9 │ ▃██▇▅▂  │  │
│ │ sepal width (cm)       │   0 │      0 │   3.057 │   0.4359 │    2 │  2.8 │      3 │  3.3 │   4.4 │ ▁▇█▇▂▁  │  │
│ │ petal length (cm)      │   0 │      0 │   3.758 │    1.765 │    1 │  1.6 │   4.35 │  5.1 │   6.9 │ █ ▂▇▆▂  │  │
│ │ petal width (cm)       │   0 │      0 │   1.199 │   0.7622 │  0.1 │  0.3 │    1.3 │  1.8 │   2.5 │ █ ▂▆▄▄  │  │
│ │ target                 │   0 │      0 │       1 │   0.8192 │    0 │    0 │      1 │    2 │     2 │ █  █ █  │  │
│ └────────────────────────┴─────┴────────┴─────────┴──────────┴──────┴──────┴────────┴──────┴───────┴─────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━┳━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓  │
│ ┃        ┃    ┃      ┃          ┃            ┃        ┃           ┃ chars per   ┃ words per    ┃             ┃  │
│ ┃ column ┃ NA ┃ NA % ┃ shortest ┃ longest    ┃ min    ┃ max       ┃ row         ┃ row          ┃ total words ┃  │
│ ┡━━━━━━━━╇━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩  │
│ │ label  │  0 │    0 │ setosa   │ versicolor │ setosa │ virginica │        8.33 │            1 │         150 │  │
│ └────────┴────┴──────┴──────────┴────────────┴────────┴───────────┴─────────────┴──────────────┴─────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

features = df.columns[:4]
features

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

for feat in features:
    sns.histplot(x=df[feat], hue=df['label']).set(title=feat)
    plt.figure()

<Figure size 640x480 with 0 Axes>

sns.scatterplot(data=df, x='sepal length (cm)', y='sepal width (cm)', hue='label')

sns.scatterplot(data=df, x='petal length (cm)', y='petal width (cm)', hue='label')