Visualizing tree-based regressors

(View this notebook in Colab)

The dtreeviz library is designed to help machine learning practitioners visualize and interpret decision trees and decision-tree-based models, such as gradient boosting machines.

The purpose of this notebook is to illustrate the main capabilities and functions of the dtreeviz API. To do that, we will use scikit-learn and the toy but well-known Titanic data set for illustrative purposes. Currently, dtreeviz supports the following decision tree libraries:

To interopt with these different libraries, dtreeviz uses an adaptor object, obtained from function dtreeviz.model(), to extract model information necessary for visualization. Given such an adaptor object, all of the dtreeviz functionality is available to you using the same programmer interface. The basic dtreeviz usage recipe is:

Import dtreeviz and your decision tree library
Acquire and load data into memory
Train a classifier or regressor model using your decision tree library
Obtain a dtreeviz adaptor model using
viz_model = dtreeviz.model(your_trained_model,...)
Call dtreeviz functions, such as
viz_model.view() or viz_model.explain_prediction_path(sample_x)

The four categories of dtreeviz functionality are:

Tree visualizations
Prediction path explanations
Leaf information
Feature space exploration

We have grouped code examples by classifiers and regressors, with a follow up section on partitioning feature space.

These examples require dtreeviz 2.0 or above because the code uses the new API introduced in 2.0.

Setup

import sys
import os

%config InlineBackend.figure_format = 'retina' # Make visualizations look good
#%config InlineBackend.figure_format = 'svg'
%matplotlib inline

if 'google.colab' in sys.modules:
  !pip install -q dtreeviz

import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

import dtreeviz

random_state = 1234 # get reproducible trees

Load Sample Data

dataset_url = "https://raw.githubusercontent.com/parrt/dtreeviz/master/data/titanic/titanic.csv"
dataset = pd.read_csv(dataset_url)
# Fill missing values for Age
dataset.fillna({"Age":dataset.Age.mean()}, inplace=True)
# Encode categorical variables
dataset["Sex_label"] = dataset.Sex.astype("category").cat.codes
dataset["Cabin_label"] = dataset.Cabin.astype("category").cat.codes
dataset["Embarked_label"] = dataset.Embarked.astype("category").cat.codes

To demonstrate regressor tree visualization, we start by creating a regressors model that predicts age instead of survival:

features_reg = ["Pclass", "Fare", "Sex_label", "Cabin_label", "Embarked_label", "Survived"]
target_reg = "Age"

tree_regressor = DecisionTreeRegressor(max_depth=3, random_state=random_state, criterion="mae")
tree_regressor.fit(dataset[features_reg].values, dataset[target_reg].values)

DecisionTreeRegressor(criterion='mae', max_depth=3, random_state=1234)

Initialize dtreeviz model (adaptor)

viz_rmodel = dtreeviz.model(model=tree_regressor,
                            X_train=dataset[features_reg],
                            y_train=dataset[target_reg],
                            feature_names=features_reg,
                            target_name=target_reg)

Tree structure visualisations

viz_rmodel.view()

viz_rmodel.view(orientation="LR")

viz_rmodel.view(fancy=False)

viz_rmodel.view(depth_range_to_display=(0, 2))

Prediction path explanations

x = dataset[features_reg].iloc[10]
x

Pclass              3.0
Fare               16.7
Sex_label           0.0
Cabin_label       145.0
Embarked_label      2.0
Survived            1.0
Name: 10, dtype: float64

viz_rmodel.view(x = x)

viz_rmodel.view(show_just_path=True, x = x)

print(viz_rmodel.explain_prediction_path(x))

1.5 <= Pclass 
Fare < 27.82
139.5 <= Cabin_label

viz_rmodel.instance_feature_importance(x, figsize=(3.5,2))

Leaf info

viz_rmodel.leaf_sizes(figsize=(3.5,2))

viz_rmodel.rtree_leaf_distributions()

viz_rmodel.node_stats(node_id=4)

	Pclass	Fare	Sex_label	Cabin_label	Embarked_label	Survived
count	72.0	72.0	72.0	72.0	72.0	72.0
mean	1.0	152.167936	0.347222	39.25	0.916667	0.763889
std	0.0	97.808005	0.479428	26.556742	1.031203	0.427672
min	1.0	66.6	0.0	-1.0	-1.0	0.0
25%	1.0	83.1583	0.0	20.75	0.0	1.0
50%	1.0	120.0	0.0	40.0	0.0	1.0
75%	1.0	211.3375	1.0	63.0	2.0	1.0
max	1.0	512.3292	1.0	79.0	2.0	1.0

viz_rmodel.leaf_purity(figsize=(3.5,2))

Partitioning

To demonstrate regression, let’s load a toy Cars data set and visualize the partitioning of univariate and bivariate feature spaces.

dataset_url = "https://raw.githubusercontent.com/parrt/dtreeviz/master/data/cars.csv"
df_cars = pd.read_csv(dataset_url)
X = df_cars.drop('MPG', axis=1)
y = df_cars['MPG']
features = list(X.columns)

dtr_cars = DecisionTreeRegressor(max_depth=3, criterion="mae")
dtr_cars.fit(X.values, y.values)

DecisionTreeRegressor(criterion='mae', max_depth=3)

viz_rmodel = dtreeviz.model(dtr_cars, X, y,
                            feature_names=features,
                            target_name='MPG')

The following visualization illustrates how the decision tree breaks up the WGT (car weight) in order to get relatively pure MPG (miles per gallon) target values.

viz_rmodel.rtree_feature_space(features=['WGT'])

In order to visualize two-dimensional feature space, we can draw in three dimensions:

viz_rmodel.rtree_feature_space3D(features=['WGT','ENG'],
                                 fontsize=10,
                                 elev=30, azim=20,
                                 show={'splits', 'title'},
                                 colors={'tessellation_alpha': .5})

Equivalently, we can show a heat map as if we were looking at the three-dimensional plot from the top down:

viz_rmodel.rtree_feature_space(features=['WGT','ENG'])