import sys
import os
Visualizing tree-based regressors
The dtreeviz library is designed to help machine learning practitioners visualize and interpret decision trees and decision-tree-based models, such as gradient boosting machines.
The purpose of this notebook is to illustrate the main capabilities and functions of the dtreeviz API. To do that, we will use scikit-learn and the toy but well-known Titanic data set for illustrative purposes. Currently, dtreeviz supports the following decision tree libraries:
To interopt with these different libraries, dtreeviz uses an adaptor object, obtained from function dtreeviz.model()
, to extract model information necessary for visualization. Given such an adaptor object, all of the dtreeviz functionality is available to you using the same programmer interface. The basic dtreeviz usage recipe is:
- Import dtreeviz and your decision tree library
- Acquire and load data into memory
- Train a classifier or regressor model using your decision tree library
- Obtain a dtreeviz adaptor model using
viz_model = dtreeviz.model(your_trained_model,...)
- Call dtreeviz functions, such as
viz_model.view()
orviz_model.explain_prediction_path(sample_x)
The four categories of dtreeviz functionality are:
- Tree visualizations
- Prediction path explanations
- Leaf information
- Feature space exploration
We have grouped code examples by classifiers and regressors, with a follow up section on partitioning feature space.
These examples require dtreeviz 2.0 or above because the code uses the new API introduced in 2.0.
Setup
%config InlineBackend.figure_format = 'retina' # Make visualizations look good
#%config InlineBackend.figure_format = 'svg'
%matplotlib inline
if 'google.colab' in sys.modules:
!pip install -q dtreeviz
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
import dtreeviz
= 1234 # get reproducible trees random_state
Load Sample Data
= "https://raw.githubusercontent.com/parrt/dtreeviz/master/data/titanic/titanic.csv"
dataset_url = pd.read_csv(dataset_url)
dataset # Fill missing values for Age
"Age":dataset.Age.mean()}, inplace=True)
dataset.fillna({# Encode categorical variables
"Sex_label"] = dataset.Sex.astype("category").cat.codes
dataset["Cabin_label"] = dataset.Cabin.astype("category").cat.codes
dataset["Embarked_label"] = dataset.Embarked.astype("category").cat.codes dataset[
To demonstrate regressor tree visualization, we start by creating a regressors model that predicts age instead of survival:
= ["Pclass", "Fare", "Sex_label", "Cabin_label", "Embarked_label", "Survived"]
features_reg = "Age"
target_reg
= DecisionTreeRegressor(max_depth=3, random_state=random_state, criterion="mae")
tree_regressor tree_regressor.fit(dataset[features_reg].values, dataset[target_reg].values)
DecisionTreeRegressor(criterion='mae', max_depth=3, random_state=1234)
Initialize dtreeviz model (adaptor)
= dtreeviz.model(model=tree_regressor,
viz_rmodel =dataset[features_reg],
X_train=dataset[target_reg],
y_train=features_reg,
feature_names=target_reg) target_name
Tree structure visualisations
viz_rmodel.view()
="LR") viz_rmodel.view(orientation
=False) viz_rmodel.view(fancy
=(0, 2)) viz_rmodel.view(depth_range_to_display
Prediction path explanations
= dataset[features_reg].iloc[10]
x x
Pclass 3.0
Fare 16.7
Sex_label 0.0
Cabin_label 145.0
Embarked_label 2.0
Survived 1.0
Name: 10, dtype: float64
= x) viz_rmodel.view(x
=True, x = x) viz_rmodel.view(show_just_path
print(viz_rmodel.explain_prediction_path(x))
1.5 <= Pclass
Fare < 27.82
139.5 <= Cabin_label
=(3.5,2)) viz_rmodel.instance_feature_importance(x, figsize
Leaf info
=(3.5,2)) viz_rmodel.leaf_sizes(figsize
viz_rmodel.rtree_leaf_distributions()
=4) viz_rmodel.node_stats(node_id
Pclass | Fare | Sex_label | Cabin_label | Embarked_label | Survived | |
---|---|---|---|---|---|---|
count | 72.0 | 72.0 | 72.0 | 72.0 | 72.0 | 72.0 |
mean | 1.0 | 152.167936 | 0.347222 | 39.25 | 0.916667 | 0.763889 |
std | 0.0 | 97.808005 | 0.479428 | 26.556742 | 1.031203 | 0.427672 |
min | 1.0 | 66.6 | 0.0 | -1.0 | -1.0 | 0.0 |
25% | 1.0 | 83.1583 | 0.0 | 20.75 | 0.0 | 1.0 |
50% | 1.0 | 120.0 | 0.0 | 40.0 | 0.0 | 1.0 |
75% | 1.0 | 211.3375 | 1.0 | 63.0 | 2.0 | 1.0 |
max | 1.0 | 512.3292 | 1.0 | 79.0 | 2.0 | 1.0 |
=(3.5,2)) viz_rmodel.leaf_purity(figsize
Partitioning
To demonstrate regression, let’s load a toy Cars data set and visualize the partitioning of univariate and bivariate feature spaces.
= "https://raw.githubusercontent.com/parrt/dtreeviz/master/data/cars.csv"
dataset_url = pd.read_csv(dataset_url)
df_cars = df_cars.drop('MPG', axis=1)
X = df_cars['MPG']
y = list(X.columns) features
= DecisionTreeRegressor(max_depth=3, criterion="mae")
dtr_cars dtr_cars.fit(X.values, y.values)
DecisionTreeRegressor(criterion='mae', max_depth=3)
= dtreeviz.model(dtr_cars, X, y,
viz_rmodel =features,
feature_names='MPG') target_name
The following visualization illustrates how the decision tree breaks up the WGT
(car weight) in order to get relatively pure MPG
(miles per gallon) target values.
=['WGT']) viz_rmodel.rtree_feature_space(features
In order to visualize two-dimensional feature space, we can draw in three dimensions:
=['WGT','ENG'],
viz_rmodel.rtree_feature_space3D(features=10,
fontsize=30, azim=20,
elev={'splits', 'title'},
show={'tessellation_alpha': .5}) colors
Equivalently, we can show a heat map as if we were looking at the three-dimensional plot from the top down:
=['WGT','ENG']) viz_rmodel.rtree_feature_space(features