In [1]:
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
TEMPLATE = 'seaborn'

import warnings
warnings.simplefilter('ignore')

Lecture 24 – Decision Trees, Grid Search, Multicollinearity¶

DSC 80, Spring 2023¶

Agenda¶

  • Cross-validation.
  • Example: Decision trees 🌲.
  • Grid search.
  • Multicollinearity.

Cross-validation¶

Recap¶

  • Suppose we've decided to fit a polynomial regressor on a dataset $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$, but we're unsure of what degree of polynomial to use (1, 2, 3, ..., 25).
    • Note that polynomial degree is a hyperparameter – it is something we can control before our model is fit to the training data.
  • Or, suppose we have a dataset of restaurant tips, bills, table sizes, days of the week, and so on, and want to decide which features to use in a linear model that predicts tips.
  • Remember, more complicated models (that is, models with more features) don't necessarily generalize well to unseen data!
  • Goal: Find the best hyperparameter, or best choice of features, so that our fit model generalizes well to unseen data.

$k$-fold cross-validation¶

Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the example below).

Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.

$k$-fold cross-validation (or simply "cross-validation") is the technique we will use for finding hyperparameters.

$k$-fold cross-validation¶

First, shuffle the dataset randomly and split it into $k$ disjoint groups. Then:

  • For each hyperparameter:
    • For each unique group:
      • Let the unique group be the "validation set".
      • Let all other groups be the "training set".
      • Train a model using the selected hyperparameter on the training set.
      • Evaluate the model on the validation set.
    • Compute the average validation score (e.g. RMSE) for the particular hyperparameter.
  • Choose the hyperparameter with the best average validation score.

As a reminder, here's what "sample 1" looks like.

In [2]:
sample_1 = pd.read_csv(os.path.join('data', 'sample-1.csv'))
px.scatter(x=sample_1['x'], y=sample_1['y'], template=TEMPLATE)