# Lecture 24 – Decision Trees, Grid Search, Multicollinearity¶

## DSC 80, Winter 2023¶

### Announcements¶

• Lab 9 (pipelines) is due on Monday, March 13th at 11:59PM.
• Project 5 (prediction) is due on Thursday, March 23rd at 11:59PM (no slip days allowed)!
• practice.dsc80.com now contains 3 past finals. Start reviewing!
• Prioritize the Spring 2022 final.
• There is no live lecture next Wednesday or Friday; videos will be posted instead.

### Agenda¶

• Cross-validation.
• Example: Decision trees 🌲.
• Grid search.
• Multicollinearity.

## Cross-validation¶

### Recap¶

• Suppose we've decided to fit a polynomial regressor on a dataset $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$, but we're unsure of what degree of polynomial to use (1, 2, 3, ..., 25).
• Note that polynomial degree is a hyperparameter – it is something we can control before our model is fit to the training data.
• Or, suppose we have a dataset of restaurant tips, bills, table sizes, days of the week, and so on, and want to decide which features to use in a linear model that predicts tips.
• Remember, more complicated models (that is, models with more features) don't necessarily generalize well to unseen data!
• Goal: Find the best hyperparameter, or best choice of features, so that our fit model generalizes well to unseen data.

### $k$-fold cross-validation¶

Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the example below).

Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.

$k$-fold cross-validation (or simply "cross-validation") is the technique we will use for finding hyperparameters.

### $k$-fold cross-validation¶

First, shuffle the dataset randomly and split it into $k$ disjoint groups. Then:

• For each hyperparameter:
• For each unique group:
• Let the unique group be the "validation set".
• Let all other groups be the "training set".
• Train a model using the selected hyperparameter on the training set.
• Evaluate the model on the validation set.
• Compute the average validation score (e.g. RMSE) for the particular hyperparameter.
• Choose the hyperparameter with the best average validation score.

As a reminder, here's what "sample 1" looks like.