In [1]:

```
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
TEMPLATE = 'seaborn'
import warnings
warnings.simplefilter('ignore')
```

- Lab 9 (pipelines) is due on Monday, March 13th at 11:59PM.
- Project 5 (prediction) is due on Thursday, March 23rd at 11:59PM (no slip days allowed)!
- practice.dsc80.com now contains 3 past finals. Start reviewing!
- Prioritize the Spring 2022 final.

- There is no live lecture next Wednesday or Friday; videos will be posted instead.

- Cross-validation.
- Example: Decision trees 🌲.
- Grid search.
- Multicollinearity.

- Suppose we've decided to fit a polynomial regressor on a dataset $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$, but we're unsure of what degree of polynomial to use (1, 2, 3, ..., 25).
- Note that polynomial degree is a
**hyperparameter**– it is something we can control*before*our model is fit to the training data.

- Note that polynomial degree is a

- Or, suppose we have a dataset of restaurant tips, bills, table sizes, days of the week, and so on, and want to decide which features to use in a linear model that predicts tips.

- Remember, more complicated models (that is, models with more features) don't necessarily
**generalize**well to**unseen data**!

**Goal**: Find the best hyperparameter, or best choice of features, so that our fit model**generalizes well to unseen data**.

Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the example below).

Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.

$k$-fold cross-validation (or simply "cross-validation") is **the** technique we will use for finding hyperparameters.

First, **shuffle** the dataset randomly and **split** it into $k$ disjoint groups. Then:

- For each hyperparameter:
- For each unique group:
- Let the unique group be the "validation set".
- Let all other groups be the "training set".
- Train a model using the selected hyperparameter on the training set.
- Evaluate the model on the validation set.

- Compute the
**average**validation score (e.g. RMSE) for the particular hyperparameter.

- For each unique group:
- Choose the hyperparameter with the best average validation score.

As a reminder, here's what "sample 1" looks like.

In [2]:

```
sample_1 = pd.read_csv(os.path.join('data', 'sample-1.csv'))
px.scatter(x=sample_1['x'], y=sample_1['y'], template=TEMPLATE)
```