```
from dsc80_utils import *
import lec15_util as util
```

*Earlier today.*

### Announcements 📣¶

- Project 3 is due
**tonight**. - Lab 8 is due on
**Monday, March 4th**. - Project 4 is released! Read all about it at
**dsc80.com/proj04**.- The checkpoint is due on
**Thursday, March 7th**. - The full project is due on
**Thursday, March 21st**. You**cannot**use slip days on the final deadline.

- The checkpoint is due on
- The Final Exam is on
**Tuesday, March 19th from 3-6PM**.- Practice by working through old exams at practice.dsc80.com.

### RSVP to the senior capstone showcase on March 15th!¶

The senior capstone showcase is on Friday, March 15th in the **Price Center East Ballroom**. The DSC seniors will be presenting posters on their capstone projects. Come and ask them questions; if you're a DSC major, this will be you one day!

*Last year's showcase.*

The session is broken into two blocks:

- Block 1: 11AM-12:30PM.
- Block 2: 1-2:30PM.

### Look at the list of topics and RSVP at hdsishowcase.com!

### Agenda 📆¶

- Standardization.
- Multicollinearity.
- Generalization.
- Bias and variance.
- Train-test splits.

Today's lecture will be *mostly* theoretical!

### Question 🤔 (Answer at q.dsc80.com)

Remember, you can always ask questions at **q.dsc80.com**!

## Standardization¶

### Review: Transformers, models, and `Pipeline`

s¶

Last class, we learned how to build a `Pipeline`

in `sklearn`

. A `Pipeline`

consists of:

- transformers (from
`sklearn.preprocessing`

), which**engineer features**, and

- a model (from
`sklearn.linear_model`

), which is trained to make predictions.

Let's import the necessary classes and functions from `sklearn`

.

```
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error # New!
```

Let's also re-import our trusty `tips`

DataFrame.

```
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()
```

total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|

0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |

1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |

2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |

3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |

4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |

### An example `Pipeline`

¶

One of the transformers we used was the `StandardScaler`

transformer, which **standardizes** columns.

Let's build a `Pipeline`

that:

- Takes in the
`'total_bill'`

and`'size'`

features of`tips`

. - Standardizes those features.
- Uses the resulting standardized features to fit a linear model that predicts
`'tip'`

.

```
# Let's define these once, since we'll use them repeatedly.
X = tips[['total_bill', 'size']]
y = tips['tip']
```

```
model_with_std = Pipeline([
('standardize', StandardScaler()),
('lin-reg', LinearRegression())
])
model_with_std.fit(X, y)
```

Pipeline(steps=[('standardize', StandardScaler()), ('lin-reg', LinearRegression())])

How well does our model do? We can compute its $R^2$ and RMSE.

```
model_with_std.score(X, y)
```

0.46786930879612587

```
mean_squared_error(y, model_with_std.predict(X), squared=False)
```

1.007256127114662

Does this model perform any better than one that *doesn't* standardize its features? Let's find out.

```
model_without_std = LinearRegression()
model_without_std.fit(X, y)
```

LinearRegression()

```
model_without_std.score(X, y)
```

0.46786930879612587

```
mean_squared_error(y, model_without_std.predict(X), squared=False)
```

1.007256127114662

**No!**

### The purpose of standardizing features¶

If you're performing "vanilla" linear regression – that is, using the `LinearRegression`

object – then standardizing your features **will not** change your model's performance.

- There are other models where standardizing your features
*will*improve performance, because the methods assume features are standardized.

- There
*is*a benefit to standardizing features when performing vanilla linear regression, as we saw in DSC 40A: the features are brought to the same scale, so the coefficients can be compared directly.

```
# Total bill, table size.
model_without_std.coef_
```

array([0.09, 0.19])

```
# Total bill, table size.
model_with_std.named_steps['lin-reg'].coef_
```

array([0.82, 0.18])

## Multicollinearity¶

```
people_path = Path('data') / 'SOCR-HeightWeight.csv'
people = pd.read_csv(people_path).drop(columns=['Index'])
people.head()
```

Height (Inches) | Weight (Pounds) | |
---|---|---|

0 | 65.78 | 112.99 |

1 | 71.52 | 136.49 |

2 | 69.40 | 153.03 |

3 | 68.22 | 142.34 |

4 | 67.79 | 144.30 |

```
people.plot(kind='scatter', x='Height (Inches)', y='Weight (Pounds)',
title='Weight vs. Height for 25,000 18 Year Olds')
```