from dsc80_utils import *
import lec15_util as util
Announcements 📣¶
- Lab 7 due tomorrow.
- Final Project Checkpoint 1 due Tuesday.
- Midterm Rdemption due Wednesday.
Agenda 📆¶
- Pipelines.
- Multicollinearity.
- Generalization.
- Bias and variance.
- Train-test splits.
Pipelines¶
So far, we've used transformers for feature engineering and models for prediction. We can combine these steps into a single Pipeline
.
Pipeline
s in sklearn
¶
From sklearn
's documentation:
Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.
Intermediate steps of the pipeline must be "transforms", that is, they must implementfit
andtransform
methods. The final estimator only needs to implementfit
.
- General template:
pl = Pipeline([trans_1, trans_2, ..., model])
- Note that the
model
is optional.
- Note that the
- Once a
Pipeline
is instantiated, you can fit all steps (transformers and model) usingpl.fit(X, y)
.
- To make predictions using raw, untransformed data, use
pl.predict(X)
.
- The actual list we provide
Pipeline
with must be a list of tuples, where- The first element is a "name" (that we choose) for the step.
- The second element is a transformer or estimator instance.
Our first Pipeline
¶
Let's build a Pipeline
that:
- One hot encodes the categorical features in
tips
. - Fits a regression model on the one hot encoded data.
tips = px.data.tips()
tips_cat = tips[['sex', 'smoker', 'day', 'time']]
tips_cat.head()
sex | smoker | day | time | |
---|---|---|---|---|
0 | Female | No | Sun | Dinner |
1 | Male | No | Sun | Dinner |
2 | Male | No | Sun | Dinner |
3 | Male | No | Sun | Dinner |
4 | Female | No | Sun | Dinner |
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
pl = Pipeline([
('one-hot', OneHotEncoder()),
('lin-reg', LinearRegression())
])
Now that pl
is instantiated, we fit
it the same way we would fit the individual steps.
pl.fit(tips_cat, tips['tip'])
Pipeline(steps=[('one-hot', OneHotEncoder()), ('lin-reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('one-hot', OneHotEncoder()), ('lin-reg', LinearRegression())])
OneHotEncoder()
LinearRegression()
Now, to make predictions using raw data, all we need to do is use pl.predict
:
pl.predict(tips_cat.iloc[:5])
array([3.1 , 3.27, 3.27, 3.27, 3.1 ])
pl
performs both feature transformation and prediction with just a single call to predict
!
We can access individual "steps" of a Pipeline
through the named_steps
attribute:
pl.named_steps
{'one-hot': OneHotEncoder(), 'lin-reg': LinearRegression()}
pl.named_steps['one-hot'].transform(tips_cat).toarray()
array([[1., 0., 1., ..., 0., 1., 0.], [0., 1., 1., ..., 0., 1., 0.], [0., 1., 1., ..., 0., 1., 0.], ..., [0., 1., 0., ..., 0., 1., 0.], [0., 1., 1., ..., 0., 1., 0.], [1., 0., 1., ..., 1., 1., 0.]])
pl.named_steps['one-hot'].get_feature_names_out()
array(['sex_Female', 'sex_Male', 'smoker_No', 'smoker_Yes', 'day_Fri', 'day_Sat', 'day_Sun', 'day_Thur', 'time_Dinner', 'time_Lunch'], dtype=object)
pl.named_steps['lin-reg'].coef_
array([-0.09, 0.09, -0.04, 0.04, -0.2 , -0.13, 0.14, 0.19, 0.25, -0.25])
pl
also has a score
method, the same way a fit LinearRegression
instance does:
# Why is this so low?
pl.score(tips_cat, tips['tip'])
0.02749679020147555
More sophisticated Pipeline
s¶
- In the previous example, we one hot encoded every input column. What if we want to perform different transformations on different columns?
- Solution: Use a
ColumnTransformer
.- Instantiate a
ColumnTransformer
using a list of tuples, where:- The first element is a "name" we choose for the transformer.
- The second element is a transformer instance (e.g.
OneHotEncoder()
). - The third element is a list of relevant column names.
- Instantiate a
Planning our first ColumnTransformer
¶
from sklearn.compose import ColumnTransformer
Let's perform different transformations on the quantitative and categorical features of tips
(note that we are not transforming 'tip'
).
tips_features = tips.drop('tip', axis=1)
tips_features.head()
total_bill | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|
0 | 16.99 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | Female | No | Sun | Dinner | 4 |
- We will leave the
'total_bill'
column untouched.
- To the
'size'
column, we will apply theBinarizer
transformer with a threshold of 2 (big tables vs. small tables).
- To the categorical columns, we will apply the
OneHotEncoder
transformer.
- In essence, we will create a transformer that reproduces the following DataFrame:
size | x0_Female | x0_Male | x1_No | x1_Yes | x2_Fri | x2_Sat | x2_Sun | x2_Thur | x3_Dinner | x3_Lunch | total_bill | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 16.99 |
1 | 1 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 10.34 |
2 | 1 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 21.01 |
3 | 0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 23.68 |
4 | 1 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 24.59 |
Building a Pipeline
using a ColumnTransformer
¶
Let's start by creating our ColumnTransformer
.
from sklearn.preprocessing import Binarizer
preproc = ColumnTransformer(
transformers=[
('size', Binarizer(threshold=2), ['size']),
('categorical_cols', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])
],
# Specify what to do with all other columns ('total_bill' here) – drop or passthrough.
remainder='passthrough',
# Keep original dtypes for remaining columns
force_int_remainder_cols=False,
)
Now, let's create a Pipeline
using preproc
as a transformer, and fit
it:
pl = Pipeline([
('preprocessor', preproc),
('lin-reg', LinearRegression())
])
pl.fit(tips_features, tips['tip'])
Pipeline(steps=[('preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('size', Binarizer(threshold=2), ['size']), ('categorical_cols', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])])), ('lin-reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('size', Binarizer(threshold=2), ['size']), ('categorical_cols', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])])), ('lin-reg', LinearRegression())])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('size', Binarizer(threshold=2), ['size']), ('categorical_cols', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])])
['size']
Binarizer(threshold=2)
['sex', 'smoker', 'day', 'time']
OneHotEncoder()
['total_bill']
passthrough
LinearRegression()
Prediction is as easy as calling predict
:
tips_features.head()
total_bill | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|
0 | 16.99 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | Female | No | Sun | Dinner | 4 |
# Note that we fit the Pipeline using tips_features, not tips_features.head()!
pl.predict(tips_features.head())
array([2.74, 2.32, 3.37, 3.37, 3.75])
Aside: FunctionTransformer
¶
A transformer you'll often use as part of a ColumnTransformer
is the FunctionTransformer
, which enables you to use your own functions on entire columns. Think of it as the sklearn
equivalent of apply
.
from sklearn.preprocessing import FunctionTransformer
f = FunctionTransformer(np.sqrt)
f.transform([1, 2, 3])
array([1. , 1.41, 1.73])
💡 Pro-Tip: Using make_pipeline
and make_column_transformer
¶
Instead of using Pipeline
and ColumnTransformer
classes directly, scikit-learn
provides nifty shortcut methods called make_pipeline
and make_column_transformer
:
# Old code
preproc = ColumnTransformer(
transformers=[
('size', Binarizer(threshold=2), ['size']),
('categorical_cols', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])
],
remainder='passthrough'
)
pl = Pipeline([
('preprocessor', preproc),
('lin-reg', LinearRegression())
])
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
preproc = make_column_transformer(
(Binarizer(threshold=2), ['size']),
(OneHotEncoder(), ['sex', 'smoker', 'day', 'time']),
remainder='passthrough',
)
pl = make_pipeline(preproc, LinearRegression())
# Notice that the steps in the pipeline and column transformer are
# automatically named
pl
Pipeline(steps=[('columntransformer', ColumnTransformer(remainder='passthrough', transformers=[('binarizer', Binarizer(threshold=2), ['size']), ('onehotencoder', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(remainder='passthrough', transformers=[('binarizer', Binarizer(threshold=2), ['size']), ('onehotencoder', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])])), ('linearregression', LinearRegression())])
ColumnTransformer(remainder='passthrough', transformers=[('binarizer', Binarizer(threshold=2), ['size']), ('onehotencoder', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])])
['size']
Binarizer(threshold=2)
['sex', 'smoker', 'day', 'time']
OneHotEncoder()
passthrough
LinearRegression()
An example Pipeline
¶
One of the transformers we used was the StandardScaler
transformer, which standardizes columns.
$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of } x}$$
Let's build a Pipeline
that:
- Takes in the
'total_bill'
and'size'
features oftips
. - Standardizes those features.
- Uses the resulting standardized features to fit a linear model that predicts
'tip'
.
# Let's define these once, since we'll use them repeatedly.
X = tips[['total_bill', 'size']]
y = tips['tip']
from sklearn.preprocessing import StandardScaler
model_with_std = make_pipeline(
StandardScaler(),
LinearRegression(),
)
model_with_std.fit(X, y)
Pipeline(steps=[('standardscaler', StandardScaler()), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()), ('linearregression', LinearRegression())])
StandardScaler()
LinearRegression()
How well does our model do? We can compute its $R^2$ and RMSE.
model_with_std.score(X, y)
0.46786930879612587
from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y, model_with_std.predict(X))
np.float64(1.007256127114662)
Does this model perform any better than one that doesn't standardize its features? Let's find out.
model_without_std = LinearRegression()
model_without_std.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
model_without_std.score(X, y)
0.46786930879612587
root_mean_squared_error(y, model_without_std.predict(X))
np.float64(1.007256127114662)
No!
The purpose of standardizing features¶
If you're performing "vanilla" linear regression – that is, using the LinearRegression
object – then standardizing your features will not change your model's error.
- There are other models where standardizing your features will improve performance, because the methods assume features are standardized.
- There is a benefit to standardizing features when performing vanilla linear regression, as we saw in DSC 40A: the features are brought to the same scale, so the coefficients can be compared directly.
# Total bill, table size.
model_without_std.coef_
array([0.09, 0.19])
# Total bill, table size.
model_with_std.named_steps['linearregression'].coef_
array([0.82, 0.18])
Aside: Pipeline
s of just transformers¶
If you want to apply multiple transformations to the same column in a dataset, you can create a Pipeline
just for that column.
For example, suppose we want to:
- One hot encode the
'sex'
,'smoker'
, and'time'
columns. - One hot encode the
'day'
column, but as either'Weekday'
,'Sat'
, or'Sun'
. - Binarize the
'size'
column.
Here's how we might do that:
def is_weekend(s):
# The input to is_weekend is a Series!
return s.replace({'Thur': 'Weekday', 'Fri': 'Weekday'})
pl_day = make_pipeline(
FunctionTransformer(is_weekend),
OneHotEncoder(),
)
col_trans = make_column_transformer(
(pl_day, ['day']),
(OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']),
(Binarizer(threshold=2), ['size']),
remainder='passthrough',
force_int_remainder_cols=False,
)
pl = make_pipeline(
col_trans,
LinearRegression(),
)
pl.fit(tips.drop('tip', axis=1), tips['tip'])
Pipeline(steps=[('columntransformer', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function is_weekend at 0x3263f4d60>)), ('onehotencoder', OneHotEncoder())]), ['day']), ('onehotencoder', OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']), ('binarizer', Binarizer(threshold=2), ['size'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function is_weekend at 0x3263f4d60>)), ('onehotencoder', OneHotEncoder())]), ['day']), ('onehotencoder', OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']), ('binarizer', Binarizer(threshold=2), ['size'])])), ('linearregression', LinearRegression())])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function is_weekend at 0x3263f4d60>)), ('onehotencoder', OneHotEncoder())]), ['day']), ('onehotencoder', OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']), ('binarizer', Binarizer(threshold=2), ['size'])])
['day']
FunctionTransformer(func=<function is_weekend at 0x3263f4d60>)
OneHotEncoder()
['sex', 'smoker', 'time']
OneHotEncoder(drop='first')
['size']
Binarizer(threshold=2)
['total_bill']
passthrough
LinearRegression()
Question 🤔 (Answer at dsc80.com/q)
Code: weights
How many weights does this linear model have?
Multicollinearity¶
people_path = Path('data') / 'SOCR-HeightWeight.csv'
people = pd.read_csv(people_path).drop(columns=['Index'])
people.head()
Height (Inches) | Weight (Pounds) | |
---|---|---|
0 | 65.78 | 112.99 |
1 | 71.52 | 136.49 |
2 | 69.40 | 153.03 |
3 | 68.22 | 142.34 |
4 | 67.79 | 144.30 |
people.plot(kind='scatter', x='Height (Inches)', y='Weight (Pounds)',
title='Weight vs. Height for 25,000 18 Year Olds')
Motivating example¶
Suppose we fit a simple linear regression model that uses height in inches to predict weight in pounds.
$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)}$$
X = people[['Height (Inches)']]
y = people['Weight (Pounds)']
lr_one_feat = LinearRegression()
lr_one_feat.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
$w_0^*$ and $w_1^*$ are shown below, along with the model's training set RMSE.
lr_one_feat.intercept_, lr_one_feat.coef_
(np.float64(-82.57574306454093), array([3.08]))
root_mean_squared_error(y, lr_one_feat.predict(X))
np.float64(10.079113675632819)
Now, suppose we fit another regression model, that uses height in inches AND height in centimeters to predict weight.
$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)} + w_2 \cdot \text{height (cm)}$$
people['Height (cm)'] = people['Height (Inches)'] * 2.54 # 1 inch = 2.54 cm
X2 = people[['Height (Inches)', 'Height (cm)']]
lr_two_feat = LinearRegression()
lr_two_feat.fit(X2, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
What are $w_0^*$, $w_1^*$, $w_2^*$, and the model's test RMSE?
lr_two_feat.intercept_, lr_two_feat.coef_
(np.float64(-82.57525639659859), array([ 3.38e+10, -1.33e+10]))
root_mean_squared_error(y, lr_two_feat.predict(X2))
np.float64(10.079113677131511)
Observation: The intercept is the same as before (roughly -82.57), as is the RMSE. However, the coefficients on 'Height (Inches)'
and 'Height (cm)'
are massive in size!
What's going on?
Redundant features¶
Let's use simpler numbers for illustration. Suppose in the first model, $w_0^* = -80$ and $w_1^* = 3$.
$$\text{predicted weight (pounds)} = -80 + 3 \cdot \text{height (inches)}$$
In the second model, we have:
$$\begin{align*}\text{predicted weight (pounds)} &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \text{height (cm)} \\ &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \big( 2.54^* \cdot \text{height (inches)} \big) \\ &= w_0^* + \left(w_1^* + 2.54 \cdot w_2^* \right) \cdot \text{height (inches)} \end{align*}$$
In the first model, we already found the "best" intercept ($-80$) and slope ($3$) in a linear model that uses height in inches to predict weight.
So, as long as $w_1^* + 2.54 \cdot w_2^* = 3$ in the second model, the second model's training predictions will be the same as the first, and hence they will also minimize RMSE.
Infinitely many parameter choices¶
Issue: There are an infinite number of $w_1^*$ and $w_2^*$ that satisfy $w_1^* + 2.54 \cdot w_2^* = 3$!
$$\text{predicted weight} = -80 - 10 \cdot \text{height (inches)} + \frac{13}{2.54} \cdot \text{height (cm)}$$
$$\text{predicted weight} = -80 + 10 \cdot \text{height (inches)} - \frac{7}{2.54} \cdot \text{height (cm)}$$
- Both prediction rules look very different, but actually make the same predictions.
lr.coef_
could return either set of coefficients, or any other of the infinitely many options.
- But neither set of coefficients is has any meaning!
(-80 - 10 * people.iloc[:, 0] + (13 / 2.54) * people.iloc[:, 2]).head()
0 117.35 1 134.55 2 128.20 3 124.65 4 123.36 dtype: float64
(-80 + 10 * people.iloc[:, 0] - (7 / 2.54) * people.iloc[:, 2]).head()
0 117.35 1 134.55 2 128.20 3 124.65 4 123.36 dtype: float64
Multicollinearity¶
- Multicollinearity occurs when features in a regression model are highly correlated with one another.
- In other words, multicollinearity occurs when a feature can be predicted using a linear combination of other features, fairly accurately.
- When multicollinearity is present in the features, the coefficients in the model are uninterpretable – they have no meaning.
- A "slope" represents "the rate of change of $y$ with respect to a feature", when all other features are held constant – but if there's multicollinearity, you can't hold other features constant.
- Note: Multicollinearity doesn't impact a model's predictions!
- It doesn't impact a model's ability to generalize to unseen data.
- If features are multicollinear in the training data, they will probably be multicollinear in the test data too.
- Solutions:
- Manually remove highly correlated features.
- Use a dimensionality reduction technique (such as PCA) to automatically reduce dimensions.
Example: One hot encoding¶
A one hot encoding will result in multicollinearity unless you drop one of the one hot encoded features.
Suppose we have the following fitted model:
$$ \begin{aligned} H(x) = 1 + 2 \cdot (\text{smoker==Yes}) - 2 \cdot (\text{smoker==No}) \end{aligned} $$
This is equivalent to:
$$ \begin{aligned} H(x) = 10 - 7 \cdot (\text{smoker==Yes}) - 11 \cdot (\text{smoker==No}) \end{aligned} $$
Solution: Drop one of the one hot encoded columns. sklearn.preprocessing.OneHotEncoder
has an option to do this.
Key takeaways¶
- Multicollinearity is present in a linear model when one feature can be accurately predicted using one or more other features.
- In other words, it is present when a feature is redundant.
- Multicollinearity doesn't pose an issue for prediction; it doesn't hinder a model's ability to generalize. Instead, it renders the coefficients of a linear model meaningless.
Question 🤔 (Answer at dsc80.com/q)
Code: wi23q9
(Wi23 Final Q9)
One piece of information that may be useful as a feature is the proportion of SAT test takers in a state in a given year that qualify for free lunches in school. The Series lunch_props
contains 8 values, each of which are either "low"
, "medium"
, or "high"
. Since we can’t use strings as features in a model, we decide to encode these strings using the following Pipeline:
# Note: The FunctionTransformer is only needed to change the result
# of the OneHotEncoder from a "sparse" matrix to a regular matrix
# so that it can be used with StandardScaler;
# it doesn't change anything mathematically.
pl = Pipeline([
("ohe", OneHotEncoder(drop="first")),
("ft", FunctionTransformer(lambda X: X.toarray())),
("ss", StandardScaler())
])
After calling pl.fit(lunch_props)
, pl.transform(lunch_props)
evaluates to the following array:
array([[ 1.29099445, -0.37796447],
[-0.77459667, -0.37796447],
[-0.77459667, -0.37796447],
[-0.77459667, 2.64575131],
[ 1.29099445, -0.37796447],
[ 1.29099445, -0.37796447],
[-0.77459667, -0.37796447],
[-0.77459667, -0.37796447]])
and pl.named_steps["ohe"].get_feature_names()
evaluates to the following array:
array(["x0_low", "x0_med"], dtype=object)
Fill in the blanks: Given the above information, we can conclude that lunch_props has ____________ value(s) equal to "low", ____________ value(s) equal to "medium", and _____________ value(s) equal to "high". (Note: You should write one positive integer in each box such that the numbers add up to 8.)
Generalization¶
Motivation¶
- You and Billy are studying for an upcoming exam. You both decide to test your understanding by taking a practice exam.
- Your logic: If you do well on the practice exam, you should do well on the real exam.
- You each take the practice exam once and look at the solutions afterwards.
- Your strategy: Memorize the answers to all practice exam questions, e.g. "Question 1: A; Question 2: C; Question 3: A."
- Billy's strategy: Learn high-level concepts from the solutions, e.g. "data are NMAR if the likelihood of missingness depends on the missing values themselves."
- Who will do better on the practice exam? Who will probably do better on the real exam? 🧐
Evaluating the quality of a model¶
- So far, we've computed the RMSE (and $R^2$) of our fit regression models on the data that we used to fit them, i.e. the training data.
- We've said that Model A is better than Model B if Model A's RMSE is lower than Model B's RMSE.
- Remember, our training data is a sample from some population.
- Just because a model fits the training data well doesn't mean it will generalize and work well on similar, unseen samples from the same population!
Example: Overfitting and underfitting¶
Let's collect two samples $\{(x_i, y_i)\}$ from the same population.
np.random.seed(23) # For reproducibility.
def sample_from_pop(n=100):
x = np.linspace(-2, 3, n)
y = x ** 3 + (np.random.normal(0, 3, size=n))
return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_from_pop()
sample_2 = sample_from_pop()
For now, let's just look at Sample 1. The relationship between $x$ and $y$ is roughly cubic; that is, $y \approx x^3$ (remember, in reality, you won't get to see the population).
px.scatter(sample_1, x='x', y='y', title='Sample 1')
Polynomial regression¶
Let's fit three polynomial models on Sample 1:
- Degree 1.
- Degree 3.
- Degree 25.
The PolynomialFeatures
transformer will be helpful here.
from sklearn.preprocessing import PolynomialFeatures
# fit_transform fits and transforms the same input.
d2 = PolynomialFeatures(3)
d2.fit_transform(np.array([1, 2, 3, 4, -2]).reshape(-1, 1))
array([[ 1., 1., 1., 1.], [ 1., 2., 4., 8.], [ 1., 3., 9., 27.], [ 1., 4., 16., 64.], [ 1., -2., 4., -8.]])
Below, we look at our three models' predictions on Sample 1 (which they were trained on).
# Look at the definition of train_and_plot in lec15_util.py if you're curious as to how the plotting works.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_1, degs=[1, 3, 25], data_name='Sample 1')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 1')
The degree 25 polynomial has the lowest RMSE on Sample 1.
How do the same fit polynomials look on Sample 2?
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25], data_name='Sample 2')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')