In [1]:
from dsc80_utils import *
import lec15_util as util

Lecture 15 – Pipelines, Multicollinearity, and Generalization¶

DSC 80, Spring 2025¶

Agenda 📆¶

  • Pipelines.
  • Multicollinearity.
  • Generalization.
    • Bias and variance.
    • Train-test splits.

Pipelines¶

No description has been provided for this image

So far, we've used transformers for feature engineering and models for prediction. We can combine these steps into a single Pipeline.

Pipelines in sklearn¶

  • Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.
  • General template: pl = Pipeline([trans_1, trans_2, ..., model])
    • Note that the model is optional.
  • Once a Pipeline is instantiated, you can fit all steps (transformers and model) using pl.fit(X, y).
  • To make predictions using raw, untransformed data, use pl.predict(X).
  • The actual list we provide Pipeline with must be a list of tuples, where
    • The first element is a "name" (that we choose) for the step.
    • The second element is a transformer or estimator instance.

Our first Pipeline¶

Let's build a Pipeline that:

  • One hot encodes the categorical features in tips.
  • Fits a regression model on the one hot encoded data.
In [2]:
tips = px.data.tips()
In [3]:
tips_cat = tips[['sex', 'smoker', 'day', 'time']]
tips_cat.head()
Out[3]:
sex smoker day time
0 Female No Sun Dinner
1 Male No Sun Dinner
2 Male No Sun Dinner
3 Male No Sun Dinner
4 Female No Sun Dinner
In [4]:
from sklearn.pipeline import Pipeline
In [5]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

pl = Pipeline([
    ('one-hot', OneHotEncoder()),
    ('lin-reg', LinearRegression())
])

Now that pl is instantiated, we fit it the same way we would fit the individual steps.

In [6]:
pl.fit(tips_cat, tips['tip'])
Out[6]:
Pipeline(steps=[('one-hot', OneHotEncoder()), ('lin-reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('one-hot', OneHotEncoder()), ('lin-reg', LinearRegression())])
OneHotEncoder()
LinearRegression()

Now, to make predictions using raw data, all we need to do is use pl.predict:

In [7]:
pl.predict(tips_cat.iloc[:5])
Out[7]:
array([3.1 , 3.27, 3.27, 3.27, 3.1 ])

pl performs both feature transformation and prediction with just a single call to predict!

We can access individual "steps" of a Pipeline through the named_steps attribute:

In [8]:
pl.named_steps
Out[8]:
{'one-hot': OneHotEncoder(), 'lin-reg': LinearRegression()}
In [9]:
pl.named_steps['one-hot'].transform(tips_cat).toarray()
Out[9]:
array([[1., 0., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [1., 0., 1., ..., 1., 1., 0.]])
In [10]:
pl.named_steps['one-hot'].get_feature_names_out()
Out[10]:
array(['sex_Female', 'sex_Male', 'smoker_No', 'smoker_Yes', 'day_Fri',
       'day_Sat', 'day_Sun', 'day_Thur', 'time_Dinner', 'time_Lunch'],
      dtype=object)
In [11]:
pl.named_steps['lin-reg'].coef_
Out[11]:
array([-0.09,  0.09, -0.04,  0.04, -0.2 , -0.13,  0.14,  0.19,  0.25,
       -0.25])

pl also has a score method, the same way a fit LinearRegression instance does:

In [12]:
# Why is this so low?
pl.score(tips_cat, tips['tip'])
Out[12]:
0.02749679020147555

More sophisticated Pipelines¶

  • In the previous example, we one hot encoded every input column. What if we want to perform different transformations on different columns?
  • Solution: Use a ColumnTransformer.
    • Instantiate a ColumnTransformer using a list of tuples, where:
      • The first element is a "name" we choose for the transformer.
      • The second element is a transformer instance (e.g. OneHotEncoder()).
      • The third element is a list of relevant column names.

Planning our first ColumnTransformer¶

In [13]:
from sklearn.compose import ColumnTransformer

Let's perform different transformations on the quantitative and categorical features of tips (note that we are not transforming 'tip').

In [14]:
tips_features = tips.drop('tip', axis=1)
tips_features.head()
Out[14]:
total_bill sex smoker day time size
0 16.99 Female No Sun Dinner 2
1 10.34 Male No Sun Dinner 3
2 21.01 Male No Sun Dinner 3
3 23.68 Male No Sun Dinner 2
4 24.59 Female No Sun Dinner 4
  • We will leave the 'total_bill' column untouched.
  • To the 'size' column, we will apply the Binarizer transformer with a threshold of 2 (big tables vs. small tables).
  • To the categorical columns, we will apply the OneHotEncoder transformer.
  • In essence, we will create a transformer that reproduces the following DataFrame:
size x0_Female x0_Male x1_No x1_Yes x2_Fri x2_Sat x2_Sun x2_Thur x3_Dinner x3_Lunch total_bill
0 0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 16.99
1 1 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 10.34
2 1 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 21.01
3 0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 23.68
4 1 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 24.59

Building a Pipeline using a ColumnTransformer¶

Let's start by creating our ColumnTransformer.

In [15]:
from sklearn.preprocessing import Binarizer

preproc = ColumnTransformer(
    transformers=[
        ('size', Binarizer(threshold=2), ['size']),
        ('categorical_cols', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])
    ],
    # Specify what to do with all other columns ('total_bill' here) – drop or passthrough.
    remainder='passthrough',
    # Keep original dtypes for remaining columns
    force_int_remainder_cols=False,
)

Now, let's create a Pipeline using preproc as a transformer, and fit it:

In [16]:
pl = Pipeline([
    ('preprocessor', preproc), 
    ('lin-reg', LinearRegression())
])
In [17]:
pl.fit(tips_features, tips['tip'])
Out[17]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('size',
                                                  Binarizer(threshold=2),
                                                  ['size']),
                                                 ('categorical_cols',
                                                  OneHotEncoder(),
                                                  ['sex', 'smoker', 'day',
                                                   'time'])])),
                ('lin-reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('size',
                                                  Binarizer(threshold=2),
                                                  ['size']),
                                                 ('categorical_cols',
                                                  OneHotEncoder(),
                                                  ['sex', 'smoker', 'day',
                                                   'time'])])),
                ('lin-reg', LinearRegression())])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('size', Binarizer(threshold=2), ['size']),
                                ('categorical_cols', OneHotEncoder(),
                                 ['sex', 'smoker', 'day', 'time'])])
['size']
Binarizer(threshold=2)
['sex', 'smoker', 'day', 'time']
OneHotEncoder()
['total_bill']
passthrough
LinearRegression()

Prediction is as easy as calling predict:

In [18]:
tips_features.head()
Out[18]:
total_bill sex smoker day time size
0 16.99 Female No Sun Dinner 2
1 10.34 Male No Sun Dinner 3
2 21.01 Male No Sun Dinner 3
3 23.68 Male No Sun Dinner 2
4 24.59 Female No Sun Dinner 4
In [19]:
# Note that we fit the Pipeline using tips_features, not tips_features.head()!
pl.predict(tips_features.head())
Out[19]:
array([2.74, 2.32, 3.37, 3.37, 3.75])

Aside: FunctionTransformer¶

A transformer you'll often use as part of a ColumnTransformer is the FunctionTransformer, which enables you to use your own functions on entire columns. Think of it as the sklearn equivalent of apply.

In [20]:
from sklearn.preprocessing import FunctionTransformer
In [21]:
f = FunctionTransformer(np.sqrt)
f.transform([1, 2, 3])
Out[21]:
array([1.  , 1.41, 1.73])

💡 Pro-Tip: Using make_pipeline and make_column_transformer¶

Instead of using Pipeline and ColumnTransformer classes directly, scikit-learn provides nifty shortcut methods called make_pipeline and make_column_transformer:

In [25]:
# Old code

preproc = ColumnTransformer(
    transformers=[
        ('size', Binarizer(threshold=2), ['size']),
        ('categorical_cols', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])
    ],
    remainder='passthrough' 
)

pl = Pipeline([
    ('preprocessor', preproc), 
    ('lin-reg', LinearRegression())
])
pl
Out[25]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('size',
                                                  Binarizer(threshold=2),
                                                  ['size']),
                                                 ('categorical_cols',
                                                  OneHotEncoder(),
                                                  ['sex', 'smoker', 'day',
                                                   'time'])])),
                ('lin-reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('size',
                                                  Binarizer(threshold=2),
                                                  ['size']),
                                                 ('categorical_cols',
                                                  OneHotEncoder(),
                                                  ['sex', 'smoker', 'day',
                                                   'time'])])),
                ('lin-reg', LinearRegression())])
ColumnTransformer(remainder='passthrough',
                  transformers=[('size', Binarizer(threshold=2), ['size']),
                                ('categorical_cols', OneHotEncoder(),
                                 ['sex', 'smoker', 'day', 'time'])])
['size']
Binarizer(threshold=2)
['sex', 'smoker', 'day', 'time']
OneHotEncoder()
passthrough
LinearRegression()
In [26]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

preproc = make_column_transformer(
    (Binarizer(threshold=2), ['size']),
    (OneHotEncoder(), ['sex', 'smoker', 'day', 'time']),
    remainder='passthrough',
)

pl = make_pipeline(preproc, LinearRegression())
# Notice that the steps in the pipeline and column transformer are
# automatically named
pl
Out[26]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('binarizer',
                                                  Binarizer(threshold=2),
                                                  ['size']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['sex', 'smoker', 'day',
                                                   'time'])])),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('binarizer',
                                                  Binarizer(threshold=2),
                                                  ['size']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['sex', 'smoker', 'day',
                                                   'time'])])),
                ('linearregression', LinearRegression())])
ColumnTransformer(remainder='passthrough',
                  transformers=[('binarizer', Binarizer(threshold=2), ['size']),
                                ('onehotencoder', OneHotEncoder(),
                                 ['sex', 'smoker', 'day', 'time'])])
['size']
Binarizer(threshold=2)
['sex', 'smoker', 'day', 'time']
OneHotEncoder()
passthrough
LinearRegression()

An example Pipeline¶

One of the transformers we used was the StandardScaler transformer, which standardizes columns.

$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of } x}$$

Let's build a Pipeline that:

  • Takes in the 'total_bill' and 'size' features of tips.
  • Standardizes those features.
  • Uses the resulting standardized features to fit a linear model that predicts 'tip'.
In [29]:
# Let's define these once, since we'll use them repeatedly.
X = tips[['total_bill', 'size']]
y = tips['tip']
In [34]:
from sklearn.preprocessing import StandardScaler

model_with_std = make_pipeline(
    StandardScaler(),
    LinearRegression(),
)

model_with_std.fit(X, y)
Out[34]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])
StandardScaler()
LinearRegression()

How well does our model do? We can compute its $R^2$ and RMSE.

In [35]:
model_with_std.score(X, y)
Out[35]:
0.46786930879612587
In [36]:
from sklearn.metrics import root_mean_squared_error

root_mean_squared_error(y, model_with_std.predict(X))
Out[36]:
np.float64(1.007256127114662)

Does this model perform any better than one that doesn't standardize its features? Let's find out.

In [37]:
model_without_std = LinearRegression()
model_without_std.fit(X, y)
Out[37]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [38]:
model_without_std.score(X, y)
Out[38]:
0.46786930879612587
In [39]:
root_mean_squared_error(y, model_without_std.predict(X))
Out[39]:
np.float64(1.007256127114662)

No!

The purpose of standardizing features¶

If you're performing "vanilla" linear regression – that is, using the LinearRegression object – then standardizing your features will not change your model's error.

  • There are other models where standardizing your features will improve performance, because the methods assume features are standardized.
    • Regularized linear regression (see DSC 140A).
    • PCA (assumes centered data, not necessarily standardized: see DSC 140B).
    • Clustering algorithms, e.g. $k$-means clustering (saw in DSC 40A!).
  • There is a benefit to standardizing features when performing vanilla linear regression, as we saw in DSC 40A: the features are brought to the same scale, so the coefficients can be compared directly.
In [40]:
# Total bill, table size.
model_without_std.coef_
Out[40]:
array([0.09, 0.19])
In [41]:
# Total bill, table size.
model_with_std.named_steps['linearregression'].coef_
Out[41]:
array([0.82, 0.18])

Aside: Pipelines of just transformers¶

If you want to apply multiple transformations to the same column in a dataset, you can create a Pipeline just for that column.

For example, suppose we want to:

  • One hot encode the 'sex', 'smoker', and 'time' columns.
  • One hot encode the 'day' column, but as either 'Weekday', 'Sat', or 'Sun'.
  • Binarize the 'size' column.

Here's how we might do that:

In [50]:
def is_weekend(s):
    # The input to is_weekend is a Series!
    return s.replace({'Thur': 'Weekday', 'Fri': 'Weekday'})
In [51]:
pl_day = make_pipeline(
    FunctionTransformer(is_weekend),
    OneHotEncoder(),
)
In [52]:
col_trans = make_column_transformer(
    (pl_day, ['day']),
    (OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']),
    (Binarizer(threshold=2), ['size']),
    remainder='passthrough',
    force_int_remainder_cols=False,
)
In [53]:
pl = make_pipeline(
    col_trans,
    LinearRegression(),
)

pl.fit(tips.drop('tip', axis=1), tips['tip'])
Out[53]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function is_weekend at 0x283024180>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['day']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['sex', 'smoker', 'time']),
                                                 ('binarizer',
                                                  Binarizer(threshold=2),
                                                  ['size'])])),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function is_weekend at 0x283024180>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['day']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['sex', 'smoker', 'time']),
                                                 ('binarizer',
                                                  Binarizer(threshold=2),
                                                  ['size'])])),
                ('linearregression', LinearRegression())])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('pipeline',
                                 Pipeline(steps=[('functiontransformer',
                                                  FunctionTransformer(func=<function is_weekend at 0x283024180>)),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['day']),
                                ('onehotencoder', OneHotEncoder(drop='first'),
                                 ['sex', 'smoker', 'time']),
                                ('binarizer', Binarizer(threshold=2),
                                 ['size'])])
['day']
FunctionTransformer(func=<function is_weekend at 0x283024180>)
OneHotEncoder()
['sex', 'smoker', 'time']
OneHotEncoder(drop='first')
['size']
Binarizer(threshold=2)
['total_bill']
passthrough
LinearRegression()

Question 🤔 (Answer at dsc80.com/q)

Code: weights

How many weights does this linear model have?

In [54]:
pl.named_steps
Out[54]:
{'columntransformer': ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                   transformers=[('pipeline',
                                  Pipeline(steps=[('functiontransformer',
                                                   FunctionTransformer(func=<function is_weekend at 0x283024180>)),
                                                  ('onehotencoder',
                                                   OneHotEncoder())]),
                                  ['day']),
                                 ('onehotencoder', OneHotEncoder(drop='first'),
                                  ['sex', 'smoker', 'time']),
                                 ('binarizer', Binarizer(threshold=2),
                                  ['size'])]),
 'linearregression': LinearRegression()}

Multicollinearity¶

Heights and weights¶

We have a dataset containing the weights and heights of 25,000 18 year olds, taken from here.

In [55]:
people_path = Path('data') / 'SOCR-HeightWeight.csv'
people = pd.read_csv(people_path).drop(columns=['Index'])
people.head()
Out[55]:
Height (Inches) Weight (Pounds)
0 65.78 112.99
1 71.52 136.49
2 69.40 153.03
3 68.22 142.34
4 67.79 144.30
In [56]:
people.plot(kind='scatter', x='Height (Inches)', y='Weight (Pounds)', 
            title='Weight vs. Height for 25,000 18 Year Olds')