from dsc80_utils import *
import lec16_util as util
Announcements 📣¶
- Final Project Checkpoint 1 is due today.
- Lab 8 is due on Friday.
Agenda 📆¶
- Hyperparameters.
- Cross-validation.
- Decision trees.
Review: Pipelines¶
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, Binarizer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
tips = px.data.tips()
def is_weekend(s):
# The input to is_weekend is a Series!
return s.replace({'Thur': 'Weekday', 'Fri': 'Weekday'})
pl_day = make_pipeline(
FunctionTransformer(is_weekend),
OneHotEncoder(),
)
col_trans = make_column_transformer(
(pl_day, ['day']),
(OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']),
(Binarizer(threshold=2), ['size']),
remainder='passthrough',
force_int_remainder_cols=False,
)
pl = make_pipeline(
col_trans,
LinearRegression(),
)
pl.fit(tips.drop('tip', axis=1), tips['tip'])
Pipeline(steps=[('columntransformer', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function is_weekend at 0x16a399c60>)), ('onehotencoder', OneHotEncoder())]), ['day']), ('onehotencoder', OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']), ('binarizer', Binarizer(threshold=2), ['size'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function is_weekend at 0x16a399c60>)), ('onehotencoder', OneHotEncoder())]), ['day']), ('onehotencoder', OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']), ('binarizer', Binarizer(threshold=2), ['size'])])), ('linearregression', LinearRegression())])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function is_weekend at 0x16a399c60>)), ('onehotencoder', OneHotEncoder())]), ['day']), ('onehotencoder', OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']), ('binarizer', Binarizer(threshold=2), ['size'])])
['day']
FunctionTransformer(func=<function is_weekend at 0x16a399c60>)
OneHotEncoder()
['sex', 'smoker', 'time']
OneHotEncoder(drop='first')
['size']
Binarizer(threshold=2)
['total_bill']
passthrough
LinearRegression()
Question 🤔 (Answer at dsc80.com/q)
Code: weights
How many weights does this linear model have?
Question 🤔 (Answer at dsc80.com/q)
Code: wi23q9
(Wi23 Final Q9)
One piece of information that may be useful as a feature is the proportion of SAT test takers in a state in a given year that qualify for free lunches in school. The Series lunch_props
contains 8 values, each of which are either "low"
, "medium"
, or "high"
. Since we can’t use strings as features in a model, we decide to encode these strings using the following Pipeline:
# Note: The FunctionTransformer is only needed to change the result
# of the OneHotEncoder from a "sparse" matrix to a regular matrix
# so that it can be used with StandardScaler;
# it doesn't change anything mathematically.
pl = Pipeline([
("ohe", OneHotEncoder(drop="first")),
("ft", FunctionTransformer(lambda X: X.toarray())),
("ss", StandardScaler())
])
After calling pl.fit(lunch_props)
, pl.transform(lunch_props)
evaluates to the following array:
array([[ 1.29099445, -0.37796447],
[-0.77459667, -0.37796447],
[-0.77459667, -0.37796447],
[-0.77459667, 2.64575131],
[ 1.29099445, -0.37796447],
[ 1.29099445, -0.37796447],
[-0.77459667, -0.37796447],
[-0.77459667, -0.37796447]])
and pl.named_steps["ohe"].get_feature_names()
evaluates to the following array:
array(["x0_low", "x0_med"], dtype=object)
Fill in the blanks: Given the above information, we can conclude that lunch_props has ____________ value(s) equal to "low", ____________ value(s) equal to "medium", and _____________ value(s) equal to "high". (Note: You should write one positive integer in each box such that the numbers add up to 8.)
Question 🤔 (Answer at dsc80.com/q)
Code: fa23q94
(Fa23 Final 9.4)
Determine how each change below affects model bias and variance compared to this model:
For each change, choose all of the following that apply: increase bias, decrease bias, increase variance, decrease variance.
- Add degree 3 polynomial features.
- Add a feature of numbers chosen at random between 0 and 1.
- Collect 100 more points for the training set.
- Don’t use the 'veg' feature.
Review: Hyperparameters¶
np.random.seed(23) # For reproducibility.
def sample_from_pop(n=100):
x = np.linspace(-2, 3, n)
y = x ** 3 + (np.random.normal(0, 3, size=n))
return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_from_pop()
sample_2 = sample_from_pop()
px.scatter(sample_1, x='x', y='y', title='Sample 1')