from dsc80_utils import *
# The dataset is built into plotly (and seaborn)!
# We shuffle here so that the head of the DataFrame contains rows where smoker is Yes and smoker is No,
# purely for illustration purposes (it doesn't change any of the math).
np.random.seed(1)
tips = px.data.tips().sample(frac=1).reset_index(drop=True)
Announcements 📣¶
- Project 3 is due today.
- The Final Project is coming out soon!
- It will be worth two projects (because it used to be two separate projects).
- It will have two short checkpoints.
- You can request an extension on the checkpoints.
- You cannot request an extension on final submission deadline on Friday, Dec 6 (the last day of classes).
About the Midterm¶
- Midterm grades will be out soon (hopefully today) on Gradescope.
- Regrades will be open for two days after scores are released.
- Given CBTF troubles, I'm giving a chance to redeem your midterm scores.
- A Midterm Redemption assignment will be released on GitHub + Gradescope
- Like labs and projects, you can redo your midterm and redeem up to 50% of your points lost.
- Hidden tests, just like labs and projects.
- Treat it like a take-home exam: work alone, and course staff won't give help.
- You can use outside resources, including lecture materials and LLMs
- The version will probably not match up with your exam exactly, sorry!
About the Final¶
- Final is on pencil-and-paper.
- Saturday, Dec 7, 11:30am-2:30pm. Room TBD.
- Two notes sheets allowed.
Will also send an Ed post with more complete details. Thanks for being patient with us!
Agenda 📆¶
- Review: Predicting tips.
- $R^2$.
- Feature engineering.
- Example: Predicting tips.
- One hot encoding.
- Example: Predicting ratings ⭐️.
- Dropping features.
- Ordinal encoding.
- Example: Horsepower 🚗.
- Quantitative scaling.
- Example: Predicting tips.
- Feature engineering in
sklearn
.- Transformer classes.
- Creating
Pipeline
s.
Review: Predicting tips 🧑🍳¶
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
241 | 17.47 | 3.50 | Female | No | Thur | Lunch | 2 |
242 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 |
243 | 16.93 | 3.07 | Female | No | Sat | Dinner | 3 |
244 rows × 7 columns
Linear models¶
Last time, we fit three linear models to predict restaurant tips:
- Constant model: $\text{predicted tip} = h$.
- Simple linear regression: $\text{predicted tip} = w_0 + w_1 \cdot \text{total bill}$.
- Multiple linear regression: $\text{predicted tip} = w_0 + w_1 \cdot \text{total bill} + w_2 \cdot \text{table size}$.
In the constant model case, we know that the optimal model parameter, when using squared loss, is $h^* = \text{mean tip}$.
mean_tip = tips['tip'].mean()
In the other two cases, we used the LinearRegression
class from sklearn
to help us find optimal model parameters.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=tips[['total_bill']], y=tips['tip'])
model_two = LinearRegression()
model_two.fit(X=tips[['total_bill', 'size']], y=tips['tip'])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Root mean squared error¶
To compare the performance of different models, we used the root mean squared error (RMSE).
$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2}$$
def rmse(actual, pred):
return np.sqrt(np.mean((actual - pred) ** 2))
rmse_dict = {}
rmse_dict['constant tip amount'] = rmse(tips['tip'], mean_tip)
all_preds = model.predict(tips[['total_bill']])
rmse_dict['one feature: total bill'] = rmse(tips['tip'], all_preds)
rmse_dict['two features'] = rmse(
tips['tip'], model_two.predict(tips[['total_bill', 'size']])
)
pd.DataFrame({'rmse': rmse_dict.values()}, index=rmse_dict.keys())
rmse | |
---|---|
constant tip amount | 1.38 |
one feature: total bill | 1.02 |
two features | 1.01 |
The .score
method of a LinearRegression
object¶
Model objects in sklearn
that have already been fit have a score
method.
model_two.score(tips[['total_bill', 'size']], tips['tip'])
0.46786930879612565
That doesn't look like the RMSE... what is it? 🤔
Aside: $R^2$¶
- $R^2$, or the coefficient of determination, is a measure of the quality of a linear fit.
- There are a few equivalent ways of computing it, assuming your model is linear and has an intercept term:
$$R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$$
$$R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$$
- Interpretation: $R^2$ is the proportion of variance in $y$ that the linear model explains.
- In the simple linear regression case, it is the square of the correlation coefficient, $r$.
- Key idea: $R^2$ ranges from 0 to 1. The closer it is to 1, the better the linear fit is.
- $R^2$ has no units of measurement, unlike RMSE.
Calculating $R^2$¶
Let's calculate the $R^2$ for model_two
's predictions in three different ways.
pred = tips.assign(predicted=model_two.predict(tips[['total_bill', 'size']]))
pred
total_bill | tip | sex | smoker | day | time | size | predicted | |
---|---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | 1.15 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 2.80 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 | 3.71 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
241 | 17.47 | 3.50 | Female | No | Thur | Lunch | 2 | 2.67 |
242 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 | 1.99 |
243 | 16.93 | 3.07 | Female | No | Sat | Dinner | 3 | 2.82 |
244 rows × 8 columns
Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$
np.var(pred['predicted']) / np.var(pred['tip'])
np.float64(0.46786930879612615)
Method 2: $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$
Note: By correlation here, we are referring to $r$, the same correlation coefficient you saw in DSC 10.
pred[['predicted', 'tip']].corr().loc['predicted', 'tip'] ** 2
np.float64(0.46786930879612526)
Method 3: LinearRegression.score
model_two.score(tips[['total_bill', 'size']], tips['tip'])
0.46786930879612565
All three methods provide the same result!
Relationship between $R^2$ and RMSE¶
For linear models with an intercept term,
$$R^2 = 1 - \frac{\text{RMSE}^2}{\text{var}(\text{actual $y$ values})}$$
1 - rmse(pred['tip'], pred['predicted']) ** 2 / np.var(pred['tip'])
np.float64(0.46786930879612554)
What's next?¶
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
3 | 14.26 | 2.50 | Male | No | Thur | Lunch | 2 |
4 | 21.16 | 3.00 | Male | No | Thur | Lunch | 2 |
So far, in our journey to predict
'tip'
, we've only used the existing numerical features in our dataset,'total_bill'
and'size'
.There's a lot of information in tips that we didn't use –
'sex'
,'smoker'
,'day'
, and'time'
, for example. We can't use these features in their current form, because they're non-numeric.How do we use categorical features in a regression model?
Feature engineering ⚙️¶
The goal of feature engineering¶
- Feature engineering is the act of finding transformations that transform data into effective quantitative variables.
- A feature function $\phi$ (phi, pronounced "fea") is a mapping from raw data to $d$-dimensional space, i.e. $\phi: \text{raw data} \rightarrow \mathbb{R}^d$.
- If two observations $x_i$ and $x_j$ are "similar" in the raw data space, then $\phi(x_i)$ and $\phi(x_j)$ should also be "similar."
- A "good" choice of features depends on many factors:
- The kind of data, i.e. quantitative, ordinal, or nominal.
- The relationship(s) being modeled.
- The model type, e.g. linear models, decision tree models, neural networks.
- To introduce different feature functions, we'll look at several different example datasets.
One hot encoding¶
- One hot encoding is a transformation that turns a categorical feature into several binary features.
- Suppose a column has $N$ unique values, $A_1$, $A_2$, ..., $A_N$. For each unique value $A_i$, we define the following feature function:
$$\phi_i(x) = \left\{\begin{array}{ll}1 & {\rm if\ } x = A_i \\ 0 & {\rm if\ } x\neq A_i \\ \end{array}\right. $$
- Note that 1 means "yes" and 0 means "no".
- One hot encoding is also called "dummy encoding", and $\phi(x)$ may also be referred to as an "indicator variable".
Example: One hot encoding 'smoker'
¶
For each unique value of 'smoker'
in our dataset, we must create a column for just that 'smoker'
. (Remember, 'smoker'
is 'Yes'
when the table was in the smoking section of the restaurant and 'No'
otherwise.)
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
3 | 14.26 | 2.50 | Male | No | Thur | Lunch | 2 |
4 | 21.16 | 3.00 | Male | No | Thur | Lunch | 2 |
tips['smoker'].value_counts()
smoker No 151 Yes 93 Name: count, dtype: int64
(tips['smoker'] == 'Yes').astype(int).head()
0 1 1 0 2 1 3 0 4 0 Name: smoker, dtype: int64
for val in tips['smoker'].unique():
tips[f'smoker == {val}'] = (tips['smoker'] == val).astype(int)
tips.head()
total_bill | tip | sex | smoker | ... | time | size | smoker == Yes | smoker == No | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | ... | Dinner | 1 | 1 | 0 |
1 | 18.78 | 3.00 | Female | No | ... | Dinner | 2 | 0 | 1 |
2 | 26.59 | 3.41 | Male | Yes | ... | Dinner | 3 | 1 | 0 |
3 | 14.26 | 2.50 | Male | No | ... | Lunch | 2 | 0 | 1 |
4 | 21.16 | 3.00 | Male | No | ... | Lunch | 2 | 0 | 1 |
5 rows × 9 columns
Model #4: Multiple linear regression using total bill, table size, and smoker status¶
Now that we've converted 'smoker'
to a numerical variable, we can use it as input in a regression model. Here's the model we'll try to fit:
$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill} + w_2 \cdot \text{table size} + w_3 \cdot \text{smoker == Yes}$$
Subtlety: There's no need to use both 'smoker == No'
and 'smoker == Yes'
. If we know the value of one, we already know the value of the other. We can use either one.
model_three = LinearRegression()
model_three.fit(tips[['total_bill', 'size', 'smoker == Yes']], tips['tip'])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
The following cell gives us our $w^*$s:
model_three.intercept_, model_three.coef_
(np.float64(0.7090155167346053), array([ 0.09, 0.18, -0.08]))
Thus, our trained linear model to predict tips given total bills, table sizes, and smoker status (yes or no) is:
$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$
Visualizing Model #4¶
Our new fit model is:
$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$
To visualize our data and linear model, we'd need 4 dimensions:
- One for total bill
- One for table size
- One for
'smoker == Yes'
. - One for tip.
Humans can't visualize in 4D, but there may be a solution. We know that 'smoker == Yes'
only has two possible values, 1 or 0, so let's look at those cases separately.
Case 1: 'smoker == Yes'
is 1, meaning that the table was in the smoking section.
$$\begin{align*} \text{predicted tip} &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot 1 \\ &= 0.626 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} \end{align*}$$
Case 2: 'smoker == Yes'
is 0, meaning that the table was not in the smoking section.
$$\begin{align*} \text{predicted tip} &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot 0 \\ &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} \end{align*}$$
Key idea: These are two parallel planes in 3D, with different $z$-intercepts!
Note that the two planes are very close to one another – you'll have to zoom in to see the difference.
# pio.renderers.default = 'plotly_mimetype+notebook' # If it doesn't render, try uncommenting this.
XX, YY = np.mgrid[0:50:2, 0:8:1]
Z_0 = model_three.intercept_ + model_three.coef_[0] * XX + model_three.coef_[1] * YY + model_three.coef_[2] * 0
Z_1 = model_three.intercept_ + model_three.coef_[0] * XX + model_three.coef_[1] * YY + model_three.coef_[2] * 1
plane_0 = go.Surface(x=XX, y=YY, z=Z_0, colorscale='Greens')
plane_1 = go.Surface(x=XX, y=YY, z=Z_1, colorscale='Purples')
fig = go.Figure(data=[plane_0, plane_1])
tips_0 = tips[tips['smoker'] == 'No']
tips_1 = tips[tips['smoker'] == 'Yes']
fig.add_trace(go.Scatter3d(x=tips_0['total_bill'],
y=tips_0['size'],
z=tips_0['tip'], mode='markers', marker = {'color': 'green'}))
fig.add_trace(go.Scatter3d(x=tips_1['total_bill'],
y=tips_1['size'],
z=tips_1['tip'], mode='markers', marker = {'color': 'purple'}))
fig.update_layout(scene = dict(
xaxis_title='Total Bill',
yaxis_title='Table Size',
zaxis_title='Tip'),
title='Tip vs. Total Bill and Table Size (Green = Non-Smoking Section, Purple = Smoking Section)',
width=1000, height=800,
showlegend=False)