from dsc80_utils import *
# The dataset is built into plotly (and seaborn)!
# We shuffle here so that the head of the DataFrame contains rows where smoker is Yes and smoker is No,
# purely for illustration purposes (it doesn't change any of the math).
np.random.seed(1)
tips = px.data.tips().sample(frac=1).reset_index(drop=True)
def rmse(actual, pred):
return np.sqrt(np.mean((actual - pred) ** 2))
📣 Announcements 📣¶
- Project 4 out, due next Friday, Dec 1.
- No project checkpoint due because of Thanksgiving, so everyone gets full credit for the checkpoint!
- But start early! Historically, the last two questions of the project take up about 75% of the project time.
- Lab 8 out, due Nov 27.
- No lecture / discussion / OH on Thurs or Fri. Happy Thanksgiving! 🦃
📆 Agenda¶
- Review: Linear models for restaurant tips
- Feature engineering
- Scikit-learn pipelines
Review: Linear Models for Restaurant tips 🧑🍳¶
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
241 | 17.47 | 3.50 | Female | No | Thur | Lunch | 2 |
242 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 |
243 | 16.93 | 3.07 | Female | No | Sat | Dinner | 3 |
244 rows × 7 columns
from sklearn.linear_model import LinearRegression
mean_tip = tips['tip'].mean()
model = LinearRegression()
model.fit(X=tips[['total_bill']], y=tips['tip'])
model_two = LinearRegression()
model_two.fit(X=tips[['total_bill', 'size']], y=tips['tip'])
LinearRegression()
rmse_dict = {}
rmse_dict['constant tip amount'] = rmse(tips['tip'], mean_tip)
all_preds = model.predict(tips[['total_bill']])
rmse_dict['one feature: total bill'] = rmse(tips['tip'], all_preds)
rmse_dict['two features'] = rmse(
tips['tip'], model_two.predict(tips[['total_bill', 'size']])
)
rmse_dict
{'constant tip amount': 1.3807999538298956, 'one feature: total bill': 1.0178504025697377, 'two features': 1.0072561271146618}
Calculating $R^2$¶
all_preds
contains model_two
's predicted 'tip'
for every row in tips
.
pred = tips.assign(predicted=model_two.predict(tips[['total_bill', 'size']]))
pred
total_bill | tip | sex | smoker | day | time | size | predicted | |
---|---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | 1.15 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 2.80 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 | 3.71 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
241 | 17.47 | 3.50 | Female | No | Thur | Lunch | 2 | 2.67 |
242 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 | 1.99 |
243 | 16.93 | 3.07 | Female | No | Sat | Dinner | 3 | 2.82 |
244 rows × 8 columns
Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$
np.var(pred['predicted']) / np.var(pred['tip'])
0.4678693087961255
Method 2: $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$
Note: By correlation here, we are referring to $r$, the same correlation coefficient you saw in DSC 10.
# There was a typo here last lecture, the correct code is:
(np.corrcoef(pred['predicted'], pred['tip'])) ** 2
array([[1. , 0.47], [0.47, 1. ]])
Method 3: LinearRegression.score
model_two.score(tips[['total_bill', 'size']], tips['tip'])
0.46786930879612565
All three methods provide the same result!
What's next?¶
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
3 | 14.26 | 2.50 | Male | No | Thur | Lunch | 2 |
4 | 21.16 | 3.00 | Male | No | Thur | Lunch | 2 |
So far, in our journey to predict
'tip'
, we've only used the existing numerical features in our dataset,'total_bill'
and'size'
.There's a lot of information in tips that we didn't use –
'sex'
,'smoker'
,'day'
, and'time'
, for example. We can't use these features in their current form, because they're non-numeric.How do we use categorical features in a regression model?
Feature engineering ⚙️¶
The goal of feature engineering¶
Feature engineering is the act of finding transformations that transform data into effective quantitative variables.
A feature function $\phi$ (phi, pronounced "fea") is a mapping from raw data to $d$-dimensional space, i.e. $\phi: \text{raw data} \rightarrow \mathbb{R}^d$.
- If two observations $x_i$ and $x_j$ are "similar" in the raw data space, then $\phi(x_i)$ and $\phi(x_j)$ should also be "similar."
A "good" choice of features depends on many factors:
- The kind of data (quantitative, ordinal, nominal).
- The relationship(s) and association(s) being modeled.
- The model type (e.g. linear models, decision tree models, neural networks).
One hot encoding¶
One hot encoding is a transformation that turns a categorical feature into several binary features.
Suppose column
'col'
has $N$ unique values, $A_1$, $A_2$, ..., $A_N$. For each unique value $A_i$, we define the following feature function:
Note that 1 means "yes" and 0 means "no".
One hot encoding is also called "dummy encoding", and $\phi(x)$ may also be referred to as an "indicator variable".
Example: One hot encoding 'smoker'
¶
For each unique value of 'smoker'
in our dataset, we must create a column for just that 'smoker'
. (Remember, 'smoker'
is 'Yes'
when the table was in the smoking section of the restaurant and 'No'
otherwise.)
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
3 | 14.26 | 2.50 | Male | No | Thur | Lunch | 2 |
4 | 21.16 | 3.00 | Male | No | Thur | Lunch | 2 |
tips['smoker'].value_counts()
No 151 Yes 93 Name: smoker, dtype: int64
(tips['smoker'] == 'Yes').astype(int).head()
0 1 1 0 2 1 3 0 4 0 Name: smoker, dtype: int64
for val in tips['smoker'].unique():
tips[f'smoker == {val}'] = (tips['smoker'] == val).astype(int)
tips.head()
total_bill | tip | sex | smoker | ... | time | size | smoker == Yes | smoker == No | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | ... | Dinner | 1 | 1 | 0 |
1 | 18.78 | 3.00 | Female | No | ... | Dinner | 2 | 0 | 1 |
2 | 26.59 | 3.41 | Male | Yes | ... | Dinner | 3 | 1 | 0 |
3 | 14.26 | 2.50 | Male | No | ... | Lunch | 2 | 0 | 1 |
4 | 21.16 | 3.00 | Male | No | ... | Lunch | 2 | 0 | 1 |
5 rows × 9 columns
Model #4: Multiple linear regression using total bill, table size, and smoker status¶
Now that we've converted 'smoker'
to a numerical variable, we can use it as input in a regression model. Here's the model we'll try to fit:
Subtlety: There's no need to use both 'smoker == No'
and 'smoker == Yes'
. If we know the value of one, we already know the value of the other. We can use either one.
model_three = LinearRegression()
model_three.fit(tips[['total_bill', 'size', 'smoker == Yes']], tips['tip'])
LinearRegression()
The following cell gives us our $w^*$s:
model_three.intercept_, model_three.coef_
(0.7090155167346053, array([ 0.09, 0.18, -0.08]))
Thus, our trained linear model to predict tips given total bills, table sizes, and smoker status (yes or no) is:
$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$Visualizing Model #4¶
Our new fit model is:
$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$To visualize our data and linear model, we'd need 4 dimensions:
- One for total bill
- One for table size
- One for
'smoker == Yes'
. - One for tip.
Humans can't visualize in 4D, but there may be a solution. We know that 'smoker == Yes'
only has two possible values, 1 or 0, so let's look at those cases separately.
Case 1: 'smoker == Yes'
is 1, meaning that the table was in the smoking section.
Case 2: 'smoker == Yes'
is 0, meaning that the table was not in the smoking section.
Key idea: These are two parallel planes in 3D, with different $z$-intercepts!
Note that the two planes are very close to one another – you'll have to zoom in to see the difference.
XX, YY = np.mgrid[0:50:2, 0:8:1]
Z_0 = model_three.intercept_ + model_three.coef_[0] * XX + model_three.coef_[1] * YY + model_three.coef_[2] * 0
Z_1 = model_three.intercept_ + model_three.coef_[0] * XX + model_three.coef_[1] * YY + model_three.coef_[2] * 1
plane_0 = go.Surface(x=XX, y=YY, z=Z_0, colorscale='Greens')
plane_1 = go.Surface(x=XX, y=YY, z=Z_1, colorscale='Purples')
fig = go.Figure(data=[plane_0, plane_1])
tips_0 = tips[tips['smoker'] == 'No']
tips_1 = tips[tips['smoker'] == 'Yes']
fig.add_trace(go.Scatter3d(x=tips_0['total_bill'],
y=tips_0['size'],
z=tips_0['tip'], mode='markers', marker = {'color': 'green'}))
fig.add_trace(go.Scatter3d(x=tips_1['total_bill'],
y=tips_1['size'],
z=tips_1['tip'], mode='markers', marker = {'color': 'purple'}))
fig.update_layout(scene = dict(
xaxis_title='Total Bill',
yaxis_title='Table Size',
zaxis_title='Tip'),
title='Tip vs. Total Bill and Table Size (Green = Non-Smoking Section, Purple = Smoking Section)',
width=1000, height=800,
showlegend=False)