from dsc80_utils import *
Agenda 📆¶
- Modeling.
- Case study: Restaurant tips 🧑🍳.
- Regression in
sklearn
.
Conceptually, today will mostly be review from DSC 40A, but we'll introduce a few new practical tools that we'll build upon next week.
Question 🤔 (Answer at q.dsc80.com)
Taken from the Spring 2022 Final Exam.
The DataFrame below contains a corpus of four song titles, labeled from 0 to 3.
track_name | |
---|---|
0 | i hate you i love you i hate that i love you |
1 | love me like a love song |
2 | love you better |
3 | hate sosa |
Part 1: What is the TF-IDF of the word "hate"
in Song 0's title? Use base 2 in your logarithm, and give your answer as a simplified fraction.
Part 2: Which word in Song 0's title has the highest TF-IDF?
Part 3: Let $\text{tfidf}(t, d)$ be the TF-IDF of term $t$ in document $d$, and let $\text{bow}(t, d)$ be the number of occurrences of term $t$ in document $d$.
Select all correct answers below.
- If $\text{tfidf}(t, d) = 0$, then $\text{bow}(t, d) = 0$.
- If $\text{bow}(t, d) = 0$, then $\text{tfidf}(t, d) = 0$.
- Neither of the above statements are necessarily true.
Modeling¶
Reflection¶
So far this quarter, we've learned how to:
Extract information from tabular data using
pandas
and regular expressions.Clean data so that it best represents an underlying data generating process.
- Missingness analyses and imputation.
Collect data from the internet through scraping and APIs, and parse it using BeautifulSoup.
Perform exploratory data analysis through aggregation, visualization, and the computation of summary statistics like TF-IDF.
Infer about the relationships between samples and populations through hypothesis and permutation testing.
Now, let's make predictions.
Modeling¶
A model is a set of assumptions about how data were generated.
George Box, a famous statistician, once said "All models are wrong, but some are useful." What did he mean?
Philosophy¶
"It has been said that "all models are wrong but some models are useful." In other words, any model is at best a useful fiction—there never was, or ever will be, an exactly normal distribution or an exact linear relationship. Nevertheless, enormous progress has been made by entertaining such fictions and using them as approximations."
"Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity."
Goals of modeling¶
To make accurate predictions regarding unseen data.
- Given this dataset of past UCSD data science students' salaries, can we predict your future salary? (regression)
- Given this dataset of images, can we predict if this new image is of a dog, cat, or zebra? (classification)
To make inferences about complex phenomena in nature.
- Is there a linear relationship between the heights of children and the heights of their biological mothers?
- The weights of smoking and non-smoking mothers' babies babies in my sample are different – how confident am I that this difference exists in the population?
Of the two focuses of models, we will focus on prediction.
In the above taxonomy, we will focus on supervised learning.
We'll start with regression before moving to classification.
Features¶
A feature is a measurable property of a phenomenon being observed.
- Other terms for "feature" include "(explanatory) variable" and "attribute".
- Typically, features are the inputs to models.
In DataFrames, features typically correspond to columns, while rows typically correspond to different individuals.
Some features come as part of a dataset, e.g. weight and height, but others we need to create given existing features, for example: $$\text{BMI} = \frac{\text{weight (kg)}}{\text{[height (m)]}^2}$$
Example: TF-IDF creates features that summarize documents!
Example: Restaurant tips 🧑🍳¶
About the data¶
What features does the dataset contain? Is this likely a recent dataset, or an older one?
# The dataset is built into plotly!
tips = px.data.tips()
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
Predicting tips¶
- Goal: Given various information about a table at a restaurant, we want to predict the tip that a server will earn.
- Why might a server be interested in doing this?
- To determine which tables are likely to tip the most (inference).
- To predict earnings over the next month (prediction).
Exploratory data analysis¶
- The most natural feature to look at first is total bills.
- As such, we should explore the relationship between total bills and tips. Moving forward:
- $x$: Total bills.
- $y$: Tips.
fig = tips.plot(kind='scatter', x='total_bill', y='tip', title='Tip vs. Total Bill')
fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip')