# Lecture 20 – Modeling and Linear Regression¶

## DSC 80, Winter 2023¶

### 📣 Announcements¶

• Project 4 (Language Models 🗣) is released.
• The checkpoint is due tomorrow at 11:59PM.
• The full project is due on Thursday, March 9th at 11:59PM.
• Lab 8 (modeling) is due on Monday, March 6th at 11:59PM.

### RSVP to the capstone showcase on Wednesday, March 15th!¶

The senior capstone showcase is on Wednesday, March 15th in the Price Center East Ballroom. The DSC seniors will be presenting posters on their capstone projects. Come and ask them questions; if you're a DSC major, this will be you one day!

The session is broken into two blocks:

• Block 1: 11AM-12:30PM.
• Block 2: 1-2:30PM.

### Look at the list of topics and RSVP here!

There will be no live DSC 80 lecture on the day of the showcase – instead, lecture will be pre-recorded!

### Agenda¶

• Modeling.
• Case study: Restaurant tips 🧑‍🍳.
• Regression in sklearn.

## Modeling¶

### Reflection¶

So far this quarter, we've learned how to:

• Extract information from tabular data using pandas and regular expressions.
• Clean data so that it best represents a data generating process.
• Missingness analyses and imputation.
• Collect data from the internet through scraping and APIs, and parse it using BeautifulSoup.
• Perform exploratory data analysis through aggregation, visualization, and the computation of summary statistics like TF-IDF.
• Infer about the relationships between samples and populations through hypothesis and permutation testing.
• We haven't learned how to make predictions.

### Modeling¶

• Data generating process: A real-world phenomena that we are interested in studying.
• Example: Every year, city employees are hired and fired, earn salaries and benefits, etc.
• Unless we work for the city, we can't observe this process directly.
• Model: A theory about the data generating process.
• Example: If an employee is $X$ years older than average, then they will make \$100,000 in salary. • Fit Model: A model that is learned from a particular set of observations, i.e. training data. • Example: If an employee is 5 years older than average, they will make \$100,000 in salary.
• How is this estimate determined? What makes it "good"?

### Goals of modeling¶

1. To make accurate predictions regarding unseen data drawn from the data generating process.
• Given this dataset of past UCSD data science students' salaries, can we predict your future salary? (regression)
• Given this dataset of images, can we predict if this new image is of a dog, cat, or zebra? (classification)
1. To make inferences about the structure of the data generating process, i.e. to understand complex phenomena.
• Is there a linear relationship between the heights of children and the heights of their biological mothers?
• The weights of smoking and non-smoking mothers' babies babies in my sample are different – how confident am I that this difference exists in the population?
• Of the two focuses of models, we will focus on prediction.
• In the above taxonomy, we will focus on supervised learning.

### Features¶

• A feature is a measurable property of a phenomenon being observed.
• Other terms for "feature" include "(explanatory) variable" and "attribute".
• Typically, features are the inputs to models.
• In DataFrames, features typically correspond to columns, while rows typically correspond to different individuals.
• There are two types of features:
• Features that come as part of a dataset, e.g. weight and height.
• Features that we create, e.g. $\text{BMI} = \frac{\text{weight (kg)}}{\text{[height (m)]}^2}$.
• Example: TF-IDF is a feature we've created that summarizes documents!

## Example: Restaurant tips 🧑‍🍳¶

What features does the dataset contain?

### Predicting tips¶

• Goal: Given various information about a table at a restaurant, we want to predict the tip that a server will earn.
• Why might a server be interested in doing this?
• To determine which tables are likely to tip the most (inference).
• To predict earnings over the next month (prediction).

### Exploratory data analysis (EDA)¶

• The most natural feature to look at first is 'total_bill'.
• As such, we should explore the relationship between 'total_bill' and 'tip', as well as the distributions of both columns individually.
• As we do so, try to describe each distribution in words.