```
from dsc80_utils import *
```

# Lecture 6 â€“ Hypothesis TestingÂ¶

## DSC 80, Winter 2024Â¶

There was an important **Pre-Lecture Reading** for this lecture â€“ we'll assume you've done it.

### Announcements ðŸ“£Â¶

- Project 1 is due on
**Saturday, January 27th**. - Lab 3 is due on
**Monday, January 29th**. - If you submitted Lab 2 and went to discussion yesterday, make sure to submit the Lab 2 Reflection form on Gradescope
**tonight**for extra credit!- The notebook from discussion is also posted on the course website, next to the podcast for Discussion 3.

### Come help us interview teaching professor candidates!Â¶

HDSI is interviewing candidates for the teaching professor role, and we want students to attend their job talks (which are mock lectures)! Your feedback will be valued in the evaluation process.

Here are the talk times in the coming weeks:

- Monday 1/29, 2-3:30PM.
- Monday 2/5, 2-3:30PM.
- Thursday 2/8, 1-2:30PM.

If you're interested in attending any of these, **please send me an email ASAP** or stay after class to tell me!

### Agenda ðŸ“†Â¶

- Data scope.
- Overview of hypothesis testing.
- Example: Total variation distance.
- Permutation testing.
- Example: Birth weight and smoking ðŸš¬.
- Example (that you'll read on your own): Permutation testing meets TVD.

### Why are we learning hypothesis testing again?Â¶

You may say,

Didn't we already learn this in DSC 10?

Yes, but:

- It's an important concept, but one that's confusing to understand the first time you learn about it.

- In addition, in order to properly handle missing values (next lecture), we need to learn how to identify different
**missingness mechanisms**. Doing so requires performing a hypothesis test.

### Question ðŸ¤” (Answer at q.dsc80.com)

Remember, you can always ask questions at **q.dsc80.com**! If the link doesn't work for you, click the **ðŸ¤” Lecture Questions** link in the top right corner of the course website.

## Data scopeÂ¶

### Where are we in the data science lifecycle?Â¶

Hypothesis testing is a tool for helping us **understand the world (some population)**, given our **understanding of the data (some sample)**.

### Data scopeÂ¶

**Statistical inference**: The practice of drawing conclusions about a population, given a sample.

**Target population**: All elements of the population you ultimately want to draw conclusions about.

**Access frame**: All elements that are accessible for you for measurement and observation.

**Sample**: The subset of the access frame that you actually measured / observed.

### Example: Wikipedia awardsÂ¶

A 2012 paper asked:

If we give awards to Wikipedia contributors, will they contribute more?

To test this question, they took the top 1% of of Wikipedia contributors, excluded those who already received an award, and then took a random sample of 200 contributors.

### Example: Who will win the election?Â¶

In the 2016 US Presidental Election, most pollsters predicted Clinton to win over Trump, even though Trump ultimately won.

To poll, they randomly selected **potential** voters and asked them a question over the phone.

### ðŸ”‘ Key Idea: Random samples look like the access frame they were sampled from!Â¶

- This enables statistical inference!

- But keep in mind, random samples look like their access frame, which can be different than the population itself.

### Sampling in practiceÂ¶

In DSC 10, you used a few key functions/methods to draw samples from populations.

- To draw samples from a known sequence (e.g. array or Series), you used
`np.random.choice`

.

```
names = np.load(Path('data') / 'names.npy', allow_pickle=True)
# By default, the sampling is done WITH replacement.
np.random.choice(names, 10)
```

array(['Matilda', 'Jack', 'Kaitly', 'Chenxi', 'Suraj', 'Tianqi', 'Seanna', 'David', 'Chenxi', 'Dawson'], dtype=object)

```
# To sample WITHOUT replacement, set replace=False.
# This is known as "simple random sampling."
np.random.choice(names, 10, replace=False)
```

array(['Monica', 'Jason', 'Minghan', 'Deepika', 'Daniel', 'Zhihan', 'Jiaye', 'Matilda', 'Tianqi', 'Yash'], dtype=object)

- The DataFrame
`.sample`

method also allowed you to draw samples from a known sequence.

```
# Samples WITHOUT replacement by default (the opposite of np.random.choice).
pd.DataFrame(names, columns=['name']).sample(10)
```

name | |
---|---|

28 | Eshaan |

13 | Brendan |

75 | Nida |

... | ... |

90 | Stephanie |

68 | Mihir |

29 | Ethan |

10 rows Ã— 1 columns

- To sample from a
**categorical**distribution, you used`np.random.multinomial`

. Note that in the cell below, we don't see`array([50, 50])`

every time, and that's due to randomness!

```
# Draws 100 elements from a population in which 50% are group 0 and 50% are group 1.
# This sampling is done WITH replacement.
# In other words, each sampled element has a 50% chance of being group 0 and a 50% chance of being group 1.
np.random.multinomial(100, [0.5, 0.5])
```

array([49, 51])

## Overview of hypothesis testingÂ¶

### What problem does hypothesis testing solve?Â¶

Suppose we've performed an experiment, or identified something interesting in our data.

- Say we've created a new vaccine.

- To assess its efficiency, we give one group the vaccine, and another a placebo.

- We notice that the flu rate among those who received the vaccine is lower than among those who received the placebo (i.e. didn't receive the vaccine).

- One possibility: the vaccine doesn't actually do anything, and by chance, those with the vaccine happened to have a lower flu rate.

- Another possibility: receiving the vaccine made a difference â€“ the flu rate among those who received the vaccine is lower than we'd expect due to random chance.

**Hypothesis testing allows us to determine whether an observation is "significant."**

### Why hypothesis testing is difficult to learnÂ¶

- It's like "[proof by contradiction](https://brilliant.org/wiki/contradiction/#:~:text=Proof%20by%20contradiction%20(also%20known,the%20opposite%20must%20be%20true.)."

- If I want to show that my vaccine works, I consider a world where it doesn't (null hypothesis).

- Then, I "attack the baseline" by showing that under the null hypothesis my data would be very unlikely.

- Showing something is not true is a lot easier than showing something is true!

### The hypothesis testing "recipe"Â¶

Faced with a question about the data raised by an observation...

- Decide on null and alternate hypotheses.
- The null hypothesis should be a well-defined probability model that reflects the baseline you want to compare against.
- The alternative hypothesis should be the "alternate reality" that you suspect may be true.

- Decide on a
**test statistic**, such that a large observed statistic would point to one hypothesis and a small observed statistic would point to the other.

- Compute an empirical distribution of the test statistic under the null by drawing samples from the null hypothesis' probability model.

- Assess whether the observed test statistic is consistent with the empirical distribution of the test statistic by computing a
**p-value**.

### Exercise

Spend some time discussing **Problem 10 from the Spring 2023 DSC 10 Final Exam**.

It focuses on the art ðŸŽ¨ of choosing an appropriate test statistic.

## Example: Total variation distanceÂ¶

```
eth = pd.DataFrame(
[['Asian', 0.15, 0.51],
['Black', 0.05, 0.02],
['Latino', 0.39, 0.16],
['White', 0.35, 0.2],
['Other', 0.06, 0.11]],
columns=['Ethnicity', 'California', 'UCSD']
).set_index('Ethnicity')
eth
```

California | UCSD | |
---|---|---|

Ethnicity | ||

Asian | 0.15 | 0.51 |

Black | 0.05 | 0.02 |

Latino | 0.39 | 0.16 |

White | 0.35 | 0.20 |

Other | 0.06 | 0.11 |

- The two distributions above are clearly different.

- One possibility: UCSD students
**do**look like a random sample of California residents, and the distributions above look different purely due to random chance.

- Another possibility: UCSD students
**don't**look like a random sample of California residents, because the distributions above look too different.

### Is the difference between the two distributions significant?Â¶

Let's establish our hypotheses.

**Null Hypothesis**: UCSD students**were**selected at random from the population of California residents.**Alternative Hypothesis**: UCSD students**were not**selected at random from the population of California residents.**Observation**: Ethnic distribution of UCSD students.**Test Statistic**: We need a way of quantifying**how different**two categorical distributions are.

```
eth.plot(kind='barh', title='Ethnic Distribution of California and UCSD', barmode='group')
```

How can we summarize the difference, or **distance**, between these two distributions using just a single number?

### Total variation distanceÂ¶

The total variation distance (TVD) is a test statistic that describes the **distance between two categorical distributions**.

If $A = [a_1, a_2, ..., a_k]$ and $B = [b_1, b_2, ..., b_k]$ are both categorical distributions, then the TVD between $A$ and $B$ is

$$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k \big|a_i - b_i\big|$$Let's compute the TVD between UCSD's ethnic distribution and California's ethnic distribution. We *could* define a function to do this (and you can use this in assignments):

```
def tvd(dist1, dist2):
return np.abs(dist1 - dist2).sum() / 2
```

But let's try and work on the `eth`

DataFrame directly, using the `diff`

method.

```
# The diff method finds the differences of consecutive elements in a Series.
pd.Series([4, 5, -2]).diff()
```

```
observed_tvd = eth.diff(axis=1).abs().sum().iloc[1] / 2
observed_tvd
```

0.41000000000000003

The issue is we don't know whether this is a large value or a small value â€“ we don't know where it lies in the **distribution of TVDs under the null**.

### The planÂ¶

To conduct our hypothesis test, we will:

Repeatedly generate samples of size 30,000 (the number of UCSD students) from the ethnic distribution of all of California.

Each time, compute the TVD between the simulated distribution and California's distribution.

**This will generate an empirical distribution of TVDs, under the null.**Finally, determine whether the observed TVD (0.41) is consistent with the empirical distribution of TVDs.

### Generating one random sampleÂ¶

Again, to sample from a categorical distribution, we use `np.random.multinomial`

.

**Important**: We must sample from the "population" distribution here, which is the ethnic distribution of everyone in California.

```
# Number of students at UCSD in this example.
N_STUDENTS = 30_000
```

```
eth['California']
```

Ethnicity Asian 0.15 Black 0.05 Latino 0.39 White 0.35 Other 0.06 Name: California, dtype: float64

```
np.random.multinomial(N_STUDENTS, eth['California'])
```

array([ 4568, 1482, 11727, 10491, 1732])

```
np.random.multinomial(N_STUDENTS, eth['California']) / N_STUDENTS
```

array([0.15, 0.05, 0.39, 0.35, 0.06])

### Generating many random samples and computing TVDs, without a `for`

-loopÂ¶

We *could* write a `for`

-loop to repeat the process on the previous slide repeatedly (and you *can* in labs and projects). However, the Pre-Lecture Reading told us about the `size`

argument in `np.random.multinomial`

, so let's use that here.

```
eth_draws = np.random.multinomial(N_STUDENTS, eth['California'], size=100_000) / N_STUDENTS
eth_draws
```

array([[0.15, 0.05, 0.39, 0.35, 0.06], [0.14, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06], ..., [0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06]])

```
eth_draws.shape
```

(100000, 5)

Notice that each row of `eth_draws`

sums to 1, because each row is a simulated categorical distribution.

```
# The values here appear rounded.
tvds = np.abs(eth_draws - eth['California'].to_numpy()).sum(axis=1) / 2
tvds
```

array([0. , 0.01, 0. , ..., 0. , 0. , 0. ])

### Visualizing the empirical distribution of the test statisticÂ¶

```
observed_tvd
```

0.41000000000000003

```
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=20, histnorm='probability',
title='Empirical Distribution of the TVD')
fig
```