from dsc80_utils import *
Announcements 📣¶
- Lab 3 is due on Friday, Aug 16.
Agenda 📆¶
- Data scope.
- Overview of hypothesis testing.
- Example: Total variation distance.
- Permutation testing.
- Example: Birth weight and smoking 🚬.
- Example (that you'll read on your own): Permutation testing meets TVD.
Why are we learning hypothesis testing again?¶
You may say,
Didn't we already learn this in DSC 10?
Yes, but:
It's an important concept, but one that's confusing to understand the first time you learn about it.
In addition, in order to properly handle missing values (next lecture), we need to learn how to identify different missingness mechanisms. Doing so requires performing a hypothesis test.
Data scope¶
Where are we in the data science lifecycle?¶
Hypothesis testing is a tool for helping us understand the world (some population), given our understanding of the data (some sample).
Data scope¶
Statistical inference: The practice of drawing conclusions about a population, given a sample.
Target population: All elements of the population you ultimately want to draw conclusions about.
Access frame: All elements that are accessible for you for measurement and observation.
Sample: The subset of the access frame that you actually measured / observed.
Example: Wikipedia awards¶
A 2012 paper asked:
If we give awards to Wikipedia contributors, will they contribute more?
To test this question, they took the top 1% of all Wikipedia contributors, excluded those who already received an award, and then took a random sample of 200 contributors.
Example: Who will win the election?¶
In the 2016 US Presidental Election, most pollsters predicted Clinton to win over Trump, even though Trump ultimately won.
To poll, they randomly selected potential voters and asked them a question over the phone.
🔑 Key Idea: Random samples look like the access frame they were sampled from!¶
This enables statistical inference!
But keep in mind, random samples look like their access frame which can be different than the population itself.
Sampling in practice¶
In DSC 10, you used a few key functions/methods to draw samples from populations.
- To draw samples from a known sequence (e.g. array or Series), you used
np.random.choice
.
names = np.load(Path('data') / 'names.npy', allow_pickle=True)
# By default, the sampling is done WITH replacement.
np.random.choice(names, 10)
array(['Aritra', 'Zhihan', 'Sarah', 'Raymond', 'Nian-Nian', 'Natasha', 'Yeogyeong', 'Cecilia', 'Chenxi', 'Nida'], dtype=object)
# To sample WITHOUT replacement, set replace=False.
# This is known as "simple random sampling."
np.random.choice(names, 10, replace=False)
array(['Joyce', 'Yutian', 'Monica', 'Bingyan', 'Chi-En', 'Diya', 'Sebastian', 'Kening', 'Katelyn', 'Lacha'], dtype=object)
- The DataFrame
.sample
method also allowed you to draw samples from a known sequence.
# Samples WITHOUT replacement by default (the opposite of np.random.choice).
pd.DataFrame(names, columns=['name']).sample(10)
name | |
---|---|
17 | Chenxi |
110 | Zhenghao |
39 | Jesse |
... | ... |
84 | Sarah |
105 | Yiran |
37 | Jasmine |
10 rows × 1 columns
- To sample from a categorical distribution, you used
np.random.multinomial
. Note that in the cell below, we don't seearray([50, 50])
every time, and that's due to randomness!
# Draws 100 elements from a population in which 50% are group 0 and 50% are group 1.
# This sampling is done WITH replacement.
# In other words, each sampled element has a 50% chance of being group 0 and a 50% chance of being group 1.
np.random.multinomial(100, [0.5, 0.5])
array([45, 55])
Overview of hypothesis testing¶
What problem does hypothesis testing solve?¶
Suppose we've performed an experiment, or identified something interesting in our data.
Say we've created a new vaccine.
To assess its efficiency, we give one group the vaccine, and another a placebo.
We notice that the flu rate among those who received the vaccine is lower than among those who received the placebo (i.e. didn't receive the vaccine).
One possibility: the vaccine doesn't actually do anything, and by chance, those with the vaccine happened to have a lower flu rate.
Another possibility: receiving the vaccine made a difference – the flu rate among those who received the vaccine is lower than we'd expect due to random chance.
Hypothesis testing allows us to determine whether an observation is "significant."
Why hypothesis testing is difficult to learn¶
It's like "proof by contradiction."
If I want to show that my vaccine works, I consider a world where it doesn't (null hypothesis).
Then, I show that under the null hypothesis my data would be very unlikely.
Why go through these mental hurdles? Showing something is not true is usually easier than showing something is true!
The hypothesis testing "recipe"¶
Faced with a question about the data raised by an observation...
Decide on null and alternate hypotheses.
- The null hypothesis should be a well-defined probability model that reflects the baseline you want to compare against.
- The alternative hypothesis should be the "alternate reality" that you suspect may be true.
Decide on a test statistic, such that a large observed statistic would point to one hypothesis and a small observed statistic would point to the other.
Compute an empirical distribution of the test statistic under the null by drawing samples from the null hypothesis' probability model.
Assess whether the observed test statistic is consistent with the empirical distribution of the test statistic by computing a p-value.
Question 🤔 (Answer at q.dsc80.com)
Let's complete Problem 10 from the Spring 2023 DSC 10 Final Exam together. Submit your answers to q.dsc80.com.
Example: Total variation distance¶
eth = pd.DataFrame(
[['Asian', 0.15, 0.51],
['Black', 0.05, 0.02],
['Latino', 0.39, 0.16],
['White', 0.35, 0.2],
['Other', 0.06, 0.11]],
columns=['Ethnicity', 'California', 'UCSD']
).set_index('Ethnicity')
eth
California | UCSD | |
---|---|---|
Ethnicity | ||
Asian | 0.15 | 0.51 |
Black | 0.05 | 0.02 |
Latino | 0.39 | 0.16 |
White | 0.35 | 0.20 |
Other | 0.06 | 0.11 |
The two distributions above are clearly different.
One possibility: UCSD students do look like a random sample of California residents, and the distributions above look different purely due to random chance.
Another possibility: UCSD students don't look like a random sample of California residents, because the distributions above look too different.
Is the difference between the two distributions significant?¶
Let's establish our hypotheses.
- Null Hypothesis: UCSD students were selected at random from the population of California residents.
- Alternative Hypothesis: UCSD students were not selected at random from the population of California residents.
- Observation: Ethnic distribution of UCSD students.
- Test Statistic: We need a way of quantifying how different two categorical distributions are.
eth.plot(kind='barh', title='Ethnic Distribution of California and UCSD', barmode='group')