from dsc80_utils import *
# Used for plotting examples.
def create_kde_plotly(df, group_col, group1, group2, vals_col, title=''):
fig = ff.create_distplot(
hist_data=[df.loc[df[group_col] == group1, vals_col], df.loc[df[group_col] == group2, vals_col]],
group_labels=[group1, group2],
show_rug=False, show_hist=False,
colors=['#ef553b', '#636efb'],
)
return fig.update_layout(title=title)
Announcements 📣¶
- Good job on Project 1!
- Project 2 released. Checkpoint due next Tues, Oct 22. Full project due the week after on Oct 29.
- Lab 4 due tomorrow.
- When submitting answers for attendance, copy-paste what you have. Answers that are some variant of "I don't know" will be treated as no submission (because that's not participation). Take your best guess!
Agenda 📆¶
- Permutation testing
- Missingness mechanisms.
- Why do data go missing?
- Missing by Design.
- Not Missing at Random.
- Missing Completely at Random.
- Missing at Random.
- Formal definitions.
- Identifying missingness mechanisms in data.
Additional resources¶
These recent lectures have been quite conceptual! Here are a few other places to look for readings; many of these are linked on the course homepage and on the Resources tab of the course website.
- Permutation testing:
- Extra lecture notebook: Fast Permutation Tests.
- Great visualization from Jared Wilber.
- Missingness mechanisms:
Permutation testing¶
Hypothesis testing vs. permutation testing¶
- So far, we've used hypothesis tests to answer questions of the form:
I know the population distribution, and I have one sample. Is this sample a likely draw from the population?
- Next, we want to consider questions of the form:
I have two samples, but no information about any population distributions. Do these samples look like they were drawn from different populations? That is, do these two samples look "different"?
Hypothesis testing vs. permutation testing¶
This framework requires us to be able to draw samples from the baseline population – but what if we don't know that population?
Example: Birth weight and smoking 🚬¶
For familiarity, we'll start with an example from DSC 10. This means we'll move quickly!
Let's start by loading in the data. Each row corresponds to a mother/baby pair.
baby = pd.read_csv(Path('data') / 'babyweights.csv')
baby
Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker | |
---|---|---|---|---|---|---|
0 | 120 | 284 | 27 | 62 | 100 | False |
1 | 113 | 282 | 33 | 64 | 135 | False |
2 | 128 | 279 | 28 | 64 | 115 | True |
... | ... | ... | ... | ... | ... | ... |
1171 | 130 | 291 | 30 | 65 | 150 | True |
1172 | 125 | 281 | 21 | 65 | 110 | False |
1173 | 117 | 297 | 38 | 65 | 129 | False |
1174 rows × 6 columns
We're only interested in the 'Birth Weight'
and 'Maternal Smoker'
columns.
baby = baby[['Maternal Smoker', 'Birth Weight']]
baby.head()
Maternal Smoker | Birth Weight | |
---|---|---|
0 | False | 120 |
1 | False | 113 |
2 | True | 128 |
3 | True | 108 |
4 | False | 136 |
Note that there are two samples:
- Birth weights of smokers' babies (rows where
'Maternal Smoker'
isTrue
). - Birth weights of non-smokers' babies (rows where
'Maternal Smoker'
isFalse
).
Exploratory data analysis¶
How many babies are in each group? What is the average birth weight within each group?
baby.groupby('Maternal Smoker')['Birth Weight'].agg(['mean', 'count'])
mean | count | |
---|---|---|
Maternal Smoker | ||
False | 123.09 | 715 |
True | 113.82 | 459 |
Note that 16 ounces are in 1 pound, so the above weights are ~7-8 pounds.
Visualizing birth weight distributions¶
Below, we draw the distributions of both sets of birth weights.
fig = px.histogram(baby, color='Maternal Smoker', histnorm='probability', marginal='box',
title="Birth Weight by Mother's Smoking Status", barmode='overlay', opacity=0.7)
fig