import pandas as pd
import numpy as np
import os
import seaborn as sns
import plotly.express as px
pd.options.plotting.backend = 'plotly'
"Standard" hypothesis testing helps us answer questions of the form:
I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?
It does not help us answer questions of the form:
I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?
That's where permutation testing comes in.
Note: For familiarity, we'll start with an example from DSC 10. This means we'll move quickly!
Let's start by loading in the data.
baby = pd.read_csv(os.path.join('data', 'baby.csv'))
baby.head()
Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker | |
---|---|---|---|---|---|---|
0 | 120 | 284 | 27 | 62 | 100 | False |
1 | 113 | 282 | 33 | 64 | 135 | False |
2 | 128 | 279 | 28 | 64 | 115 | True |
3 | 108 | 282 | 23 | 67 | 125 | True |
4 | 136 | 286 | 25 | 62 | 93 | False |
We're only interested in the 'Birth Weight'
and 'Maternal Smoker'
columns.
baby = baby[['Maternal Smoker', 'Birth Weight']]
baby.head()
Maternal Smoker | Birth Weight | |
---|---|---|
0 | False | 120 |
1 | False | 113 |
2 | True | 128 |
3 | True | 108 |
4 | False | 136 |
Note that there are two samples:
How many babies are in each group? What is the average birth weight within each group?
baby.groupby('Maternal Smoker')['Birth Weight'].agg(['mean', 'count'])
mean | count | |
---|---|---|
Maternal Smoker | ||
False | 123.085315 | 715 |
True | 113.819172 | 459 |
Note that 16 ounces are in 1 pound, so the above weights are ~7-8 pounds.
Below, we draw the distributions of both sets of birth weights.
px.histogram(baby, color='Maternal Smoker', histnorm='probability', marginal='box',
title="Birth Weight by Mother's Smoking Status", barmode='overlay', opacity=0.7)