import pandas as pd import numpy as np import os import seaborn as sns import plotly.express as px pd.options.plotting.backend = 'plotly'
"Standard" hypothesis testing helps us answer questions of the form:
I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?
It does not help us answer questions of the form:
I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?
That's where permutation testing comes in.
Note: For familiarity, we'll start with an example from DSC 10. This means we'll move quickly!
Let's start by loading in the data.
baby = pd.read_csv(os.path.join('data', 'baby.csv')) baby.head()
|Birth Weight||Gestational Days||Maternal Age||Maternal Height||Maternal Pregnancy Weight||Maternal Smoker|
We're only interested in the
'Birth Weight' and
'Maternal Smoker' columns.
baby = baby[['Maternal Smoker', 'Birth Weight']] baby.head()
|Maternal Smoker||Birth Weight|
Note that there are two samples:
How many babies are in each group? What is the average birth weight within each group?
baby.groupby('Maternal Smoker')['Birth Weight'].agg(['mean', 'count'])
Note that 16 ounces are in 1 pound, so the above weights are ~7-8 pounds.
Below, we draw the distributions of both sets of birth weights.
px.histogram(baby, color='Maternal Smoker', histnorm='probability', marginal='box', title="Birth Weight by Mother's Smoking Status", barmode='overlay', opacity=0.7)