import pandas as pd import numpy as np import os import seaborn as sns import plotly.express as px pd.options.plotting.backend = 'plotly'
"Standard" hypothesis testing helps us answer questions of the form:
I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?
Permutation testing helps us answer questions of the form:
I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?
Let's load in a cleaned version of the couples dataset from the last lecture.
couples_fp = os.path.join('data', 'married_couples_cleaned.csv') couples = pd.read_csv(couples_fp) couples.head()
|0||married||Working as paid employee||M||51|
|1||married||Working as paid employee||F||53|
|2||married||Working as paid employee||M||57|
|3||married||Working as paid employee||F||57|
|4||married||Working as paid employee||M||60|
|774||married||Not working - looking for work||M||35|
|668||unmarried||Working as paid employee||M||50|
|490||married||Working as paid employee||M||52|
|367||married||Not working - disabled||F||55|
|208||married||Working as paid employee||M||49|
To answer these questions, let's compute the distribution of employment status conditional on household type (married vs. unmarried).
# Note that this is a shortcut to picking a column for values and using aggfunc='count'. empl_cnts = couples.pivot_table(index='empl_status', columns='mar_status', aggfunc='size') cond_distr = empl_cnts / empl_cnts.sum() cond_distr
|Not working - disabled||0.048518||0.077055|
|Not working - looking for work||0.047844||0.118151|
|Not working - on a temporary layoff from a job||0.014151||0.022260|
|Not working - other||0.122642||0.056507|
|Not working - retired||0.063342||0.018836|
|Working as paid employee||0.610512||0.594178|
Are the distributions of employment status for married people and for unmarried people who live with their partners different?
Is this difference just due to noise?
cond_distr.plot(kind='barh', title='Distribution of Employment Status, Conditional on Household Type', barmode='group')