import pandas as pd
import numpy as np
import os
import seaborn as sns
import plotly.express as px
pd.options.plotting.backend = 'plotly'
Additional resources:
"Standard" hypothesis testing helps us answer questions of the form:
I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?
Permutation testing helps us answer questions of the form:
I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?
Let's load in a cleaned version of the couples dataset from the last lecture.
couples_fp = os.path.join('data', 'married_couples_cleaned.csv')
couples = pd.read_csv(couples_fp)
couples.head()
mar_status | empl_status | gender | age | |
---|---|---|---|---|
0 | married | Working as paid employee | M | 51 |
1 | married | Working as paid employee | F | 53 |
2 | married | Working as paid employee | M | 57 |
3 | married | Working as paid employee | F | 57 |
4 | married | Working as paid employee | M | 60 |
couples.sample(5)
mar_status | empl_status | gender | age | |
---|---|---|---|---|
1523 | married | Working, self-employed | F | 34 |
1759 | unmarried | Working as paid employee | F | 37 |
437 | married | Working as paid employee | F | 45 |
324 | married | Working as paid employee | M | 51 |
2014 | unmarried | Working as paid employee | M | 20 |
To answer these questions, let's compute the distribution of employment status conditional on household type (married vs. unmarried).
# Note that this is a shortcut to picking a column for values and using aggfunc='count'.
empl_cnts = couples.pivot_table(index='empl_status', columns='mar_status', aggfunc='size')
cond_distr = empl_cnts / empl_cnts.sum()
cond_distr
mar_status | married | unmarried |
---|---|---|
empl_status | ||
Not working - disabled | 0.048518 | 0.077055 |
Not working - looking for work | 0.047844 | 0.118151 |
Not working - on a temporary layoff from a job | 0.014151 | 0.022260 |
Not working - other | 0.122642 | 0.056507 |
Not working - retired | 0.063342 | 0.018836 |
Working as paid employee | 0.610512 | 0.594178 |
Working, self-employed | 0.092992 | 0.113014 |
Are the distributions of employment status for married people and for unmarried people who live with their partners different?
Is this difference just due to noise?
cond_distr.plot(kind='barh', title='Distribution of Employment Status, Conditional on Household Type', barmode='group')