import pandas as pd
import numpy as np
import os
import util
import plotly.express as px
import plotly.figure_factory as ff
pd.options.plotting.backend = 'plotly'
Recall, the "missing value flowchart" says that we should:
To decide between MAR and MCAR, we can look at the data itself.
Today, we'll use the same heights
dataset as we did last time.
heights = pd.read_csv(os.path.join('data', 'midparent.csv'))
heights = (
heights
.rename(columns={'childHeight': 'child', 'childNum': 'number'})
.drop('midparentHeight', axis=1)
)
heights.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | 69.0 |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | 69.0 |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
'child'
heights on 'father'
's heights (MCAR)¶'child'
heights dependent on the 'father'
column?'father'
when 'child'
is missing.'father'
when 'child'
is not missing.'child'
looks to be independent of 'father'
.Aside: In util.py
, there are several functions that we've created to help us with this lecture.
make_mcar
takes in a dataset and intentionally drops values from a column such that they are MCAR.make_mar
does the same for MAR.# Generating MCAR data.
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mcar.isna().mean()
family 0.0 father 0.0 mother 0.0 children 0.0 number 0.0 gender 0.0 child 0.5 dtype: float64
'child'
heights on 'father'
's heights (MCAR)¶heights_mcar['child_missing'] = heights_mcar['child'].isna()
util.create_kde_plotly(heights_mcar[['child_missing', 'father']], 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MCAR example)")