import pandas as pd
import numpy as np
import os
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
# Used for plotting examples.
def create_kde_plotly(df, group_col, group1, group2, vals_col, title=''):
fig = ff.create_distplot(
hist_data=[df.loc[df[group_col] == group1, vals_col], df.loc[df[group_col] == group2, vals_col]],
group_labels=[group1, group2],
show_rug=False, show_hist=False,
colors=['#ef553b', '#636efb'],
)
return fig.update_layout(title=title)
conda activate dsc80
before working on assignments!A good strategy is to assess missingness in the following order.
In each of the following examples, decide whether the missing data are likely to be MD, NMAR, MAR, or MCAR:
'gender'
and 'age'
. 'age'
has missing values.'self-reported education level'
, which contains missing values.'Version 1'
, 'Version 2'
, and 'Version 3'
. $\frac{2}{3}$ of the entries in the table are NaN
.We won't spend much time on these in lecture, but you may find them helpful.
Suppose we have:
Data is missing completely at random (MCAR) if
$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$That is, adding information about the dataset doesn't change the likelihood data is missing!
Suppose we have:
Data is missing at random (MCAR) if
$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$That is, MAR data is actually MCAR, conditional on $Y_{obs}$.
Suppose we have:
Data is not missing at random (NMAR) if
$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.
Phone | Screen Size | Price |
---|---|---|
iPhone 14 | 6.06 | 999 |
Galaxy Z Fold 4 | 7.6 | NaN |
OnePlus 9 Pro | 6.7 | 799 |
iPhone 13 Pro Max | 6.68 | NaN |
Suppose you have a DataFrame with columns named $\text{col}_1$, $\text{col}_2$, ..., $\text{col}_k$, and want to test whether values in $\text{col}_X$ are MCAR. To test whether $\text{col}_X$'s missingness is independent of all other columns in the DataFrame:
For $i = 1, 2, ..., k$, where $i \neq X$:
If all pairs of distribution were the same, then $\text{col}_X$ is MCAR.
heights = pd.read_csv('data/midparent.csv')
heights = heights.rename(columns={'childHeight': 'child'})
heights = heights[['father', 'mother', 'gender', 'child']]
heights.head()
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | 69.0 |
3 | 78.5 | 67.0 | female | 69.0 |
4 | 75.5 | 66.5 | male | 73.5 |
Proof that there aren't currently any missing values in heights
:
heights.isna().sum()
father 0 mother 0 gender 0 child 0 dtype: int64
We have three numerical columns – 'father'
, 'mother'
, and 'child'
. Let's visualize them simultaneously.
fig = px.scatter_matrix(heights.drop(columns=['gender']))
fig