from dsc80_utils import *
from lec08_utils import *
Announcements 📣¶
- Project 2 checkpoint due tomorrow. No extensions on project checkpoints!
- Lab 4 is due on Wed, Feb 5th.
- Please fill out the early survey and let me know how the course is going for you
Agenda 📆¶
- Review: Missingness mechanisms.
- Identifying missingness mechanisms in data.
- How do we decide between MCAR and MAR using a permutation test?
- The Kolmogorov-Smirnov test statistic.
- Imputation.
- Mean imputation.
- Probabilistic imputation.
Review: Missingness mechanisms¶
Flowchart¶
A good strategy is to assess missingness in the following order.
Question 🤔 (Answer at dsc80.com/q)
Code: lec08-imdb
Taken from the Winter 2023 DSC 80 Midterm Exam.
The DataFrame tv_excl
contains all of the information we have for TV shows that are only available for streaming on a single streaming service.

Given no other information other than a TV show’s "Title"
and "IMDb"
rating, what is the most likely missingness mechanism of the "IMDb"
column?
A. Missing by design
B. Not missing at random
C. Missing at random
D. Missing completely at random
Question 🤔 (Answer at dsc80.com/q)
Code: lec08-imdb2
Taken from the Winter 2023 DSC 80 Midterm Exam.

Now, suppose we discover that the median "Rotten Tomatoes"
rating among TV shows with a missing "IMDb"
rating is a 13, while the median "Rotten Tomatoes"
rating among TV shows with a present "IMDb"
rating is a 52.
Given this information, what is the most likely missingness mechanism of the "IMDb"
column?
A. Missing by design
B. Not missing at random
C. Missing at random
D. Missing completely at random
Question 🤔 (Answer at dsc80.com/q)
Code: lec08-blood
Suppose Sam collects the blood pressures of 30 people in January. Then, in February, he asks a subset of the people to come back for a second reading (which means that there are missing blood pressures for February). What are the missing mechanisms for the blood pressures in February in the following situations?
- Sam uses
np.random.choice()
to select 7 individuals from January. - Sam asks the individuals who had hypertension (blood pressure > 140) in January to come back in February.
- Sam asks everyone to come back for a second reading in February, but only records the data for people who had hypertension (blood pressure > 140).
Identifying missingness mechanisms in data¶
Example: Heights¶
- Let's load in Galton's dataset containing the heights of adult children and their parents (which you may have seen in DSC 10).
- The dataset does not contain any missing values – we will artifically introduce missing values such that the values are MCAR, for illustration.
heights_path = Path('data') / 'midparent.csv'
heights = pd.read_csv(heights_path).rename(columns={'childHeight': 'child'})[['father', 'mother', 'gender', 'child']]
heights.head()
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | 69.0 |
3 | 78.5 | 67.0 | female | 69.0 |
4 | 75.5 | 66.5 | male | 73.5 |
Simulating MCAR data¶
- We will make
'child'
MCAR by taking a random subset ofheights
and setting the corresponding'child'
heights tonp.NaN
. - This is equivalent to flipping a (biased) coin for each row.
- If heads, we delete the
'child'
height.
- If heads, we delete the
- You will not do this in practice!
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = heights.copy()
idx = heights_mcar.sample(frac=0.3).index
heights_mcar.loc[idx, 'child'] = np.nan
heights_mcar.head(10)
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | NaN |
... | ... | ... | ... | ... |
7 | 75.5 | 66.5 | female | NaN |
8 | 75.0 | 64.0 | male | 71.0 |
9 | 75.0 | 64.0 | female | 68.0 |
10 rows × 4 columns
heights_mcar.isna().mean()
father 0.0 mother 0.0 gender 0.0 child 0.3 dtype: float64
Verifying that child heights are MCAR in heights_mcar
¶
- Each row of
heights_mcar
belongs to one of two groups:- Group 1:
'child'
is missing. - Group 2:
'child'
is not missing.
- Group 1:
heights_mcar['child_missing'] = heights_mcar['child'].isna()
heights_mcar.head()
father | mother | gender | child | child_missing | |
---|---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 | False |
1 | 78.5 | 67.0 | female | 69.2 | False |
2 | 78.5 | 67.0 | female | NaN | True |
3 | 78.5 | 67.0 | female | 69.0 | False |
4 | 75.5 | 66.5 | male | 73.5 | False |
- We need to look at the distributions of every other column –
'gender'
,'mother'
, and'father'
– separately for these two groups, and check to see if they are similar.
gender_dist = (
heights_mcar
.assign(child_missing=heights_mcar['child'].isna())
.pivot_table(index='gender', columns='child_missing', aggfunc='size')
)
# Added just to make the resulting pivot table easier to read.
gender_dist.columns = ['child_missing = False', 'child_missing = True']
gender_dist = gender_dist / gender_dist.sum()
gender_dist
child_missing = False | child_missing = True | |
---|---|---|
gender | ||
female | 0.49 | 0.48 |
male | 0.51 | 0.52 |
gender_dist.plot(kind='barh', title='Gender by Missingness of Child Height (MCAR Example)', barmode='group')
Since 'gender'
is categorical, we're looking at two categorical variables here. The two distributions look similar, but to make formal what we mean by similar, we'd need to run a permutation test.
create_kde_plotly(heights_mcar, 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MCAR Example)")
Since 'father'
s heights are numerical, we're looking at two numerical variables here. The two distributions look similar, but to make formal what we mean by similar, we'd need to run a permutation test.
create_kde_plotly(heights_mcar, 'child_missing', True, False, 'mother',
"Mother's Height by Missingness of Child Height (MCAR Example)")
Again, 'mother'
s heights are numerical, we're looking at two numerical variables here. The two distributions look similar, but to make formal what we mean by similar, we'd need to run a permutation test.
Concluding that 'child'
is MCAR¶
We need to run three permutation tests – one for each column in
heights_mcar
other than'child'
.For every other column, if we fail to reject the null that the distribution of the column when
'child'
is missing is the same as the distribution of the column when'child'
is not missing, then we can conclude'child'
is MCAR.- In such a case, its missingness is not tied to any other columns.
- For instance, children with shorter fathers are not any more likely to have missing heights than children with taller fathers.
Simulating MAR data¶
Now, we will make 'child'
heights MAR by deleting 'child'
heights according to a random procedure that depends on other columns.
np.random.seed(42) # So that we get the same results each time (for lecture).
def make_missing(r):
rand = np.random.uniform() # Random real number between 0 and 1.
if r['father'] > 72 and rand < 0.5:
return np.nan
elif r['gender'] == 'female' and rand < 0.3:
return np.nan
else:
return r['child']
heights_mar = heights.copy()
heights_mar['child'] = heights_mar.apply(make_missing, axis=1)
heights_mar['child_missing'] = heights_mar['child'].isna()
heights_mar.head()
father | mother | gender | child | child_missing | |
---|---|---|---|---|---|
0 | 78.5 | 67.0 | male | NaN | True |
1 | 78.5 | 67.0 | female | 69.2 | False |
2 | 78.5 | 67.0 | female | 69.0 | False |
3 | 78.5 | 67.0 | female | 69.0 | False |
4 | 75.5 | 66.5 | male | NaN | True |
Comparing null and non-null 'child'
distributions for 'gender'
, again¶
This time, the distribution of 'gender'
in the two groups is very different.
gender_dist = (
heights_mar
.assign(child_missing=heights_mar['child'].isna())
.pivot_table(index='gender', columns='child_missing', aggfunc='size')
)
# Added just to make the resulting pivot table easier to read.
gender_dist.columns = ['child_missing = False', 'child_missing = True']
gender_dist = gender_dist / gender_dist.sum()
gender_dist
child_missing = False | child_missing = True | |
---|---|---|
gender | ||
female | 0.4 | 0.88 |
male | 0.6 | 0.12 |
gender_dist.plot(kind='barh', title='Gender by Missingness of Child Height (MAR Example)', barmode='group')
Comparing null and non-null 'child'
distributions for 'father'
, again¶
create_kde_plotly(heights_mar, 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MAR Example)")
- The above two distributions look quite different.
- This is because we artificially created missingness in the dataset in a way that depended on
'father'
and'gender'
.
- This is because we artificially created missingness in the dataset in a way that depended on
- However, their difference in means is small:
(
heights_mar
.groupby('child_missing')
['father']
.mean()
.diff()
.iloc[-1]
)
np.float64(1.0055466604787853)
- If we ran a permutation test with the difference in means as our test statistic, we would fail to reject the null.
- Using just the difference in means, it is hard to tell these two distributions apart.
The Kolmogorov-Smirnov test statistic¶
Recap: Permutation tests¶
- Permutation tests help decide whether two samples look like they were drawn from the same population distribution.
- In a permutation test, we simulate data under the null by shuffling either group labels or numerical features.
- In effect, this randomly assigns individuals to groups.
- If the two distributions are numerical, we've used as our test statistic the difference in group means or medians.
- If the two distributions are categorical, we've used as our test statistic the total variation distance (TVD).
Difference in means¶
The difference in means works well in some cases. Let's look at one such case.
Below, we artificially generate two numerical datasets.
np.random.seed(42) # So that we get the same results each time (for lecture).
N = 1000 # Number of samples for each distribution.
# Distribution 'A'.
distr1 = pd.Series(np.random.normal(0, 1, size=N // 2))
# Distribution 'B'.
distr2 = pd.Series(np.random.normal(3, 1, size=N // 2))
data = pd.concat([distr1, distr2], axis=1, keys=['A', 'B']).unstack().reset_index().drop('level_1', axis=1)
data = data.rename(columns={'level_0': 'group', 0: 'data'})
meanA, meanB = data.groupby('group')['data'].mean().round(7).tolist()
create_kde_plotly(data, 'group', 'A', 'B', 'data', f'mean of A: {meanA}<br>mean of B: {meanB}')
Different distributions with the same mean¶
Let's generate two distributions that look very different but have the same mean.
np.random.seed(42) # So that we get the same results each time (for lecture).
N = 1000 # Number of samples for each distribution.
# Distribution 'A'.
a = pd.Series(np.random.normal(0, 1, size=N//2))
b = pd.Series(np.random.normal(4, 1, size=N//2))
distr1 = pd.concat([a,b], ignore_index=True)
# Distribution 'B'.
distr2 = pd.Series(np.random.normal(distr1.mean(), distr1.std(), size=N))
data = pd.concat([distr1, distr2], axis=1, keys=['A', 'B']).unstack().reset_index().drop('level_1', axis=1)
data = data.rename(columns={'level_0': 'group', 0: 'data'})
meanA, meanB = data.groupby('group')['data'].mean().round(7).tolist()
create_kde_plotly(data, 'group', 'A', 'B', 'data', f'mean of A: {meanA}<br>mean of B: {meanB}')
In this case, if we use the difference in means as our test statistic in a permutation test, we will fail to reject the null that the two distributions are different.
n_repetitions = 500
shuffled = data.copy()
diff_means = []
for _ in range(n_repetitions):
# Shuffling the values, while keeping the group labels in place.
shuffled['data'] = np.random.permutation(shuffled['data'])
# Computing and storing the absolute difference in means.
diff_mean = shuffled.groupby('group')['data'].mean().diff().abs().iloc[-1]
diff_means.append(diff_mean)
observed_diff = data.groupby('group')['data'].mean().diff().abs().iloc[-1]
fig = px.histogram(pd.DataFrame(diff_means), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Absolute Difference in Means')
fig.add_vline(x=observed_diff, line_color='red', line_width=1, opacity=1)
fig.add_annotation(text=f'<span style="color:red">Observed Absolute Difference in Means = {round(observed_diff, 2)}</span>',
x=2 * observed_diff, showarrow=False, y=0.07)
# The computed p-value is fairly large.
np.mean(np.array(diff_means) >= observed_diff)
np.float64(0.108)
Telling numerical distributions apart¶
- The difference in means only works as a test statistic in permutation tests if the two distributions have similar shapes.
- It tests to see if one is a shifted version of the other.
- We need a better test statistic to differentiate between numerical distributions with different shapes.
- In other words, we need a distance metric between numerical.
- The TVD is a distance metric between categorical distributions.
create_kde_plotly(data, 'group', 'A', 'B', 'data', f'mean of A: {meanA}<br>mean of B: {meanB}')
The Kolmogorov-Smirnov test statistic¶
- The K-S test statistic measures the similarity between two distributions.
- It is defined in terms of the cumulative distribution function (CDF) of a given distribution.
- If $f(x)$ is a distribution, then the CDF $F(x)$ is the proportion of values in distribution $f$ that are less than or equal to $x$.
- The K-S statistic is roughly defined as the largest difference between two CDFs.
Aside: Cumulative distribution functions¶
First, some setup.
fig1 = create_kde_plotly(data, 'group', 'A', 'B', 'data', f'Distributions of A and B')
# Think about what this function is doing!
def create_cdf(group):
return data.loc[data['group'] == group, 'data'].value_counts(normalize=True).sort_index().cumsum()
fig2 = go.Figure()
fig2.add_trace(
go.Scatter(x=create_cdf('A').index, y=create_cdf('A'), name='CDF of A')
)
fig2.add_trace(
go.Scatter(x=create_cdf('B').index, y=create_cdf('B'), name='CDF of B')
)
fig2.update_layout(title='CDFs of A and B')
from plotly.subplots import make_subplots
for i in range(2):
fig2.data[i]['marker']['color'] = fig1.data[i]['marker']['color']
fig2.data[i]['showlegend'] = False
fig = make_subplots(rows=1, cols=2, subplot_titles=['Distributions', 'CDFs'])
fig.add_trace(fig1.data[0], row=1, col=1)
fig.add_trace(fig1.data[1], row=1, col=1)
fig.add_trace(fig2.data[0], row=1, col=2)
fig.add_trace(fig2.data[1], row=1, col=2)
fig.update_layout(width=1000, height=400);
Aside: Cumulative distribution functions¶
Let's look at the CDFs of our two synthetic distributions.
fig