In [1]:
from dsc80_utils import *
In [2]:
from lec08_utils import *

Lecture 8 – Imputation¶

DSC 80, Spring 2025¶

Agenda 📆¶

  • Review: Missingness mechanisms.
  • Identifying missingness mechanisms in data.
    • How do we decide between MCAR and MAR using a permutation test?
    • The Kolmogorov-Smirnov test statistic.
  • Imputation.
    • Mean imputation.
    • Probabilistic imputation.

Review: Missingness mechanisms¶

Flowchart¶

A good strategy is to assess missingness in the following order.

Missing by design (MD)
Can I determine the missing value exactly by looking at the other columns? 🤔
⬇️
Not missing at random (NMAR)
Is there a good reason why the missingness depends on the values themselves? 🤔
⬇️
Missing at random (MAR)
Do other columns tell me anything about the likelihood that a value is missing? 🤔
⬇️
Missing completely at random (MCAR)
The missingness must not depend on other columns or the values themselves. 😄

Question 🤔 (Answer at dsc80.com/q)

Code: lec08-imdb

Taken from the Winter 2023 DSC 80 Midterm Exam.

The DataFrame tv_excl contains all of the information we have for TV shows that are only available for streaming on a single streaming service.

No description has been provided for this image

Given no other information other than a TV show’s "Title" and "IMDb" rating, what is the most likely missingness mechanism of the "IMDb" column?

  • A. Missing by design

  • B. Not missing at random

  • C. Missing at random

  • D. Missing completely at random

Question 🤔 (Answer at dsc80.com/q)

Code: lec08-imdb2

Taken from the Winter 2023 DSC 80 Midterm Exam.

No description has been provided for this image

Now, suppose we discover that the median "Rotten Tomatoes" rating among TV shows with a missing "IMDb" rating is a 13, while the median "Rotten Tomatoes" rating among TV shows with a present "IMDb" rating is a 52.

Given this information, what is the most likely missingness mechanism of the "IMDb" column?

  • A. Missing by design

  • B. Not missing at random

  • C. Missing at random

  • D. Missing completely at random

Question 🤔 (Answer at dsc80.com/q)

Code: lec08-blood

Suppose Sam collects the blood pressures of 30 people in January. Then, in February, he asks a subset of the people to come back for a second reading (which means that there are missing blood pressures for February). What are the missing mechanisms for the blood pressures in February in the following situations?

  1. Sam uses np.random.choice() to select 7 individuals from January.
  2. Sam asks the individuals who had hypertension (blood pressure > 140) in January to come back in February.
  3. Sam asks everyone to come back for a second reading in February, but only records the data for people who had hypertension (blood pressure > 140).

Identifying missingness mechanisms in data¶

Example: Heights¶

  • Let's load in Galton's dataset containing the heights of adult children and their parents (which you may have seen in DSC 10).
  • The dataset does not contain any missing values – we will artifically introduce missing values such that the values are MCAR, for illustration.
In [3]:
heights_path = Path('data') / 'midparent.csv'
heights = pd.read_csv(heights_path).rename(columns={'childHeight': 'child'})[['father', 'mother', 'gender', 'child']]
heights.head()
Out[3]:
father mother gender child
0 78.5 67.0 male 73.2
1 78.5 67.0 female 69.2
2 78.5 67.0 female 69.0
3 78.5 67.0 female 69.0
4 75.5 66.5 male 73.5

Simulating MCAR data¶

  • We will make 'child' MCAR by taking a random subset of heights and setting the corresponding 'child' heights to np.NaN.
  • This is equivalent to flipping a (biased) coin for each row.
    • If heads, we delete the 'child' height.
  • You will not do this in practice!
In [4]:
np.random.seed(42) # So that we get the same results each time (for lecture).

heights_mcar = heights.copy()
idx = heights_mcar.sample(frac=0.3).index
heights_mcar.loc[idx, 'child'] = np.nan
In [5]:
heights_mcar.head(10)
Out[5]:
father mother gender child
0 78.5 67.0 male 73.2
1 78.5 67.0 female 69.2
2 78.5 67.0 female NaN
... ... ... ... ...
7 75.5 66.5 female NaN
8 75.0 64.0 male 71.0
9 75.0 64.0 female 68.0

10 rows × 4 columns

In [6]:
heights_mcar.isna().mean()
Out[6]:
father    0.0
mother    0.0
gender    0.0
child     0.3
dtype: float64

Verifying that child heights are MCAR in heights_mcar¶

  • Each row of heights_mcar belongs to one of two groups:
    • Group 1: 'child' is missing.
    • Group 2: 'child' is not missing.
In [7]:
heights_mcar['child_missing'] = heights_mcar['child'].isna()
heights_mcar.head()
Out[7]:
father mother gender child child_missing
0 78.5 67.0 male 73.2 False
1 78.5 67.0 female 69.2 False
2 78.5 67.0 female NaN True
3 78.5 67.0 female 69.0 False
4 75.5 66.5 male 73.5 False
  • We need to look at the distributions of every other column – 'gender', 'mother', and 'father' – separately for these two groups, and check to see if they are similar.
In [8]:
gender_dist = (
    heights_mcar
    .assign(child_missing=heights_mcar['child'].isna())
    .pivot_table(index='gender', columns='child_missing', aggfunc='size')
)

# Added just to make the resulting pivot table easier to read.
gender_dist.columns = ['child_missing = False', 'child_missing = True']

gender_dist = gender_dist / gender_dist.sum()
gender_dist
Out[8]:
child_missing = False child_missing = True
gender
female 0.49 0.48
male 0.51 0.52
In [9]:
gender_dist.plot(kind='barh', title='Gender by Missingness of Child Height (MCAR Example)', barmode='group')

Since 'gender' is categorical, we're looking at two categorical variables here. The two distributions look similar, but to make formal what we mean by similar, we'd need to run a permutation test.

In [10]:
create_kde_plotly(heights_mcar, 'child_missing', True, False, 'father', 
                  "Father's Height by Missingness of Child Height (MCAR Example)")

Since 'father's heights are numerical, we're looking at two numerical variables here. The two distributions look similar, but to make formal what we mean by similar, we'd need to run a permutation test.

In [11]:
create_kde_plotly(heights_mcar, 'child_missing', True, False, 'mother', 
                  "Mother's Height by Missingness of Child Height (MCAR Example)")

Again, 'mother's heights are numerical, we're looking at two numerical variables here. The two distributions look similar, but to make formal what we mean by similar, we'd need to run a permutation test.

Concluding that 'child' is MCAR¶

  • We need to run three permutation tests – one for each column in heights_mcar other than 'child'.

  • For every other column, if we fail to reject the null that the distribution of the column when 'child' is missing is the same as the distribution of the column when 'child' is not missing, then we can conclude 'child' is MCAR.

    • In such a case, its missingness is not tied to any other columns.
    • For instance, children with shorter fathers are not any more likely to have missing heights than children with taller fathers.

Simulating MAR data¶

Now, we will make 'child' heights MAR by deleting 'child' heights according to a random procedure that depends on other columns.

In [12]:
np.random.seed(42) # So that we get the same results each time (for lecture).

def make_missing(r):
    rand = np.random.uniform() # Random real number between 0 and 1.
    if r['father'] > 72 and rand < 0.5:
        return np.nan
    elif r['gender'] == 'female' and rand < 0.3:
        return np.nan
    else:
        return r['child']
    
heights_mar = heights.copy()
heights_mar['child'] = heights_mar.apply(make_missing, axis=1)
heights_mar['child_missing'] = heights_mar['child'].isna()
In [13]:
heights_mar.head()
Out[13]:
father mother gender child child_missing
0 78.5 67.0 male NaN True
1 78.5 67.0 female 69.2 False
2 78.5 67.0 female 69.0 False
3 78.5 67.0 female 69.0 False
4 75.5 66.5 male NaN True

Comparing null and non-null 'child' distributions for 'gender', again¶

This time, the distribution of 'gender' in the two groups is very different.

In [16]:
gender_dist = (
    heights_mar
    .assign(child_missing=heights_mar['child'].isna())
    .pivot_table(index='gender', columns='child_missing', aggfunc='size')
)

# Added just to make the resulting pivot table easier to read.
gender_dist.columns = ['child_missing = False', 'child_missing = True']

gender_dist = gender_dist / gender_dist.sum()
gender_dist
Out[16]:
child_missing = False child_missing = True
gender
female 0.4 0.88
male 0.6 0.12
In [15]:
gender_dist.plot(kind='barh', title='Gender by Missingness of Child Height (MAR Example)', barmode='group')

Comparing null and non-null 'child' distributions for 'father', again¶

In [17]:
create_kde_plotly(heights_mar, 'child_missing', True, False, 'father', 
                  "Father's Height by Missingness of Child Height (MAR Example)")
  • The above two distributions look quite different.
    • This is because we artificially created missingness in the dataset in a way that depended on 'father' and 'gender'.
  • However, their difference in means is small:
In [18]:
(
    heights_mar
    .groupby('child_missing')
    ['father']
    .mean()
    .diff()
    .iloc[-1]
)
Out[18]:
np.float64(1.0055466604787853)
  • If we ran a permutation test with the difference in means as our test statistic, we would fail to reject the null.
    • Using just the difference in means, it is hard to tell these two distributions apart.

The Kolmogorov-Smirnov test statistic¶

Recap: Permutation tests¶

  • Permutation tests help decide whether two samples look like they were drawn from the same population distribution.
  • In a permutation test, we simulate data under the null by shuffling either group labels or numerical features.
    • In effect, this randomly assigns individuals to groups.
  • If the two distributions are numerical, we've used as our test statistic the difference in group means or medians.
  • If the two distributions are categorical, we've used as our test statistic the total variation distance (TVD).

Difference in means¶

The difference in means works well in some cases. Let's look at one such case.

Below, we artificially generate two numerical datasets.

In [19]:
np.random.seed(42) # So that we get the same results each time (for lecture).

N = 1000 # Number of samples for each distribution.

# Distribution 'A'.
distr1 = pd.Series(np.random.normal(0, 1, size=N // 2))

# Distribution 'B'.
distr2 = pd.Series(np.random.normal(3, 1, size=N // 2))

data = pd.concat([distr1, distr2], axis=1, keys=['A', 'B']).unstack().reset_index().drop('level_1', axis=1)
data = data.rename(columns={'level_0': 'group', 0: 'data'})

meanA, meanB = data.groupby('group')['data'].mean().round(7).tolist()
create_kde_plotly(data, 'group', 'A', 'B', 'data', f'mean of A: {meanA}<br>mean of B: {meanB}')

Different distributions with the same mean¶

Let's generate two distributions that look very different but have the same mean.

In [20]:
np.random.seed(42) # So that we get the same results each time (for lecture).

N = 1000 # Number of samples for each distribution.

# Distribution 'A'.
a = pd.Series(np.random.normal(0, 1, size=N//2))
b = pd.Series(np.random.normal(4, 1, size=N//2))
distr1 = pd.concat([a,b], ignore_index=True)

# Distribution 'B'.
distr2 = pd.Series(np.random.normal(distr1.mean(), distr1.std(), size=N))

data = pd.concat([distr1, distr2], axis=1, keys=['A', 'B']).unstack().reset_index().drop('level_1', axis=1)
data = data.rename(columns={'level_0': 'group', 0: 'data'})

meanA, meanB = data.groupby('group')['data'].mean().round(7).tolist()
create_kde_plotly(data, 'group', 'A', 'B', 'data', f'mean of A: {meanA}<br>mean of B: {meanB}')

In this case, if we use the difference in means as our test statistic in a permutation test, we will fail to reject the null that the two distributions are different.

In [21]:
n_repetitions = 500
shuffled = data.copy()

diff_means = []
for _ in range(n_repetitions):
    
    # Shuffling the values, while keeping the group labels in place.
    shuffled['data'] = np.random.permutation(shuffled['data'])
    
    # Computing and storing the absolute difference in means.
    diff_mean = shuffled.groupby('group')['data'].mean().diff().abs().iloc[-1]
    diff_means.append(diff_mean)
In [22]:
observed_diff = data.groupby('group')['data'].mean().diff().abs().iloc[-1]
fig = px.histogram(pd.DataFrame(diff_means), x=0, nbins=50, histnorm='probability', 
                   title='Empirical Distribution of the Absolute Difference in Means')
fig.add_vline(x=observed_diff, line_color='red', line_width=1, opacity=1)
fig.add_annotation(text=f'<span style="color:red">Observed Absolute Difference in Means = {round(observed_diff, 2)}</span>',
                   x=2 * observed_diff, showarrow=False, y=0.07)
In [23]:
# The computed p-value is fairly large.
np.mean(np.array(diff_means) >= observed_diff)
Out[23]:
np.float64(0.108)

Telling numerical distributions apart¶

  • The difference in means only works as a test statistic in permutation tests if the two distributions have similar shapes.
    • It tests to see if one is a shifted version of the other.
  • We need a better test statistic to differentiate between numerical distributions with different shapes.
  • In other words, we need a distance metric between numerical.
    • The TVD is a distance metric between categorical distributions.
In [24]:
create_kde_plotly(data, 'group', 'A', 'B', 'data', f'mean of A: {meanA}<br>mean of B: {meanB}')

The Kolmogorov-Smirnov test statistic¶

  • The K-S test statistic measures the similarity between two distributions.
  • It is defined in terms of the cumulative distribution function (CDF) of a given distribution.
    • If $f(x)$ is a distribution, then the CDF $F(x)$ is the proportion of values in distribution $f$ that are less than or equal to $x$.
  • The K-S statistic is roughly defined as the largest difference between two CDFs.
    No description has been provided for this image

Aside: Cumulative distribution functions¶

First, some setup.

In [25]:
fig1 = create_kde_plotly(data, 'group', 'A', 'B', 'data', f'Distributions of A and B')

# Think about what this function is doing!
def create_cdf(group):
    return data.loc[data['group'] == group, 'data'].value_counts(normalize=True).sort_index().cumsum()

fig2 = go.Figure()

fig2.add_trace(
    go.Scatter(x=create_cdf('A').index, y=create_cdf('A'), name='CDF of A')
)

fig2.add_trace(
    go.Scatter(x=create_cdf('B').index, y=create_cdf('B'), name='CDF of B')
)

fig2.update_layout(title='CDFs of A and B')

from plotly.subplots import make_subplots

for i in range(2):
    fig2.data[i]['marker']['color'] = fig1.data[i]['marker']['color']
    fig2.data[i]['showlegend'] = False
    
fig = make_subplots(rows=1, cols=2, subplot_titles=['Distributions', 'CDFs'])
fig.add_trace(fig1.data[0], row=1, col=1)
fig.add_trace(fig1.data[1], row=1, col=1)
fig.add_trace(fig2.data[0], row=1, col=2)
fig.add_trace(fig2.data[1], row=1, col=2)
fig.update_layout(width=1000, height=400);

Aside: Cumulative distribution functions¶

Let's look at the CDFs of our two synthetic distributions.

In [26]:
fig