# Lecture 13 – Imputation¶

## DSC 80, Spring 2023¶

### Midterm Exam Logistics¶

• The Midterm Exam is in-class, in-person on Friday, May 5th.
• It will cover Lectures 1-13, Labs 1-5, and Projects 1-2.
• You can bring a single, two-sided note sheet.
• To review problems from old exams, go to practice.dsc80.com.
• Also look at the Resources tab on the course website.

### Agenda¶

• Recap: Identifying missingness mechanisms.
• Overview of imputation.
• Mean imputation.
• Probabilistic imputation.

## Recap: Identifying missingness mechanisms¶

### Review: Missingness mechanisms¶

• Missing by design (MD): Whether or not a value is missing depends entirely on the data in other columns. In other words, if we can always predict if a value will be missing given the other columns, the data is MD.
• Not missing at random (NMAR): The chance that a value is missing depends on the actual missing value!
• Missing at random (MAR): The chance that a value is missing depends on other columns, but not the actual missing value itself.
• Missing completely at random (MCAR): The chance that a value is missing is completely independent of other columns and the actual missing value.

### Deciding between MAR and MCAR¶

Recall, the "missing value flowchart" says that we should:

• First, determine whether values are missing by design (MD).
• Then, reason about whether values are not missing at random (NMAR).
• Finally, decide whether values are missing at random (MAR) or missing completely at random (MCAR).

To decide between MAR and MCAR, we can look at the data itself.

### Deciding between MAR and MCAR¶

• If the missingness of column $X$ is explainable via the other columns in the data, then the missing data is missing at random (MAR).
• The distribution of missing values in column $X$ may look different than the distribution of observed data in column $X$ – that's fine, as long as the missingness can be explained solely by other columns in the data.
• If the missingness of column $X$ doesn't depend on any values in the observed data, it is missing completely at random (MCAR).
• MCAR is equivalent to data being MAR, without dependence on any other columns.
• To decide if the missingness in column $X$ looks MCAR, for every other column, compare:
• The distribution of the other column when $X$ is missing.
• The distribution of the other column when $X$ is not missing.
• If this pair of distributions looks similar for every other column, then the values in column $X$ may be MCAR.
• Caution: you can't prove that data are MCAR, as permutation tests don't allow you to accept the null hypothesis!
• See Lab 5, Question 4.

### Example: Heights¶

Today, we'll use the same heights dataset as we did last time.

### Example: Missingness of 'child' heights on 'father''s heights (MCAR)¶

• Question: Is the missingness of 'child' heights dependent on the 'father' column?
• To answer, we can look at two distributions:
• The distribution of 'father' when 'child' is missing.
• The distribution of 'father' when 'child' is not missing.
• If the two distributions look similar, then the missingness of 'child' looks to be independent of 'father'.
• To test whether two distributions look similar, we use a permutation test.

Aside: In util.py, there are several functions that we've created to help us with this lecture.

• make_mcar takes in a dataset and intentionally drops values from a column such that they are MCAR.
• make_mar does the same for MAR.
• You wouldn't actually do this in practice – in practice, you'll obtain a dataset with no prior knowledge of the missingness mechanism!