Lecture 13 – Imputation¶

DSC 80, Winter 2023¶

📣 Announcements¶

• Discussion 4 is today at 5PM.
• Lab 4 scores and solutions (to non-discussion problems) have been released, or will be shortly.
• Project 2 is due tomorrow at 11:59PM.
• Lab 5 is due on Monday, February 13th at 11:59PM.
• Lab 5 does not have any hidden tests – but the content is all on the Midterm Exam, so make sure you thoroughly understand it.
• Look at this notebook for more examples of missingness.
• Speaking of the Midterm Exam...

Midterm Exam Logistics¶

• The Midterm Exam is in-class, in-person on Wednesday, February 15th.
• It will cover Lectures 1-13, Labs 1-5, and Projects 1-2.
• You can bring a single, two-sided note sheet.
• To review problems from old exams, go to practice.dsc80.com.
• We'll be adding more office hours on Tuesday 2/14, the day before the exam. Come with questions!
• Also look at the Resources tab on the course website.

Agenda¶

• Recap: Identifying missingness mechanisms.
• Overview of imputation.
• Mean imputation.
• Probabilistic imputation.

Recap: Identifying missingness mechanisms¶

Review: Missingness mechanisms¶

• Missing by design (MD): Whether or not a value is missing depends entirely on the data in other columns. In other words, if we can always predict if a value will be missing given the other columns, the data is MD.
• Not missing at random (NMAR): The chance that a value is missing depends on the actual missing value!
• Missing at random (MAR): The chance that a value is missing depends on other columns, but not the actual missing value itself.
• Missing completely at random (MCAR): The chance that a value is missing is completely independent of other columns and the actual missing value.

Deciding between MAR and MCAR¶

Recall, the "missing value flowchart" says that we should:

• First, determine whether values are missing by design (MD).
• Then, reason about whether values are not missing at random (NMAR).
• Finally, decide whether values are missing at random (MAR) or missing completely at random (MCAR).

To decide between MAR and MCAR, we can look at the data itself.

Deciding between MAR and MCAR¶

• If the missingness of column $X$ is explainable via the other columns in the data, then the missing data is missing at random (MAR).
• The distribution of missing values in column $X$ may look different than the distribution of observed data in column $X$ – that's fine, as long as the missingness can be explained solely by other columns in the data.
• If the missingness of column $X$ doesn't depend on any values in the observed data, it is missing completely at random (MCAR).
• MCAR is equivalent to data being MAR, without dependence on any other columns.
• To decide if the missingness in column $X$ looks MCAR, for every other column, compare:
• The distribution of the other column when $X$ is missing.
• The distribution of the other column when $X$ is not missing.
• If this pair of distributions looks similar for every other column, then the values in column $X$ may be MCAR.
• Caution: you can't prove that data are MCAR, as permutation tests don't allow you to accept the null hypothesis!
• See Lab 5, Question 4.

Example: Heights¶

Today, we'll use the same heights dataset as we did last time.

Example: Missingness of 'child' heights on 'father''s heights (MCAR)¶

• Question: Is the missingness of 'child' heights dependent on the 'father' column?
• To answer, we can look at two distributions:
• The distribution of 'father' when 'child' is missing.
• The distribution of 'father' when 'child' is not missing.
• If the two distributions look similar, then the missingness of 'child' looks to be independent of 'father'.
• To test whether two distributions look similar, we use a permutation test.

Aside: In util.py, there are several functions that we've created to help us with this lecture.

• make_mcar takes in a dataset and intentionally drops values from a column such that they are MCAR.
• make_mar does the same for MAR.
• You wouldn't actually do this in practice – in practice, you'll obtain a dataset with no prior knowledge of the missingness mechanism!