# Lecture 12 – Identifying Missingness Mechanisms¶

## DSC 80, Winter 2023¶

### Announcements¶

• Lab 4 is due tonight at 11:59PM.
• Project 2 is due on Thursday, February 9th at 11:59PM.
• The Grade Report on Gradescope now contains scores and slip days through Lab 3.
• Remember to run conda activate dsc80 before working on assignments!

### Agenda¶

• Review and discussion: missingness mechanisms.
• Deciding between MCAR and MAR using permutation tests.
• A new test statistic for permutation tests: the Kolmogorov-Smirnov statistic.

## Missingness mechanisms¶

### Flowchart¶

A good strategy is to assess missingness in the following order.

Missing by design (MD)

Can I determine the missing value exactly by looking at the other columns? 🤔
$$\downarrow$$

Not missing at random (NMAR)

Is there a good reason why the missingness depends on the values themselves? 🤔
$$\downarrow$$

Missing at random (MAR)

Do other columns tell me anything about the likelihood that a value is missing? 🤔
$$\downarrow$$

Missing completely at random (MCAR)
The missingness must not depend on other columns or the values themselves. 😄

### Discussion Question¶

In each of the following examples, decide whether the missing data are likely to be MD, NMAR, MAR, or MCAR:

• A table for a medical study has columns for 'gender' and 'age'. 'age' has missing values.
• Measurements from the Hubble Space Telescope are dropped during transmission.
• A table has a single column, 'self-reported education level', which contains missing values.
• A table of grades contains three columns, 'Version 1', 'Version 2', and 'Version 3'. $\frac{2}{3}$ of the entries in the table are NaN.

### Why do we care again?¶

• If a dataset contains missing values, it is likely not an accurate picture of the data generating process.
• By identifying missingness mechanisms, we can best fill in missing values, to gain a better understanding of the DGP.

## Formal definitions¶

We won't spend much time on these in lecture, but you may find them helpful.

### Formal definition: MCAR¶

Suppose we have:

• A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
• A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is missing completely at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$

That is, adding information about the dataset doesn't change the likelihood data is missing!

### Formal definition: MAR¶

Suppose we have:

• A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
• A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is missing at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$

That is, MAR data is actually MCAR, conditional on $Y_{obs}$.

### Formal definition: NMAR¶

Suppose we have:

• A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
• A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is not missing at random (NMAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$

cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.

## Assessing missingness through data¶

### Assessing missingness through data¶

• Suppose I believe that the missingness mechanism of a column is NMAR, MAR, or MCAR.
• I've ruled out missing by design (a good first step).
• Can I check whether this is true, by looking at the data?

### Assessing NMAR¶

• We can't determine if data is NMAR just by looking at the data, as whether or not data is NMAR depends on the unobserved data.
• To establish if data is NMAR, we must:
• reason about the data generating process, or
• collect more data.
• Example: Consider a dataset of survey data of students' self-reported happiness. The data contains PIDs and happiness scores; nothing else. Some happiness scores are missing. Are happiness scores likely NMAR?

### Assessing MAR¶

• Data are MAR if the missingness only depends on observed data.
• After reasoning about the data generating process, if you establish that data is not NMAR, then it must be either MAR or MCAR.
• The more columns we have in our dataset, the "weaker the NMAR effect" is.
• Adding more columns -> controlling for more variables -> moving from NMAR to MAR.
• Example: With no other columns, income in a census is NMAR. But once we look at location, education, and occupation, incomes are closer to being MAR.

### Deciding between MCAR and MAR¶

• For data to be MCAR, the chance that values are missing should not depend on any other column or the values themselves.
• Example: Consider a dataset of phones, in which we store the screen size and price of each phone. Some prices are missing.
Phone Screen Size Price
iPhone 14 6.06 999
Galaxy Z Fold 4 7.6 NaN
OnePlus 9 Pro 6.7 799
iPhone 13 Pro Max 6.68 NaN
• If prices are MCAR, then the distribution of screen size should be the same for:
• phones whose prices are missing, and
• phones whose prices aren't missing.
• We can use a permutation test to decide between MAR and MCAR! We are asking the question, did these two samples come from the same underlying distribution?

### Deciding between MCAR and MAR¶

Suppose you have a DataFrame with columns named $\text{col}_1$, $\text{col}_2$, ..., $\text{col}_k$, and want to test whether values in $\text{col}_X$ are MCAR. To test whether $\text{col}_X$'s missingness is independent of all other columns in the DataFrame:

For $i = 1, 2, ..., k$, where $i \neq X$:

• Look at the distribution of $\text{col}_i$ when $\text{col}_X$ is missing.
• Look at the distribution of $\text{col}_i$ when $\text{col}_X$ is not missing.
• Check if these two distributions are the same. (What do we mean by "the same"?)
• If so, then $\text{col}_X$'s missingness doesn't depend on $\text{col}_i$.
• If not, then $\text{col}_X$ is MAR dependent on $\text{col}_i$.

If all pairs of distribution were the same, then $\text{col}_X$ is MCAR.

### Example: Heights¶

• Let's load in Galton's dataset containing the heights of adult children and their parents (which you may have seen in DSC 10).
• The dataset does not contain any missing values – we will artifically introduce missing values such that the values are MCAR, for illustration.

Proof that there aren't currently any missing values in heights:

We have three numerical columns – 'father', 'mother', and 'child'. Let's visualize them simultaneously.