In [1]:

```
import pandas as pd
import numpy as np
import os
import util
import plotly.express as px
import plotly.figure_factory as ff
pd.options.plotting.backend = 'plotly'
```

- Discussion 4 is today at 5PM.
- Lab 4 scores and solutions (to non-discussion problems) have been released, or will be shortly.

- Project 2 is due
**tomorrow at 11:59PM**. - Lab 5 is due on
**Monday, February 13th at 11:59PM**.- Lab 5 does not have any hidden tests – but the content is all on the Midterm Exam, so make sure you thoroughly understand it.

- Look at this notebook for more examples of missingness.

- Aside: Interested in basketball 🏀? Look at this visualization.

- Speaking of the Midterm Exam...

- The Midterm Exam is
**in-class, in-person on Wednesday, February 15th**. - It will cover Lectures 1-13, Labs 1-5, and Projects 1-2.
- You can bring a single, two-sided note sheet.
- To review problems from old exams, go to practice.dsc80.com.
- We'll be adding more office hours on Tuesday 2/14, the day before the exam. Come with questions!
- Also look at the Resources tab on the course website.

- Recap: Identifying missingness mechanisms.
- Overview of imputation.
- Mean imputation.
- Probabilistic imputation.

**Missing by design (MD)**: Whether or not a value is missing depends entirely on the data in other columns. In other words, if we can always predict if a value will be missing given the other columns, the data is MD.**Not missing at random (NMAR)**: The chance that a value is missing**depends on the actual missing value**!**Missing at random (MAR)**: The chance that a value is missing**depends on other columns**, but**not**the actual missing value itself.**Missing completely at random (MCAR)**: The chance that a value is missing is**completely independent**of other columns and the actual missing value.

Recall, the "missing value flowchart" says that we should:

- First, determine whether values are
**missing by design (MD)**.

- Then, reason about whether values are
**not missing at random (NMAR)**.

- Finally, decide whether values are
**missing at random (MAR)**or**missing completely at random (MCAR)**.

To decide between MAR and MCAR, we can look at the data itself.

- If the missingness of column $X$ is explainable via the other columns in the data, then the missing data is missing at random (MAR).
- The distribution of missing values in column $X$ may look different than the distribution of observed data in column $X$ – that's fine, as long as the missingness can be explained solely by other columns in the data.

- If the missingness of column $X$ doesn't depend on any values in the observed data, it is missing completely at random (MCAR).
- MCAR is equivalent to data being MAR, without dependence on any other columns.

- To decide if the missingness in column $X$ looks MCAR, for every other column, compare:
- The distribution of the other column when $X$ is missing.
- The distribution of the other column when $X$ is not missing.

- If this pair of distributions looks similar for every other column, then the values in column $X$
*may*be MCAR.- Caution: you can't
**prove**that data are MCAR, as permutation tests don't allow you to accept the null hypothesis! - See Lab 5, Question 4.

- Caution: you can't

Today, we'll use the same `heights`

dataset as we did last time.

In [2]:

```
heights = pd.read_csv(os.path.join('data', 'midparent.csv'))
heights = (
heights
.rename(columns={'childHeight': 'child', 'childNum': 'number'})
.drop('midparentHeight', axis=1)
)
heights.head()
```

Out[2]:

family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|

0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |

1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |

2 | 1 | 78.5 | 67.0 | 4 | 3 | female | 69.0 |

3 | 1 | 78.5 | 67.0 | 4 | 4 | female | 69.0 |

4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |

`'child'`

heights on `'father'`

's heights (MCAR)¶**Question**: Is the missingness of`'child'`

heights dependent on the`'father'`

column?

- To answer, we can look at two distributions:
- The distribution of
`'father'`

when`'child'`

is missing. - The distribution of
`'father'`

when`'child'`

is not missing.

- The distribution of

- If the two distributions look similar, then the missingness of
`'child'`

looks to be independent of`'father'`

.- To test whether two distributions look similar, we use a permutation test.

Aside: In `util.py`

, there are several functions that we've created to help us with this lecture.

`make_mcar`

takes in a dataset and intentionally drops values from a column such that they are MCAR.`make_mar`

does the same for MAR.- You wouldn't actually do this in practice – in practice, you'll obtain a dataset with no prior knowledge of the missingness mechanism!

In [3]:

```
# Generating MCAR data.
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mcar.isna().mean()
```

Out[3]:

family 0.0 father 0.0 mother 0.0 children 0.0 number 0.0 gender 0.0 child 0.5 dtype: float64

`'child'`

heights on `'father'`

's heights (MCAR)¶In [4]:

```
heights_mcar['child_missing'] = heights_mcar['child'].isna()
util.create_kde_plotly(heights_mcar[['child_missing', 'father']], 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MCAR example)")
```