```
from dsc80_utils import *
```

### Midterm Exam 📝¶

**Tuesday during lecture time in PCH 120.**

- Pen and paper only. No calculators, phones, or watches allowed.
- You are allowed to bring one double-sided 8.5" x 11" sheet of handwritten notes.
- No reference sheet given, unlike DSC 10!

- We will display clarifications and the time remaining during the exam.
- Covers Lectures 1-8 and all related assignments.
- To review problems from old exams, go to practice.dsc80.com.
- Also look at the Resources tab on the course website.

### Agenda 📆¶

- Review: Missingness mechanisms.
- Identifying missingness mechanisms in data.
- How do we decide between MCAR and MAR using a permutation test?
- The Kolmogorov-Smirnov test statistic.

- Imputation.
- Mean imputation.
- Probabilistic imputation.

## Review: Missingness mechanisms¶

### Flowchart¶

A good strategy is to assess missingness in the following order.

**Missing by design (MD)**

*Can I determine the missing value exactly by looking at the other columns?*🤔

**Not missing at random (NMAR)**

*Is there a good reason why the missingness depends on the values themselves?*🤔

**Missing at random (MAR)**

*Do other columns tell me anything about the likelihood that a value is missing?*🤔

**Missing completely at random (MCAR)**

*The missingness must not depend on other columns or the values themselves.*😄

### Question 🤔 (Answer at q.dsc80.com)

*Taken from the Winter 2023 DSC 80 Midterm Exam.*

The DataFrame `tv_excl`

contains all of the information we have for TV shows that are only available for streaming on a single streaming service.

Given no other information other than a TV show’s `"Title"`

and `"IMDb"`

rating, what is the most likely missingness mechanism of the `"IMDb"`

column?

A. Missing by design

B. Not missing at random

C. Missing at random

D. Missing completely at random

### Question 🤔 (Answer at q.dsc80.com)

*Taken from the Winter 2023 DSC 80 Midterm Exam.*

Now, suppose we discover that the median `"Rotten Tomatoes"`

rating among TV shows with a missing `"IMDb"`

rating is a 13, while the median `"Rotten Tomatoes"`

rating among TV shows with a present `"IMDb"`

rating is a 52.

Given this information, what is the most likely missingness mechanism of the `"IMDb"`

column?

A. Missing by design

B. Not missing at random

C. Missing at random

D. Missing completely at random

## Identifying missingness mechanisms in data¶

### Example: Heights¶

- Let's load in Galton's dataset containing the heights of adult children and their parents (which you may have seen in DSC 10).
- The dataset does not contain any missing values – we will
**artifically introduce missing values**such that the values are MCAR, for illustration.

```
heights_path = Path('data') / 'midparent.csv'
heights = pd.read_csv(heights_path).rename(columns={'childHeight': 'child'})[['father', 'mother', 'gender', 'child']]
heights.head()
```

father | mother | gender | child | |
---|---|---|---|---|

0 | 78.5 | 67.0 | male | 73.2 |

1 | 78.5 | 67.0 | female | 69.2 |

2 | 78.5 | 67.0 | female | 69.0 |

3 | 78.5 | 67.0 | female | 69.0 |

4 | 75.5 | 66.5 | male | 73.5 |

### Simulating MCAR data¶

- We will make
`'child'`

MCAR by taking a random subset of`heights`

and setting the corresponding`'child'`

heights to`np.NaN`

. - This is equivalent to flipping a (biased) coin for each row.
- If heads, we delete the
`'child'`

height.

- If heads, we delete the
**You will not do this in practice!**

```
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = heights.copy()
idx = heights_mcar.sample(frac=0.3).index
heights_mcar.loc[idx, 'child'] = np.NaN
```

```
heights_mcar.head(10)
```

father | mother | gender | child | |
---|---|---|---|---|

0 | 78.5 | 67.0 | male | 73.2 |

1 | 78.5 | 67.0 | female | 69.2 |

2 | 78.5 | 67.0 | female | NaN |

... | ... | ... | ... | ... |

7 | 75.5 | 66.5 | female | NaN |

8 | 75.0 | 64.0 | male | 71.0 |

9 | 75.0 | 64.0 | female | 68.0 |

10 rows × 4 columns

```
heights_mcar.isna().mean()
```

father 0.0 mother 0.0 gender 0.0 child 0.3 dtype: float64

### Verifying that child heights are MCAR in `heights_mcar`

¶

- Each row of
`heights_mcar`

belongs to one of two**groups**:- Group 1:
`'child'`

is missing. - Group 2:
`'child'`

is not missing.

- Group 1:

```
heights_mcar['child_missing'] = heights_mcar['child'].isna()
heights_mcar.head()
```

father | mother | gender | child | child_missing | |
---|---|---|---|---|---|

0 | 78.5 | 67.0 | male | 73.2 | False |

1 | 78.5 | 67.0 | female | 69.2 | False |

2 | 78.5 | 67.0 | female | NaN | True |

3 | 78.5 | 67.0 | female | 69.0 | False |

4 | 75.5 | 66.5 | male | 73.5 | False |

- We need to look at the distributions of every other column –
`'gender'`

,`'mother'`

, and`'father'`

– separately for these two groups, and check to see if they are similar.

```
gender_dist = (
heights_mcar
.assign(child_missing=heights_mcar['child'].isna())
.pivot_table(index='gender', columns='child_missing', aggfunc='size')
)
# Added just to make the resulting pivot table easier to read.
gender_dist.columns = ['child_missing = False', 'child_missing = True']
gender_dist = gender_dist / gender_dist.sum()
gender_dist
```

child_missing = False | child_missing = True | |
---|---|---|

gender | ||

female | 0.49 | 0.48 |

male | 0.51 | 0.52 |

```
gender_dist.plot(kind='barh', title='Gender by Missingness of Child Height (MCAR Example)', barmode='group')
```