Lecture 8 – Unfaithful Data, Hypothesis Testing

DSC 80, Spring 2023


Messy data

More data type ambiguities

Example: The Norway problem 🇳🇴

Unfaithful data

Is the data "faithful" to the DGP?

Is the data "faithful" to the DGP?

Example: Police vehicle stops 🚔

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

Data types

Are the data types correct? If not, are they easily fixable?


Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

What about all of the stops that involved people under the legal driving age?

Unfaithful 'subject_age'

Human-entered data

Let's look at all unique stop causes. Notice that there are three different causes related to bicycles, which should probably all fall under the same cause.

Let's plot the distribution of ages, within a reasonable range (15 to 85). What do you notice? How could we address this?