• Messy data.
• Unfaithful data.
• Hypothesis testing.

## Messy data¶

### More data type ambiguities¶

• 1649043031 looks like a number, but is probably a date.

• As we saw earlier, Unix timestamps count the number of seconds since January 1st, 1970.
• "USD 1,000,000" looks like a string, but is actually a number and a unit.

• 92093 looks like a number, but is really a zip code (and isn't equal to 92,093).

• Sometimes, False appears in a column of country codes. Why might this be? 🤔

## Unfaithful data¶

### Is the data "faithful" to the DGP?¶

• In other words, how well does the data represent reality?

• Does the data contain unrealistic or "incorrect" values?

• Dates in the future for events in the past.
• Locations that don't exist.
• Negative counts.
• Misspellings of names.
• Large outliers.

### Is the data "faithful" to the DGP?¶

• Does the data violate obvious dependencies?
• Age and birthday don't match.
• Was the data entered by hand?
• Spelling errors.
• Fields shifted.
• Did the form require fields or provide default values?
• Are there obvious signs of data falsification (also known as "curbstoning")?
• Repeated names.
• Repeated use of uncommon names or fields.

### Example: Police vehicle stops 🚔¶

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

### Data types¶

Are the data types correct? If not, are they easily fixable?

### Unfaithfulness¶

• Are there suspicious values?
• If a value is suspicious, can we trust the observation?
• For example, consider 'subject_age' – some are too high to be true, some are too low to be true.

Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

What about all of the stops that involved people under the legal driving age?

### Unfaithful 'subject_age'¶

• Ages of 'No Age' and 0 are likely explicit null values.
• What do we do about the exceptionally small and large ages?
• Do we throw the entire row away, even if the rest of row is well-formed?
• What about the 14 and 15 year olds?
• Each has more than one occurrence – these could be real entries!

### Human-entered data¶

• Which fields were likely entered by a human?
• Which fields were likely generated by code?
• What was the original source?

Let's look at all unique stop causes. Notice that there are three different causes related to bicycles, which should probably all fall under the same cause.

Let's plot the distribution of ages, within a reasonable range (15 to 85). What do you notice? How could we address this?