import pandas as pd
import numpy as np
import os
pd.options.plotting.backend = 'plotly'
1649043031 looks like a number, but is probably a date.
"USD 1,000,000" looks like a string, but is actually a number and a unit.
92093 looks like a number, but is really a zip code (and isn't equal to 92,093).
Sometimes, False
appears in a column of country codes. Why might this be?
🤔
import yaml
player = '''
name: Magnus Carlsen
age: 32
country: NO
'''
yaml.safe_load(player)
{'name': 'Magnus Carlsen', 'age': 32, 'country': False}
In other words, how well does the data represent reality?
Does the data contain unrealistic or "incorrect" values?
The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.
stops = pd.read_csv(os.path.join('data', 'vehicle_stops_2016_datasd.csv'))
stops.head()
stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1308198 | Equipment Violation | 530 | W | M | 28 | 2016-01-01 00:06:00 | 2016-01-01 | 0:06 | Y | N | N | N | N | N |
1 | 1308172 | Moving Violation | 520 | B | M | 25 | 2016-01-01 00:10:00 | 2016-01-01 | 0:10 | N | N | N | NaN | NaN | NaN |
2 | 1308171 | Moving Violation | 110 | H | F | 31 | 2016-01-01 00:14:00 | 2016-01-01 | 0:14 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1308170 | Moving Violation | Unknown | W | F | 29 | 2016-01-01 00:16:00 | 2016-01-01 | 0:16 | N | N | N | NaN | NaN | NaN |
4 | 1308197 | Moving Violation | 230 | W | M | 52 | 2016-01-01 00:30:00 | 2016-01-01 | 0:30 | N | N | N | NaN | NaN | NaN |
stops.shape
(103051, 15)
Are the data types correct? If not, are they easily fixable?
stops.head(1)
stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1308198 | Equipment Violation | 530 | W | M | 28 | 2016-01-01 00:06:00 | 2016-01-01 | 0:06 | Y | N | N | N | N | N |
stops.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 103051 entries, 0 to 103050 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 stop_id 103051 non-null int64 1 stop_cause 103044 non-null object 2 service_area 103051 non-null object 3 subject_race 102920 non-null object 4 subject_sex 102865 non-null object 5 subject_age 100284 non-null object 6 timestamp 102879 non-null object 7 stop_date 103051 non-null object 8 stop_time 103051 non-null object 9 sd_resident 83689 non-null object 10 arrested 84400 non-null object 11 searched 83330 non-null object 12 obtained_consent 4791 non-null object 13 contraband_found 4969 non-null object 14 property_seized 4924 non-null object dtypes: int64(1), object(14) memory usage: 11.8+ MB
'subject_age'
– some are too high to be true, some are too low to be true.stops['subject_age'].unique()
array(['28', '25', '31', '29', '52', '24', '20', '50', '23', '54', '53', '35', '38', '18', '48', '68', '45', '63', '49', '42', '27', '19', '55', '32', '47', '33', '41', '59', '60', '58', '26', '36', '40', '39', '21', '64', '30', '43', '17', '51', '34', '56', '44', '22', '69', '46', '16', '57', '37', '65', '72', '67', '66', '70', '62', '73', '74', '0', '77', nan, '89', '79', '61', '78', '99', '75', '85', '82', '71', '15', '80', '81', '93', '84', '76', '2', '4', '86', '91', '83', '88', '98', '87', 'No Age', '9', '100', '14', '95', '96', '92', '119', '1', '90', '163', '5', '114', '94', '10', '212', '220', '6', '145', '97', '120'], dtype=object)
ages = pd.to_numeric(stops['subject_age'], errors='coerce')
ages.describe()
count 99648.000000 mean 37.277697 std 14.456934 min 0.000000 25% 25.000000 50% 34.000000 75% 47.000000 max 220.000000 Name: subject_age, dtype: float64
Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?
stops[ages > 100]
stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25029 | 1333254 | Equipment Violation | 820 | W | F | 119 | 2016-03-19 22:20:00 | 2016-03-19 | 22:20 | N | N | N | NaN | NaN | NaN |
34179 | 1375483 | Moving Violation | 510 | W | M | 163 | 2016-04-18 16:35:00 | 2016-04-18 | 16:35 | N | N | N | NaN | NaN | NaN |
36968 | 1378369 | Moving Violation | 240 | A | F | 114 | 2016-04-28 11:44:00 | 2016-04-28 | 11:44 | NaN | NaN | NaN | NaN | NaN | NaN |
63570 | 1405478 | Moving Violation | 110 | V | M | 212 | 2016-08-04 18:10:00 | 2016-08-04 | 18:10 | Y | N | N | NaN | NaN | NaN |
70267 | 1411694 | Moving Violation | 310 | H | F | 220 | 2016-08-30 18:28:00 | 2016-08-30 | 18:28 | Y | N | N | NaN | NaN | NaN |
77038 | 1418885 | Moving Violation | 830 | B | M | 145 | 2016-09-22 20:35:00 | 2016-09-22 | 20:35 | Y | N | Y | N | N | NaN |
99449 | 1440889 | Equipment Violation | 120 | W | F | 120 | 2016-12-15 19:28:00 | 2016-12-15 | 19:28 | Y | N | N | NaN | NaN | NaN |
What about all of the stops that involved people under the legal driving age?
ages[ages < 16].value_counts()
0.0 218 15.0 27 2.0 6 14.0 5 4.0 4 1.0 2 10.0 2 9.0 1 5.0 1 6.0 1 Name: subject_age, dtype: int64
stops[ages < 16]
stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
547 | 1308628 | Moving Violation | 120 | W | M | 0 | 2016-01-03 16:20:00 | 2016-01-03 | 16:20 | NaN | NaN | NaN | NaN | NaN | NaN |
747 | 1308820 | Moving Violation | Unknown | H | F | 0 | 2016-01-04 22:13:00 | 2016-01-04 | 22:13 | N | N | N | NaN | NaN | NaN |
1686 | 1309896 | Moving Violation | 120 | W | M | 15 | 2016-01-08 19:30:00 | 2016-01-08 | 19:30 | N | N | N | NaN | NaN | NaN |
1752 | 1309975 | Moving Violation | Unknown | O | M | 0 | 2016-01-08 23:20:00 | 2016-01-08 | 23:20 | N | N | N | NaN | NaN | NaN |
2054 | 1310377 | Equipment Violation | 430 | H | F | 0 | 2016-01-10 16:42:00 | 2016-01-10 | 16:42 | Y | N | N | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
100465 | 1442806 | Moving Violation | 810 | O | M | 15 | 2016-12-20 10:45:00 | 2016-12-20 | 10:45 | Y | N | N | NaN | NaN | NaN |
101495 | 1443500 | Moving Violation | 320 | W | M | 0 | 2016-12-23 22:30:00 | 2016-12-23 | 22:30 | NaN | NaN | NaN | NaN | NaN | NaN |
101784 | 1443745 | Moving Violation | Unknown | O | M | 0 | 2016-12-26 19:30:00 | 2016-12-26 | 19:30 | NaN | NaN | NaN | NaN | NaN | NaN |
101790 | 1443747 | Moving Violation | 320 | W | F | 0 | 2016-12-26 20:00:00 | 2016-12-26 | 20:00 | NaN | NaN | NaN | NaN | NaN | NaN |
102848 | 1444647 | Moving Violation | 320 | W | M | 0 | 2016-12-31 17:45:00 | 2016-12-31 | 17:45 | NaN | NaN | NaN | NaN | NaN | NaN |
267 rows × 15 columns
'subject_age'
¶'No Age'
and 0
are likely explicit null values.Let's look at all unique stop causes. Notice that there are three different causes related to bicycles, which should probably all fall under the same cause.
stops['stop_cause'].value_counts()
Moving Violation 75200 Equipment Violation 26234 Radio Call/Citizen Contact 571 Muni, County, H&S Code 319 Personal Knowledge/Informant 289 No Cause Specified on a Card 184 Suspect Info (I.S., Bulletin, Log) 181 Personal Observ/Knowledge 45 MUNI, County, H&S Code 14 Bicycle 2 Suspect Info 2 BICYCLE 1 B & P 1 Bicycle Bicycle 1 Name: stop_cause, dtype: int64
Let's plot the distribution of ages, within a reasonable range (15 to 85). What do you notice? How could we address this?
ages[(ages > 15) & (ages <= 85)].plot(kind='hist')