In [1]:
from dsc80_utils import *
Announcements 📣¶
- Project 3 Checkpoint due today.
- Lab 6 due Friday.
- We're working on Midterm grades now, but it's going to take a bit of time!
- Also, keep an eye out for the Mid-Quarter Survey. If 80% of class fills it out, everyone gets +1% EC for the midterm.
Agenda 📆¶
- Text features.
- Bag of words.
- Cosine similarity.
- TF-IDF.
- Example: State of the Union addresses 🎤.
Question 🤔 (Answer at dsc80.com/q)
Code: weird_re
^\w{2,5}.\d*\/[^A-Z5]{1,}
Select all strings below that contain any match with the regular expression above.
"billy4/Za"
"billy4/za"
"DAI_s2154/pacific"
"daisy/ZZZZZ"
"bi_/_lly98"
"!@__!14/atlantic"
(This problem was from Sp22 Final question 7.2.)
Text features¶
Review: Regression and features¶
- In DSC 40A, our running example was to use regression to predict a data scientist's salary, given their GPA, years of experience, and years of education.
- After minimizing empirical risk to determine optimal parameters, $w_0^*, \dots, w_3^*$, we made predictions using:
$$\text{predicted salary} = w_0^* + w_1^* \cdot \text{GPA} + w_2^* \cdot \text{experience} + w_3^* \cdot \text{education}$$
- GPA, years of experience, and years of education are features – they represent a data scientist as a vector of numbers.
- e.g. Your feature vector may be [3.5, 1, 7].
- This approach requires features to be numeric.
Moving forward¶
Suppose we'd like to predict the sentiment of a piece of text from 1 to 10.
- 10: Very positive (happy).
- 1: Very negative (sad, angry).
Example:
Input: "DSC 80 is a pretty good class."
Output: 7.
We can frame this as a regression problem, but we can't directly use what we learned in 40A, because here our inputs are text, not numbers.
Text features¶
- Big question: How do we represent a text document as a feature vector of numbers?
- If we can do this, we can:
- use a text document as input in a regression or classification model (in a few lectures).
- quantify the similarity of two text documents (today).
Example: San Diego employee salaries¶
- Transparent California publishes the salaries of all City of San Diego employees.
- Let's look at the 2022 data.
In [2]:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2022.csv')
salaries['Employee Name'] = salaries['Employee Name'].str.split().str[0] + ' Xxxx'
In [3]:
salaries.head()
Out[3]:
Employee Name | Job Title | Base Pay | Overtime Pay | ... | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|
0 | Mara Xxxx | City Attorney | 227441.53 | 0.00 | ... | 2022 | NaN | San Diego | FT |
1 | Todd Xxxx | Mayor | 227441.53 | 0.00 | ... | 2022 | NaN | San Diego | FT |
2 | Terence Xxxx | Assistant Police Chief | 227224.32 | 0.00 | ... | 2022 | NaN | San Diego | FT |
3 | Esmeralda Xxxx | Police Sergeant | 124604.40 | 162506.54 | ... | 2022 | NaN | San Diego | FT |
4 | Marcelle Xxxx | Assistant Retirement Administrator | 279868.04 | 0.00 | ... | 2022 | NaN | San Diego | FT |
5 rows × 13 columns
Aside on privacy and ethics¶
- Even though the data we downloaded is publicly available, employee names still correspond to real people.
- Be careful when dealing with PII (personably identifiable information).
- Only work with the data that is needed for your analysis.
- Even when data is public, people have a reasonable right to privacy.
- Remember to think about the impacts of your work outside of your Jupyter Notebook.
Goal: Quantifying similarity¶
- Our goal is to describe, numerically, how similar two job titles are.
- For instance, our similarity metric should tell us that
'Deputy Fire Chief'
and'Fire Battalion Chief'
are more similar than'Deputy Fire Chief'
and'City Attorney'
.
- Idea: Two job titles are similar if they contain shared words, regardless of order. So, to measure the similarity between two job titles, we could count the number of words they share in common.
- Before we do this, we need to be confident that the job titles are clean and consistent – let's explore.
Exploring job titles¶
In [4]:
jobtitles = salaries['Job Title']
jobtitles.head()
Out[4]:
0 City Attorney 1 Mayor 2 Assistant Police Chief 3 Police Sergeant 4 Assistant Retirement Administrator Name: Job Title, dtype: object
How many employees are in the dataset? How many unique job titles are there?
In [5]:
jobtitles.shape[0], jobtitles.nunique()
Out[5]:
(12831, 611)
What are the most common job titles?
In [6]:
jobtitles.value_counts().iloc[:100]
Out[6]:
Job Title Police Officer Ii 1082 Police Sergeant 311 Fire Fighter Ii 306 ... Public Works Supervisor 29 Project Assistant 29 Associate Engineer - Traffic 29 Name: count, Length: 100, dtype: int64
In [7]:
jobtitles.value_counts().iloc[:10].sort_values().plot(kind='barh')