import pandas as pd
import numpy as np
import re
pd.options.plotting.backend = 'plotly'
import util
operation | example | matches ✅ | does not match ❌ |
---|---|---|---|
escape character | ucsd\.edu |
'ucsd.edu' |
'ucsd!edu' |
beginning of line | ^ark |
'ark two' 'ark o ark' |
'dark' |
end of line | ark$ |
'dark' 'ark o ark' |
'ark two' |
zero or one | cat? |
'ca' 'cat' |
'cart' (matches 'ca' only) |
built-in character classes* | \w+ \d+ |
'billy' '231231' |
'this person' '858 people' |
character class negation | [^a-z]+ |
'KINGTRITON551' '1721$$' |
'porch' 'billy.edu' |
*Note: in Python's implementation of regex,
\d
refers to digits.\w
refers to alphanumeric characters ([A-Z][a-z][0-9]_
).\s
refers to whitespace.\b
is a word boundary.\d{3} \d{3}-\d{4}
match?\bcat\b
match? Does it find a match in 'my cat is hungry'
? What about 'concatenate'
?Write a regular expression that matches any string that:
'Y'
and 'y'
), periods, and spaces.Examples include 'yoo.ee.IOU'
and 'AI.I oey'
.
^[aeiouyAEIOUY. ]{5,10}$
[...]
), special characters do not generally need to be escaped.
re
in Python¶The re
package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.
import re
re.search
takes in a string regex
and a string text
and returns the location and substring corresponding to the first match of regex
in text
.
re.search('AB*A',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
<re.Match object; span=(26, 31), match='ABBBA'>
re.findall
takes in a string regex
and a string text
and returns a list of all matches of regex
in text
. You'll use this most often.
re.findall('AB*A',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
['ABBBA', 'ABBBBBBBA']
re.sub
takes in a string regex
, a string repl
, and a string text
, and replaces all matches of regex
in text
with repl
.
re.sub('AB*A',
'billy',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
'here is a string for you: billy. here is another: billy'
When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r
before the quotes, e.g. r'exp'
.
re.findall('\bcat\b', 'my cat is hungry')
[]
re.findall(r'\bcat\b', 'my cat is hungry')
['cat']
# Huh?
print('\bcat\b')
cat
(
and )
to define a capture group within a pattern.re.findall(r'\w+@(\w+)\.edu',
'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')
['notucsd', 'ucsd']
(
and )
!re.findall(r'\w+@\w+\.edu',
'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')
['billy@notucsd.edu', 'notbilly@ucsd.edu']
re.findall
, all groups are treated as capturing groups.# A regex that matches strings with two of the same vowel followed by 3 digits
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')
[('oo', '124')]
Web servers typically record every request made of them in the "logs".
s = '''132.249.20.188 - - [24/Feb/2023:12:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''
Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s
.
exp = '\[(.+)\/(.+)\/(.+):(.+):(.+):(.+) .+\]'
re.findall(exp, s)
[('24', 'Feb', '2023', '12', '26', '15')]
While above regex works, it is not very specific. It works on incorrectly formatted log strings.
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)
[('adr', 'jduy', 'wffsdffs', 'r4s4', '4wsgdfd', 'asdf')]
.*
matches every possible string, but we don't use it very often.\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
\d{2}
matches any 2-digit number.[A-Z]{1}
matches any single occurrence of any uppercase letter.[a-z]{2}
matches any 2 consecutive occurrences of lowercase letters.[
, ]
, /
) need to be escaped with \
.s
'132.249.20.188 - - [24/Feb/2023:12:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'
new_exp = '\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]'
re.findall(new_exp, s)
[('24', 'Feb', '2023', '12', '26', '15')]
A benefit of new_exp
over exp
is that it doesn't capture anything when the string doesn't follow the format we specified.
other_s
'[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(new_exp, other_s)
[]
Writing a regular expression is like writing a program.
Regular expressions are terrible at certain types of problems. Examples:
Below is a regular expression that validates email addresses in Perl. See this article for more details.
StackOverflow crashed due to regex! See this article for the details.
Suppose we'd like to predict the sentiment of a piece of text from 1 to 10.
Example:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2021.csv')
util.anonymize_names(salaries)
salaries.head()
Employee Name | Job Title | Base Pay | Overtime Pay | Other Pay | Benefits | Total Pay | Pension Debt | Total Pay & Benefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mara Xxxx | City Attorney | 218759.0 | 0.0 | -2560.00 | 108652.0 | 216199.0 | 427749.18 | 752600.18 | 2021 | NaN | San Diego | FT |
1 | Todd Xxxx | Mayor | 218759.0 | 0.0 | -81.00 | 95549.0 | 218678.0 | 427749.18 | 741976.18 | 2021 | NaN | San Diego | FT |
2 | Elizabeth Xxxx | Investment Officer | 259732.0 | 0.0 | -870.00 | 71438.0 | 258862.0 | 221041.09 | 551341.09 | 2021 | NaN | San Diego | FT |
3 | Terence Xxxx | Police Officer | 212837.0 | 0.0 | 39683.00 | 56569.0 | 252520.0 | 222375.06 | 531464.06 | 2021 | NaN | San Diego | FT |
4 | Andrea Xxxx | Independent Budget Analyst | 224312.0 | 0.0 | 59819.00 | 54213.0 | 284131.0 | 192126.79 | 530470.79 | 2021 | NaN | San Diego | FT |
'Deputy Fire Chief'
and 'Fire Battalion Chief'
are more similar than 'Deputy Fire Chief'
and 'City Attorney'
.jobtitles = salaries['Job Title']
jobtitles.head()
0 City Attorney 1 Mayor 2 Investment Officer 3 Police Officer 4 Independent Budget Analyst Name: Job Title, dtype: object
How many employees are in the dataset? How many unique job titles are there?
jobtitles.shape[0], jobtitles.nunique()
(12305, 588)
What are the most common job titles?
jobtitles.value_counts().iloc[:100]
Police Officer 2123 Fire Fighter Ii 331 Assistant Engineer - Civil 284 Grounds Maintenance Worker Ii 250 Fire Captain 248 ... Grounds Maintenance Manager 27 Electrician 27 Executive Assistant 26 Paralegal 26 Librarian Iv 25 Name: Job Title, Length: 100, dtype: int64
jobtitles.value_counts().iloc[:25].sort_values().plot(kind='barh')