Lecture 18 – Regular Expressions, Bag of Words

DSC 80, Spring 2023

Agenda

More regular expressions

Even more regex syntax

operation example matches ✅ does not match ❌
escape character ucsd\.edu 'ucsd.edu' 'ucsd!edu'
beginning of line ^ark 'ark two'
'ark o ark'
'dark'
end of line ark$ 'dark'
'ark o ark'
'ark two'
zero or one cat? 'ca'
'cat'
'cart' (matches 'ca' only)
built-in character classes* \w+
\d+
'billy'
'231231'
'this person'
'858 people'
character class negation [^a-z]+ 'KINGTRITON551'
'1721$$'
'porch'
'billy.edu'

Example (built-in character classes)

*Note: in Python's implementation of regex,

Exercise

Write a regular expression that matches any string that:

Examples include 'yoo.ee.IOU' and 'AI.I oey'.


✅ Click here to see the answer after you've tried it yourself at regex101.com. One answer: ^[aeiouyAEIOUY. ]{5,10}$
Key idea: Within a character class (i.e. [...]), special characters do not generally need to be escaped.

Regex in Python

re in Python

The re package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

re.search takes in a string regex and a string text and returns the location and substring corresponding to the first match of regex in text.

re.findall takes in a string regex and a string text and returns a list of all matches of regex in text. You'll use this most often.

re.sub takes in a string regex, a string repl, and a string text, and replaces all matches of regex in text with repl.

Raw strings

When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r before the quotes, e.g. r'exp'.

Capture groups

Example: Log parsing

Web servers typically record every request made of them in the "logs".

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s.

While above regex works, it is not very specific. It works on incorrectly formatted log strings.

The more specific, the better!

A benefit of new_exp over exp is that it doesn't capture anything when the string doesn't follow the format we specified.

Limitations

Limitations of regexes

Writing a regular expression is like writing a program.

Regular expressions are terrible at certain types of problems. Examples:

Below is a regular expression that validates email addresses in Perl. See this article for more details.

StackOverflow crashed due to regex! See this article for the details.

Text features

Review: Regression and features

$$\text{predicted salary} = w_0^* + w_1^* \cdot \text{GPA} + w_2^* \cdot \text{experience} + w_3^* \cdot \text{education}$$

Moving forward

Suppose we'd like to predict the sentiment of a piece of text from 1 to 10.

Example:

Text features

Example: San Diego employee salaries

Aside on privacy and ethics

Goal: Quantifying similarity

Exploring job titles

How many employees are in the dataset? How many unique job titles are there?

What are the most common job titles?