# Lecture 18 – Regular Expressions, Bag of Words¶

## DSC 80, Winter 2023¶

### 📣 Announcements¶

• Lab 7 (regular expressions and text features) is due on Monday, February 27th at 11:59PM.
• We won't cover the necessary ideas for Question 3 until Monday; see this old lecture if you'd like to finish it over the weekend.
• Nice work on Project 3! Project 4 (Language Models 🗣) will be released over the weekend.
• The checkpoint will be due on Thursday, March 2nd at 11:59PM.
• The full project will be due on Thursday, March 9th at 11:59PM.

## More regular expressions¶

### Even more regex syntax¶

operation example matches ✅ does not match ❌
escape character ucsd\.edu 'ucsd.edu' 'ucsd!edu'
beginning of line ^ark 'ark two'
'ark o ark'
'dark'
end of line ark$ 'dark' 'ark o ark' 'ark two' zero or one cat? 'ca' 'cat' 'cart' (matches 'ca' only) built-in character classes* \w+ \d+ 'billy' '231231' 'this person' '858 people' character class negation [^a-z]+ 'KINGTRITON551' '1721$$' 'porch' 'billy.edu' ### Example (built-in character classes)¶ *Note: in Python's implementation of regex, • \d refers to digits. • \w refers to alphanumeric characters ([A-Z][a-z][0-9]_). • \s refers to whitespace. • \b is a word boundary. • What does \d{3} \d{3}-\d{4} match? • What does \bcat\b match? Does it find a match in 'my cat is hungry'? What about 'concatenate'? ### Exercise¶ Write a regular expression that matches any string that: • is between 5 and 10 characters long, and • is made up of only vowels (either uppercase or lowercase, including 'Y' and 'y'), periods, and spaces. Examples include 'yoo.ee.IOU' and 'AI.I oey'. ✅ Click here to see the answer after you've tried it yourself at regex101.com. One answer: ^[aeiouyAEIOUY. ]{5,10} Key idea: Within a character class (i.e. [...]), special characters do not generally need to be escaped. ## Regex in Python¶ ### re in Python¶ The re package is built into Python. It allows us to use regular expressions to find, extract, and replace strings. re.search takes in a string regex and a string text and returns the location and substring corresponding to the first match of regex in text. re.findall takes in a string regex and a string text and returns a list of all matches of regex in text. You'll use this most often. re.sub takes in a string regex, a string repl, and a string text, and replaces all matches of regex in text with repl. ### Raw strings¶ When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r before the quotes, e.g. r'exp'. ### Capture groups¶ • Surround a regex with ( and ) to define a capture group within a pattern. • Capture groups are useful for extracting relevant parts of a string. • Notice what happens if we remove the ( and )! • Earlier, we also saw that parentheses can be used to group parts of a regex together. When using re.findall, all groups are treated as capturing groups. ## Example: Log parsing¶ Web servers typically record every request made of them in the "logs". Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s. While above regex works, it is not very specific. It works on incorrectly formatted log strings. ### The more specific, the better!¶ • Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about. • .* matches every possible string, but we don't use it very often. • A better date extraction regex: $(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}$ • \d{2} matches any 2-digit number. • [A-Z]{1} matches any single occurrence of any uppercase letter. • [a-z]{2} matches any 2 consecutive occurrences of lowercase letters. • Remember, special characters ([, ], /) need to be escaped with \. A benefit of new_exp over exp is that it doesn't capture anything when the string doesn't follow the format we specified. ## Limitations¶ ### Limitations of regexes¶ Writing a regular expression is like writing a program. • You need to know the syntax well. • They can be easier to write than to read. • They can be difficult to debug. Regular expressions are terrible at certain types of problems. Examples: • Anything involving counting (same number of instances of a and b). • Anything involving complex structure (palindromes). • Parsing highly complex text structure (HTML, for instance). Below is a regular expression that validates email addresses in Perl. See this article for more details. StackOverflow crashed due to regex! See this article for the details. ## Text features¶ ### Review: Regression and features¶ • In DSC 40A, our running example was to use regression to predict a data scientist's salary, given their GPA, years of experience, and years of education. • After minimizing empirical risk to determine optimal parameters, w_0^*, \dots, w_3^*, we made predictions using:$$\text{predicted salary} = w_0^* + w_1^* \cdot \text{GPA} + w_2^* \cdot \text{experience} + w_3^* \cdot \text{education}$\$
• GPA, years of experience, and years of education are features – they represent a data scientist as a vector of numbers.
• e.g. Your feature vector may be [3.5, 1, 7].
• This approach requires features to be numerical.

### Moving forward¶

Suppose we'd like to predict the sentiment of a piece of text from 1 to 10.

• 10: Very positive (happy).
• 1: Very negative (sad, angry).

Example:

• Input: "DSC 80 is a pretty good class."
• Output: 7.
• We can frame this as a regression problem, but we can't directly use what we learned in 40A, because here our inputs are text, not numbers.

### Text features¶

• Big question: How do we represent a text document as a feature vector of numbers?
• If we can do this, we can:
• use a text document as input in a regression or classification model (in a few lectures).
• quantify the similarity of two text documents (today).

### Example: San Diego employee salaries¶

• Transparent California publishes the salaries of all City of San Diego employees.
• The latest available data is from 2021.

### Aside on privacy and ethics¶

• Even though the data we downloaded is publicly available, employee names still correspond to real people.
• Be careful when dealing with PII (personably identifiable information).
• Only work with the data that is needed for your analysis.
• Even when data is public, people have a reasonable right to privacy.
• For instance, our similarity metric should tell us that 'Deputy Fire Chief' and 'Fire Battalion Chief' are more similar than 'Deputy Fire Chief' and 'City Attorney'.