In [4]:
from dsc80_utils import *

Lecture 11 – Regular Expressions¶

DSC 80, Winter 2026¶

Agenda 📆¶

  • Most of today's lecture will be about regular expressions. Good resources:
    • regex101.com, a helpful site to have open while writing regular expressions.
    • Python re library documentation and how-to.
      • The "how-to" is great, read it!
    • regex "cheat sheet" (taken from here).
    • These are all on the resources tab of the course website as well.

Motivation¶

In [5]:
contact = '''
Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.
'''

Who called? 📞¶

  • Goal: Extract all phone numbers from a piece of text, assuming they are of the form '(###) ###-####'.
In [6]:
print(contact)
Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.

  • We can do this using the same string methods we've come to know and love.

  • Strategy:

    • Split by spaces.
    • Check if there are any consecutive "words" where:
      • the first "word" looks like an area code, like '(678)'.
      • the second "word" looks like the last 7 digits of a phone number, like '999-8212'.

Let's first write a function that takes in a string and returns whether it looks like an area code.

In [7]:
def is_possibly_area_code(s):
    '''Does `s` look like (678)?'''
    return (s.startswith('(') and 
            s.endswith(')') and 
            s[1:4].isnumeric() and 
            len(s) == 5)
In [8]:
is_possibly_area_code('(123)')
Out[8]:
True
In [9]:
is_possibly_area_code('(99)')
Out[9]:
False

Let's also write a function that takes in a string and returns whether it looks like the last 7 digits of a phone number.

In [10]:
def is_last_7_phone_number(s):
    '''Does `s` look like 999-8212?'''
    parts = s.split('-')
    if len(parts) == 2:
        return (parts[0].isnumeric() and 
                parts[1].isnumeric() and 
                len(parts[0])==3 and 
                len(parts[1])==4)
    return False
In [11]:
is_last_7_phone_number('999-8217')
Out[11]:
True
In [12]:
is_last_7_phone_number('534 1100')
Out[12]:
False

Finally, let's split the entire text by spaces into "words", and check whether there are any consecutive words that look like an area code followed by last 7 digits of a phone number.

In [13]:
# Removes punctuation from the end of each word.
words = [s.rstrip('.,?;"\'') for s in contact.split()]

for i in range(len(words) - 1):
    if is_possibly_area_code(words[i]):
        if is_last_7_phone_number(words[i+1]):
            print(words[i], words[i+1])
(800) 867-5309
(800) 123-4567

Is there a better way?¶

  • This was an example of pattern matching.
  • It can usually be done with string methods, but there is often a better approach: regular expressions.
In [14]:
print(contact)
Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.

In [15]:
import re
re.findall(r'\(\d{3}\) \d{3}-\d{4}', contact)
Out[15]:
['(800) 867-5309', '(800) 123-4567']

🤯

Basic regular expressions¶

Regular expressions¶

  • A regular expression, or regex for short, is a sequence of characters used to match patterns in strings.
    • For example, \(\d{3}\) \d{3}-\d{4} describes a pattern that matches US phone numbers of the form '(XXX) XXX-XXXX'.
    • Think of regex as a "mini-language" (formally: they are a grammar for describing a language).
  • Pros: They are very powerful and are widely used (virtually every programming language has a module for working with them).
  • Cons: They can be hard to read and have many different "dialects."

Writing regular expressions¶

  • You will ultimately write most of your regular expressions in Python, using the re module. We will see how to do so shortly.

  • However, a useful tool for designing regular expressions is regex101.com.

  • We will use it heavily during lecture; you should have it open as we work through examples. If you're trying to revisit this lecture in the future, you'll likely want to watch the podcast.

Literals¶

  • A literal is a character that has no special meaning.

  • Letters, numbers, and some symbols are all literals.

  • Some symbols, like ., *, (, and ), are special characters.

  • Example: The regex hey matches the string 'hey'. The regex he. also matches the string 'hey'.

Regex building blocks 🧱¶

The four main building blocks for all regexes are shown below (table source, inspiration).

operation order of op. example matches ✅ does not match ❌
concatenation 3 AABAAB 'AABAAB' every other string
or 4 AA\|BAAB 'AA', 'BAAB' every other string
closure
(zero or more)
2 AB*A 'AA', 'ABBBBBBA' 'AB', 'ABABA'
parentheses 1 A(A\|B)AAB
(AB)*A
'AAAAB', 'ABAAB'
'A', 'ABABABABA'
every other string
'AA', 'ABBA'

Note that |, (, ), and * are special characters, not literals. They manipulate the characters around them.

Example (or, parentheses):

  • What does DSC 30|80 match?
  • What does DSC (30|80) match?

Example (closure, parentheses):

  • What does blah* match?
  • What does (blah)* match?

Question 🤔

    

Write a regular expression that matches 'food', 'fooood', 'fooooood', etc. with a positive even number of 'o's.

  • First step: any number of 'o's.
  • Second stop: an even number of 'o's.
  • Third step: a positive even number of 'o's.
In [ ]:
 

Question 🤔

Write a regular expression that matches 'sorry', 'sorrrry', 'soggy', 'soggggy', etc. with positive even number of 'r's or 'g's in the middle.

In [ ]:
 

Intermediate regex¶

More regex syntax¶

operation example matches ✅ does not match ❌
wildcard .U.U.U. 'CUMULUS'
'JUGULUM'
'SUCCUBUS'
'TUMULTUOUS'
character class [A-Za-z][a-z]* 'word'
'Capitalized'
'camelCase'
'4illegal'
at least one bi(ll)+y 'billy'
'billlllly'
'biy'
'bily'
between $i$ and $j$ occurrences
(inclusive)
m[aeiou]{1,2}m 'mem'
'maam'
'miem'
'mm'
'mooom'
'meme'

., [, ], +, {, and } are also special characters, in addition to |, (, ), and *.

Example (character classes, at least one): [A-E]+ is just shortform for (A|B|C|D|E)(A|B|C|D|E)*.

Example (wildcard):

  • What does . match?
  • What does he. match?
  • What does ... match?

Example (at least one, closure):

  • What does 123+ match?
  • What does 123* match?

Example (number of occurrences): What does tri{3, 5} match?

Example (character classes, number of occurrences): What does [1-6a-f]{3}-[7-9E-S]{2} match?

Question 🤔

Write a regular expression that matches any lowercase string with a doubled vowel, such as 'noon', 'peel', 'festoon', or 'zeebraa'.

In [ ]:
 

Question 🤔

Write a regular expression that matches any string that contains both a lowercase letter and a number, in any order. Examples include 'dsc80', '3 Tickets', and 'Gr8'.

In [ ]:
 

Even more regex syntax¶

operation example matches ✅ does not match ❌
escape character ucsd\.edu 'ucsd.edu' 'ucsd!edu'
beginning of line ^ark 'ark two' 'dark'
end of line ark$ 'dark' 'ark two'
word boundary k\b
'\bk'
'dark'
'kid'
'darker'
'dark'
zero or one cat? 'ca'
'cat'
'catt'
built-in character classes \w+
\d+
'person'
'231231'
'this person'
'231 people'
character class negation [^a-z]+ 'KINGTRITON551'
'1721$$'
'porch'
'billy.edu'

Note: in Python's implementation of regex,

  • \d refers to digits.
  • \w refers to alphanumeric characters ([A-Z][a-z][0-9]_). Whenever we say "alphanumeric" in an assignment, we're referring to \w!
  • \s refers to whitespace.

Example (escaping):

  • What does he. match?
  • What does he\. match?
  • What does (858) match?
  • What does \(858\) match?

Example (anchors):

  • What does 858-534 match?
  • What does ^858-534 match?
  • What does 858-534$ match?

Example (built-in character classes):

  • What does \d{3} \d{3}-\d{4} match?
  • What does \bcat\b match? Does it find a match in 'my cat is hungry'? What about 'concatenate' or 'kitty cat'?

Question 🤔

Write a regular expression that matches any string that:

  • is between 5 and 10 characters long, and
  • is made up of only vowels (either uppercase or lowercase, including 'Y' and 'y'), periods, and spaces.

Examples include 'yoo.ee.IOU' and 'AI.I oey'.

In [ ]:
 

Regex in Python¶

re in Python¶

The re package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

In [16]:
import re 

re.search takes in a string regex and a string text and returns the location and substring corresponding to the first match of regex in text.

In [17]:
re.search('AB*A', 
          'here is a string for you: ABBBA. here is another: ABBBBBBBA')
Out[17]:
<re.Match object; span=(26, 31), match='ABBBA'>

re.findall takes in a string regex and a string text and returns a list of all non-overlapping matches of regex in text. You'll use this most often.

In [18]:
re.findall('AB*A', 
           'here is a string for you: ABBBA. here is another: ABBBBBBBA')
Out[18]:
['ABBBA', 'ABBBBBBBA']

re.sub takes in a string regex, a string repl, and a string text, and replaces all matches of regex in text with repl.

In [19]:
re.sub('AB*A', 
       'found', 
       'here is a string for you: ABBBA. here is another: ABBBBBBBA')
Out[19]:
'here is a string for you: found. here is another: found'

Raw strings¶

When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r before the quotes, e.g. r'exp'.

In [20]:
re.findall('\bcat\b', 'my cat is hungry')
Out[20]:
[]
In [21]:
re.findall(r'\bcat\b', 'my cat is hungry cat concatenate')
Out[21]:
['cat', 'cat']
In [22]:
# Huh?
print('cat\b')
ca

Capture groups¶

  • Surround a regex with ( and ) to define a capture group within a pattern.
  • Capture groups are useful for extracting relevant parts of a string.
In [23]:
re.findall(r'\w+@(\w+)\.edu', 
           'My old email was noah@sdccd.edu. My new email is noah@ucsd.edu.')
Out[23]:
['sdccd', 'ucsd']
  • Notice what happens if we remove the ( and )!
In [24]:
re.findall(r'\w+@\w+\.edu', 
           'My old email was noah@sdccd.edu. My new email is noah@ucsd.edu.')
Out[24]:
['noah@sdccd.edu', 'noah@ucsd.edu']
  • Earlier, we also saw that parentheses can be used to group parts of a regex together. When using re.findall, all groups are treated as capturing groups.
In [25]:
# A regex that matches strings with a doubled vowel followed by 3 digits
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'ahaaa124')
Out[25]:
[('aa', '124')]

Example: Log parsing¶

Web servers typically record every request made of them in the "logs".

In [26]:
s = '''132.249.20.188 - - [24/Feb/2023:12:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, hours, minutes, and seconds from the log string s.

In [27]:
exp = r'\[(.+)/(.+)/(.+):(.+):(.+):(.+) .+\]'
re.findall(exp, s)
Out[27]:
[('24', 'Feb', '2023', '12', '26', '15')]

While above regex works, it is not very specific. It works on incorrectly formatted log strings.

In [28]:
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)
Out[28]:
[('adr', 'jduy', 'wffsdffs', 'r4s4', '4wsgdfd', 'asdf')]

The more specific, the better!¶

  • Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about.
    • .* matches every possible string, but we don't use it very often.
  • A better date extraction regex:
\[(\d{2})/([A-Z]{1}[a-z]{2})/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
- `\d{2}` matches any 2-digit number.
- `[A-Z]{1}` matches any single occurrence of any uppercase letter.
- `[a-z]{2}` matches any 2 consecutive occurrences of lowercase letters.
- Remember, special characters like `[` and `]` need to be escaped with `\`, since they have another meaning in regular expressions (defining character classes).
In [29]:
s
Out[29]:
'132.249.20.188 - - [24/Feb/2023:12:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'
In [30]:
new_exp = r'\[(\d{2})/([A-Z]{1}[a-z]{2})/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]'
re.findall(new_exp, s)
Out[30]:
[('24', 'Feb', '2023', '12', '26', '15')]

A benefit of new_exp over exp is that it doesn't capture anything when the string doesn't follow the format we specified.

In [31]:
other_s
Out[31]:
'[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
In [32]:
re.findall(new_exp, other_s)
Out[32]:
[]

Question 🤔

^\w{2,5}.\d*/[^A-Z5]{1,}

Select all strings below that contain any match with the regular expression above.

  • "billy4/Ha"
  • "billy4/ha"
  • "DAI_s2154/pacific"
  • "daisy/ZZZZZ"
  • "bi_/_lly98"
  • "!@__!14/atlantic"

Limitations of regular expressions¶

Writing a regular expression is like writing a program.

  • You need to know the syntax well.
  • They can be easier to write than to read.
  • They can be difficult to debug.

Regular expressions are terrible at certain types of problems. Examples:

  • Anything involving counting (same number of instances of a and b).
  • Anything involving complex structure (palindromes).
  • Parsing highly complex text structure (HTML, for instance).

Summary, next time¶

Summary¶

  • Regular expressions are used to match and extract patterns from text.
  • You don't need to force yourself to "memorize" regex syntax. Instead, refer this notebook and the resources linked on in the Agenda section and the Resources tab of the course website.
  • Remember, you don't always have to use regular expressions! If Python or pandas string methods work for your task, no need to overcomplicate things.
    • pandas .str methods can use regular expressions; just set regex=True.
  • Play Regex Golf to practice! 🏌️

Next time¶

  • Text features: Bag of words, TF-IDF