Lecture 16 – More Parsing Examples

DSC 80, Spring 2023


Parsing HTML using Beautiful Soup

BeautifulSoup objects

Finding elements in a tree

The most common methods you'll use to find tags in a soup object are:

Using find_all

find_all returns a list of all matches.

Node attributes

The get method must be called directly on the node that contains the attribute you're looking for.

Example: Scraping the HDSI Faculty page


Let's try and extract a list of HDSI Faculty from https://datascience.ucsd.edu/about/faculty/faculty/.

A good first step is to use the "inspect element" tool in our web browser.

It seems like the relevant <div>s for faculty are the ones where the data-entry-type attribute is equal to 'individual'. Let's find all of those.

Within here, we need to extract each faculty member's name. It seems like names are stored as text within the <a> tag.

We can also extract job titles:

Let's create a DataFrame consisting of names and job titles for each faculty member.

Now we have a DataFrame!

What if we want to get faculty members' pictures? It seems like we should look at the attributes of an <img> tag.

Example: Scraping quotes

Example: Scraping quotes

Let's scrape quotes from https://quotes.toscrape.com/.

Specifically, let's try to make a DataFrame that looks like the one below:

quote author author_url tags
0 “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Albert Einstein https://quotes.toscrape.com/author/Albert-Einstein change,deep-thoughts,thinking,world
1 “It is our choices, Harry, that show what we truly are, far more than our abilities.” J.K. Rowling https://quotes.toscrape.com/author/J-K-Rowling abilities,choices
2 “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” Albert Einstein https://quotes.toscrape.com/author/Albert-Einstein inspirational,life,live,miracle,miracles

The plan

Eventually, we will create a single function – quote_df – which takes in an integer n and returns a DataFrame with the quotes on the first n pages of https://quotes.toscrape.com/.

To do this, we will define several helper functions:

Key principle: some of our helper functions will make requests, and others will parse, but none will do both!

Downloading a single page

In quote_df, we will call download_page repeatedly – once for i=1, once for i=2, ..., i = n. For now, we will work with just page 5 (chosen arbitrarily).

Parsing a single page

Let's look at the page's source code (via "inspect element") to find where the quotes in the page are located.

From this <div>, we can extract the quote, author name, author's URL, and tags.

Let's implement our next function, process_quote, which takes in a <div> corresponding to a single quote and returns a Series containing the quote's information.

Note that this approach is different than the approach taken in the HDSI Faculty page example – there, we created each column of our final DataFrame separately, while here we are creating one row of our final DataFrame at a time.

Our last helper function will take in a list of <div>s, call process_quote on each <div> in the list, and return a DataFrame.

Putting it all together

The elements in the 'tags' column are all strings, but they look like lists. This is not ideal, as we will see shortly.

Key takeaways

Nested vs. flat data formats

Nested vs. flat data formats

Aside: JSON Crack

The site jsoncrack.com allows you to upload a JSON file and visualizes it. Let's try it with data/family.json!

Example: Scraping quotes, again

Note that for a single quote, we have keys for 'auth_url', 'quote_auth', 'quote_text', 'bio', 'dob', and 'tags'.

Since each line is a separate JSON object, let's read in each line one at a time.

Let's convert the result to a DataFrame.

What data type is the 'tags' column?

Let's save df to a CSV and read it back in.

What data type is the 'tags' column now?

One-hot encoding

Let's write a function that takes in the list of tags (taglist) for a given quote and returns the one-hot-encoded sequence of 1s and 0s for that quote.

Let's combine this one-hot-encoded DataFrame with df.

If we want all quotes tagged 'inspiration', we can simply query:

Note that this DataFrame representation of the response JSON takes up much more space than the original JSON. Why is that?

Summary, next time


Next time

All about regular expressions!