# Lecture 19 – Bag of Words, TF-IDF¶

## DSC 80, Winter 2023¶

### 📣 Announcements¶

• Lab 7 (regular expressions and text features) is due tonight at 11:59PM.
• Project 4 (Language Models 🗣) is released.
• The checkpoint will be due on Thursday, March 2nd at 11:59PM.
• The full project will be due on Thursday, March 9th at 11:59PM.
• We've updated the Grade Report to account for Labs 1-6, Projects (and checkpoints) 1-2, and the Midterm Exam. Check it out!
• We will be opening 1-on-1 tutoring slots to the entire class – see Ed for more details.

### Agenda¶

• Bag of words 💰.
• Cosine similarity.
• TF-IDF.
• Example: State of the Union addresses 🎤.

## Bag of words 💰¶

### Example: San Diego employee salaries¶

Recall, we're working with a (real) dataset of salary data for all San Diego city employees.

Our goal is to quantify how similar two job titles are; so far, our metric has been the number of shared words between the two titles.

### A counts matrix¶

Let's create a "counts" matrix, such that:

• there is 1 row per job title,
• there is 1 column per unique word that is used in job titles, and
• the value in row title and column word is the number of occurrences of word in title.

### Question: What job titles are most similar to 'deputy fire chief'?¶

• Remember, our idea was to count the number of shared words between two job titles.
• We now have access to counts_df, which contains a row vector for each job title.
• How can we use it to count the number of shared words between two job titles, i.e. the similarity of two job titles?

To start, let's compare the row vectors for 'deputy fire chief' and 'fire battalion chief'.

We can stack these two vectors horizontally.

One way to measure how similar the above two vectors are is through their dot product.

Here, since both vectors consist only of 1s and 0s, the dot product is equal to the number of shared words between the two job titles.

### Aside: Dot product¶

• Recall, if $\vec{a} = \begin{bmatrix} a_1 & a_2 & ... & a_n \end{bmatrix}^T$ and $\vec{b} = \begin{bmatrix} b_1 & b_2 & ... & b_n \end{bmatrix}^T$ are two vectors, then their dot product $\vec{a} \cdot \vec{b}$ is defined as:
$$\vec{a} \cdot \vec{b} = a_1b_1 + a_2b_2 + ... + a_nb_n$$
• The dot product also has a geometric interpretation. If $|\vec{a}|$ and $|\vec{b}|$ are the $L_2$ norms (lengths) of $\vec{a}$ and $\vec{b}$, and $\theta$ is the angle between $\vec{a}$ and $\vec{b}$, then:
$$\vec{a} \cdot \vec{b} = |\vec{a}| |\vec{b}| \cos \theta$$
(source)
• $\cos \theta$ is equal to its maximum value (1) when $\theta = 0$, i.e. when $\vec{a}$ and $\vec{b}$ point in the same direction.
• 🚨 Key idea: The more similar two unit vectors are, the larger their dot product is!

### Computing similarities¶

To find the job title that is most similar to 'deputy fire chief', we can compute the dot product of the 'deputy fire chief' word vector with all other titles' word vectors, and find the title with the highest dot product.

To do so, we can apply np.dot to each row that doesn't correspond to 'deputy fire chief'.

The unique job titles that are most similar to 'deputy fire chief' are given below.

Note that they all share two words in common with 'deputy fire chief'.

Note: To truly use the dot product as a measure of similarity, we should normalize by the lengths of the word vectors.

### Bag of words¶

• The bag of words model represents texts (e.g. job titles, sentences, documents) as vectors of word counts.
• The "counts" matrices we have worked with so far were created using the bag of words model.
• The bag of words model defines a vector space in $\mathbb{R}^{\text{number of unique words}}$.
• It is called "bag of words" because it doesn't consider order.
(source)

### Aside: Interactive bag of words demo¶

Check this site out – it automatically generates a bag of words matrix for you!

(source)

## Cosine similarity¶

### Cosine similarity and bag of words¶

To measure the similarity between two word vectors, we compute their normalized dot product, also known as their cosine similarity.

$$\cos \theta = \boxed{\frac{\vec{a} \cdot \vec{b}}{|\vec{a}| | \vec{b}|}}$$

If $\cos \theta$ is large, the two word vectors are similar. It is important to normalize by the lengths of the vectors, otherwise texts with more words will have artificially high similarities with other texts.

Note: Sometimes, you will see the cosine distance being used. It is the complement of cosine similarity:

$$\text{dist}(\vec{a}, \vec{b}) = 1 - \cos \theta$$

If $\text{dist}(\vec{a}, \vec{b})$ is small, the two word vectors are similar.

### A recipe for computing similarities¶

Given a set of documents, to find the most similar text to one document $d$ in particular:

• Use the bag of words model to create a counts matrix, in which:
• there is 1 row per document,
• there is 1 column per unique word that is used across documents, and
• the value in row doc and column word is the number of occurrences of word in doc.
• Compute the cosine similarity between $d$'s row vector and all other documents' row vectors.
• The other document with the greatest cosine similarity is the most similar, under the bag of words model.

### Example: Global warming 🌎¶

Consider the following three documents.

Let's represent each document using the bag of words model.

Let's now find the cosine similarity between each document.

Issue: Bag of words only encodes the words that each document uses, not their meanings.

• "I really really want global peace" and "We must solve climate change" have similar meanings, but have no shared words, and thus a low cosine similarity.
• "I really really want global peace" and "I must enjoy global warming" have very different meanings, but a relatively high cosine similarity.

### Pitfalls of the bag of words model¶

Remember, the key assumption underlying the bag of words model is that two documents are similar if they share many words in common.

• The bag of words model doesn't consider order.
• The job titles 'deputy fire chief' and 'chief fire deputy' are treated as the same.
• The bag of words model doesn't consider the meaning of words.
• 'I love data science' and 'I hate data science' share 75% of their words, but have very different meanings.
• The bag of words model treats all words as being equally important.
• 'deputy' and 'fire' have the same importance, even though 'fire' is probably more important in describing someone's job title.

## TF-IDF¶

### The importance of words¶

Issue: The bag of words model doesn't know which words are "important" in a document. Consider the following document:

"my brother has a friend named billy who has an uncle named billy"

How do we determine which words are important?

• The most common words ("the", "has") often don't have much meaning!
• The very rare words are also less important!

Goal: Find a way of quantifying the importance of a word in a document by balancing the above two factors, i.e. find the word that best summarizes a document.

### Term frequency¶

• The term frequency of a word (term) $t$ in a document $d$, denoted $\text{tf}(t, d)$ is the proportion of words in document $d$ that are equal to $t$.
$$\text{tf}(t, d) = \frac{\text{# of occurrences of t in d}}{\text{total # of words in d}}$$
• Example: What is the term frequency of "billy" in the following document?
"my brother has a friend named billy who has an uncle named billy"
• Answer: $\frac{2}{13}$.
• Intuition: Words that occur often within a document are important to the document's meaning.
• If $\text{tf}(t, d)$ is large, then word $t$ occurs often in $d$.
• If $\text{tf}(t, d)$ is small, then word $t$ does not occur often $d$.
• Issue: "has" also has a TF of $\frac{2}{13}$, but it seems less important than "billy".

### Inverse document frequency¶

• The inverse document frequency of a word $t$ in a set of documents $d_1, d_2, ...$ is
$$\text{idf}(t) = \log \left(\frac{\text{total # of documents}}{\text{# of documents in which t appears}} \right)$$
• Example: What is the inverse document frequency of "billy" in the following three documents?
• "my brother has a friend named billy who has an uncle named billy"
• "my favorite artist is named jilly boel"
• "why does he talk about someone named billy so often"
• Answer: $\log \left(\frac{3}{2}\right) \approx 0.4055$.
• Intuition: If a word appears in every document (like "the" or "has"), it is probably not a good summary of any one document.
• If $\text{idf}(t)$ is large, then $t$ is rarely found in documents.
• If $\text{idf}(t)$ is small, then $t$ is commonly found in documents.
• Think of $\text{idf}(t)$ as the "rarity factor" of $t$ across documents – the larger $\text{idf}(t)$ is, the more rare $t$ is.

### Intuition¶

$$\text{tf}(t, d) = \frac{\text{# of occurrences of t in d}}{\text{total # of words in d}}$$$$\text{idf}(t) = \log \left(\frac{\text{total # of documents}}{\text{# of documents in which t appears}} \right)$$

Goal: Quantify how well word $t$ summarizes document $d$.

• If $\text{tf}(t, d)$ is small, then $t$ doesn't occur very often in $d$, so $t$ can't be a good summary of $d$.
• If $\text{idf}(t)$ is small, then $t$ occurs often amongst all documents, and so it is not a good summary of any one document.
• If $\text{tf}(t, d)$ and $\text{idf}(t)$ are both large, then $t$ occurs often in $d$ but rarely overall. This makes $t$ a good summary of document $d$.

### Term frequency-inverse document frequency¶

The term frequency-inverse document frequency (TF-IDF) of word $t$ in document $d$ is the product:

\begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{# of occurrences of t in d}}{\text{total # of words in d}} \cdot \log \left(\frac{\text{total # of documents}}{\text{# of documents in which t appears}} \right) \end{align*}
• If $\text{tfidf}(t, d)$ is large, then $t$ is a good summary of $d$, because $t$ occurs often in $d$ but rarely across all documents.
• TF-IDF is a heuristic – it has no probabilistic justification.
• To know if $\text{tfidf}(t, d)$ is large for one particular word $t$, we need to compare it to $\text{tfidf}(t_i, d)$, for several different words $t_i$.

### Computing TF-IDF¶

Question: What is the TF-IDF of "global" in the second sentence?

Question: Is this big or small? Is "global" the best summary of the second sentence?

### TF-IDF of all words in all documents¶

On its own, the TF-IDF of a word in a document doesn't really tell us anything; we must compare it to TF-IDFs of other words in that same document.

### Interpreting TF-IDFs¶

The above DataFrame tells us that:

• the TF-IDF of 'peace' in the first sentence is 0.183102,
• the TF-IDF of 'climate' in the second sentence is 0.

Note that there are two ways that $\text{tfidf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t)$ can be 0:

• If $t$ appears in every document, because then $\text{idf}(t) = \log (\frac{\text{# documents}}{\text{# documents}}) = \log(1) = 0$.
• If $t$ does not appear in document $d$, because then $\text{tf}(t, d) = \frac{0}{\text{len}(d)} = 0$.

The word that best summarizes a document is the word with the highest TF-IDF for that document:

Look closely at the rows of tfidf – in documents 1 and 2, the max TF-IDF is not unique!

## Example: State of the Union addresses 🎤¶

### State of the Union addresses¶

The 2023 State of the Union address was on February 7th, 2023.

### The data¶

The entire corpus (another word for "set of documents") is over 10 million characters long... let's not display it in our notebook.

Each speech is separated by '***'.

Note that each "speech" currently contains other information, like the name of the president and the date of the address.

Let's extract just the speech text.

### Finding the most important words in each speech¶

Here, a "document" is a speech. We have 233 documents.

A rough sketch of what we'll compute:

for each word t:
for each speech d:
compute tfidf(t, d)

Note that the TF-IDFs of many common words are all 0!

### Summarizing speeches¶

By using idxmax, we can find the word with the highest TF-IDF in each speech.

What if we want to see the 5 words with the highest TF-IDFs, for each speech?

Run the cell below to see every single row of keywords_df.

### Aside: What if we remove the $\log$ from $\text{idf}(t)$?¶

Let's try it and see what happens.

### The role of $\log$ in $\text{idf}(t)$¶

\begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{# of occurrences of t in d}}{\text{total # of words in d}} \cdot \log \left(\frac{\text{total # of documents}}{\text{# of documents in which t appears}} \right) \end{align*}
• Remember, for any positive input $x$, $\log(x)$ is (much) smaller than $x$.
• In $\text{idf}(t)$, the $\log$ "dampens" the impact of the ratio $\frac{\text{# documents}}{\text{# documents with$t$}}$.
• If a word is very common, the ratio will be close to 1. The log of the ratio will be close to 0.
• If a word is very rare, the ratio will be very large. However, for instance, a word being seen in 2 out of 50 documents is not very different than being seen in 2 out of 500 documents (it is very rare in both cases), and so $\text{idf}(t)$ should be similar in both cases.

## Summary, next time¶

### Summary¶

• One way to turn documents, like 'deputy fire chief', into feature vectors, is to count the number of occurrences of each word in the document, ignoring order. This is done using the bag of words model.
• To measure the similarity of two documents under the bag of words model, compute the cosine similarity of their two word vectors.
• Term frequency-inverse document frequency (TF-IDF) is a statistic that tries to quantify how important a word (term) is to a document. It balances:
• how often a word appears in a particular document, $\text{tf}(t, d)$, with
• how often a word appears across documents, $\text{idf}(t)$.
• For a given document, the word with the highest TF-IDF is thought to "best summarize" that document.

### Next time¶

Modeling and feature engineering.