# Lecture 26 – Classifier Evaluation¶

## DSC 80, Winter 2023¶

### Announcements¶

• Project 5 (prediction) is due on Thursday, March 23rd at 11:59PM (no slip days allowed)!
• There is no live lecture today or Friday; videos will be posted instead.
• The Final Exam is on Wednesday, March 22nd from 11:30AM-2:30PM, location TBD.
• Lectures 1-26 (everything before this Friday) are in scope, as are all assignments.
• You can bring 2 two-sided notes sheets. More details to come on Ed.
• practice.dsc80.com now contains 3 past finals. Start reviewing!
• If at least 80% of the class fills out BOTH CAPEs and the End-of-Quarter Survey, then everyone will receive an extra 0.5% added to their overall course grade.
• Deadline: Saturday, March 18th at 8AM.
This lecture will not be delivered live! Instead, a pre-recorded version of this lecture can be found here.

### Agenda¶

• Classifier evaluation.
• Example: Tumor malignancy prediction (via logistic regression).

## Classifier evaluation¶

### Accuracy isn't everything!¶

$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$
• Accuracy is defined as the proportion of predictions that are correct.
• It weighs all correct predictions the same, and weighs all incorrect predictions the same.
• But some incorrect predictions may be worse than others!
• Example: Suppose you take a COVID test 🦠. Which is worse:
• The test saying you have COVID, when you really don't, or
• The test saying you don't have COVID, when you really do?

### The Boy Who Cried Wolf 👦😭🐺¶

(source)

A shepherd boy gets bored tending the town's flock. To have some fun, he cries out, "Wolf!" even though no wolf is in sight. The villagers run to protect the flock, but then get really mad when they realize the boy was playing a joke on them.

Repeat the previous paragraph many, many times.

One night, the shepherd boy sees a real wolf approaching the flock and calls out, "Wolf!" The villagers refuse to be fooled again and stay in their houses. The hungry wolf turns the flock into lamb chops. The town goes hungry. Panic ensues.

### The wolf classifier¶

• Predictor: Shepherd boy.
• Positive prediction: "There is a wolf."
• Negative prediction: "There is no wolf."

• What is an example of an incorrect, positive prediction?
• Was there a correct, negative prediction?
• There are four possibilities. What are the consequences of each?
• (predict yes, predict no) x (actually yes, actually no).

### The wolf classifier¶

Below, we present a confusion matrix, which summarizes the four possible outcomes of the wolf classifier.

### Outcomes in binary classification¶

When performing binary classification, there are four possible outcomes.

(Note: A "positive prediction" is a prediction of 1, and a "negative prediction" is a prediction of 0.)

Outcome of Prediction Definition True Class
True positive (TP) ✅ The predictor correctly predicts the positive class. P
False negative (FN) ❌ The predictor incorrectly predicts the negative class. P
True negative (TN) ✅ The predictor correctly predicts the negative class. N
False positive (FP) ❌ The predictor incorrectly predicts the positive class. N
⬇️
Predicted Negative Predicted Positive
Actually Negative TN ✅ FP ❌
Actually Positive FN ❌ TP ✅

The confusion matrix above is organized the same way that sklearn's confusion matrices are (but differently than in the wolf example).

Note that in the four acronyms – TP, FN, TN, FP – the first letter is whether the prediction is correct, and the second letter is what the prediction is.

### Example: COVID testing 🦠¶

• UCSD Health administers hundreds of COVID tests a day. The tests are not fully accurate.
• Each test comes back either
• positive, indicating that the individual has COVID, or
• negative, indicating that the individual does not have COVID.
• Question: What is a TP in this scenario? FP? TN? FN?
• TP: The test predicted that the individual has COVID, and they do ✅.
• FP: The test predicted that the individual has COVID, but they don't ❌.
• TN: The test predicted that the individual doesn't have COVID, and they don't ✅.
• FN: The test predicted that the individual doesn't have COVID, but they do ❌.

### Accuracy of COVID tests¶

The results of 100 UCSD Health COVID tests are given below.

Predicted Negative Predicted Positive
Actually Negative TN = 90 ✅ FP = 1 ❌
Actually Positive FN = 8 ❌ TP = 1 ✅
UCSD Health test results

🤔 Question: What is the accuracy of the test?

🙋 Answer: $$\text{accuracy} = \frac{TP + TN}{TP + FP + FN + TN} = \frac{1 + 90}{100} = 0.91$$

• Followup: At first, the test seems good. But, suppose we build a classifier that predicts that nobody has COVID. What would its accuracy be?
• Answer to followup: Also 0.91! There is severe class imbalance in the dataset, meaning that most of the data points are in the same class (no COVID). Accuracy doesn't tell the full story.

### Recall¶

Predicted Negative Predicted Positive
Actually Negative TN = 90 ✅ FP = 1 ❌
Actually Positive FN = 8 TP = 1
UCSD Health test results

🤔 Question: What proportion of individuals who actually have COVID did the test identify?

🙋 Answer: $\frac{1}{1 + 8} = \frac{1}{9} \approx 0.11$

More generally, the recall of a binary classifier is the proportion of actually positive instances that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.

$$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$

To compute recall, look at the bottom (positive) row of the above confusion matrix.

### Recall isn't everything, either!¶

$$\text{recall} = \frac{TP}{TP + FN}$$

🤔 Question: Can you design a "COVID test" with perfect recall?

🙋 Answer: Yes – just predict that everyone has COVID!

Predicted Negative Predicted Positive
Actually Negative TN = 0 ✅ FP = 91 ❌
Actually Positive FN = 0 TP = 9
everyone-has-COVID classifier
$$\text{recall} = \frac{TP}{TP + FN} = \frac{9}{9 + 0} = 1$$

Like accuracy, recall on its own is not a perfect metric. Even though the classifier we just created has perfect recall, it has 91 false positives!

### Precision¶

Predicted Negative Predicted Positive
Actually Negative TN = 0 ✅ FP = 91
Actually Positive FN = 0 ❌ TP = 9
everyone-has-COVID classifier

The precision of a binary classifier is the proportion of predicted positive instances that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.

$$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$

To compute precision, look at the right (positive) column of the above confusion matrix.

• Tip: A good way to remember the difference between precision and recall is that in the denominator for 🅿️recision, both terms have 🅿️ in them (TP and FP).
• Note that the "everyone-has-COVID" classifier has perfect recall, but a precision of $\frac{9}{9 + 91} = 0.09$, which is quite low.
• 🚨 Key idea: There is a "tradeoff" between precision and recall. Ideally, you want both to be high. For a particular prediction task, one may be important than the other.

(source)

### Precision and recall¶

$$\text{precision} = \frac{TP}{TP + FP} \: \: \: \: \: \: \: \: \text{recall} = \frac{TP}{TP + FN}$$

🤔 Question: When might high precision be more important than high recall?

🙋 Answer: For instance, in deciding whether or not someone committed a crime. Here, false positives are really bad – they mean that an innocent person is charged!

🤔 Question: When might high recall be more important than high precision?

🙋 Answer: For instance, in medical tests. Here, false negatives are really bad – they mean that someone's disease goes undetected!

### Discussion Question¶

Consider the confusion matrix shown below.

Predicted Negative Predicted Positive
Actually Negative TN = 22 ✅ FP = 2 ❌
Actually Positive FN = 23 ❌ TP = 18 ✅

What is the accuracy of the above classifier? The precision? The recall?

After calculating all three on your own, click below to see the answers.

Accuracy (22 + 18) / (22 + 2 + 23 + 18) = 40 / 65
Precision 18 / (18 + 2) = 9 / 10
Recall 18 / (18 + 23) = 18 / 41

## Example: Tumor malignancy prediction (via logistic regression)¶

### Wisconsin breast cancer dataset¶

The Wisconsin breast cancer dataset (WBCD) is a commonly-used dataset for demonstrating binary classification. It is built into sklearn.datasets.

1 stands for "malignant", i.e. cancerous, and 0 stands for "benign", i.e. safe.

Our goal is to use the features in bc to predict labels.

### Aside: Logistic regression¶

Logistic regression is a linear classification? technique that builds upon linear regression. It models the probability of belonging to class 1, given a feature vector:

$$P(y = 1 | \vec{x}) = \sigma (\underbrace{w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}}_{\text{linear regression model}})$$

Here, $\sigma(t) = \frac{1}{1 + e^{-t}}$ is the sigmoid function; its outputs are between 0 and 1 (which means they can be interpreted as probabilities).

🤔 Question: Suppose our logistic regression model predicts the probability that a tumor is malignant is 0.75. What class do we predict – malignant or benign? What if the predicted probability is 0.3?

🙋 Answer: We have to pick a threshold (e.g. 0.5)!

• If the predicted probability is above the threshold, we predict malignant (1).
• Otherwise, we predict benign (0).

### Fitting a logistic regression model¶

How did clf come up with 1s and 0s?

It turns out that the predicted labels come from applying a threshold of 0.5 to the predicted probabilities. We can access the predicted probabilities via the predict_proba method:

Note that our model still has $w^*$s:

### Evaluating our model¶

Let's see how well our model does on the test set.

Which metric is more important for this task – precision or recall?

### What if we choose a different threshold?¶

🤔 Question: Suppose we choose a threshold higher than 0.5. What will happen to our model's precision and recall?

🙋 Answer: Precision will increase, while recall will decrease*.

• If the "bar" is higher to predict 1, then we will have fewer false positives.
• The denominator in $\text{precision} = \frac{TP}{TP + FP}$ will get smaller, and so precision will increase.
• However, the number of false negatives will increase, as we are being more "strict" about what we classify as positive, and so $\text{recall} = \frac{TP}{TP + FN}$ will decrease.
• *It is possible for either or both to stay the same, if changing the threshold slightly (e.g. from 0.5 to 0.500001) doesn't change any predictions.

Similarly, if we decrease our threshold, our model's precision will decrease, while its recall will increase.

### Trying several thresholds¶

The classification threshold is not actually a hyperparameter of LogisticRegression, because the threshold doesn't change the coefficients ($w^*$s) of the logistic regression model itself (see this article for more details).

As such, if we want to imagine how our predicted classes would change with thresholds other than 0.5, we need to manually threshold.

Let's visualize the results in plotly, which is interactive.