Lecture 7 – Relational Algebra, Messy Data

DSC 80, Spring 2023


Relational algebra

Birds-eye view of the course

Relational algebra

$$\sigma_{\text{temps.City = countries.City}} \big( \text{temps} \times \text{countries} \big)$$

For illustration purposes, let's look at the temperatures and countries example DataFrames from the last lecture, with slight modifications.

Projection ($\Pi$)

Used to project (keep) columns in a relation. Duplicates rows are dropped.

$$\Pi_{(\text{City, Humid})}(\text{temps})$$

Selection ($\sigma$)

Used to keep rows in a relation that satisfy certain conditions.

$$\sigma_{(\text{Temperature} > 50)}(\text{temps})$$

Operators can be composed:

$$\Pi_{(\text{City, Humid})} \big(\sigma_{(\text{Temperature} > 50)}(\text{temps}) \big)$$

Cross product ($\times$)

Used to create every possible combination of rows in the first relation with rows in the second relation.

$$\text{temps} \times \text{countries}$$

The cross product is not incredibly useful on its own, but it can be used with other operators to perform more meaningful operations.

What does the following compute?

$$\sigma_{\text{temps.City = countries.City}} \big( \text{temps} \times \text{countries} \big)$$

Union ($\cup$)

Used to combine the rows of two relations. Duplicate rows are dropped. Only works if the two relations have the same attributes (column names).

$$\text{temps} \cup \text{other_temps}$$

Difference ($-$)

Used to find the rows that are in one relation but not the other. Only works if the two relations have the same attributes (column names).

$$\text{temps} - \text{other_temps}$$

Brief summary

Introduction to messy data

There is no such thing as "raw data"!

Data generating process

Example: COVID case counts 🦠

Suppose our goal is to determine the number of COVID cases in the US yesterday.

Why do you think so few cases were reported on Christmas Day – is it because COVID was less prevalent on Christmas Day as compared to the days before and after, or is it likely for some other reason? 🎅

Data provenance

The bigger picture question we're asking here is, can we trust our data?

Data cleaning 🧹

Keys to data cleaning

Data cleaning often addresses:

Let's focus on the latter two.

Kinds of data

Kinds of data

Discussion Question

Determine the kind of each of the following variables.

Example: DSC 80 students

In the next cell, we'll load in an example dataset containing information about past DSC 80 students.

What needs to be changed in the DataFrame to extract meaningful insights?

How much has each student paid in total tuition in 2021 and 2022?

Check the data types of students!

Cleaning '2021 tuition' and '2022 tuition'

We can loop through the columns of students to apply the above procedure. (Looping through columns is fine, just avoid looping through rows.)

Alternatively, we can do this without a loop by using str.contains to find only the columns that contain tuition information.

Cleaning 'Paid'

Cleaning 'Month', 'Day', and 'Year'

Cleaning 'DSC 80 Final Grade'

Cleaning 'Student Name'

More data type ambiguities

Example: the Norway problem 🇳🇴

Unfaithful data

Is the data "faithful" to the DGP?

Is the data "faithful" to the DGP?

Summary, next time


Next time