# Lecture 7 – Relational Algebra, Messy Data¶

## DSC 80, Winter 2023¶

### Announcements¶

• Discussion 2 is tonight at 11:59PM.
• To earn 0.3% of extra credit, you must submit Lab 2, attend Discussion 2, and submit the Lab 2 Reflection Form by Saturday at 11:59PM.
• Lab 2 scores and solutions (to non-discussion problems) have been released.
• Project 1 is due tomorrow at 11:59PM.
• Lab 3 is due on Monday, January 30th at 11:59PM.
• Please fill out the anonymous Week 3 Feedback Survey to let us know how the course has been going so far!

### Agenda¶

• Relational algebra.
• Working with messy, real-world data.

## Relational algebra¶

### Birds-eye view of the course¶

• In just two weeks, we've covered most of the core DataFrame manipulation tools you'll need to extract insight from data.
• In future courses, like DSC 100, you'll revisit how to perform many of the same operations in SQL, a language designed to work with relational databases.
• A relational database is a database that stores tabular data.

### Relational algebra¶

• Relational algebra is a system for describing operations that are performed on relations (tables).
• There are five primitive relational operators, each of which produce a new relation.
• Like in regular arithmetic, expressions are made up of values and operators. For instance, the following is an expression in relational algebra:
$$\sigma_{\text{temps.City = countries.City}} \big( \text{temps} \times \text{countries} \big)$$
• Here, we'll give you a brief, imprecise introduction to relational algebra to make the connections more clear in future courses.

For illustration purposes, let's look at the temperatures and countries example DataFrames from the last lecture, with slight modifications.

### Projection ($\Pi$)¶

Used to project (keep) columns in a relation. Duplicates rows are dropped.

$$\Pi_{(\text{City, Humid})}(\text{temps})$$

### Selection ($\sigma$)¶

Used to keep rows in a relation that satisfy certain conditions.

$$\sigma_{(\text{Temperature} > 50)}(\text{temps})$$

Operators can be composed:

$$\Pi_{(\text{City, Humid})} \big(\sigma_{(\text{Temperature} > 50)}(\text{temps}) \big)$$

### Cross product ($\times$)¶

Used to create every possible combination of rows in the first relation with rows in the second relation.

$$\text{temps} \times \text{countries}$$

The cross product is not incredibly useful on its own, but it can be used with other operators to perform more meaningful operations.

What does the following compute?

$$\sigma_{\text{temps.City = countries.City}} \big( \text{temps} \times \text{countries} \big)$$

### Union ($\cup$)¶

Used to combine the rows of two relations. Duplicate rows are dropped. Only works if the two relations have the same attributes (column names).

$$\text{temps} \cup \text{other_temps}$$

### Difference ($-$)¶

Used to find the rows that are in one relation but not the other. Only works if the two relations have the same attributes (column names).

$$\text{temps} - \text{other_temps}$$

### Brief summary¶

• We saw five operators: project ($\Pi$), select ($\sigma$), cross product ($\times$), union ($\cup$), and difference ($-$).
• You'll learn about more operations in relational algebra in the future, including rename ($\rho$), intersection ($\cap$), join, aggregate, etc. Additional resources: Wikipedia, UNSW, UCSD DSC 100.
• Takeaway: Think about what each line of code you write is doing – don't just guess-and-check until you pass all otter cases.

## Introduction to messy data¶

### There is no such thing as "raw data"!¶

• Data are the result of measurements that must be recorded.
• Humans design the measurements and record the results.
• Data is always an imperfect record of the underlying processing being measured.

### Data generating process¶

• A data generating process is the underlying, real-world (probabilistic) mechanism that generates observed data.
• Observed data is an incomplete artifact of the data generating process.
• A data generating process is what a statistical model attempts to describe.
• From DSC 10: a model is a set of assumptions about how data were generated.
• More on this later in the quarter.
• Data cleaning requires an understanding of the data generating process.

### Example: COVID case counts 🦠¶

Suppose our goal is to determine the number of COVID cases in the US yesterday.

• What are we really asking for – the number of people who tested positive yesterday, or the number of people who contracted COVID yesterday?
• Tested positive on what type of test? How accurate is that type of test?
• How often are test results reported? Is there a delay in when test results are reported?

Why do you think so few cases were reported on Christmas Day – is it because COVID was less prevalent on Christmas Day as compared to the days before and after, or is it likely for some other reason? 🎅

### Data provenance¶

• As data scientists, we often need to work with datasets that others collected, for a purpose that is different than our current interest.
• As such, it's important to understand the "story" of how a dataset came to be, or the provenance of the data. Specifically, we need to be aware of:
1. Assumptions about the data generating process.
2. How the initial values in the dataset came to be.
3. How any data processing or storage decisions affected the values in the dataset.

The bigger picture question we're asking here is, can we trust our data?

### Data cleaning 🧹¶

• Data cleaning is the process of transforming data so that it best represents the underlying data generating process.

• In practice, data cleaning is often detective work to understand data provenance.

• Always be skeptical of your data!

### Keys to data cleaning¶

• The structure of the recorded data.
• Is the data stored in a tabular format (e.g. CSV, SQL, Google Sheets) or in another format (JSON, XML)?
• Are the individuals properly represented as rows?
• The encoding and format of the values in the data.
• Are the data types of all columns reflective of the kinds of data they contain?
• Corrupt and "incorrect" data, and missing values.
• Were there flaws in the data recording process? In other words, is our data faithful to the data generating process?

Let's focus on the latter two.

## Kinds of data¶

### Discussion Question¶

Determine the kind of each of the following variables.

• Fuel economy in miles per gallon.
• Number of quarters at UCSD.
• Class standing (freshman, sophomore, etc.).
• Income bracket (low, medium, high).
• Bank account number.

### Example: DSC 80 students¶

In the next cell, we'll load in an example dataset containing information about past DSC 80 students.

• 'PID' and 'Student Name': student PID and name.
• 'Month', 'Day', 'Year': date when the student was accepted to UCSD.
• '2021 tuition' and '2022 tuition': amount paid in tuition in 2021 and 2022, respectively.
• 'Percent Growth': growth between the two aforementioned columns.
• 'Paid': whether or not the student has paid tuition for this quarter yet.
• 'DSC 80 Final Grade': either 'Pass', 'Fail', or a number.

What needs to be changed in the DataFrame to extract meaningful insights?

### Check the data types of students!¶

• What kinds of data should each column have?
• Qualitative or quantitative?
• Discrete or continuous?
• Ordinal or nominal?
• What data type should each column have?

• Use the dtypes attribute or the info method to peek at the data types.

### Cleaning '2021 tuition' and '2022 tuition'¶

• '2021 tuition' and '2022 tuition' are stored as objects (strings), not numerical values.
• The '\$' character causes the entries to be interpreted as strings.
• We can use str methods to strip the dollar sign.
• Recall, whatever method/operator comes immediately after .str will be applied to each element of the Series individually, rather than the Series as a whole.

We can loop through the columns of students to apply the above procedure. (Looping through columns is fine, just avoid looping through rows.)

Alternatively, we can do this without a loop by using str.contains to find only the columns that contain tuition information.

### Cleaning 'Paid'¶

• Currently, 'Paid' contains the strings 'Y' and 'N'.
• 'Y's and 'N's typically result from manual data entry.
• The 'Paid' column should contain Trues and Falses, or 1s and 0s.
• One solution: create a Boolean Series through comparison. Could also use the Series replace method.

### Cleaning 'Month', 'Day', and 'Year'¶

• Currently, these are stored separately using the int64 data type. This could be fine for certain purposes, but ideally they are stored as a single column (e.g. for sorting).
• Solution: use pd.to_datetime to convert dates to datetime64 objects.

### Cleaning 'DSC 80 Final Grade'¶

• Currently, 'DSC 80 Final Grade's are stored as objects (strings).
• Unless we somehow store this column to a numeric type, we can't do any arithmetic with it.
• However, due to the existence of strings like 'Pass', we can't use astype to convert it.
• Solution: use pd.to_numeric(s, errors='coerce'), where s is a Series.
• ⚠️ Be careful with this!
• errors='coerce' can cause uninformed destruction of data.

### Cleaning 'Student Name'¶

• We want names to be formatted as 'Last Name, First Name', a common format.
• Solution: use str methods once again.

### More data type ambiguities¶

• 1649043031 looks like a number, but is probably a date.

• As we saw in the last lecture, Unix timestamps count the number of seconds since January 1st, 1970.
• "USD 1,000,000" looks like a string, but is actually a number and a unit.

• 92093 looks like a number, but is really a zip code (and isn't equal to 92,093).

• Sometimes, False appears in a column of country codes. Why might this be? 🤔

## Summary, next time¶

### Summary¶

• Relational algebra provides a system for describing operations on relations (tables).
• Data provenance describes the "origin story" of a dataset, from the data generating process to its storage.
• Data cleaning is the process of transforming data so that it best represents the underlying data generating process.
• We must ensure that each column in a DataFrame uses the correct data type for the kind of data in the column.

### Next time¶

• Working with unfaithful data – that is, data that is purely wrong.
• (Re)introduction to hypothesis testing.