# Lecture 5 – Pivoting and Simpson's Paradox¶

## DSC 80, Winter 2023¶

### Announcements¶

• To earn 0.3% of extra credit, submit the Lab 1 Reflection Form by tomorrow at 11:59PM.
• No slip days are allowed for lab reflection forms.
• To earn the EC, you must also have submitted Lab 1 and attended Discussion 1.
• Lab 1 scores and solutions (for questions not covered in discussion) are posted on Ed.
• Project 1 is due on Thursday, January 26th at 11:59PM.
• Lab 2 is due on Monday, January 23rd at 11:59PM.
• Check the calendar for the latest office hours schedule – we're regularly adding more OH.
• Aside: dsc-courses.github.io has links to course websites for many DSC courses.

### Agenda¶

• Grouping with multiple columns.
• Pivoting.
• Distributions.
• Reshaping DataFrames.
• Aside: Time series data.

## Grouping¶

Recall, last class, we started working with a dataset that involves various measurements taken of three species of penguins in Antarctica.

### Discussion Question¶

Find the 'island' on which the heaviest penguin of each 'species' lives.

### Grouping with multiple columns¶

When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.

### Grouping and indexes¶

• The groupby method creates an index based on the specified columns.
• When grouping by multiple columns, the resulting DataFrame has a MultiIndex.
• Advice: When working with a MultiIndex, use reset_index or set as_index=False in groupby.

## Pivoting¶

### Mean body mass for every combination of species and island¶

We just found the above information by grouping by both 'species' and 'island'.

But we can also create a pivot table, which contains the same information in a different orientation.

Let's visualize how the pivot table was created using Pandas Tutor.

### pivot_table¶

The pivot_table DataFrame method aggregates a DataFrame using two columns. To use it:

df.pivot_table(index=index_col,
columns=columns_col,
values=values_col,
aggfunc=func)


The resulting DataFrame will have:

• One row for every unique value in index_col.
• One column for every unique value in columns_col.
• Values determined by applying func on values in values_col.

### Example¶

Find the number of penguins per island and species.

Note that there is a NaN at the intersection of 'Biscoe' and 'Chinstrap', because there were no Chinstrap penguins on Biscoe Island.

We can either use the fillna method afterwards or the fill_value argument to fill in NaNs.

### Example¶

Find the median body mass per species and sex.

Important: In penguins, each row corresponds to an individual/observation. In the pivot table above, that is no longer true.

## Distributions¶

### Joint distribution¶

When using aggfunc='count', a pivot table describes the joint distribution of two categorical variables. This is also called a contingency table.

We can normalize the DataFrame by dividing by the total number of penguins. The resulting numbers can be interpreted as probabilities that a randomly selected penguin from the dataset belongs to a given combination of species and sex.

### Marginal probabilities¶

If we sum over one of the axes, we can compute marginal probabilities, i.e. unconditional probabilities.

For instance, the second Series tells us that a randomly selected penguin has a 0.357357 chance of being of species 'Gentoo'.

### Conditional probabilities¶

Using counts, how might we compute conditional probabilities like $$P(\text{species } = \text{"Adelie"} \mid \text{sex } = \text{"Female"})?$$

\begin{align*} P(\text{species} = c \mid \text{sex} = x) &= \frac{P(\text{species} = c \text{ and } \text{sex} = x)}{P(\text{sex = }x)} \\ &= \frac{\frac{\# \: (\text{species } = \: c \text{ and } \text{sex } = \: x)}{N}}{\frac{\# \: (\text{sex } = \: x)}{N}} \\ &= \frac{\# \: (\text{species} = c \text{ and } \text{sex} = x)}{\# \: (\text{sex} = x)} \end{align*}

Answer: To find conditional probabilities of species given sex, divide by column sums. To find conditional probabilities of sex given species, divide by row sums.

### Conditional probabilities¶

To find conditional probabilities of species given sex, divide by column sums. To find conditional probabilities of sex given species, divide by row sums.

The conditional distribution of species given sex is below. Note that in this new DataFrame, the 'Female' and 'Male' columns each sum to 1.

For instance, the above DataFrame tells us that the probability that a randomly selected penguin is of species 'Adelie' given that they are of sex 'Female' is 0.442424.

Task: Try and find the conditional distribution of sex given species.

## Reshaping DataFrames¶

### pivot_table aggregates and reshapes¶

• The pivot_table method does two things. It:
• Aggregates based on two columns.
• Reshapes the data from "long" to "wide".
• Rows no longer correspond to observations.
• At times, we may only want to do the second step – reshape the data. To do so, we can use the pivot method.

### Example: Tic-tac-toe¶

The pivot method only reshapes a DataFrame. It does not change any of the values in it (i.e. aggfunc doesn't work with pivot).

### pivot_table = groupby + pivot¶

• pivot_table is a shortcut for using groupby and then using pivot.
• For example, both of the following code cells find the mean body mass per species and sex.

aggfunc='mean' plays the same role that .mean() does.

### Reshaping¶

• pivot_table and pivot reshape DataFrames from "long" to "wide".
• Other DataFrame reshaping methods:
• melt: Un-pivots a DataFrame.
• stack: Pivots multi-level columns to multi-indices.
• unstack: Pivots multi-indices to columns.

• Two students, Lisa and Bart, just finished freshman year. They both took a different number of classes in Fall, Winter, and Spring.
• Each quarter, Lisa had a higher GPA than Bart.
• But Bart has a higher overall GPA.
• How is this possible? 🤔

Run this cell to create DataFrames that contain each students' grades.

### Quarter-specific vs. overall GPAs¶

Note: The number of "grade points" earned for a course is

$$\text{number of units} \cdot \text{grade (out of 4)}$$

For instance, an A- in a 4 unit course earns $3.7 \cdot 4 = 14.8$ grade points.

Lisa had a higher GPA in all three quarters:

But Lisa's overall GPA was less than Bart's overall GPA:

### What happened?¶

• When Lisa and Bart both performed poorly, Lisa took more units than Bart. This brought down 📉 Lisa's overall average.
• When Lisa and Bart both performed well, Bart took more units than Annie. This brought up 📈 Bart's overall average.

• Simpson's paradox occurs when grouped data and ungrouped data show opposing trends.
• It is named after Edward H. Simpson, not Lisa or Bart Simpson.
• It is purely arithmetic – it is a consequence of weighted averages.
• It often happens because there is a hidden factor (i.e. a confounder) within the data that influences results.
• Question: What is the "correct" way to summarize your data? What if you had to act on these results?

### Example: How Berkeley was almost sued for gender discrimination (1973)¶

What do you notice?

### What happened?¶

• The overall acceptance rate for women (30%) was lower than it was for men (45%).
• However, most departments (A, B, D, F) had a higher acceptance rate for women.
• Department A had a 62% acceptance rate for men and an 82% acceptance rate for women!
• 31% of men applied to Department A.
• 6% of women applied to Department A.
• Department F had a 6% acceptance rate for men and a 7% acceptance rate for women!
• 14% of men applied to Department F.
• 19% of women applied to Department F.
• Conclusion: Women tended to apply to departments with a lower acceptance rate.

### Caution!¶

This doesn't mean that admissions are free from gender discrimination!

From Moss-Racusin et al., 2012, PNAS (cited 2600+ times):

In a randomized double-blind study (n = 127), science faculty from research-intensive universities rated the application materials of a student—who was randomly assigned either a male or female name—for a laboratory manager position. Faculty participants rated the male applicant as significantly more competent and hireable than the (identical) female applicant. These participants also selected a higher starting salary and offered more career mentoring to the male applicant. The gender of the faculty participants did not affect responses, such that female and male faculty were equally likely to exhibit bias against the female student.

### But then...¶

Here we report five hiring experiments in which faculty evaluated hypothetical female and male applicants, using systematically varied profiles disguising identical scholarship, for assistant professorships in biology, engineering, economics, and psychology. Contrary to prevailing assumptions, men and women faculty members from all four fields preferred female applicants 2:1 over identically qualified males with matching lifestyles (single, married, divorced), with the exception of male economists, who showed no gender preference.

### Do these conflict?¶

Not necessarily. One explanation, from William and Ceci:

Instead, past studies have used ratings of students’ hirability for a range of posts that do not include tenure-track jobs, such as managing laboratories or performing math assignments for a company. However, hiring tenure-track faculty differs from hiring lower-level staff: it entails selecting among highly accomplished candidates, all of whom have completed Ph.D.s and amassed publications and strong letters of support. Hiring bias may occur when applicants’ records are ambiguous, as was true in studies of hiring bias for lower-level staff posts, but such bias may not occur when records are clearly strong, as is the case with tenure-track hiring.

### Do these conflict?¶

From Witteman, et al, 2019, in The Lancet:

Thus, evidence of scientists favouring women comes exclusively from hypothetical scenarios, whereas evidence of scientists favouring men comes from hypothetical scenarios and real behaviour. This might reflect academics' growing awareness of the social desirability of achieving gender balance, while real academic behaviour might not yet put such ideals into action.

### Example: Restaurant reviews and phone types¶

• You are deciding whether to eat at Dirty Birds or The Loft.
• Suppose Yelp shows ratings aggregated by phone type (Android vs. iPhone).
Phone Type Stars for Dirty Birds Stars for The Loft
Android 4.24 4.0
iPhone 2.99 2.79
All 3.32 3.37
• Question: Should you choose Dirty Birds or The Loft?
• Answer: The type of phone you use likely has nothing to do with your taste in food – pick the restaurant that is rated higher overall.
• Remember, Simpson's paradox is merely a property of weighted averages!

Aggregated means:

Disaggregated means:

### Takeaways¶

Be skeptical of...

• Aggregate statistics.
• People misusing statistics to "prove" that discrimination doesn't exist.
• Drawing conclusions from individual publications ($p$-hacking, publication bias, narrow focus, etc.).
• Everything!

## Aside: Working with time series data¶

### Time series – why now?¶

• Next lecture, we'll start looking at how to combine multiple DataFrames.
• Data is often partitioned by time. For instance, there may be one .csv file per day for 1 year.
• We will need to load in the files as DataFrames and pd.concat them together.
• Note: "time series" is a general term and is not related to Series in pandas.

### Datetime types¶

When working with time data, you will see two different kinds of "times":

• Datetimes reference particular moments in time (e.g. November 26th, 1998 at 8:26AM).
• Could just be a date, e.g. January 20, 2023.
• Could just be a time, e.g. 4:45 AM.
• Datetimes typically don't keep track of timezones.
• Timedeltas, or durations, reference an exact length of time (e.g. a duration of 3 hours).

### The datetime module¶

Python has an in-built datetime module, which contains datetime and timedelta types. These are much more convenient to deal with than strings that contain times.

Unix timestamps count the number of seconds since January 1st, 1970.

### Times in pandas¶

• pd.Timestamp is the pandas equivalent of datetime.
• pd.to_datetime converts strings to pd.Timestamp objects.

Timestamps have time-related attributes, e.g. dayofweek, hour, min, sec.

Subtracting timestamps yields pd.Timedelta objects.

### Example: Exam speeds¶

Below, we have the Final Exam starting and ending times for two sections of a course.

Question: Who took the longest time to finish the exam?

## Summary, next time¶

### Summary¶

• pivot_table aggregates data based on two categorical columns, and reshapes the result to be "wide" instead of "long".
• Simpson's paradox occurs when grouped data and ungrouped data show opposing trends.
• It is a consequence of arithmetic.
• Timestamps in pandas are stored using pd.Timestamp and pd.Timedelta objects.

### Next time¶

Combining DataFrames.