# Lecture 4 – Grouping¶

## DSC 80, Winter 2023¶

### Announcements¶

• Lab 1 is due today at 4PM – no slip days are allowed!
• Discussion 1 is today at 5PM.
• If you submit Lab 1, attend Discussion 1, and satisfactorily fill out the reflection form for Lab 1, you'll receive 0.3% of extra credit.
• Project 1's checkpoint is due tomorrow at 11:59PM, and the full project is due on Thursday, January 26th at 11:59PM.
• Lab 2 is due on Monday, January 23rd at 11:59PM.
• Check the calendar for the latest office hours schedule – we're regularly adding more OH.
• Aside: dsc-courses.github.io has links to course websites for many DSC courses.

### Agenda¶

• Data granularity.
• Grouping.
• DataFrameGroupBy objects and aggregation.
• Other DataFrameGroupBy methods.

## Data granularity¶

### Granularity¶

• Granularity refers to the level of detail present in data.
• Fine: small details.
• Coarse: bigger picture.
• Most commonly, rows in a DataFrame correspond to individuals, and columns correspond to attributes. Data formatted in this way is called tidy data.
• In the following example, what is an individual?
Name Assignment Score
Billy Homework 1 94
Sally Homework 1 98
Molly Homework 1 82
Sally Homework 2 47

### Levels of granularity¶

Each student submits CAPEs once for each course they are in.

Student Name Quarter Course Instructor Recommend? Expected Grade Hours Per Week Comments
Billy WI23 DSC 80 Suraj Rampure No A- 14 I hate this class
Billy WI23 DSC 40B Justin Eldridge Yes B+ 9 go big O
Sally WI23 DSC 10 Janine Tiefenbruck Yes A 11 baby pandas are so cute
Molly WI23 DSC 80 Suraj Rampure Yes A+ 2 I wish there was music in class
Molly WI23 DSC 95 Marina Langlois No A 3 I loved DSC 30, but 95 wasn't hard enough :(

Only instructors can see individual responses. At cape.ucsd.edu, overall class statistics are visible.

Quarter Course Instructor Recommend (%) Expected Grade Hours Per Week
WI23 DSC 80 Suraj Rampure 6% 3.15 (B) 13.32
WI23 DSC 40B Justin Eldridge 89% 3.35 (B+) 8.54
WI23 DSC 10 Janine Tiefenbruck 94% 3.45 (B+) 11.49
WI23 DSC 95 Marina Langlois 91% 4.0 (A) 9.21

The university may be interested in looking at CAPEs results by department.

Quarter Department Recommend (%) Expected Grade Hours Per Week
WI23 DSC 91% 3.01 (B) 12.29
WI23 BILD 85% 2.78 (C+) 13.21

Prospective students may be interested in comparing course evaluations across different universities.

University Recommend (%) Average GPA Hours Per Week
UC San Diego 94% 3.12 (B) 42.19
UC Irvine 89% 3.15 (B) 38.44
SDSU 88% 2.99 (B-) 36.89

### Collecting data¶

• If you can control how your dataset is created, you should opt for finer granularity, i.e. for more detail.
• You can always remove detail, but you cannot add detail if it is not already present in the dataset.
• However, obtaining fine-grained data can take more time and space.

### Manipulating granularity¶

• In the CAPEs example, we looked at the same information – course evaluations – at varying levels of detail.
• We'll now explore how to change the level of granularity present in our dataset.
• While it may seem like we are "losing information," removing detail can help us understand bigger-picture trends in our data.

### Example: Penguins¶

Artwork by @allison_horst

The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.

### Discussion Question¶

Create a Series, indexed by 'species', that contains the mean 'body_mass_g' of each 'species'.

### Naive approach: looping through unique values¶

• For each unique 'species', we make a pass through the entire dataset.
• The asymptotic runtime of this procedure is $\Theta(ns)$, where $n$ is the number of rows and $s$ is the number of unique species.
• While there are other loop-based solutions that only involve a single pass over the DataFrame, we'd like to avoid Python loops entirely, as they're slow.

## Grouping¶

### 🤔¶

Somehow, the groupby method computes what we're looking for in just one line. How?

### Aside: Pandas Tutor¶

Pandas Tutor is the equivalent of Python Tutor, which you saw in DSC 20, but for pandas.

You can use it by:

• Going to the website and entering pandas code directly.
• Installing it locally (!pip install pandas-tutor), enabling it in your notebook (%reload_ext pandas_tutor) and using the %%pt cell magic to visualize the last expression in a cell.

We'll do the latter, since it doesn't require us leaving our Jupyter Notebook (and because the website doesn't allow us to upload datasets).

### Split-apply-combine¶

The groupby method involves three steps: split, apply, and combine. This is the same terminology that the pandas documentation uses.

• Split breaks up and "groups" the rows of a DataFrame according to the specified key. There is one "group" for every unique value of the key.
• Apply uses a function (e.g. aggregation, transformation, filtration) within the individual groups.
• Combine stitches the results of these operations into an output DataFrame.
• The split-apply-combine pattern can be parallelized to work on multiple computers or threads, by sending computations for each group to different processors.

### More examples¶

Before we dive deep into the internals, let's look at a few more examples.

Which 'species' has the highest median 'bill_length_mm'?

What proportion of penguins of each 'species' live on 'Dream' island?

Note that groupby is a declarative operation – the user just specifies what computation needs to be done, and pandas figures out how to do it under the hood.

## DataFrameGroupBy objects and aggregation¶

### DataFrameGroupBy objects¶

We've just evaluated a few expressions of the following form.

There are two method calls in the expression above: .groupby('species') and .mean(). What happens if we remove the latter?

### Peeking under the hood¶

If df is a DataFrame, then df.groupby(key) returns a DataFrameGroupBy object.

This object represents the "split" in "split-apply-combine".

DataFrameGroupBy objects have a groups attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.

DataFrameGroupBy objects also have a get_group(key) method, which returns a DataFrame with only the values for the given key.

We usually don't use these attributes and methods, but they're useful in understanding how groupby works under the hood.

### Aggregation¶

• Once we create a DataFrameGroupBy object, we need to apply some function to each group, and combine the results.
• The most common operation we apply to each group is an aggregation.
• Aggregation refers to the process of reducing many values to one.
• To perform an aggregation, use an aggregation method on the DataFrameGroupBy object, e.g. .mean(), .max(), or .median().

Let's look at some examples.

### Column independence¶

Within each group, the aggregation method is applied to each column independently.

The above result tells us that:

• The maximum 'island' among 'Adelie' penguins is 'Torgersen'.
• The maximum 'body_mass_g' among 'Adelie' penguins is 4775.0.

It is not telling us that there is an 'Adelie' penguin on 'Torgersen' island with a 'body_mass_g' of 4775.0!

### Discussion Question¶

Find the 'island' on which the heaviest penguin of each 'species' lives.

### Column selection and performance implications¶

• By default, the aggregator will be applied to all columns that it can be applied to.
• max and min are defined on strings, while median and mean are not.
• If we only care about one column, we can select that column before aggregating to save time.
• DataFrameGroupBy objects support [] notation, just like DataFrames.

To demonstrate that the former is slower than the latter, we can use %%timeit. For reference, we'll also include our earlier for-loop-based solution.

### Takeaways¶

• It's important to understand what each piece of your code evaluates to – in the first two timed examples, the code is almost identical, but the performance is quite different.

          # Slower
penguins.groupby('species').mean()['bill_length_mm']

# Faster
penguins.groupby('species')['bill_length_mm'].mean()
• The groupby method is much quicker than for-looping over the DataFrame in Python. It can often produce results using just a single, fast pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.

### Beyond default aggregation methods¶

• There are many built-in aggregation methods.
• What if you want to apply different aggregation methods to different columns?
• What if the aggregation method you want to use doesn't already exist in pandas?

### The aggregate method¶

• The DataFrameGroupBy object has a general aggregate method, which aggregates using one or more operations.
• Remember, aggregation refers to the process of reducing many values to one.
• There are many ways of using aggregate; refer to the documentation for a comprehensive list.
• Example arguments:
• A single function.
• A list of functions.
• A dictionary mapping column names to functions.
• Per the documentation, agg is an alias for aggregate.

### Example¶

How many penguins are there of each 'species', and what is the mean 'body_mass_g' of each species?

Note what happens when we don't select a column before aggregating.

### Example¶

What is the maximum 'bill_length_mm' of each species, and which 'island's is each 'species' found on?

### Example¶

What is the interquartile range of the 'body_mass_g' of each 'species'?

## Other DataFrameGroupBy methods¶

### Split-apply-combine, revisited¶

When we introduced the split-apply-combine pattern, the "apply" step involved aggregation – our final DataFrame had one row for each group.

Instead of aggregating during the apply step, we could instead perform a:

• Transformation, in which we perform operations to every value within each group.
• Filtration, in which we keep only the groups that satisfy some condition.

### Transformations¶

• Suppose we want to convert the 'body_mass_g' column to to z-scores (i.e. standard units), separately for each 'species':
$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of } x}$$
• To do so, we can use the transform method on a DataFrameGroupBy object. The transform method takes in a function, which itself takes in a Series and returns a new Series.
• A transformation produces a DataFrame or Series of the same size – it is not an aggregation!

Note that below, penguin 276 has a larger 'body_mass_g' than penguin 65, but a lower 'z_mass'.

• Penguin 65 has an above average 'body_mass_g' among 'Adelie' penguins.
• Penguin 276 has a below average 'body_mass_g' among 'Gentoo' penguins. Remember from earlier that the average 'body_mass_g' of 'Gentoo' penguins is much higher than for other species.

### Filtering¶

• To keep only the groups that satisfy a particular condition, use the filter method on a DataFrameGroupBy object.
• The filter method takes in a function, which itself takes in a DataFrame/Series and return a single Boolean. The result is a new DataFrame/Series with only the groups for which the filter function returned True.

For example, suppose we want only the 'species' whose 'bill_length_mm' is above 39.

No more 'Adelie's!

Or, as another example, suppose we only want the 'species' with at least 100 penguins:

No more 'Chinstrap's!

### Grouping with multiple columns¶

When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.

### Grouping and indexes¶

• The groupby method creates an index based on the specified columns.
• When grouping by multiple columns, the resulting DataFrame has a MultiIndex.
• Advice: When working with a MultiIndex, use reset_index or set as_index=False in groupby.

## Summary, next time¶

### Summary, next time¶

• Grouping allows us to change the level of granularity in a DataFrame.
• Grouping involves three steps – split, apply, and combine.
• The groupby method returns a DataFrameGroupBy method, which creates one group for every unique combination of values in the column(s) being grouped on.
• Most often, we will use an aggregation method on a DataFrameGroupBy object, but we can also use transform, filter, or the more general apply methods. Each one of these methods acts on each group individually.
• Next time: pivot and pivot_table. Simpson's paradox.