from dsc80_utils import *
# Pandas Tutor setup
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Announcements 📣¶
- Lab 1 is due tomorrow at 11:59pm.
- The Welcome Survey is part of your Lab 1; we won't count your Lab 1 as submitted unless your Welcome Survey is also submitted.
- Project 1 is released.
- The checkpoint (Questions 1-7) is due on Tuesday, Oct 8th.
- The full project is due on Tuesday, Oct 15th.
- Lab 2 will be released tomorrow.
Agenda¶
- Adding and modifying columns
- Data granularity and the
groupby
method. DataFrameGroupBy
objects and aggregation.- Other
DataFrameGroupBy
methods. - Pivot tables using the
pivot_table
method.
You will need to code a lot today – make sure to pull the course repository
I will code a lot too, and the filled-in slides are posted online.
Question 🤔 (Answer at dsc80.com/q)
Code: hello
A question for you now: Have you set up your development environment? Have you started on Lab 1?
- A. Yes, I have set up my development environment. I've started Lab 1.
- B. Yes, I have set up my development environment. I haven't started Lab 1.
- C. No, I am running into difficulties setting up my environment.
- D. No, I should really start getting set up!
Adding and modifying columns¶
Adding and modifying columns, using a copy¶
- To add a new column to a DataFrame, use the
assign
method.- To change the values in a column, add a new column with the same name as the existing column.
- Like most
pandas
methods,assign
returns a new DataFrame.- Pro ✅: This doesn't inadvertently change any existing variables.
- Con ❌: It is not very space efficient, as it creates a new copy each time it is called.
dogs = pd.read_csv(Path('data') / 'dogs43.csv', index_col='breed')
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
💡 Pro-Tip: Method chaining¶
Chain methods together instead of writing long, hard-to-read lines.
# Finds the rows corresponding to the five cheapest to own breeds on a per-year basis.
(dogs
.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
.sort_values('cost_per_year')
.iloc[:5]
)
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.00 | 1557.88 |
Lhasa Apso | non-sporting | 22031.0 | 13.92 | small | 15.0 | 10.50 | 1582.69 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.00 | 1589.02 |
Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.00 | 1590.91 |
Shih Tzu | toy | 21152.0 | 13.20 | small | 12.5 | 9.75 | 1602.42 |
💡 Pro-Tip: assign
for column names with special characters¶
You can also use assign
when the desired column name has spaces (and other special characters) by unpacking a dictionary:
dogs.assign(**{'cost per year 💵': dogs['lifetime_cost'] / dogs['longevity']})
kind | lifetime_cost | longevity | size | weight | height | cost per year 💵 | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
Adding and modifying columns, in-place¶
- You can assign a new column to a DataFrame in-place using
[]
.- This works like dictionary assignment.
- This modifies the underlying DataFrame, unlike
assign
, which returns a new DataFrame.
- This is the more "common" way of adding/modifying columns.
- ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
Note that we never reassigned dogs_copy
in the cell above – that is, we never wrote dogs_copy = ...
– though it was still modified.
Mutability¶
DataFrames, like lists, arrays, and dictionaries, are mutable. As you learned in DSC 20, this means that they can be modified after being created. (For instance, the list .append
method mutates in-place.)
Not only does this explain the behavior on the previous slide, but it also explains the following:
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
def cost_in_thousands():
dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
# What happens when we run this twice?
cost_in_thousands()
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22.59 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21.99 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18.99 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13.94 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13.58 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20.02 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
⚠️ Avoid mutation when possible¶
Note that dogs_copy
was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on labs and projects, among other things!
To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include df = df.copy()
as the first line in functions that take DataFrames as input.
Also, some methods let you use the inplace=True
argument to mutate the original. Don't use this argument, since future pandas
releases plan to remove it.
pandas
and numpy
¶
pandas
is built upon numpy
!¶
- A Series in
pandas
is anumpy
array with an index. - A DataFrame is like a dictionary of columns, each of which is a
numpy
array. - Many operations in
pandas
are fast because they usenumpy
's implementations, which are written in fast languages like C. - If you need access the array underlying a DataFrame or Series, use the
to_numpy
method.
dogs['lifetime_cost']
breed Brittany 22589.0 Cairn Terrier 21992.0 English Cocker Spaniel 18993.0 ... Bullmastiff 13936.0 Mastiff 13581.0 Saint Bernard 20022.0 Name: lifetime_cost, Length: 43, dtype: float64
dogs['lifetime_cost'].to_numpy()
array([22589., 21992., 18993., ..., 13936., 13581., 20022.])
Axes¶
- The rows and columns of a DataFrame are both stored as Series.
- The axis specifies the direction of a slice of a DataFrame.
- Axis 0 refers to the index (rows).
- Axis 1 refers to the columns.
- These are the same axes definitions that 2D
numpy
arrays have!
DataFrame methods with axis
¶
- Many Series methods work on DataFrames.
- In such cases, the DataFrame method usually applies the Series method to every row or column.
- Many of these methods accept an
axis
argument; the default is usuallyaxis=0
.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# Max element in each column.
dogs.max()
kind working lifetime_cost 26686.0 longevity 16.5 size small weight 175.0 height 30.0 dtype: object
# Max element in each row – errors, since there are different types in each row.
# dogs.max(axis=1)
# The number of unique values in each column.
dogs.nunique()
kind 7 lifetime_cost 43 longevity 40 size 3 weight 37 height 30 dtype: int64
# describe doesn't accept an axis argument; it works on every numeric column in the DataFrame it is called on.
dogs.describe()
lifetime_cost | longevity | weight | height | |
---|---|---|---|---|
count | 43.00 | 43.00 | 43.00 | 43.00 |
mean | 20532.84 | 11.34 | 49.35 | 18.34 |
std | 3290.78 | 2.05 | 39.42 | 6.83 |
... | ... | ... | ... | ... |
50% | 21006.00 | 11.81 | 36.50 | 18.50 |
75% | 22072.50 | 12.52 | 67.50 | 25.00 |
max | 26686.00 | 16.50 | 175.00 | 30.00 |
8 rows × 4 columns
Data granularity and the groupby
method¶
Example: Palmer Penguins¶
The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.
IFrame('https://www.youtube-nocookie.com/embed/CCrNAHXUstU?si=-DntSyUNp5Kwitjm&start=11',
width=560, height=315)
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
Here, each row corresponds to a single penguin, and each column corresponds to a different attribute (or feature) we have for each penguin. Data formatted in this way is called tidy data.
Granularity¶
- Granularity refers to what each observation in a dataset represents.
- Fine: small details.
- Coarse: bigger picture.
- If you can control how your dataset is created, you should opt for finer granularity, i.e. for more detail.
- You can always remove details, but it's difficult to add detail that isn't already there.
- But obtaining fine-grained data can take more time/money.
- Today, we'll focus on how to remove details from fine-grained data, in order to help us understand bigger-picture trends in our data.
Aggregating¶
Aggregating is the act of combining many values into a single value.
- What is the mean
'body_mass_g'
for all penguins?
penguins['body_mass_g'].mean()
np.float64(4207.057057057057)
- What is the mean
'body_mass_g'
for each species?
# ???
Naive approach: looping through unique values¶
species_map = pd.Series([], dtype=float)
for species in penguins['species'].unique():
species_only = penguins.loc[penguins['species'] == species]
species_map.loc[species] = species_only['body_mass_g'].mean()
species_map
Adelie 3706.16 Chinstrap 3733.09 Gentoo 5092.44 dtype: float64
- For each unique
'species'
, we make a pass through the entire dataset.- The asymptotic runtime of this procedure is $\Theta(ns)$, where $n$ is the number of rows and $s$ is the number of unique species.
- While there are other loop-based solutions that only involve a single pass over the DataFrame, we'd like to avoid Python loops entirely, as they're slow.
Grouping¶
A better solution, as we know from DSC 10, is to use the groupby
method.
# Before:
penguins['body_mass_g'].mean()
np.float64(4207.057057057057)
# After:
penguins.groupby('species')['body_mass_g'].mean()
species Adelie 3706.16 Chinstrap 3733.09 Gentoo 5092.44 Name: body_mass_g, dtype: float64
Somehow, the groupby
method computes what we're looking for in just one line. How?
%%pt
penguins.groupby('species')['body_mass_g'].mean()
"Split-apply-combine" paradigm¶
The groupby
method involves three steps: split, apply, and combine. This is the same terminology that the pandas
documentation uses.
Split breaks up and "groups" the rows of a DataFrame according to the specified key. There is one "group" for every unique value of the key.
Apply uses a function (e.g. aggregation, transformation, filtration) within the individual groups.
Combine stitches the results of these operations into an output DataFrame.
The split-apply-combine pattern can be parallelized to work on multiple computers or threads, by sending computations for each group to different processors.
More examples¶
Before we dive into the internals, let's look at a few more examples.
Question 🤔 (Answer at dsc80.com/q)
Code: dream
What proportion of penguins of each 'species'
live on 'Dream'
island?
Your output should look like:
species
Adelie 0.38
Chinstrap 1.00
Gentoo 0.00
# Fill this in, then respond on dsc80.com/q
DataFrameGroupBy
objects and aggregation¶
DataFrameGroupBy
objects¶
We've just evaluated a few expressions of the following form.
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
penguins.groupby('species')['bill_length_mm'].mean()
species Adelie 38.82 Chinstrap 48.83 Gentoo 47.57 Name: bill_length_mm, dtype: float64
There are two method calls in the expression above: .groupby('species')
and .mean()
. What happens in the .groupby()
call?
penguins.groupby('species')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x31dccaa80>
Peeking under the hood¶
If df
is a DataFrame, then df.groupby(key)
returns a DataFrameGroupBy
object.
This object represents the "split" in "split-apply-combine".
# Simplified DataFrame for demonstration:
penguins_small = penguins.iloc[[0, 150, 300, 1, 251, 151, 301], [0, 5, 6]]
penguins_small
species | body_mass_g | sex | |
---|---|---|---|
0 | Adelie | 3750.0 | Male |
156 | Chinstrap | 3725.0 | Male |
308 | Gentoo | 4875.0 | Female |
1 | Adelie | 3800.0 | Female |
258 | Gentoo | 4350.0 | Female |
157 | Chinstrap | 3950.0 | Female |
309 | Gentoo | 5550.0 | Male |
# Creates one group for each unique value in the species column.
penguin_groups = penguins_small.groupby('species')
penguin_groups
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x31dccbc20>
%%pt
penguin_groups
DataFrameGroupBy
objects have a groups
attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.
penguin_groups.groups
{'Adelie': [0, 1], 'Chinstrap': [156, 157], 'Gentoo': [308, 258, 309]}
DataFrameGroupBy
objects also have a get_group(key)
method, which returns a DataFrame with only the values for the given key.
penguin_groups.get_group('Chinstrap')
species | body_mass_g | sex | |
---|---|---|---|
156 | Chinstrap | 3725.0 | Male |
157 | Chinstrap | 3950.0 | Female |
# Same as the above!
penguins_small.query('species == "Chinstrap"')
species | body_mass_g | sex | |
---|---|---|---|
156 | Chinstrap | 3725.0 | Male |
157 | Chinstrap | 3950.0 | Female |
We usually don't use these attributes and methods, but they're useful in understanding how groupby
works under the hood.
Aggregation¶
Once we create a
DataFrameGroupBy
object, we need to apply some function to each group, and combine the results.The most common operation we apply to each group is an aggregation.
- Remember, aggregation is the act of combining many values into a single value.
To perform an aggregation, use an aggregation method on the
DataFrameGroupBy
object, e.g..mean()
,.max()
, or.median()
.
Let's look at some examples.
penguins_small
species | body_mass_g | sex | |
---|---|---|---|
0 | Adelie | 3750.0 | Male |
156 | Chinstrap | 3725.0 | Male |
308 | Gentoo | 4875.0 | Female |
1 | Adelie | 3800.0 | Female |
258 | Gentoo | 4350.0 | Female |
157 | Chinstrap | 3950.0 | Female |
309 | Gentoo | 5550.0 | Male |
penguins_small.groupby('species')['body_mass_g'].mean()
species Adelie 3775.0 Chinstrap 3837.5 Gentoo 4925.0 Name: body_mass_g, dtype: float64
# Whoa, what happened in the sex column?
penguins_small.groupby('species').sum()
body_mass_g | sex | |
---|---|---|
species | ||
Adelie | 7550.0 | MaleFemale |
Chinstrap | 7675.0 | MaleFemale |
Gentoo | 14775.0 | FemaleFemaleMale |
penguins_small.groupby('species').last()
body_mass_g | sex | |
---|---|---|
species | ||
Adelie | 3800.0 | Female |
Chinstrap | 3950.0 | Female |
Gentoo | 5550.0 | Male |
penguins_small.groupby('species').max()
body_mass_g | sex | |
---|---|---|
species | ||
Adelie | 3800.0 | Male |
Chinstrap | 3950.0 | Male |
Gentoo | 5550.0 | Male |
Column independence¶
Within each group, the aggregation method is applied to each column independently.
penguins_small.groupby('species').max()
body_mass_g | sex | |
---|---|---|
species | ||
Adelie | 3800.0 | Male |
Chinstrap | 3950.0 | Male |
Gentoo | 5550.0 | Male |
It is not telling us that there is a 'Male'
'Adelie'
penguin with a 'body_mass_g'
of 3800.0
!
# This penguin is Female!
penguins_small.loc[(penguins['species'] == 'Adelie') & (penguins['body_mass_g'] == 3800.0)]
species | body_mass_g | sex | |
---|---|---|---|
1 | Adelie | 3800.0 | Female |
Question 🤔 (Answer at dsc80.com/q)
Code: heavy
Find the species
, island
, and body_mass_g
of the heaviest Male
and Female
penguins in penguins
(not penguins_small
).
# Your code goes here.
Column selection and performance implications¶
- By default, the aggregator will be applied to all columns that it can be applied to.
max
,min
, andsum
are defined on strings, whilemedian
andmean
are not.
- If we only care about one column, we can select that column before aggregating to save time.
DataFrameGroupBy
objects support[]
notation, just likeDataFrame
s.
# Back to the big penguins dataset!
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
# Works, but involves wasted effort since the other columns had to be aggregated for no reason.
penguins.groupby('species').sum()['bill_length_mm']
species Adelie 5668.3 Chinstrap 3320.7 Gentoo 5660.6 Name: bill_length_mm, dtype: float64
# This is a SeriesGroupBy object!
penguins.groupby('species')['bill_length_mm']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x31dd512e0>
# Saves time!
penguins.groupby('species')['bill_length_mm'].sum()
species Adelie 5668.3 Chinstrap 3320.7 Gentoo 5660.6 Name: bill_length_mm, dtype: float64
To demonstrate that the former is slower than the latter, we can use %%timeit
. For reference, we'll also include our earlier for
-loop-based solution.
%%timeit
penguins.groupby('species').sum()['bill_length_mm']
259 μs ± 15.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
penguins.groupby('species')['bill_length_mm'].sum()
84.8 μs ± 3.2 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
species_map = pd.Series([], dtype=float)
for species in penguins['species'].unique():
species_only = penguins.loc[penguins['species'] == species]
species_map.loc[species] = species_only['body_mass_g'].mean()
species_map
636 μs ± 21.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Takeaways¶
It's important to understand what each piece of your code evaluates to – in the first two timed examples, the code is almost identical, but the performance is quite different.
# Slower penguins.groupby('species').sum()['bill_length_mm'] # Faster penguins.groupby('species')['bill_length_mm'].sum()
The
groupby
method is much quicker thanfor
-looping over the DataFrame in Python. It can often produce results using just a single, fast pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.You should always select the columns you want after
groupby
, unless you really know what you're doing!
Beyond default aggregation methods¶
- There are many built-in aggregation methods.
- What if you want to apply different aggregation methods to different columns?
- What if the aggregation method you want to use doesn't already exist in
pandas
?
The aggregate
method¶
- The
DataFrameGroupBy
object has a generalaggregate
method, which aggregates using one or more operations.- Remember, aggregation is the act of combining many values into a single value.
- There are many ways of using
aggregate
; refer to the documentation for a comprehensive list. - Example arguments:
- A single function.
- A list of functions.
- A dictionary mapping column names to functions.
- Per the documentation,
agg
is an alias foraggregate
.
Example¶
How many penguins are there of each 'species'
, and what is the mean 'body_mass_g'
of each 'species'
?
(penguins
.groupby('species')
['body_mass_g']
.aggregate(['count', 'mean'])
)
count | mean | |
---|---|---|
species | ||
Adelie | 146 | 3706.16 |
Chinstrap | 68 | 3733.09 |
Gentoo | 119 | 5092.44 |
Example¶
What is the maximum 'bill_length_mm'
of each 'species'
, and which 'island'
s is each 'species'
found on?
(penguins
.groupby('species')
.aggregate({'bill_length_mm': 'max', 'island': 'unique'})
)
bill_length_mm | island | |
---|---|---|
species | ||
Adelie | 46.0 | [Torgersen, Biscoe, Dream] |
Chinstrap | 58.0 | [Dream] |
Gentoo | 59.6 | [Biscoe] |
Example¶
What is the interquartile range of the 'body_mass_g'
of each 'species'
?
# Here, the argument to agg is a function,
# which takes in a pd.Series and returns a scalar.
def iqr(s):
return np.percentile(s, 75) - np.percentile(s, 25)
(penguins
.groupby('species')
['body_mass_g']
.agg(iqr)
)
species Adelie 637.5 Chinstrap 462.5 Gentoo 800.0 Name: body_mass_g, dtype: float64
Other DataFrameGroupBy
methods¶
Split-apply-combine, revisited¶
When we introduced the split-apply-combine pattern, the "apply" step involved aggregation – our final DataFrame had one row for each group.
Instead of aggregating during the apply step, we could instead perform a:
Transformation, in which we perform operations to every value within each group.
Filtration, in which we keep only the groups that satisfy some condition.
Transformations¶
Suppose we want to convert the 'body_mass_g'
column to to z-scores (i.e. standard units):
$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of } x}$$
def z_score(x):
return (x - x.mean()) / x.std(ddof=0)
z_score(penguins['body_mass_g'])
0 -0.57 1 -0.51 2 -1.19 ... 341 1.92 342 1.23 343 1.48 Name: body_mass_g, Length: 333, dtype: float64
Transformations within groups¶
Now, what if we wanted the z-score within each group?
To do so, we can use the
transform
method on aDataFrameGroupBy
object. Thetransform
method takes in a function, which itself takes in a Series and returns a new Series.A transformation produces a DataFrame or Series of the same size – it is not an aggregation!
z_mass = (penguins
.groupby('species')
['body_mass_g']
.transform(z_score))
z_mass
0 0.10 1 0.21 2 -1.00 ... 341 1.32 342 0.22 343 0.62 Name: body_mass_g, Length: 333, dtype: float64
penguins.assign(z_mass=z_mass)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | z_mass | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male | 0.10 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female | 0.21 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female | -1.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male | 1.32 |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female | 0.22 |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male | 0.62 |
333 rows × 8 columns
display_df(penguins.assign(z_mass=z_mass), rows=8)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | z_mass | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male | 0.10 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female | 0.21 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female | -1.00 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female | -0.56 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female | -0.49 |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male | 1.32 |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female | 0.22 |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male | 0.62 |
333 rows × 8 columns
Note that above, penguin 340 has a larger 'body_mass_g'
than penguin 0, but a lower 'z_mass'
.
- Penguin 0 has an above average
'body_mass_g'
among'Adelie'
penguins. - Penguin 340 has a below average
'body_mass_g'
among'Gentoo'
penguins. Remember from earlier that the average'body_mass_g'
of'Gentoo'
penguins is much higher than for other species.
penguins.groupby('species')['body_mass_g'].mean()
species Adelie 3706.16 Chinstrap 3733.09 Gentoo 5092.44 Name: body_mass_g, dtype: float64
Filtering groups¶
To keep only the groups that satisfy a particular condition, use the
filter
method on aDataFrameGroupBy
object.The
filter
method takes in a function, which itself takes in a DataFrame/Series and return a single Boolean. The result is a new DataFrame/Series with only the groups for which the filter function returnedTrue
.
For example, suppose we want only the 'species'
whose average 'bill_length_mm'
is above 39.
(penguins
.groupby('species')
.filter(lambda df: df['bill_length_mm'].mean() > 39)
)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
187 rows × 7 columns
No more 'Adelie'
s!
Or, as another example, suppose we only want 'species'
with at least 100 penguins:
(penguins
.groupby('species')
.filter(lambda df: df.shape[0] > 100)
)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
265 rows × 7 columns
No more 'Chinstrap'
s!
Question 🤔 (Answer at dsc80.com/q)
Code: aggs
Answer the following questions about grouping:
- In
.agg(fn)
, what is the input tofn
? What is the output offn
? - In
.transform(fn)
, what is the input tofn
? What is the output offn
? - In
.filter(fn)
, what is the input tofn
? What is the output offn
?
Grouping with multiple columns¶
When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
species_and_island = (
penguins
.groupby(['species', 'island'])
[['bill_length_mm', 'body_mass_g']]
.mean()
)
species_and_island
bill_length_mm | body_mass_g | ||
---|---|---|---|
species | island | ||
Adelie | Biscoe | 38.98 | 3709.66 |
Dream | 38.52 | 3701.36 | |
Torgersen | 39.04 | 3708.51 | |
Chinstrap | Dream | 48.83 | 3733.09 |
Gentoo | Biscoe | 47.57 | 5092.44 |
Grouping and indexes¶
- The
groupby
method creates an index based on the specified columns. - When grouping by multiple columns, the resulting DataFrame has a
MultiIndex
. - Advice: When working with a
MultiIndex
, usereset_index
or setas_index=False
ingroupby
.
species_and_island
bill_length_mm | body_mass_g | ||
---|---|---|---|
species | island | ||
Adelie | Biscoe | 38.98 | 3709.66 |
Dream | 38.52 | 3701.36 | |
Torgersen | 39.04 | 3708.51 | |
Chinstrap | Dream | 48.83 | 3733.09 |
Gentoo | Biscoe | 47.57 | 5092.44 |
species_and_island['body_mass_g']
species island Adelie Biscoe 3709.66 Dream 3701.36 Torgersen 3708.51 Chinstrap Dream 3733.09 Gentoo Biscoe 5092.44 Name: body_mass_g, dtype: float64
species_and_island.loc['Adelie']
bill_length_mm | body_mass_g | |
---|---|---|
island | ||
Biscoe | 38.98 | 3709.66 |
Dream | 38.52 | 3701.36 |
Torgersen | 39.04 | 3708.51 |
species_and_island.loc[('Adelie', 'Torgersen')]
bill_length_mm 39.04 body_mass_g 3708.51 Name: (Adelie, Torgersen), dtype: float64
species_and_island.reset_index()
species | island | bill_length_mm | body_mass_g | |
---|---|---|---|---|
0 | Adelie | Biscoe | 38.98 | 3709.66 |
1 | Adelie | Dream | 38.52 | 3701.36 |
2 | Adelie | Torgersen | 39.04 | 3708.51 |
3 | Chinstrap | Dream | 48.83 | 3733.09 |
4 | Gentoo | Biscoe | 47.57 | 5092.44 |
(penguins
.groupby(['species', 'island'], as_index=False)
[['bill_length_mm', 'body_mass_g']]
.mean()
)
species | island | bill_length_mm | body_mass_g | |
---|---|---|---|---|
0 | Adelie | Biscoe | 38.98 | 3709.66 |
1 | Adelie | Dream | 38.52 | 3701.36 |
2 | Adelie | Torgersen | 39.04 | 3708.51 |
3 | Chinstrap | Dream | 48.83 | 3733.09 |
4 | Gentoo | Biscoe | 47.57 | 5092.44 |
Question 🤔 (Answer at dsc80.com/q)
Code: names
Find the most popular Male
and Female
baby Name
for each Year
in baby
. Exclude Year
s where there were fewer than 1 million births recorded.
baby_path = Path('data') / 'baby.csv'
baby = pd.read_csv(baby_path)
baby
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 20456 | 2022 |
1 | Noah | M | 18621 | 2022 |
2 | Olivia | F | 16573 | 2022 |
... | ... | ... | ... | ... |
2085155 | Wright | M | 5 | 1880 |
2085156 | York | M | 5 | 1880 |
2085157 | Zachariah | M | 5 | 1880 |
2085158 rows × 4 columns
# Your code goes here.
Pivot tables using the pivot_table
method¶
Pivot tables: an extension of grouping¶
Pivot tables are a compact way to display tables for humans to read:
Sex | F | M |
---|---|---|
Year | ||
2018 | 1698373 | 1813377 |
2019 | 1675139 | 1790682 |
2020 | 1612393 | 1721588 |
2021 | 1635800 | 1743913 |
2022 | 1628730 | 1733166 |
- Notice that each value in the table is a sum over the counts, split by year and sex.
- You can think of pivot tables as grouping using two columns, then "pivoting" one of the group labels into columns.
pivot_table
¶
The pivot_table
DataFrame method aggregates a DataFrame using two columns. To use it:
df.pivot_table(index=index_col,
columns=columns_col,
values=values_col,
aggfunc=func)
The resulting DataFrame will have:
- One row for every unique value in
index_col
. - One column for every unique value in
columns_col
. - Values determined by applying
func
on values invalues_col
.
last_5_years = baby.query('Year >= 2018')
last_5_years
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 20456 | 2022 |
1 | Noah | M | 18621 | 2022 |
2 | Olivia | F | 16573 | 2022 |
... | ... | ... | ... | ... |
159444 | Zyrie | M | 5 | 2018 |
159445 | Zyron | M | 5 | 2018 |
159446 | Zzyzx | M | 5 | 2018 |
159447 rows × 4 columns
last_5_years.pivot_table(
index='Year',
columns='Sex',
values='Count',
aggfunc='sum',
)
Sex | F | M |
---|---|---|
Year | ||
2018 | 1698373 | 1813377 |
2019 | 1675139 | 1790682 |
2020 | 1612393 | 1721588 |
2021 | 1635800 | 1743913 |
2022 | 1628730 | 1733166 |
# Look at the similarity to the snippet above!
(last_5_years
.groupby(['Year', 'Sex'])
[['Count']]
.sum()
)
Count | ||
---|---|---|
Year | Sex | |
2018 | F | 1698373 |
M | 1813377 | |
2019 | F | 1675139 |
... | ... | ... |
2021 | M | 1743913 |
2022 | F | 1628730 |
M | 1733166 |
10 rows × 1 columns
Example¶
Find the number of penguins per 'island'
and 'species'
.
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
penguins.pivot_table(
index='species',
columns='island',
values='bill_length_mm', # Choice of column here doesn't actually matter!
aggfunc='count',
)
island | Biscoe | Dream | Torgersen |
---|---|---|---|
species | |||
Adelie | 44.0 | 55.0 | 47.0 |
Chinstrap | NaN | 68.0 | NaN |
Gentoo | 119.0 | NaN | NaN |
Note that there is a NaN
at the intersection of 'Biscoe'
and 'Chinstrap'
, because there were no Chinstrap penguins on Biscoe Island.
We can either use the fillna
method afterwards or the fill_value
argument to fill in NaN
s.
penguins.pivot_table(
index='species',
columns='island',
values='bill_length_mm',
aggfunc='count',
fill_value=0,
)
island | Biscoe | Dream | Torgersen |
---|---|---|---|
species | |||
Adelie | 44 | 55 | 47 |
Chinstrap | 0 | 68 | 0 |
Gentoo | 119 | 0 | 0 |
Granularity, revisited¶
Take another look at the pivot table from the previous slide. Each row of the original penguins
DataFrame represented a single penguin, and each column represented features of the penguins.
What is the granularity of the DataFrame below?
penguins.pivot_table(
index='species',
columns='island',
values='bill_length_mm',
aggfunc='count',
fill_value=0,
)
island | Biscoe | Dream | Torgersen |
---|---|---|---|
species | |||
Adelie | 44 | 55 | 47 |
Chinstrap | 0 | 68 | 0 |
Gentoo | 119 | 0 | 0 |
Reshaping¶
pivot_table
reshapes DataFrames from "long" to "wide".- Other DataFrame reshaping methods:
melt
: Un-pivots a DataFrame. Very useful in data cleaning.pivot
: Likepivot_table
, but doesn't do aggregation.stack
: Pivots multi-level columns to multi-indices.unstack
: Pivots multi-indices to columns.- Google and the documentation are your friends!
We will most likely end lecture here.
Distributions¶
Joint distribution¶
When using aggfunc='count'
, a pivot table describes the joint distribution of two categorical variables. This is also called a contingency table.
counts = penguins.pivot_table(
index='species',
columns='sex',
values='body_mass_g',
aggfunc='count',
fill_value=0
)
counts
sex | Female | Male |
---|---|---|
species | ||
Adelie | 73 | 73 |
Chinstrap | 34 | 34 |
Gentoo | 58 | 61 |
We can normalize the DataFrame by dividing by the total number of penguins. The resulting numbers can be interpreted as probabilities that a randomly selected penguin from the dataset belongs to a given combination of species and sex.
joint = counts / counts.sum().sum()
joint
sex | Female | Male |
---|---|---|
species | ||
Adelie | 0.22 | 0.22 |
Chinstrap | 0.10 | 0.10 |
Gentoo | 0.17 | 0.18 |
Marginal probabilities¶
If we sum over one of the axes, we can compute marginal probabilities, i.e. unconditional probabilities.
joint
sex | Female | Male |
---|---|---|
species | ||
Adelie | 0.22 | 0.22 |
Chinstrap | 0.10 | 0.10 |
Gentoo | 0.17 | 0.18 |
# Recall, joint.sum(axis=0) sums across the rows,
# which computes the sum of the **columns**.
joint.sum(axis=0)
sex Female 0.5 Male 0.5 dtype: float64
joint.sum(axis=1)
species Adelie 0.44 Chinstrap 0.20 Gentoo 0.36 dtype: float64
For instance, the second Series tells us that a randomly selected penguin has a 0.36 chance of being of species 'Gentoo'
.
Conditional probabilities¶
Using counts
, how might we compute conditional probabilities like $$P(\text{species } = \text{"Adelie"} \mid \text{sex } = \text{"Female"})?$$
counts
sex | Female | Male |
---|---|---|
species | ||
Adelie | 73 | 73 |
Chinstrap | 34 | 34 |
Gentoo | 58 | 61 |
$$\begin{align*} P(\text{species} = c \mid \text{sex} = x) &= \frac{\# \: (\text{species} = c \text{ and } \text{sex} = x)}{\# \: (\text{sex} = x)} \end{align*}$$
➡️ Click here to see more of a derivation.
$$\begin{align*} P(\text{species} = c \mid \text{sex} = x) &= \frac{P(\text{species} = c \text{ and } \text{sex} = x)}{P(\text{sex = }x)} \\ &= \frac{\frac{\# \: (\text{species } = \: c \text{ and } \text{sex } = \: x)}{N}}{\frac{\# \: (\text{sex } = \: x)}{N}} \\ &= \frac{\# \: (\text{species} = c \text{ and } \text{sex} = x)}{\# \: (\text{sex} = x)} \end{align*}$$Answer: To find conditional probabilities of 'species'
given 'sex'
, divide by column sums. To find conditional probabilities of 'sex'
given 'species'
, divide by row sums.
Conditional probabilities¶
To find conditional probabilities of 'species'
given 'sex'
, divide by column sums. To find conditional probabilities of 'sex'
given 'species'
, divide by row sums.
counts
sex | Female | Male |
---|---|---|
species | ||
Adelie | 73 | 73 |
Chinstrap | 34 | 34 |
Gentoo | 58 | 61 |
counts.sum(axis=0)
sex Female 165 Male 168 dtype: int64
The conditional distribution of 'species'
given 'sex'
is below. Note that in this new DataFrame, the 'Female'
and 'Male'
columns each sum to 1.
counts / counts.sum(axis=0)
sex | Female | Male |
---|---|---|
species | ||
Adelie | 0.44 | 0.43 |
Chinstrap | 0.21 | 0.20 |
Gentoo | 0.35 | 0.36 |
For instance, the above DataFrame tells us that the probability that a randomly selected penguin is of 'species'
'Adelie'
given that they are of 'sex'
'Female'
is 0.442424.
Question 🤔 (Answer at dsc80.com/q)
Code: dist
Find the conditional distribution of 'sex'
given 'species'
.
Hint: Use .T
.
# Your code goes here.
Summary, next time¶
Summary¶
- Grouping allows us to change the level of granularity in a DataFrame.
- Grouping involves three steps – split, apply, and combine.
- Usually, what is applied is an aggregation, but it could be a transformation or filtration.
pivot_table
aggregates data based on two categorical columns, and reshapes the result to be "wide" instead of "long".
Next time¶
- Simpson's paradox.
- Merging.
- Review this diagram from DSC 10!
- The pitfalls of the
apply
method.