In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_rows', 9)

Lecture 2 – DataFrame Fundamentals¶

DSC 80, Winter 2023¶

Announcements 📣¶

  • Lab 1 is released, and is due on Wednesday, January 18th at 4PM (no slip days)!
    • Watch this video 🎥 for tips on how to set up your environment and work on assignments.
    • If you set up your environment before the lab was released, you may have to recreate your conda environment – see the Tech Support page for instructions, and this Ed post for debugging.
  • Discussion starts next week.
  • Lecture recordings are available here.
  • Make sure to fill out the Welcome Survey.

Agenda¶

  • Introduction to pandas.
  • Selecting columns.
    • get vs. [].
    • Useful Series methods.
  • Selecting rows (and columns).
    • loc and iloc.
    • Querying.

Remember, we are not going to cover every single detail! The pandas documentation will be your friend.

Introduction to pandas 🐼¶

pandas¶

  • pandas is the Python library for tabular data manipulation.
  • Before pandas was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
  • Wes McKinney, the original developer of pandas, wanted a library which would allow everything to be done in Python.
    • Python is faster to develop in than Java, and is more general-purpose than R.

pandas data structures¶

There are three key data structures at the core of pandas:

  • DataFrame: 2 dimensional tables.
  • Series: 1 dimensional array-like object, typically representing a column or row.
  • Index: sequence of column or row labels.

Importing pandas and related libraries¶

We've already run this at the top of the notebook, so we won't repeat it here. But pandas is almost always imported in conjunction with numpy:


import pandas as pd
import numpy as np

Example: Universities in California 📚¶

To refresh our memory on the basics of pandas, let's work with a dataset that contains the name, location, enrollment, and founding date of most UCs and CSUs.

  • We use pd.read_csv to load a DataFrame from file.
  • Aside: os.path.join('x', 'y.csv') evaluates to 'x/y.csv' on Unix machines and 'x\y.csv' on Windows.
In [2]:
schools_path = os.path.join('data', 'california_universities.csv')
schools = pd.read_csv(schools_path)
schools
Out[2]:
Name City County Enrollment Founded
0 Humboldt State University Arcata Humboldt 7,774 1913
1 California State University, Bakersfield Bakersfield Kern 10,493 1965
2 University of California, Berkeley Berkeley Alameda 42,519 1869
3 California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ... ...
28 University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
29 California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
30 California State University, Stanislaus Turlock Stanislaus 10,214 1957
31 California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 5 columns

Exploring a new DataFrame¶

To extract the first or last few rows of a DataFrame, use the head or tail methods.

In [3]:
schools.head()
Out[3]:
Name City County Enrollment Founded
0 Humboldt State University Arcata Humboldt 7,774 1913
1 California State University, Bakersfield Bakersfield Kern 10,493 1965
2 University of California, Berkeley Berkeley Alameda 42,519 1869
3 California State University Channel Islands Camarillo Ventura 7,095 2002
4 California State University, Dominguez Hills Carson Los Angeles 15,741 1960
In [4]:
schools.tail(2)
Out[4]:
Name City County Enrollment Founded
30 California State University, Stanislaus Turlock Stanislaus 10,214 1957
31 California State University Maritime Academy Vallejo Solano 1,017 1929

The shape attribute returns the DataFrame's number of rows and columns.

In [5]:
schools.shape
Out[5]:
(32, 5)

The anatomy of a DataFrame¶

Each row and column of a DataFrame is a Series.

  • Note that Series' look like arrays, but contain an index.
  • The column labels and row labels are each stored as Index types.
  • You can think of a DataFrame as a dictionary that maps column names to Series.
In [6]:
schools
Out[6]:
Name City County Enrollment Founded
0 Humboldt State University Arcata Humboldt 7,774 1913
1 California State University, Bakersfield Bakersfield Kern 10,493 1965
2 University of California, Berkeley Berkeley Alameda 42,519 1869
3 California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ... ...
28 University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
29 California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
30 California State University, Stanislaus Turlock Stanislaus 10,214 1957
31 California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 5 columns

In [7]:
# Index is 0, 1, 2, ..., 31
schools['City']
Out[7]:
0             Arcata
1        Bakersfield
2           Berkeley
3          Camarillo
           ...      
28        Santa Cruz
29    Seaside-Marina
30           Turlock
31           Vallejo
Name: City, Length: 32, dtype: object
In [8]:
# The default index of a DataFrame is 0, 1, 2, 3, ...
schools.index
Out[8]:
RangeIndex(start=0, stop=32, step=1)
In [9]:
# Index is 'Name', 'City', 'County', ...
schools.iloc[-5]
Out[9]:
Name          University of California, Santa Barbara
City                                    Santa Barbara
County                                  Santa Barbara
Enrollment                                     24,346
Founded                                          1891
Name: 27, dtype: object
In [10]:
schools.columns
Out[10]:
Index(['Name', 'City', 'County', 'Enrollment', 'Founded'], dtype='object')

Sorting¶

The order of the rows in schools does not seem to be meaningful right now. To sort by a column, use the sort_values method. Like most DataFrame and Series methods, sort_values returns a new DataFrame, and doesn't modify the original.

In [11]:
schools
Out[11]:
Name City County Enrollment Founded
0 Humboldt State University Arcata Humboldt 7,774 1913
1 California State University, Bakersfield Bakersfield Kern 10,493 1965
2 University of California, Berkeley Berkeley Alameda 42,519 1869
3 California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ... ...
28 University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
29 California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
30 California State University, Stanislaus Turlock Stanislaus 10,214 1957
31 California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 5 columns

In [12]:
schools.sort_values('Founded')
Out[12]:
Name City County Enrollment Founded
24 San Jose State University San Jose Santa Clara 32,828 1857
2 University of California, Berkeley Berkeley Alameda 42,519 1869
13 University of California, Los Angeles Los Angeles Los Angeles 45,428 1882
5 California State University, Chico Chico Butte 17,488 1887
... ... ... ... ... ...
26 California State University San Marcos San Marcos San Diego 14,511 1988
29 California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
3 California State University Channel Islands Camarillo Ventura 7,095 2002
14 University of California, Merced Merced Merced 8,544 2005

32 rows × 5 columns

In [13]:
schools.sort_values('Name', ascending=False)
Out[13]:
Name City County Enrollment Founded
28 University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
27 University of California, Santa Barbara Santa Barbara Santa Barbara 24,346 1891
21 University of California, San Diego San Diego San Diego 38,798 1960
17 University of California, Riverside Riverside Riverside 23,278 1954
... ... ... ... ... ...
31 California State University Maritime Academy Vallejo Solano 1,017 1929
3 California State University Channel Islands Camarillo Ventura 7,095 2002
16 California State Polytechnic University, Pomona Pomona Los Angeles 26,443 1938
25 California Polytechnic State University San Luis Obispo San Luis Obispo 21,812 1901

32 rows × 5 columns

In [14]:
# Why isn't this sorting correctly?
schools.sort_values('Enrollment')
Out[14]:
Name City County Enrollment Founded
31 California State University Maritime Academy Vallejo Solano 1,017 1929
30 California State University, Stanislaus Turlock Stanislaus 10,214 1957
1 California State University, Bakersfield Bakersfield Kern 10,493 1965
26 California State University San Marcos San Marcos San Diego 14,511 1988
... ... ... ... ... ...
3 California State University Channel Islands Camarillo Ventura 7,095 2002
0 Humboldt State University Arcata Humboldt 7,774 1913
14 University of California, Merced Merced Merced 8,544 2005
18 Sonoma State University Rohnert Park Sonoma 9,201 1960

32 rows × 5 columns

Setting the index¶

Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index method.

In [15]:
# By reassigning schools, our changes will persist.
schools = schools.set_index('Name')
schools
Out[15]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [16]:
# Only 4 columns now!
schools.shape
Out[16]:
(32, 4)

Selecting columns¶

Selecting columns in babypandas 👶🐼¶

  • In babypandas, you selected columns using the .get method.
  • .get also works in pandas, but it is not idiomatic – people don't usually use it.
In [17]:
schools
Out[17]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [18]:
schools.get('County')
Out[18]:
Name
Humboldt State University                         Humboldt
California State University, Bakersfield              Kern
University of California, Berkeley                 Alameda
California State University Channel Islands        Ventura
                                                   ...    
University of California, Santa Cruz            Santa Cruz
California State University, Monterey Bay         Monterey
California State University, Stanislaus         Stanislaus
California State University Maritime Academy        Solano
Name: County, Length: 32, dtype: object
In [19]:
# This doesn't error, but sometimes we'd like it to.
schools.get('State')

Selecting columns with []¶

  • The standard way to select a column in pandas is by using the [] operator.
    • Think of a DataFrame as a dictionary of arrays!
  • Specifying a column name returns the column as a Series.
  • Specifying a list of column names returns a DataFrame.
In [20]:
schools
Out[20]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [21]:
# Returns a Series.
schools['City']
Out[21]:
Name
Humboldt State University                               Arcata
California State University, Bakersfield           Bakersfield
University of California, Berkeley                    Berkeley
California State University Channel Islands          Camarillo
                                                     ...      
University of California, Santa Cruz                Santa Cruz
California State University, Monterey Bay       Seaside-Marina
California State University, Stanislaus                Turlock
California State University Maritime Academy           Vallejo
Name: City, Length: 32, dtype: object
In [22]:
# Returns a DataFrame.
schools[['Founded', 'County']]
Out[22]:
Founded County
Name
Humboldt State University 1913 Humboldt
California State University, Bakersfield 1965 Kern
University of California, Berkeley 1869 Alameda
California State University Channel Islands 2002 Ventura
... ... ...
University of California, Santa Cruz 1965 Santa Cruz
California State University, Monterey Bay 1994 Monterey
California State University, Stanislaus 1957 Stanislaus
California State University Maritime Academy 1929 Solano

32 rows × 2 columns

In [23]:
# 🤔
schools[['Founded']]
Out[23]:
Founded
Name
Humboldt State University 1913
California State University, Bakersfield 1965
University of California, Berkeley 1869
California State University Channel Islands 2002
... ...
University of California, Santa Cruz 1965
California State University, Monterey Bay 1994
California State University, Stanislaus 1957
California State University Maritime Academy 1929

32 rows × 1 columns

In [24]:
# Names are stored in the index, which is not a column!
schools['Name']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/pandas/core/indexes/base.py:3361, in Index.get_loc(self, key, method, tolerance)
   3360 try:
-> 3361     return self._engine.get_loc(casted_key)
   3362 except KeyError as err:

File ~/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/pandas/_libs/index.pyx:76, in pandas._libs.index.IndexEngine.get_loc()

File ~/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/pandas/_libs/index.pyx:108, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Name'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[24], line 2
      1 # Names are stored in the index, which is not a column!
----> 2 schools['Name']

File ~/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/pandas/core/frame.py:3458, in DataFrame.__getitem__(self, key)
   3456 if self.columns.nlevels > 1:
   3457     return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
   3459 if is_integer(indexer):
   3460     indexer = [indexer]

File ~/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/pandas/core/indexes/base.py:3363, in Index.get_loc(self, key, method, tolerance)
   3361         return self._engine.get_loc(casted_key)
   3362     except KeyError as err:
-> 3363         raise KeyError(key) from err
   3365 if is_scalar(key) and isna(key) and not self.hasnans:
   3366     raise KeyError(key)

KeyError: 'Name'
In [25]:
schools.index
Out[25]:
Index(['Humboldt State University', 'California State University, Bakersfield',
       'University of California, Berkeley',
       'California State University Channel Islands',
       'California State University, Dominguez Hills',
       'California State University, Chico', 'University of California, Davis',
       'California State University, Fresno',
       'California State University, Fullerton',
       'California State University, East Bay',
       'University of California, Irvine',
       'California State University, Long Beach',
       'California State University, Los Angeles',
       'University of California, Los Angeles',
       'University of California, Merced',
       'California State University, Northridge',
       'California State Polytechnic University, Pomona',
       'University of California, Riverside', 'Sonoma State University',
       'California State University, Sacramento',
       'California State University, San Bernardino',
       'University of California, San Diego', 'San Diego State University',
       'San Francisco State University', 'San Jose State University',
       'California Polytechnic State University',
       'California State University San Marcos',
       'University of California, Santa Barbara',
       'University of California, Santa Cruz',
       'California State University, Monterey Bay',
       'California State University, Stanislaus',
       'California State University Maritime Academy'],
      dtype='object', name='Name')

Useful Series methods¶

There are a variety of useful methods that work on Series. You can see the entire list here. As we'll see next lecture, many of these methods work on DataFrames directly, too – how?

In [26]:
schools
Out[26]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [27]:
# What's the most common county?
schools['County'].mode()
Out[27]:
0    Los Angeles
dtype: object
In [28]:
# How many unique counties are represented?
schools['County'].nunique()
Out[28]:
23
In [29]:
# What's the distribution of counties?
schools['County'].value_counts()
Out[29]:
Los Angeles    6
San Diego      3
Alameda        2
Orange         2
              ..
Yolo           1
Butte          1
Ventura        1
Solano         1
Name: County, Length: 23, dtype: int64
In [30]:
# What's the mean of the 'Founded' column?
schools['Founded'].mean()
Out[30]:
1938.625
In [31]:
# Tell me more about the 'Founded' column.
schools['Founded'].describe()
Out[31]:
count      32.000000
mean     1938.625000
std        38.954688
min      1857.000000
25%      1904.000000
50%      1951.500000
75%      1961.250000
max      2005.000000
Name: Founded, dtype: float64
In [32]:
# Sort the 'Founded' column. Note that here we're using sort_values on a Series, not a DataFrame!
schools['Founded'].sort_values()
Out[32]:
Name
San Jose State University                      1857
University of California, Berkeley             1869
University of California, Los Angeles          1882
California State University, Chico             1887
                                               ... 
California State University San Marcos         1988
California State University, Monterey Bay      1994
California State University Channel Islands    2002
University of California, Merced               2005
Name: Founded, Length: 32, dtype: int64

Selecting rows (and columns)¶

Using loc to select rows using row labels¶

If df is a DataFrame, then:

  • df.loc[idx] returns the Series whose index is idx.
  • df.loc[idx_list] returns a DataFrame containing the rows whose indexes are in idx_list.
In [33]:
schools
Out[33]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [34]:
schools.loc['University of California, San Diego']
Out[34]:
City          San Diego
County        San Diego
Enrollment       38,798
Founded            1960
Name: University of California, San Diego, dtype: object
In [35]:
schools.loc[['University of California, San Diego', 'San Diego State University']]
Out[35]:
City County Enrollment Founded
Name
University of California, San Diego San Diego San Diego 38,798 1960
San Diego State University San Diego San Diego 34,881 1897

Boolean indexing¶

  • The loc operator also supports Boolean sequences (lists, arrays, Series) as input.
  • The length of the sequence must be the same as the number of rows in the DataFrame.
  • The result is a filtered DataFrame, containing only the rows in which the sequence contained True.
In [36]:
schools
Out[36]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [37]:
random_bools = np.random.choice([True, False], 32)
random_bools
Out[37]:
array([ True,  True, False,  True,  True, False,  True,  True,  True,
       False, False,  True,  True,  True, False, False,  True, False,
        True, False, False,  True,  True, False, False, False,  True,
       False,  True,  True,  True,  True])
In [38]:
schools.loc[random_bools]
Out[38]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
California State University Channel Islands Camarillo Ventura 7,095 2002
California State University, Dominguez Hills Carson Los Angeles 15,741 1960
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

19 rows × 4 columns

Querying¶

  • Comparisons with arrays (Series) result in Boolean arrays (Series).
  • We can use comparisons along with the loc operator to query a DataFrame.
  • Querying is the act of selecting rows in a DataFrame that satisfy certain condition(s).
In [39]:
schools
Out[39]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [40]:
schools['Founded'] > 1998
Out[40]:
Name
Humboldt State University                       False
California State University, Bakersfield        False
University of California, Berkeley              False
California State University Channel Islands      True
                                                ...  
University of California, Santa Cruz            False
California State University, Monterey Bay       False
California State University, Stanislaus         False
California State University Maritime Academy    False
Name: Founded, Length: 32, dtype: bool
In [41]:
schools.loc[schools['Founded'] > 1998]
Out[41]:
City County Enrollment Founded
Name
California State University Channel Islands Camarillo Ventura 7,095 2002
University of California, Merced Merced Merced 8,544 2005
In [42]:
schools.loc[schools.index.str.contains('University of California')]
Out[42]:
City County Enrollment Founded
Name
University of California, Berkeley Berkeley Alameda 42,519 1869
University of California, Davis Davis Yolo 39,152 1905
University of California, Irvine Irvine Orange 35,220 1965
University of California, Los Angeles Los Angeles Los Angeles 45,428 1882
University of California, Merced Merced Merced 8,544 2005
University of California, Riverside Riverside Riverside 23,278 1954
University of California, San Diego San Diego San Diego 38,798 1960
University of California, Santa Barbara Santa Barbara Santa Barbara 24,346 1891
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
In [43]:
# Using loc is not strictly necessary when indexing with Boolean sequences.
schools[schools.index.str.contains('University of California')]
Out[43]:
City County Enrollment Founded
Name
University of California, Berkeley Berkeley Alameda 42,519 1869
University of California, Davis Davis Yolo 39,152 1905
University of California, Irvine Irvine Orange 35,220 1965
University of California, Los Angeles Los Angeles Los Angeles 45,428 1882
University of California, Merced Merced Merced 8,544 2005
University of California, Riverside Riverside Riverside 23,278 1954
University of California, San Diego San Diego San Diego 38,798 1960
University of California, Santa Barbara Santa Barbara Santa Barbara 24,346 1891
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965

Note that because we set the index to 'Name' earlier, we can select rows based on school names without having to query.

In [44]:
schools
Out[44]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [45]:
# Series!
schools.loc['University of California, San Diego']
Out[45]:
City          San Diego
County        San Diego
Enrollment       38,798
Founded            1960
Name: University of California, San Diego, dtype: object

If 'Name' was instead a column, then we'd need to query to access information about a particular school.

In [46]:
schools_reset = schools.reset_index()
schools_reset
Out[46]:
Name City County Enrollment Founded
0 Humboldt State University Arcata Humboldt 7,774 1913
1 California State University, Bakersfield Bakersfield Kern 10,493 1965
2 University of California, Berkeley Berkeley Alameda 42,519 1869
3 California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ... ...
28 University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
29 California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
30 California State University, Stanislaus Turlock Stanislaus 10,214 1957
31 California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 5 columns

In [47]:
# DataFrame!
schools_reset[schools_reset['Name'] == 'University of California, San Diego']
Out[47]:
Name City County Enrollment Founded
21 University of California, San Diego San Diego San Diego 38,798 1960

Discussion Question¶

Write an expression that evaluates to the number of UC schools founded after 1950.

In [48]:
...
Out[48]:
Ellipsis

Selecting columns and rows simultaneously¶

So far, we used [] to select columns and loc to select rows.

For instance, to find the cities for all schools in San Diego county:

In [49]:
schools
Out[49]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [50]:
schools.loc[schools['County'] == 'San Diego']['City']
Out[50]:
Name
University of California, San Diego        San Diego
San Diego State University                 San Diego
California State University San Marcos    San Marcos
Name: City, dtype: object

Selecting columns and rows simultaneously¶

loc can also be used to select both rows and columns. The general pattern is:

df.loc[<row selector>, <column selector>]

Examples:

  • df.loc[idx_list, col_list] returns a DataFrame containing the rows in idx_list and columns in col_list.
  • df.loc[bool_arr, col_list] returns a DataFrame contaning the rows for which bool_arr is True and columns in col_list.
  • Both the row and column selectors can be slices, which use : syntax (e.g. 'City': 'Enrollment').

There are many, many more – see the pandas documentation for more.

In [51]:
schools
Out[51]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [52]:
# Find the city and enrollment for all schools in San Diego county.
schools.loc[schools['County'] == 'San Diego', ['City', 'Enrollment']]
Out[52]:
City Enrollment
Name
University of California, San Diego San Diego 38,798
San Diego State University San Diego 34,881
California State University San Marcos San Marcos 14,511
In [53]:
# Find the county, enrollment, and year founded for all schools founded after 1950.
schools.loc[schools['Founded'] > 1950, 'County':]
Out[53]:
County Enrollment Founded
Name
California State University, Bakersfield Kern 10,493 1965
California State University Channel Islands Ventura 7,095 2002
California State University, Dominguez Hills Los Angeles 15,741 1960
California State University, Fullerton Orange 39,774 1957
... ... ... ...
California State University San Marcos San Diego 14,511 1988
University of California, Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Monterey 7,079 1994
California State University, Stanislaus Stanislaus 10,214 1957

16 rows × 3 columns

Don't forget iloc!¶

  • iloc stands for "integer location".
  • iloc is like loc, but it selects rows and columns based off of integer positions only.
In [54]:
schools
Out[54]:
City County Enrollment Founded
Name
Humboldt State University Arcata Humboldt 7,774 1913
California State University, Bakersfield Bakersfield Kern 10,493 1965
University of California, Berkeley Berkeley Alameda 42,519 1869
California State University Channel Islands Camarillo Ventura 7,095 2002
... ... ... ... ...
University of California, Santa Cruz Santa Cruz Santa Cruz 19,700 1965
California State University, Monterey Bay Seaside-Marina Monterey 7,079 1994
California State University, Stanislaus Turlock Stanislaus 10,214 1957
California State University Maritime Academy Vallejo Solano 1,017 1929

32 rows × 4 columns

In [55]:
schools.iloc[3:7, :-1]
Out[55]:
City County Enrollment
Name
California State University Channel Islands Camarillo Ventura 7,095
California State University, Dominguez Hills Carson Los Angeles 15,741
California State University, Chico Chico Butte 17,488
University of California, Davis Davis Yolo 39,152

iloc is often most useful when we sort first. For instance, to find the enrollment of the youngest school in the dataset:

In [56]:
schools.sort_values('Founded', ascending=False)['Enrollment'].iloc[0]
Out[56]:
'8,544'
In [57]:
# Finding the name involves sorting, but not iloc.
schools.sort_values('Founded', ascending=False)['Enrollment'].index[0]
Out[57]:
'University of California, Merced'

More Practice¶

Consider the DataFrame below.

In [58]:
jack = pd.DataFrame({1: ['fee', 'fi'], 
                     '1': ['fo', 'fum']})
jack
Out[58]:
1 1
0 fee fo
1 fi fum

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own.

In [59]:
# jack[1]
In [60]:
# jack[[1]]
In [61]:
# jack['1']
In [62]:
# jack[[1, 1]]
In [63]:
# jack.loc[1]
In [64]:
# jack.loc[jack[1] == 'fo']
In [65]:
# jack[1, ['1', 1]]
In [66]:
# jack.loc[1,1]

Summary, next time¶

Summary¶

  • pandas is the library for tabular data manipulation in Python.
  • There are three key data structures in pandas: DataFrame, Series, and Index.
  • Refer to the lecture notebook and the pandas documentation for tips.

Next time¶

  • How pandas and numpy work together (and when they disagree).
  • Performance implications.