# You'll start seeing this cell in most lectures.
# It exists to hide all of the import statements and other setup
# code we need in lecture notebooks.
from dsc80_utils import *
Announcements 📣¶
- Lab 1 is released, and is due Friday, Sept 4 at 11:59pm.
- See the Tech Support page for instructions and watch this video 🎥 for tips on how to set up your environment and work on assignments.
- Please try to set up your computer ASAP so that you have enough time to debug your environment.
- Project 1 will be released by Wednesday.
- Please fill out the Welcome Survey ASAP.
- Lecture recordings are available here, and are linked on the course website.
Agenda¶
numpy
arrays.- From
babypandas
topandas
.- Deep dive into DataFrames.
- Accessing subsets of rows and columns in DataFrames.
.loc
and.iloc
.- Querying (i.e. filtering).
- Adding and modifying columns.
pandas
andnumpy
.
We can't cover every single detail! The pandas
documentation will be your friend.
Throughout lecture, ask questions!¶
- You're always free to ask questions during lecture, and I'll try to stop for them frequently.
- But, you may not feel like asking your question out loud.
- You can type your questions throughout lecture at the following link:
dsc80.com/q
Bookmark it!
- I'll check the form responses periodically.
- You'll also use this form to answer questions that I ask you during lecture.
Question 🤔 (Answer at dsc80.com/q)
dogs = pd.read_csv('data/dogs43.csv')
dogs.head(2)
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
What does this code do?
whoa = np.random.choice([True, False], size=len(dogs))
(dogs[whoa]
.groupby('size')
.max()
.get('longevity')
)
size large 11.92 medium 13.58 small 16.50 Name: longevity, dtype: float64
numpy
arrays¶
numpy
overview¶
numpy
stands for "numerical Python". It is a commonly-used Python module that enables fast computation involving arrays and matrices.numpy
's main object is the array. Innumpy
, arrays are:- Homogenous – all values are of the same type.
- (Potentially) multi-dimensional.
- Computation in
numpy
is fast because:- Much of it is implemented in C.
numpy
arrays are stored more efficiently in memory than, say, Python lists.
- This site provides a good overview of
numpy
arrays.
We used numpy
in DSC 10 to work with sequences of data:
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# The shape (10,) means that the array only has a single dimension,
# of size 10.
arr.shape
(10,)
2 ** arr
array([ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512])
Arrays come equipped with several handy methods; some examples are below, but you can read about them all here.
(2 ** arr).sum()
np.int64(1023)
(2 ** arr).mean()
np.float64(102.3)
(2 ** arr).max()
np.int64(512)
(2 ** arr).argmax()
np.int64(9)
⚠️ The dangers of for
-loops¶
for
-loops are slow when processing large datasets. You will rarely writefor
-loops in DSC 80 (except for Lab 1 and Project 1), and may be penalized on assignments for using them when unnecessary!- One of the biggest benefits of
numpy
is that it supports vectorized operations.- If
a
andb
are two arrays of the same length, thena + b
is a new array of the same length containing the element-wise sum ofa
andb
.
- If
- To illustrate how much faster
numpy
arithmetic is than using afor
-loop, let's compute the squares of the numbers between 0 and 1,000,000:- Using a
for
-loop. - Using vectorized arithmetic, through
numpy
.
- Using a
%%timeit
squares = []
for i in range(1_000_000):
squares.append(i * i)
30.3 ms ± 294 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In vanilla Python, this takes about 0.04 seconds per loop.
%%timeit
squares = np.arange(1_000_000) ** 2
1.2 ms ± 131 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In numpy
, this only takes about 0.001 seconds per loop, more than 40x faster! Note that under the hood, numpy
is also using a for
-loop, but it's a for
-loop implemented in C, which is much faster than Python.
Multi-dimensional arrays¶
While we didn't see these very often in DSC 10, multi-dimensional lists/arrays may have since come up in DSC 20, 30, or 40A (especially in the context of linear algebra).
We'll spend a bit of time talking about 2D (and 3D) arrays here, since in some ways, they behave similarly to DataFrames.
Below, we create a 2D array from scratch.
nums = np.array([
[5, 1, 9, 7],
[9, 8, 2, 3],
[2, 5, 0, 4]
])
nums
array([[5, 1, 9, 7], [9, 8, 2, 3], [2, 5, 0, 4]])
# nums has 3 rows and 4 columns.
nums.shape
(3, 4)
We can also create 2D arrays by reshaping other arrays.
# Here, we're asking to reshape np.arange(1, 7)
# so that it has 2 rows and 3 columns.
a = np.arange(1, 7).reshape((2, 3))
a
array([[1, 2, 3], [4, 5, 6]])
Operations along axes¶
In 2D arrays (and DataFrames), axis 0 refers to the rows (up and down) and axis 1 refers to the columns (left and right).
a
array([[1, 2, 3], [4, 5, 6]])
If we specify axis=0
, a.sum
will "compress" along axis 0.
a.sum(axis=0)
array([5, 7, 9])
If we specify axis=1
, a.sum
will "compress" along axis 1.
a.sum(axis=1)
array([ 6, 15])
Selecting rows and columns from 2D arrays¶
You can use [
square brackets]
to slice rows and columns out of an array, using the same slicing conventions you saw in DSC 20.
a
array([[1, 2, 3], [4, 5, 6]])
# Accesses row 0 and all columns.
a[0, :]
array([1, 2, 3])
# Same as the above.
a[0]
array([1, 2, 3])
# Accesses all rows and column 1.
a[:, 1]
array([2, 5])
# Accesses row 0 and columns 1 and onwards.
a[0, 1:]
array([2, 3])
Question 🤔 (Answer at dsc80.com/q)
Try and predict the value of grid[-1, 1:].sum()
without running the code below.
s = (5, 3)
grid = np.ones(s) * 2 * np.arange(1, 16).reshape(s)
# grid[-1, 1:].sum()
Question 🤔 (Answer at dsc80.com/q)
Ask ChatGPT:
- To explain what the code above does.
- To tell you what the code outputs.
Example: Image processing¶
numpy
arrays are homogenous and potentially multi-dimensional.
It turns out that images can be represented as 3D numpy
arrays. The color of each pixel can be described with three numbers under the RGB model – a red value, green value, and blue value. Each of these can vary from 0 to 1.
from PIL import Image
img_path = Path('imgs') / 'bentley.jpg'
img = np.asarray(Image.open(img_path)) / 255
img
array([[[0.4 , 0.33, 0.24], [0.42, 0.35, 0.25], [0.43, 0.36, 0.26], ..., [0.5 , 0.44, 0.36], [0.51, 0.44, 0.36], [0.51, 0.44, 0.36]], [[0.39, 0.33, 0.23], [0.42, 0.36, 0.26], [0.44, 0.37, 0.27], ..., [0.51, 0.44, 0.36], [0.52, 0.45, 0.37], [0.52, 0.45, 0.38]], [[0.38, 0.31, 0.21], [0.41, 0.35, 0.24], [0.44, 0.37, 0.27], ..., [0.52, 0.45, 0.38], [0.53, 0.46, 0.39], [0.53, 0.47, 0.4 ]], ..., [[0.71, 0.64, 0.55], [0.71, 0.65, 0.55], [0.68, 0.62, 0.52], ..., [0.58, 0.49, 0.41], [0.56, 0.47, 0.39], [0.56, 0.47, 0.39]], [[0.5 , 0.44, 0.34], [0.42, 0.37, 0.26], [0.44, 0.38, 0.28], ..., [0.4 , 0.33, 0.25], [0.55, 0.48, 0.4 ], [0.58, 0.5 , 0.42]], [[0.38, 0.33, 0.22], [0.49, 0.44, 0.33], [0.56, 0.51, 0.4 ], ..., [0.15, 0.08, 0. ], [0.28, 0.21, 0.13], [0.42, 0.35, 0.27]]])
img.shape
(200, 263, 3)
plt.imshow(img)
plt.axis('off');
Applying a greyscale filter¶
One way to convert an image to greyscale is to average its red, green, and blue values.
mean_2d = img.mean(axis=2)
mean_2d
array([[0.32, 0.34, 0.35, ..., 0.43, 0.44, 0.44], [0.31, 0.35, 0.36, ..., 0.44, 0.45, 0.45], [0.3 , 0.33, 0.36, ..., 0.45, 0.46, 0.47], ..., [0.64, 0.64, 0.6 , ..., 0.49, 0.47, 0.47], [0.43, 0.35, 0.37, ..., 0.32, 0.48, 0.5 ], [0.31, 0.42, 0.49, ..., 0.07, 0.21, 0.34]])
This is just a single red channel!
plt.imshow(mean_2d)
plt.axis('off');