```
# You'll start seeing this cell in most lectures.
# It exists to hide all of the import statements and other setup
# code we need in lecture notebooks.
from dsc80_utils import *
```

### Announcements 📣¶

- Lab 1 is released, and is due
**Friday, Sept 4 at 11:59pm**.- See the Tech Support page for instructions and watch this video 🎥 for tips on how to set up your environment and work on assignments.
- Please try to set up your computer ASAP so that you have enough time to debug your environment.

- Project 1 will be released by Wednesday.
- Please fill out the Welcome Survey
**ASAP**. - Lecture recordings are available here, and are linked on the course website.

### Agenda¶

`numpy`

arrays.- From
`babypandas`

to`pandas`

.- Deep dive into DataFrames.

- Accessing subsets of rows and columns in DataFrames.
`.loc`

and`.iloc`

.- Querying (i.e. filtering).

- Adding and modifying columns.
`pandas`

and`numpy`

.

We can't cover every single detail! The `pandas`

documentation will be your friend.

### Throughout lecture, ask questions!¶

- You're always free to ask questions during lecture, and I'll try to stop for them frequently.
- But, you may not feel like asking your question out loud.
- You can
**type your questions throughout lecture**at the following link:

### dsc80.com/q

#### Bookmark it!

- I'll check the form responses periodically.
- You'll also use this form to answer questions that I ask you during lecture.

### Question 🤔 (Answer at dsc80.com/q)

```
dogs = pd.read_csv('data/dogs43.csv')
dogs.head(2)
```

breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|

0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |

1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |

**What does this code do?**

```
whoa = np.random.choice([True, False], size=len(dogs))
(dogs[whoa]
.groupby('size')
.max()
.get('longevity')
)
```

size large 11.92 medium 13.58 small 16.50 Name: longevity, dtype: float64

`numpy`

arrays¶

`numpy`

overview¶

`numpy`

stands for "numerical Python". It is a commonly-used Python module that enables**fast**computation involving arrays and matrices.`numpy`

's main object is the**array**. In`numpy`

, arrays are:- Homogenous – all values are of the same type.
- (Potentially) multi-dimensional.

- Computation in
`numpy`

is fast because:- Much of it is implemented in C.
`numpy`

arrays are stored more efficiently in memory than, say, Python lists.

- This site provides a good overview of
`numpy`

arrays.

We used `numpy`

in DSC 10 to work with sequences of data:

```
arr = np.arange(10)
arr
```

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

```
# The shape (10,) means that the array only has a single dimension,
# of size 10.
arr.shape
```

(10,)

```
2 ** arr
```

array([ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512])

Arrays come equipped with several handy methods; some examples are below, but you can read about them all here.

```
(2 ** arr).sum()
```

np.int64(1023)

```
(2 ** arr).mean()
```

np.float64(102.3)

```
(2 ** arr).max()
```

np.int64(512)

```
(2 ** arr).argmax()
```

np.int64(9)

### ⚠️ The dangers of `for`

-loops¶

`for`

-loops are slow when processing large datasets.**You will rarely write**`for`

-loops in DSC 80 (except for Lab 1 and Project 1), and may be penalized on assignments for using them when unnecessary!- One of the biggest benefits of
`numpy`

is that it supports**vectorized**operations.- If
`a`

and`b`

are two arrays of the same length, then`a + b`

is a new array of the same length containing the element-wise sum of`a`

and`b`

.

- If
- To illustrate how much faster
`numpy`

arithmetic is than using a`for`

-loop, let's compute the squares of the numbers between 0 and 1,000,000:- Using a
`for`

-loop. - Using vectorized arithmetic, through
`numpy`

.

- Using a

```
%%timeit
squares = []
for i in range(1_000_000):
squares.append(i * i)
```

30.3 ms ± 294 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In vanilla Python, this takes about 0.04 seconds per loop.

```
%%timeit
squares = np.arange(1_000_000) ** 2
```

1.2 ms ± 131 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In `numpy`

, this only takes about 0.001 seconds per loop, more than 40x faster! Note that under the hood, `numpy`

is also using a `for`

-loop, but it's a `for`

-loop implemented in C, which is much faster than Python.

### Multi-dimensional arrays¶

While we didn't see these very often in DSC 10, multi-dimensional lists/arrays may have since come up in DSC 20, 30, or 40A (especially in the context of linear algebra).

We'll spend a bit of time talking about 2D (and 3D) arrays here, since in some ways, they behave similarly to DataFrames.

Below, we create a 2D array from scratch.

```
nums = np.array([
[5, 1, 9, 7],
[9, 8, 2, 3],
[2, 5, 0, 4]
])
nums
```

array([[5, 1, 9, 7], [9, 8, 2, 3], [2, 5, 0, 4]])

```
# nums has 3 rows and 4 columns.
nums.shape
```

(3, 4)

We can also create 2D arrays by *reshaping* other arrays.

```
# Here, we're asking to reshape np.arange(1, 7)
# so that it has 2 rows and 3 columns.
a = np.arange(1, 7).reshape((2, 3))
a
```

array([[1, 2, 3], [4, 5, 6]])

### Operations along axes¶

In 2D arrays (and DataFrames), axis 0 refers to the rows (up and down) and axis 1 refers to the columns (left and right).

```
a
```

array([[1, 2, 3], [4, 5, 6]])

If we specify `axis=0`

, `a.sum`

will "compress" along axis 0.

```
a.sum(axis=0)
```

array([5, 7, 9])

If we specify `axis=1`

, `a.sum`

will "compress" along axis 1.

```
a.sum(axis=1)
```

array([ 6, 15])

### Selecting rows and columns from 2D arrays¶

You can use `[`

square brackets`]`

to **slice** rows and columns out of an array, using the same slicing conventions you saw in DSC 20.

```
a
```

array([[1, 2, 3], [4, 5, 6]])

```
# Accesses row 0 and all columns.
a[0, :]
```

array([1, 2, 3])

```
# Same as the above.
a[0]
```

array([1, 2, 3])

```
# Accesses all rows and column 1.
a[:, 1]
```

array([2, 5])

```
# Accesses row 0 and columns 1 and onwards.
a[0, 1:]
```

array([2, 3])

### Question 🤔 (Answer at dsc80.com/q)

` `

Try and predict the value of `grid[-1, 1:].sum()`

without running the code below.

```
s = (5, 3)
grid = np.ones(s) * 2 * np.arange(1, 16).reshape(s)
# grid[-1, 1:].sum()
```

### Question 🤔 (Answer at dsc80.com/q)

` `

**Ask ChatGPT:**

- To explain what the code above does.
- To tell you what the code outputs.

### Example: Image processing¶

`numpy`

arrays are homogenous and potentially multi-dimensional.

It turns out that **images** can be represented as 3D `numpy`

arrays. The color of each pixel can be described with three numbers under the RGB model – a red value, green value, and blue value. Each of these can vary from 0 to 1.

```
from PIL import Image
img_path = Path('imgs') / 'bentley.jpg'
img = np.asarray(Image.open(img_path)) / 255
```

```
img
```

array([[[0.4 , 0.33, 0.24], [0.42, 0.35, 0.25], [0.43, 0.36, 0.26], ..., [0.5 , 0.44, 0.36], [0.51, 0.44, 0.36], [0.51, 0.44, 0.36]], [[0.39, 0.33, 0.23], [0.42, 0.36, 0.26], [0.44, 0.37, 0.27], ..., [0.51, 0.44, 0.36], [0.52, 0.45, 0.37], [0.52, 0.45, 0.38]], [[0.38, 0.31, 0.21], [0.41, 0.35, 0.24], [0.44, 0.37, 0.27], ..., [0.52, 0.45, 0.38], [0.53, 0.46, 0.39], [0.53, 0.47, 0.4 ]], ..., [[0.71, 0.64, 0.55], [0.71, 0.65, 0.55], [0.68, 0.62, 0.52], ..., [0.58, 0.49, 0.41], [0.56, 0.47, 0.39], [0.56, 0.47, 0.39]], [[0.5 , 0.44, 0.34], [0.42, 0.37, 0.26], [0.44, 0.38, 0.28], ..., [0.4 , 0.33, 0.25], [0.55, 0.48, 0.4 ], [0.58, 0.5 , 0.42]], [[0.38, 0.33, 0.22], [0.49, 0.44, 0.33], [0.56, 0.51, 0.4 ], ..., [0.15, 0.08, 0. ], [0.28, 0.21, 0.13], [0.42, 0.35, 0.27]]])

```
img.shape
```

(200, 263, 3)

```
plt.imshow(img)
plt.axis('off');
```

### Applying a greyscale filter¶

One way to convert an image to greyscale is to average its red, green, and blue values.

```
mean_2d = img.mean(axis=2)
mean_2d
```

array([[0.32, 0.34, 0.35, ..., 0.43, 0.44, 0.44], [0.31, 0.35, 0.36, ..., 0.44, 0.45, 0.45], [0.3 , 0.33, 0.36, ..., 0.45, 0.46, 0.47], ..., [0.64, 0.64, 0.6 , ..., 0.49, 0.47, 0.47], [0.43, 0.35, 0.37, ..., 0.32, 0.48, 0.5 ], [0.31, 0.42, 0.49, ..., 0.07, 0.21, 0.34]])

This is just a single red channel!

```
plt.imshow(mean_2d)
plt.axis('off');
```