In [1]:

```
import pandas as pd
import numpy as np
import os
import seaborn as sns
import plotly.express as px
pd.options.plotting.backend = 'plotly'
```

- Permutation testing.

"Standard" hypothesis testing helps us answer questions of the form:

I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

- Sample: 59 heads and 41 tails. Population: A fair coin.

- Sample: Ethnic distribution of UCSD. Population: Ethnic distribution of California. (Comparing categorical distributions with the TVD.)

- Sample: Sample of Torgersen Island penguins. Population: All 333 penguins. (Comparing a subgroup statistic to a population parameter.)

It **does not** help us answer questions of the form:

I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

That's where permutation testing comes in.

** Note**: For familiarity, we'll start with an example from DSC 10. This means we'll move quickly!

Let's start by loading in the data.

In [2]:

```
baby = pd.read_csv(os.path.join('data', 'baby.csv'))
baby.head()
```

Out[2]:

Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker | |
---|---|---|---|---|---|---|

0 | 120 | 284 | 27 | 62 | 100 | False |

1 | 113 | 282 | 33 | 64 | 135 | False |

2 | 128 | 279 | 28 | 64 | 115 | True |

3 | 108 | 282 | 23 | 67 | 125 | True |

4 | 136 | 286 | 25 | 62 | 93 | False |

We're only interested in the `'Birth Weight'`

and `'Maternal Smoker'`

columns.

In [3]:

```
baby = baby[['Maternal Smoker', 'Birth Weight']]
baby.head()
```

Out[3]:

Maternal Smoker | Birth Weight | |
---|---|---|

0 | False | 120 |

1 | False | 113 |

2 | True | 128 |

3 | True | 108 |

4 | False | 136 |

Note that there are **two samples**:

- Birth weights of smokers' babies.
- Birth weights of non-smokers' babies.

How many babies are in each group? What is the average birth weight within each group?

In [4]:

```
baby.groupby('Maternal Smoker')['Birth Weight'].agg(['mean', 'count'])
```

Out[4]:

mean | count | |
---|---|---|

Maternal Smoker | ||

False | 123.085315 | 715 |

True | 113.819172 | 459 |

Note that 16 ounces are in 1 pound, so the above weights are ~7-8 pounds.

Below, we draw the distributions of both sets of birth weights.

In [5]:

```
px.histogram(baby, color='Maternal Smoker', histnorm='probability', marginal='box',
title="Birth Weight by Mother's Smoking Status", barmode='overlay', opacity=0.7)
```