In [1]:

```
import numpy as np
import pandas as pd
import os
import plotly.express as px
pd.options.plotting.backend = 'plotly'
```

- A permutation test, like all simulation-based hypothesis tests, generates an
**approximation**of the distribution of the test statistic.- If we found
**all**permutations, the distribution would be exact! - If there are $a$ elements in one group and $b$ in the other, the total number of permutations is ${a + b \choose a}$.
- If $a = 100$ and $b = 150$, there are more than ${250 \choose 100} \approx 6 \cdot 10^{71}$ permutations!

- If we found

- The more repetitions we use, the better our approximation will be.
- Unfortunately, our code is pretty slow, so we can't use many repetitions.

In [2]:

```
baby_fp = os.path.join('data', 'baby.csv')
baby = pd.read_csv(baby_fp)
baby = baby[['Maternal Smoker', 'Birth Weight']]
```

In [3]:

```
baby.head()
```

Out[3]:

Maternal Smoker | Birth Weight | |
---|---|---|

0 | False | 120 |

1 | False | 113 |

2 | True | 128 |

3 | True | 108 |

4 | False | 136 |

Recall our permutation test from last class:

**Null Hypothesis**: In the population, birth weights of smokers' babies and non-smokers' babies have the same distribution, and the observed differences in our samples are due to random chance.**Alternative Hypothesis**: In the population, smokers' babies have lower birth weights than non-smokers' babies, on average. The observed difference in our samples cannot be explained by random chance alone.

We'll use 3000 repetitions instead of 500.

In [4]:

```
%%time
n_repetitions = 3000
differences = []
for _ in range(n_repetitions):
# Step 1: Shuffle the weights and store them in a DataFrame.
with_shuffled = baby.assign(Shuffled_Weights=np.random.permutation(baby['Birth Weight']))
# Step 2: Compute the test statistic.
# Remember, alphabetically, False comes before True,
# so this computes True - False.
group_means = (
with_shuffled
.groupby('Maternal Smoker')
.mean()
.loc[:, 'Shuffled_Weights']
)
difference = group_means.diff().iloc[-1]
# Step 3: Store the result.
differences.append(difference)
```

CPU times: user 2.46 s, sys: 1.71 ms, total: 2.46 s Wall time: 2.46 s

In [5]:

```
fig = px.histogram(pd.DataFrame(differences), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Test Statistic, Original Approach')
fig.update_layout(xaxis_range=[-5, 5])
```