import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from matplotlib_inline.backend_inline import set_matplotlib_formats set_matplotlib_formats("svg") sns.set_context("poster") sns.set_style("whitegrid") plt.rcParams["figure.figsize"] = (10, 5) pd.set_option("display.max_rows", 7) pd.set_option("display.max_columns", 8) pd.set_option("display.precision", 2)
Welcome to DSC 80! 🎉
About the instructor¶
Prof. Sam Lau¶
- Assistant Teaching Professor, HDSI, UCSD
- Personal: https://www.samlau.me/
I design curriculum and invent tools for teaching programming and data science.
Bio: Ph.D. UCSD (2023), M.S. UC Berkeley (2018), B.S. UC Berkeley (2017).
In addition to the instructor, we have 2 TAs and 7 Tutors, who are here to help you in discussion, office hours, and on Ed:
Giorgia Nicolaou, Dylan Stockard
Gabriel Cha, Jiayu (John) Chen, Doris Gao, Zelong (Alan) Wang, Sunan Xu, Tiffany Yu, Luran (Lauren) Zhang
- Used Python to explore and visualize data.
- Used simulation to make inferences about a population, given just a sample.
- Made predictions about the future given data from the past.
Let's look at a few more definitions of data science.
The chart below is taken from the followup 2021 Data/AI Salary Survey, also administered by O'Reilly. They asked respondents:
What technologies will have the biggest effect on compensation in the coming year?
As you take more courses, we're training you to answer questions whose answers are ambiguous – this uncertainly is what makes data science challenging!
Let's look at some examples of data science in practice.
Data science involves people 🧍¶
The decisions that we make as data scientists have the potential to impact the livelihoods of other people.
- Flu case forecasting.
- Admissions and hiring.
- Hyper-personalized ad recommendations.
What this course is about:¶
Good data analysis is not:
- A simple application of a statistics formula.
- A simple application of computer programs.
There are many tools out there for data science, but they are merely tools. They don’t do any of the important thinking – that's where you come in!
DSC 80 teaches you to think like a data scientist.
In this course, you will...
- Practice translating potentially vague questions into quantitative questions about measurable observations.
- Learn to reason about "black-box" processes (e.g. complicated models).
- Understand computational and statistical implications of working with data.
- Learn to use real data tools (and rely on documentation).
- Get a taste of the "life of a data scientist."
After this course, you will...
- Be prepared for internships and data science "take home" interviews!
- Be ready to create your own portfolio of personal projects.
- Have the background and maturity to succeed in the upper-division.
- Week 1: From
- Week 2: DataFrames
- Week 3: Working with messy data, hypothesis testing
- Week 4: Missing values
- Week 5: HTML, Midterm Exam
- Week 6: Web data
- Week 7: Text data, modeling
- Week 8: Feature engineering,
- Week 9:
sklearnpipelines and model evaluation
- Week 10: Classifier evaluation, fairness
- Week 11: Final Exam
Accessing course content on GitHub¶
You will access all course content by pulling the course GitHub repository:
We will post HTML versions of lecture notebooks on the course website, but otherwise you must
git pull from this repository to access all course materials (including blank copies of assignments).
- You have two choices:
- Set up your own Python environment (strongly recommended).
- Use DataHub.
- Either way, follow the instructions on the Tech Support page of the course website.
- Once you set up your environment, you will pull the course repo every time a new assignment comes out.
- Note: You will submit your work to Gradescope directly, without using Git.
- Will post a demo video with Lab 1.
In this course, you will learn by doing!
- Labs (30%): 9 total. Due weekly on Mondays
- Projects (35% + 5% checkpoints): 5 total. Due on Wednesdays, and usually have a "checkpoint."
In DSC 80, assignments will usually consist of both a Jupyter Notebook and a
.py file. You will write your code in the
.py file; the Jupyter Notebook will contain problem descriptions and test cases. Lab 1 will explain the workflow.
Discussions and lab reflections¶
In order to have you reflect on your lab work, we will offer extra credit each week if you do all 3 of the following:
- Submit the lab.
- Attend discussion in-person (Fridays 10-10:50AM in Center Hall 212), where discuss solutions to the most recent lab.
- Submit a lab reflection form to Gradescope by Saturday.
Each week you do all 3, you'll earn 0.25% of extra credit – this could total 2%.
This scheme starts this week. Discussion will be podcasted.
It is no secret that this course requires a lot of work - becoming fluent with working with data is hard!
- You will learn how to solve problems independently – documentation and the internet will be your friends.
- Learning how to effectively check your work and debug is extremely useful.
- Learning to stick with a problem (tenacity) is a very valuable skill; but don't be afraid to ask for help.
Once you've tried to solve problems on your own, we're glad to help.
- Office hours are offered – most are in-person, but a few are remote. See the Calendar 📆 for details.
- Ed is your friend too. Make your conceptual questions public, and make your debugging questions private.
However, it hides a lot of complexity.
- Where did the hypothesis come from?
- What data are you modeling? Is the data sufficient?
- Under which conditions are the conclusions valid?
All steps lead to more questions! We'll refer back to the data science lifecycle repeatedly throughout the quarter.
Goal: See if claims are true, based on the data.
baby = pd.read_csv('data/baby.csv') baby
2085158 rows × 4 columns
(baby .assign(first_letter=baby['Name'].str) .query('first_letter == "L"') .groupby('Year') .sum() .plot() );
(baby .query('Name == "Luna"') .groupby('Year') .sum() .plot() );
(baby .query('Name == "Siri"') .groupby('Year') .sum() .plot() );