1/49
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
data science
the scientific and ethical process of extracting value from data
steps of the data science process
Ask an interesting question
Get the data
Explore the data
Model the data
Communicate and visualize the results
why do we start with questions?
to know what to look for
to know what kind of analysis to do
to think about ethics
goal
start with a ____ in mind
questions
____ drive the choice of data
criteria of a data science question
specific
measureable
answerable with data that’s possible to collect
have relevance or value to something or someone
big enough to be interesting
small enough to be accomplishable
ingredients for a data science question
topic
outcome/target
features
timeframe
topic
who or what are you studying?
outcome/target
what exactly are you measuring?
features
what are the factors or comparisons that matter?
timeframe
over what timescale?
some goals of data science questions…
describe or understand patterns
predict
test cause-and-effect
decide what to do or optimize something
ethics
principles and standards that guide choices about how data is collected and how models are built and used
privacy
a person’s ability to control access to themselves and their own information
The Cobra Effect
the unintended consequences that result from an attempt to solve a problem
class imbalance problem
the data itself are biased (e.g. a trained hiring-recommendation model discriminates against women)
what your devices do know
demographic data
location data
search history
browsing history
purchase history
physical interactions
cookies
small pieces of data a website stores in your browser to remember you, but can also track your activity across pages
how to practice more responsible data science
learn from the past
use checklists: how data will be dealt with ethically and responsibly before the project is started
incorporate ethics into hiring decisions
organize governing bodies to determine how data will be regulated
ensure safeguards (e.g. when to “pull the plug”)
Dean’s Ethics Checklist
meant to provoke discussion around:
data collection
data storage
analysis
modeling
deployment
data
anything that can be measured or observed about the world
quantitative
a value that can be objectively measured or counted
qualitative
describes categories or qualities
categories of qualitative data
categorical
nominal (unordered)
ordinal (ordered)
text, audio, images (in raw form)
categories of quantative data
discrete: countable values
continuous: any value in range
quantitative data visualization
histogram
box plot/violin plot
scatter plot
line chart (over time)
qualitative data visualization
bar chart
grouped/stacked bar chart
pie chart
structured data
can be organized into an n x p table of values
n
the number of observations/sample size
p
the number of features
structured data: storage
spreadsheets (.csv, Excel (.xls, .xlsx))
databases: SQL (structured query language)
SQL as an example of a structured database
enables you to relate tables of data together, perform actions on data
can ask questions (query), add, update, delete information
manage + analyze large datasets
structured data: JOIN operation
combine rows from 2+ tables by matching columns
unstructured data
no simple way to represent in table format
unstructured data: data storage
text documents
emails + messaging
images
audio
video
presentations
web content
sentiment analysis as unstructured data
using Natural Language Processing (NLP) to identify and categorize opinions expressed in text
semi-structured data
doesn’t fit neatly into rows and columns, but has labels + keys that give it organization
how to get data
collect it yourself
use public/open-source data
buy it
request it from an API
scrape it from a website
source it from your company
steps to request data from an API
send a request
API sends back semi-structured data, often JSON
web scraping
a way to automatically collect information from websites
used when there’s no API or an insufficient API
API: definition, pros, cons
a site’s official way to access data
structured + stable
may require a key + have rate limits
web scraping: pros + cons
useful when there’s no API
fragile; websites change
can violate terms & conditions
questions to ask before using data
allowed? restrictions? ethical, privacy, legal concerns?
why and how was the data collected?
how was the data cleaned or altered? missing data?
who/what is included or excluded? biased in data collection/measurement?
documentation?
trustworthy?
robots.txt
small text file a website can publish that tells automatic crawlers which parts of the site they’re allowed to access
voluntary
websites enforce real restrictions with logins, paywalls, server-side blocking
“Identity crisis” and the stats vs. data science debate
Donoho is pushing back on the idea that data science is simply a subset of stats or a rebrand. The argument isn’t “stats is irrelevant”—it’s that data science includes major concerns that weren’t historically central in stats curricula (especially computing and end-to-end workflow).
The “two cultures” (generative vs. predictive)
This frames tension between modeling for explanation/interpretation and understanding mechanisms vs. modeling for prediction and performance on held-out data
The Common Task Framework (CTF)
CTF is a way to compare methods using a shared dataset, a defined task, and a scoring metric—often with a holdout test set. It encourages a prediction-centric culture and can drive rapid progress, but it can also reward narrow optimization that may not generalize.
“Science about data science”: cross-study and cross-workflow thinking
Donoho’s examples highlight something uncomfortable but important: different reasonable workflows can produce different results, even on the same data. This raises questions about reliability, reproducibility, and the credibility of conclusions.
key ideas: “Data’s Day of Reckoning”
Data-driven products can be weaponized (misuse, abuse) when incentives don’t align with the public good.
Ethics must be integrated into everyday practice and data science education, and not treated as an optional afterthought.
Use concrete mechanisms like checklists and “stop-the-line” culture to make responsibility actionable. Build routines for consent, bias checks, abuse cases, monitoring, and the ability to pause/shut down systems.
key ideas: “Myths & Fallacies of Personal Identifiable Information (PII)”
The “PII fallacy”: removing a fixed list of identifiers doesn’t guarantee privacy. PII is hard to define and context-dependent.
Re-identification can happen without names or obvious identifiers. Any sufficiently rich set of behavioral/transactional attributes can uniquely “fingerprint” people, especially when linked with other data.
Privacy protection should move beyond “de-identification” toward stronger approaches.