COGS 9 Exam 1

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/49

There's no tags or description

Looks like no tags are added yet.

Last updated 6:10 AM on 4/18/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

50 Terms

New cards

data science

the scientific and ethical process of extracting value from data

New cards

steps of the data science process

Ask an interesting question
Get the data
Explore the data
Model the data
Communicate and visualize the results

New cards

why do we start with questions?

to know what to look for
to know what kind of analysis to do
to think about ethics

New cards

goal

start with a ____ in mind

New cards

questions

____ drive the choice of data

New cards

criteria of a data science question

specific
measureable
answerable with data that’s possible to collect
have relevance or value to something or someone
big enough to be interesting
small enough to be accomplishable

New cards

ingredients for a data science question

topic
outcome/target
features
timeframe

New cards

topic

who or what are you studying?

New cards

outcome/target

what exactly are you measuring?

New cards

features

what are the factors or comparisons that matter?

New cards

timeframe

over what timescale?

New cards

some goals of data science questions…

describe or understand patterns
predict
test cause-and-effect
decide what to do or optimize something

New cards

ethics

principles and standards that guide choices about how data is collected and how models are built and used

New cards

privacy

a person’s ability to control access to themselves and their own information

New cards

The Cobra Effect

the unintended consequences that result from an attempt to solve a problem

New cards

class imbalance problem

the data itself are biased (e.g. a trained hiring-recommendation model discriminates against women)

New cards

what your devices do know

demographic data
location data
search history
browsing history
purchase history
- physical interactions

New cards

small pieces of data a website stores in your browser to remember you, but can also track your activity across pages

New cards

how to practice more responsible data science

learn from the past
use checklists: how data will be dealt with ethically and responsibly before the project is started
incorporate ethics into hiring decisions
organize governing bodies to determine how data will be regulated
- ensure safeguards (e.g. when to “pull the plug”)

New cards

Dean’s Ethics Checklist

meant to provoke discussion around:

data collection
data storage
analysis
modeling
deployment

New cards

data

anything that can be measured or observed about the world

New cards

quantitative

a value that can be objectively measured or counted

New cards

qualitative

describes categories or qualities

New cards

categories of qualitative data

categorical
- nominal (unordered)
- ordinal (ordered)
text, audio, images (in raw form)

New cards

categories of quantative data

discrete: countable values
continuous: any value in range

New cards

quantitative data visualization

histogram
box plot/violin plot
scatter plot
- line chart (over time)

New cards

qualitative data visualization

bar chart
grouped/stacked bar chart
pie chart

New cards

structured data

can be organized into an n x p table of values

New cards

the number of observations/sample size

New cards

the number of features

New cards

structured data: storage

spreadsheets (.csv, Excel (.xls, .xlsx))
databases: SQL (structured query language)

New cards

SQL as an example of a structured database

enables you to relate tables of data together, perform actions on data
- can ask questions (query), add, update, delete information
manage + analyze large datasets

New cards

structured data: JOIN operation

combine rows from 2+ tables by matching columns

New cards

unstructured data

no simple way to represent in table format

New cards

unstructured data: data storage

text documents
emails + messaging
images
audio
video
presentations
web content

New cards

sentiment analysis as unstructured data

using Natural Language Processing (NLP) to identify and categorize opinions expressed in text

New cards

semi-structured data

doesn’t fit neatly into rows and columns, but has labels + keys that give it organization

New cards

how to get data

collect it yourself
use public/open-source data
buy it
request it from an API
scrape it from a website
source it from your company

New cards

steps to request data from an API

send a request
API sends back semi-structured data, often JSON

New cards

web scraping

a way to automatically collect information from websites

used when there’s no API or an insufficient API

New cards

API: definition, pros, cons

a site’s official way to access data

structured + stable
may require a key + have rate limits

New cards

web scraping: pros + cons

useful when there’s no API
fragile; websites change
- can violate terms & conditions

New cards

questions to ask before using data

allowed? restrictions? ethical, privacy, legal concerns?
why and how was the data collected?
how was the data cleaned or altered? missing data?
who/what is included or excluded? biased in data collection/measurement?
documentation?
- trustworthy?

New cards

robots.txt

small text file a website can publish that tells automatic crawlers which parts of the site they’re allowed to access

voluntary
websites enforce real restrictions with logins, paywalls, server-side blocking

New cards

“Identity crisis” and the stats vs. data science debate

Donoho is pushing back on the idea that data science is simply a subset of stats or a rebrand. The argument isn’t “stats is irrelevant”—it’s that data science includes major concerns that weren’t historically central in stats curricula (especially computing and end-to-end workflow).

New cards

The “two cultures” (generative vs. predictive)

This frames tension between modeling for explanation/interpretation and understanding mechanisms vs. modeling for prediction and performance on held-out data

New cards

The Common Task Framework (CTF)

CTF is a way to compare methods using a shared dataset, a defined task, and a scoring metric—often with a holdout test set. It encourages a prediction-centric culture and can drive rapid progress, but it can also reward narrow optimization that may not generalize.

New cards

“Science about data science”: cross-study and cross-workflow thinking

Donoho’s examples highlight something uncomfortable but important: different reasonable workflows can produce different results, even on the same data. This raises questions about reliability, reproducibility, and the credibility of conclusions.

New cards

key ideas: “Data’s Day of Reckoning”

Data-driven products can be weaponized (misuse, abuse) when incentives don’t align with the public good.
Ethics must be integrated into everyday practice and data science education, and not treated as an optional afterthought.
Use concrete mechanisms like checklists and “stop-the-line” culture to make responsibility actionable. Build routines for consent, bias checks, abuse cases, monitoring, and the ability to pause/shut down systems.

New cards

key ideas: “Myths & Fallacies of Personal Identifiable Information (PII)”

The “PII fallacy”: removing a fixed list of identifiers doesn’t guarantee privacy. PII is hard to define and context-dependent.
Re-identification can happen without names or obvious identifiers. Any sufficiently rich set of behavioral/transactional attributes can uniquely “fingerprint” people, especially when linked with other data.
Privacy protection should move beyond “de-identification” toward stronger approaches.