COGS 9 Exam 1

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/49

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 6:10 AM on 4/18/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

50 Terms

1
New cards

data science

the scientific and ethical process of extracting value from data

2
New cards

steps of the data science process

  1. Ask an interesting question

  2. Get the data

  3. Explore the data

  4. Model the data

  5. Communicate and visualize the results

3
New cards

why do we start with questions?

  • to know what to look for

  • to know what kind of analysis to do

  • to think about ethics

4
New cards

goal

start with a ____ in mind

5
New cards

questions

____ drive the choice of data

6
New cards

criteria of a data science question

  • specific

  • measureable

  • answerable with data that’s possible to collect

  • have relevance or value to something or someone

  • big enough to be interesting

  • small enough to be accomplishable

7
New cards

ingredients for a data science question

  • topic

  • outcome/target

  • features

  • timeframe

8
New cards

topic

who or what are you studying?

9
New cards

outcome/target

what exactly are you measuring?

10
New cards

features

what are the factors or comparisons that matter?

11
New cards

timeframe

over what timescale?

12
New cards

some goals of data science questions…

  • describe or understand patterns

  • predict

  • test cause-and-effect

  • decide what to do or optimize something

13
New cards

ethics

principles and standards that guide choices about how data is collected and how models are built and used

14
New cards

privacy

a person’s ability to control access to themselves and their own information

15
New cards

The Cobra Effect

the unintended consequences that result from an attempt to solve a problem

16
New cards

class imbalance problem

the data itself are biased (e.g. a trained hiring-recommendation model discriminates against women)

17
New cards

what your devices do know

  • demographic data

  • location data

  • search history

  • browsing history

  • purchase history

    • physical interactions

18
New cards

cookies

small pieces of data a website stores in your browser to remember you, but can also track your activity across pages

19
New cards

how to practice more responsible data science

  • learn from the past

  • use checklists: how data will be dealt with ethically and responsibly before the project is started

  • incorporate ethics into hiring decisions

  • organize governing bodies to determine how data will be regulated

    • ensure safeguards (e.g. when to “pull the plug”)

20
New cards

Dean’s Ethics Checklist

meant to provoke discussion around:

  • data collection

  • data storage

  • analysis

  • modeling

  • deployment

21
New cards

data

anything that can be measured or observed about the world

22
New cards

quantitative

a value that can be objectively measured or counted

23
New cards

qualitative

describes categories or qualities

24
New cards

categories of qualitative data

  • categorical

    • nominal (unordered)

    • ordinal (ordered)

  • text, audio, images (in raw form)

25
New cards

categories of quantative data

  • discrete: countable values

  • continuous: any value in range

26
New cards

quantitative data visualization

  • histogram

  • box plot/violin plot

  • scatter plot

    • line chart (over time)

27
New cards

qualitative data visualization

  • bar chart

  • grouped/stacked bar chart

  • pie chart

28
New cards

structured data

can be organized into an n x p table of values

29
New cards

n

the number of observations/sample size

30
New cards

p

the number of features

31
New cards

structured data: storage

  • spreadsheets (.csv, Excel (.xls, .xlsx))

  • databases: SQL (structured query language)

32
New cards

SQL as an example of a structured database

  • enables you to relate tables of data together, perform actions on data

    • can ask questions (query), add, update, delete information

  • manage + analyze large datasets

33
New cards

structured data: JOIN operation

combine rows from 2+ tables by matching columns

34
New cards

unstructured data

no simple way to represent in table format

35
New cards

unstructured data: data storage

  • text documents

  • emails + messaging

  • images

  • audio

  • video

  • presentations

  • web content

36
New cards

sentiment analysis as unstructured data

using Natural Language Processing (NLP) to identify and categorize opinions expressed in text

37
New cards

semi-structured data

doesn’t fit neatly into rows and columns, but has labels + keys that give it organization

38
New cards

how to get data

  1. collect it yourself

  2. use public/open-source data

  3. buy it

  4. request it from an API

  5. scrape it from a website

  6. source it from your company

39
New cards

steps to request data from an API

  1. send a request

  2. API sends back semi-structured data, often JSON

40
New cards

web scraping

a way to automatically collect information from websites

  • used when there’s no API or an insufficient API

41
New cards

API: definition, pros, cons

a site’s official way to access data

  • structured + stable

  • may require a key + have rate limits

42
New cards

web scraping: pros + cons

  • useful when there’s no API

  • fragile; websites change

    • can violate terms & conditions

43
New cards

questions to ask before using data

  • allowed? restrictions? ethical, privacy, legal concerns?

  • why and how was the data collected?

  • how was the data cleaned or altered? missing data?

  • who/what is included or excluded? biased in data collection/measurement?

  • documentation?

    • trustworthy?

44
New cards

robots.txt

small text file a website can publish that tells automatic crawlers which parts of the site they’re allowed to access

  • voluntary

  • websites enforce real restrictions with logins, paywalls, server-side blocking

45
New cards

“Identity crisis” and the stats vs. data science debate

Donoho is pushing back on the idea that data science is simply a subset of stats or a rebrand. The argument isn’t “stats is irrelevant”—it’s that data science includes major concerns that weren’t historically central in stats curricula (especially computing and end-to-end workflow).

46
New cards

The “two cultures” (generative vs. predictive)

This frames tension between modeling for explanation/interpretation and understanding mechanisms vs. modeling for prediction and performance on held-out data

47
New cards

The Common Task Framework (CTF)

CTF is a way to compare methods using a shared dataset, a defined task, and a scoring metric—often with a holdout test set. It encourages a prediction-centric culture and can drive rapid progress, but it can also reward narrow optimization that may not generalize.

48
New cards

“Science about data science”: cross-study and cross-workflow thinking

Donoho’s examples highlight something uncomfortable but important: different reasonable workflows can produce different results, even on the same data. This raises questions about reliability, reproducibility, and the credibility of conclusions.

49
New cards

key ideas: “Data’s Day of Reckoning”

  1. Data-driven products can be weaponized (misuse, abuse) when incentives don’t align with the public good. 

  2. Ethics must be integrated into everyday practice and data science education, and not treated as an optional afterthought.

  3. Use concrete mechanisms like checklists and “stop-the-line” culture to make responsibility actionable. Build routines for consent, bias checks, abuse cases, monitoring, and the ability to pause/shut down systems.

50
New cards

key ideas: “Myths & Fallacies of Personal Identifiable Information (PII)”

  1. The “PII fallacy”: removing a fixed list of identifiers doesn’t guarantee privacy. PII is hard to define and context-dependent.

  2. Re-identification can happen without names or obvious identifiers. Any sufficiently rich set of behavioral/transactional attributes can uniquely “fingerprint” people, especially when linked with other data.

  3. Privacy protection should move beyond “de-identification” toward stronger approaches.