: Big data in science: Do scientists still matter in the age of AI?

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/13

There's no tags or description

Looks like no tags are added yet.

Last updated 7:21 AM on 3/19/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

14 Terms

New cards

The rise of big data > Kenneth Cukier and Victor Mayer
What is big data?

Big data is defined as a movement that posits the idea that we c an learn from a large body of information, things we would have never learnt form smaller amounts (so size)

It also refers to the ability to render many aspects of the world into dats that we have never done before. (ie, location via GPS, friendship via likes) (datafication)

It encompasses a new way of thinking. Big data has marked 3 large shifts:

Collecting as much data as possible rather than relying on small samples

Accepting messy, imperfect data rather than demanding precision
Looking for correlations (what tends to happen alongside what) rather than always seeking causation (why something happens)

New cards

The rise of big data > Kenneth Cukier and Victor Mayer
Approaching N = ALL

Big data has facilitated the collection of much more data. This has changed the approach of science in data collection.

Historically, collecting and processing data was so costly and time-consuming that working with small, carefully chosen samples was the only practical option.

Issues with samples

they only work well for broad simple questions but fall apart with more specific questions.
eg samples wouldn’t work to predict voting preferences of American asian women who aren’t university educated > this is because within the samp;e size there wont t be enough people from that sub group in your sample to draw meaningful conclusions.
But if you collect all the data ("n = all" in statistical terminology), that problem disappears entirely.

Secondary benefit

when you collect everything upfront, you don't need to decide in advance what questions you'll ask. The data can be re-examined later for purposes nobody anticipated at the time of collection.

New cards

The rise of big data > Kenneth Cukier and Victor Mayer
From Clean to messy

Closely linked to the "n = all" argument, this section challenges the long-held assumption that data must be pristine and exact to be useful. The authors argue that our historical obsession with accuracy was really just a product of scarcity — when you only have a little data, every data point must count, so it must be correct.

With big data, the calculus changes. A small amount of inaccuracy becomes tolerable when offset by the sheer volume and breadth of what you're working with.

example
Google drastically improved translation by using far more data sourced from the messy, imperfect internet sources he messier but vastly larger dataset produced significantly better results across 65 languages.

The lesson: volume and variety can outweigh cleanliness.

New cards

The rise of big data > Kenneth Cukier and Victor Mayer
From correlation to causation

Big data has constituted a MAJOR shift in science

It encourages us to stop asking WHY things happen and focus instead on WHAT is happening and WHAT tends to follow.

Understanding causation is obviously desirable BUT its often difficult to establish causes AND humans have a well-documented cognitive tendency (highlighted by behavioural economics) to perceive causation where none actually exists.

Correlation has practical value in itself
Examples
UPS > fitted heat and vibration sensors into UPS vans to predict a breakdown before it occured. The system worked. There was no explanation as to WHY these factors predicted breakdown but CORRELATION WAS STRONG ENOUGH TO ACT ON

Googl e Flu >Google could track seasonal flu outbreaks in near-real time simply by analysing search query patterns, cross-referencing them against CDC historical flu data. The system had no understanding of why people were searching those terms BUT correlation was enough.

But caveat > Google's system later overestimated flu cases, > serves as a reminder that correlation is probabilistic NOT certain. hey can be disrupted by external factors T

New cards

The rise of big data > Kenneth Cukier and Victor Mayer

Back-End Operations (Datafication)

Big data introduces datafication

it means taking aspects of life that were never previously thought of as data and converting them into quantifiable, analysable information.

The key point is that once something is datafied, it can be repurposed in ways its originators never intended.

Datafication unlocks entirely new categories of knowledge and commercial value.

New cards

The rise of big data > Kenneth Cukier and Victor Mayer
Big Data in the Big Apple (Government and Politics)

Argument that big data will change the way how governments function

NYC Mayor Bloomberg example

NCY had an issue with fire insepction > received 25,000 complaints but only had 200 inspectors
analytics team built a database of all 900,000 buildings in the city and enriched it with data from 19 municipal agencies (stuff like tax records, missed payments, , crime rates, ambulance call outs)
Led to a dramatically more efficient inspection system > inspections handing out evacuation orders at 70%, before was 13% of visits.

Big data also pushes govt to publish the data they hold > enhancing democtratic transparency > However, they warn that governments must also act to prevent big data from enabling unhealthy corporate monopolies

New cards

The rise of big data > Kenneth Cukier and Victor Mayer

Big data or big brother

Argument that big data amplifies existing power asymmetries between states and citizens, potentially enabling what they call "big-data authoritarianism."

eg:
Department of Homeland Security's FAST programme > attempted to identify potential terrorists by analysing physiological and behavioural data.

Crime prediction systems too

The deeper concern is about pre-crime intervention and free will. Even well-intentioned uses of predictive data — for example, identifying which teenagers are statistically likely to drop out of school or become pregnant, in order to offer support — carry the risk of stigmatising individuals for things they haven't yet done, and may never do. The correlation-based prediction becomes a kind of pre-emptive judgement, eroding the presumption of innocence. > Death of legal subject thing

IMO > isnt this just akin to having ore conceived nations about people based on their background? If a human were to do this it would be discrimination.

New cards

Big Data and Prediction: Four Case Studies" Northcott

The main argument is that the claims that Big data is revolutionary is overstated. whilst Big data has certainly been a useful tool (evidenced by the weather and political elections case studies), it is not revolutionary as big data itself fails to predict well when the Pietsch conditions are not satisfied (especially stationarity)

So saying that ‘theory is dead’ is premature and an overstatement.

New cards

Big Data and Prediction: Four Case Studies" Northcott
What is non stationarity?

when the underlying relationship between variables keeps changing over time.

This is fatal for big data methods, which depend on learning patterns from past data and assuming those patterns will continue.

New cards

Big Data and Prediction: Four Case Studies" Northcott
Political elections

Opinion polling is the most successful method for predicting elections. But it can be difficult.

sample size
sample balance (to ensure reflects the voting population) > Getting this wrong systematically skews results regardless of sample size.

Political campaigns have used big data heavily — most famously "microtargeting," which tracks hundreds of variables (shopping habits, media preferences, demographics) to predict individual voters' preferences and likeliness to vote. Obama's 2008 campaign tracked over 800 voter variables.

But whilst Prediction has improved modestly thanks to more data and better polling aggregation, but big data hasn't revolutionised it, and structural limitations mean it's unlikely to.

Non stationarity
also elections are rare events, so there simply isn't enough data to train good predictive algorithms.

New cards

Big Data and Prediction: Four Case Studies" Northcott
Weather

Weather forecasting has genuinely improved significantly — seven-day forecasts today are as reliable as three-day forecasts were 20 years ago

why?

more and better data
better models
ensemble forecasting > instead of running one simulation, forecasters now run many and aggregate the results probabilistically.
more computing

More data has played an instrumental role > classic example of big data

BUT

Background theory is still necessary — for example, deciding what kinds of data to collect requires physical understanding of the atmosphere.
if the weather system is truly chaotic, only probabilistic forecasts will ever be possible — you'll never get precise deterministic predictions far in advance. (esp with climate change)

So whilst big data has helped, theory still remains important here.

New cards

Big Data and Prediction: Four Case Studies" Northcott
GDP

Big data has NOT been a fix for GDP predictions. . In one study, out of 60 actual instances of negative growth, economists' consensus forecasts predicted negative growth only three times.

WHY

Non-stationarity — the economy's causal structure keeps changing
Open system — the economy is constantly buffeted by non-economic factors (elections, pandemics, wars) that no economic model includes
Reflexivity — if a forecast becomes widely known, people may change their behaviour in response to it, making the forecast self-defeating
Measurement problems — GDP itself is only estimated, with large revisions after the fact
Possible chaos — meaning at best only probabilistic forecasts are ever possible
Unobservable variables — economic agents' expectations drive behaviour but can't be directly measured

YET > this isnt just a big data issue, other methods also do just as badly

extrapolation,
complex econometric models,
surveys

New cards

Big Data and Prediction: Four Case Studies" Northcott

Economic Auctions (US Spectrum Auctions)

TLDR > Big data methods were irrelevant and Predictive success here required theory-guided experiment design and context-specific expertise since it was niche.

Gam,e theory provided a starting point as crucial "data" here wasn't a pre-existing dataset to mine — it had to be actively created through experiments.

New cards

Big Data and Prediction: Four Case Studies" Northcott
pietsch conditions

identifies four conditions that must be satisfied for big data methods to predict successfully:

Good vocabulary — the variables you're tracking must be genuinely stable causal categories, not arbitrary composites
All relevant parameters known — you can't predict well if key causal factors are missing from your model
Stable background conditions (stationarity) — correlations found in past data must still hold in the future
Sufficient data — enough examples of every relevant combination of causes and effects to learn from

Northcott nuances

these conditions are NECCESSARY BUT NOT SUFFICIENT. (GDP shows there can be additional barriers (chaos, reflexivity, measurement error) even when the conditions are broadly met.)
Nonetheless, background theory is indispensdible, even for pure prediction, you need theory to choose the right variables and what data to collect > big data doesnt eliminate the need for human expertise and theoretical knowledge.
Prediction is often an amalgam of methods — big data might help with some sub-tasks even when it can't improve overall prediction.