1/13
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
The rise of big data > Kenneth Cukier and Victor Mayer
What is big data?
Big data is defined as a movement that posits the idea that we c an learn from a large body of information, things we would have never learnt form smaller amounts (so size)
It also refers to the ability to render many aspects of the world into dats that we have never done before. (ie, location via GPS, friendship via likes) (datafication)
It encompasses a new way of thinking. Big data has marked 3 large shifts:
Collecting as much data as possible rather than relying on small samples
Accepting messy, imperfect data rather than demanding precision
Looking for correlations (what tends to happen alongside what) rather than always seeking causation (why something happens)
The rise of big data > Kenneth Cukier and Victor Mayer
Approaching N = ALL
Big data has facilitated the collection of much more data. This has changed the approach of science in data collection.
Historically, collecting and processing data was so costly and time-consuming that working with small, carefully chosen samples was the only practical option.
Issues with samples
they only work well for broad simple questions but fall apart with more specific questions.
eg samples wouldn’t work to predict voting preferences of American asian women who aren’t university educated > this is because within the samp;e size there wont t be enough people from that sub group in your sample to draw meaningful conclusions.
But if you collect all the data ("n = all" in statistical terminology), that problem disappears entirely.
Secondary benefit
when you collect everything upfront, you don't need to decide in advance what questions you'll ask. The data can be re-examined later for purposes nobody anticipated at the time of collection.
The rise of big data > Kenneth Cukier and Victor Mayer
From Clean to messy
Closely linked to the "n = all" argument, this section challenges the long-held assumption that data must be pristine and exact to be useful. The authors argue that our historical obsession with accuracy was really just a product of scarcity — when you only have a little data, every data point must count, so it must be correct.
With big data, the calculus changes. A small amount of inaccuracy becomes tolerable when offset by the sheer volume and breadth of what you're working with.
example
Google drastically improved translation by using far more data sourced from the messy, imperfect internet sources he messier but vastly larger dataset produced significantly better results across 65 languages.
The lesson: volume and variety can outweigh cleanliness.
The rise of big data > Kenneth Cukier and Victor Mayer
From correlation to causation
Big data has constituted a MAJOR shift in science
It encourages us to stop asking WHY things happen and focus instead on WHAT is happening and WHAT tends to follow.
Understanding causation is obviously desirable BUT its often difficult to establish causes AND humans have a well-documented cognitive tendency (highlighted by behavioural economics) to perceive causation where none actually exists.
Correlation has practical value in itself
Examples
UPS > fitted heat and vibration sensors into UPS vans to predict a breakdown before it occured. The system worked. There was no explanation as to WHY these factors predicted breakdown but CORRELATION WAS STRONG ENOUGH TO ACT ON
Googl e Flu >Google could track seasonal flu outbreaks in near-real time simply by analysing search query patterns, cross-referencing them against CDC historical flu data. The system had no understanding of why people were searching those terms BUT correlation was enough.
But caveat > Google's system later overestimated flu cases, > serves as a reminder that correlation is probabilistic NOT certain. hey can be disrupted by external factors T
The rise of big data > Kenneth Cukier and Victor Mayer
Back-End Operations (Datafication)
Big data introduces datafication
it means taking aspects of life that were never previously thought of as data and converting them into quantifiable, analysable information.
The key point is that once something is datafied, it can be repurposed in ways its originators never intended.
Datafication unlocks entirely new categories of knowledge and commercial value.
The rise of big data > Kenneth Cukier and Victor Mayer
Big Data in the Big Apple (Government and Politics)
Argument that big data will change the way how governments function
NYC Mayor Bloomberg example
NCY had an issue with fire insepction > received 25,000 complaints but only had 200 inspectors
analytics team built a database of all 900,000 buildings in the city and enriched it with data from 19 municipal agencies (stuff like tax records, missed payments, , crime rates, ambulance call outs)
Led to a dramatically more efficient inspection system > inspections handing out evacuation orders at 70%, before was 13% of visits.
Big data also pushes govt to publish the data they hold > enhancing democtratic transparency > However, they warn that governments must also act to prevent big data from enabling unhealthy corporate monopolies
The rise of big data > Kenneth Cukier and Victor Mayer
Big data or big brother
Argument that big data amplifies existing power asymmetries between states and citizens, potentially enabling what they call "big-data authoritarianism."
eg:
Department of Homeland Security's FAST programme > attempted to identify potential terrorists by analysing physiological and behavioural data.
Crime prediction systems too
The deeper concern is about pre-crime intervention and free will. Even well-intentioned uses of predictive data — for example, identifying which teenagers are statistically likely to drop out of school or become pregnant, in order to offer support — carry the risk of stigmatising individuals for things they haven't yet done, and may never do. The correlation-based prediction becomes a kind of pre-emptive judgement, eroding the presumption of innocence. > Death of legal subject thing
IMO > isnt this just akin to having ore conceived nations about people based on their background? If a human were to do this it would be discrimination.
Big Data and Prediction: Four Case Studies" Northcott
The main argument is that the claims that Big data is revolutionary is overstated. whilst Big data has certainly been a useful tool (evidenced by the weather and political elections case studies), it is not revolutionary as big data itself fails to predict well when the Pietsch conditions are not satisfied (especially stationarity)
So saying that ‘theory is dead’ is premature and an overstatement.
Big Data and Prediction: Four Case Studies" Northcott
What is non stationarity?
when the underlying relationship between variables keeps changing over time.
This is fatal for big data methods, which depend on learning patterns from past data and assuming those patterns will continue.
Big Data and Prediction: Four Case Studies" Northcott
Political elections
Opinion polling is the most successful method for predicting elections. But it can be difficult.
sample size
sample balance (to ensure reflects the voting population) > Getting this wrong systematically skews results regardless of sample size.
Political campaigns have used big data heavily — most famously "microtargeting," which tracks hundreds of variables (shopping habits, media preferences, demographics) to predict individual voters' preferences and likeliness to vote. Obama's 2008 campaign tracked over 800 voter variables.
But whilst Prediction has improved modestly thanks to more data and better polling aggregation, but big data hasn't revolutionised it, and structural limitations mean it's unlikely to.
Non stationarity
also elections are rare events, so there simply isn't enough data to train good predictive algorithms.
Big Data and Prediction: Four Case Studies" Northcott
Weather
Weather forecasting has genuinely improved significantly — seven-day forecasts today are as reliable as three-day forecasts were 20 years ago
why?
more and better data
better models
ensemble forecasting > instead of running one simulation, forecasters now run many and aggregate the results probabilistically.
more computing
More data has played an instrumental role > classic example of big data
BUT
Background theory is still necessary — for example, deciding what kinds of data to collect requires physical understanding of the atmosphere.
if the weather system is truly chaotic, only probabilistic forecasts will ever be possible — you'll never get precise deterministic predictions far in advance. (esp with climate change)
So whilst big data has helped, theory still remains important here.
Big Data and Prediction: Four Case Studies" Northcott
GDP
Big data has NOT been a fix for GDP predictions. . In one study, out of 60 actual instances of negative growth, economists' consensus forecasts predicted negative growth only three times.
WHY
Non-stationarity — the economy's causal structure keeps changing
Open system — the economy is constantly buffeted by non-economic factors (elections, pandemics, wars) that no economic model includes
Reflexivity — if a forecast becomes widely known, people may change their behaviour in response to it, making the forecast self-defeating
Measurement problems — GDP itself is only estimated, with large revisions after the fact
Possible chaos — meaning at best only probabilistic forecasts are ever possible
Unobservable variables — economic agents' expectations drive behaviour but can't be directly measured
YET > this isnt just a big data issue, other methods also do just as badly
extrapolation,
complex econometric models,
surveys
Big Data and Prediction: Four Case Studies" Northcott
Economic Auctions (US Spectrum Auctions)
TLDR > Big data methods were irrelevant and Predictive success here required theory-guided experiment design and context-specific expertise since it was niche.
Gam,e theory provided a starting point as crucial "data" here wasn't a pre-existing dataset to mine — it had to be actively created through experiments.
Big Data and Prediction: Four Case Studies" Northcott
pietsch conditions
identifies four conditions that must be satisfied for big data methods to predict successfully:
Good vocabulary — the variables you're tracking must be genuinely stable causal categories, not arbitrary composites
All relevant parameters known — you can't predict well if key causal factors are missing from your model
Stable background conditions (stationarity) — correlations found in past data must still hold in the future
Sufficient data — enough examples of every relevant combination of causes and effects to learn from
Northcott nuances
these conditions are NECCESSARY BUT NOT SUFFICIENT. (GDP shows there can be additional barriers (chaos, reflexivity, measurement error) even when the conditions are broadly met.)
Nonetheless, background theory is indispensdible, even for pure prediction, you need theory to choose the right variables and what data to collect > big data doesnt eliminate the need for human expertise and theoretical knowledge.
Prediction is often an amalgam of methods — big data might help with some sub-tasks even when it can't improve overall prediction.