The Use and Misuse of Data
The University of Sydney - SCIE1001: The Use and Misuse of Data
Overview
Course Title: The Use and Misuse of Data
Date: March 3, 2026
Lecture Topics for Weeks 2 & 3
Use and Misuse of Statistics:
Focus on false positives and negatives in testing.
Explore the counterintuitive nature of statistical outcomes.
Discuss inherent uncertainties in the scientific process.
Understanding Basics:
Foundation of probability and statistics is essential to grasp potential misuses.
Interactive Learning:
Engage in interactive lectures and tutorials with peers.
Statistics and Its Role in Science
Definition: Statistics is the science of collecting, describing, and analyzing data.
Importance in Science:
Science relies on observations and experiments, leading to data collection.
Statistics transforms raw data into meaningful information to inform decisions.
Uncertainty and Variation
Nature of Scientific Inquiry:
Falsification: Testing hypotheses amidst uncertainty.
Exploratory Science: Aiming for precise measurements and accurate estimates.
Role of Statistics:
Facilitates intelligent judgment and quantification of decisions amidst uncertainty.
Basic Terminology and Descriptive Statistics
Probability vs. Statistics
Probability (Theoretical):
Begins with a known model.
Describes randomness and uncertainty.
Addresses: What is the chance of an event occurring?
Statistics (Practical):
Starts with observed data.
Interprets data variability.
Addresses: What can we infer from data?
Populations and Samples
Population Definition:
A well-defined collection of objects.
Examples of Populations:
Response to a specific drug by Type 2 diabetes patients.
All galaxies in the observable universe.
Sampling Method:
Due to constraints, data collection on entire populations is impractical.
A sample, or a subset, is selected for analysis.
Sampling Examples:
Patients with Type 2 diabetes at a specific hospital over one week.
Galaxies observed in a specific astronomical image.
Reasoning
Probability vs. Statistics Reasoning:
Probability uses deductive reasoning from population to sample.
Inferential statistics uses inductive reasoning from sample to population.
Inferential Statistics
Statistical Methods Applied:
Descriptive Statistics: Charts, graphs, and numerical summaries.
Parameter Estimation: Inferring population characteristics from samples.
Hypothesis Testing: Testing claims about population characteristics.
Importance: Understanding these methodologies is vital for assessing data use and potential misuse.
Data Visualization
Types of Visualization:
Bar charts, pie charts, line graphs, etc.
Construct and Interpretation:
Understanding graph construction is crucial since graphs can mislead.
Common Misleading Techniques:
Truncated Axes: Ignores baseline data affecting representation.
Cherry-Picked Ranges: Selecting specific data ranges to manipulate outcomes.
3D Effects: Use perspective to distort interpretations.
Omitted Baselines: Presenting data starting from non-zero points distorts perceptions.
Using Areas for 1D Data: Confuses the representation of figures.
Example
Case Study: OpenAl's live stream about GPT-5
Subsequent Data Observations:
Accuracy percentages presented for comparison purposes.
Numerical Summaries
Measures of Central Tendency
Purpose: Summarizes datasets into a single representative value.
Common Measures:
Mean: Arithmetic average, sensitive to extreme values (outliers).
Median: Middle value, resistant to outliers.
Consideration: A sole measure can be misleading depending on the data's distribution characteristics.
Example: Reaction Time Question
Query: Which measure (mean or median) better represents the reaction time of students?
Key Takeaway
Asymmetry in Data: When data is skewed, the median more accurately reflects the typical value than the mean, which aligns closely in symmetrical data.
Measures of Spread
Definition: Measures the variability around a dataset's center.
Common Measures:
Range: Maximum - Minimum value.
Interquartile Range (IQR): Spread of the middle 50%.
Standard Deviation: Average distance of data points from the mean.
Contextual Importance: Two datasets can show the same mean but grossly differ in variability.
Example: Air Pollution Statistics
Comparative Analysis:
City A Reporting: Consistent readings between 11-13 µg/m³ annually.
City B Fluctuation: Readings can drop to 5 µg/m³ or rise to 30 µg/m³ during certain seasons, indicating variability despite the same average figure.
Key Takeaways: Numerical Summaries
Data Distribution: Understand distributions to determine suitable measures of center and spread.
Reporting Standards: Must include center, spread, and shape in numerical summaries of quantitative data.
Introduction to Parameter Estimation
Case Study: Estimating Panzer Tank Production in WWII
Contextual Background: Allies captured various Panzer tanks and noted their unique serial numbers.
Analysis Objective: Determine whether these numbers could estimate total German tank production.
Outcome Quality: Post-war production figures showed that estimates derived from serial numbers were more accurate than traditional espionage data collection.
Research Question Case Study: The Five-Second Rule
Definition: A widely held belief allows eating dropped food if retrieved within five seconds.
Research Objective: Determine the percentage of USYD students who believe in this rule, referred to as "fivers."
Parameter Estimation Process
Population Proportion (p):
Example Calculation: For 70,000 USYD students with 19,250 identified as "fivers":
p = rac{19250}{70000} = 0.275 ext{ (or 27.5\%)}Hypothetical Nature: Assume a true fixed proportion for the population.
Understanding Parameters and Statistics
Parameter Definition: A number characterizing a population aspect, such as proportions.
Statistical Representation: A number from the sample data.
Example Clarification:
Population parameter: Proportion of all first-year science students who are “fivers”.
Sample statistic: Proportion of “fivers” from the sampled students.
Historical Example: Dewey Defeats Truman
Summary of Events: Pre-election publication claimed Dewey's victory based on incomplete data. Truman ultimately won, illustrating the pitfalls of premature conclusions based on sample data.
Simple Random Samples
Importance of Sampling Methodology
Sampling Bias: Occurs when sample selection method skews the population representation.
If bias exists, generic conclusions from samples to populations may be invalid.
Simple Random Sample Definition: Each population unit has an equal chance of selection, minimizing bias influence.
Representative Nature: Simple random samples generally yield a good reflection of the population characteristics.
Estimation Example for “Fivers”
Main Objective: Estimate the true proportion of “fivers” among first-year USYD students.
Equation of Proportion (p):
For proportional estimation:
p = ext{sample proportion} + ext{margin of error} + ext{bias}If utilizing a simple random sample, bias can be negated, leaving only chance error consideration.
Variability and Chance Error
Statistical Variability: Sample statistics fluctuate from sample to sample.
Chance Error Definition: Variability inherent in sample selection affecting estimations.
Confidence Intervals
Definition and Purpose
Concept of Confidence Interval: A range of plausible values for an unknown parameter.
Example of a Confidence Interval: For our scenario, one might report:
(0.55, 0.71)
Importance of Confidence Levels
Balancing Act: A trade-off exists between confidence level and interval specificity.
Interval Width Influence: Primarily governed by the chosen confidence level.
Computing a Confidence Interval
Formula: The confidence interval is derived from:
ext{sample proportion} ext{±} ext{chance error (CE)}Example Computation: If sample proportion is p = 0.55 and CE = 0.04, then the confidence interval calculates to:
(0.55 - 0.04, 0.55 + 0.04) = (0.51, 0.59)
Interpretation of Confidence Intervals
Sample-Based Interpretations: Drawing conclusions based on sample-derived intervals.
Re-iteration Point: Collect 100 samples and expect roughly 95 to contain the population parameter truth.
Misinterpretation Risks
Common Misstatement: Claiming a 95% confidence interval assures a sample mean will reside within the interval, which misconstrues the nature of confidence intervals.
Correct Interpretation: It guarantees that the population mean will reside within said interval; the focus is on the population, not individual samples.
Introduction to Hypothesis Testing
Research Question Case Study: Detecting Parkinson’s Disease by Smell
Background Narrative: Joy Milne claimed to detect a “subtle musky odor” linked to her husband's Parkinson’s.
Importance of Research: Determines implications if such detection is achievable through olfactory cues.
Hypothesis Formation
Competing Claims:
Null Hypothesis (H0): Joy’s detection ability mirrors random guessing.
Alternative Hypothesis (HA): Joy’s ability surpasses random guessing.
Aim of Hypothesis Testing
Data Evaluation: Assessing whether collected data aligns unexpectedly under the null hypothesis assumption.
Null and Alternative Hypotheses
Population Parameter of Interest: Evaluate the proportion of correctly diagnosed shirts by Joy.
Testing Parameter Values:
H0: p = 0.5 (random guessing)
HA: p > 0.5 (correct smell detection)
Hypothesis Testing Process
Focus on Evidence: Data is utilized to compare relative likelihood regarding null hypothesis claims versus alternative claims.
Statistical Claim Testing: Examine whether the data provides enough evidence to accept or reject H0.
P-value Definition
Conceptual Definition: The probability of obtaining an observed sample statistic as extreme as the observed value assuming the null hypothesis is true.