The Use and Misuse of Data

The University of Sydney - SCIE1001: The Use and Misuse of Data

Overview

Course Title: The Use and Misuse of Data
Date: March 3, 2026

Lecture Topics for Weeks 2 & 3

Use and Misuse of Statistics:
- Focus on false positives and negatives in testing.
- Explore the counterintuitive nature of statistical outcomes.
- Discuss inherent uncertainties in the scientific process.
Understanding Basics:
- Foundation of probability and statistics is essential to grasp potential misuses.
Interactive Learning:
- Engage in interactive lectures and tutorials with peers.

Statistics and Its Role in Science

Definition: Statistics is the science of collecting, describing, and analyzing data.
Importance in Science:
- Science relies on observations and experiments, leading to data collection.
- Statistics transforms raw data into meaningful information to inform decisions.

Uncertainty and Variation

Nature of Scientific Inquiry:
- Falsification: Testing hypotheses amidst uncertainty.
- Exploratory Science: Aiming for precise measurements and accurate estimates.
Role of Statistics:
- Facilitates intelligent judgment and quantification of decisions amidst uncertainty.

Basic Terminology and Descriptive Statistics

Probability vs. Statistics

Probability (Theoretical):
- Begins with a known model.
- Describes randomness and uncertainty.
- Addresses: What is the chance of an event occurring?
Statistics (Practical):
- Starts with observed data.
- Interprets data variability.
- Addresses: What can we infer from data?

Populations and Samples

Population Definition:
- A well-defined collection of objects.
Examples of Populations:
- Response to a specific drug by Type 2 diabetes patients.
- All galaxies in the observable universe.
Sampling Method:
- Due to constraints, data collection on entire populations is impractical.
- A sample, or a subset, is selected for analysis.
Sampling Examples:
- Patients with Type 2 diabetes at a specific hospital over one week.
- Galaxies observed in a specific astronomical image.

Reasoning

Probability vs. Statistics Reasoning:
- Probability uses deductive reasoning from population to sample.
- Inferential statistics uses inductive reasoning from sample to population.

Inferential Statistics

Statistical Methods Applied:
- Descriptive Statistics: Charts, graphs, and numerical summaries.
- Parameter Estimation: Inferring population characteristics from samples.
- Hypothesis Testing: Testing claims about population characteristics.
Importance: Understanding these methodologies is vital for assessing data use and potential misuse.

Data Visualization

Types of Visualization:
- Bar charts, pie charts, line graphs, etc.
Construct and Interpretation:
- Understanding graph construction is crucial since graphs can mislead.
Common Misleading Techniques:
- Truncated Axes: Ignores baseline data affecting representation.
- Cherry-Picked Ranges: Selecting specific data ranges to manipulate outcomes.
- 3D Effects: Use perspective to distort interpretations.
- Omitted Baselines: Presenting data starting from non-zero points distorts perceptions.
- Using Areas for 1D Data: Confuses the representation of figures.

Example

Case Study: OpenAl's live stream about GPT-5
Subsequent Data Observations:
- Accuracy percentages presented for comparison purposes.

Numerical Summaries

Measures of Central Tendency

Purpose: Summarizes datasets into a single representative value.
Common Measures:
- Mean: Arithmetic average, sensitive to extreme values (outliers).
- Median: Middle value, resistant to outliers.
Consideration: A sole measure can be misleading depending on the data's distribution characteristics.

Example: Reaction Time Question

Query: Which measure (mean or median) better represents the reaction time of students?

Key Takeaway

Asymmetry in Data: When data is skewed, the median more accurately reflects the typical value than the mean, which aligns closely in symmetrical data.

Measures of Spread

Definition: Measures the variability around a dataset's center.
Common Measures:
- Range: Maximum - Minimum value.
- Interquartile Range (IQR): Spread of the middle 50%.
- Standard Deviation: Average distance of data points from the mean.
Contextual Importance: Two datasets can show the same mean but grossly differ in variability.

Example: Air Pollution Statistics

Comparative Analysis:
City A Reporting: Consistent readings between 11-13 µg/m³ annually.
City B Fluctuation: Readings can drop to 5 µg/m³ or rise to 30 µg/m³ during certain seasons, indicating variability despite the same average figure.

Key Takeaways: Numerical Summaries

Data Distribution: Understand distributions to determine suitable measures of center and spread.
Reporting Standards: Must include center, spread, and shape in numerical summaries of quantitative data.

Introduction to Parameter Estimation

Case Study: Estimating Panzer Tank Production in WWII

Contextual Background: Allies captured various Panzer tanks and noted their unique serial numbers.
Analysis Objective: Determine whether these numbers could estimate total German tank production.
Outcome Quality: Post-war production figures showed that estimates derived from serial numbers were more accurate than traditional espionage data collection.

Research Question Case Study: The Five-Second Rule

Definition: A widely held belief allows eating dropped food if retrieved within five seconds.
Research Objective: Determine the percentage of USYD students who believe in this rule, referred to as "fivers."

Parameter Estimation Process

Population Proportion (p):
- Example Calculation: For 70,000 USYD students with 19,250 identified as "fivers":
  p = rac{19250}{70000} = 0.275 ext{ (or 27.5\%)}
- Hypothetical Nature: Assume a true fixed proportion for the population.

Understanding Parameters and Statistics

Parameter Definition: A number characterizing a population aspect, such as proportions.
Statistical Representation: A number from the sample data.
Example Clarification:
- Population parameter: Proportion of all first-year science students who are “fivers”.
- Sample statistic: Proportion of “fivers” from the sampled students.

Historical Example: Dewey Defeats Truman

Summary of Events: Pre-election publication claimed Dewey's victory based on incomplete data. Truman ultimately won, illustrating the pitfalls of premature conclusions based on sample data.

Simple Random Samples

Importance of Sampling Methodology

Sampling Bias: Occurs when sample selection method skews the population representation.
- If bias exists, generic conclusions from samples to populations may be invalid.
Simple Random Sample Definition: Each population unit has an equal chance of selection, minimizing bias influence.
Representative Nature: Simple random samples generally yield a good reflection of the population characteristics.

Estimation Example for “Fivers”

Main Objective: Estimate the true proportion of “fivers” among first-year USYD students.
Equation of Proportion (p):
- For proportional estimation:
  p = ext{sample proportion} + ext{margin of error} + ext{bias}
- If utilizing a simple random sample, bias can be negated, leaving only chance error consideration.

Variability and Chance Error

Statistical Variability: Sample statistics fluctuate from sample to sample.
Chance Error Definition: Variability inherent in sample selection affecting estimations.

Confidence Intervals

Definition and Purpose

Concept of Confidence Interval: A range of plausible values for an unknown parameter.
Example of a Confidence Interval: For our scenario, one might report:
(0.55, 0.71)

Importance of Confidence Levels

Balancing Act: A trade-off exists between confidence level and interval specificity.
Interval Width Influence: Primarily governed by the chosen confidence level.

Computing a Confidence Interval

Formula: The confidence interval is derived from:
ext{sample proportion} ext{±} ext{chance error (CE)}
Example Computation: If sample proportion is p = 0.55 and CE = 0.04, then the confidence interval calculates to:
(0.55 - 0.04, 0.55 + 0.04) = (0.51, 0.59)

Interpretation of Confidence Intervals

Sample-Based Interpretations: Drawing conclusions based on sample-derived intervals.
Re-iteration Point: Collect 100 samples and expect roughly 95 to contain the population parameter truth.

Misinterpretation Risks

Common Misstatement: Claiming a 95% confidence interval assures a sample mean will reside within the interval, which misconstrues the nature of confidence intervals.
Correct Interpretation: It guarantees that the population mean will reside within said interval; the focus is on the population, not individual samples.

Introduction to Hypothesis Testing

Research Question Case Study: Detecting Parkinson’s Disease by Smell

Background Narrative: Joy Milne claimed to detect a “subtle musky odor” linked to her husband's Parkinson’s.
Importance of Research: Determines implications if such detection is achievable through olfactory cues.

Hypothesis Formation

Competing Claims:
- Null Hypothesis (H0): Joy’s detection ability mirrors random guessing.
- Alternative Hypothesis (HA): Joy’s ability surpasses random guessing.

Aim of Hypothesis Testing

Data Evaluation: Assessing whether collected data aligns unexpectedly under the null hypothesis assumption.

Null and Alternative Hypotheses

Population Parameter of Interest: Evaluate the proportion of correctly diagnosed shirts by Joy.
Testing Parameter Values:
- H0: p = 0.5 (random guessing)
- HA: p > 0.5 (correct smell detection)

Hypothesis Testing Process

Focus on Evidence: Data is utilized to compare relative likelihood regarding null hypothesis claims versus alternative claims.
Statistical Claim Testing: Examine whether the data provides enough evidence to accept or reject H0.

P-value Definition

Conceptual Definition: The probability of obtaining an observed sample statistic as extreme as the observed value assuming the null hypothesis is true.