The Use and Misuse of Data

The University of Sydney - SCIE1001: The Use and Misuse of Data

Overview

  • Course Title: The Use and Misuse of Data

  • Date: March 3, 2026

Lecture Topics for Weeks 2 & 3

  1. Use and Misuse of Statistics:

    • Focus on false positives and negatives in testing.

    • Explore the counterintuitive nature of statistical outcomes.

    • Discuss inherent uncertainties in the scientific process.

  2. Understanding Basics:

    • Foundation of probability and statistics is essential to grasp potential misuses.

  3. Interactive Learning:

    • Engage in interactive lectures and tutorials with peers.

Statistics and Its Role in Science

  • Definition: Statistics is the science of collecting, describing, and analyzing data.

  • Importance in Science:

    • Science relies on observations and experiments, leading to data collection.

    • Statistics transforms raw data into meaningful information to inform decisions.

Uncertainty and Variation

  • Nature of Scientific Inquiry:

    • Falsification: Testing hypotheses amidst uncertainty.

    • Exploratory Science: Aiming for precise measurements and accurate estimates.

  • Role of Statistics:

    • Facilitates intelligent judgment and quantification of decisions amidst uncertainty.

Basic Terminology and Descriptive Statistics

Probability vs. Statistics
  • Probability (Theoretical):

    • Begins with a known model.

    • Describes randomness and uncertainty.

    • Addresses: What is the chance of an event occurring?

  • Statistics (Practical):

    • Starts with observed data.

    • Interprets data variability.

    • Addresses: What can we infer from data?

Populations and Samples
  • Population Definition:

    • A well-defined collection of objects.

  • Examples of Populations:

    • Response to a specific drug by Type 2 diabetes patients.

    • All galaxies in the observable universe.

  • Sampling Method:

    • Due to constraints, data collection on entire populations is impractical.

    • A sample, or a subset, is selected for analysis.

  • Sampling Examples:

    • Patients with Type 2 diabetes at a specific hospital over one week.

    • Galaxies observed in a specific astronomical image.

Reasoning
  • Probability vs. Statistics Reasoning:

    • Probability uses deductive reasoning from population to sample.

    • Inferential statistics uses inductive reasoning from sample to population.

Inferential Statistics

  • Statistical Methods Applied:

    • Descriptive Statistics: Charts, graphs, and numerical summaries.

    • Parameter Estimation: Inferring population characteristics from samples.

    • Hypothesis Testing: Testing claims about population characteristics.

  • Importance: Understanding these methodologies is vital for assessing data use and potential misuse.

Data Visualization
  • Types of Visualization:

    • Bar charts, pie charts, line graphs, etc.

  • Construct and Interpretation:

    • Understanding graph construction is crucial since graphs can mislead.

  • Common Misleading Techniques:

    • Truncated Axes: Ignores baseline data affecting representation.

    • Cherry-Picked Ranges: Selecting specific data ranges to manipulate outcomes.

    • 3D Effects: Use perspective to distort interpretations.

    • Omitted Baselines: Presenting data starting from non-zero points distorts perceptions.

    • Using Areas for 1D Data: Confuses the representation of figures.

Example
  • Case Study: OpenAl's live stream about GPT-5

  • Subsequent Data Observations:

    • Accuracy percentages presented for comparison purposes.

Numerical Summaries

Measures of Central Tendency
  • Purpose: Summarizes datasets into a single representative value.

  • Common Measures:

    • Mean: Arithmetic average, sensitive to extreme values (outliers).

    • Median: Middle value, resistant to outliers.

  • Consideration: A sole measure can be misleading depending on the data's distribution characteristics.

Example: Reaction Time Question
  • Query: Which measure (mean or median) better represents the reaction time of students?

Key Takeaway
  • Asymmetry in Data: When data is skewed, the median more accurately reflects the typical value than the mean, which aligns closely in symmetrical data.

Measures of Spread
  • Definition: Measures the variability around a dataset's center.

  • Common Measures:

    • Range: Maximum - Minimum value.

    • Interquartile Range (IQR): Spread of the middle 50%.

    • Standard Deviation: Average distance of data points from the mean.

  • Contextual Importance: Two datasets can show the same mean but grossly differ in variability.

Example: Air Pollution Statistics
  • Comparative Analysis:

  • City A Reporting: Consistent readings between 11-13 µg/m³ annually.

  • City B Fluctuation: Readings can drop to 5 µg/m³ or rise to 30 µg/m³ during certain seasons, indicating variability despite the same average figure.

Key Takeaways: Numerical Summaries

  • Data Distribution: Understand distributions to determine suitable measures of center and spread.

  • Reporting Standards: Must include center, spread, and shape in numerical summaries of quantitative data.

Introduction to Parameter Estimation

Case Study: Estimating Panzer Tank Production in WWII
  • Contextual Background: Allies captured various Panzer tanks and noted their unique serial numbers.

  • Analysis Objective: Determine whether these numbers could estimate total German tank production.

  • Outcome Quality: Post-war production figures showed that estimates derived from serial numbers were more accurate than traditional espionage data collection.

Research Question Case Study: The Five-Second Rule
  • Definition: A widely held belief allows eating dropped food if retrieved within five seconds.

  • Research Objective: Determine the percentage of USYD students who believe in this rule, referred to as "fivers."

Parameter Estimation Process
  • Population Proportion (p):

    • Example Calculation: For 70,000 USYD students with 19,250 identified as "fivers":
      p = rac{19250}{70000} = 0.275 ext{ (or 27.5\%)}

    • Hypothetical Nature: Assume a true fixed proportion for the population.

Understanding Parameters and Statistics
  • Parameter Definition: A number characterizing a population aspect, such as proportions.

  • Statistical Representation: A number from the sample data.

  • Example Clarification:

    • Population parameter: Proportion of all first-year science students who are “fivers”.

    • Sample statistic: Proportion of “fivers” from the sampled students.

Historical Example: Dewey Defeats Truman
  • Summary of Events: Pre-election publication claimed Dewey's victory based on incomplete data. Truman ultimately won, illustrating the pitfalls of premature conclusions based on sample data.

Simple Random Samples

Importance of Sampling Methodology
  • Sampling Bias: Occurs when sample selection method skews the population representation.

    • If bias exists, generic conclusions from samples to populations may be invalid.

  • Simple Random Sample Definition: Each population unit has an equal chance of selection, minimizing bias influence.

  • Representative Nature: Simple random samples generally yield a good reflection of the population characteristics.

Estimation Example for “Fivers”
  • Main Objective: Estimate the true proportion of “fivers” among first-year USYD students.

  • Equation of Proportion (p):

    • For proportional estimation:
      p = ext{sample proportion} + ext{margin of error} + ext{bias}

    • If utilizing a simple random sample, bias can be negated, leaving only chance error consideration.

Variability and Chance Error
  • Statistical Variability: Sample statistics fluctuate from sample to sample.

  • Chance Error Definition: Variability inherent in sample selection affecting estimations.

Confidence Intervals

Definition and Purpose
  1. Concept of Confidence Interval: A range of plausible values for an unknown parameter.

  2. Example of a Confidence Interval: For our scenario, one might report:
    (0.55, 0.71)

Importance of Confidence Levels
  1. Balancing Act: A trade-off exists between confidence level and interval specificity.

  2. Interval Width Influence: Primarily governed by the chosen confidence level.

Computing a Confidence Interval
  • Formula: The confidence interval is derived from:
    ext{sample proportion} ext{±} ext{chance error (CE)}

  • Example Computation: If sample proportion is p = 0.55 and CE = 0.04, then the confidence interval calculates to:
    (0.55 - 0.04, 0.55 + 0.04) = (0.51, 0.59)

Interpretation of Confidence Intervals
  • Sample-Based Interpretations: Drawing conclusions based on sample-derived intervals.

  • Re-iteration Point: Collect 100 samples and expect roughly 95 to contain the population parameter truth.

Misinterpretation Risks
  • Common Misstatement: Claiming a 95% confidence interval assures a sample mean will reside within the interval, which misconstrues the nature of confidence intervals.

  • Correct Interpretation: It guarantees that the population mean will reside within said interval; the focus is on the population, not individual samples.

Introduction to Hypothesis Testing

Research Question Case Study: Detecting Parkinson’s Disease by Smell
  • Background Narrative: Joy Milne claimed to detect a “subtle musky odor” linked to her husband's Parkinson’s.

  • Importance of Research: Determines implications if such detection is achievable through olfactory cues.

Hypothesis Formation
  1. Competing Claims:

    • Null Hypothesis (H0): Joy’s detection ability mirrors random guessing.

    • Alternative Hypothesis (HA): Joy’s ability surpasses random guessing.

Aim of Hypothesis Testing
  • Data Evaluation: Assessing whether collected data aligns unexpectedly under the null hypothesis assumption.

Null and Alternative Hypotheses
  1. Population Parameter of Interest: Evaluate the proportion of correctly diagnosed shirts by Joy.

  2. Testing Parameter Values:

    • H0: p = 0.5 (random guessing)

    • HA: p > 0.5 (correct smell detection)

Hypothesis Testing Process
  1. Focus on Evidence: Data is utilized to compare relative likelihood regarding null hypothesis claims versus alternative claims.

  2. Statistical Claim Testing: Examine whether the data provides enough evidence to accept or reject H0.

P-value Definition
  • Conceptual Definition: The probability of obtaining an observed sample statistic as extreme as the observed value assuming the null hypothesis is true.