Chapter 2: Generalization and Inference for a Single Quantitative Variable

Generalization: How Broadly Do the Results Apply

Inference for a Single Quantitative Variable

Vocabulary
  • Summary Statistics: Statistics calculated from a sample that provide numeric information about the sample.
    • Mean / Average: Calculated by adding up all the values and then dividing by the total number of values. Represented by ar{x} (sample mean) or \mu (population mean).
    • Median: The middle value when the data is ordered. Half of the data is above it, and half is below it. It is considered robust to outliers.
    • Outliers: Unusual data points that differ significantly from other observations.
    • Robust: A summary statistic is robust if its value does not change much when outliers are included. The median is robust, while the mean is not. The standard deviation is also affected by outliers.
    • Skew: A distribution is skewed if it is not symmetric.
      • Right Skewed: The tail of the distribution is on the right side. The mean is pulled toward the longer (right) tail.
      • Left Skewed: The tail of the distribution is on the left side. The mean is pulled toward the longer (left) tail.
Example: Mean and Median with Outliers
  • Original Data (Hypothetical): Suppose we observe a dataset (e.g., 5, 8, 10, 12, 15).
    • Mean = (5+8+10+12+15) / 5 = 50 / 5 = 10
    • Median = 10
  • Data with Outlier: Suppose instead of 15, we observed 30 (e.g., 5, 8, 10, 12, 30).
    • Mean = (5+8+10+12+30) / 5 = 65 / 5 = 13
    • Median = 10
  • Robustness: This example demonstrates that the median (10) remained unchanged when the outlier was introduced, making it robust. The mean, however, shifted from 10 to 13, indicating it is not robust to outliers.
  • Standard Deviation: The standard deviation will increase if an outlier is added to the data set.
Symbols in Statistical Inference
  • Sample Mean: \bar{x} (observed statistic from a sample).
  • Sample Standard Deviation: s (calculated from a sample).
  • Population Mean: \mu (hypothesized value for the entire population).
  • Population Standard Deviation: \sigma (true standard deviation for the entire population, often unknown).
Comparison: Population vs. Sample (Quantitative)
FeaturePopulationSample
Mean\mu\bar{x}
Std. Dev.\sigmas
Single Quantitative Variable: Hypothesis Testing Framework
  • Variable of Interest: Typically a quantitative measurement (e.g., "number of [context]").
  • Parameter: The long-run average of the variable in context (represented by \mu).
  • Hypotheses:
    • Words:
      • Null Hypothesis (H_0): The long-run average (context) is equal to a previously known or hypothesized number.
      • Alternative Hypothesis (H_A): The long-run average (context) is greater than, less than, or different from a previously known or hypothesized number.
    • Symbols:
      • H_0: \mu = \text{previously known #}
      • H_A: \mu > \text{previously known #}
      • H_A: \mu < \text{previously known #}
      • H_A: \mu \neq \text{previously known #}
Example: Average Number of Bikes Sold
  • Scenario: We are interested in whether the average number of bikes sold at a store per day is less than 20.
  • Variable of Interest: The number of bikes sold in a day (quantitative).
  • Parameter: The long-run average number of bikes sold at the store (\mu).
  • Hypotheses:
    • Words:
      • Null: The long-run average number of bikes sold at the store is equal to 20.
      • Alternative: The long-run average number of bikes sold at the store is less than 20.
    • Symbols:
      • H_0: \mu = 20
      • H_A: \mu < 20
Simulation vs. Theory-Based Approach
  • Categorical Data: For categorical data, simulations often involve physical models like spinners or coins to create a null distribution and evaluate sample deviation.
  • Quantitative Data: Coin flips or spinners cannot effectively model quantitative data. Instead, we transition to a theory-based method for standardized statistics and p-values. While applets might not be directly used for quantitative data, the underlying sample distributions are still explored.
Theory-Based Approach for Quantitative Variables
  • Validity Conditions (at least one of the following must hold):
    1. The quantitative variable should have a symmetric distribution.
    2. The sample size (n) is greater than or equal to 20 and the sampling distribution is not strongly skewed.
  • Standard Error (SE): An approximation of the standard deviation of the sample mean when the population standard deviation (\sigma) is unknown. It is calculated using the sample standard deviation (s).
    • SE = \frac{s}{\sqrt{n}}
  • Standardized Statistic: Calculates how many standard deviations (z) or standard errors (t) a statistic falls from the null value.
    • Background (Z-statistic): If the population standard deviation (\sigma) were known, the standardized statistic (z-score) would be: z = \frac{\bar{x} - \mu0}{\sigma / \sqrt{n}} where \mu0 is the null hypothesized value.
    • T-statistic: Since \sigma is usually unknown, we use the sample standard deviation (s) to approximate the standard deviation of the sample mean, leading to the t-statistic:
      • t = \frac{\bar{x} - \mu}{s / \sqrt{n}}
  • T-statistic vs. Z-statistic: For categorical data, the Z (Standard Normal) distribution is often used as a reference. For quantitative variables when \sigma is unknown and estimated by s, a different reference distribution called the t-distribution is used.
General Standardized Statistic (SS) for Quantitative Variables
  • Formula: SS = \frac{\text{observed statistic} - \text{hypothesized value}}{\text{SD or SE}}
  • For One Quantitative Variable (using the Theory-Based Method):
    • SE = \frac{s}{\sqrt{n}}
    • SS = \frac{\bar{x} - \mu}{s / \sqrt{n}}
  • Cut-off Values: The rule of thumb for cut-off values (e.g., "outside the 2's" to reject the null hypothesis) remains the same as in Chapter 1.
Bicycle Example: Calculation and Conclusion
  • Hypotheses: H0: \mu = 20 versus HA: \mu < 20
  • Observed Data: Sample of 30 days (n=30). Average sales \bar{x} = 19.5. Sample standard deviation s = 2.1.
  • Calculate Standard Error (SE):
    • SE = \frac{s}{\sqrt{n}} = \frac{2.1}{\sqrt{30}} \approx \frac{2.1}{5.477} \approx 0.38
  • Calculate Standardized Statistic (t):
    • t = \frac{\text{observed} - \text{hypothesized}}{SE} = \frac{19.5 - 20}{0.38} = \frac{-0.5}{0.38} \approx -1.32
  • Conclusion: The standardized statistic is -1.32. Since -1.32 is inside the range of -2 to 2 (i.e., "inside the 2's"), we do not have sufficient evidence to reject the null hypothesis.
    • Therefore, we do not have evidence to conclude that the long-run average number of bikes sold at the store is less than 20.

Errors and Significance

Section 2.3 Introduction

  • This section will cover the concepts of errors in hypothesis testing and statistical significance. (Further details would be in subsequent pages of the original source.)