Comprehensive Study Notes: Probability and the Sampling Distribution of the Sample Mean
Variability and Randomness
- Statistics is fundamentally the study of variability.
- Uncertainty is managed by investigating random behavior.
- Random behavior is characterized by two distinct patterns relative to time:
- Short-run: Outcomes are unpredictable and appear haphazard.
- Long-run: Outcomes exhibit a regular, predictable distribution.
- A phenomenon is defined as random if individual outcomes are uncertain, but a stable distribution of outcomes emerges over a large number of repetitions.
- A random experiment is defined as any process or activity involving uncertainty that results in two or more possible outcomes.
- In everyday language, "randomness" is often equated with chaos or haphazard events because we often do not observe the phenomenon enough times to perceive the emerging long-run pattern.
Understanding Probability
- The foundation of probability lies in the fact that regular patterns emerge only after many repeated trials (e.g., rolling dice, tossing coins, or lottery outcomes).
- The Coin Toss Experiment:
- Assuming a fair coin, the likelihood of observing a Head is equal to observing a Tail (50% chance each).
- In a sequence like , the observed proportions of Heads are .
- Proportions vary significantly in the early stages, but in the long-run, the proportion of Heads will consistently stay very close to .
- Empirical Threshold: If a fair coin is tossed times, it is almost certain that one will observe between and Heads.
- Definition of Probability: The probability of any outcome of a random phenomenon is the proportion of times that specific outcome would occur in an infinitely long series of trials.
- Probability Theory: This branch of mathematics describes random behavior using mathematical models. Because we cannot perform an experiment infinite times, we use models to describe what would happen theoretically.
Proportions vs. Probability
- Proportion: A value that is known or has been observed. It is spoken of in the present tense.
- Probability: A theoretical value representing the proportion after an infinitely long series of trials. It relates to future events.
Probability Models and Sample Spaces
- A probability model consists of two components:
- A list of all possible outcomes.
- A probability assigned to each outcome.
- The Sample Space (): The set of all possible outcomes for a random phenomenon.
- Simple Examples:
- Tossing a coin once: .
- Tossing a coin three times: .
- Complex Examples:
- Lotto 6/49: Choosing six numbers from 49 leads to nearly possible combinations.
- Sports (Valour FC soccer): Considering the next two games ( = Win, = Tie, = Loss), the order matters. . Note that Winning first then Losing () is distinct from Losing first then Winning ().
- Rolling Two Dice: The sample space contains 36 outcomes (). If the variable of interest is the sum of the two dice, then .
- Simple Examples:
Rules and Probability Distributions
- For a sample space , let the probability of individual outcome be denoted as .
- Fundamental Conditions:
- Each individual probability must be between 0 and 1: for all .
- The sum of all probabilities must equal exactly 1: .
- Events:
- An event is a subset of outcomes from the sample space.
- Example (Rolling two dice): If event is "At Least One 4", then . If event is "Sum is 9", then .
- Probability of Events: Calculated by adding the probabilities of all individual outcomes contained within that event.
- Example: .
- Complements (): The event containing all outcomes in the sample space not found in .
- Rule: .
- Examples: ; .
- Probability Distribution: A table or rule that provides all possible values of a variable and the specific probability for each value.
Random Variables and Calculations
- A random variable () provides a numerical description of the outcome of a statistical experiment.
- Case Study: NHL Atlantic Division Division Winner:
- Teams: Montreal, Ottawa, Toronto, and others.
- Probabilities: Montreal (, Ottawa (), Toronto (), Team 4 (), Team 5 (), Team 6 (), Team 7 (), Team 8 ().
- Calculation for : .
- Probability a Canadian team wins: .
- Case Study: NHL Pacific Division (Complementary Logic):
- Probabilities provided: Calgary (), Edmonton (), Vancouver (). Others are incomplete.
- Probability an American team wins: .
Continuous Random Variables
- While discrete random variables (like dice sums or coin counts) take only certain values, continuous random variables can take any value in an interval.
- Sample Space Example: Time for a light bulb to burn out, .
- Probability Assignment: Because there are infinitely many outcomes, probabilities are assigned to intervals of values rather than individual points.
- Density Curves: The area under a density curve represents the probability of observing an outcome in that interval.
- Normal Probability Distribution: Denoted as , where probabilities correspond to the area under the Normal curve.
- Example (Pulse Rates): Adult females have pulse rates with and .
- .
The Sampling Distribution of the Sample Mean ($\bar{X}$)
- Instead of observing single individuals, researchers often take a Random Sample of size and calculate the sample mean ().
- The Sampling Distribution of a Statistic is the distribution of values taken by that statistic in all possible samples of the same size from the same population.
- Conceptual Experiment:
- Repeatedly take samples of size from a population with mean and standard deviation .
- Calculate for each sample.
- Plot the histogram of values.
- Key Characteristics of the Distribution of :
- The mean of the sampling distribution is equal to the population mean ().
- The standard deviation (Standard Error) of the sampling distribution is lower than the population standard deviation: .
- Averages are consistently less variable than individual observations.
The Central Limit Theorem (CLT)
- The Theorem: When taking a Simple Random Sample (SRS) of size from any population with mean and standard deviation , the sampling distribution of is approximately Normal if is sufficiently large.
- Notation: .
- Significance: The original population distribution does not need to be symmetric or normal. As increases, the skewness of the original distribution is overcome.
- Sample Size Guidelines:
- For symmetric distributions, becomes normal at very low .
- For strongly skewed distributions, a higher is required.
- Course Rule: It is safe to apply the CLT when .
Practical Examples and R Code
Male Heights Case Study: Population .
- Individual Probability: .
- Sample Probability (): .
- R Code for Sample Mean:
pnorm(180, 178, 6/sqrt(10), lower.tail = FALSE)yields0.1459203.
Light Bulb Lifetimes Case Study:
- Population is right-skewed with and .
- For bulbs, calculate the probability the mean lifetime exceeds .
- Since , use CLT: .
Metal Bolts Case Study (, ):
- Probability that an SRS of has a mean diameter between and .
- .
- Constraint: If we only select bolts, and the underlying distribution of is unknown, we cannot calculate the probability because the sample size is too small for the CLT.
Summary Classification for $\bar{X}$
- Scenario 1: Population is Normal:
- .
- Result: is exactly Normal for any sample size .
- Scenario 2: Population is Not Normal / Unknown:
- If : is approximately Normal by the CLT.
- If : is not normal; standard probability techniques cannot be applied.
- Universal Truth: For any distribution, the mean of the sample mean is and the standard deviation is .