Parameters and Statistics:
A parameter describes a characteristic of a population; its value is usually unknown.
A statistic describes a characteristic of a sample; its value can be computed from the sample data and varies from sample to sample.
Statistics are used to estimate unknown parameters.
Mnemonic: "s and p" - statistics come from samples, parameters come from populations.
µ (mu) represents the population mean, and σ represents the population standard deviation.
\bar{x} (x-bar) represents the sample mean, and s represents the sample standard deviation.
Statistical Estimation:
Statistical inference involves using sample information to draw conclusions about the wider population.
Different random samples yield different statistics, necessitating the description of the sampling distribution of possible statistic values.
Sampling Variability:
Sampling variability refers to the variation of a statistic's value in repeated random sampling.
To understand sampling variability, consider what would happen if many samples were taken.
Sampling Distributions:
The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.
It consists of all possible values of the statistic and the relative frequency of each value.
This distribution can be plotted using a histogram.
Simulation:
In practice, obtaining the actual sampling distribution by taking all possible samples is difficult.
Simulation can be used to imitate the process of taking many samples.
Bias and Variability:
Bias concerns the center of the sampling distribution.
An unbiased statistic has a mean of its sampling distribution equal to the true value of the parameter.
Variability is the spread of the sampling distribution, determined by the sampling design and sample size n.
Larger samples have smaller spreads.
Analogy:
The true population parameter is like the bull’s-eye on a target, and the sample statistic is like an arrow fired at the target.
Bias and variability describe the pattern of many shots at the target.
Managing Bias and Variability:
Reduce bias by using random sampling.
Reduce variability by using a larger sample size.
The variability of a statistic from a random sample does not depend on the population size, as long as the population is at least 20 times larger than the sample.
Why Randomize?
The purpose of a sample is to provide information about a larger population, and inference is the process of drawing conclusions about a population based on sample data.
Reasons to use random sampling:
Eliminates bias.
Allows trustworthy inference using probability laws, including a margin of error.
Larger random samples provide better information.
Population Distribution:
The population distribution is the distribution of values of a variable among all individuals in the population.
It is also the probability distribution of the variable when one individual is chosen at random.
In some cases, the population of interest does not actually exist, such as future exam scores.
Mean and Standard Deviation of a Sample Mean:
The mean of the sampling distribution of the sample mean is an unbiased estimate of the population mean µ.
The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample.
It is smaller than the standard deviation of the population by a factor of \sqrt{n}. Averages are less variable than individual observations.
\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
The Sampling Distribution of Sample Means:
When choosing many SRSs from a population, the sampling distribution of the sample mean is centered at the population mean µ and is less spread out than the population distribution.
The Central Limit Theorem:
As the sample size increases, the distribution of sample means begins to resemble a Normal distribution, regardless of the population distribution shape, provided the population has a finite standard deviation.
A Few More Facts:
Any linear combination of independent Normal random variables is also Normally distributed.
More generally, the central limit theorem notes that the distribution of a sum or average of many small random quantities is close to Normal.
The central limit theorem also applies to discrete random variables.
An average of discrete random variables will never result in a continuous sampling distribution, but the Normal distribution often serves as a good approximation.
The Binomial Setting:
A binomial setting arises when performing several independent trials of the same chance process and recording the number of times a particular outcome (success) occurs.
The four conditions (BINS) are:
Binary: Outcomes can be classified as "success" or "failure."
Independent: Trials must be independent.
Number: The number of trials n must be fixed in advance.
Success: The probability p of success must be the same on every trial.
Binomial Distribution:
The count X of successes in a binomial setting has the binomial distribution with parameters n and p, denoted as X \sim B(n, p).
The possible values of X are whole numbers from 0 to n.
Form of the Binomial Distribution:
In a binomial setting with n trials and success probability p, the probability of exactly k successes is:
P(X = k) = {n \choose k} * p^k * (1 - p)^{(n-k)} = \frac{n!}{k! (n-k)!} * p^k * (1 - p)^{(n-k)}
Note: the LaTex displayed here use the single backslash as required by the instructions
k! means k(k – 1)(k – 2) . . . 2(1). For example, 5! = 5(4)(3)(2)(1) = 120 and 0! = 1.
Binomial Mean and Standard Deviation:
If X has a binomial distribution with n trials and success probability p, the mean and standard deviation of X are:
µ_X = np
σ_X = \sqrt{np(1 - p)}
Note: These formulas work ONLY for binomial distributions.
Normal Approximation for Binomial Distributions:
When n is large, the distribution of X is approximately Normal with mean and standard deviation:
µ_X = np
σ_X = \sqrt{np(1 - p)}
Rule of thumb: Use the Normal approximation when np ≥ 10 and n(1 – p) ≥ 10.
Sample Proportion:
There is an important connection between the sample proportion \hat{p} and the number of “successes” X in the sample.
\hat{p} = \frac{\text{count of successes in sample}}{\text{size of sample}} = \frac{X}{n}
Sampling Distribution of a Sample Proportion:
Choose an SRS of size n from a population of size N with proportion p of successes. Let \hat{p} be the sample proportion of successes. Then:
The mean of the sampling distribution is p.
The standard deviation of the sampling distribution is σ_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}.
For large n, \hat{p} has approximately the N(p, \sqrt{\frac{p(1 - p)}{n}})
Sampling Distribution of a Sample Proportion Example:
Considering the previous online shopping example. What is the probability that at least 58\% of 2500 adults agree?
Normal approximation for counts and proportions
Draw an SRS of size n from a large population having population proportion p of successes. Let X be the count of successes in the sample and p= x/n be the sample proportion of successes. When n is large, the sampling distributions of these statistics are approximately Normal:
X is approximately N (np, \sqrt{np(1 - p)}
p is approximately N (P, \sqrt{\frac{p(1 - p)}{n}})
As a rule of thumb, we will use this approximation for values of n and p that satisfy np > 10 and n(1 - p) > 10.