Under certain conditions (df > 30), a sample mean (x̄) can be modelled using a normal distribution; centred around the population mean. A t-distribution can be used otherwise.
Central Limit Theorem for sample mean:
“When we collect a sufficiently large sample of n independent observations from a population with mean μ and standard deviation σ, the sampling distribution x̄ will be nearly normal with
Meanx̄ = μ SE = σ/sqrt(n) ”
Notice that we cannot directly calculate the standard error since we will not realistically know the population standard deviation, hence we take the sample standard deviation in its place, so SE = s/sqrt(n). This works well when we have a lot of data and can estimate the population standard deviation using the sample value accurately, but is less precise with smaller samples, causing problems when using the normal distribution to model x̄. Hence, we have the t-distribution, as its thicker tail can account for the additional uncertainty created when we use s.
Conditions for applying the Central Limit Theorem for x̄:
Independence Observations must be independent
Normality Observations come from a normally distributed population
‘Normality check’:
n < 30 If there are no clear outliers, assume the data come from a nearly normal distribution. Note that if the population distribution is not normal then CLT may not hold.
n >= 30 If there are no particularly extreme outliers, assume the distribution of x̄ is nearly normal, even if the underlying distribution of individual observations is not, CLT holds.
A mental check may also be appropriate to evaluate whether we believe the underlying population would have moderate skew (if n < 30) or have particularly extreme outliers (if n >= 30) beyond what are in the data.
Properties of the t-distribution:
Shape Bell; shallower than the normal distribution, “thicker” tails => observations are more likely to fall beyond two standard deviations.
Centred 0
Parameter df (= n - 1), describes the precise form of the bell shape. Larger df => more normal distribution
Confidence intervals:
x̄ ± tdf* * SE
t* is a cutoff based on the confidence level and the t-distribution. Find t* such that the fraction of the t-distribution with df degrees of freedom within a distance (t*) of 0 matches the confidence level of interest.
Identify x̄, s, n and the CI
Verify our conditions are met
Compute SE, find t*, construct interval
Interpret CI
Notation – t(df;v) describes the value from the t-distribution with df degrees of freedom which gives an area of v to the left of some point. Then P(t < t(df;v)) = v, the probability that t is less than this value is equal to v.
[1] In the t-distribution table, v are the column headings for a left-hand tail (the headings represent v-1 for a right-hand tail as v is the area to the left of our point), df are the row headings and t(df;v) are the values (or t(df;v-1) for a right-hand tail)
When the desired confidence is [(1-α)*100]%, use t(n - 1; 1 - α/2), so 95% CI => t(n-1 0.975)
Hypothesis tests:
Follows the same procedure as that of sample proportions, with the minor difference in the use of a T-score instead of Z-score (although these are just really name changes). We still subsequently calculate the p-value from double the tail area and evaluate this against our significance level to establish a conclusion.
T = (x̄ - μ)/SE
Identify the parameter of interest, hypotheses, significance level and x̄, s and n
Verify our conditions are met
Compute SE, find T and calculate p-value
Evaluate by comparison of p-value and significance level and conclude
Rejection regions:
Using the alternative hypothesis, define a rejection region of values of the test statistic that are extreme under the null in the direction of the alternative hypothesis. If the observed test statistic is within the RR, then reject the null hypothesis in favour of the alternative hypothesis. Otherwise, fail to reject.