Study Notes on Hypothesis Testing, Examples, and Statistical Significance
Hypothesis Testing
Overview
Hypothesis testing is a formal statistical procedure used to evaluate specific claims or ideas (hypotheses) about a population parameter, such as a mean or proportion, based on data collected from a sample.
The primary goal is to determine if an observed effect or relationship in a sample is statistically significant, meaning it is unlikely to have occurred purely by chance (sampling error), and thus can be generalized to the larger population. It helps researchers decide between competing hypotheses regarding a population.
Key Concepts in Hypothesis Testing
Distribution of Sample Means
The distribution of sample means (DSM) is a theoretical probability distribution of all possible sample means that could be obtained from a population of a given sample size (n). It is fundamental because it allows researchers to determine the probability of obtaining a specific sample mean if the null hypothesis were true.
This concept establishes the foundation for calculating test statistics (like z-scores) and making inferences about the population.
Three Distributions
Original Population of IQ Scores:
Represents the entire group of interest (e.g., all adults).
Mean (\mu) = 100, which is the true average IQ of the population.
Standard deviation (\sigma) = 15, indicating the typical spread of individual IQ scores around the population mean.
Sample of n = 25 IQ Scores:
A subset drawn from the original population, used to make inferences.
Sample mean (M) = 101.2, the average IQ of the 25 individuals in this specific sample.
Sample standard deviation (S) = 11.5, the spread of IQ scores within this sample.
Score range: 80 to 130, providing context on the variability within the sample.
Distribution of Sample Means (Sampling Distribution of the Mean):
A theoretical distribution of sample means if we were to take an infinite number of samples of size n=25 from the population. It helps connect a single sample mean to the population.
Population mean (\mu) = 100.
Mean of the distribution of sample means (\mu_M) = . This states that the average of all possible sample means will equal the population mean.
Standard deviation of the distribution of sample means, also known as the standard error of the mean (\sigma_M), is calculated as: . It quantifies the average amount that sample means deviate from the population mean, essentially measuring the precision of the sample mean as an estimate of the population mean.
What Is Hypothesis Testing?
Hypothesis testing is a structured process to make statistical decisions using experimental data. It involves the following sequential steps:
State the Hypotheses about a Population: Formulate both a null hypothesis (H0) and an alternative hypothesis (H1) that are mutually exclusive and exhaustive.
Obtain a Random Sample from the Population: Collect data through random sampling to ensure the sample is representative of the population and minimize bias.
Compare the Sample Data with the Chance Model: Evaluate how likely the observed sample statistic (e.g., sample mean) is to occur if the null hypothesis were true, often using a specific probability distribution (like the distribution of sample means).
The core of the test is to evaluate whether the observed sample statistic is so extreme that it is unlikely to be attributed solely to random chance (sampling error), suggesting a real effect or difference exists in the population.
Steps in Hypothesis Testing
Four Steps:
State the Hypothesis
Null Hypothesis (H_0): This is a statement of "no effect," "no change," or "no difference." It assumes that any observed difference in the sample is due to random chance. It always includes an equality sign (e.g., , , ).
Example: A new drug has no effect on patient recovery time (H0: \mu{drug} = \mu_{standard}).
Alternate Hypothesis (H_1): This is the scientific hypothesis, representing what the researcher is trying to prove. It states that there is a significant effect, change, or difference in the population parameter. It never includes an equality sign (e.g., , , ).
Example: A new drug reduces patient recovery time (H1: \mu{drug} < \mu_{standard}).
H0 and H1 are mutually exclusive (only one can be true) and exhaustive (cover all possibilities).
Set Decision Criteria
This step involves establishing a standard for rejecting or failing to reject H0. It requires determining what the distribution of sample means would look like if H0 were true.
Body: Represents sample means that are likely to occur if H_0 is true.
Tails (Critical Regions): Represents sample means that are very unlikely to occur if H0 is true. These extreme values lead to the rejection of H0.
The alpha level (\alpha), also known as the significance level, is chosen (commonly 0.05 or 0.01). It represents the maximum probability of making a Type I error (rejecting a true H_0).
Critical values (e.g., z-scores, t-scores) define the boundaries of the critical region(s). If the calculated sample statistic falls within these critical regions, H_0 is rejected.
Example: For a two-tailed test with \alpha = 0.05, the critical z-values are Z = -1.96 and Z = 1.96. If the calculated z-score of the sample mean is less than -1.96 or greater than 1.96, we reject H_0.
Collect Data and Compute Sample Statistic
Cary out the research experiment or data collection, selecting a random sample of size n.
Compute the relevant sample statistic (e.g., sample mean, M).
Convert the sample statistic into a test statistic (e.g., z-score, t-score) to locate it within the sampling distribution that assumes H_0 is true. This standardized score indicates how many standard errors the sample mean is from the hypothesized population mean.
For means, the z-score formula often used is: Z = \frac{M - \mu}{ \sigma_M} = \frac{M - \mu}{\sigma/\sqrt{n}}.
Make a Decision
Compare the computed test statistic to the critical values defined in Step 2.
If the test statistic falls in the critical region (i.e., its probability of occurrence under H0<alpha, then we reject H0. This means the observed sample result is considered statistically significant, and it's unlikely to be due to chance, suggesting the alternative hypothesis (H_1) is supported.
If the test statistic does not fall in the critical region, then we fail to reject H0. This means the observed sample result is not statistically significant; there isn't enough evidence to conclude that an effect exists beyond what could be attributed to random chance. Note: We do not "accept H0" because we are only testing against the possibility of a specific effect, not proving its absence.
Example: Marketing Impact on Startups
Known Population: Average revenue growth for US startups is , standard deviation .
Hypothesis: Does a greater marketing budget (18%) affect growth?
Sample Selection: 25 new startups all adopting a marketing budget of 18%.
Sample Result: First-year growth (M_{obt}) = 550%.
Hypothesis Testing Steps:
Null Hypothesis (H0): Greater marketing does not affect growth (H0: \mu = 300\%).
Alternate Hypothesis (H1): Greater marketing affects growth (H1: \mu \neq 300\%) (non-directional, two-tailed test chosen here).
Decision Criteria: Set alpha level at 0.05. For a two-tailed test, the critical z-values are Z = \pm 1.96. This means if our calculated z-score is below -1.96 or above 1.96, we will reject H_0.
Compute Sample Statistic (Standard Error and Z-score):
Standard Error (\sigmaM) = . \sigmaM = 50/\sqrt{25} = 50/5 = 10.
Test Statistic (Z{obt}) = . Z{obt} = (550-300)/10 = 250/10 = 25.
Make a Decision: The computed z-score is Z{obt} = 25. Since 25 > 1.96, the sample statistic falls in the critical region. Therefore, we reject H0. There is statistically significant evidence to suggest that greater marketing affects growth, and the observed 550% growth is not due to chance alone.
Directional vs. Non-Directional Tests
Directional (One-Tailed) Test: Used when the researcher has a specific prediction about the direction of the effect. The critical region is entirely in one tail of the distribution. This approach increases statistical power if the prediction is correct.
Example: H0: \mu \le 300\% vs. H1: \mu > 300\% (only interested in growth greater than 300%). The critical region would be solely in the upper tail (e.g., Z > 1.645 for \alpha = 0.05).
Non-Directional (Two-Tailed) Test: Used when the researcher is interested in any difference from the null hypothesis, regardless of direction (e.g., simply "affect growth" without specifying increase or decrease). The critical region is split between both tails of the distribution.
Example: H0: \mu = 300\% vs. H1: \mu \neq 300\% (interested if growth is different from 300%, either higher or lower). The critical region is split (e.g., Z < -1.96 or Z > 1.96 for \alpha = 0.05).
One-tailed tests are generally used when prior theory or research strongly supports a specific direction. However, two-tailed tests are more conservative and are common in many research areas, as they allow for the detection of effects in either direction.
Chance Models
A chance model is a theoretical framework or distribution that describes the expected outcomes if random processes alone were at play, i.e., assuming the null hypothesis is true.
In behavioral sciences and other fields, these models often involve estimating the probability distribution of a sample statistic (like a mean or proportion) under the null hypothesis.
By comparing the observed sample statistic to this chance model (typically using standardized test statistics like z-scores, t-scores, or F-ratios), researchers can determine whether the observed outcome is a plausible result of chance or if it indicates a genuine underlying effect.
Statistical Significance
An event or result is considered statistically significant if its observed occurrence has a probability less than or equal to the predetermined alpha level (\alpha) of happening strictly by chance, assuming the null hypothesis is true. Commonly, an \alpha of 0.05 is used, meaning there's less than a 5% chance the observed effect is a random fluctuation.
Statistical significance does not necessarily imply practical importance, only that the result is unlikely to be random.
Examples of significance:
30 heads in 50 coin tosses: Z = 1.41. This z-score is not in the critical region for typical \alpha levels (e.g., for \alpha=0.05, |Z_{critical}| = 1.96), so it is not statistically significant and could easily be due to chance.
8 kings in 20 picks from a standard deck: Z = 5.42. This z-score is well beyond typical critical values, indicating that it is statistically significant and highly unlikely to occur by chance alone.
50 baby turtles reaching the ocean out of 1000 hatchlings (assuming a very low survival probability under chance): Z = 11.34. This is an extremely high z-score, suggesting a very strong and statistically significant deviation from chance expectations.
Understanding the Distribution of Sample Means
The distribution of sample means (DSM) is critical for hypothesis testing as it provides the probability framework for evaluating how well a sample mean represents its population mean. It describes the characteristics of sample means if infinite samples of a given size were drawn.
It relies on two key properties derived from the Central Limit Theorem:
Mean of the distribution of sample means (\mu_M): This is always equal to the population mean (\mu). In symbols: E(M) = \mu. This means sample means tend to cluster around the true population mean.
Standard error of the mean (\sigma_M): This is the standard deviation of the sample means, quantifying the average variability of sample means around the population mean. It is denoted as . As sample size (n) increases, the standard error decreases, meaning sample means become more clustered around the population mean, leading to more precise estimates.
Central Limit Theorem (CLT)
The Central Limit Theorem is a cornerstone of statistical inference. It states that, for a sufficiently large sample size (n), the distribution of sample means (\mu_M) will approximate a normal distribution, regardless of the shape of the original population distribution.
Key aspects of the CLT:
Normality: Even if the population distribution is skewed or non-normal, the distribution of sample means will become approximately normal as n increases (typically, n \ge 30 is considered sufficient for this to hold).
Mean: The mean of this sampling distribution will be equal to the population mean ( = \mu).
Standard Deviation (Standard Error): The standard deviation of this sampling distribution (standard error) will be .
The CLT guarantees that we can use the properties of the normal distribution (like standard z-tables) to calculate probabilities for sample means, which is essential for determining critical values and p-values in hypothesis testing.
Error Types in Hypothesis Testing
When making a decision based on sample data, there's always a possibility of making an incorrect inference. These are known as Type I and Type II errors.
Type I Error
Definition: Occurs when the null hypothesis (H_0) is rejected, but in reality, the treatment or effect has no true impact on the population (a "false positive"). This means a researcher concludes there is an effect when there isn't one.
Probability: The probability of making a Type I error is directly defined by alpha (\alpha), the significance level set by the researcher (e.g., if \alpha = 0.05, there's a 5% chance of committing a Type I error).
Consequence: Can lead to unnecessary interventions, misallocation of resources, or false scientific claims.
Type II Error
Definition: Occurs when the null hypothesis (H_0) fails to be rejected, but in reality, a real treatment effect exists in the population (a "false negative"). This means a researcher concludes there is no effect when there actually is one.
Probability: Denoted by beta (\beta). Unlike alpha, beta is not typically set directly but is influenced by several factors.
Consequence: Can lead to missed opportunities, failure to implement effective treatments, or overlooking important scientific discoveries.
Power of a Test: The power of a statistical test is defined as . It represents the probability of correctly rejecting H_0 when it is false (i.e., detecting a real effect). Researchers aim for high power, conventionally 0.80 or greater.
Factors Affecting Power
The power of a hypothesis test is crucial for its ability to detect a true effect. Several factors influence power:
Alpha Level (\alpha): A lower alpha level (e.g., 0.01 vs. 0.05) makes it harder to reject H_0 (requires more extreme evidence), thus resulting in a smaller critical region and lower power. Conversely, a higher alpha level increases power but also increases the risk of a Type I error.
Test Type (One-tailed vs. Two-tailed): Two-tailed hypothesis tests possess less power than comparable one-tailed tests for a given effect size and alpha level. This is because the critical region in a two-tailed test is split between both tails, requiring a more extreme test statistic in either direction to achieve significance. One-tailed tests concentrate all the critical region in one tail, making it easier to detect an effect in the predicted direction.
Effect Size: Larger treatment effects produce greater power. When the true difference between the population parameter under the null hypothesis and the actual parameter value (\mu{actual} - \mu{H0}) is large, the distributions of sample means under H0 and under H1 are further apart, making it easier to distinguish between them and reject H_0.
Sample Size (n): Larger sample sizes lead to smaller standard errors ( = \frac{\sigma}{\sqrt{n}}). A smaller standard error means the distribution of sample means is narrower and more clustered around its mean. This increased precision makes it easier to detect a true effect, thereby enhancing power.
Effect Size Measurement: Cohen’s d
While statistical significance (p-value) indicates whether an effect is likely due to chance, effect size measures the magnitude or practical importance of the observed effect. Cohen's d is a common standardized measure of effect size for differences between two means.
It is calculated as: . \text{ Cohen's d is generally calculated as the difference between means divided by the population standard deviation or a pooled standard deviation for two-sample tests.}
Standardized units mean that Cohen's d is independent of the original unit of measurement, allowing for comparison across different studies. It provides an indication of the practical significance of the treatment effect, independent of sample size.
Categories (Cohen's conventions):
Small effect: d < 0.2 (e.g., a mean difference of 0.2 standard deviations).
Medium effect: 0.2 \le d < 0.8 (e.g., a mean difference of 0.5 standard deviations, noticeable to the naked eye).
Large effect: d \ge 0.8 (e.g., a mean difference of 0.8 standard deviations, clearly visible and substantial).
Reporting both statistical significance and effect size provides a more complete understanding of research findings.
Conclusion
A thorough understanding of hypothesis testing, encompassing its fundamental concepts, systematic steps, the potential for Type I and Type II errors, and the importance of statistical power and effect size, is critical for researchers. These tools enable the making of informed, data-driven decisions and interpreting the impact of experimental conditions on population parameters.
Grasping concepts like the Central Limit Theorem, distribution of sample means, and Cohen's d significantly enhances the meaningful interpretation of research findings, moving beyond mere p-values to understand the practical importance and reliability of results.