Key Points on Statistical Power
Definition of Power
Power: Probability of detecting a real effect in a study.
Even well-designed studies may not always produce significant results.
Calculating power requires estimating population mean differences and standard deviations.
Sports Analogy (Yankees vs. Red Sox)
A team (or treatment) being better doesn’t guarantee a win, just a higher probability of winning.
Similarly, a real effect in an experiment won’t always yield statistical significance.
Meta-analysis (covered in Chapter 21) helps synthesize multiple studies to draw conclusions.
Misconceptions About Experiments
People assume experiments always confirm or reject a theory clearly, but real-world data is messy.
A single nonsignificant result doesn’t invalidate a theory—it could be due to sample size or variability.
Type I vs. Type II Errors
Type I error (α): False positive—rejecting a true null hypothesis.
Type II error (β): False negative—not detecting a real effect.
Type II errors can lead researchers to abandon promising theories prematurely.
Neglect of Type II Errors
Researchers historically focused on controlling Type I errors.
Type II errors were often ignored, despite their importance in evaluating study effectiveness.
Jacob Cohen played a major role in making statistical power a key concern in psychology.
Jacob Cohen & His Contributions
Developed Cohen’s Kappa, a widely used measure of agreement.
1968: Integrated linear regression & ANOVA, making statistical methods more accessible.
1969: Statistical Power Analysis for the Behavioral Sciencesexplained power, calculation methods, and study design improvements.
His work led to questioning the reliability of hypothesis testing in psychology.
Power in Statistical Terms
Power = 1−β→ The probability of correctly rejecting a false null hypothesis.
A more powerful experiment has a higher chance of detecting a real effect.
Key Points on the Basic Concept of Power (15.1)
Understanding Power Through Resampling
Power is the probability of detecting a real effect.
It can be analyzed using resampling methods before conducting a study.
Stereotype Threat Study (Aronson et al., 1998)
Investigated whether stereotype threat could affect White male students' math performance.
Control Group (n=11): Simply completed a difficult math exam.
Threat Group (n=12): Told that Asians typically outperform others in math.
Results:
Control Group: Mean = 9.64, SD = 3.17
Threat Group: Mean = 6.58, SD = 3.03
t-test result: Significant difference (p < 0.05), supporting the stereotype threat effect.
Estimating Power for a Replication Study
A replication was planned with 20 participants per group.
To predict the likelihood of success, researchers estimated population means & SDs from the original study.
Simulation Process:
Randomly draw 20 values from each population.
Calculate a t-statistic for each simulated sample.
Repeat this process 10,000 times to generate a distribution of t-values.
Compare the t-values to the critical value (±2.024, df = 38) to determine significance.
Results of the Resampling Study
86% of t-values were statistically significant, meaning the power of the experiment was 0.86.
14% of trials failed to detect a real effect, indicating some risk of a Type II error.
A power level of 0.86 is considered strong for practical research.
Online Power Calculation Tools
The estimated power matched results from online power analysis tools.
Example: Lenth’s Power Applet (University of Iowa).
Key Points on Factors Affecting the Power of a Test (15.2)
Power is Influenced by Four Main Factors:
Significance Level (α):
Lower α (e.g., 0.01 instead of 0.05) reduces the risk of Type I error but also lowers power.
Higher α increases power but raises the chance of a false positive.
True Alternative Hypothesis (Effect Size, δ):
Larger true effects (stronger relationships) are easier to detect, increasing power.
Small effects require larger samples to reach statistical significance.
Sample Size (n):
Increasing n reduces variability and increases the likelihood of detecting a real effect.
Power grows as n increases because the standard error decreases.
Type of Statistical Test Used:
Some tests are more powerful than others under certain conditions.
Example: Matched-pair tests tend to have higher power than independent samples tests when assumptions are met.
The book mainly discusses the most powerful tests availablefor each scenario.
Key Points: A Short Review
Two Sampling Distributions:
Left Distribution (Null Hypothesis True, H0)
Represents expected sample means if H0H_0H0 is correct.
Right Distribution (Null Hypothesis False, H1)
Represents sample means when the true population mean differs from H0H_0H0.
The position of this distribution depends on the true effect size.
Power and Error Types:
Type I Error (α):
The right tail of the H0H_0H0 distribution.
Represents the probability of falsely rejecting H0.
Type II Error (β):
The left tail of the H1H_1H1 distribution.
Represents the probability of failing to reject a false H0
Power (1 - β):
The unshaded area of the H1 distribution.
Probability of correctly rejecting a false H0
Higher power means a greater chance of detecting a real effect.
Key Points: Power as a Function of α, True Hypothesis, and Sample Size
Power as a Function of α:
Increasing α moves the cutoff point to the left.
Decreases β (Type II error).
Increases power (correctly rejecting a false null).
Increases Type I error (α).
Power as a Function of the True Hypothesis (Effect Size):
Larger difference between the means under H0H_0H0 and H1H_1H1 increases power.
Figure 15.3 shows increased power with a greater effect size.
Larger differences make it easier to detect a real effect.
Power as a Function of Sample Size (n) and Standard Error (σ):
Increasing sample size (n) or decreasing standard deviation (σ) reduces the standard error of the mean.
Figure 15.4 shows that reducing the overlap between the two distributions increases power.
Power increases as sample size increases or standard deviation decreases.
Manipulating Sample Size (n):
Sample size (n) is easier to control than α or effect size.
Power analysis often focuses on adjusting n for optimal results, as changes in experimental design can also improve power.
Calculating Power the Traditional Way
Overlap and Power:
Power depends on the overlap between the sampling distributions under the null hypothesis (H0) and the alternative hypothesis (H1).
This overlap is influenced by:
The distance between H0 and H1 (effect size).
The standard error, which is influenced by sample size.
Effect Size (d):
Effect size (d) measures the difference between H0 and H1 in terms of standard deviations (rather than raw scores).
Cohen's d is the standardized difference between the means of two groups (used for estimating how different they are, relative to the variability within the data).
For example, if the means differ by one standard deviation, d = 1.
Standardized Measure:
d provides a way to express the effect size without incorporating the sample size (n) initially.
Formula for d:
d=M1−M2/σ
Where M1 and M2 are the means of the two distributions, and σ is the pooled standard deviation.
Effect size (d) is computed independently of the sample size (n), by using the means and standard deviation (σ).
Estimating the Effect Size (d)
There are three main ways to estimate the effect size (ddd) for power calculations:
Prior Research:
Use data from past studies to estimate the effect size.
Look at sample means and variances to make an informed guess for your study.
Rough approximations are often sufficient and more useful than no estimate at all.
Personal Assessment of Importance:
Researcher defines the minimum meaningful difference before the study.
Example: A 10-point difference in test scores is considered significant.
Estimate standard deviation from other data to calculate ddd.
Example: If the standard deviation is 100 and the desired difference is 40 points, d=40/100=0.4d = 40/100 = 0.4d=40/100=0.4.
Use of Special Conventions (Cohen’s Guidelines):
If prior research or personal assessment isn't available, use Cohen's conventions:
Small effect size: d=0.20 (92% overlap between distributions).
Medium effect size: d=0.50 (80% overlap between distributions).
Large effect size: d=0.80 (69% overlap between distributions).
Cohen’s guidelines were meant as a fallback when no other estimation methods are available.
Important Notes:
Thompson’s Warning: Don’t use Cohen’s conventions mindlessly. Focus on both the difference (numerator) and the variability (denominator) in your study.
Researchers should try to define the effect size before conducting the experiment for realistic expectations and relevant power analysis.
Recombining the Effect Size and Sample Size (n)
Earlier, we separated the effect size (d) and sample size (n) for convenience.
Now, we combine them using a statistic (denoted as δ) to account for both d and n in power calculations.
The statistic δ is defined differently for each specific test.
This approach allows the use of a single table of δ for power calculations across various statistical procedures.
Power Calculations for the One-Sample t Test
For the one-sample t-test, δ is calculated as the combination of effect size d and sample size n. Specifically, δ=d× square root n
δ is known as the noncentrality parameter in this case.
Example Scenario:
A clinical psychologist is replicating a study by Everitt on cognitive behavior therapy (CBT) as a treatment for anorexia.
The psychologist assumes Everitt’s study data represent the population parameters:
Population mean weight gain: μ=3.00\mu = 3.00
Standard deviation: s=7.31
The null hypothesis: CBT does not lead to weight gain (H0:μ=0).
Power Calculation Process:
Sample Size: The psychologist plans to use the same sample size as in Everitt’s study: n=30
Noncentrality Parameter (δ): Calculated as:
Determine Power: Using a table of power values for different values of δ, we locate the power associated with δ=1.24 and a significance level of α=0.05
The table shows power between 0.60 and 0.63.
Using linear interpolation, we estimate the power to be 0.60.
Interpretation:
The result implies that if the true mean weight gain from CBT is indeed 3 pounds, the study has only a 60% chance of obtaining a significant result.
This is considered low power, meaning 40% of the time, the study would fail to detect a real effect (Type II error).
Estimating Required Sample Sizes
The sample size needed for a study depends on the desired level of power.
The goal is to calculate the required sample size n for a given power level (e.g., 0.80).
Example Scenario (with Desired Power of 0.80):
Given:
Desired power = 0.80
From Table E.5, the corresponding δ for 0.80 power is 2.80.
Solving for Sample Size (n):
Formula:
Rearranging the formula to solve for n:
Calculating n:
Using the known vales for δ = 2.80, s=7.31 and μ=3.00
Round off to 47 participants.
Interpretation:
To achieve 80% power, the experimenter needs to have 47 clients.
Why Choose Power of 0.80?
Power of 0.80 means there is a 20% chance of making a Type II error(failing to reject the null hypothesis when it is false).
Practicality: Higher power (e.g., 0.95 or 0.99) requires a larger sample size, which can become impractical or cost-prohibitive:
Power = 0.95: Requires approximately 100 participants.
Power = 0.99: Requires around 105 participants.
Diminishing Returns: As sample size increases, the additional benefit in power is smaller.
Ethical Concerns: Institutional Review Boards (IRBs) may limit sample sizes, especially for studies involving human or animal subjects.
Power Calculations for Differences between Two Independent Means
Effect Size Calculation (d):
Formula:
D = μ1 - v2 / σ
μ1−μ2: Difference between the population means under the alternative hypothesis.
σ: Standard deviation of the populations.
This is the same as Cohen’s d or Hedges’s g, expressed in terms of population means.
Two Cases:
Equal Sample Sizes (n1=n2):
The power calculation is straightforward.
Unequal Sample Sizes (n1≠n2)
Power calculation adjusts for differences between the two groups.
Power Calculations for Differences between Two Independent Means
Equal Sample Sizes (15-5a)
Assumptions:
Expected difference: 5 points between the two treatments.
Standard deviation: σ=10
Effect size (d): Half a standard deviation (d=0.5).
Initial Power Calculation:
Sample size per group: n=25
Noncentrality parameter δ: δ=d/ square root of n.
Power for δ: Approx. 0.43 (from interpolation for two-tailed test at α=0.05).
Type II error rate: 1−0.43=0.57.
Increasing Power to 0.80:
For 80% power, δ=2.80.
Sample size: n=63 subjects per group (total of 126 subjects).
Unequal Sample Sizes (15-5b)
Problem with Unequal Sample Sizes:
Different sample sizes (n1≠n2) complicate the calculation of δ
Incorrect method: Using the arithmetic mean of n1 n2.
Correct Method:
Use the harmonic mean for unequal sample sizes
The harmonic mean accounts for the relationship between variance and sample size.
Example: Stereotype Threat Study:
Control group: 18 males, Threat group: 12 males.
Effective sample size: 14.4 (from harmonic mean).
The study with unequal sample sizes is equivalent in power to a study with 28.8 subjects (equal sample sizes).
Power calculation: Approx. 0.75 for unequal sample sizes.
Improving Power:
Increase sample size in the unequal groups.
Adding 20 more students to the study:
New sample sizes: 28 in the Control group, 22 in the Threat group.
Power now equals 0.93 (sufficient for the study).
Key Takeaway:
Unequal sample sizes reduce power.
When possible, make the smaller group as large as possible to preserve power.
Power Calculations for the t Test for Related Samples
Challenge in Matched Samples:
Testing the difference between two matched (related) samples requires accounting for the correlation between the two sets of observations.
The calculation for power is similar to independent samples, but with the added complexity of the correlation.
Effect Size (d):
ddd is defined as the difference in means, similar to independent samples.
Substitute sample statistics for parameters (i.e., difference scores).
For the t test for related samples, calculate δ\deltaδ as:δ=dn\delta = \frac{d}{\sqrt{n}}δ=ndwhere nnn is the number of pairs in the sample.
Power Calculation:
Use Table E.5 for δ\deltaδ values.
If parameters are not available, approximate power by treating the two groups as independent, giving a lower bound for power.
True power is likely higher than this estimate due to the correlation between the two sets of scores.
Example: Family Therapy for Anorexia
Original Data:
Difference in weights: 7.26 pounds.
Standard deviation of difference scores: 7.16 pounds.
Assume replication study with same sample size and population parameters.
Power Calculation:
From Appendix E: For a two-tailed test at α=0.05\alpha = 0.05α=0.05, power is approximately 0.99.
Conclusion: With 17 subjects, the study has a 99% chance of finding a significant result if the effect size is as large as the original study suggested.
Key Point: This calculation assumes the same effect size (large) as the original study, which can be unrealistic in practice, but it demonstrates the potential for high power when the effect is large.
Power Considerations in Terms of Sample Size
Large Sample Sizes are Essential:
High power (e.g., 0.80) requires large sample sizes, especially for detecting small effects.
If an effect is small, failing to use a large sample increases the risk of Type II errors (failing to detect a real effect).
Cohen’s Effect Size Categories & Sample Requirements:
Small Effect: Requires very large sample sizes.
Medium Effect: Requires moderate sample sizes.
Large Effect: Can be detected with smaller sample sizes.
Power is Costly:
Table 15.3 (not included here) shows the required total sample sizes at α = 0.05, two-tailed for different effect sizes.
Small effects demand disproportionately high sample sizes, making power an "expensive commodity."
Strategic Considerations for Researchers:
Seek large effects whenever possible.
Use large samples if resources allow.
Utilize sensitive experimental designs, such as:
Repeated measures designs (using the same participants multiple times).
Reducing experimental error, which increases the likelihood of detecting an effect.
Conclusion:
While large sample sizes are often necessary, choosing an efficient experimental design can enhance power without requiring excessive sample sizes.
Post-Hoc (Retrospective) Power
Definition & Controversy
Post-hoc power (retrospective power) is calculated after running a study.
A misleading use of post-hoc power is claiming that if a non-significant result had high power, the null hypothesis is likely true.
This logic is flawed because high power and failure to reject the null cannot coexist.
Common Misconceptions
Some researchers argue:
“I had high power, but my result was not significant, so the null must be true.”
“My study had low power, so don’t blame me for not rejecting the null.”
Hoenig & Heisey (2001) debunk this, showing that if p = .05, retrospective power is ~.50, and it drops further as p increases.
It is mathematically impossible to have a non-significant result and still claim high power.
Illustrating the Issue
Lenth’s Piface program demonstrates this by assigning post-hoc power of either 0.00 or 1.00, based solely on whether the null was rejected—showing the concept is flawed.
A Valid Use of Post-Hoc Power
Some statistical tools (e.g., G*Power) compute post-hoc power in a meaningful way.
Instead of using post-hoc power to justify current results, it can be used to:
Estimate the power for a future replication based on past study parameters.
Assess whether an experiment was well-designed for detecting an effect.
Key Takeaway
Do not use post-hoc power to justify or explain away non-significant results.
A proper approach is to use prior data to estimate power for future studies.
Summary of Power Analysis
Definition of Power
Power = Probability of rejecting the null hypothesis (H0H_0H0) when it is false.
Related to Type I error (α): Rejecting a true H0H_0H0.
Related to Type II error (β): Failing to reject a false H0H_0H0.
Factors Affecting Power
Significance level (α): Increasing α raises power but also increases Type I errors.
Effect size (μ1−μ2\mu_1 - \mu_2μ1−μ2): Larger differencesbetween population means increase power.
Population standard deviation (σ): Smaller σ increases power (less variability).
Sample size (n): Larger samples = higher power.
Key Concepts
Cohen’s d: Measures effect size as the difference in means relative to standard deviation.
Delta (δ): Combines sample size and effect size to determine power from tables.
Power in Different Tests
One-sample t-tests: Examines whether a sample differs from a known population.
Two independent samples t-test: Power depends on equal vs. unequal sample sizes.
Paired (related) samples t-test: Accounts for correlationbetween observations.
Main Takeaway
Detecting small effects requires large samples.
If you want high power, aim for strong effects or large sample sizes.
Q: What is a Type I error?
A: Rejecting the null hypothesis when it is actually true.
Q: What is a Type II error?
A: Failing to reject the null hypothesis when it is actually false.
Q: What does α (alpha) represent in hypothesis testing?
A: The probability of making a Type I error (rejecting H0H0 when it is true).
Q: What does β (beta) represent in hypothesis testing?
A: The probability of making a Type II error (not rejecting H0H0 when it is false).
Q: What is statistical power?
A: The probability of correctly rejecting H0H0 when it is false (1 - β).
Q: How does increasing sample size affect power?
A: It increases power by reducing variability and making it easier to detect a true effect.
Q: Why is choosing the most powerful test important?
A: Because it minimizes Type II errors and increases the chance of detecting a true effect.
Q: If a study fails to reject H0H0 but H1H1 is true, what can you conclude?
A: The study may not have had enough power to detect the difference.
Suppose you are planning to use a one-sample t test to test the following one-tailed hypotheses.
H00: μ = 0
H11: μ > 0
Since you want to maximize the power of your study, you are considering which factors might increase power so that you can adjust your plans to incorporate them, if possible, before conducting your research.
The following are your considerations for increasing power. Fill in any missing words/values. (Hint: Remember that for this alternative hypothesis, you will reject the null hypothesis for large values of the test statistic.)
• | An increase in α (such as using α = .10 instead of α = .05) increases the power of the one-sample t test, because this change in α makes it more likely you will reject the null hypothesis. This change makes it easier to correctly reject a false null hypothesis—meaning that power is increased. |
Points:
1 / 1
Close Explanation
Explanation:
Power is the probability of correctly rejecting a false null hypothesis. Thus, anything that will make you more likely to reject the null hypothesis (whether true or false) increases the power of your test. The larger the significance level (alpha or α), the more likely you are to reject the null hypothesis.
An increase in α (meaning using α = .10 instead of α = .05, for example) makes it more likely you will reject the null hypothesis, increasing the power.
• | An increase in X̄X̄ increases the power of the one-sample t test, because this change in thenumerator of the test statistic for the one-sample t testincreases the overall test statistic, making itmore likely you will reject the null hypothesis. |
Points:
1 / 1
Close Explanation
Explanation:
You will reject the null hypothesis for large values of the test statistic. Thus, anything that makes the test statistic larger increases the power, because you are then more likely to reject the null hypothesis.
The value X̄X̄ is the numerator of the test statistic for the one-sample t test when the null hypothesis is H00: μ = 0. When the numerator increases, the overall test statistic increases, making it more likely you will reject the null hypothesis and increasing the power.
• | The denominator of the test statistic for the one-sample t test issX̄X̄ .A decrease in this denominator increases the power of the one-sample t test, because this changeincreases the overall test statistic, making it more likely you will reject the null hypothesis. This denominator decreases when either the sample size (N)increases or the population variancedecreases . |
Points:
0.8 / 1
Close Explanation
Explanation:
You will reject the null hypothesis for large values of the test statistic. Thus, anything that makes the test statistic larger increases the power, because you are then more likely to reject the null hypothesis.
While increasing the numerator of the test statistic had the effect of increasing the overall test statistic for the one-sample t test, the opposite is true of the denominator. That is, decreasing the denominator has the effect of increasing the overall test statistic.
The denominator of the test statistic for the one-sample t test is sX̄X̄. A decrease in this denominator then makes it more likely you will reject the null hypothesis.
The tricky part is that this denominator is a fraction itself:
sX̄X̄ | = | s/√N |
This denominator decreases when its numerator (directly related to the population variance) decreases or when its denominator (directly related to the sample size—N) increases.
All of the factors discussed previously affect the power of the one-sample t test. As the person making the decisions about how the research will be conducted, which of the following factors can you possibly influence? Check all that apply.
X̄X̄
sX̄X̄
α
Sample size (N)