Psychological Research Methods and Analysis - Statistical Power and Limitations of t-tests
Statistical Power
- Statistical power is the likelihood that a test will detect an effect if there is a true effect in the population.
- When an effect is real (H1 is true), the aim is to detect it in the study instead of committing a Type II error.
Key Concepts
- Statistical Power: The probability of detecting a real effect in a study.
- β = p(Type II error): Probability of failing to reject H0 when H1 is true.
- Statistical power = 1 – β: the probability that the test correctly rejects the null hypothesis H0 when the research hypothesis H1 is true.
- Even when a real effect exists, a study might not yield a significant result, leading to a Type II error.
- Example simulation: population effect of d = 1, simulation of 500 studies with N = 9, 430 results significant at alpha-level of 5%. 86% correctly inferred the effect, 14% Type II errors.
Factors Affecting Statistical Power
- \alpha – level
- Effect size
- Sample size
Effect Size
- A more powerful treatment effect increases statistical power
- As effect size increases, statistical power increases.
- Cohen’s d = \frac{M1 - M2}{SD}: Measures the magnitude of the difference between means.
- Larger effect sizes make it easier to detect significant results, thus increasing statistical power.
- Ways to increase effect size in an experiment:
- Use a strong manipulation (increases the difference in means).
- Reduce within-group variability (decreases SD).
- Use a homogenous sample (limiting the population).
- Standardize procedures and cautious measurement.
- Large variability within groups (e.g., community sample): d = 1.
- Identical means but less variability (e.g., student sample): d = 1.4.
Homogenous vs Heterogenous Samples
- Homogenous sample:
- Limiting the population (e.g., one particular subgroup or people who share similar characteristics).
- Reduces within-group variability.
- Therefore increasing d.
- Increases power of experiments.
- Heterogeneous sample:
- Capturing widest range of perspectives possible – maximum variation sampling.
- Reduces d in experiments.
- Increases power in correlations – increases the observed correlation (remember r is an effect size!).
- Within-subjects studies tend to have higher power than between-subjects studies because they reduce within-group variability.
Alpha Level
- Statistical power increases as \alpha increases.
- Increasing the alpha level increases the power of a study (e.g., from 0.05 to 0.10), making it more likely to find a true effect.
- However, this increases the risk of a Type I error.
- Lowering the significance level (e.g., from 0.05 to 0.01) decreases the likelihood of making a Type I error.
- However, this reduces statistical power, and makes it more likely to commit a Type II error.
- With a lower alpha level, the critical value for rejecting the null hypothesis becomes more stringent, requiring stronger evidence to reject it.
Sample Size
- Statistical power increases as sample size N increases.
- Larger samples provide more information about the population, reducing the influence of random variability.
- Small sample sizes lead to low power, increasing the likelihood of failing to detect true effects (Type II error).
- A power analysis is needed to determine the minimum sample size required to detect a specified effect size with a desired level of confidence (usually set at 0.8/80%).
Importance of Power
- To plan studies: Determine the necessary N for sufficient power, given the estimated effect size.
- To interpret non-significant results: A non-significant effect with small N means power is low, thus likely to miss even a medium-sized effect.
Key Summary
- Power increases as \alpha increases.
- Power increases as N increases.
- Power increases as effect size increases.
- A non-significant result is meaningless when power is low.
- A conclusion that there is no effect in the population is only sensible when power is high in a study.
Power Calculation Example
- A study on a new teaching method shows:
- Control: M = 2473, SD = 461, N = 22
- Experimental: M = 3105, SD = 425, N = 22
- Effect size: d = 1.4
- Calculation helps determine the necessary sample size for replication to have adequate power.
Limitations of t-tests and Alternatives
- Non-parametric tests are based on fewer assumptions and often based on ranking data.
- These are used when CIs or t-tests don’t work well.
When t-tests Don't Work
- T-tests might not be suitable for all data; the goal is to identify when data is inappropriate for these tests and learn about easy alternatives.
- Alternatives will be covered for t-tests, but there are no easy alternatives for computation of CI.
Assumptions of t-tests
- All statistical tests have ‘assumptions’; these are conditions or requirements that must be met by the data for statistical techniques to be valid.
- If assumptions are not met and the analysis is run, the results won’t reflect the population, leading to distortion.
T-test Assumptions
- Paired Samples t-test:
- Measurement on interval or ratio scale
- Populations have a normal distribution
- Independent Samples t-test:
- Measurement on interval or ratio scale
- Populations have a normal distribution
- Both samples are drawn from populations with equal variance
Consequences of Violating Assumptions
- If assumptions are met: p(sig. result | H0 is true) = \alpha.
- If assumptions are violated (e.g., unequal variance), the t-test may not be robust, and the p-value becomes misleading.
Guidelines for t-tests
- For all t-tests:
- Don’t use if data are not at least on an interval scale.
- Strongly skewed data is problematic (especially if using one-tailed testing).
- Additionally, for independent samples t-test:
- If there is extreme unequal variance and sample sizes are also unequal, do not use t-test (both conditions must be met).
Alternatives to t-tests
- When data are strongly skewed, they are sometimes transformed to make them easier to model.
- A transformation is a rescaling of the data using a function.
- One usual transformation is the log-transformation.
- If you have numbers that are really spread out (like 1, 10, 100, 1000), taking the log of each number makes them closer together (log(1) = 0, log(10) = 1, log(100) = 2, log(1000) = 3).
Tests Based on Ranks (Non-Parametric Tests)
- A non-parametric test doesn’t rely on specific assumptions about the distribution of data.
- It doesn’t rely on exact numerical values of the data, but the ranks/orders of data points.
Benefits
- Do not assume normality.
- Less susceptible to non-normality and extreme values (outliers).
Disadvantage
- Lower power because some information is lost when values are converted to ranks.
Rank-Based Tests
- For independent samples: U –test.
- For paired samples: Wilcoxon signed-ranks test.
- For both, H0 assumes that the medians of relevant populations are equal.
U-Test
- For two independent samples.
- If H0 was true, 3.4% chance to observe a difference in mean ranks of 3.5 or more.
Wilcoxon Signed-Rank Test
- For two paired samples.
- Given that H0 is true, the expectation for the mean signed rank is zero; the test computes if the observed mean signed rank deviates significantly from this expectation.