Psychological Research Methods and Analysis - Statistical Power and Limitations of t-tests

Statistical Power

  • Statistical power is the likelihood that a test will detect an effect if there is a true effect in the population.
  • When an effect is real (H1 is true), the aim is to detect it in the study instead of committing a Type II error.

Key Concepts

  • Statistical Power: The probability of detecting a real effect in a study.
  • β = p(Type II error): Probability of failing to reject H0 when H1 is true.
  • Statistical power = 1 – β: the probability that the test correctly rejects the null hypothesis H0 when the research hypothesis H1 is true.
  • Even when a real effect exists, a study might not yield a significant result, leading to a Type II error.
  • Example simulation: population effect of d = 1, simulation of 500 studies with N = 9, 430 results significant at alpha-level of 5%. 86% correctly inferred the effect, 14% Type II errors.

Factors Affecting Statistical Power

  • \alpha – level
  • Effect size
  • Sample size

Effect Size

  • A more powerful treatment effect increases statistical power
  • As effect size increases, statistical power increases.
  • Cohen’s d = \frac{M1 - M2}{SD}: Measures the magnitude of the difference between means.
  • Larger effect sizes make it easier to detect significant results, thus increasing statistical power.
  • Ways to increase effect size in an experiment:
    • Use a strong manipulation (increases the difference in means).
    • Reduce within-group variability (decreases SD).
      • Use a homogenous sample (limiting the population).
      • Standardize procedures and cautious measurement.
  • Large variability within groups (e.g., community sample): d = 1.
  • Identical means but less variability (e.g., student sample): d = 1.4.

Homogenous vs Heterogenous Samples

  • Homogenous sample:
    • Limiting the population (e.g., one particular subgroup or people who share similar characteristics).
    • Reduces within-group variability.
    • Therefore increasing d.
    • Increases power of experiments.
  • Heterogeneous sample:
    • Capturing widest range of perspectives possible – maximum variation sampling.
    • Reduces d in experiments.
    • Increases power in correlations – increases the observed correlation (remember r is an effect size!).
  • Within-subjects studies tend to have higher power than between-subjects studies because they reduce within-group variability.

Alpha Level

  • Statistical power increases as \alpha increases.
  • Increasing the alpha level increases the power of a study (e.g., from 0.05 to 0.10), making it more likely to find a true effect.
  • However, this increases the risk of a Type I error.
  • Lowering the significance level (e.g., from 0.05 to 0.01) decreases the likelihood of making a Type I error.
  • However, this reduces statistical power, and makes it more likely to commit a Type II error.
  • With a lower alpha level, the critical value for rejecting the null hypothesis becomes more stringent, requiring stronger evidence to reject it.

Sample Size

  • Statistical power increases as sample size N increases.
  • Larger samples provide more information about the population, reducing the influence of random variability.
  • Small sample sizes lead to low power, increasing the likelihood of failing to detect true effects (Type II error).
  • A power analysis is needed to determine the minimum sample size required to detect a specified effect size with a desired level of confidence (usually set at 0.8/80%).

Importance of Power

  • To plan studies: Determine the necessary N for sufficient power, given the estimated effect size.
  • To interpret non-significant results: A non-significant effect with small N means power is low, thus likely to miss even a medium-sized effect.

Key Summary

  • Power increases as \alpha increases.
  • Power increases as N increases.
  • Power increases as effect size increases.
  • A non-significant result is meaningless when power is low.
  • A conclusion that there is no effect in the population is only sensible when power is high in a study.

Power Calculation Example

  • A study on a new teaching method shows:
    • Control: M = 2473, SD = 461, N = 22
    • Experimental: M = 3105, SD = 425, N = 22
    • Effect size: d = 1.4
  • Calculation helps determine the necessary sample size for replication to have adequate power.

Limitations of t-tests and Alternatives

  • Non-parametric tests are based on fewer assumptions and often based on ranking data.
  • These are used when CIs or t-tests don’t work well.

When t-tests Don't Work

  • T-tests might not be suitable for all data; the goal is to identify when data is inappropriate for these tests and learn about easy alternatives.
  • Alternatives will be covered for t-tests, but there are no easy alternatives for computation of CI.

Assumptions of t-tests

  • All statistical tests have ‘assumptions’; these are conditions or requirements that must be met by the data for statistical techniques to be valid.
  • If assumptions are not met and the analysis is run, the results won’t reflect the population, leading to distortion.

T-test Assumptions

  • Paired Samples t-test:
    • Measurement on interval or ratio scale
    • Populations have a normal distribution
  • Independent Samples t-test:
    • Measurement on interval or ratio scale
    • Populations have a normal distribution
    • Both samples are drawn from populations with equal variance

Consequences of Violating Assumptions

  • If assumptions are met: p(sig. result | H0 is true) = \alpha.
  • If assumptions are violated (e.g., unequal variance), the t-test may not be robust, and the p-value becomes misleading.

Guidelines for t-tests

  • For all t-tests:
    • Don’t use if data are not at least on an interval scale.
    • Strongly skewed data is problematic (especially if using one-tailed testing).
  • Additionally, for independent samples t-test:
    • If there is extreme unequal variance and sample sizes are also unequal, do not use t-test (both conditions must be met).

Alternatives to t-tests

Data Transformation

  • When data are strongly skewed, they are sometimes transformed to make them easier to model.
  • A transformation is a rescaling of the data using a function.
  • One usual transformation is the log-transformation.
  • If you have numbers that are really spread out (like 1, 10, 100, 1000), taking the log of each number makes them closer together (log(1) = 0, log(10) = 1, log(100) = 2, log(1000) = 3).

Tests Based on Ranks (Non-Parametric Tests)

  • A non-parametric test doesn’t rely on specific assumptions about the distribution of data.
  • It doesn’t rely on exact numerical values of the data, but the ranks/orders of data points.
Benefits
  • Do not assume normality.
  • Less susceptible to non-normality and extreme values (outliers).
Disadvantage
  • Lower power because some information is lost when values are converted to ranks.

Rank-Based Tests

  • For independent samples: U –test.
  • For paired samples: Wilcoxon signed-ranks test.
  • For both, H0 assumes that the medians of relevant populations are equal.
U-Test
  • For two independent samples.
  • If H0 was true, 3.4% chance to observe a difference in mean ranks of 3.5 or more.
Wilcoxon Signed-Rank Test
  • For two paired samples.
  • Given that H0 is true, the expectation for the mean signed rank is zero; the test computes if the observed mean signed rank deviates significantly from this expectation.