Psychological Research Methods and Analysis - Statistical Power and Limitations of t-tests

Statistical power is the likelihood that a test will detect an effect if there is a true effect in the population.
When an effect is real (H1 is true), the aim is to detect it in the study instead of committing a Type II error.

Statistical Power: The probability of detecting a real effect in a study.
$β$ = p(Type II error): Probability of failing to reject H0 when H1 is true.
Statistical power = $1 – β$ : the probability that the test correctly rejects the null hypothesis H0 when the research hypothesis H1 is true.
Even when a real effect exists, a study might not yield a significant result, leading to a Type II error.
Example simulation: population effect of d = 1, simulation of 500 studies with N = 9, 430 results significant at alpha-level of 5%. 86% correctly inferred the effect, 14% Type II errors.

A more powerful treatment effect increases statistical power
As effect size increases, statistical power increases.
Cohen’s d = $\frac{M1 - M2}{SD}$ : Measures the magnitude of the difference between means.
Larger effect sizes make it easier to detect significant results, thus increasing statistical power.
Ways to increase effect size in an experiment:
- Use a strong manipulation (increases the difference in means).
- Reduce within-group variability (decreases SD).
  - Use a homogenous sample (limiting the population).
  - Standardize procedures and cautious measurement.
Large variability within groups (e.g., community sample): d = 1.
Identical means but less variability (e.g., student sample): d = 1.4.

Homogenous sample:
- Limiting the population (e.g., one particular subgroup or people who share similar characteristics).
- Reduces within-group variability.
- Therefore increasing d.
- Increases power of experiments.
Heterogeneous sample:
- Capturing widest range of perspectives possible – maximum variation sampling.
- Reduces d in experiments.
- Increases power in correlations – increases the observed correlation (remember r is an effect size!).
Within-subjects studies tend to have higher power than between-subjects studies because they reduce within-group variability.

Statistical power increases as $\alpha$ increases.
Increasing the alpha level increases the power of a study (e.g., from 0.05 to 0.10), making it more likely to find a true effect.
However, this increases the risk of a Type I error.
Lowering the significance level (e.g., from 0.05 to 0.01) decreases the likelihood of making a Type I error.
However, this reduces statistical power, and makes it more likely to commit a Type II error.
With a lower alpha level, the critical value for rejecting the null hypothesis becomes more stringent, requiring stronger evidence to reject it.

Statistical power increases as sample size N increases.
Larger samples provide more information about the population, reducing the influence of random variability.
Small sample sizes lead to low power, increasing the likelihood of failing to detect true effects (Type II error).
A power analysis is needed to determine the minimum sample size required to detect a specified effect size with a desired level of confidence (usually set at 0.8/80%).

To plan studies: Determine the necessary N for sufficient power, given the estimated effect size.
To interpret non-significant results: A non-significant effect with small N means power is low, thus likely to miss even a medium-sized effect.

Power increases as $\alpha$ increases.
Power increases as N increases.
Power increases as effect size increases.
A non-significant result is meaningless when power is low.
A conclusion that there is no effect in the population is only sensible when power is high in a study.

A study on a new teaching method shows:
- Control: M = 2473, SD = 461, N = 22
- Experimental: M = 3105, SD = 425, N = 22
- Effect size: d = 1.4
Calculation helps determine the necessary sample size for replication to have adequate power.

Non-parametric tests are based on fewer assumptions and often based on ranking data.
These are used when CIs or t-tests don’t work well.

T-tests might not be suitable for all data; the goal is to identify when data is inappropriate for these tests and learn about easy alternatives.
Alternatives will be covered for t-tests, but there are no easy alternatives for computation of CI.

All statistical tests have ‘assumptions’; these are conditions or requirements that must be met by the data for statistical techniques to be valid.
If assumptions are not met and the analysis is run, the results won’t reflect the population, leading to distortion.

Paired Samples t-test:
- Measurement on interval or ratio scale
- Populations have a normal distribution
Independent Samples t-test:
- Measurement on interval or ratio scale
- Populations have a normal distribution
- Both samples are drawn from populations with equal variance

If assumptions are met: p(sig. result | H0 is true) = $\alpha$ .
If assumptions are violated (e.g., unequal variance), the t-test may not be robust, and the p-value becomes misleading.

For all t-tests:
- Don’t use if data are not at least on an interval scale.
- Strongly skewed data is problematic (especially if using one-tailed testing).
Additionally, for independent samples t-test:
- If there is extreme unequal variance and sample sizes are also unequal, do not use t-test (both conditions must be met).

When data are strongly skewed, they are sometimes transformed to make them easier to model.
A transformation is a rescaling of the data using a function.
One usual transformation is the log-transformation.
If you have numbers that are really spread out (like 1, 10, 100, 1000), taking the log of each number makes them closer together (log(1) = 0, log(10) = 1, log(100) = 2, log(1000) = 3).

A non-parametric test doesn’t rely on specific assumptions about the distribution of data.
It doesn’t rely on exact numerical values of the data, but the ranks/orders of data points.

Lower power because some information is lost when values are converted to ranks.

For two independent samples.
If H0 was true, 3.4% chance to observe a difference in mean ranks of 3.5 or more.

For two paired samples.
Given that H0 is true, the expectation for the mean signed rank is zero; the test computes if the observed mean signed rank deviates significantly from this expectation.