Null Hypothesis Testing and Statistical Inference

Null hypothesis tests are used to draw conclusions on the population based on sample data.
The choice of test depends on:
- Number of groups being assessed.
- Underlying distribution of the data.

Many common null hypothesis tests require data to be normally distributed or approximately normally distributed.
A normal distribution is a bell curve where two standard deviations from the mean account for 96% of observations.
Some advanced tests require the variance of data sets to be homogenous, meaning the underlying structure is similar across test scores.
If data is not normally distributed (e.g., skewed data), the reliability and validity of inferences can be impacted.

T-tests are commonly used for basic-level statistical analysis when comparing two groups.
Types of t-tests:
- Independent t-test:
  - Used to assess the difference in test scores of two groups that are stratified non-quantitatively.
  - Example: comparing selected vs. non-selected athletes based on competition status.
- Dependent t-test (paired t-test):
  - Used to determine whether a difference exists between two dependent groups.
  - Observations are contributed from the same participant in each group or testing time point.
  - Example: assessing jump height in the same participants over two different testing sessions (T1 and T2).

When there are multiple testing time points, such as in longitudinal designs, F-tests (ANOVA) are used.
Types of ANOVA:
- One-way ANOVA:
  - Used for independent groups with one grouping factor.
  - Example: assessing differences in bench press 1RM between three competition levels in Rugby League.
- Factorial ANOVA:
  - Used when there is more than one grouping factor.
  - Example: grouping based on competition level and selection status within that level.
- One-way repeated measures ANOVA:
  - Used for multiple dependent measures, i.e., repeated measurements on the same participant across more than two occasions.
- Factorial (two-way) repeated measures ANOVA:
  - Used for more than two groups (e.g., selected vs. not selected) over multiple measurement times.
  - Example: start, mid, and end of the season.

F-tests (ANOVA) indicate if there is a difference between groups but do not specify where that difference lies.
A main effect for a factor (e.g., competition level) indicates a statistical difference exists somewhere among the groups.
Post hoc comparisons are used to determine exactly where the differences lie through pairwise comparisons.
Performing multiple comparisons increases the risk of a Type I error (rejecting the null hypothesis when it's true).
Corrections for Type I error:
- Bonferroni correction: Takes the p-value and divides it by the number of groups.
- HOME (sequentially rejected Bonferroni): Ranks the order of differences to solve for conservatism.
- Tukey test and Scheffé test.
- The researcher decides which correction is most appropriate for the research design.

Used when data does not fit a normal distribution.
Alternatives to t-tests:
- Two-sample Mann-Whitney U test: For assessing two independent groups.
- Wilcoxon Signed Rank test: For paired or dependent data in two time points.
Alternative to one-way ANOVA: Kruskal-Wallis test (for more than two groups).
For more than two factors and two groups:
- Aligned Rank Transformation: Assesses the interaction effect by transforming the data and ranking medians.
For one group in a longitudinal within-participant design: Friedman test.

Null hypothesis tests provide a description that there is a difference between groups or time points on some level.
They are somewhat binary; either there is or isn't a difference based on a statistical outcome.
Statistical outcomes don't always align with practically relevant outcomes.
Small changes in performance can have large effects on competition outcomes, which may not be detected statistically without very large data sets.
Hypothesis tests assess performance change or differences between groups on a group level, not at an individual athlete level.
They don't provide information about the magnitude of the difference.
Tests are heavily impacted by the distribution of the data. T-tests are sensitive to non-normally distributed data.
ANOVA models are somewhat robust to violations of distribution assumptions but are still impacted by large deviations.
The size of the sample impacts the accuracy of statistical outcomes.
Small samples may not be representative of the population, affecting the inferences that can be drawn from the tests and especially increasing the risk of type II errors.