Lecture Notes on Pearson Statistics, Null Hypothesis Testing, and Bayesian Approach

Pearson Statistics and Null Hypothesis Significance Testing

The Pearson style of statistics and null hypothesis significance testing are built on the premise of relative frequency of events.
When testing a null hypothesis against an alternate hypothesis, we consider the relative frequency of events.
Even if the null hypothesis is true, sample statistics can still fall in the rejection region due to the normal distribution and the tails.
This leads to false positives, where a true null hypothesis is incorrectly rejected.
A p-value less than 0.05 indicates a less than 5% chance of misclassifying a null hypothesis in the tail.

Empirical vs. Subjective Probabilities

Empirical probabilities arise from repeated observations and data collection.
Subjective probabilities consider personal uncertainties and knowledge about events.
Example: A subjectivist would estimate the probability of a dice landing on 1 or 2 as one-third based on the physics and past experience.
Subjective probabilities can apply to non-repeatable events, such as the probability of a prime minister running for election again.
These probabilities are influenced by personal knowledge and circumstances.
Example: Someone working in the prime minister's office might have a better estimate than a random person.

Classical Approaches to Probability

Classical approaches use the tail ends of probability distributions to make decisions.
Example: Determining if a population is evenly divided between males and females.
If the probability of choosing a female is 0.5, a sample of 10 people can be analyzed.
The probability of a sample containing zero females and ten males is 1 in 1024.
The probability of one female and nine males is 10 out of 1024.
In a two-tailed distribution, the null hypothesis might be rejected if a sample of 10 has zero or one females, or nine or ten females.
Problematic issues:
- A sample of 10 with nine females might lead to the incorrect conclusion that the population isn't balanced.
- The significance level includes probabilities of events that did not occur.

Addressing Problems with Data Collection

Collecting data continually can mitigate these issues by regressing probabilities to the mean.
Classical inference predicts long-run probabilities before data collection.
Confidence intervals indicate that 95% of the intervals will contain the true result if an experiment is run 100 times. However, for one sample, we do not know if the true value is with the interval. This assumes an assumption of long run frequencies for probabilities.
Challenges arise when individuals belong to multiple collectives with different probabilities for an event.
Example: Estimating the probability of someone having black hair in Australia is more complicated if that person has Celtic or South Asian ancestry.
Base probabilities are less useful unless subgroups are narrowly defined.
The null hypothesis approach helps determine the probability of events needed to retain the null hypothesis.

Bayesian Approach to Probability

The Bayesian approach considers probabilities given a specific situation.
Example: What is the probability of having black hair given Southeast Asian descent?
It incorporates factors that update likelihoods, improving accuracy.
It addresses the probability of a theory being true given the available data.
Notation: $P(BlackHair | SoutheastAsian)$ , where | means "given".
The frequentist approach assesses the probability of getting data given a theory, which cannot be done with the null hypothesis.
With a $p$ value greater than 0.05, no inference about the null hypothesis being true can be made; only information about the alternate being true is obtained.

Examples of Bayesian vs. Frequentist Thinking

Example 1: If someone's head is bitten off by a shark, the probability of being dead is 1. However, being dead does not necessarily imply the head was bitten off by a shark.
Example 2: A weighted coin lands on heads 60% of the time. If Joanna flips it five times and gets three heads, a t-test might incorrectly conclude the coin is fair.
The Bayesian approach uses data to adjust theories rather than supporting or refuting them.
It replaces significance, $p$ -value, and power with a weight of evidence for updating expectations.
It can answer questions about the probability of a hypothesis being true (prior probability) and the probability of a hypothesis given the data (posterior probability).
The posterior probability is the likelihood times the prior.
In the frequentist approach, $p$ -values change based on the decision procedure or the selected significance threshold, which is not the case in the Bayesian approach.

Bayes Factor

Replaces the $p$ -value by comparing theories
Compares two separate theories
Start with the prior odds of these two theories being true by taking the probability of theory one versus probability of theory two and dividing them by each other
Base Factor: A ratio of the likelihood of one theory compared to another.
- BF > 1: Supports the experimental hypothesis over the null hypothesis.
- BF < 1: Supports the null hypothesis over the experimental hypothesis.
- $BF ≈ 1$ : The experiment provides no new information.
A Bayes factor of 3 or more is often considered equivalent to a $p$ -value of 0.05.
Base factors can be infinite, with higher values indicating more sensitivity.
As more data supports the alternative hypothesis, the base factor increases, whereas the $p$ -value decreases.
In the null hypothesis significance testing approach, a null result means absence of evidence, not evidence of absence. The base factor approach can indicate evidence of absence.

Problems with the Frequentist Approach

Scenario: A study with 20 participants yields a non-significant result ( $p = 0.08$ ). The researcher adds 20 more participants and gets $p = 0.01$ .
- Should the researcher submit the study with 40 participants or stick to the original plan?
- Option C: Use a method that is not sensitive when data collection stops.
P-hacking: Increases the chance of Type I error.
The Bayesian approach does not have a hard stop threshold, and does not require alpha correction.

Planned vs. Post Hoc Comparisons

Scenario: A three-way design yields an unexpected partial two-way interaction significant at $p = 0.03$ for males but not females.
How should these results be accounted for?
- Write an introduction to fit the data.
- Treat the results as non-significant if they didn't fit the initial theory.
- Determine how strong the evidence is before moving on.
Frequentist approach: A priori predictions are treated differently than post hoc tests.
Bayesian approach: It does not matter if it is a priori or it is a posteriori.
- The likelihood is unaffected by the time data were collected.
Multiple tests in Frequentist approach: If running 100 correlations, 5 will be significant by chance.

Bayesian Approach and Rationality

It is based on two key definitions of rationality:
- Sufficient justification for beliefs.
- Subjecting beliefs to critical scrutiny.
It fits in with the idea of critical rationalism.
- Statistical inference can't tell you how confident to be in a given hypothesis.

Effect Size in Bayesian Analysis

We need to specify the sort of effect sizes to calculate our Bayes factor.
It distinguishes between evidence that there's no relevant effect from no evidence of relevant effect.

Specifying Effect Sizes

We need to specify theoretically effect sizes.
If you don't have data from the field, you use the preset d = .5.
Need to look at what range of effect sizes are we going to expect.
- Is it a uniform distribution, normal distribution, etc.
Scales have limits determined by a zero to five rating.
- The difference between conditions cannot exceed five.
When hypothesizing that group is higher than that group, the prior expectation is that score is between zero and five, not -5 to 5.

Weaknesses of the Bayesian Approach

Calculating the Bayes factor depends on making theory predictions before collecting any data.
Making predictions requires explicit consideration of the theory, including effect sizes.
These predictions are subjective probabilities based on the theory.
There is pressure to create prior expectations that will survive scrutiny from others.
Some people do this by reporting likelihoods across all different probabilities.
Bayesian procedures are not concerned with long term frequencies like the NHST approach.
It is not guaranteed that we won't have type one or type two errors in the Bayesian approach.
Frequentist type statistics have the same type one type two error rates, no matter how many participants you have. The more participants included, the smaller the probability is of weak or misleading evidence using a Bayes Factor.