Week 1: Introduction, Data & Experiments - Overview of the course and basic experimental principles.
Week 2: Distributions, Samples & Populations - Understanding how data is distributed and the basics of sampling.
Week 3: Testing Hypotheses: One-Sample t-test - Introduction to hypothesis testing using a one-sample t-test.
Week 4: Testing Hypotheses: Independent Sample t-test - Learning to test hypotheses with independent samples.
Week 5: Statistical Inference: p-values and effect sizes - Understanding the importance of p-values and effect sizes in statistical inference.
Week 6: Consolidation week - A week to catch up on material and consolidate knowledge.
Week 7: Non-Parametric alternative tests - Exploring non-parametric tests as alternatives to parametric tests.
Week 8: Comparing multiple means - Techniques for comparing multiple means.
Week 9: Qualitative Methods - Introduction to qualitative research methods.
Week 10: Advanced Thematic Analysis - Advanced techniques in thematic analysis.
Week 11: Revision & Open Science - Review of key concepts and introduction to open science practices.
t-values: Evidence against the null hypothesis, indicating the strength of statistical evidence. Larger t-values suggest stronger evidence against the null hypothesis.
Degrees of freedom: Number of values that can vary in the dataset, influencing the shape of the t-distribution and the statistical power of the test. Degrees of freedom are crucial for determining the appropriate critical value for hypothesis testing.
p-values: Probability of observing results if the null hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis, typically compared against a significance level (e.g., 0.05) to make a decision.
Effect sizes: Absolute magnitude of the difference, quantifying the practical significance of the results. Effect sizes provide a measure of the real-world importance of the findings beyond statistical significance.
Reporting t-tests: Guidelines on how to report t-tests, including all necessary information for reproducibility and transparency. Reporting should include t-value, degrees of freedom, p-value, effect size, and confidence intervals.
Shapiro-Wilk test**: Used to test for normal distribution, assessing whether the data follows a normal distribution assumption required for parametric tests. This test is essential for validating the assumptions of t-tests and other parametric statistical methods.
Levene’s test: Used to test for homogeneity of variance, ensuring that different groups have equal variances, an assumption for independent samples t-tests. Violation of this assumption may require the use of Welch’s t-test or non-parametric alternatives.
Data to be Analyzed | Test Choices |
---|---|
Compare one sample to reference | One-Sample t-test |
Compare two samples | Independent Samples (Student’s t-test, Welch’s t-test), Dependent Samples (Paired t-test) |
If the assumption of normality is violated:- One sample test -> Wilcoxon Rank. This non-parametric test is used when the data does not meet the normality assumption required for the one-sample t-test.
Independent sample test -> Mann-Whitney U. The Mann-Whitney U test is used as a non-parametric alternative when the assumption of normality is violated in independent samples.
Consider Welch’s t-test if groups do not have comparable variance. Welch’s t-test adjusts for unequal variances between groups, providing a more robust analysis in such cases.
Testing Hypotheses- Focus on tests of difference rather than relationships. This involves determining whether observed differences between groups are statistically significant.
Using data to answer questions. Applying statistical tests to real-world data to derive meaningful conclusions.
Running t-tests- Understanding how a one-sample t-test works. Gaining proficiency in performing and interpreting one-sample t-tests.
Practical Examples- AI hyperrealism example. Demonstrating the application of t-tests in the context of AI-generated faces.
Running t-tests in Jamovi. Step-by-step guide on how to conduct t-tests using Jamovi software.
Assumptions of t-tests. Reviewing the key assumptions that must be met to ensure the validity of t-tests.
The slide contains a diagram with a curve and the formula for a t-test. However, the formula is not clearly visible in the slide.
t(24) = \frac{\bar{x} - \mu}{SEM}
The t-value grows as the difference between the observed data mean and comparison value gets bigger (numerator increases). This indicates a stronger effect or difference.
The t-value shrinks as the variance of the observed data gets bigger (denominator increases). Higher variance reduces the likelihood of finding a statistically significant result.
If one group of people contributes two data points and we want to test if they are different, we can use the one-sample t-test. Paired samples are used when the data points are related (e.g., pre- and post-test scores).
We do the one sample t-test comparing the pairwise differences to zero. This approach simplifies the analysis by focusing on the change within each subject.
Condition 1 vs. Condition 2 vs. Reference value. Comparing two related conditions against a reference baseline to determine significant differences.
When we compute a t-test, we account for uncertainty in our estimate of the mean by using the standard error of the mean. The standard error of the mean (SEM) reflects the variability of sample means around the true population mean.
We also have to estimate the standard deviation from the data. Estimating standard deviation is crucial for understanding the spread of the data.
The estimate of the standard deviation is even noisier than the estimate of the mean. The standard deviation estimate is more susceptible to sampling variability, especially in small samples.
For a long time, analyses were restricted to very large samples where we can be confident in the standard deviation. Large samples provide more reliable estimates of population parameters.
William Seely Gossett provided the answer. Gossett developed the t-test to address the challenges of small sample sizes.
This was required as the Guinness brewery couldn’t create large samples when experimenting with different crops or recipes. The t-test allowed for accurate analysis even with limited data.
The sampling distribution of t-values (when the null is true) is not quite ‘normally’ distributed – it is ‘t’ distributed. The t-distribution accounts for the increased uncertainty in small samples.
This accounts for additional uncertainty in smaller samples with its wider tails. The wider tails of the t-distribution result in more conservative p-values.
In a small sample, a t-value of 2.5 would be larger than 99% of values in a standard normal distribution but only larger than 95% of values in a t-distribution. This highlights the difference between using a t-distribution versus a normal distribution for small samples.
This more conservative comparison accounts for the additional uncertainty in having to estimate the standard deviation from the data. Using the t-distribution ensures more accurate statistical inference with small samples.
Apophenia is the tendency to see meaningful connections between unrelated things. This can lead to misinterpretation of data and spurious findings.
Pareidolia is the perception of images like faces in random stimuli. This psychological phenomenon can bias how people interpret patterns.
Just because something looks show a pattern, it doesn’t necessarily mean that it is a universal effect. It could be random chance, a bias or something that can’t yet be explained. Observed patterns may not always reflect genuine effects.
We need some way of protecting against false discoveries. Statistical methods help mitigate the risk of false positives.
We must remember that we’re taking random samples from a wider population – there is uncertainty in how well a single sample represents the whole. Sampling variability can lead to differences between the sample and the overall population.
These relate the mean of a sample to a pre-specified comparison value.- Attendance in class is more than 80%. Comparing sample attendance to a benchmark.
Medical doctor’s stress levels are higher than the average stress in the UK. Assessing stress levels in a specific group against a national average.
People will remember word pairings better than chance level when the pairs are situated in a story. Testing memory performance against a baseline.
A hypothesis looks to compare the null hypothesis of no effect with the alternative hypothesis. This involves formally stating the expectations being tested.
We assume the null is true until proven otherwise. The null hypothesis represents the absence of an effect.
This is proof by contradiction. We start with the null and put the burden of proof on the alternative hypothesis. The alternative hypothesis requires sufficient evidence to reject the null.
This is a clear and fair starting point for research. Starting with the null provides a neutral foundation for investigation.
We know everything about our statistics when there is no effect. Statistical distributions are well-defined under the null hypothesis.
We can write down the exact distribution of t-values for every experiment if the data are normally distributed and random. The t-distribution enables precise calculations under the null hypothesis.
But the same distribution when there is an effect is unknown. When the null hypothesis is false, the distribution is more complex.
Attendance in class is NO DIFFERENT FROM 80%. This null hypothesis posits no difference in attendance compared to the specified value.
Medical doctor’s stress levels are NOT DIFFERENT TO average stress in the UK. The null hypothesis suggests no difference in stress levels between doctors and the general population.
People will remember word pairings AT chance level when the pairs are situated in a story. The null hypothesis assumes memory performance is at chance.
…and any observed differences will be due to random chance. Any deviations from the null are attributed to random variability.
Null hypothesis (H0) is… | ||
---|---|---|
True | False | |
Decision about null hypothesis (H0) | ||
Not Reject | Correct inference (true negative) | Type II error (false negative) |
Reject | Type I error (false positive) | Correct inference (true positive) |
If we have a specific direction in mind for our hypothesis – we can specify a one-tailed test. A one-tailed test examines whether the data mean is significantly greater or less than the comparison value, but not both.
This test for strictly whether our data mean is either greater than or less than our comparison value. The focus is on one direction of effect.
If we predict that our data will be higher than the comparison and it turns out to be much lower – this would not be a significant result. A one-tailed test is not appropriate if the effect is in the opposite direction.
Two tailed tests account for both possible differences Two-tailed tests consider differences in either direction (greater or less than).
We need a mechanism to protect against false discoveries. Statistical hypothesis testing serves this purpose.
Hypothesis testing provides a solution. It offers a structured approach for evaluating evidence.
We assume no effect, and place a burden of proof on new evidence against that null hypothesis. The null hypothesis is the starting point, requiring strong evidence to reject.
Statistics is a way of using data to provide a principled answer to hypotheses in the face of noisy data and uncertainty. It allows for objective assessment of hypotheses.
Hypotheses must be clearly specified, and we must remember that we’re actually testing a null… The focus is on assessing the null hypothesis.
Artificial intelligence has become extremely good at simulating realistic images of faces. AI-generated faces are often indistinguishable from real ones.
Humans frequently cannot tell the difference. Distinguishing AI-generated faces from real faces poses a significant challenge.
Hyperrealism refers to AI-generated faces that human participants identify as more human than real human faces. This phenomenon highlights the advanced capabilities of AI.
3 most ‘human’ faces actually AI… This surprising result underscores the effectiveness of AI in creating realistic images.
Millet et al. computed a metric called d-prime that indicates how effectively participants could distinguish AI-generated faces from real faces. The d-prime metric quantifies the discriminability between real and AI-generated faces.
A higher value for d-prime indicates that the true faces can be more readily detected. This means participants are better at identifying real faces.
This was statistically assessed using one-sample t-tests. T-tests were used to determine if the d-prime values were significantly different from chance.
Chance level = 50%. This represents the baseline probability of correctly identifying a real or AI-generated face by random guessing.
White AI faces were judged as human significantly more often than chance (= 50% in the two-alternative forced choice task), M_{White-AI} = 69.5\%, t(314) = 16.01, p < .001, d = 0.90, 95% CI = [0.77, 1.03]. This demonstrates a significant bias in the perception of White AI faces.
In contrast, non-White AI faces were judged as human at around chance levels, M_{non-White-AI} = 50.5\%, t(314) = 0.41, p = .682, d = 0.02, 95% CI = [−0.09, 0.13]. There was no significant bias observed for non-White AI faces.
Generative adversarial networks (GANs) are biased toward the statistical regularities of their most common inputs, which we argue produces AI hyperrealism. GANs tend to replicate patterns present in their training data.
Found evidence of White racial bias in algorithmic training that produces racial differentials in the presence of AI hyperrealism, with significant implications for the use of AI faces online and in science. This highlights the potential for bias in AI systems.
Participants rated whether the L1 or L2 image was higher in perceived Experience, Agency, or Realness. The ratings were used to compare different image versions.
In the dataset, there is one row per image and one column indicating the proportion of participants who rated L1 > L2 for that percept. The data structure allowed for analysis of preference proportions.
Averaging over participants and including one row per image is a little unusual; it would be clearer to average over images and have one row per participant – the outcome is the same for our purposes… This clarifies the data aggregation approach.
“One-sample Student’s t tests confirmed that the proportion of participants choosing L1 over L2 was significantly above the chance level of 50% for Realness [t(28) = 17.99, P < 0.0001, d = 3.34], Experience [t(28) = 9.10, P < 0.0001, d = 1.69], and Agency [t(28) = 5.70, P < 0.0001, d = 1.06]. That is, on all three dimensions, minds were perceived more keenly in L1 than in L2.” This summarizes the main findings of the study.
Reporting T-Tests
Mean and standard deviation of data observations Descriptive statistics are essential for interpreting the results.
Comparison level The reference point for comparison.
T-value with degrees of freedom The test statistic and its associated degrees of freedom.
P-value The probability of obtaining the observed results under the null hypothesis.
Experience
t_{\text{df}} = \frac{\bar{x} - \mu}{SEM}
t_{28} = \frac{69.9 - 50}{2.19} = 9.10
Agency
t_{\text{df}} = \frac{\bar{x} - \mu}{SEM}
t_{28} = \frac{68.7 - 50}{3.28} = 5.70
Realness
t_{\text{df}} = \frac{\bar{x} - \mu}{SEM}
t_{28} = \frac{84.3 - 50}{1.91} = 17.99
“A one-sample t-test showed that more participants selected the L2 image as more ‘Real’ (M=84.3%, SD=10.3) than the chance level comparison (50%), t(28) = 17.99, p<0.001” This provides a detailed example of how to report the results of a t-test.
Comparing two groups with independent samples t-test