KM

RMB

LECTURE 2: DISTRIBUTION, SAMPLES & POPULATIONS

Outline & Objectives

  • Topics covered: Introduction, Data & Experiments, Distributions, Samples & Populations, Testing Hypotheses (One-Sample t-test, Independent Sample t-test), Statistical Inference, Consolidation week, Non-Parametric alternative tests, Comparing multiple means, Qualitative Methods, Advanced Thematic Analysis, Revision & Open Science
  • Lecture 2 focuses on Distributions, Samples, and Populations.
  • A dataset is a single sample of a broader population.
  • Assuming a normal distribution simplifies many statistical calculations.
  • Resampling is used to estimate variability.
  • Key questions:
    • How precisely have we estimated the population mean from our sample?
    • This leads to uncertainty in the parameters we estimate.
    • Not all data is normally distributed.
    • Data samples can be systematically biased due to participant recruitment and data collection methods.

Populations

  • A 'population' is the entire group we want to test (e.g., all primary school children in the UK, all living adults, everyone with depressive symptoms).
  • We usually cannot test everyone, so we take a 'sample' from the population.
  • Populations are heterogeneous, leading to additional variability that is hard to control.

Samples & Populations

  • A histogram represents the sample distribution of our value of interest.
  • There is an underlying population distribution that we can't directly measure.
  • Each sample we measure is an approximation of the underlying population.
  • Larger samples tend to give more accurate approximations.

Summary 1

  • Histograms represent the distribution of data samples.
  • We can use the distribution of data samples to infer things about the population (assuming the sample is large enough and doesn't have too much bias).

Sampling

  • Potential bias in sampling:
    • Random chance: Can be addressed by taking another sample or recruiting a larger sample.
    • Systematic bias: Occurs when certain people are more or less likely to respond (e.g., to a recruitment email); a larger sample won't fix this.
    • Unrecognized bias: Bias in an aspect we aren’t even aware of.

Sampling Methods

  • Random: Participants selected at random from a list.
  • Systematic: Structured approach, e.g., every 5th participant selected from list.
  • Opportunity/Convenience: Recruitment from people closest and/or most accessible, e.g., participants who attended that day's lecture.
  • Stratified: Recruitment aims to match key characteristics of the target population, e.g., participant group that matches known age, sex, and political characteristics.
  • Cluster: Whole groups are recruited at once, can be combined with other methods (e.g., whole university football team recruited).

Ethics & Sampling

  • Biased or poorly considered sampling can lead to ethical concerns.
  • It can reinforce inequalities and exclude certain groups.
  • Psychologists must:
    • Avoid any unfair, prejudiced, or discriminatory practice in participant selection or research content.
    • Accept that individuals may choose not to participate or may withdraw their data.
    • Be alert to the possible consequences of unexpected as well as predicted outcomes of work, and the often public nature of the interpretation of research findings.
  • Example 2 - WEIRD populations
    • Behavioral scientists often make broad claims based on samples from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies.
    • Researchers often assume little variation across human populations, or that “standard subjects” are representative of the species, which may not be justified.

Sampling Methods - Summary

  • Samples can approximate target populations.
  • We can't test everyone in a target population and often have to use a 'sample.'
  • We can use statistics to estimate how variable our statistics are likely to be across repeated samples from a population.
  • Samples contain randomness, and different samples may give different results.
  • Careful planning is essential:
    • Who will we recruit?
    • How will they be recruited?
    • How will we maintain our obligation to conduct ethical research?
  • Statistics cannot save us from a strongly biased dataset.

The ‘Normal’ Distribution

  • A special distribution with convenient properties.
  • Summarized by two parameters: the mean and the standard deviation.
  • 68.2\% of observations lie within 1 standard deviation.
  • 95.4\% lie within 2 standard deviations.
  • 99.6\% lie within 3 standard deviations.

The ‘Normal’ Distribution

  • Many tests assume data is 'normally' distributed, simplifying calculations; these are 'parametric' tests.
  • Alternatives exist for non-normally distributed data, but we typically start with parametric tests.
  • Assessing data distribution shape and meeting parametric assumptions is a good first step in any analysis.

Data Skills & Coding

  • Example: Personality data from 500 participants.
  • Testing for a normal distribution.
  • The Shapiro-Wilk test objectively tests whether data is normally distributed.
    • Shapiro-Wilk W: A metric indicating how ‘normal’ the data is; higher values indicate more normal data.
    • Shapiro-Wilk p: A probability indicating how significant any difference from normality is.

Data types revisited!

  • Are Likert Scales Ordinal or Interval?
  • Ambiguous and tricky to decide.
  • Critical factor: does the data have an interpretable mean and standard deviation?
  • Ordinal data: Does not have an interpretable mean as its value depends on how we have coded the values
  • Interval data: Inherently numeric and does have a meaningful mean and standard deviation

Data types revisited!

  • This can be difficult as researchers make a lot of decisions when presenting questions. Some of these examples might be more naturally ordinal and some more interval. This is a massive, ongoing debate!
  • Did the participants see words or numbers? were the numbers presented continuously?
  • Individual questions are often ordinal
  • Scores aggregated across several single items are often interval (this is the computer practical data)

Data types revisited!

  • In my opinion - it comes down to distributions. Do the data have a distribution that you think are fairly summarized with a mean and standard deviation?
  • If yes, then we can proceed with interval - if no, we should proceed with ordinal
  • This matters as it relates to whether we can use parametric statistics - or do we need a non-parametric alternative

Sampling Variability

  • Each sample only gives an estimate of the ‘true’ mean
  • The mean is the sum of all the individual data points divided by the total number of data points

Sampling Variability

  • The mean is the sum of all the individual data points divided by the total number of data points
  • The standard deviation is the square root of the sum of the squared difference between the sample mean and each individual data point divided by the total number of data points minus one

Bessel’s Correction

  • Estimates of the population mean from a sample might be wrong, but they are equally likely to be too big or too little.
  • Estimates of the population standard deviation are biased. They are nearly always too small!
  • Bessel’s correction makes the estimated standard deviation a bit bigger to account for this bias.
  • Jamovi will always give you the corrected estimate.

Sampling — Variability

  • Notation:
    • x: Sample mean (Yes, calculated from the raw data).
    • \mu: True population mean (Almost never known for sure).
    • \hat{\mu}: Estimate of the population mean (Yes, identical to the sample mean in most cases).
  • The best we can do is estimate how close our sample mean might be to the underlying population mean

Standard Error of the Mean

  • The standard error of the mean is the likely variability in our estimate of the population mean from a given data sample
  • SEM = \frac{s}{\sqrt{N}}, is the standard deviation/\sqrt{N} of the sample divided by the square root of the total number of data points
  • A shortcut to the variability in the sample distribution when the data are normally distributed
  • It decreases as the sample size grows larger - larger samples are more reliable!

5% Confidence Intervals

  • 95\% CI = 1.96 * SEM
    • Upper = \bar{x} + CI
    • Lower = \bar{x} - CI
  • Confidence intervals are an intuitive way to communicate how reliable our estimate of the mean is likely to be.
  • They provide two values which define a range that has a 95% chance of containing the true mean.

Summary

MetricSymbolMeaning
Population Mean\muThe true, unobservable mean of our population
Sample Mean\bar{x}A sample estimate of the mean
Standard Deviation\sigmaA sample estimate of the variability in the data points
Standard ErrorSEMThe precision to which our sample mean has been estimated
Confidence IntervalsCIA range of values around the sample mean that has a 95% chance of containing the population mean

Data Skills & Coding

  • Example with Extraversion Data
  • Extraversion is a measure of how sociable and outgoing someone is. Let's take a look at some example data.
  • You will go through a similar process with another datasets in the computer practical sessions.
  • We can visualise the sample distribution using a histogram (or perhaps a box and whisker chart)

Descriptives

  • Descriptive statistics give us our sample statistics.
  • We have a total of 500 observations and a mean value of 3.49 (this matches what we see on the histogram) and the standard deviation is 0.355
  • We can compute both the standard error and the confidence intervals in the Jamovi Descriptives tab.

Descriptives

  • The standard error of the mean is very small - only 0.0159. This is likely as we have a large data sample.
  • As a result of this, the confidence intervals specify a very small range. The true population mean is likely to be between 3.46 and 3.52. This is a small proportion of the overall variability
  • Simulated data with N = 10
  • The amount of data affects our precision
  • 6 simulations show false positive
  • Simulated data with N = 25
  • The amount of data affects our precision
  • 3 simulations show false positive
  • Examples from really big data
  • With massive datasets we can be extremely precise
  • Fig. 1: Gender rating gap across platforms.

Summary

  • The normal distribution: Assuming a normal distribution simplifies many calculations in statistics. Not all data is normally distributed though…

  • Samples & Populations: A dataset is a single sample of a broader population - we can very rarely sample the whole Population. Data samples can be systematically biased due to participant recruitment and data collection methods.

  • Standard Error of the Mean: How precisely have we estimated the population mean from our sample? This leads to uncertainty in the parameters we estimate

  • LECTURE 3: TESTING HYPOTHESIS: ONE-SAMPLE T-TEST

    • Next 3 weeks
  • Differences:

    • Compare one sample to reference
    • Compare two samples
  • Assumptions

  • Independent Samples

  • Dependent Samples

    • Shapiro-Wilk test
    • Consider non-parametric alternative
    • Checking normality of the paired-difference, not the data
  • Normality

    • One sample test -> Wilcoxon Rank
    • One-Sample t-test
    • Assumption of normality is violated
    • Wilcoxon Rank Test
    • Ind sample test -> Mann-Whitney U
  • Equal variance

    • Consider Welch's test if groups do not have comparable variance
    • Levene’s test
    • Student's t-test
    • Welch's t-test