Note
0.0(0)
KM

Lecture 2: Samples and Populations

Samples & Populations

Outline & Objectives
  • A dataset is a single sample of a broader population.

    • It's rare to sample the entire population.

    • This leads to uncertainty in estimated parameters.

  • Data samples can be systematically biased due to participant recruitment and data collection methods.

  • Assuming a normal distribution simplifies statistical calculations.

    • However, not all data is normally distributed.

  • Standard Error of the Mean

    • Resampling to estimate variability.

    • Precision in estimating the population mean from the sample.

Sampling & Populations
  • A histogram represents the sample distribution of the value of interest.

  • There's an underlying population distribution that can't be directly measured.

  • Each measured sample is an approximation of the underlying population.

  • Bigger samples tend to provide more accurate approximations.

Summary 1
  • Histograms represent the distribution of data samples.

  • The distribution of data samples approximates population distributions, assuming a large enough sample size and minimal bias.

Sampling
  • A ‘population’ is the total set of everyone within a group that we want to test.

    • Examples include all primary school children in the UK, all living adults, and everyone with depressive symptoms.

  • It's usually impossible to test everyone in a population, so we take a ‘sample’.

  • Populations aren't homogeneous, leading to variability that we can't control.

Sampling
  • It's possible to get lucky and find a sub-sample that properly represents the population.

  • However, it's more likely to get a sample with some form of bias, where certain traits are over-represented compared to the population.

Sampling
  • Bias might occur randomly, in which case we can try another sample or recruit a larger sample to balance things out.

  • A bigger issue is systematic bias e.g., certain people are more or less likely to respond to a recruitment email, which remains true even when recruiting a larger sample.

  • Worst of all, there may be bias in some aspect that we aren’t even aware of…

Sampling Methods

Method

Description

Random

Recruitment done completely by chance; participants selected at random from a list.

Systematic

A structured approach to selecting participants; every nth participant is selected from a list.

Opportunity/Convenience

Recruitment from people closest and/or most accessible to the experimenter; e.g., participants who attended that day’s lecture.

Stratified

Recruitment aims to match key characteristics of the target population; e.g. participant group that matches known age, sex, and political characteristics.

Cluster

Whole groups are recruited at once, sometimes combined with other methods e.g. whole university football team recruited at once.

Quantitative Methods

Qualitative Methods

Ecological Validity
  • Do the variables and conclusions of a study sufficiently reflect the real-world context of its population?

  • If writing about a very specific population (e.g., elite athletes), recruiting a relatively homogeneous sample may be appropriate.

  • However, drawing conclusions about a very broad population, perhaps even all humans, then a narrow sample is unlikely to be appropriate.

Ethics in research Ethics & Sampling
  • Psychologists:

    • Avoid unfair, prejudiced, or discriminatory practice, e.g., in participant selection or in the research content itself.

    • Accept that individuals may choose not to be involved in research, or may withdraw their data.

    • Are alert to the possible consequences of unexpected outcomes and acknowledge the problematic nature of interpreting research findings.

  • Biased or poorly considered sampling can lead to ethical concerns.

  • It can reinforce inequalities and marginalize certain groups.

Example 1 - Students
  • Can the body of knowledge on the psychology of prejudice garnered largely from student samples be applied to the general adult population?

  • For sure, much of it can.

  • The problem, however, is that at this point we cannot be sure which parts, or how much, can. These questions are empirical ones that we as a science should not lose sight of.

Example 2 – WEIRD populations
  • Behavioral scientists routinely publish broad claims about human psychology and behavior based on samples drawn entirely from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies.

  • Researchers implicitly assume either there is little variation across human populations or that these “standard subjects” are as representative of the species as any other population.

  • Are these assumptions justified?

Discussion point
  • What could go wrong if our sampling methods are biased?

  • How can we recruit more diverse & representative samples in our research?

Sampling Methods - Summary
  • Samples can approximate target populations.

    • We can’t test everyone in a target population and often have to use a ‘sample’.

    • Samples contain randomness, and different samples may give different results.

  • We must carefully plan our sample.

    • Who will we recruit?

    • How will they be recruited?

    • How will we maintain our obligation to conduct ethical research?

  • Sampling is a major element in statistics

    • We can use statistics to estimate how variable our statistics are likely to be across repeated samples from a population.

    • But no statistics can save us from a strongly biased dataset

Revising Distributions
The ‘Normal’ Distribution
  • A special distribution with some convenient properties.

  • It can be summarised by 2 parameters – the mean and the standard deviation.

  • 68.2% of observations lie within 1 standard deviation.

  • 95.4% lies with 2 standard deviations.

  • 99.6% lies within 3 standard deviations

The ‘Standard Normal’ Distribution
  • A special case of the normal distribution in which the mean is zero and the standard deviation is one.

The ‘Normal’ Distribution
  • Many tests make use of the assumption that our data is ‘normally’ distributed, which simplifies a lot of calculations.

  • These are called ‘parametric’ tests.

  • There are alternatives if data are not normally distributed, but we typically start with parametric tests.

  • Assessing the shape of our data distribution and whether it meets parametric assumptions is a good first step any analysis.

Testing for a normal distribution
  • Personality data from 500 participants

The Shapiro-Wilk
  • The Shapiro-Wilk provides an objective test for whether data is normally distributed.

  • Shapiro-Wilk W – is a metric indicating how ‘normal’ the data is, higher values indicate more normal data

  • Shapiro-Wilk p - a probability indicating how significant any difference from normality is.

Data types revisited! Are Likert Scales Ordinal or Interval?
  • This is a bit ambiguous and can be tricky to decide.

  • The critical factor is – does the data have a meaningful mean and standard deviation?

  • Ordinal data does not have an interpretable mean as its value will depend on how we have coded the values

  • Interval data is more inherently numeric and does have a meaningful mean and standard deviation

Data types revisited! Are Likert Scales Ordinal or Interval?
  • This can be difficult as researchers make a lot of decisions when presenting questions. Some of these examples might be more naturally ordinal and some more interval.

  • This is a massive, ongoing debate!

  • Did the participants see words or numbers? were the numbers presented continuously?

    • Individual questions are often ordinal

    • Scores aggregated across several single items are often interval (this is the computer practical data)

Data types revisited! Are Likert Scales Ordinal or Interval?
  • In my opinion – it comes down to distributions. Do the data have a distribution that you think are fairly summarized with a mean and standard deviation?

  • If yes, then we can proceed with interval – if no, we should proceed with ordinal.

  • This matters as it relates to whether we can use parametric statistics – or do we need a non-parametric alternative.

Standard Error of the Mean
  • It's all about precision!

Sampling Variability
  • The sample mean is the sum of all the individual data points divided by the total number of data points

  • 𝑥ഥ = σ 𝑥ⱼ / 𝑁

  • Each sample only gives an estimate of the ‘true’ mean

  • The sample mean is the sum of all the individual data points divided by the total number of data points

  • 𝑥ഥ = σ 𝑥ⱼ / 𝑁

  • The sample standard deviation is the square root of the sum of the squared difference between the sample mean and each individual data point divided by the total number of data points minus one

  • Sampling Variability

  • \sigmaഥ=\sqrt{\sigma(𝑥ⱼ-𝑥ഥ)^2/𝑁-1}

Bessel’s Correction
  • Why do we need this minus one..?

  • Estimates of the population mean from a sample might be wrong, but they are equally likely to be too big or too little.

  • Estimates of the population standard deviation are biased. They are nearly always too small!

  • Bessel’s correction makes the estimated standard deviation a bit bigger to account for this bias. Jamovi will always give you the corrected estimate.

  • Uncorrected

  • \sqrt{\sigmaഥ=\sigma(𝑥ⱼ-𝑥ഥ)^2/𝑁}

  • Corrected

  • \sqrt{\sigmaഥ=\sigma(𝑥ⱼ-𝑥ഥ)^2/𝑁-1}

Sampling Variability
  • Each sample only gives an estimate of the ‘true’ mean

  • We can’t directly measure the population mean in most cases

  • The best we can do is estimate how close our sample mean might be to the underlying population mean

Standard Error of the Mean
  • The standard error of the mean is the likely variability in our estimate of the population mean from a given data sample

  • A shortcut to the variability in the sample distribution when the data are normally distributed

  • It decreases as the sample size grows larger – larger samples are more reliable!

  • The standard error of the mean is the standard deviation of the sample divided by the square root of the total number of data points

  • 𝑆𝐸𝑀 = {σ / \sqrt{N}}

95% Confidence Intervals
  • Confidence intervals are an intuitive way to communicate how reliable our estimate of the mean is likely to be.

  • They provide two values which define a range that has a 95% chance of containing the true mean.

  • 95% CI = 1.96 ∗ 𝑆𝐸𝑀

  • U𝑝𝑝𝑒𝑟 = 𝑥ҧ+ 𝐶𝐼

  • L𝑜𝑤𝑒𝑟 = 𝑥ҧ − 𝐶𝐼

Summary

Metric

Symbol

Meaning

Population Mean

μ

The true, unobservable mean of our population

Sample Mean

𝑥ഥ

A sample estimate of the mean

Standard Deviation

𝝈ഥ

A sample estimate of the variability in the data points

Standard Error

SEM

The precision to which our sample mean has been estimated

Confidence Intervals

CI

A range of values around the sample mean that has a 95% chance of containing the population mean

Example with Extraversion Data
  • Extraversion is a measure of how sociable and outgoing someone is.

  • Let’s take a look at some example data.

Results
  • We can visualise the sample distribution using a histogram (or perhaps a box and whisker chart)

#

  • Descriptive statistics give us our sample statistics. We have a total of 500 observations and a mean value of 3.49 (this matches what we see on the histogram) and the standard deviation is 0.355

#

  • We can compute both the standard error and the confidence intervals in the Jamovi Descriptives tab.

#

  • The standard error of the mean is very small – only 0.0159. This is likely as we have a large data sample. As a result of this, the confidence intervals specify a very small range. The true population mean is likely to be between 3.46 and 3.52. This is a small proportion of the overall variability

Examples from data
  • The amount of data affects our precision

  • Simulated data with N = 10

  • Simulated data with N = 25

Examples from really big data
  • With massive datasets we can be extremely precise

Summary
  1. The normal distribution

    • Assuming a normal distribution simplifies many calculations in statistics

    • Not all data is normally distributed though..

  2. Samples & Populations

    • A dataset is a single sample of a broader population – we can very rarely sample the whole population

    • This leads to uncertainty in the parameters we estimate

    • Data samples can be systematically biased due to participant recruitment and data collection methods.

  3. Standard Error of the Mean

    • How precisely have we estimated the population mean from our sample?

    • The standard error of the mean is the likely variability in our estimate of the population mean from a given data sample

Note
0.0(0)