Sampling, External Validity, and Intro to Data Analysis – Detailed Lecture Notes

External Validity

• Four pillars of validity recap:

• Construct validity – Are variables operationalized well? This refers to how well a conceptual variable is measured or manipulated in a study. It asks if the chosen operational definitions truly reflect the theoretical constructs intended.

• Internal validity – Does the design pin-down causality? This ensures that the observed changes in the dependent variable are indeed caused by the independent variable, ruling out alternative explanations.

• Statistical validity – Are numeric conclusions correct? This involves ensuring the statistical analyses are appropriate, accurate, and that the conclusions drawn from them are reliable, including considering effect size and significance.

• External validity – Do findings generalize? This is the extent to which the conclusions from a study can be applied to other people, settings, times, or situations.

• Key forms of external validity:

• Population validity – Do results generalize from the sample to the target population? This addresses whether the study's findings, based on the specific group of participants (sample), can be applied to a larger group of people (population).

• Ecological validity – Do results generalize to real-world contexts? This concerns whether the experimental setup and findings can be replicated or hold true in natural, everyday settings outside of the laboratory.

• Experimental realism – Do procedures engage participants so that behaviour is genuine? This refers to the extent to which participants become absorbed in the study and behave naturally, rather than feeling like they are in an artificial experiment.

• Replication – Can the same conclusion be reproduced in new samples/contexts? The ability to repeat a study and obtain similar results across different populations, settings, or variations of the original methodology is crucial for establishing external validity.

• Representative (population-valid) samples:

• Definition: A sample mirroring all important characteristics of the population of interest. A truly representative sample allows for confident generalization of findings from the sample to the broader population.

• Threats: sampling bias (systematic differences between the sample and the population), non-response bias (differences between those who participate and those who don't), situational factors (unique aspects of the study's context), & temporal factors (findings being specific to a particular historical moment).

• Ecological validity:

• Extent to which lab findings extend to natural environments. High ecological validity means the study's conditions closely resemble the real-world situations the researcher is interested in.

• Increases when tasks/settings mimic everyday life. For example, studying social interaction in a naturalistic setting rather than a sterile lab room.

• Experimental realism:

• Depth of engagement; participants behave spontaneously rather than "acting" for the lab. This focuses on the psychological impact of the study on participants, making them feel like the situation is real and important.

• Often achieved with immersive, believable procedures. An example is a well-constructed deception or cover story that makes the experimental task seem meaningful to participants.

• Replication typology:

• Direct replication – Re-run the same study with new participants using the exact same procedures, materials, and settings to verify the original findings.

• Conceptual replication – Re-test the same hypothesis using different operationalizations, methods, or settings. This helps to establish the generalizability of the underlying theoretical relationship, even if the specific variables are measured or manipulated differently.

Sampling

• Population → Sampling frame → Sample

• Population of interest: All units you want to generalize to. This is the entire group of individuals or cases that a researcher wants to study and draw conclusions about.

• Research population (sampling frame): List/definition you can actually access. This is the subset of the population of interest from which a sample can actually be drawn, often defined by an available list or clear criteria.

• Actual sample: Subset that participates. The specific individuals or units from the sampling frame who are included in the study.

• Two broad approaches:

• Probability sampling – every unit has a known, non-zero chance of selection; supports strong external validity. These methods involve random selection, ensuring that each member of the population has a calculable probability of being included, thus minimizing sampling bias and allowing for statistical inference to the population.

• Non-probability sampling – selection based on researcher judgement or participant availability; cheaper, faster, but riskier for generalization. These methods do not involve random selection, making it difficult to determine the probability of any given unit being selected. While convenient, they often lead to samples that are not representative and limit external validity.

• Research modes & external validity:

• Generalization mode – Explicit goal is external validity → probability sampling ideal. This mode is used when the primary aim is to make claims about a specific population, such as in survey research or public opinion polls, where representativeness is paramount.

• Theory-testing mode – Focus on causal/mechanistic tests → probability sampling less critical. This mode emphasizes internal validity and theoretical understanding of relationships between variables. While some generalizability is desirable, the focus is more on demonstrating a cause-effect relationship that can be applied to, or explain, human behavior in general, rather than describing a specific population.

Probability Sampling Methods

• Simple random sample (SRS)

• Every member has equal chance. Each possible sample of a given size has an equal chance of being selected.

• Procedure: Obtain full list (sampling frame), assign a unique number to each member, use a random number generator (RNG) to pick n IDs for the sample. This method relies on complete randomness.

• Pros: Unbiased, simple conceptualization, and provides a clear basis for statistical inference. Cons: Often impractical (cost, access to a complete list of the population, difficulty in reaching selected individuals).

• Cluster sampling

• Divide population into naturally occurring clusters (e.g., schools, postal codes, neighborhoods). These clusters are groups within the population that are often geographically or organizationally defined.

• Randomly sample whole clusters; test every unit within them or sub-sample later. All individuals within the selected clusters are included, or a further sampling stage is applied within them.

• Efficient for wide-spread populations where a complete list of individuals is unavailable, or collecting data from widely dispersed individuals is too costly. Risk of cluster-level bias if clusters are not truly representative or if they are too homogenous internally.

• Multi-stage cluster sampling

• Example hierarchy: campuses → divisions → departments → classes → students. This involves a hierarchical breakdown of the population into successively smaller clusters.

• Randomly sample at each stage for added efficiency. For example, first randomly select a subset of campuses, then randomly select divisions within those campuses, and so on, until the final units (students) are chosen. This method reduces geographical spread and resources needed for data collection while maintaining a level of randomness.

• Stratified sampling

• Define strata based on key demographics (sex, ethnicity, SES …). The population is divided into subgroups (strata) that share similar characteristics thought to be relevant to the study.

• Proportionate stratified – sample within each stratum in proportion to population. For instance, if 60% of a population is female, then 60% of the sample will be female, ensuring demographic representativeness.

• Disproportionate (oversampling) – intentionally recruit extra from small strata for analytic precision, particularly when small groups are of specific research interest and need sufficient representation for meaningful analysis; weight later if needed during data analysis to accurately reflect population proportions.

• Combining methods

• Example: Stratify first, then cluster within strata, then SRS within clusters. This creates a highly refined and efficient sampling design, leveraging the strengths of multiple methods to achieve specific research goals.

• Distinctions:

• Cluster = units are grouped based on location/group membership, and whole groups are randomly selected.

• Stratified = units are divided based on participant characteristics, and individuals are sampled proportionally (or disproportionately) from each characteristic group.

Non-Probability Sampling Methods

• Convenience sample – recruit whoever is easiest and most accessible (e.g., students in a class, participants from an online panel like MTurk, individuals responding to general advertisements). This is often used for pilot studies or when generalizability is not the primary concern.

• Purposive sample – deliberately recruit people with specific attributes needed for the study. Participants are chosen based on the researcher's judgment about who will be most informative (e.g., experts in a field, individuals with a rare condition). Maximize the likelihood of obtaining specific information.

• Snowball sample – participants recruit acquaintances; useful for hard-to-reach or hidden populations (e.g., parents of twins, gang members, individuals with specific rare diseases) where a complete sampling frame does not exist. Initial participants refer others, creating a network effect.

• Quota sample – set numeric targets for sub-groups (e.g., 50 males, 50 females; 30 young adults, 30 middle-aged adults) and recruit until quotas met; non-probability analog of stratified sampling. While it attempts to achieve a certain composition like stratified sampling, the selection within each quota is non-random, relying on convenience or judgment.

External-Validity Threats Recap

• Non-response bias – responders differ systematically from non-responders. If those who choose to participate are different in meaningful ways from those who don't, the sample may not accurately represent the population.

• Sampling bias – any feature that makes the sample unrepresentative of the population of interest. This occurs when the method of selecting participants systematically favors certain individuals or groups over others.

• Situational factors – unique context (time, place, experimenter particularities) limits generalization. The specific conditions, environment, or even the personality of the experimenter during data collection might influence results, making them less applicable to other situations.

• Temporal validity – findings tied to one historical moment. Results observed at one point in time might not hold true in the past or future due to societal, cultural, or technological changes.

Sampling Examples – Kelly et al. (2018)

• Study on social-media use & adolescent depression; N = 10{,}904 (age 14) from UK Millennium Cohort.

• Step 1: Families selected from a random sample of 398 electoral wards across UK → Cluster sampling. This allowed the researchers to efficiently sample from a large, geographically dispersed population without needing a list of every single family.

• Step 2: Over-sampled disadvantaged, minority-ethnic, and smaller-nation sub-groups → Stratified (disproportionate) sampling. This was done to ensure sufficient representation of these smaller groups in the sample, allowing for more precise statistical analysis within these specific strata.

• Only 61 % of original cohort present at age-14 interview → Potential non-response bias threatens population validity. The significant dropout rate raises concerns that the 61% who remained may be systematically different from the 39% who did not, potentially limiting the generalizability of the findings to the full cohort or population.

Sampling vs. Random Assignment

• Random sampling – how you choose participants → boosts external validity. This refers to randomly selecting individuals from a population to be in your study, making your sample representative and allowing you to generalize findings to that larger population.

• Random assignment – how you allocate participants to conditions → boosts internal validity. This refers to randomly distributing participants who are already in your study to different experimental groups or conditions, which helps ensure that any observed differences between groups are due to the manipulation of the independent variable, rather than pre-existing differences among participants.

Describing Data (Descriptive Statistics)

• Every variable yields a distribution of scores. A distribution shows all the possible values of a variable and how often they occur.

• Visual summaries: frequency table (lists categories/scores and their counts/percentages), histogram (a bar graph that displays the frequencies of a numerical variable, with bars touching to represent continuous data).

Histogram insights

• Central tendency ("where most scores fall") – refers to the typical or central value of the data.

• Spread/variance (width of distribution) – indicates how much the scores in a distribution vary from each other and from the center.

• Shape (symmetry, modality, skew) – describes the overall form of the distribution. Is it symmetrical or skewed? Does it have one peak (unimodal) or multiple peaks (bimodal, multimodal)?

Measures of Central Tendency

• Mode – most frequent score (works best for unimodal data). Applicable to all types of data (nominal, ordinal, interval, ratio). Less informative for highly varied or continuous data.

• Median – 50th percentile; robust to extreme values (outliers). It is the middle score when data is ordered from least to greatest. Because it is based on position rather than magnitude, it is not affected by extremely high or low scores, making it a good measure for skewed distributions.

• Mean \bar{X} – arithmetic average; sensitive to every score & to skew. Calculated by summing all scores and dividing by the number of scores. It is the most commonly used measure but can be heavily influenced by outliers or skewed data because it incorporates every value.

Skew

• Normal distribution – symmetric, bell-shaped; \text{mean} = ext{median} = ext{mode}. This ideal distribution is perfectly symmetrical, with the highest frequency in the middle and frequencies tapering off equally in both directions.

• Skewed distribution – tail pulls mean; median resists skew better. If the tail points to the right (positive skew), the mean > median > mode. If the tail points to the left (negative skew), the mean < median < mode. The mean is pulled in the direction of the tail due to the influence of extreme scores, while the median remains a better indicator of the center for skewed data.

Variability

• Range = X{\max} - X{\min} – the difference between the highest and lowest scores in a distribution. It is a simple measure of variability but highly sensitive to outliers.

• Sample standard deviation S = \sqrt{\frac{\sum (X - \bar{X})^2}{N-1}} – average deviation from mean. This is the most common measure of spread, indicating the typical distance that scores fall away from the mean. A larger standard deviation indicates greater variability in the data. The denominator (N-1) is used for sample standard deviation to provide an unbiased estimate of the population standard deviation.

Descriptive vs. Inferential Statistics

• Descriptive – summarize sample. These are techniques used to organize, summarize, and describe characteristics of a data set, usually a sample. They provide a clear and concise picture of the data at hand (e.g., calculating means, standard deviations, creating histograms).

• Inferential – use sample to estimate/decide about population. These are techniques that allow researchers to make generalizations about a larger population based on data from a sample. They involve hypothesis testing and confidence intervals to draw conclusions beyond the immediate data (e.g., determining if a sample difference reflects a real population difference).

Point Estimates & Sampling Error

• Point estimate – single-value statistic from sample (e.g., \bar{X} = 546 word difference). A single computed value from a sample that is used to estimate a population parameter.

• Population parameter – unknown "true" value. This is the true, but usually unknown, characteristic of the entire population (e.g., the true average speaking rate of all women).

• Sampling variability – different random samples → different point estimates. Because samples are rarely perfectly identical to the population, different random samples drawn from the same population will yield slightly different point estimates. This natural variation is expected and quantified by sampling error.

• Most are close to truth; some are far off (outliers). The distribution of sample means (the sampling distribution) tends to cluster around the true population mean, but a few samples might, by chance, produce estimates far from the true value.

Margin of Error & Confidence Intervals

• Margin of error (MoE) – radius around point estimate likely to include parameter. It is a statistic expressing the amount of random sampling error in a survey's results. It represents how much the sample results are likely to differ from the true population value.

• Generic formula: \text{MoE} = z{\alpha/2} \times \frac{\sigma}{\sqrt{N}} (when \sigma known). Here, z{\alpha/2} is the critical z-score corresponding to the desired confidence level (e.g., 1.96 for 95% CI), \sigma is the population standard deviation (or its estimate, the sample standard deviation), and N is the sample size. This formula shows that MoE decreases with larger sample sizes and smaller population variability.

• Confidence interval (CI)

• \text{CI} = \text{Point estimate} \pm \text{MoE}. This range is calculated from the sample data and is likely to contain the unknown population parameter.

• 95 % CI captures parameter in 95 % of hypothetical repeated samples. This means that if you were to draw many (e.g., 100) random samples from the same population and calculate a 95% CI for each, approximately 95 of those intervals would contain the true population parameter. It's not the probability that the current interval contains the parameter.

• 99 % CI widens interval → higher confidence, less precision. To be more confident that the interval contains the true parameter, the interval must be wider, thus sacrificing precision.

• Effect of N

• Larger N \Rightarrow smaller \frac{\sigma}{\sqrt{N}} \Rightarrow narrower MoE. As sample size increases, the variability of sample means decreases, leading to more precise estimates of the population parameter.

• Example: N = 1000, MoE \pm 3\% vs. N = 10, MoE \pm 30\% for same proportion. This highlights the crucial role of sample size in the precision and representativeness of a claim.

Applied Example – Mehl et al. (2007) “Do women talk more?”

• Participants: 210 women (M = 16{,}215 words, SD = 7{,}301 ); 186 men (M = 15{,}669 words, SD = 8{,}633 ).

• Difference in means: 16{,}215 - 15{,}669 = 546 words. This is the point estimate of the difference in daily word count between women and men in this specific sample.

• 95 % MoE:

• Women: \pm 987 words

• Men: \pm 1,240 words

• Overlapping CIs → cannot claim women definitively talk more; difference may be sampling error. The confidence intervals for women's average word count and men's average word count largely overlap. This overlap suggests that the observed difference of 546 words in the sample could plausibly be due to random sampling variability, and we cannot confidently conclude a true difference in the population.

Statistical Validity of Frequency Claims

• Ask: What is the sample size? (bigger → smaller MoE). A larger sample size generally leads to a smaller margin of error, indicating a more precise estimate of the population frequency.

• What is the MoE? (smaller → more precise). The margin of error directly quantifies the precision of the frequency claim. A smaller MoE means the sample proportion is a more reliable estimate of the population proportion.

• Frequency claims without MoE can mislead ("59 % of Canadians…" means little unless you know \pm 3\% or \pm 30\%). Without a margin of error, it is impossible to gauge the reliability or precision of a frequency claim, making it difficult to interpret or generalize the reported percentage.

Practical & Philosophical Implications

• Ethics: Oversampling minority groups can improve equity by ensuring their voices are adequately represented in research, but demands careful weighting to avoid misrepresentation when generalizing to the overall population. Without proper weighting, such samples could skew overall population estimates.

• Real-world relevance: High ecological validity & replication build public trust in psychological science. When research findings are relevant to real-world problems and can be consistently reproduced, the public and policymakers are more likely to accept and apply the scientific insights.

• Trade-offs: Cost vs. representativeness (probability sampling is expensive and time-consuming; convenience sampling is cheap but risks bias). Researchers must balance the ideal of highly representative samples with practical constraints such as budget, time, and access to participants.

• Designing for generalization mode requires thoughtful sampling, prioritizing representative methods like probability sampling; theory-testing can focus resources on internal validity by ensuring rigorous experimental control and manipulation, even if the