QTM 100 Lecture 2 Notes: Describing Variables and Study Design

QTM 100 Lecture 2 Notes: Describing Variables and Study Design

  • Purpose of lecture: Introduce how to describe variables, distinguish variable types, and compare observational studies versus experiments; cover sampling design, bias, and key experimental design concepts.

Describing Variables

  • A variable is any characteristic observed in a study.

  • Categorical variables

    • Observations belong to one of a set of categories.

    • Typically contain descriptive words or phrases.

    • Examples: gender (male/female), types of shipping (standard, firstClass, upsGround, etc.).

  • Quantitative variables

    • Observations take on numeric values.

  • For any data set, rows = observations and columns = variables.

    • Example data set (Mario Kart on eBay):

    • Variables include: obs, nBids, cond, startPr, totalPr, shipSp, wheels.

    • A sample row: 1 20 new 0.99 51.55 standard 1

    • The data frame example ends with: 143 observations and 13 variables.

  • Describing quantitative variables further

    • Quantitative variables can be discrete or continuous.

    • Discrete: finite number of possible values (e.g., counts like 0, 1, 2, …).

    • Continuous: a continuum of possible values (e.g., times, measurements).

    • Example illustrating continuous data: times for the 200m race at the Olympics (e.g., 1:54.27, 1:54.90, 1:56:22, etc.).

  • Describing categorical variables further

    • A categorical variable with two categories is dichotomous (e.g., gender: male/female).

    • A categorical variable with more than two categories is simply multi-category; some have a natural ordering (ordinal), e.g., year in college: freshman, sophomore, junior, senior.

  • Response vs explanatory variables (in data analysis)

    • Both can be categorical or quantitative.

    • There is an association between the response and an explanatory variable when a relationship exists.

    • If no association is present, the response and explanatory variables are independent.

  • Quick conceptual notes

    • An association does not imply causation in observational studies.

    • The way data are collected (the design) influences what conclusions are valid.

Types of Studies

  • Types of studies distinguish how data are collected and whether treatments are assigned.

  • Observational studies

    • Researchers observe the response and the explanatory variable without assigning a treatment.

    • Non-experimental; potential confounding variables can affect results.

    • Association does not equal causation in general.

  • Experiments

    • Subjects are assigned to experimental conditions (treatments) and then the response is observed.

    • Experimental conditions are called treatments.

    • A key advantage: stronger potential to infer causation due to randomization and control of variables.

Observational Studies: Types

  • Prospective studies (cohort)

    • Identify a group and observe them over a period of time; characteristics recorded along the way.

  • Retrospective studies (case-control)

    • Look back in time or use existing records.

  • Cross-sectional studies

    • Collect information about individuals at a single point in time or over a very short period.

Conducting an Experiment vs Observational Study (examples)

  • Research question example: does exercise improve energy level?

    • Observational approach: select people with/without exercise habits and measure energy level.

    • Experimental approach: randomly assign subjects to exercise vs no exercise and compare energy levels.

  • General takeaway: experimental designs can better isolate causal effects than observational designs.

Sampling Design and Generalizability

  • Goal of sampling: obtain a representative sample so that inferences generalize to the population.

  • If the sample is not representative, results may be biased.

  • Bias sources in observational studies

    • Sampling bias / coverage bias: the sampling frame does not represent the entire population; sampling may not be random.

    • Undercoverage: some groups are underrepresented.

    • Nonresponse bias: those who participate differ from those who do not participate; missing data may arise if respondents skip questions.

    • Response bias: participants provide inaccurate answers or are influenced by question wording.

  • Practical example: survey bias illustrated via media/press coverage of a study; headlines might mislead about causation; emphasis on correct interpretation (association vs causation).

Sampling Methods (how to select individuals)

  • Random sampling methods

    • Simple random sample: every individual has an equal chance of being selected; most representative and unbiased.

    • Stratified sample: population divided into strata; random sample drawn from within each stratum; useful for ensuring representation of specific groups.

    • Cluster sample: natural groups (clusters) are sampled; then individuals within chosen clusters are sampled; good when a reliable sampling frame is unavailable.

  • Non-random sampling methods

    • Volunteer sample: participants volunteer to join.

    • Convenience sample: obtain subjects who are easy to recruit.

  • Non-random designs are more prone to sampling bias/undercoverage.

  • Quick summaries of methods:

    • Simple Random Sample: each individual equally likely to be sampled; generally most representative.

    • Cluster Sample: sample clusters first, then sample within clusters.

    • Stratified Sample: ensure representation across key subgroups by stratifying first.

Example Scenarios and Questions (practice concepts)

  • Example: three-variable dataset related to hiring rates (Simpson’s Paradox) used to illustrate that the direction of an association can change after conditioning on a third variable.

  • Observational headlines vs true results: caution in interpreting causal claims from observational data; must consider confounding variables and potential biases.

  • Quick practice questions from the slides:

    • Question: In the helper vs hinder study, where 14 out of 16 (87.5%) infants prefer the helper toy, what type of variable was studied?

    • Answer: C. categorical, dichotomous (two categories: helper vs hinder).

    • Question: An experiment regarding the physiological cost of reproduction on male fruit flies includes multiple variables; how many quantitative variables does this dataset contain?

    • Answer (based on the slide): D. 3 (lifespan, thorax length, sleep).

Response vs Explanatory Variables and Data Relationships

  • Response variable (outcome) can be categorical or quantitative.

  • Explanatory variable (predictor) can be categorical or quantitative.

  • Association between response and explanatory variables may exist even in the absence of a causal relationship in observational data.

  • If a third variable explains both the explanatory and response variables, this is a confounding variable.

Experimental Design: Core Concepts

  • Key concepts in experimental design

    • Control: compare treatment of interest to a control group.

    • Randomize: randomly assign subjects to treatment and control groups to balance characteristics.

    • Replicate: use a large enough sample size or replicate the study to ensure reliability.

    • Block: account for known or suspected variables that affect the response; group units into blocks and assign treatments within blocks.

  • Random assignment benefits

    • Helps determine whether an intervention was effective by comparing treatment vs control.

    • Balances other characteristics across groups, reducing confounding.

    • Supports causal inference if well designed.

  • Terminology

    • Placebo: fake treatment used as control.

    • Placebo effect: change observed due to belief in treatment.

    • Blinding: experimental units do not know which group they are in.

    • Double-blind: neither experimental units nor researchers know group assignments.

  • Multifactor experiments

    • Explanatory variables may be multiple factors (treatment conditions).

    • Example: antidepressants and nicotine patches in a four-group, cross-classified design:
      1) antidepressant + nicotine patch
      2) antidepressant + placebo patch
      3) placebo antidepressant + nicotine patch
      4) placebo antidepressant + placebo patch

  • Blocking vs factors

    • Factors: experimental conditions imposed on units (e.g., treatment vs control).

    • Blocking variables: characteristics of units that you want to control for.

    • Blocking in experiments is analogous to stratification in observational studies when sampling.

How to Analyze and Interpret Study Designs

  • Experimental studies

    • Random assignment aims to limit confounding and support causal conclusions.

    • Control groups and blinding reduce bias and placebo effects.

  • Observational studies

    • Can reveal associations and real-world behavior but are more susceptible to confounding; causation is harder to establish.

  • Simpson’s Paradox (illustrative problem)

    • A relationship that appears in aggregated data can reverse when data are stratified by a confounding factor (e.g., gender, field of study, or other grouping).

Practical Examples and Recaps

  • Example: data collection on Mario Kart eBay listings uses a data frame with both quantitative (nBids, startPr, totalPr) and categorical variables (cond, shipSp, wheels).

  • Example: measuring IQ in children and spanking status illustrates how two variables (spanking and IQ) can be associated but may be confounded by age group (2-4 vs 5-9) or other variables; headlines can misrepresent association as causation.

  • Example of study design choices:

    • To investigate eye color and emotional sensitivity, an experimental study would be inappropriate; stratified sampling or simple random sampling among eye color groups would be more appropriate to study association in an observational framework.

Quick Reference: Glossary Highlights

  • Observational study: researchers observe without assigning treatments; potential confounding.

  • Experimental study: researchers assign treatments and observe outcomes; stronger causal claims possible.

  • Randomization: random assignment to groups to balance known and unknown factors.

  • Blocking: grouping similar units and randomizing within blocks to control for blocking variables.

  • Placebo and placebo effect: control treatment and belief-driven response.

  • Blinding and double-blinding: reduce bias from participants and researchers.

  • Confounding variable: a third variable that influences both explanatory and response variables, creating a spurious association.

  • Prospective (cohort): follow a group over time.

  • Retrospective (case-control): look back at past data.

  • Cross-sectional: snapshot at a single point in time.

  • Simpson’s Paradox: aggregated association can reverse within strata or subgroups.

Key Equations and Notation (illustrative)

  • Proportion and percent example from a small study:

    • If 14 of 16 infants prefer the helper toy, the proportion is rac{14}{16} = 0.875 = 87.5\%.

  • Example data frame notations (variables in Mario Kart dataset):

    • Observations: rows

    • Variables: columns such as obs,
      nBids,
      cond,
      startPr,
      totalPr,
      shipSp,
      wheels.

    • Example values: nBids = 20, ext{ cond} = 'new', ext{ startPr} = 0.99, ext{ totalPr} = 51.55, ext{ shipSp} = 'standard', ext{ wheels} = 1.

  • Example terminologies in an experimental design context:

    • Treatments: the different conditions applied to experimental units.

    • Blocking variable example: gender, age group, baseline health status.

    • Cross-classified factors: factorial design with multiple factors and their levels.


If you’d like, I can tailor these notes to a specific topic focus (e.g., only sampling biases, or only experimental design), or expand any section with more examples or practice questions.