QTM 100 Lecture 2 Notes: Describing Variables and Study Design
QTM 100 Lecture 2 Notes: Describing Variables and Study Design
Purpose of lecture: Introduce how to describe variables, distinguish variable types, and compare observational studies versus experiments; cover sampling design, bias, and key experimental design concepts.
Describing Variables
A variable is any characteristic observed in a study.
Categorical variables
Observations belong to one of a set of categories.
Typically contain descriptive words or phrases.
Examples: gender (male/female), types of shipping (standard, firstClass, upsGround, etc.).
Quantitative variables
Observations take on numeric values.
For any data set, rows = observations and columns = variables.
Example data set (Mario Kart on eBay):
Variables include: obs, nBids, cond, startPr, totalPr, shipSp, wheels.
A sample row: 1 20 new 0.99 51.55 standard 1
The data frame example ends with: 143 observations and 13 variables.
Describing quantitative variables further
Quantitative variables can be discrete or continuous.
Discrete: finite number of possible values (e.g., counts like 0, 1, 2, …).
Continuous: a continuum of possible values (e.g., times, measurements).
Example illustrating continuous data: times for the 200m race at the Olympics (e.g., 1:54.27, 1:54.90, 1:56:22, etc.).
Describing categorical variables further
A categorical variable with two categories is dichotomous (e.g., gender: male/female).
A categorical variable with more than two categories is simply multi-category; some have a natural ordering (ordinal), e.g., year in college: freshman, sophomore, junior, senior.
Response vs explanatory variables (in data analysis)
Both can be categorical or quantitative.
There is an association between the response and an explanatory variable when a relationship exists.
If no association is present, the response and explanatory variables are independent.
Quick conceptual notes
An association does not imply causation in observational studies.
The way data are collected (the design) influences what conclusions are valid.
Types of Studies
Types of studies distinguish how data are collected and whether treatments are assigned.
Observational studies
Researchers observe the response and the explanatory variable without assigning a treatment.
Non-experimental; potential confounding variables can affect results.
Association does not equal causation in general.
Experiments
Subjects are assigned to experimental conditions (treatments) and then the response is observed.
Experimental conditions are called treatments.
A key advantage: stronger potential to infer causation due to randomization and control of variables.
Observational Studies: Types
Prospective studies (cohort)
Identify a group and observe them over a period of time; characteristics recorded along the way.
Retrospective studies (case-control)
Look back in time or use existing records.
Cross-sectional studies
Collect information about individuals at a single point in time or over a very short period.
Conducting an Experiment vs Observational Study (examples)
Research question example: does exercise improve energy level?
Observational approach: select people with/without exercise habits and measure energy level.
Experimental approach: randomly assign subjects to exercise vs no exercise and compare energy levels.
General takeaway: experimental designs can better isolate causal effects than observational designs.
Sampling Design and Generalizability
Goal of sampling: obtain a representative sample so that inferences generalize to the population.
If the sample is not representative, results may be biased.
Bias sources in observational studies
Sampling bias / coverage bias: the sampling frame does not represent the entire population; sampling may not be random.
Undercoverage: some groups are underrepresented.
Nonresponse bias: those who participate differ from those who do not participate; missing data may arise if respondents skip questions.
Response bias: participants provide inaccurate answers or are influenced by question wording.
Practical example: survey bias illustrated via media/press coverage of a study; headlines might mislead about causation; emphasis on correct interpretation (association vs causation).
Sampling Methods (how to select individuals)
Random sampling methods
Simple random sample: every individual has an equal chance of being selected; most representative and unbiased.
Stratified sample: population divided into strata; random sample drawn from within each stratum; useful for ensuring representation of specific groups.
Cluster sample: natural groups (clusters) are sampled; then individuals within chosen clusters are sampled; good when a reliable sampling frame is unavailable.
Non-random sampling methods
Volunteer sample: participants volunteer to join.
Convenience sample: obtain subjects who are easy to recruit.
Non-random designs are more prone to sampling bias/undercoverage.
Quick summaries of methods:
Simple Random Sample: each individual equally likely to be sampled; generally most representative.
Cluster Sample: sample clusters first, then sample within clusters.
Stratified Sample: ensure representation across key subgroups by stratifying first.
Example Scenarios and Questions (practice concepts)
Example: three-variable dataset related to hiring rates (Simpson’s Paradox) used to illustrate that the direction of an association can change after conditioning on a third variable.
Observational headlines vs true results: caution in interpreting causal claims from observational data; must consider confounding variables and potential biases.
Quick practice questions from the slides:
Question: In the helper vs hinder study, where 14 out of 16 (87.5%) infants prefer the helper toy, what type of variable was studied?
Answer: C. categorical, dichotomous (two categories: helper vs hinder).
Question: An experiment regarding the physiological cost of reproduction on male fruit flies includes multiple variables; how many quantitative variables does this dataset contain?
Answer (based on the slide): D. 3 (lifespan, thorax length, sleep).
Response vs Explanatory Variables and Data Relationships
Response variable (outcome) can be categorical or quantitative.
Explanatory variable (predictor) can be categorical or quantitative.
Association between response and explanatory variables may exist even in the absence of a causal relationship in observational data.
If a third variable explains both the explanatory and response variables, this is a confounding variable.
Experimental Design: Core Concepts
Key concepts in experimental design
Control: compare treatment of interest to a control group.
Randomize: randomly assign subjects to treatment and control groups to balance characteristics.
Replicate: use a large enough sample size or replicate the study to ensure reliability.
Block: account for known or suspected variables that affect the response; group units into blocks and assign treatments within blocks.
Random assignment benefits
Helps determine whether an intervention was effective by comparing treatment vs control.
Balances other characteristics across groups, reducing confounding.
Supports causal inference if well designed.
Terminology
Placebo: fake treatment used as control.
Placebo effect: change observed due to belief in treatment.
Blinding: experimental units do not know which group they are in.
Double-blind: neither experimental units nor researchers know group assignments.
Multifactor experiments
Explanatory variables may be multiple factors (treatment conditions).
Example: antidepressants and nicotine patches in a four-group, cross-classified design:
1) antidepressant + nicotine patch
2) antidepressant + placebo patch
3) placebo antidepressant + nicotine patch
4) placebo antidepressant + placebo patch
Blocking vs factors
Factors: experimental conditions imposed on units (e.g., treatment vs control).
Blocking variables: characteristics of units that you want to control for.
Blocking in experiments is analogous to stratification in observational studies when sampling.
How to Analyze and Interpret Study Designs
Experimental studies
Random assignment aims to limit confounding and support causal conclusions.
Control groups and blinding reduce bias and placebo effects.
Observational studies
Can reveal associations and real-world behavior but are more susceptible to confounding; causation is harder to establish.
Simpson’s Paradox (illustrative problem)
A relationship that appears in aggregated data can reverse when data are stratified by a confounding factor (e.g., gender, field of study, or other grouping).
Practical Examples and Recaps
Example: data collection on Mario Kart eBay listings uses a data frame with both quantitative (nBids, startPr, totalPr) and categorical variables (cond, shipSp, wheels).
Example: measuring IQ in children and spanking status illustrates how two variables (spanking and IQ) can be associated but may be confounded by age group (2-4 vs 5-9) or other variables; headlines can misrepresent association as causation.
Example of study design choices:
To investigate eye color and emotional sensitivity, an experimental study would be inappropriate; stratified sampling or simple random sampling among eye color groups would be more appropriate to study association in an observational framework.
Quick Reference: Glossary Highlights
Observational study: researchers observe without assigning treatments; potential confounding.
Experimental study: researchers assign treatments and observe outcomes; stronger causal claims possible.
Randomization: random assignment to groups to balance known and unknown factors.
Blocking: grouping similar units and randomizing within blocks to control for blocking variables.
Placebo and placebo effect: control treatment and belief-driven response.
Blinding and double-blinding: reduce bias from participants and researchers.
Confounding variable: a third variable that influences both explanatory and response variables, creating a spurious association.
Prospective (cohort): follow a group over time.
Retrospective (case-control): look back at past data.
Cross-sectional: snapshot at a single point in time.
Simpson’s Paradox: aggregated association can reverse within strata or subgroups.
Key Equations and Notation (illustrative)
Proportion and percent example from a small study:
If 14 of 16 infants prefer the helper toy, the proportion is rac{14}{16} = 0.875 = 87.5\%.
Example data frame notations (variables in Mario Kart dataset):
Observations: rows
Variables: columns such as obs,
nBids,
cond,
startPr,
totalPr,
shipSp,
wheels.Example values: nBids = 20, ext{ cond} = 'new', ext{ startPr} = 0.99, ext{ totalPr} = 51.55, ext{ shipSp} = 'standard', ext{ wheels} = 1.
Example terminologies in an experimental design context:
Treatments: the different conditions applied to experimental units.
Blocking variable example: gender, age group, baseline health status.
Cross-classified factors: factorial design with multiple factors and their levels.
If you’d like, I can tailor these notes to a specific topic focus (e.g., only sampling biases, or only experimental design), or expand any section with more examples or practice questions.