Key Concepts: Individuals, Variables, Measurement, Study Types, and Sampling
Individuals vs Variables
Individuals (or objects) are the people or items included in the study.
A variable is a characteristic of the individual that we will measure.
Example: a group of students where a specific student is an individual; their height, hair color, GPA are examples of variables.
Common mistakes: using “variance” when you mean “variable.”
Quantitative vs Qualitative
Quantitative variables deal with numbers; focus on quantity. Examples: height, weight, GPA, time.
Qualitative variables (also called qualitative or categorical) are not inherently numerical; they describe qualities. Examples: color, favorite ice cream.
Some qualitative data can be represented with numbers (labels), e.g., social security numbers or jersey numbers, but they are labels, not quantities to be mathematically operated on.
A simple heuristic: qualitative observations use senses/labels; quantitative observations are measured numbers.
Population vs Sample
Population data come from every individual of interest; it is all-encompassing.
Sample data come from only some individuals of the entire population; a subset.
Population parameter: a numerical measure that describes an aspect of the population; reports the true characteristic of the population.
Sample statistic: a numerical measure that describes an aspect of the sample; provides an estimate of the population parameter and reflects margin of error and confidence interval.
Key reminder: a sample is a subset of your population.
If confused, remember the initials: population parameter vs sample statistic.
Picture idea: population → sample → inferences about the population based on the sample.
Course structure in context: Part I focuses on describing populations (complete information); Part II leans toward inferential statistics (drawing conclusions about a population from samples).
Real-world nuance: in practice, we usually don’t know the entire population’s behavior; we take samples to infer about the larger population.
Terminology examples:
Population measurable quantity = parameter (e.g., population mean μ, population standard deviation σ).
Sample measurable quantity = statistic (e.g., sample mean \bar{X}, sample standard deviation s).
A statistic provides a margin of error and a confidence interval for the population parameter.
Levels of Measurement (Nominal, Ordinal, Interval, Ratio)
Nominal level: names, labels, or categories with no implied order.
Ordinal level: categories that can be ordered; differences between data values are not necessarily meaningful.
Interval level: differences between values are meaningful; data are ordered; there is no true zero.
Ratio level: has a true zero; allows multiplication/division (ratios are meaningful).
Key features:
Ordinal, Interval, and Ratio are ordered (ordinal is the basic ordering; interval adds meaningful differences; ratio adds meaningful zero).
Interval and Ratio are quantitative; Nominal and Ordinal are qualitative.
True zero concept: a true zero means a zero value indicates the absence of the quantity (enables multiplication/division).
Example: Temperature: interval data (e.g., Celsius, Fahrenheit) — not meaningful to say "twice as hot".
Example: Height, Weight, Time: ratio data — "twice as tall" or "half the weight" are meaningful.
Hierarchy note: the four levels form a progression of structure and information from least to most informative for numerical operations.
Mathematical associations:
For a dataset X with n observations, a common statistic is the mean: which requires at least interval data to be meaningful.
Recap mapping:
Nominal and Ordinal = qualitative levels.
Interval and Ratio = quantitative levels.
Observational Studies vs Experiments
Observational study: observe and measure individuals without applying any treatment or deliberate intervention; external factors may influence results.
Experiments: deliberately impose a treatment on individuals to observe a possible change in the response or measured variable.
Observational studies tend to be more qualitative and may have weaker causal conclusions due to uncontrolled outside factors.
Experiments aim for stronger causal conclusions by controlling the treatment and other variables.
In experiments, you often see a treatment group and a control group.
Placebo effect: when individuals in the control group believe they are receiving treatment and respond accordingly; placebo can improve outcomes in some cases.
Double-blind experiment: neither the participants nor the researchers know who is receiving the treatment; helps prevent bias in treatment administration and assessment.
Medical trials commonly use double-blind designs when feasible to avoid bias.
Types of Studies and How They Relate to Data
Observational studies tend to yield more qualitative results and weaker causal inference.
Experiments tend to yield more quantitative results and stronger causal inference.
The choice of study design affects how confidently we can attribute observed effects to the treatment rather than external factors.
Sampling: Concepts and Methods
Simple random sample (SRS): a subset of n individuals chosen from the population so that every possible sample of size n has an equal chance of being selected; ignores any attributes of individuals (no bias toward characteristics).
Stratified sampling: divide the population into distinct subgroups (strata) based on characteristics (e.g., age, income, education); draw random samples from each stratum; useful for comparisons across strata and when you want to ensure representation from each subgroup.
Note: sample sizes from strata do not have to be equal, but equal or proportional sizes often make comparisons easier.
Cluster sampling: divide the population into preexisting segments or clusters (often geographic); randomly select some clusters and include every member of the selected clusters in the sample.
Difference from stratified: stratified samples from meaningful groups with samples drawn from each; cluster samples entire members from a subset of clusters.
Systematic sampling: number all population members, pick a random starting point, and then select every k-th member (a fixed interval) thereafter.
Example pattern: start at position 11 and pick every 6th person (11, 17, 23, 29, 35, …).
Multistage sampling: combine multiple sampling methods across stages to form a final sample; often ends with clusters selected in the last stage.
Convenience sampling: select individuals who are readily available (e.g., the first 30 people who walk by or people in a room); highly convenient but biased and often not representative of the population.
Caveat: convenience samples make generalizations about the population very weak due to bias.
Quick recap of the methods (from least structure to most):
Simple Random Sampling
Stratified Sampling
Cluster Sampling
Systematic Sampling
Multistage Sampling
Convenience Sampling
Practical notes on the spreadsheet example used in class:
Population of 200 individuals broken into chunks of 10 to illustrate geographic partitioning for cluster sampling.
Right-hand column shows a secondary classification (e.g., political affiliation: Republican, Democrat, Other; or gender: male, female, other/decline to answer).
Simple random sample demonstration: 30 individuals selected randomly (greens indicate selected).
Stratified sampling demonstration: 30 individuals selected with 10 from each stratum (e.g., 10 from each quality group 1, 2, and 3); sample sizes across strata can be equal or unequal, but equal sizes often simplify analysis.
Cluster sampling demonstration: three entire blocks chosen; within each block, all individuals are included.
Systematic sampling demonstration: select every k-th member after a random start (e.g., step size of 6).
Convenience sampling demonstration: first 30 individuals recorded; highly convenient but very biased and not representative.
Important distinctions clarified on the differences between stratified and cluster sampling:
Stratified sampling: divide the population into meaningful groups (strata) and sample within each stratum; aims to compare or highlight differences between strata.
Cluster sampling: divide the population into clusters (often geographic) and sample whole clusters; one or more clusters are selected, and all individuals in those clusters are included in the sample.
Practical Implications and Takeaways
Inference in statistics relies on samples to draw conclusions about populations.
Different sampling methods have different biases and trade-offs in terms of cost, accuracy, and representativeness.
When planning a study, consider whether you need representation across subgroups (stratified) or efficiency through clusters (cluster), or a simple random approach (SRS).
Be aware of potential biases: systematic patterns, convenience bias, or non-representative strata/clusters can distort inferences.
Remember the core goal: use the sample to describe the population and, where appropriate, infer population characteristics with quantified uncertainty (margin of error, confidence intervals).
Quick References and Mnemonics
Parameter vs Statistic memory aid: Population Parameter (θ) vs Sample Statistic (\hat{θ}); population mean μ vs sample mean \bar{X}.
Levels of Measurement mnemonic: Nominal (names) → Ordinal (order) → Interval (differences meaningful) → Ratio (true zero).
Relationship reminder: Qualitative = Nominal/Ordinal; Quantitative = Interval/Ratio.
Double-blind importance: reduces bias in treatment effects and outcome assessment in experiments.
Systematic sampling pitfall: beware of cyclic patterns that may coincide with the sampling interval and introduce bias.
Convenience sampling caveat: easy but biases limit generalizability.