Key Concepts: Individuals, Variables, Measurement, Study Types, and Sampling

Individuals vs Variables

  • Individuals (or objects) are the people or items included in the study.

  • A variable is a characteristic of the individual that we will measure.

  • Example: a group of students where a specific student is an individual; their height, hair color, GPA are examples of variables.

  • Common mistakes: using “variance” when you mean “variable.”

Quantitative vs Qualitative

  • Quantitative variables deal with numbers; focus on quantity. Examples: height, weight, GPA, time.

  • Qualitative variables (also called qualitative or categorical) are not inherently numerical; they describe qualities. Examples: color, favorite ice cream.

  • Some qualitative data can be represented with numbers (labels), e.g., social security numbers or jersey numbers, but they are labels, not quantities to be mathematically operated on.

  • A simple heuristic: qualitative observations use senses/labels; quantitative observations are measured numbers.

Population vs Sample

  • Population data come from every individual of interest; it is all-encompassing.

  • Sample data come from only some individuals of the entire population; a subset.

  • Population parameter: a numerical measure that describes an aspect of the population; reports the true characteristic of the population.

  • Sample statistic: a numerical measure that describes an aspect of the sample; provides an estimate of the population parameter and reflects margin of error and confidence interval.

  • Key reminder: a sample is a subset of your population.

  • If confused, remember the initials: population parameter vs sample statistic.

  • Picture idea: population → sample → inferences about the population based on the sample.

  • Course structure in context: Part I focuses on describing populations (complete information); Part II leans toward inferential statistics (drawing conclusions about a population from samples).

  • Real-world nuance: in practice, we usually don’t know the entire population’s behavior; we take samples to infer about the larger population.

  • Terminology examples:

    • Population measurable quantity = parameter (e.g., population mean μ, population standard deviation σ).

    • Sample measurable quantity = statistic (e.g., sample mean \bar{X}, sample standard deviation s).

    • A statistic provides a margin of error and a confidence interval for the population parameter.

Levels of Measurement (Nominal, Ordinal, Interval, Ratio)

  • Nominal level: names, labels, or categories with no implied order.

  • Ordinal level: categories that can be ordered; differences between data values are not necessarily meaningful.

  • Interval level: differences between values are meaningful; data are ordered; there is no true zero.

  • Ratio level: has a true zero; allows multiplication/division (ratios are meaningful).

  • Key features:

    • Ordinal, Interval, and Ratio are ordered (ordinal is the basic ordering; interval adds meaningful differences; ratio adds meaningful zero).

    • Interval and Ratio are quantitative; Nominal and Ordinal are qualitative.

  • True zero concept: a true zero means a zero value indicates the absence of the quantity (enables multiplication/division).

    • Example: Temperature: interval data (e.g., Celsius, Fahrenheit) — not meaningful to say "twice as hot".

    • Example: Height, Weight, Time: ratio data — "twice as tall" or "half the weight" are meaningful.

  • Hierarchy note: the four levels form a progression of structure and information from least to most informative for numerical operations.

  • Mathematical associations:

    • For a dataset X with n observations, a common statistic is the mean: Xˉ=1n<em>i=1nX</em>i\bar{X} = \frac{1}{n} \sum<em>{i=1}^n X</em>i which requires at least interval data to be meaningful.

  • Recap mapping:

    • Nominal and Ordinal = qualitative levels.

    • Interval and Ratio = quantitative levels.

Observational Studies vs Experiments

  • Observational study: observe and measure individuals without applying any treatment or deliberate intervention; external factors may influence results.

  • Experiments: deliberately impose a treatment on individuals to observe a possible change in the response or measured variable.

  • Observational studies tend to be more qualitative and may have weaker causal conclusions due to uncontrolled outside factors.

  • Experiments aim for stronger causal conclusions by controlling the treatment and other variables.

  • In experiments, you often see a treatment group and a control group.

  • Placebo effect: when individuals in the control group believe they are receiving treatment and respond accordingly; placebo can improve outcomes in some cases.

  • Double-blind experiment: neither the participants nor the researchers know who is receiving the treatment; helps prevent bias in treatment administration and assessment.

  • Medical trials commonly use double-blind designs when feasible to avoid bias.

Types of Studies and How They Relate to Data

  • Observational studies tend to yield more qualitative results and weaker causal inference.

  • Experiments tend to yield more quantitative results and stronger causal inference.

  • The choice of study design affects how confidently we can attribute observed effects to the treatment rather than external factors.

Sampling: Concepts and Methods

  • Simple random sample (SRS): a subset of n individuals chosen from the population so that every possible sample of size n has an equal chance of being selected; ignores any attributes of individuals (no bias toward characteristics).

  • Stratified sampling: divide the population into distinct subgroups (strata) based on characteristics (e.g., age, income, education); draw random samples from each stratum; useful for comparisons across strata and when you want to ensure representation from each subgroup.

    • Note: sample sizes from strata do not have to be equal, but equal or proportional sizes often make comparisons easier.

  • Cluster sampling: divide the population into preexisting segments or clusters (often geographic); randomly select some clusters and include every member of the selected clusters in the sample.

    • Difference from stratified: stratified samples from meaningful groups with samples drawn from each; cluster samples entire members from a subset of clusters.

  • Systematic sampling: number all population members, pick a random starting point, and then select every k-th member (a fixed interval) thereafter.

    • Example pattern: start at position 11 and pick every 6th person (11, 17, 23, 29, 35, …).

  • Multistage sampling: combine multiple sampling methods across stages to form a final sample; often ends with clusters selected in the last stage.

  • Convenience sampling: select individuals who are readily available (e.g., the first 30 people who walk by or people in a room); highly convenient but biased and often not representative of the population.

    • Caveat: convenience samples make generalizations about the population very weak due to bias.

  • Quick recap of the methods (from least structure to most):

    • Simple Random Sampling

    • Stratified Sampling

    • Cluster Sampling

    • Systematic Sampling

    • Multistage Sampling

    • Convenience Sampling

  • Practical notes on the spreadsheet example used in class:

    • Population of 200 individuals broken into chunks of 10 to illustrate geographic partitioning for cluster sampling.

    • Right-hand column shows a secondary classification (e.g., political affiliation: Republican, Democrat, Other; or gender: male, female, other/decline to answer).

    • Simple random sample demonstration: 30 individuals selected randomly (greens indicate selected).

    • Stratified sampling demonstration: 30 individuals selected with 10 from each stratum (e.g., 10 from each quality group 1, 2, and 3); sample sizes across strata can be equal or unequal, but equal sizes often simplify analysis.

    • Cluster sampling demonstration: three entire blocks chosen; within each block, all individuals are included.

    • Systematic sampling demonstration: select every k-th member after a random start (e.g., step size of 6).

    • Convenience sampling demonstration: first 30 individuals recorded; highly convenient but very biased and not representative.

  • Important distinctions clarified on the differences between stratified and cluster sampling:

    • Stratified sampling: divide the population into meaningful groups (strata) and sample within each stratum; aims to compare or highlight differences between strata.

    • Cluster sampling: divide the population into clusters (often geographic) and sample whole clusters; one or more clusters are selected, and all individuals in those clusters are included in the sample.

Practical Implications and Takeaways

  • Inference in statistics relies on samples to draw conclusions about populations.

  • Different sampling methods have different biases and trade-offs in terms of cost, accuracy, and representativeness.

  • When planning a study, consider whether you need representation across subgroups (stratified) or efficiency through clusters (cluster), or a simple random approach (SRS).

  • Be aware of potential biases: systematic patterns, convenience bias, or non-representative strata/clusters can distort inferences.

  • Remember the core goal: use the sample to describe the population and, where appropriate, infer population characteristics with quantified uncertainty (margin of error, confidence intervals).

Quick References and Mnemonics

  • Parameter vs Statistic memory aid: Population Parameter (θ) vs Sample Statistic (\hat{θ}); population mean μ vs sample mean \bar{X}.

  • Levels of Measurement mnemonic: Nominal (names) → Ordinal (order) → Interval (differences meaningful) → Ratio (true zero).

  • Relationship reminder: Qualitative = Nominal/Ordinal; Quantitative = Interval/Ratio.

  • Double-blind importance: reduces bias in treatment effects and outcome assessment in experiments.

  • Systematic sampling pitfall: beware of cyclic patterns that may coincide with the sampling interval and introduce bias.

  • Convenience sampling caveat: easy but biases limit generalizability.