Sampling, Quantitative vs Qualitative Data, and the Foundations of Random Sampling

Quantitative vs. Qualitative Measurement

  • Median survival illustration
    • Two arms of a cancer-vaccine trial: one arm had median survival \approx 10\text{ months}, the other \approx 6\text{ months}.
    • Quantitative view (actual months lived) revealed a near-significant difference; qualitative view ("responded yes/no") appeared flat.
    • Core moral: quantitative variables carry more “information content” than binary/qualitative variables, enlarging power to detect effects.

  • Engineer’s aphorism: “If you can measure numerically, do it.”
    • Binary outcomes are unavoidable in some contexts (e.g., “had surgery: yes/no”), but whenever a continuum exists, prefer it.

  • Covariates
    • Length of treatment, disease stage, etc., mentioned as ancillary variables that must be “taken into account” in analysis.
    • Term introduced: covariate = a variable not itself the primary outcome but possibly associated with outcome or exposure.

Populations, Samples, Measurements (Quick Recap)

  • Identify population ⇒ decide variables ⇒ design study ⇒ collect data.

  • Data collection begins with sampling.

Why Sampling Quality Matters

  • Sample = small fraction of population; if collected poorly, inferences are by definition poor.

  • Ultimate mission: obtain a sample representative of the population.
    • Trivial statement, yet operationally non-trivial: “how” is the hard part.

Random (Probability) vs. Non-Random (Non-Probability) Sampling

  • Two overarching families:

    1. Random / Probability sampling.
      • Guiding principle: randomness = unpredictability ⇒ immunizes against manipulation.
      • Enables computation of probabilities for sample statistics.

    2. Non-Random / Non-Probability sampling.
      • At least one population element has 0 chance of selection.
      • Examples: web-site pop-up polls, “clipboard on Bruin Walk” intercept surveys.

  • Size does not rescue bias
    • Millions of web responses still = huge but bad sample if selection is biased.

Clinical-Trial Reality Check

  • Clinical studies are inherently non-random samples:
    • Geography limits who can attend a study site.
    • Informed consent allows refusal.
    • Investigator discretion & site capacity limit who is even asked.

  • Consequence: must assume sample is “as if” random (representative) to apply statistical inference.
    • If that assumption fails, results lose generalizability.

Types of Random Samples (overview)

  1. Simple Random Sample (SRS) ⟶ detailed below.

  2. Stratified random sample.

  3. Cluster (multistage) sample.

  4. Systematic sample.
    (Only SRS introduced so far.)

Simple Random Sample (SRS)

  • Definition: every element has equal chance of selection.
    • SRS ⊂ random sampling (where only “non-zero chance” is required).

  • Classic metaphor: lottery balls.
    • 65 numbered balls bounce with forced air; each draw has 1/65 chance on first selection.

Sampling With vs. Without Replacement

  • Lottery uses without replacement (ball is set aside).
    • Probabilities shift: P(\text{draw}=i \text{ on 2nd}) = 1/64, etc.

  • For large populations with small samples, difference between with/without replacement is negligible ⇒ treat as with replacement to simplify probability math.

Practical Walk-Through: UCLA Diet Survey

  • Goal: SRS of n = 100 undergraduates from population N \approx 45{,}000.

Step-by-step:

  1. Create sampling frame: obtain registrar list of all enrolled students.

  2. Assign unique equal-length IDs: UID already nine digits; if homemade, pad with leading zeros: 00001,\ldots,45000.

  3. Random-number generation: use an RNG (Excel RAND, calculator RND key, Python random, etc.).
    • Algorithm picks digits 0–9 with equal probability to form 9-digit strings.
    • If generated ID not in frame, discard and redraw (still random).

  4. Repeat until 100 valid IDs gathered.

  5. Contact those students for the diet questionnaire.

  • Notation reminder:
    • n = sample size (here 100).
    • N = population size (here \approx 45{,}000).

RNGs & Philosophical Aside

  • Computers are deterministic; “random” numbers are produced by algorithms (pseudo-random).

  • Lecturer foreshadows deeper discussion on adequacy of pseudo-randomness for statistical work.

Probability Example (cards)

  • If deck truly random and you don’t manipulate:
    • P(\text{5 red cards}) = \dfrac{\binom{26}{5}}{\binom{52}{5}}.

  • If you secretly arrange deck (non-random), probability becomes 1 (forced) or 0; can’t compute meaningful probability.

Key Takeaways & Best Practices

  • Quantitative outcomes > qualitative for detecting effects.

  • Representativeness > sample size; bias cannot be “averaged out.”

  • Clinical research relies on optimistic assumption of representativeness; critical readers must scrutinize this claim.

  • Understand varieties of random sampling; SRS is simplest but not only method.

  • Random number generators enable practical SRS; awareness of pseudo-random versus true random helps assess rigor.

Formulas & Notation Recap

  • Median definition (sample): the ordered data value at position \dfrac{n+1}{2} (if n odd) or average of middle pair if n even.

  • Lottery selection probabilities (without replacement):
    P1 = \dfrac{1}{65},\;P2 = \dfrac{1}{64}, \ldots

  • Card example (5 red):
    P=\dfrac{\binom{26}{5}}{\binom{52}{5}}\approx 0.000495.