Sampling, Quantitative vs Qualitative Data, and the Foundations of Random Sampling

Median survival illustration
• Two arms of a cancer-vaccine trial: one arm had median survival $\approx 10\text{ months}$ , the other $\approx 6\text{ months}$ .
• Quantitative view (actual months lived) revealed a near-significant difference; qualitative view ("responded yes/no") appeared flat.
• Core moral: quantitative variables carry more “information content” than binary/qualitative variables, enlarging power to detect effects.
Engineer’s aphorism: “If you can measure numerically, do it.”
• Binary outcomes are unavoidable in some contexts (e.g., “had surgery: yes/no”), but whenever a continuum exists, prefer it.
Covariates
• Length of treatment, disease stage, etc., mentioned as ancillary variables that must be “taken into account” in analysis.
• Term introduced: covariate = a variable not itself the primary outcome but possibly associated with outcome or exposure.

Sample = small fraction of population; if collected poorly, inferences are by definition poor.
Ultimate mission: obtain a sample representative of the population.
• Trivial statement, yet operationally non-trivial: “how” is the hard part.

Two overarching families:
1. Random / Probability sampling.
  • Guiding principle: randomness = unpredictability ⇒ immunizes against manipulation.
  • Enables computation of probabilities for sample statistics.
2. Non-Random / Non-Probability sampling.
  • At least one population element has $0$ chance of selection.
  • Examples: web-site pop-up polls, “clipboard on Bruin Walk” intercept surveys.
Size does not rescue bias
• Millions of web responses still = huge but bad sample if selection is biased.

Clinical studies are inherently non-random samples:
• Geography limits who can attend a study site.
• Informed consent allows refusal.
• Investigator discretion & site capacity limit who is even asked.
Consequence: must assume sample is “as if” random (representative) to apply statistical inference.
• If that assumption fails, results lose generalizability.

Definition: every element has equal chance of selection.
• SRS ⊂ random sampling (where only “non-zero chance” is required).
Classic metaphor: lottery balls.
• $65$ numbered balls bounce with forced air; each draw has $1/65$ chance on first selection.

Lottery uses without replacement (ball is set aside).
• Probabilities shift: $P(\text{draw}=i \text{ on 2nd}) = 1/64$ , etc.
For large populations with small samples, difference between with/without replacement is negligible ⇒ treat as with replacement to simplify probability math.

Step-by-step:

Create sampling frame: obtain registrar list of all enrolled students.
Assign unique equal-length IDs: UID already nine digits; if homemade, pad with leading zeros: $00001,\ldots,45000$ .
Random-number generation: use an RNG (Excel RAND, calculator RND key, Python random, etc.).
• Algorithm picks digits 0–9 with equal probability to form 9-digit strings.
• If generated ID not in frame, discard and redraw (still random).
Repeat until 100 valid IDs gathered.
Contact those students for the diet questionnaire.

Notation reminder:
• $n$ = sample size (here $100$ ).
• $N$ = population size (here $\approx 45{,}000$ ).

Computers are deterministic; “random” numbers are produced by algorithms (pseudo-random).
Lecturer foreshadows deeper discussion on adequacy of pseudo-randomness for statistical work.

If deck truly random and you don’t manipulate:
• $P(\text{5 red cards}) = \dfrac{\binom{26}{5}}{\binom{52}{5}}$ .
If you secretly arrange deck (non-random), probability becomes $1$ (forced) or $0$ ; can’t compute meaningful probability.

Quantitative outcomes > qualitative for detecting effects.
Representativeness > sample size; bias cannot be “averaged out.”
Clinical research relies on optimistic assumption of representativeness; critical readers must scrutinize this claim.
Understand varieties of random sampling; SRS is simplest but not only method.
Random number generators enable practical SRS; awareness of pseudo-random versus true random helps assess rigor.

Median definition (sample): the ordered data value at position $\dfrac{n+1}{2}$ (if $n$ odd) or average of middle pair if $n$ even.
Lottery selection probabilities (without replacement):
$P<em>1 = \dfrac{1}{65},\;P</em>2 = \dfrac{1}{64}, \ldots$
Card example (5 red):
$P=\dfrac{\binom{26}{5}}{\binom{52}{5}}\approx 0.000495$ .