Comprehensive Notes on Sampling and Sampling Methods

Population vs. Sample

Population: the entire group you want to study (e.g., all dogs in the world).
- Feasibility issue: you usually cannot examine every member of a very large population.
Sample: a subset of the population used to study and draw inferences about the population.
- The smaller the sample relative to the population, the more sampling error may occur.
- If the population is small, you can in principle examine everyone (e.g., all dogs a vet sees in a week).
Census vs. Survey
- Census: collecting information from every member of the population.
- Surveys are used when a census is impractical; they sample the population.
Population parameter vs. sample statistic
- Parameter: a numerical characteristic of the population (a true value).
- Statistic: a numerical characteristic computed from the sample, used as an estimate of the parameter.
- Example (eye color of dogs):
- Parameter: the true percentage of all dogs with blue eyes in the world population.
- Statistic: the percentage of blue-eyed dogs found in a sample of, say, 100 dogs.
Relationship between parameter and statistic
- A statistic is an approximation/estimate of the corresponding parameter.
- As the sample size grows, the statistic tends to get closer to the parameter (conceptually tied to the idea behind the law of large numbers).

The f-counting demonstration: statistic vs. parameter in practice

Setup: counting f’s in a paragraph to illustrate sampling vs. population counts.
- Task: count how many f’s appear in a paragraph. People tend to miscount because there are many lines and f’s are easy to miss.
The exercise showed that counting across a whole population (all f’s in the entire text) would give the true count (the parameter).
Instead, you take a sample line, count f’s, and use it as a statistic to estimate the total number of f’s in the entire population of lines.
Numbers from the demonstration (illustrative, not exact):
- True total f’s in the whole text (parameter) = 34.
- Count in one sampled line (statistic) = 30.
- Sampling error = parameter − statistic = 34 − 30 = 4.
- Another sample yielded close results (e.g., a statistic near 30 to 33), illustrating that larger samples generally improve accuracy.
Takeaway:
- The statistic is an estimate of the true parameter.
- Random sampling (a sample chosen at random) helps ensure the sample is representative and the statistic is a good estimate of the parameter.
- The more you sample, the more accurate the statistic tends to be.

Sampling methods: overview and implications

Why sampling methods matter
- Different sampling methods affect bias, precision, and generalizability.
- A key way to critique a study is to examine how the sample was obtained.
Polling example after 2016 election
- Early polling relied heavily on cold calls, which biased the sample toward people who answer phones.
- Result: polls could misrepresent voting intentions; modern polling uses more sophisticated sampling to reduce bias and improve representativeness.
Convenience sampling (the default, least desirable in many cases)
- Sample whoever is easy to reach (e.g., people at a random spot during lunch).
- Pros: quick and easy.
- Cons: highly biased and often not representative.
Simple random sampling (SRS)
- Each member of the population has an equal chance of being selected.
- Methods to implement:
- Use a computer random number generator (true randomness is elusive; see pseudo-random discussion below).
- Example tool: random.org for generating random numbers.
- In Excel: use RAND between (RANDBETWEEN) or similar functions; you can repeat the sampling by dragging formulas.
- Practical notes:
- In practice, truly random numbers are hard; most random generators are pseudo-random but sufficiently random for many applications.
- Repetition: sometimes you want repeatability (seeded randomness) or non-repetition depending on the context.
Systematic sampling
- Steps:
- Determine population size N and desired sample size n; compute sampling interval $k = \frac{N}{n}$ (often rounded, e.g., floor for a fixed interval in some examples).
- Choose a random starting point r in {1, 2, …, k}.
- Select units: r, r+k, r+2k, …, r+(n−1)k.
- Example from lecture: class with 98 students, want n = 10 samples.
- Interval: k = ig
  floor rac{98}{10} ig
  floor = 9.
- Random start r ∈ {1,…,9} (e.g., r = 6).
- Selected students: 6, 15, 24, 33, 42, 51, 60, 69, 78, 87.
- Pros: easy to implement, ensures coverage across the population.
- Cons: can be biased if the population has a hidden pattern aligned with the interval (e.g., Fibonacci/bees pattern caveats).
- Be aware of potential pattern bias (e.g., bee sexes and the Fibonacci sequence occasionally aligning with sampling intervals).
Cluster sampling
- Divide the population into clusters (e.g., groups by state, classroom rows, etc.).
- Randomly select a subset of clusters and survey everyone in those clusters.
- Pros: efficient with large populations; reduces travel/communication costs.
- Cons: if clusters are not homogeneous, estimates may be biased or have higher variance.
- Example from lecture: survey random rows in a class, survey everyone in those rows; could still be biased if rows correlate with outcomes (e.g., grades).
Stratified sampling
- Divide population into strata (mutually exclusive subgroups) and sample from each stratum, preserving the overall proportion or an explicit sample from each stratum.
- Key idea: ensure representation of important subgroups (control for confounding variables).
- Example discussed: Texas vs Wyoming students; stratify by state, sample proportionally within each state, vs. simple cluster approach.
- Uses:
- Control for confounding variables (e.g., gender, age, location).
- Compare effects across strata (e.g., medication effects in men vs. women).
- Advantage: often more precise estimates when strata are internally homogeneous.
Convenience vs. random sampling in practice
- Convenience sampling is common but risky due to bias.
- Stratified or simple random sampling generally provides more reliable inferences.
Reading sampling methods in studies
- Some real-world examples: American Airlines survey by randomly selecting 87 flights and surveying all passengers on those flights (a cluster-like approach).
- The distinction between cluster and simple random sampling is subtle in practice; clarity about how clusters are chosen matters.

Randomness, tools, and practical tips

True randomness vs. pseudo-randomness
- Computers cannot generate truly random numbers; most use pseudo-random algorithms.
- Examples: using natural processes (raindrops, star pulsations) or cryptographic algorithms to seed generators; Pokemon/random-start timing illustrates non-deterministic timing can feel random.
Random number generators in practice
- Random.org: generates numbers using atmospheric noise; useful for demonstrations and some experiments.
- Excel/TI calculators: provide RANDBETWEEN(a,b) or equivalent; results are pseudo-random but adequate for many classroom purposes.
- Important caveat: with very large datasets, randomness can still yield patterns or clusters by chance, which is not “true bias-free randomness,” but often acceptable for practical sampling.
Examples and exercises using Excel functions
- Random integer between a and b: $ext{RANDBETWEEN}(a,b)$ .
- Floor function to round down: ext{FLOOR}(x) = ig
  floor x ig
  floor (e.g., floor(11.7) = 11).
- Combining functions to form sampling intervals and starting points (e.g., using FLOOR with random inputs to compute starting points and intervals).
Practical sampling considerations with Excel
- You can generate multiple random picks and then tally outcomes or summarize statistics.
- Repetition vs. non-repetition: sometimes you want to avoid repeating the same individual, sometimes you want to allow repeats depending on context.
Be mindful of potential sampling bias when using automated/random methods
- Even random processes can align with hidden patterns in the population, creating confounding effects.
- When sampling millions of cases, randomization alone may not address all systematic biases; stratification or careful cluster design can help.

Sampling error vs. non-sampling error (biases)

Sampling error
- Definition: the difference between the sample statistic and the true population parameter.
- Not “bad” in itself; it’s an expected part of sampling due to having only a subset.
- Example from the f-counting exercise: parameter = 34, statistic = 30; sampling error = $34 - 30 = 4.$
- As sample size increases, sampling error tends to decrease on average (law of large numbers intuition).
Non-sampling error (bias, measurement error)
- Causes include biases in who responds, how questions are worded, and how samples are selected.
- Types of bias:
- Sampling bias / selection bias: individuals who choose to respond differ systematically from those who do not (self-selection bias).
  - Example: asking people to shout out their exam grades in a classroom; those with higher grades may be more likely to respond.
- Response bias: poor phrasing or formats skew responses (e.g., Rotten Tomatoes scoring phrasing).
  - Example: a binary good/bad question producing misleading interpretations of quality.
- Question phrasing and vagueness bias: vague questions yield inconsistent interpretations (e.g., vague family-related questions when assessing abuse risks).
- Publication bias and sponsor influence: researchers or sponsors may prefer to publish favorable results; negative results may be suppressed.
- Population representation bias in medical trials: underrepresentation of minorities (e.g., mostly white populations) can confound results and limit generalizability; stratified sampling can help mitigate this.
Best practices to reduce biases
- Use clear, specific questions; avoid vague terms; specify what is being measured.
- Ensure sampling frames include diverse subgroups; use stratified sampling to ensure proportional representation.
- Document sampling method in detail so readers can assess potential biases.
- Consider multiple sampling methods when feasible and compare results across strata or clusters.

Quick reference: key formulas and concepts (LaTeX)

Population parameter vs. sample statistic
- Parameter:
- Denote as heta, representing a true population value.
- Statistic:
- Denote as
- It is an estimator of the parameter:
  $ext{Estimator} o ext{Parameter as } n o ext{large}$
Sampling error (basic form)
- $ext{Sampling error} = ext{Statistic} - ext{Parameter} = \hat{\theta} - \theta$
Systematic sampling interval and start (example)
- Population size: $N$
- Desired sample size: $n$
- Interval: $k = \frac{N}{n}$ (often rounded, e.g., floor) -> if using floor, $k = \left\lfloor \frac{N}{n} \right\rfloor$
- Random start: $r \sim \text{Uniform}{1,2,\dots,k}$
- Sample units: ${ r, r+k, r+2k, \dots, r+(n-1)k }$
Random number generation references
- Random integer between a and b: $\text{RANDBETWEEN}(a,b)$ (Excel-like function)
- Fractional random numbers: $\text{RAND}()$ typically in [0,1)
- Floor function example: $\lfloor x \rfloor$ (e.g., (\text{Floor}(11.7) = 11))
Be mindful of randomness claims
- True randomness is often unattainable in deterministic computers; pseudo-random generators are used in practice.
Conceptual terms
- Population: the entire group of interest.
- Sample: a subset used to draw inferences about the population.
- Census: measurement of the entire population.
- Survey: measurement on a sample to infer population parameters.
- Parameter: a true population value.
- Statistic: a sample-based estimate of the parameter.
- Bias: systematic error that can arise from sampling or response issues.

Practical takeaways for exams and real-world studies

Always ask: How was the sample obtained? Is it random? Are there strata? Could be biases?
Prefer simple random or stratified sampling over convenience sampling when possible.
When using systematic sampling, ensure the population is not arranged in a way that introduces bias with the chosen interval.
Understand that sampling error is natural and can be reduced by larger, well-designed samples; non-sampling error requires careful survey design and execution.
If you are reporting results, be explicit about the sampling method, the sample size, and any limitations due to bias or representation.