Comprehensive Notes on Sampling and Sampling Methods
Population vs. Sample
- Population: the entire group you want to study (e.g., all dogs in the world).
- Feasibility issue: you usually cannot examine every member of a very large population.
- Sample: a subset of the population used to study and draw inferences about the population.
- The smaller the sample relative to the population, the more sampling error may occur.
- If the population is small, you can in principle examine everyone (e.g., all dogs a vet sees in a week).
- Census vs. Survey
- Census: collecting information from every member of the population.
- Surveys are used when a census is impractical; they sample the population.
- Population parameter vs. sample statistic
- Parameter: a numerical characteristic of the population (a true value).
- Statistic: a numerical characteristic computed from the sample, used as an estimate of the parameter.
- Example (eye color of dogs):
- Parameter: the true percentage of all dogs with blue eyes in the world population.
- Statistic: the percentage of blue-eyed dogs found in a sample of, say, 100 dogs.
- Relationship between parameter and statistic
- A statistic is an approximation/estimate of the corresponding parameter.
- As the sample size grows, the statistic tends to get closer to the parameter (conceptually tied to the idea behind the law of large numbers).
The f-counting demonstration: statistic vs. parameter in practice
- Setup: counting f’s in a paragraph to illustrate sampling vs. population counts.
- Task: count how many f’s appear in a paragraph. People tend to miscount because there are many lines and f’s are easy to miss.
- The exercise showed that counting across a whole population (all f’s in the entire text) would give the true count (the parameter).
- Instead, you take a sample line, count f’s, and use it as a statistic to estimate the total number of f’s in the entire population of lines.
- Numbers from the demonstration (illustrative, not exact):
- True total f’s in the whole text (parameter) = 34.
- Count in one sampled line (statistic) = 30.
- Sampling error = parameter − statistic = 34 − 30 = 4.
- Another sample yielded close results (e.g., a statistic near 30 to 33), illustrating that larger samples generally improve accuracy.
- Takeaway:
- The statistic is an estimate of the true parameter.
- Random sampling (a sample chosen at random) helps ensure the sample is representative and the statistic is a good estimate of the parameter.
- The more you sample, the more accurate the statistic tends to be.
Sampling methods: overview and implications
- Why sampling methods matter
- Different sampling methods affect bias, precision, and generalizability.
- A key way to critique a study is to examine how the sample was obtained.
- Polling example after 2016 election
- Early polling relied heavily on cold calls, which biased the sample toward people who answer phones.
- Result: polls could misrepresent voting intentions; modern polling uses more sophisticated sampling to reduce bias and improve representativeness.
- Convenience sampling (the default, least desirable in many cases)
- Sample whoever is easy to reach (e.g., people at a random spot during lunch).
- Pros: quick and easy.
- Cons: highly biased and often not representative.
- Simple random sampling (SRS)
- Each member of the population has an equal chance of being selected.
- Methods to implement:
- Use a computer random number generator (true randomness is elusive; see pseudo-random discussion below).
- Example tool: random.org for generating random numbers.
- In Excel: use RAND between (RANDBETWEEN) or similar functions; you can repeat the sampling by dragging formulas.
- Practical notes:
- In practice, truly random numbers are hard; most random generators are pseudo-random but sufficiently random for many applications.
- Repetition: sometimes you want repeatability (seeded randomness) or non-repetition depending on the context.
- Systematic sampling
- Steps:
- Determine population size N and desired sample size n; compute sampling interval k = rac{N}{n} (often rounded, e.g., floor for a fixed interval in some examples).
- Choose a random starting point r in {1, 2, …, k}.
- Select units: r, r+k, r+2k, …, r+(n−1)k.
- Example from lecture: class with 98 students, want n = 10 samples.
- Interval: k = ig
floor rac{98}{10} ig
floor = 9. - Random start r ∈ {1,…,9} (e.g., r = 6).
- Selected students: 6, 15, 24, 33, 42, 51, 60, 69, 78, 87.
- Pros: easy to implement, ensures coverage across the population.
- Cons: can be biased if the population has a hidden pattern aligned with the interval (e.g., Fibonacci/bees pattern caveats).
- Be aware of potential pattern bias (e.g., bee sexes and the Fibonacci sequence occasionally aligning with sampling intervals).
- Cluster sampling
- Divide the population into clusters (e.g., groups by state, classroom rows, etc.).
- Randomly select a subset of clusters and survey everyone in those clusters.
- Pros: efficient with large populations; reduces travel/communication costs.
- Cons: if clusters are not homogeneous, estimates may be biased or have higher variance.
- Example from lecture: survey random rows in a class, survey everyone in those rows; could still be biased if rows correlate with outcomes (e.g., grades).
- Stratified sampling
- Divide population into strata (mutually exclusive subgroups) and sample from each stratum, preserving the overall proportion or an explicit sample from each stratum.
- Key idea: ensure representation of important subgroups (control for confounding variables).
- Example discussed: Texas vs Wyoming students; stratify by state, sample proportionally within each state, vs. simple cluster approach.
- Uses:
- Control for confounding variables (e.g., gender, age, location).
- Compare effects across strata (e.g., medication effects in men vs. women).
- Advantage: often more precise estimates when strata are internally homogeneous.
- Convenience vs. random sampling in practice
- Convenience sampling is common but risky due to bias.
- Stratified or simple random sampling generally provides more reliable inferences.
- Reading sampling methods in studies
- Some real-world examples: American Airlines survey by randomly selecting 87 flights and surveying all passengers on those flights (a cluster-like approach).
- The distinction between cluster and simple random sampling is subtle in practice; clarity about how clusters are chosen matters.
Randomness, tools, and practical tips
- True randomness vs. pseudo-randomness
- Computers cannot generate truly random numbers; most use pseudo-random algorithms.
- Examples: using natural processes (raindrops, star pulsations) or cryptographic algorithms to seed generators; Pokemon/random-start timing illustrates non-deterministic timing can feel random.
- Random number generators in practice
- Random.org: generates numbers using atmospheric noise; useful for demonstrations and some experiments.
- Excel/TI calculators: provide RANDBETWEEN(a,b) or equivalent; results are pseudo-random but adequate for many classroom purposes.
- Important caveat: with very large datasets, randomness can still yield patterns or clusters by chance, which is not “true bias-free randomness,” but often acceptable for practical sampling.
- Examples and exercises using Excel functions
- Random integer between a and b: ext{RANDBETWEEN}(a,b).
- Floor function to round down: ext{FLOOR}(x) = ig
floor x ig
floor (e.g., floor(11.7) = 11). - Combining functions to form sampling intervals and starting points (e.g., using FLOOR with random inputs to compute starting points and intervals).
- Practical sampling considerations with Excel
- You can generate multiple random picks and then tally outcomes or summarize statistics.
- Repetition vs. non-repetition: sometimes you want to avoid repeating the same individual, sometimes you want to allow repeats depending on context.
- Be mindful of potential sampling bias when using automated/random methods
- Even random processes can align with hidden patterns in the population, creating confounding effects.
- When sampling millions of cases, randomization alone may not address all systematic biases; stratification or careful cluster design can help.
Sampling error vs. non-sampling error (biases)
- Sampling error
- Definition: the difference between the sample statistic and the true population parameter.
- Not “bad” in itself; it’s an expected part of sampling due to having only a subset.
- Example from the f-counting exercise: parameter = 34, statistic = 30; sampling error = 34 - 30 = 4.
- As sample size increases, sampling error tends to decrease on average (law of large numbers intuition).
- Non-sampling error (bias, measurement error)
- Causes include biases in who responds, how questions are worded, and how samples are selected.
- Types of bias:
- Sampling bias / selection bias: individuals who choose to respond differ systematically from those who do not (self-selection bias).
- Example: asking people to shout out their exam grades in a classroom; those with higher grades may be more likely to respond.
- Response bias: poor phrasing or formats skew responses (e.g., Rotten Tomatoes scoring phrasing).
- Example: a binary good/bad question producing misleading interpretations of quality.
- Question phrasing and vagueness bias: vague questions yield inconsistent interpretations (e.g., vague family-related questions when assessing abuse risks).
- Publication bias and sponsor influence: researchers or sponsors may prefer to publish favorable results; negative results may be suppressed.
- Population representation bias in medical trials: underrepresentation of minorities (e.g., mostly white populations) can confound results and limit generalizability; stratified sampling can help mitigate this.
- Best practices to reduce biases
- Use clear, specific questions; avoid vague terms; specify what is being measured.
- Ensure sampling frames include diverse subgroups; use stratified sampling to ensure proportional representation.
- Document sampling method in detail so readers can assess potential biases.
- Consider multiple sampling methods when feasible and compare results across strata or clusters.
Quick reference: key formulas and concepts (LaTeX)
Population parameter vs. sample statistic
Parameter:
Denote as heta, representing a true population value.
Statistic:
Denote as
It is an estimator of the parameter:
ext{Estimator} o ext{Parameter as } n o ext{large}
Sampling error (basic form)
- ext{Sampling error} = ext{Statistic} - ext{Parameter} = \hat{\theta} - \theta
Systematic sampling interval and start (example)
- Population size: N
- Desired sample size: n
- Interval: k = \frac{N}{n} (often rounded, e.g., floor) -> if using floor, k = \left\lfloor \frac{N}{n} \right\rfloor
- Random start: r \sim \text{Uniform}{1,2,\dots,k}
- Sample units: { r, r+k, r+2k, \dots, r+(n-1)k }
Random number generation references
- Random integer between a and b: \text{RANDBETWEEN}(a,b) (Excel-like function)
- Fractional random numbers: \text{RAND}() typically in [0,1)
- Floor function example: \lfloor x \rfloor (e.g., (\text{Floor}(11.7) = 11))
Be mindful of randomness claims
- True randomness is often unattainable in deterministic computers; pseudo-random generators are used in practice.
Conceptual terms
- Population: the entire group of interest.
- Sample: a subset used to draw inferences about the population.
- Census: measurement of the entire population.
- Survey: measurement on a sample to infer population parameters.
- Parameter: a true population value.
- Statistic: a sample-based estimate of the parameter.
- Bias: systematic error that can arise from sampling or response issues.
Practical takeaways for exams and real-world studies
- Always ask: How was the sample obtained? Is it random? Are there strata? Could be biases?
- Prefer simple random or stratified sampling over convenience sampling when possible.
- When using systematic sampling, ensure the population is not arranged in a way that introduces bias with the chosen interval.
- Understand that sampling error is natural and can be reduced by larger, well-designed samples; non-sampling error requires careful survey design and execution.
- If you are reporting results, be explicit about the sampling method, the sample size, and any limitations due to bias or representation.