Statistical Sampling Techniques - Key Concepts and Applications
Sampling Overview
- Sampling is the process of selecting a subset (the sample) from a larger population.
- It is practical and cost-effective because studying an entire population is often time-consuming and expensive.
- Real-world examples:
- Government Census conducted roughly every 10 years to represent the entire population.
- Presidential elections use opinion polls with samples typically around 10^3-10^4 people to represent national views.
- Manufacturers sample products (e.g., 16-ounce cans) to verify content accuracy and consistency across the batch.
- The sample is examined, information is interpreted, and conclusions are applied to the entire population.
- Purpose: to draw inferences about the population from the sample.
Sampling Requirements
- Careful sample selection is needed so the population is represented well and results are meaningful.
- All samples should be:
- Representative: the sample has the same relevant characteristics as the population and does not favor any subgroup.
- Random: selected by chance, with no intentional bias or ulterior motive.
- Five sampling techniques covered in this lesson (considered as individual techniques):
- Random
- Systematic
- Convenience
- Stratified
- Cluster
- In practice, researchers often use a combination of two or more techniques to improve representation and randomness.
Statistical Sampling Techniques
RANDOM (Simple Random Sample)
- Each sample of the same size has an equal opportunity of being chosen.
- Initially, every member of the population has an equal chance of being selected for the sample.
- Common examples:
- Drawing names from a hat.
- Using a Random Number Generator.
- Mathematical note: for a population of size N and a desired sample size n, the number of possible samples is inom{N}{n}, and each specific sample has probability rac{1}{inom{N}{n}}.
SYSTEMATIC
- Select a random starting point, then select every L^{th} subject in the population.
- Simple to use and thus commonly employed.
- Description cue: look for a "th" in the description (e.g., every 50th).
- Example form: let the sampling interval be L; choose starting index r ext{ with } 1 \le r \le L, then select indices r, r+L, r+2L, \dots until the desired sample size is reached.
- Quick example from the material: selecting every 50^{th} student starting from a randomly chosen starting point to obtain 75 students.
CONVENIENCE
- Use results that are readily available or easily accessible.
- Examples: family members, students in a classroom, mall shoppers.
- CAUTION: can lead to a non-representative or biased sample because it may over- or under-represent certain groups.
STRATIFIED
- Divide the population into at least two distinct groups (strata) that share a common characteristic(s).
- From each stratum, draw some subjects.
- Benefits:
- Results in a more representative sample.
- Helps preserve certain characteristics of the population within the sample.
CLUSTER
- Divide the population into groups (clusters).
- Randomly select some clusters, then collect data from ALL members of the selected clusters.
- Used extensively by government and private research organizations.
- Example: exit polls (sampling whole clusters like a geographic area or polling location).
Examples and Scenarios (Applying the Techniques)
Scenario 1 (Stratified): Determine the average tuition paid by San Jose State undergraduate students per Fall semester.
- Sample: 100 undergraduates organized by class (freshman, sophomore, junior, senior); 25 from each class are selected.
- Technique: Stratified because the population is divided by class and samples are drawn from each class.
- Alternative described: Use a random number generator to select a student from the entire alphabetical list; then select every 50th student until 75 students are included.
- Classification in notes: Stratified (followed by systematic approach in the alternative).
Scenario 2 (Systematic): Same tuition study, different sampling method.
- Method: Completely random selection of 75 students using a random process.
- Then, the freshman/sophomore/junior/senior levels are numbered (1,2,3,4). A random number generator picks two levels; all students in those two levels are included in the sample.
- Classification in notes: Cluster (groups are years; all students in selected years are included).
Scenario 3 (Convenience): Administrative assistant stands by the library and asks the first 100 undergraduate students encountered about their tuition.
- Method: Convenience sampling due to easy access.
- Classification in notes: Convenience.
Scenario 4 (Stratified): Soccer coach forms a recreational team by age groups.
- Ages: 8-10, 11-12, 13-14.
- Selection: 6 players from 8-10, 7 players from 11-12, 3 players from 13-14.
- Classification in notes: Stratified by age groups.
Scenario 5 (Cluster): Pollster interviews all HR personnel in five different high-tech companies.
- Method: Cluster sampling by company.
Scenario 6 (Stratified): High school educational researcher interviews 50 female teachers and 50 male teachers.
- Method: Stratified by gender.
Scenario 7 (Systematic): Medical researcher interviews every third cancer patient from a hospital list.
- Method: Systematic with interval L=3.
Scenario 8 (Random): High school counselor generates 50 random numbers and selects students whose names correspond to those numbers.
- Method: Random sampling.
Scenario 9 (Convenience): A student interviews classmates in algebra class to determine the average number of jeans owned per student.
- Method: Convenience sampling.
Key Takeaways and Connections
- Sampling aims to represent the population while saving time and resources.
- The five core techniques each have strengths and weaknesses:
- Random: unbiased probability of selection.
- Systematic: simple and fast, but potential biases if there is an underlying pattern aligned with the interval.
- Convenience: highly practical but prone to bias.
- Stratified: improves representativeness by preserving subgroup characteristics.
- Cluster: practical for large populations and when natural groups exist; can be efficient but may introduce cluster-level bias if clusters are not representative.
- Real-world relevance:
- Government censuses inform policy and resource allocation.
- Public opinion polls guide political strategy and media coverage.
- Quality control in manufacturing relies on sampling to ensure product consistency.
Formulas and Notation Highlights
- Simple Random Sampling: probability of selecting a specific sample is rac{1}{inom{N}{n}}.
- Systematic Sampling (conceptual):
- Choose a random starting point r \in \{1,2,\dots, L\} and a fixed interval L; select indices i_j = r + (j-1)L, \ j = 1,2,\dots, t where t is the number of selections needed.
- General relationships:
- Population size: N
- Desired sample size: n
- Sampling interval (systematic): L (e.g., 50 in the provided example)