Notes on Data Collection, Sampling, and Measurement Levels
Four general steps of statistics
Formulate questions (e.g., which candidate is most likely to win)
Collect and analyze data to address the questions
Describe the data
Draw conclusions using appropriate methods
Course focus and key definitions
Statistics is the study of procedures for collecting, describing, and drawing conclusions from information.
Emphasis in this course:
Chapter 1: collecting data in an unbiased way
Chapters 2–3: describing data graphically and numerically
Remaining chapters: methods for drawing conclusions from data
Population vs. Sample:
Population: the entire collection of individuals about which information is sought
Sample: a subset of a population containing individuals that are actually observed
Statistic: a number that describes a sample
Parameter: a number that describes a population
Is it a statistic or a parameter? (Examples)
57% of the teachers at Central High School are female
57% describes the entire population of teachers in that school → parameter
In a sample of 100 surgery patients given a new pain reliever, 78% reported significant pain relief
78% describes a sample → statistic
Constructing a simple random sample
Simple Random Sample (SRS) of size N: a sample chosen so that every collection of N population items is equally likely to make up the sample
Analogy: lottery
Example: 10,000 lottery tickets; five tickets drawn as winners; any group of five tickets equally likely
Example 1 (random integers):
Physical education professor has 20,000 students, uses a computer to generate 100 random integers between 1 and 20,000, and invites those 100 students
Is this a simple random sample? Yes, because any group of 100 students would have been equally likely
Example 2 (class-based):
Professor wants a sample of 50 students to fill a questionnaire; uses her 10:00 AM class of 50 students; uses first 20 minutes to have them fill out the questionnaire
Is this a simple random sample? No, because not every group of 50 students from the population had an equal chance
Samples of convenience
When random sampling is difficult or impossible, a convenient method is used
Definition: a sample not drawn by a well-defined random method
Example: 1,000 concrete blocks in a pile; test 10 blocks for crushing strength
Why is it hard to draw an SRS from the pile? You’d have to remove blocks from the center/bottom; a top-sample is convenient but may be biased
Problems: samples of convenience may differ systematically from the population
If there is no important systematic difference believed, a convenience sample may be treated as if it were random; otherwise, bias is likely present
Other sampling methods (stratified, cluster, systematic, voluntary response)
Stratified random sampling:
Population divided into groups (strata); then a simple random sample drawn from each stratum
Useful when strata differ from one another but individuals within a stratum are similar
Example: a company with 800 full-time and 200 part-time employees; to draw a sample of 100, draw a simple random sample of 20 part-time employees and from the rest fill accordingly
Cluster sampling:
Population drawn from groups or clusters; then some clusters are selected at random and all individuals in chosen clusters are included
Useful when the population is large and spread out
Example: clusters = households in a county; select random clusters and survey all adults in those clusters
Question: What are the clusters? Why is this a cluster sample? The clusters are the groups of households; the selected clusters are surveyed entirely
Systematic sampling:
Items are ordered; select every k-th item after a starting point
Often used for quality checks in production
Example: automobiles on an assembly line; start with the 3rd car, then sample every 5th car: 3, 8, 13, 18, 23, 28, …
Voluntary response sampling:
Participants are invited to respond (e.g., call-in, log on, text, tweet)
Often used by media; responses are not reliable
Why not reliable? People with strong opinions are more likely to participate; those with moderate/no opinions underrepresented
Data collection in practice: identifying data and variables
Example scenario: poll with 6 voters on political affiliation, age, and voting history
Individuals: 6
Variables: political affiliation, age, voted last election
For individual 3: example data could be Democrat, 21, no
Collecting information by sampling yields a data set; individuals have characteristics called variables; the observed values are the data
Qualitative vs. Quantitative data
Qualitative data: classify individuals into categories
Quantitative data: tell how much or how many
Example question: Which of the following are quantitative?
A) A person’s age → quantitative (tells how much time has elapsed since birth)
B) A person’s gender → qualitative (categories: male/female)
C) The mileage of a car → quantitative (miles driven)
D) The color of a car → qualitative (categories of colors)
Ordinal vs. nominal data (within qualitative data)
Ordinal: natural ordering
Nominal: no natural ordering
Examples:
A) State of residence → nominal (no natural order)
B) Gender → nominal (no natural order)
C) Letter grade (A, B, C, D, F) → ordinal (A > B > C > D > F)
D) Size of soft drink (small, medium, large, extra large) → ordinal
Continuous vs. discrete data (within quantitative data)
Discrete: possible values can be listed
Continuous: can take any value in an interval
Examples:
A) Age at last birthday → discrete (whole years)
B) Height of a person → continuous
C) Number of siblings → discrete
D) Distance to work → continuous
Levels of measurement: ratio vs. interval
Quantitative variables can be classified as ratio or interval
Ratio level: zero represents the absence of the quantity; ratios are meaningful
Interval level: zero does not represent absence; ratios are not meaningful
Examples:
Number of siblings → ratio (zero means no siblings; 4 siblings is twice 2)
Outdoor temperature in degrees Celsius → interval (zero degrees does not mean no heat; 10°C is not twice as hot as 5°C in a meaningful way)
Year of the next presidential election → interval (zero year doesn’t denote the start of time)
Price of a pair of shoes → ratio (zero dollars means no cost; $100 is twice $50)
Other examples to classify qualitatively vs quantitatively and nominal vs ordinal:
Genre, ticket sales in millions, running time in minutes → ticket sales, running time, and release year are quantitative; genre is qualitative; among qualitative: genre is nominal, release year is quantitative, etc.
Bias and reliability in studies
Unbiased vs biased studies:
An unbiased study yields correct results on average across many samples
Biased studies systematically misestimate the population value
Voluntary response bias:
People invited to participate may not reflect the population; those with strong opinions are more likely to participate
Self-interest bias:
Advertisers or sponsors may omit data that is unfavorable to their product
Social acceptability bias:
Respondents may misreport behaviors they think are socially undesirable (e.g., asking about voting behavior)
Leading question bias:
Wording of questions can push respondents toward a particular answer (e.g., favoring tax cuts with loaded language)
Nonresponse bias:
Some people refuse to participate; nonresponders may differ from responders
Sampling bias:
Some population members are more likely to be included than others (e.g., calling landlines excludes cell-phone-only individuals)
Important takeaway:
A sample size of n = 80,000 at a football game may still be biased if the sampling method overrepresents a subgroup; a larger sample does not fix bias
Illustration:
Example: 80,000 football-game attendees polled about tax money; 70% support
A simple random sample of 500 voters finds 30% support
Owner claims stadium sample is more reliable due to size, but the stadium sample is biased; bigger size does not overcome bias
Summary of practical implications
The method of sampling directly impacts the reliability and validity of conclusions
Always assess potential biases (voluntary response, nonresponse, sampling bias, leading questions, etc.) in addition to sample size
Distinguish properly between population parameters and sample statistics when interpreting results
Use appropriate measurement scales (nominal, ordinal, interval, ratio) to choose valid analysis methods
Recognize when a convenience or nonrandom sample may be acceptable (if no systematic differences are believed) versus when it introduces bias that cannot be ignored