Notes on Data Collection, Sampling, and Measurement Levels

Four general steps of statistics
  • Formulate questions (e.g., which candidate is most likely to win)

  • Collect and analyze data to address the questions

  • Describe the data

  • Draw conclusions using appropriate methods

Course focus and key definitions
  • Statistics is the study of procedures for collecting, describing, and drawing conclusions from information.

  • Emphasis in this course:

    • Chapter 1: collecting data in an unbiased way

    • Chapters 2–3: describing data graphically and numerically

    • Remaining chapters: methods for drawing conclusions from data

  • Population vs. Sample:

    • Population: the entire collection of individuals about which information is sought

    • Sample: a subset of a population containing individuals that are actually observed

    • Statistic: a number that describes a sample

    • Parameter: a number that describes a population

Is it a statistic or a parameter? (Examples)
  • 57% of the teachers at Central High School are female

    • 57% describes the entire population of teachers in that school → parameter

  • In a sample of 100 surgery patients given a new pain reliever, 78% reported significant pain relief

    • 78% describes a sample → statistic

Constructing a simple random sample
  • Simple Random Sample (SRS) of size N: a sample chosen so that every collection of N population items is equally likely to make up the sample

  • Analogy: lottery

    • Example: 10,000 lottery tickets; five tickets drawn as winners; any group of five tickets equally likely

  • Example 1 (random integers):

    • Physical education professor has 20,000 students, uses a computer to generate 100 random integers between 1 and 20,000, and invites those 100 students

    • Is this a simple random sample? Yes, because any group of 100 students would have been equally likely

  • Example 2 (class-based):

    • Professor wants a sample of 50 students to fill a questionnaire; uses her 10:00 AM class of 50 students; uses first 20 minutes to have them fill out the questionnaire

    • Is this a simple random sample? No, because not every group of 50 students from the population had an equal chance

Samples of convenience
  • When random sampling is difficult or impossible, a convenient method is used

  • Definition: a sample not drawn by a well-defined random method

  • Example: 1,000 concrete blocks in a pile; test 10 blocks for crushing strength

  • Why is it hard to draw an SRS from the pile? You’d have to remove blocks from the center/bottom; a top-sample is convenient but may be biased

  • Problems: samples of convenience may differ systematically from the population

  • If there is no important systematic difference believed, a convenience sample may be treated as if it were random; otherwise, bias is likely present

Other sampling methods (stratified, cluster, systematic, voluntary response)
  • Stratified random sampling:

    • Population divided into groups (strata); then a simple random sample drawn from each stratum

    • Useful when strata differ from one another but individuals within a stratum are similar

    • Example: a company with 800 full-time and 200 part-time employees; to draw a sample of 100, draw a simple random sample of 20 part-time employees and from the rest fill accordingly

  • Cluster sampling:

    • Population drawn from groups or clusters; then some clusters are selected at random and all individuals in chosen clusters are included

    • Useful when the population is large and spread out

    • Example: clusters = households in a county; select random clusters and survey all adults in those clusters

    • Question: What are the clusters? Why is this a cluster sample? The clusters are the groups of households; the selected clusters are surveyed entirely

  • Systematic sampling:

    • Items are ordered; select every k-th item after a starting point

    • Often used for quality checks in production

    • Example: automobiles on an assembly line; start with the 3rd car, then sample every 5th car: 3, 8, 13, 18, 23, 28, …

  • Voluntary response sampling:

    • Participants are invited to respond (e.g., call-in, log on, text, tweet)

    • Often used by media; responses are not reliable

    • Why not reliable? People with strong opinions are more likely to participate; those with moderate/no opinions underrepresented

Data collection in practice: identifying data and variables
  • Example scenario: poll with 6 voters on political affiliation, age, and voting history

    • Individuals: 6

    • Variables: political affiliation, age, voted last election

    • For individual 3: example data could be Democrat, 21, no

  • Collecting information by sampling yields a data set; individuals have characteristics called variables; the observed values are the data

Qualitative vs. Quantitative data
  • Qualitative data: classify individuals into categories

  • Quantitative data: tell how much or how many

  • Example question: Which of the following are quantitative?

    • A) A person’s age → quantitative (tells how much time has elapsed since birth)

    • B) A person’s gender → qualitative (categories: male/female)

    • C) The mileage of a car → quantitative (miles driven)

    • D) The color of a car → qualitative (categories of colors)

Ordinal vs. nominal data (within qualitative data)
  • Ordinal: natural ordering

  • Nominal: no natural ordering

  • Examples:

    • A) State of residence → nominal (no natural order)

    • B) Gender → nominal (no natural order)

    • C) Letter grade (A, B, C, D, F) → ordinal (A > B > C > D > F)

    • D) Size of soft drink (small, medium, large, extra large) → ordinal

Continuous vs. discrete data (within quantitative data)
  • Discrete: possible values can be listed

  • Continuous: can take any value in an interval

  • Examples:

    • A) Age at last birthday → discrete (whole years)

    • B) Height of a person → continuous

    • C) Number of siblings → discrete

    • D) Distance to work → continuous

Levels of measurement: ratio vs. interval
  • Quantitative variables can be classified as ratio or interval

  • Ratio level: zero represents the absence of the quantity; ratios are meaningful

  • Interval level: zero does not represent absence; ratios are not meaningful

  • Examples:

    • Number of siblings → ratio (zero means no siblings; 4 siblings is twice 2)

    • Outdoor temperature in degrees Celsius → interval (zero degrees does not mean no heat; 10°C is not twice as hot as 5°C in a meaningful way)

    • Year of the next presidential election → interval (zero year doesn’t denote the start of time)

    • Price of a pair of shoes → ratio (zero dollars means no cost; $100 is twice $50)

  • Other examples to classify qualitatively vs quantitatively and nominal vs ordinal:

    • Genre, ticket sales in millions, running time in minutes → ticket sales, running time, and release year are quantitative; genre is qualitative; among qualitative: genre is nominal, release year is quantitative, etc.

Bias and reliability in studies
  • Unbiased vs biased studies:

    • An unbiased study yields correct results on average across many samples

    • Biased studies systematically misestimate the population value

  • Voluntary response bias:

    • People invited to participate may not reflect the population; those with strong opinions are more likely to participate

  • Self-interest bias:

    • Advertisers or sponsors may omit data that is unfavorable to their product

  • Social acceptability bias:

    • Respondents may misreport behaviors they think are socially undesirable (e.g., asking about voting behavior)

  • Leading question bias:

    • Wording of questions can push respondents toward a particular answer (e.g., favoring tax cuts with loaded language)

  • Nonresponse bias:

    • Some people refuse to participate; nonresponders may differ from responders

  • Sampling bias:

    • Some population members are more likely to be included than others (e.g., calling landlines excludes cell-phone-only individuals)

  • Important takeaway:

    • A sample size of n = 80,000 at a football game may still be biased if the sampling method overrepresents a subgroup; a larger sample does not fix bias

  • Illustration:

    • Example: 80,000 football-game attendees polled about tax money; 70% support

    • A simple random sample of 500 voters finds 30% support

    • Owner claims stadium sample is more reliable due to size, but the stadium sample is biased; bigger size does not overcome bias

Summary of practical implications
  • The method of sampling directly impacts the reliability and validity of conclusions

  • Always assess potential biases (voluntary response, nonresponse, sampling bias, leading questions, etc.) in addition to sample size

  • Distinguish properly between population parameters and sample statistics when interpreting results

  • Use appropriate measurement scales (nominal, ordinal, interval, ratio) to choose valid analysis methods

  • Recognize when a convenience or nonrandom sample may be acceptable (if no systematic differences are believed) versus when it introduces bias that cannot be ignored