ES

Statistical Inference: Population, Sampling, and Confidence Intervals

Review: Key components of a study

  • A research question or hypothesis

  • Independent and dependent variables

  • Measurement for the variables

  • Data collection

  • For what? Statistical inference (to answer the research question)

  • On whom? (next lecture) = Sampling

Preview

  • Population vs. Sample

  • Statistical Inference

  • Sampling

  • Confidence Intervals

Population vs. Sample

  • Population

    • The entire pool from which a sample is drawn

    • The individuals or group the sample is supposed to represent

    • In general, large numbers of people

  • Sample

    • Usually who researchers actually deal with

    • A portion of the population

    • In general, a much smaller number of individuals

Population vs. Sample (Parameters vs. Statistics)

  • Parameter

    • A characteristic of the population

    • Ex: population mean

    • Example: Average height for an adult male in the U.S.

  • Sample statistic

    • A characteristic of a sample calculated from data in the sample

    • Ex: sample mean

    • Example: Average height for an adult male in the sample of 100 U.S. male participants

  • Statistical inference

    • Generalizing from a sample to a population with calculated degree of certainty

Population vs. Sample (Symbols)

  • Population → Parameter(s): N, μ, σ

  • Sample → Statistic(s): n, x̄, s

  • Note: Parameter represents the true population value; statistics are derived from the sample data

Statistical Inference

  • Determining a specific population value to better understand the population

  • Determining if there is a relationship between variables in a population

  • A population value = a population parameter = a "true" value

Examples of Statistical Inference

  • "The average Netflix subscriber spends 2 hours a day on the service."

  • "On average, people live 78.93 years."

  • "The average income of former UH Manoa students who are working and no longer in school is $49,300."

Statistical Inference and Sampling

  • Researchers use sophisticated sampling procedures and collect data from the sample

  • Researchers infer, with a relatively high degree of accuracy (95%), the characteristics of a given population from the characteristics of a representative sample

  • Concept: Statistical inference

Example: Netflix hours (how many hours, on average, do Netflix subscribers spend a day on the service?)

  • Population: Netflix subscribers (182.8 million)

  • A population value or parameter: 2 hours (only Netflix knows this)

  • How can we find out for ourselves?

Statistical Inference and Sampling (continued)

  • Our sample should be representative of the population of interest to enable generalization

  • In reality, there is no such thing as a perfectly representative sample

  • There will ALWAYS be a degree of difference between a sample statistic and a “true” value that would have been derived from the entire population

Statistical Inference and Sampling (error and solutions)

  • Example context: Average hours spent on Netflix with a sample size of 400 participants

  • Potential sources of error:

    • Measurement error

    • Respondent error

    • Researcher error

  • How to reduce error: draw multiple smaller samples and calculate the means multiple times and take the average

  • By taking multiple samples, the error is divided by the number of samples

Example: four samples of 100 each

  • Take a 100-sample and calculate the mean

    • Sample A: 115 minutes

  • Take another 100-sample, calculate the mean

    • Sample B: 128 minutes

  • Take another 100-sample, calculate the mean

    • Sample C: 100 minutes

  • Take another 100-sample, calculate the mean

    • Sample D: 145 minutes

  • Incorporate results from these four samples

  • Take the average of the 4 means

  • Average:

    • {\left(115+128+100+145\right)}{4}=122

In Real Research

  • The greater the number of samples (and participants) we use, the closer the sample mean will be to the true population mean

  • In real research, we can’t randomly sample as many times as we want

  • This leads to a certain degree of error

  • Statisticians found some laws:

    • Representative sample

    • Large sample size

    • If these conditions are met, the mean of the sample means ≈ the population mean

In Real Research (continued)

  • There is almost always a degree of difference between a sample statistic and a true population value

  • Researchers DO NOT deterministically report findings from a sample

  • Instead, they report findings with a certain degree of confidence

  • Concept: Confidence interval

Confidence Intervals

  • A range computed using sample statistics to estimate an unknown population parameter with a stated level of confidence

  • A range within which the population parameter (the true value) is likely to fall, with a certain probability (usually 95%)

  • Lower end =x̄-1.96\times𝑆𝐸

  • Upper End=x̄+1.96\times𝑆𝐸

Confidence Interval Interpretation

  • "We are 95% confident that the confidence interval contains the population parameter (the true value)."

  • "95 out of 100 samples (95%) from the same population will produce confidence intervals that contain the population parameter (the true value)."

Example 1 (UH height)

  • Researchers collected data from a random sample of 525 UH students

  • Mean height = 67 inches

  • 95% CI: [66.62, 67.38]

  • Population: All UH students

  • Sample: 525 UH students (random sample)

  • Interpretation: "We are 95% confident that the mean height of all UH students (population parameter, the true value) is between 66.62 inches and 67.38 inches."

Example 2 (MSU IQ)

  • Random sample of 300 students at MSU

  • Mean IQ = 102.2

  • 95% CI: [98.72, 108.38]

  • Population: All MSU students

  • Sample: 300 MSU students (random sample)

  • Interpretation:

    • "We are 95% confident that the mean IQ score of all students at Michigan State University is between 98.72 and 108.38."

A Relationship between Variables

  • A relationship between variables is studied for correlation purposes

  • Examples:

    • Aggression among high school students is highly associated with playing violent video games

    • Anti-smoking ads depicting serious health consequences significantly decreased smoking cessation intention

  • Often used for correlation studies

Example: Relationship between social media use and loneliness (UH Manoa)

  • Sample: 325 UH students surveyed in Spring 2021

  • Found that social media use and loneliness are highly correlated with Pearson’s correlation coefficient r = 0.67

  • 95% confidence interval for the correlation:

    • [0.54, 0.71]

  • Interpretation guidelines:

    • r ≥ 0.5 = Moderate correlation

    • r ≥ 0.7 = Strong correlation

Interpretation of the UH Manoa example

  • We are 95% confident that the correlation between social media use and loneliness in the population of all college students is between 0.54 and 0.71

  • This provides evidence of a moderate to strong relationship between social media use and loneliness among college students

To Recap

  • Sampling: Population → Sample

  • Collect data → Calculate sample statistics (e.g., sample mean)

  • Infer population parameters (e.g., population mean)

  • Statistical inference: drawing conclusions about a population from sample data with a stated level of confidence

  • Core concepts: population, sample, parameter, statistic, sampling error, confidence intervals, and relationships between variables (correlation)