Statistical Inference: Population, Sampling, and Confidence Intervals
Review: Key components of a study
A research question or hypothesis
Independent and dependent variables
Measurement for the variables
Data collection
For what? Statistical inference (to answer the research question)
On whom? (next lecture) = Sampling
Preview
Population vs. Sample
Statistical Inference
Sampling
Confidence Intervals
Population vs. Sample
Population
The entire pool from which a sample is drawn
The individuals or group the sample is supposed to represent
In general, large numbers of people
Sample
Usually who researchers actually deal with
A portion of the population
In general, a much smaller number of individuals
Population vs. Sample (Parameters vs. Statistics)
Parameter
A characteristic of the population
Ex: population mean
Example: Average height for an adult male in the U.S.
Sample statistic
A characteristic of a sample calculated from data in the sample
Ex: sample mean
Example: Average height for an adult male in the sample of 100 U.S. male participants
Statistical inference
Generalizing from a sample to a population with calculated degree of certainty
Population vs. Sample (Symbols)
Population → Parameter(s): N, μ, σ
Sample → Statistic(s): n, x̄, s
Note: Parameter represents the true population value; statistics are derived from the sample data
Statistical Inference
Determining a specific population value to better understand the population
Determining if there is a relationship between variables in a population
A population value = a population parameter = a "true" value
Examples of Statistical Inference
"The average Netflix subscriber spends 2 hours a day on the service."
"On average, people live 78.93 years."
"The average income of former UH Manoa students who are working and no longer in school is $49,300."
Statistical Inference and Sampling
Researchers use sophisticated sampling procedures and collect data from the sample
Researchers infer, with a relatively high degree of accuracy (95%), the characteristics of a given population from the characteristics of a representative sample
Concept: Statistical inference
Example: Netflix hours (how many hours, on average, do Netflix subscribers spend a day on the service?)
Population: Netflix subscribers (182.8 million)
A population value or parameter: 2 hours (only Netflix knows this)
How can we find out for ourselves?
Statistical Inference and Sampling (continued)
Our sample should be representative of the population of interest to enable generalization
In reality, there is no such thing as a perfectly representative sample
There will ALWAYS be a degree of difference between a sample statistic and a “true” value that would have been derived from the entire population
Statistical Inference and Sampling (error and solutions)
Example context: Average hours spent on Netflix with a sample size of 400 participants
Potential sources of error:
Measurement error
Respondent error
Researcher error
How to reduce error: draw multiple smaller samples and calculate the means multiple times and take the average
By taking multiple samples, the error is divided by the number of samples
Example: four samples of 100 each
Take a 100-sample and calculate the mean
Sample A: 115 minutes
Take another 100-sample, calculate the mean
Sample B: 128 minutes
Take another 100-sample, calculate the mean
Sample C: 100 minutes
Take another 100-sample, calculate the mean
Sample D: 145 minutes
Incorporate results from these four samples
Take the average of the 4 means
Average:
{\left(115+128+100+145\right)}{4}=122
In Real Research
The greater the number of samples (and participants) we use, the closer the sample mean will be to the true population mean
In real research, we can’t randomly sample as many times as we want
This leads to a certain degree of error
Statisticians found some laws:
Representative sample
Large sample size
If these conditions are met, the mean of the sample means ≈ the population mean
In Real Research (continued)
There is almost always a degree of difference between a sample statistic and a true population value
Researchers DO NOT deterministically report findings from a sample
Instead, they report findings with a certain degree of confidence
Concept: Confidence interval
Confidence Intervals
A range computed using sample statistics to estimate an unknown population parameter with a stated level of confidence
A range within which the population parameter (the true value) is likely to fall, with a certain probability (usually 95%)
Lower end =x̄-1.96\times𝑆𝐸
Upper End=x̄+1.96\times𝑆𝐸
Confidence Interval Interpretation
"We are 95% confident that the confidence interval contains the population parameter (the true value)."
"95 out of 100 samples (95%) from the same population will produce confidence intervals that contain the population parameter (the true value)."
Example 1 (UH height)
Researchers collected data from a random sample of 525 UH students
Mean height = 67 inches
95% CI: [66.62, 67.38]
Population: All UH students
Sample: 525 UH students (random sample)
Interpretation: "We are 95% confident that the mean height of all UH students (population parameter, the true value) is between 66.62 inches and 67.38 inches."
Example 2 (MSU IQ)
Random sample of 300 students at MSU
Mean IQ = 102.2
95% CI: [98.72, 108.38]
Population: All MSU students
Sample: 300 MSU students (random sample)
Interpretation:
"We are 95% confident that the mean IQ score of all students at Michigan State University is between 98.72 and 108.38."
A Relationship between Variables
A relationship between variables is studied for correlation purposes
Examples:
Aggression among high school students is highly associated with playing violent video games
Anti-smoking ads depicting serious health consequences significantly decreased smoking cessation intention
Often used for correlation studies
Example: Relationship between social media use and loneliness (UH Manoa)
Sample: 325 UH students surveyed in Spring 2021
Found that social media use and loneliness are highly correlated with Pearson’s correlation coefficient r = 0.67
95% confidence interval for the correlation:
[0.54, 0.71]
Interpretation guidelines:
r ≥ 0.5 = Moderate correlation
r ≥ 0.7 = Strong correlation
Interpretation of the UH Manoa example
We are 95% confident that the correlation between social media use and loneliness in the population of all college students is between 0.54 and 0.71
This provides evidence of a moderate to strong relationship between social media use and loneliness among college students
To Recap
Sampling: Population → Sample
Collect data → Calculate sample statistics (e.g., sample mean)
Infer population parameters (e.g., population mean)
Statistical inference: drawing conclusions about a population from sample data with a stated level of confidence
Core concepts: population, sample, parameter, statistic, sampling error, confidence intervals, and relationships between variables (correlation)