Statistical Inference: Population, Sampling, and Confidence Intervals

Review: Key components of a study

A research question or hypothesis
Independent and dependent variables
Measurement for the variables
Data collection
For what? Statistical inference (to answer the research question)
On whom? (next lecture) = Sampling

Preview

Population vs. Sample
Statistical Inference
Sampling
Confidence Intervals

Population vs. Sample

Population
- The entire pool from which a sample is drawn
- The individuals or group the sample is supposed to represent
- In general, large numbers of people
Sample
- Usually who researchers actually deal with
- A portion of the population
- In general, a much smaller number of individuals

Population vs. Sample (Parameters vs. Statistics)

Parameter
- A characteristic of the population
- Ex: population mean
- Example: Average height for an adult male in the U.S.
Sample statistic
- A characteristic of a sample calculated from data in the sample
- Ex: sample mean
- Example: Average height for an adult male in the sample of 100 U.S. male participants
Statistical inference
- Generalizing from a sample to a population with calculated degree of certainty

Population vs. Sample (Symbols)

Population → Parameter(s): N, μ, σ
Sample → Statistic(s): n, x̄, s
Note: Parameter represents the true population value; statistics are derived from the sample data

Statistical Inference

Determining a specific population value to better understand the population
Determining if there is a relationship between variables in a population
A population value = a population parameter = a "true" value

Examples of Statistical Inference

"The average Netflix subscriber spends 2 hours a day on the service."
"On average, people live 78.93 years."
"The average income of former UH Manoa students who are working and no longer in school is $49,300."

Statistical Inference and Sampling

Researchers use sophisticated sampling procedures and collect data from the sample
Researchers infer, with a relatively high degree of accuracy (95%), the characteristics of a given population from the characteristics of a representative sample
Concept: Statistical inference

Example: Netflix hours (how many hours, on average, do Netflix subscribers spend a day on the service?)

Population: Netflix subscribers (182.8 million)
A population value or parameter: 2 hours (only Netflix knows this)
How can we find out for ourselves?

Statistical Inference and Sampling (continued)

Our sample should be representative of the population of interest to enable generalization
In reality, there is no such thing as a perfectly representative sample
There will ALWAYS be a degree of difference between a sample statistic and a “true” value that would have been derived from the entire population

Statistical Inference and Sampling (error and solutions)

Example context: Average hours spent on Netflix with a sample size of 400 participants
Potential sources of error:
- Measurement error
- Respondent error
- Researcher error
How to reduce error: draw multiple smaller samples and calculate the means multiple times and take the average
By taking multiple samples, the error is divided by the number of samples

Example: four samples of 100 each

Take a 100-sample and calculate the mean
- Sample A: 115 minutes
Take another 100-sample, calculate the mean
- Sample B: 128 minutes
Take another 100-sample, calculate the mean
- Sample C: 100 minutes
Take another 100-sample, calculate the mean
- Sample D: 145 minutes
Incorporate results from these four samples
Take the average of the 4 means
Average:
- {\left(115+128+100+145\right)}{4}=122

In Real Research

The greater the number of samples (and participants) we use, the closer the sample mean will be to the true population mean
In real research, we can’t randomly sample as many times as we want
This leads to a certain degree of error
Statisticians found some laws:
- Representative sample
- Large sample size
- If these conditions are met, the mean of the sample means ≈ the population mean

In Real Research (continued)

There is almost always a degree of difference between a sample statistic and a true population value
Researchers DO NOT deterministically report findings from a sample
Instead, they report findings with a certain degree of confidence
Concept: Confidence interval

Confidence Intervals

A range computed using sample statistics to estimate an unknown population parameter with a stated level of confidence
A range within which the population parameter (the true value) is likely to fall, with a certain probability (usually 95%)
Lower end =x̄-1.96\times𝑆𝐸
Upper End=x̄+1.96\times𝑆𝐸

Confidence Interval Interpretation

"We are 95% confident that the confidence interval contains the population parameter (the true value)."
"95 out of 100 samples (95%) from the same population will produce confidence intervals that contain the population parameter (the true value)."

Example 1 (UH height)

Researchers collected data from a random sample of 525 UH students
Mean height = 67 inches
95% CI: [66.62, 67.38]
Population: All UH students
Sample: 525 UH students (random sample)
Interpretation: "We are 95% confident that the mean height of all UH students (population parameter, the true value) is between 66.62 inches and 67.38 inches."

Example 2 (MSU IQ)

Random sample of 300 students at MSU
Mean IQ = 102.2
95% CI: [98.72, 108.38]
Population: All MSU students
Sample: 300 MSU students (random sample)
Interpretation:
- "We are 95% confident that the mean IQ score of all students at Michigan State University is between 98.72 and 108.38."

A Relationship between Variables

A relationship between variables is studied for correlation purposes
Examples:
- Aggression among high school students is highly associated with playing violent video games
- Anti-smoking ads depicting serious health consequences significantly decreased smoking cessation intention
Often used for correlation studies

Example: Relationship between social media use and loneliness (UH Manoa)

Sample: 325 UH students surveyed in Spring 2021
Found that social media use and loneliness are highly correlated with Pearson’s correlation coefficient r = 0.67
95% confidence interval for the correlation:
- [0.54, 0.71]
Interpretation guidelines:
- r ≥ 0.5 = Moderate correlation
- r ≥ 0.7 = Strong correlation

Interpretation of the UH Manoa example

We are 95% confident that the correlation between social media use and loneliness in the population of all college students is between 0.54 and 0.71
This provides evidence of a moderate to strong relationship between social media use and loneliness among college students

To Recap

Sampling: Population → Sample
Collect data → Calculate sample statistics (e.g., sample mean)
Infer population parameters (e.g., population mean)
Statistical inference: drawing conclusions about a population from sample data with a stated level of confidence
Core concepts: population, sample, parameter, statistic, sampling error, confidence intervals, and relationships between variables (correlation)