Statistics Master Doc

Unit 3: Collecting Data

Measuring Data with Populations

‘A Census’ is when you collect data on every individual in the population – that is difficult to achieve so instead data is taken by ‘Samples’ – a subset of individuals from a subset from a population

However, that can lead to ‘sampling bias’ – when some members of a population are systematically more likely to be selected in a sample than others and lead to ‘undercoverage’ – when part of the population has a reduced chance of being included in a sample

Types of Samples

Stratified Random Sample: Divides the population into homogeneous groups (e.g., grade levels) and selects a few individuals from each group.
Systematic Sample: Selects individuals at fixed intervals (e.g., every 3rd person).
Voluntary Response Sample: Individuals choose to participate after being invited.
Simple Random Sample (SRS): Every individual has an equal chance of being selected.
Convenience Sample: Uses individuals who are easy to reach (e.g., surveying people at a mall).
Cluster Sample: the population is divided into groups called "clusters," and then a random selection of those clusters is chosen, with data collected from every member.

REMEMBER LANGUAGE: Be SPECIFIC → number 1-100, generate, correspond

Types of Selection Bias

Non response Bias: When individuals chosen for a sample don’t respond
Undercoverage Bias: Segments of the target population are excluded or less represented
Voluntary Response Bias: Sampling is composed of volunteers

Example of Selection Bias

Claim: “This college states that 90% of their graduates get a job in their major choice’”

Population & Sample: All graduates.
Bias Effect: Graduates without jobs may be less likely to respond.
Impact on Estimate: Overestimates the true employment rate.

Types of Survey Bias

Confusing Wording Bias: When survey questions are confusing or leading
Self-reported response bias: When individuals inaccurately report their own traits

Wald’s Bullet Hole Problem:

A famous example of survivorship bias: A mistake where we only look at the things that survived and ignore the ones that didn’t – in this example aircrafts

This teaches us that ignoring missing data can lead to bad conclusions. It’s why we need random sampling (SRS)—to consider all cases, not just the ones we can see.

Observational Study

Observational study: a study in which data is collected without imposing any treatments

Retrospective study: examines existing data on individuals
Prospective study: follows individuals to gather future data (overtime)
- Both do not show cause and effect because they do not control for confounding!

Experimental Study

Experiments: a study in which treatment is imposed on subjects.

If well designed, experiments can show cause-effect relationships by controlling for confounding variables (units, variables)

Experimental units: is the object to which a treatment is randomly assigned
Explanatory variable: the variable that is purposely manipulated – also known as the factor.
Treatments: the different levels of the explanatory variable in the experiment.
Response variable: the measured experiment outcome that is compared between treatment groups

Confounding variable is something that potentially affects the results of a study but is not accounted for in the study itself

The Four Principles of Experimental Design

Comparison
Random Assignment
Replication (definition: having many experimental units in each treatment group)
Control (one way they control for confounding factors is making it completely identical besides explanatory variable)

Completely randomized design: An experimental design in which experimental units are assigned to treatments completely at random.

• This is the “SRS” of experiments – the simplest (but still effective) randomized experiment.

REMEMBER LANGUAGE: Random sample reduces bias. Random assignment reduces confounding

Variation in Random Assignment: Natural fluctuations that occur when randomly assigned experimental units to different treatment groups

Using Tiers: Shows less overlap → shows variability

Randomized Complete Block Design: Experimental units are first blocked (grouped) by a similar trait that may affect response. Then, units from each block are randomly assigned to treatment

This is the “Stratified sample” of experiments
Shows differences are treatment rather than chance variation in random assignment

Matched Pairs: A type of experimental design where participants are paired up based on similar characteristics (like age or ability level), and then within each pair, one person is randomly assigned to one treatment group while the other is assigned to the other treatment group, allowing for a more controlled comparison between the two treatments by minimizing variability due to other factors.

Examples of Matched Pairs – Studying depression

Difficult to control variables – individuals have different levels of depression, symptoms

Placebo Effect: Show significant benefit even though it is inactive

Single Blind: Either subject or researcher are unaware who receive active treatment or placebo
Double-blind study: both the subject and the researchers are unaware who gets what in placebo
- Prevent Soon Founding or biased in evaluations

Statistical Concepts:

Generalization: Applying study results to a larger population based on the sample’s findings.

Statistical Significance: Results are statistically significant when they are highly unlikely to have occurred due to chance alone – If the probability of obtaining the results by random chance is extremely low, the treatment effect is considered real.