STAT Unit 4

4.1

Population - entire group you want to study

Sample - Smaller group taken from the population to study

  • Sample Mean (xx) - average from sample

  • Population Mean (μ) - average for whole population

Convenience Sample

  • taking data that’s easy to reach

  • asking friends or family or first 10 customers

  • Problem \rightarrow High Bias - does not represent everyone fairly

Voluntary Response

  • people choose to respond

  • Problem \rightarrow High Bias - only people with strong opinions tend to answer

Simple Random Sample

  • lowers bias

  • every individual has a equal chance of being chosen

Steps

  1. Label everyone

  2. Use a random number generator or draw names

  3. Select without repeats

4.2

Stratified Random Sample

  • split population in groups (strata) that share something in common (homogenous)

  • then randomly sample from each group

  • “Sample some from all groups”

  • reduces variability

  • split population - label a group 1 to N - use a random number generator to select # numbers with no repeats - select those corresponding to these numbers - repeat for each group

Steps

  1. Split population into homogenous groups

  2. Number individuals within each group ( 1 to N)

  3. Use a random number generator to select people - NO REPEATS

  4. repeat for each group

4.3

Cluster Sample

  • split population into heterogenous groups (clusters)

  • randomly select a few entire clusters

  • mini populations of the whole

  • Ex - choose 1-2 floors and survey every room on those floors

  • “Sample all from some groups”

Systematic Random Sample

  1. Label everyone 1-N

  2. Choose a random starting point

  3. Pick every kth person (every 10th person) until sample is complete

Stratified \rightarrow reduces variability + ensures groups are represented

Cluster \rightarrow quick + easy to collect data, good when groups are mixed

Systematic \rightarrow spreads sample evenly through population

4.4

Types of Bias

Under coverage

  • Some groups in the population are less likely to be chosen or not represented.

  • Calling only landlines (misses people with only cell phones).

Nonresponse

  • People are selected for the sample but don’t respond.

  • Sending an email survey and most people don’t reply.

Response Bias

  • Occurs when the survey design or wording influences responses, or people lie.

  • A firefighter in uniform asks about cutting fire funding; people might lie.

How to Reduce Bias

  • Use random selection to include everyone equally.

  • Follow up with nonrespondents (phone calls, reminders).

  • Keep surveys anonymous to reduce pressure or lying.

  • Use neutral wording (avoid emotional or leading questions).

4.5

  • Explanatory variable - the cause/what you change

  • Response variable - effect/outcome measured

Confounding Variable - hidden variable that effects both

  • related to explanatory variable

  • also, effects response variable

  • messes up results

  • Ex \rightarrow motivation - impacts SAT score (response variable)

Observational Study - researchers just observe - no treatment given

  • correlation 

Experiment - researchers impose treatment

  • shows causation or cause-and-effect

Experimental Units (Subjects) - who the experiment is being done on

Control Group - group that does not get treatment, used as a benchmark to compare the effects of the treatment on the experimental group.

  • consistent results

Placebo - fake treatment given to compare effects

Steps to Design a Good Experiment

  1. Randomly Assign them to groups

  2. One group gets treatment

  3. Other group gets no treatment = control group

  4. Compare results

Example - Does SAT prep (class) improve SAT scores?

4.6

Treatment - what is being done

Random Assignment - randomly putting subjects into groups (random number generator)

  1. Label 2. Randomize 3. Assign

Blinding - subjects and/or researchers don’t know who gets what

  • reduces bias

Single Blind - subjects don’t know - researchers know

Double Blind - subjects and researchers don’t know

  • best way to avoid bias

4 Parts of a Good Experiment (CRRC)

  • Comparison \rightarrow use 2 or more groups (to compare treatments)

  • Random Assignment \rightarrow randomly assign subjects to groups (makes groups fair)

  • Replication \rightarrow enough subjects in each group (reliable)

  • Control \rightarrow (control group) keep other variables same

Example Problems

What is wrong with this experiment?

  • He only tested for a month, and there is no control variable so we don’t know if his beard grew without the oil.

What could be done to improve this experiment?

  • Measure beard growth for a month with no oil (control group)

  • Measure beard for 7 months - replication

What is wrong with this experiment?

  • Athletes were able to choose which treatment they wanted, this makes it more of a observational study. (not randomized)

How could you randomly assign the subjects?

Number athletes 1–120 → randomly pick 60 → Strength group

Remaining 60 → Relaxed group

What is the benefit of using random assignment?

  • we can determine if the strength workout actually worked (caused faster times)

What is wrong with this experiment?

  • The students knew what pill (treatment) they were taking.

What could be done to improve this experiment?

  • no labels on the pills - single blind

  • there’s 1200 cows in total

4.7

Completely Randomized Design

  • all subjects are randomly assigned to treatment groups

  • no grouping/blocking beforehand

When to use

  • when subjects are similar - no obvious differences

Examples

Randomized Block Design

  • subjects are first grouped by a variable that may affect the response variable (block), then randomly assigned to treatments within each block.

  • controls confounding variable - reduces variability

When to use

  • when you think groups differ that affects results

  • age, gender, skill level

Examples

Matched Pairs Design

  • each block has 2 subjects, or each subject gets 2 treatments

  • reduces variability

  • two-sampled paired - pair similar subjects \rightarrow randomly assign to each treatment

  • repeated measures - one person gets both treatments in random order

Examples

4.8

Stimulation - a way to model what could happen by random chance

  • repeated random trials to model chance

Statistical Significance

  • a result is statistically significant if it is unlikely to happen by chance alone

  • less than < 5% by chance

If a difference is statistically significant

  • we have evidence that the treatment caused the effect, not just random chance

Steps to Test Statistical Significance Using Simulation

Step

What you do

Example

1. Start with experiment data

Compute difference between groups

Ad A − Ad B = 4%

2. Assume no real difference (null hypothesis)

Shuffle or randomly assign outcomes

Randomly mix click results

3. Run many trials

Repeat 50–100 simulations

Each time record difference

4. Compare

See how often a result equal to or bigger than actual result appears by chance

42 out of 100 simulations ≥ 4%

5. Decide significance

If < 5%, significant

42% → NOT significant

Interpreting Results

If p-value < 5%

  • unlikely due to chance

  • statistically significant

  • evidence treatment worked

If p-value \ge 5%

  • could be due to chance

  • not statistically significant

  • no strong evidence that treatment worked

5. Example: Yelp A/B Ad Test

Group

Clicked

Conversion Rate

Ad A

21/50

42%

Ad B

19/50

38%

Difference = 4%

Simulation shows 42% of random trials gave a difference ≥ 4%.

Conclusion:

  • 42% > 5% → Not statistically significant

  • The difference could easily happen by chance

6. Example: John vs Jennifer Study (Gender Bias Experiment)

Measured mean rating difference:

xˉJohn−xˉJenn=1.26\bar{x}_{John} - \bar{x}_{Jenn} = 1.26xˉJohn​−xˉJenn​=1.26

Simulation showed 6.7% of random assignments gave a difference ≥ 1.26.

Conclusion:

  • 6.7% > 5% → Not statistically significant

  • The result could be due to random assignment

  • Some evidence of bias, but not strong

Term

Definition

Null Hypothesis (H0H_0H0​)

Assumes no difference or no treatment effect

Observed Difference

The difference from the real experiment

Simulation

Repeated random trials to model chance

p-value

Probability results are due to chance

Statistically Significant

p-value < 5%

4.9

When we finish study we want to know 2 things

  • can we generalize to a population? - does it apply to more people (RS)

  • can we show cause-and-effect? - did treatment cause results (RA)

Examples

Term

Meaning

Random Sample

People are randomly chosen from a population → lets us generalize

Random Assignment

People randomly put into groups → allows cause & effect

Association

Two things are related but one didn’t necessarily cause the other

Causation

One thing caused a change in another (requires experiment)