Notation
Actual values of the response variable: y
Predicted value of the response variable: y-hat; ŷ
Residual
Positive is point is above the line
Negative if point is below the line
e=y-y-hat
What would you get if you added up all the residuals from the scatterplot
Zero
Example
220lbs
-20lbs
Y, e
How do we choose where the regression line goes
Regression line minimizes squared residuals
Least Squares Regression Line (LSRL) or line of best fit
Line of best fit formula
y=bo+b1x
Slope Formula
b1=rsysx
How well does the regression line fit the data?
R2
Values are between 0 and +1
Represents the fraction of the variation(specifically the variance) in the response variable that is explained by the regression line
R2 close to 1 indicated the model explains a lot
R2=(correlation coefficient[r])2
Practice
Predicted value
40 units are explained
R2=40/50=0.80
r=(0.80)1/2=0.89
Assumptions for regression
Quantitative variable condition
Straight enough condition
No outliers condition
“Does the Plot Thicken”
Residuals must have similar spread
Most common violation when residuals get more spread out
(P. 187)
Can check using a residual plot, plotting the residuals on the y-axis and the explanatory variable on the x-axis
homoscedasticity/heteroskedasticity
Regression models are appropriate only when they capture and underlying relationship
Nothing interesting would be left behind
Residuals incorporate everything that is left behind
This means that the residuals should not be interesting
Plotting the residuals against the explanatory variable should show no relationship
(from p. 181)
Standard Error:
Summarizes typical residual size
Rough estimate of how much the model is “off” by
R2 revisited
R2 tells us the proportion of variation on the response variable that is explained by the explanatory variable
“Signal”
The leftover unexplained variation is summarized by the residuals
“Noise”
Total variance of the response variable = variance coming from the predicted response variable (from the regression model) + variance coming from the residuals
Regression to the mean: when a sample is extreme, the next sample is likely to be closer to the mean
“I trust Spike more than me”
Joe Walch, 2024
R2: The percentage of the variation in the response variable that is explained by the explanatory variable
Total Variance= Unexplained Variance + Explained Variance
To test whether the conditions for a regression are met, use a residual plot
Should see no patterns on the residual plot
Shifting, rescaling and standardizing variables will not change correlation coefficient , but it will change slope and intercept
Outliers, leverage and Influence
Outliers:
Large residuals
High leverage
Leverage:
Data points that are far from the mean
Will pull the line closer to themselves, making the residual deceptively small
Influential Point
If omitting a data point results in a model with a very different slope, than the point is influential
Lurking variables can lead to spurious associations
Regressions and causations
Regressions do not show causation
Be careful about lurking variables
Be careful when interpreting slopes
INTRO TO PROBABILITY
Random Phenomena:
Situation where we know which outcomes could happen, but do not know which particular outcomes would happen
E.G. Coin Flip, drawing cards
Trial:
A single attempt of a random phenomenon
E.G. A single coin flip
Outcome:
Value that is measured, observed, or reported for a trial
Event:
A collection of outcomes
Denoted with bold capital letters
E.G. flipping 2 coins and recording the outcomes
Getting a heads and heads in one event
Sample Space:
Collection of all possible outcomes
Denoted with S={...}
E.G. flipping 2 coins
S = {HH, HT, TH, TT}
What is the sample space for flipping 3 coins?
S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
Law of large numbers:
The long-run relative frequency of repeated independent events get closer and closer to the true relative frequency as the number of trials increases
LLN
Sometimes mistakenly referred to as the “Law of Averages” which doesn’t exist
Gambler’s Fallacy
LLN only works over that long run; doesn’t say anything about the short run
“The house always wins”
Probability:
Long run relative frequency if an event’s occurrence
Represented by a number between 1 and 0
Typically in decimal or fraction form
To denote the probability of event a occurring P(A)
If P(A)=1, than A will occur
If P(A)=0, than A will never occur
If P(A)=0.5, than A will occur half of the time over the long run
Independance:
Two events are independent if learning that one event occurs does not change the probability of the other event occurring
A fan might say that they are 40% sure that their team will win the game. Is that the same type of probability that we have been discussing
Subjective probability vs. Theoretical probability
Theoretical:
When a probability is based on a mathematical model
Fair coin toss/dice roll, shuffled deck of cards
Subjective:
Probability that represents someone’s personal degree of belief
“I’m 90% sure we will win the game”
TREE DIAGRAM:
5 probability rules
Probability must be between 0 and 1
Probability Assignment rule
All probabilities must add up to 1
Complement rule
P(AC)=1-P(A)
Complement:
everything that is not in A is the complement of A
Addition rule
For two disjoint events A and B, the probability of that one or the other occurs is the sum of the two probabilities
5. 15.38461538%
6. Addition Rule(?)
N.E.I.
Addition Rule:
For two disjoint events A and B, the probability that one or the other occurs
Disjoint Events:
Events that have no outcomes in common
General Addition Rule:
More flexible than addition rule
Used when events are not disjointed
Formal equation:
P(A ∪ B) = P(A) + P(B) – P(A ⋂ B)
Conditional Probability:
The probability of an event given the occurrence of another event
Probability applied to a conditional distribution
P(B | A)=P(A∩B)/P(A)
Probability of B, conditioned on A
A “given” B
independent when
P(B | A)=P(B)
ஃ A & B are independent
Venn Diagram
Uses both a rectangle and some circles
General Product Rule
P(A⋂B)=P(A) * P(B | A)
→Disjoint/independent events are required to use simple addition/multiplication rule
Random Variable:
Variable whose value depends on a random event
Denoted by ‘X’
Values are denoted by ‘x’
E.G. coin flips, dice rolls, card draws, etc.
Probability Model:
Function that associates a probability with each value of a discrete random variable
Typically in a table form with at least 3 columns
Expected Value:
Theoretical long run average of a random variable
Center of a probability model for the random variable(like the mean)
Denoted by E(X) or μ
Calculated by the sum of the products of variable values and probabilities
Analogous to the ”break even” point or house edge
Random:
An outcome is random if we know the possible outcomes but not which value it actually takes
Random outcomes are free of human influence
Don’t use “random” in place of “unexpected”
Examples
“Random” phone call
“Random” actions
Simulation: Using random numbers to represent the outcomes of uncertain events
Trial:
In a simulation, the sequence of events that we are pretending will take place
For each trial, we get a simulated answer to our question(simulated outcome)
DISCRETE VS CONtINUOUS
D - finite number
C - any within interval
Bernoulli Trials:
Collection of trials where trial:
Each has exactly two outcomes: “success” or “failure”
q: success
p: failure
P(“success”) is constant
All trials are independent
Geometric Probability Model:
Used with random variables that count the number of Bernoulli trials until our first success
X = the number of trials until the first success
p = the probability of success
q = the probability of failure
q=1-p
p and q are compliments
P(X=x)=qx-1p
E(X)==1p
Note→on the ap exam, 1-p will be shown instead of q
Var(X)=qp2
Standard Deviation ==qp2
10% Condition
Remember that one of the requirements for Bernouli trials is independence, and trials are not independent when we sample without replacement
However, it is still ok to use this model as long as we randomly sample less than 10% of the population
Binomial Model:
Appropriate for a random variable that counts the number of successes in a fixed number of Bernoulli Trials
Example: getting 2 heads with 4 coin flips
Probability of getting x successes in n trials
Details:
x → number of successes
n → number of trials
p → probability of success (1-q)
q → probability of failure (1-p)
P(x)=n!x!(n-x)!pxqn-x
Var(X)=npq
SD(X)=npq
Systematic Sample
SImple Random Sample, SMS is the gold standard, but not often the most practical
Systematic Sample -
Still has randomness, but each is not equally likely
Stratified Random Sample
Population divided into several subpopulations
SRS within each strata
Used by differences in the subgroups and want to capture those differences proportionally
Cluster Sample
Population is divided into groups or clusters
Each cluster is similar to other clusters
Done for convenience, practicality and/or cost
Multistage Sampling
Combo of multiple methods (usually Stratified and Cluster)
EG
For Kauai, we can stratify by moku, than cluster by neighborhood or city block
Surveys:
How are you asking your questions?
Specific questions
Careful with phrasing
See p. 290-291
Pilot Survey:
Small Trial run of a Survey to test whether the questions and setup are good and clear
What can go wrong?
Voluntary response sample:
A large group is invited to respond and anyone who chooses to respond are counted
Leads to a Voluntary Response Bias:
Example: Very strongly opinionated people might be more likely to volunteer
Convenience sample:
D
Bad Sampling Coverage:
If the sampling frame excludes people from the population
Undercoverage:
Minorities during the census
Nonresponse Bias: bias introduced when a large fraction of those sampled fails to respond to a survey
Response Bias: Anything in a survey that influences responses (like leading questions or unclear phrasing)
The Success/Failure Condition:
A binomial model is approximately normal if we expect at least 10 successes and 10 failures
np10
nq10
Discrete vs Continuous models
Normal Distribution is continuous
Binomial model is discrete
Statistical Significance: The results of a study are considered statistically significant is there is a very low probability that they happened by chance
Are the results extreme enough to reject a hypothesis?
Sampling Distribution
Distribution of sample means
Complement Rule:
P(AC)=1-P(A)
Complement:
everything that is not in A is the complement of A
Addition Rule:
For two disjoint events A and B, the probability of that one or the other occurs is the sum of the two probabilities
Disjoint Events:
Events that have no outcomes in common
General Addition Rule:
More flexible than addition rule
Used when events are not disjointed
Formal equation:
P(A ∪ B) = P(A) + P(B) – P(A ⋂ B)
Conditional Probability:
The probability of an event given the occurrence of another event
Probability applied to a conditional distribution
P(B | A)=P(A∩B)/P(A)
Probability of B, conditioned on A
A “given” B
independent when
P(B | A)=P(B)
ஃ A & B are independent
Shifting data affects center but not spread
E(X+C)=E(X)+C
E(X±Y)=E(X)±E(Y)
SD(X+C)=SD(X) (same standard deviation)
Var(X+C)=Var(X) (same variance)
Var(X±Y)=Var(X)+Var(Y)
Rescaling data affects center and spread
E(X*C)=E(X)*C
SD(X*C)=SD(X)*C
Var(X*C)=Var(X)*C^2This relationship demonstrates how variance scales with the square of the constant factor, indicating that as we multiply a random variable by a constant, the variability increases in proportion to the square of that constant.