Untitled Flashcards Set

  • Notation

    • Actual values of the response variable: y

    • Predicted value of the response variable: y-hat; ŷ

  • Residual

    • Positive is point is above the line

    • Negative if point is below the line

    • e=y-y-hat

    • What would you get if you added up all the residuals from the scatterplot

      • Zero

  • Example

    • 220lbs

    • -20lbs

    • Y, e

  • How do we choose where the regression line goes

    • Regression line minimizes squared residuals

    • Least Squares Regression Line (LSRL) or line of best fit

  • Line of best fit formula

    • y=bo+b1x

  • Slope Formula

    • b1=rsysx

  • How well does the regression line fit the data?

    • R2

      • Values are between 0 and +1

      • Represents the fraction of the variation(specifically the variance) in the response variable that is explained by the regression line

        • R2 close to 1 indicated the model explains a lot

      • R2=(correlation coefficient[r])2

  • Practice

    • Predicted value

    • 40 units are explained

    • R2=40/50=0.80

    • r=(0.80)1/2=0.89

  • Assumptions for regression

    • Quantitative variable condition

    • Straight enough condition

    • No outliers condition

    • “Does the Plot Thicken”

      • Residuals must have similar spread

      • Most common violation when residuals get more spread out

      • (P. 187)

      • Can check using a residual plot, plotting the residuals on the y-axis and the explanatory variable on the x-axis

  • homoscedasticity/heteroskedasticity

  • Regression models are appropriate only when they capture and underlying relationship

    •  Nothing interesting would be left behind

    • Residuals incorporate everything that is left behind

    • This means that the residuals should not be interesting

    • Plotting the residuals against the explanatory variable should show no relationship

    • (from p. 181)

  • Standard Error:

    • Summarizes typical residual size

    • Rough estimate of how much the model is “off” by

  • R2 revisited

    • R2 tells us the proportion of variation on the response variable that is explained by the explanatory variable

      • “Signal”

    • The leftover unexplained variation is summarized by the residuals

      • “Noise”

    • Total variance of the response variable = variance coming from the predicted response variable (from the regression model) + variance coming from the residuals

  • Regression to the mean: when a sample is extreme, the next sample is likely to be closer to the mean

  • “I trust Spike more than me”

  • Joe Walch, 2024

  • R2: The percentage of the variation in the response variable that is explained by the explanatory variable

  • Total Variance= Unexplained Variance + Explained Variance

  • To test whether the conditions for a regression are met, use a residual plot

    • Should see no patterns on the residual plot

  • Shifting, rescaling and standardizing variables will not change correlation coefficient , but it will change slope and intercept

  • Outliers, leverage and Influence

    • Outliers:

      • Large residuals

      • High leverage

    • Leverage:

      • Data points that are far from the mean

      • Will pull the line closer to themselves, making the residual deceptively small

    • Influential Point

      • If omitting a data point results in a model with a very different slope, than the point is influential

  • Lurking variables can lead to spurious associations

  • Regressions and causations

    • Regressions do not show causation

    • Be careful about lurking variables

    • Be careful when interpreting slopes


INTRO TO PROBABILITY

  • Random Phenomena:

    • Situation where we know which outcomes could happen, but do not know which particular outcomes would happen

    • E.G. Coin Flip, drawing cards

  • Trial:

    • A single attempt of a random phenomenon

    • E.G. A single coin flip

  • Outcome:

    • Value that is measured, observed, or reported for a trial

  • Event:

    • A collection of outcomes

    • Denoted with bold capital letters

    • E.G. flipping 2 coins and recording the outcomes

      • Getting a heads and heads in one event

  • Sample Space:

    • Collection of all possible outcomes

    • Denoted with S={...}

    • E.G. flipping 2 coins

      • S = {HH, HT, TH, TT}

  • What is the sample space for flipping 3 coins?

    • S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

  • Law of large numbers:

    • The long-run relative frequency of repeated independent events get closer and closer to the true relative frequency as the number of trials increases

    • LLN

    • Sometimes mistakenly referred to as the “Law of Averages” which doesn’t exist

      • Gambler’s Fallacy

    • LLN only works over that long run; doesn’t say anything about the short run

      • “The house always wins”

  • Probability:

    • Long run relative frequency if an event’s occurrence

      • Represented by a number between 1 and 0

      • Typically in decimal or fraction form

    • To denote the probability of event a occurring P(A)

      • If P(A)=1, than A will occur

      • If P(A)=0, than A will never occur

      • If P(A)=0.5, than A will occur half of the time over the long run

  • Independance:

    • Two events are independent if learning that one event occurs does not change the probability of the other event occurring

  • A fan might say that they are 40% sure that their team will win the game. Is that the same type of probability that we have been discussing

    • Subjective probability vs. Theoretical probability

      • Theoretical:

        • When a probability is based on a mathematical model

          • Fair coin toss/dice roll, shuffled deck of cards

      • Subjective:

        • Probability that represents someone’s personal degree of belief

          • “I’m 90% sure we will win the game”

  • TREE DIAGRAM:

  • 5 probability rules

    • Probability must be between 0 and 1

    • Probability Assignment rule

      • All probabilities must add up to 1

    • Complement rule

      • P(AC)=1-P(A)

      • Complement:

        • everything that is not in A is the complement of A

    • Addition rule

      • For two disjoint events A and B, the probability of that one or the other occurs is the sum of the two probabilities

  • 5. 15.38461538%

  • 6. Addition Rule(?)

  • N.E.I.

  • Addition Rule:

    • For two disjoint events A and B, the probability that one or the other occurs

  •  Disjoint Events:

    • Events that have no outcomes in common

  • General Addition Rule:

    • More flexible than addition rule

    • Used when events are not disjointed

    • Formal equation:

      • P(A ∪ B) = P(A) + P(B) – P(A ⋂ B)

  • Conditional Probability:

    • The probability of an event given the occurrence of another event

    • Probability applied to a conditional distribution

    • P(B | A)=P(A∩B)/P(A)

      • Probability of B, conditioned on A

        • A “given” B

    • independent when

      • P(B | A)=P(B)

        • ஃ A & B are independent

  • Venn Diagram

    • Uses both a rectangle and some circles

  • General Product Rule

    • P(A⋂B)=P(A) * P(B | A)

  • →Disjoint/independent events are required to use simple addition/multiplication rule

  • Random Variable:

    • Variable whose value depends on a random event

    • Denoted by ‘X’

    • Values are denoted by ‘x’

    • E.G. coin flips, dice rolls, card draws, etc.

  • Probability Model:

    • Function that associates a probability with each value of a discrete random variable

    • Typically in a table form with at least 3 columns

  • Expected Value:

    • Theoretical long run average of a random variable

    • Center of a probability model for the random variable(like the mean)

    • Denoted by E(X) or μ

    • Calculated by the sum of the products of variable values and probabilities

    • Analogous to the ”break even” point or house edge

  • Random:

    • An outcome is random if we know the possible outcomes but not which value it actually takes

    • Random outcomes are free of human influence

    • Don’t use “random” in place of “unexpected”

    • Examples

      • “Random” phone call

      • “Random” actions

  • Simulation: Using random numbers to represent the outcomes of uncertain events

  • Trial:

    • In a simulation, the sequence of events that we are pretending will take place

    • For each trial, we get a simulated answer to our question(simulated outcome)

  • DISCRETE VS CONtINUOUS

    • D - finite number

    • C - any within interval

  • Bernoulli Trials:

    • Collection of trials where trial:

      • Each has exactly two outcomes: “success” or “failure”

        • q: success

        • p: failure

      • P(“success”) is constant

      • All trials are independent

  • Geometric Probability Model:

    • Used with random variables that count the number of Bernoulli trials until our first success

    • X = the number of trials until the first success

    • p = the probability of success

    • q = the probability of failure

      • q=1-p

    • p and q are compliments

    • P(X=x)=qx-1p

    • E(X)==1p

      • Note→on the ap exam, 1-p will be shown instead of q

    • Var(X)=qp2

    • Standard Deviation ==qp2

  • 10% Condition

    • Remember that one of the requirements for Bernouli trials is independence, and trials are not independent when we sample without replacement

    • However, it is still ok to use this model as long as we randomly sample less than 10% of the population

  • Binomial Model:

    • Appropriate for a random variable that counts the number of successes in a fixed number of Bernoulli Trials

    • Example: getting 2 heads with 4 coin flips

    • Probability of getting x successes in n trials

    • Details:

      • x → number of successes

      • n → number of trials

      • p → probability of success (1-q)

      • q → probability of failure (1-p)

      • P(x)=n!x!(n-x)!pxqn-x

      • Var(X)=npq

      • SD(X)=npq

  • Systematic Sample

    • SImple Random Sample, SMS is the gold standard, but not often the most practical

      • Systematic Sample - 

        • Still has randomness, but each is not equally likely

      • Stratified Random Sample

        • Population divided into several subpopulations

        • SRS within each strata

        • Used by differences in the subgroups and want to capture those differences proportionally

      • Cluster Sample

        • Population is divided into groups or clusters

        • Each cluster is similar to other clusters

        • Done for convenience, practicality and/or cost

      • Multistage Sampling

        • Combo of multiple methods (usually Stratified and Cluster)

        • EG

          • For Kauai, we can stratify by moku, than cluster by neighborhood or city block

  • Surveys:

    • How are you asking your questions?

    • Specific questions

    • Careful with phrasing

    • See p. 290-291

    • Pilot Survey:

      • Small Trial run of a Survey to test whether the questions and setup are good and clear

  • What can go wrong?

    • Voluntary response sample:

      • A large group is invited to respond and anyone who chooses to respond are counted

      • Leads to a Voluntary Response Bias:

        • Example: Very strongly opinionated people might be more likely to volunteer

    • Convenience sample:

      • D

    • Bad Sampling Coverage:

      • If the sampling frame excludes people from the population

    • Undercoverage:

      • Minorities during the census

    • Nonresponse Bias: bias introduced when a large fraction of those sampled fails to respond to a survey

    • Response Bias: Anything in a survey that influences responses (like leading questions or unclear phrasing)

  • The Success/Failure Condition:

    • A binomial model is approximately normal if we expect at least 10 successes and 10 failures

      • np10

      • nq10

  • Discrete vs Continuous models

    • Normal Distribution is continuous

    • Binomial model is discrete

  • Statistical Significance: The results of a study are considered statistically significant is there is a very low probability that they happened by chance

    • Are the results extreme enough to reject a hypothesis?

  • Sampling Distribution

    • Distribution of sample means

  • Complement Rule:

    • P(AC)=1-P(A)

    • Complement:

      • everything that is not in A is the complement of A

  • Addition Rule:

    • For two disjoint events A and B, the probability of that one or the other occurs is the sum of the two probabilities

  •  Disjoint Events:

    • Events that have no outcomes in common

  • General Addition Rule:

    • More flexible than addition rule

    • Used when events are not disjointed

    • Formal equation:

      • P(A ∪ B) = P(A) + P(B) – P(A ⋂ B)

  • Conditional Probability:

    • The probability of an event given the occurrence of another event

    • Probability applied to a conditional distribution

    • P(B | A)=P(A∩B)/P(A)

      • Probability of B, conditioned on A

        • A “given” B

    • independent when

      • P(B | A)=P(B)

        • ஃ A & B are independent

  • Shifting data affects center but not spread

    • E(X+C)=E(X)+C

    • E(X±Y)=E(X)±E(Y)

    • SD(X+C)=SD(X) (same standard deviation)

    • Var(X+C)=Var(X) (same variance)

    • Var(X±Y)=Var(X)+Var(Y)

  • Rescaling data affects center and spread

    • E(X*C)=E(X)*C

    • SD(X*C)=SD(X)*C

    • Var(X*C)=Var(X)*C^2This relationship demonstrates how variance scales with the square of the constant factor, indicating that as we multiply a random variable by a constant, the variability increases in proportion to the square of that constant.

robot