Review- Ch 3 Statistics

0.0(0)

Studied by 6 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/23

Earn XP

Description and Tags

Statistics

Linear Regression and Correlation

AP Statistics

Unit 2: Exploring Two-Variable Data

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

24 Terms

New cards

r=Correlation coefficient, the average cross product of z scores

Definition

Measures the relationship BETWEEN 2 numeric variables
Strength and association
- Measures direction(+-) and strength (-1 to 1), not shape
HOW closely points cluster around the “center” of data

Data

Univariate data→ mean
bivariate data→ regression line
Unitless, so changing the units does nothing
r must be BETWEEN -1 and 1, with 1 meaning perfect correlation
Not affected by which variable(x,y) is changing units
SAME Sign(+-) as the direction of the slope
STRONGLY affected by extreme values
If 1 variable has an equattion→ use it
- (s-x) and (x) have a negative correlation because there is a negative

Math

r= Σ ((x-x̄)/s_x)*(y-ȳ)s_y)) / (n-1)

= Σ( z_x^-z_y) / (n-1)

<p>r=Correlation coefficient, the <strong>average cross product of z scores</strong></p><p>Definition </p><ul><li><p>Measures the <strong>relationship</strong> BETWEEN 2 <strong>numeric</strong> variables </p></li><li><p>Strength and association</p><ul><li><p>Measures <strong>direction</strong>(<span style="color: green">+</span><span style="color: red">-</span>) and <strong>strength</strong> (<span style="color: red">-1</span> to<span style="color: green"> 1</span>), <span style="color: red">not</span> shape </p></li></ul></li><li><p>HOW closely points <strong>cluster</strong> around the “center” of data</p></li></ul><p>Data</p><ul><li><p>Univariate data→ mean</p></li><li><p>bivariate data→ regression line </p></li><li><p><mark data-color="red">Unitless</mark>, so changing the units does <span style="color: red">nothing</span> </p></li><li><p>r must be BETWEEN <span style="color: red">-1</span> and<span style="color: green"> 1,</span> with 1 meaning <mark data-color="green">perfect</mark> correlation </p></li><li><p><span style="color: red">Not</span> affected by which variable(x,y) is changing units </p></li><li><p>SAME Sign(<span>+-</span>) as the <strong>direction</strong> of the <strong>slope</strong></p></li><li><p>STRONGLY affected by <strong>extreme</strong> values </p></li><li><p>If 1 variable has an equattion→ use it</p><ul><li><p>(s-x) and (x) have a negative correlation because there is a negative </p></li></ul></li></ul><p>Math</p><p> r= Σ ((x-x̄)/s<sub>x</sub>)*(y-ȳ)s<sub>y</sub>)) / (n-1)</p><p>= Σ( z<sub>x<sup>- </sup></sub>z<sub>y</sub>) / (n-1)</p>

New cards

Strength of r

STRENGTH of r, correlation coefficient

Numbers

0 to 0.5 →weak

0.5-0.8 →moderate

0.8 onwards→strong

A negative: LRSL is overpredicting data→ negative association
A positive: LRRSL is underpredicting data→ positive association

New cards

Least Square Regression Line(LSRL)

Estimates and predictions, not actual values
reasonable only WITHIN the domain of the data(Interpolation
MUST pass through the mean(x̄, ȳ)
Regression OUTLIERS
- indicated by a point falling far away from the overall pattern
- points with relatively large discrepancies BETWEEN the value of the response variable, y, and a predicted value for the response variable,ŷ

Math

LSRL=ŷ =a+bx

a =y intercept
b=slope

b=r(s_{y /}s_x)

SSE= Σ(y-ŷ)

y= Actual
ŷ=predicted

New cards

r²

r^{2=Coefficient of}^{determination}

Calculates the proportion of the variance(variability) of one variable that is PREDICTED by the other variable
- “ r²as a 5 of the total variation in Y can be explained by the linear relationship BETWEEN X and Y in the regression line. “
What % of total data can be explained by the regression line?
Greater r^2%→ Better fit

Math

1-r²_⁼HOW much variability in Y is unaccountable by the regression line.

New cards

Describing Scatterplots

SOFA

S:Strength( Strong, Moderate, Weak, variability and Heteroscedasticity)

O: Outliers( in x, y direction, or BOTH)

F: Form(Linear or curved)

A: Association (Positive, negative, or no composition")

Describing SOFA relationship BETWEEN variables

STEPS

Identify the variables, cases, and scale of measure
Describe overall shape
Describe the trend through the slope
describe strength
Generalization
Note any lurking variables OR causation

New cards

Heteroscedasticity

Unequal variation in the plot
“Fanning left/right”
Doesn’t cause bias in the coefficient estimates, but make them less precise.
- Lower precision increases the likelihood that the coefficient estimates are further from the correct population value.
tends to produce p-values that are smaller than they should be

New cards

Scatterplots

Graph

change can be seen in frequency bar charts
clusters→ modes(peaks, which can also show bimodal)
Scatterplots are only for bivariate data

New cards

Z score

Standardised Z
x, y values will be based on their +-, meaning their points location on the 4 quadrants of the coordinate plane, the origin (0,0) being the intersection

New cards

Regression

HOW 2 numerical variables AFFECT each other
(x, y) are not interchangeable
“Casual” affect, but NOT causation
Positive when independent and dependent variables are both increasing or decreasing together
Negative when independent and dependent variables are going opposite ways(ie. one is increasing the other is decreasing)

Mean

the regression to the mean: in ANY elliptical cloud of points whenever the correlation, r, is not perfect
- A line fitting through this elliptical cloud has a slope of 1

New cards

Interpolation

Predicting data value within the dataset

New cards

Extrapolation

Predicting data value Outside the dataset

New cards

Slope interpretation

“for every 1 unit increase in the explanatory variable, x, there is a slope increase/decrease in the response variable, y.

New cards

SSE

The sum of square residual error

New cards

Residuals

*distance measurement

The net sum of residual and mean=0
The DIFFERENCE between an observed Y value and its predicted value from the regression line
- Decreases when the regression line fits MORE data

Math

Residual= Y-ŷ

Positive output: linear model UNDERestimated the actual response variable

Negative output: linear model OVERestimated the actual response variable

New cards

Residual Plots

Scatter plot of regression residuals AGAINST the predicted y values
a “barometer” for HOW well the regression lines fit the data
curvature →sign of curvature in the original plot, meaning the original was a nonlinear regression

New cards

rules for regression

The sum of residuals=0
horizontal line: mean of residuals=0
Residual Scattered=better fit for data
Residual have pattern/curve= Not an appropriate line

New cards

Missed features in Scatterplots

These points will change the measurement
Influential points
High leverage points
outliers
lurking variables

New cards

Influential points

examples: Outliers, high-leverage
removal of points→sharply CHANGE the regression line
High leverage
- x values are far from x̄
- line up with pattern: doesn’t influence equation, strengthens correlation, r, and determination, r²
- Not line up with pattern: dramatically CHANGES the equation, an influential point
Outliers
- may cause r² and S to CHANGE
lurking variables
- Correlation ≠ causation

New cards

Slope Changing Transformations

Line of fit to a scatterplot should be considered for a plot with curvature → adjust the plot using transformations
Nonlinear transformations change the shape of the graph, linear won’t
- in terms of slope and correlation,r
ONLY required if a linear model/scatterplot has curvature
- use log(ln) or log(log) depending on the plot
- exponential and power