1/40
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Sample correlation coefficient:
rxy
Measures tightness of fit/how tight is that walk from the beach.
measures the strength of the linear relationship between X and Y.
gives the average change in standard deviations of Y for every 1 standard deviation increase in X.
Y:
Dependent.
Response.
Should depend somewhat in the model.
What you’re trying to explain, predict, or measure.
X:
Independent.
Explanatory.
Independent not because it is independent of Y, but because X is determined independently (called insogencity).
The factor you think influences or explains Y.
Least squares Regression Line:
The line that fits that data the best.
A line that minimizes the sum of the square distances between the dots/actual values of Y and the line.
The best estimator for the expected value of Y given X.
Minimizes the SSE.
ŷi= bo+b1x-
ŷi:
The predicted value of Y given X.
It’s also E(YIX)/ CEF
The estimator of E(YIX).
bo:
The estimated intercept/line.
Estimator of Bo.
gives the predicted value of Y when X=0.
If the X-variable cannot possibly be equal to zero, this is said to be “non-meaningful”.
b1:
The estimated slope.
Estimator of B1.
the average change in units of Y for every 1 unit increase in X.
It estimates the relationship between X and Y.
It has a variance of std deviation.
SSxy/SSxx
Residual:
How much is the estimated X off.
How much is the error of the estimation.
The distance between the dots and the line.
The difference between the actual values of Y/dots, and the predicted line.
We want this to be as close to the line as possible.
ei= mean yi-ŷi
How well did the model work?
What was the purpose of the line?
The purpose of the line was to explain the variation in Y./ The purpose of the line was to use the variation (if it’s hotter than average, mean x, or colder than average, mean x) in X to explain the variation in Y.
SST=
Total variation in Y
Sum of the squares total.
(yi-mean yi)²
SSR=
Explained variation in Y.
How much the variation in Y was explained.
When our line leaves the mean, when it rises above or below the mean, would explain the variation in Y.
Sum of the Squares from regression.
(ŷi-mean y)²
SSE:
unexplained variation in Y.
Sum from the first observation to the last of the actual values of Y minus the predicted squared.
Sum of the squared error.
The sum of the squared distances in the sample between the dots and the line.
Also the sum of the squared residuals. ei²
(yi-ŷi)²
The SSE and SSR are:
Originally exclusive (you can’t be both explained and unexplained) and collectively exhausted (you’ve got to be one or the other, explained or unexplained).
R²:
The quoefficient of determination.
The % of the variation in Y explained by the model. SSR/SST=
Explained/Total=
E(YIX):
Conditional Expectation function (CEF)
The expected value of Y given X.
Regression Equation:
Before using this, you have to specify a function point. For example: Which shape are we stimating?
Used to, in the example used in the lecture, predict sales (Y) given high temperature (X).
Used to learn more about the relationship between the variables, for example: how much evidence is there that they’re related at all?
Used when specifying the straight line.
E(YIX)= Bo+B1xi
It’s a positive trend if it’s going up.
You can only draw the actual values of Y in the graph.
Bo:
True population parameters and constant.
The true population intercept/line.
Bo is the E(YIX=0)
(If X cannot be equal to zero, β_0 is said to be “non-meaningful.”)
B1:
True population parameters and constant.
The true population slope.
The true relationship between x and Y.
How to interpret it: The change in Y for every increase in X.
The change in E(Y|X) from a 1 unit increase in X.
If there is a small, positive, weak relationship, the B1 would be small and positive.
If there is a strong, positive relationship, the B1 would be harder and positive.
Goals of the Regression Equation:
Main motivation: How is x related to Y.
B is our target paramenter.
Used to demonstrate is there is a true relationship between high temperature, X, and sales, Y.
yi:
The actual values of Y (the dots around the line)
Error Term:
The distance between the acual values of Y and the true CEF.
The spread of the dots around the line.
This represents the influence of Y from unobserved factors (unobserved factors=what’s causing the dots to be spread out).
The distance between the actual values of Y and the mean of Y given X.
If you had all these, they would sum up to 0/ E(Ei)=0
Some of these will be positive, some of these will be negative.
The deviation of the actual values of Y from their expected values.
Ei= yi-E(YIX)
SD(Ei):
The average spread of the error terms.
The average spread of the dots around the line.
This determines the level of precision of the dots in the model.
Population Model:
yi= Bo+Bi + Ei
Bo+B1xi is the same as E(YIX)
Ei:
What’s causing the dots to be spread out.
These are the unobserved factors.
What makes estimating the line difficult.
The deviation from that caused by other factors.
Sampling Distribution of b1:
If the 4 regression assumptions are met, then b1 is normally distributed with a mean of B1, and a std deviation of SD(b1)= SD(Ei)/square root SSxx
b1~N(B1, var(b1)) this can be translated as b1 is normally distributed with a mean of B1 and a variance of b1.
When we drew the Regression equation, population data/E(YIX) which is CEF, we gave B1 a positive slope, so:
It shapes to a normal distribution.
True std deviation of b1:
We won’t ever have the true std deviation of b1 because it’s a function of a parameter we don’t have.
SD(b1)= SD(Ei)/square root SSxx=the standard deviation of the error terms.
K=
The number of slopes we have to estimate.
Individual significance test for the T test:
This tests if there is a relationship between X & Y.
Is there actually a relationship between temperature and sales?
Test statistic:
This tests if there are two things related to each other.
Linearity:
The relationship between X and Y is linear.
Violated when: Parabola, curvi linear, misspecified.
Independent:
Each observation is independent or not related to each other.
Violated when: Time series data (Measuring variables over different time periods), Autocorrelation (The observations are either positively or negatively related to each other).
Constant variance:
The variance of the error terms is constant/ The spread of the Error terms is constant.
If met: homo—> The variance of the errors is constant.
Violated when: Hetero—> The variance of the errors is not constant.
Normality:
The error term is normally distributed.
If met: never completely met, but becomes approximately met with large sample sizes.
Violated when: small sample sizes.
If the four regression assumptions are met (or approximately met):
the estimators in the model are unbiased, efficient, and consistent estimators.
Let’s say we want to predict sales at a frozen yogurt shop, maybe YaYa’s Yogurt in Oxford:
use expected sales or mean sales. E(Y) = uy
2 Reasons to Estimate the Conditional Expectation Function:
To predict the value of Y (forecasting).
2. To measure relationships between variables (statistical inference).
As long as X is related to Y, E(Y|X) is a:
better predictor of Y than just X.
Regression Equation linear CEF

Conditional Expectation Function (CEF)
E(Y│X) = the expected (or average) value of Y given X.
The true CEF cuts through the middle of the data.
Individual significance test:
Ho
If we reject Ho, we’ll say that X is significantly related to Y. AND will say that we found evidence that B1 is not =0.
If we don’t reject Ho, we’ll say that X is insignificantly related to Y. AND will say that we found no evidence of relationship.