regression
covariance
a way to see if there is a relationship between two or more variables
how much does each score deviate from the mean
changes in one variable met by similar changes in the other variable
if both variables deviate from the mean by a similar amount, they are likely to be related
formula
Cov(x, y) = ∑(xi - x̄)(y-ȳ) / N-1
x = scores on first variable
y = scores on second variable
regression
predicting Y from someone’s score on X based on knowledge of the relationship between X and Y
the degree of relationship between X and Y
see if there's a connection between variables and how strong it is
key components
define predictor (IV) and outcome (DV) variables
regression coefficients
show the strength and direction of the relationship between each independent variable and the dependent variable
ex. how much test scores increase for each extra hour of study
residuals / errors
differences between the actual values of the dependent variable and the values predicted by the regression model
show how well the model fits the data
intercept
starting point of the dependent variable when all independent variables are zero
like a baseline
assumptions
linearity, independence of errors, constant variance, and normality of residuals
regression equation (&its components)
Y (predicted value) = b0 + b1x + e
b0 = intercept = value of y when x = 0
b1 = slope = change in y due to 1 unit change in x
e = residual / error
difference between correlation & regression
correlation as checking if two things are linked, and regression as exploring and using that link to make predictions
Correlation measures how strongly two things move together, but it doesn't show cause and effect. adv. - simple and quick to calculate, giving a single value to summarize the strength and direction of a relationship, only need to know if two variables are related
disadvantage of correlation in comparison
Correlation doesn’t show causation or directionality, whereas regression models the potential causal relationship.
doesn’t allow for predictions or an understanding of how changes in one variable might affect another.
Regression not only shows the relationship but also helps predict one thing (like test scores) based on the other (like study hours) and gives more insight into how much change in one thing leads to a change in the other. adv. - more detail, helps quantify the exact nature of the relationship
sketch different relationships & suggest relationship type based on scatter plot
non-Linear relationship
Points follow a curved pattern (e.g., a U-shape). Suggests a non-linear regression model might work better than linear regression
clustered relationship
Points group into clusters. This may indicate the presence of distinct groups or categories within the data, requiring additional analysis
what does the F-test in regression measure?
it tests whether the independent variables, as a group, significantly explain the variation in the dependent variable
whether the overall regression model provides a better fit to the data than a model with no predictors
if the F-test gives a low p-value (e.g., < 0.05), it suggests that the model explains a significant portion of the variability in the dependent variable
the resulting F-statistic indicates whether the improvement in fit is statistically significant
evaluates the overall significance of a regression model, especially in multiple regression where there are several predictors
residual / error variance
measures the average squared difference between the observed values of the dependent variable and the values predicted by the regression model
how much variation in the dependent variable is left unexplained by the model
assesses the quality of the model
lower residual variance → the model’s predictions are closer to the actual data, indicating a better fit
higher residual variance → the model is not capturing the data well, leaving a lot of variation unexplained
formula
Residual sum of squares (RSS) / n - k - 1
n = number of observations
k = number of independent variables (predictors) in the model
n - k - 1 = degrees of freedom for residuals
standard error of the estimate (SEE)
measure of the accuracy of predictions made by a regression model
indicates the average distance that the observed values fall from the regression line, in the same units as the dependent variable
when comparing two regression models, the one with the smaller SEE generally has better predictive accuracy
formula
SEE = √ RSS / n - k - 1
RSS = residual sum of squares
k = number of predictors (IVs) in a model
R squared
coefficient of determination
measures the proportion of variance (that is predictable from the IV)
proportion of variability in Y scores that is predicted by the regression equation
how scattered the data is around the regression line
how much of the variation in IV can be explained by the variation in DV
value between 0 and 1, where higher values indicate a better fit
method of least squares
a mathematical approach used in regression analysis to find the line (or curve) that best fits a set of data points by minimizing the differences between observed values and predicted values
finds the line (or regression function) that minimizes the sum of the squared residuals
technique for estimating the coefficients in linear regression models
how
Start with the data points: You have some points on a graph (e.g., hours studied and test scores) and you want to draw a line that best represents the pattern in the data
Draw a line: This line is your guess for how the two variables are related (e.g., "if you study more, your score goes up")
Measure the errors (residuals): For each point, measure how far it is from the line. These distances are the errors (or residuals).
Square the errors: Instead of just adding the distances (because positive and negative could cancel out), you square each distance so they are all positive.
Find the smallest total: The least squares method adjusts the line’s position (its slope and where it crosses the vertical axis) to make the total of those squared errors as small as possible.
when
Linear Regression: To estimate the best-fit line when analyzing the relationship between two variables (simple regression) or multiple variables (multiple regression).
Predictive Modeling: When you want to predict outcomes based on explanatory variables.
Error Minimization: Whenever you need a model that minimizes prediction errors in squared terms
limitations & warnings
prediction is not perfect (unless r=1)
regression equation should not be used to make predictions of X outside the range of the original data