regression

covariance
- a way to see if there is a relationship between two or more variables
- how much does each score deviate from the mean
- changes in one variable met by similar changes in the other variable
- if both variables deviate from the mean by a similar amount, they are likely to be related
- formula
  Cov(x, y) = ∑(xi - x̄)(y-ȳ) / N-1
  x = scores on first variable
  y = scores on second variable
regression
- predicting Y from someone’s score on X based on knowledge of the relationship between X and Y
- the degree of relationship between X and Y
- see if there's a connection between variables and how strong it is
- key components
- define predictor (IV) and outcome (DV) variables
- regression coefficients
  show the strength and direction of the relationship between each independent variable and the dependent variable
  ex. how much test scores increase for each extra hour of study
- residuals / errors
  differences between the actual values of the dependent variable and the values predicted by the regression model
  - show how well the model fits the data
- intercept
  starting point of the dependent variable when all independent variables are zero
  - like a baseline
- assumptions
  linearity, independence of errors, constant variance, and normality of residuals
- regression equation (&its components)
  Y (predicted value) = b0 + b1x + e
  b0 = intercept = value of y when x = 0
  b1 = slope = change in y due to 1 unit change in x
  e = residual / error
difference between correlation & regression
- correlation as checking if two things are linked, and regression as exploring and using that link to make predictions
- Correlation measures how strongly two things move together, but it doesn't show cause and effect. adv. - simple and quick to calculate, giving a single value to summarize the strength and direction of a relationship, only need to know if two variables are related
- disadvantage of correlation in comparison
  - Correlation doesn’t show causation or directionality, whereas regression models the potential causal relationship.
  - doesn’t allow for predictions or an understanding of how changes in one variable might affect another.
- Regression not only shows the relationship but also helps predict one thing (like test scores) based on the other (like study hours) and gives more insight into how much change in one thing leads to a change in the other. adv. - more detail, helps quantify the exact nature of the relationship
- sketch different relationships & suggest relationship type based on scatter plot
  - non-Linear relationship
    Points follow a curved pattern (e.g., a U-shape). Suggests a non-linear regression model might work better than linear regression
  - clustered relationship
    Points group into clusters. This may indicate the presence of distinct groups or categories within the data, requiring additional analysis
what does the F-test in regression measure?
- it tests whether the independent variables, as a group, significantly explain the variation in the dependent variable
- whether the overall regression model provides a better fit to the data than a model with no predictors
- if the F-test gives a low p-value (e.g., < 0.05), it suggests that the model explains a significant portion of the variability in the dependent variable
- the resulting F-statistic indicates whether the improvement in fit is statistically significant
- evaluates the overall significance of a regression model, especially in multiple regression where there are several predictors
residual / error variance
- measures the average squared difference between the observed values of the dependent variable and the values predicted by the regression model
- how much variation in the dependent variable is left unexplained by the model
- assesses the quality of the model
lower residual variance → the model’s predictions are closer to the actual data, indicating a better fit
higher residual variance → the model is not capturing the data well, leaving a lot of variation unexplained
- formula
  Residual sum of squares (RSS) / n - k - 1
  n = number of observations
  k = number of independent variables (predictors) in the model
  n - k - 1 = degrees of freedom for residuals
standard error of the estimate (SEE)
- measure of the accuracy of predictions made by a regression model
- indicates the average distance that the observed values fall from the regression line, in the same units as the dependent variable
- when comparing two regression models, the one with the smaller SEE generally has better predictive accuracy
- formula
  SEE = √ RSS / n - k - 1
  RSS = residual sum of squares
  k = number of predictors (IVs) in a model
R squared
- coefficient of determination
- measures the proportion of variance (that is predictable from the IV)
- proportion of variability in Y scores that is predicted by the regression equation
- how scattered the data is around the regression line
- how much of the variation in IV can be explained by the variation in DV
- value between 0 and 1, where higher values indicate a better fit
method of least squares
- a mathematical approach used in regression analysis to find the line (or curve) that best fits a set of data points by minimizing the differences between observed values and predicted values
- finds the line (or regression function) that minimizes the sum of the squared residuals
- technique for estimating the coefficients in linear regression models
- how
  1. Start with the data points: You have some points on a graph (e.g., hours studied and test scores) and you want to draw a line that best represents the pattern in the data
  2. Draw a line: This line is your guess for how the two variables are related (e.g., "if you study more, your score goes up")
  3. Measure the errors (residuals): For each point, measure how far it is from the line. These distances are the errors (or residuals).
  4. Square the errors: Instead of just adding the distances (because positive and negative could cancel out), you square each distance so they are all positive.
  5. Find the smallest total: The least squares method adjusts the line’s position (its slope and where it crosses the vertical axis) to make the total of those squared errors as small as possible.
- when
  - Linear Regression: To estimate the best-fit line when analyzing the relationship between two variables (simple regression) or multiple variables (multiple regression).
  - Predictive Modeling: When you want to predict outcomes based on explanatory variables.
  - Error Minimization: Whenever you need a model that minimizes prediction errors in squared terms
limitations & warnings
- prediction is not perfect (unless r=1)
- regression equation should not be used to make predictions of X outside the range of the original data