lecture 4: simple linear regression

notes

  • intercept = the y value we would get when x equals to 0

  • slope → indicates a specific association between x and y

    • if the slope is positive, y increases as a function of x → positive association

    • if the slope is negative, y decreases as a function of x → negative association

  • purpose of simple linear regression → to estimate the intercept and slope based on the collected data

    • it refers to the case where we have 1 explanatory variable, as opposed to multiple regression, where there are multiple x variables

  • the least squares method:

    • this method helps to estimate the intercept and the slope, with the criteria that would correspond with a best fit

    • first, a specific value for the intercept and the slope is given, and then calculate for each x value in the dataset a predicted value ^y

    • however there will be a slight deviation in y and ^y

    • the fit is good of the predicted values ^y lie close to y

    • a good measure of the etoal difference between the observed and predicted values is Q → the residual sums of squares; it is a measure of how closely our points fit the line for the given values of the intercept and the slope

    • the difference between the predicted y values ^y and the observed values y are referred to as the residuals and are given by ei (formula in slide 9)

    • if ei is larger than 0, y lies above the regression line

    • if ei is equal to 0, y lies on the regression line

    • if ei is less than 0, y lies below the regression line

  • calculation in R:

    • the function lm() is used for linear regression

    • it matters what foes on the left and right side of the symbol ~

  • the least squares method allowed us to find the line that best describes the linear relationship among the observations

  • statistical interference → the calculation of p values and confidence intervals; it is concerned with calculating the chance of the estimated values for the slope and intercept being significantly different from 0, and in inferring the confidence intervals on the estimated parameters

questions

  1. Why do we use simple linear regression? Why do we need this? Can’t we do the same with the Pearson correlation or t-test?

  2. How does a simple linear regression work? What is the main principle? What are the main parameters to estimate?

  3. How do we account for uncertainty in this estimate?

  4. How can we find the best regression line?

  5. Will X ~Y result in the same regression coefficients as Y~X?

  6. What is the ANOVA table? What is the interpretation of the R2? How does it measure the reduction of the total error sum when we use X information?

  7. How can we see from the graphical confidence intervals if the slope differs from 0?

  8. What are the main assumptions of a simple linear regression?

  9. How do we report a simple linear regression?