Math1041 Lecture Notes: Regression, Residuals, and Continuous Random Variables

Announcements

  • Lab test next week: Questions provided in advance, except for one.
  • Assignment: Will be emailed by the end of the week; more details in the next lecture.

Least Squares Regression Line

  • Objective: Find the line that minimizes the sum of the squares of the vertical distances between data points and the line.
  • Calculus and Linear Algebra: Obtaining the least squares regression line requires calculus and linear algebra.
  • Notation:
    • X and Y: Denote the coordinates of the data points.
    • \hat{y} (y-hat): Denotes values computed from the least squares line (fitted values).
  • Fitted Value: The value of y on the least squares line for a particular x.
  • Predicted Value: Another term for the fitted value, emphasizing the use of the line for prediction.

Ski Jump Example

  • Explanatory Variable: The score from the first jump.
  • Response Variable: The sum of the two jumps.
  • Equation: y = b0 + b1 * x (where b0 is the intercept and b1 is the coefficient for the explanatory variable x).
  • Prediction: To predict the sum of the two jumps for a new first jump score, find the corresponding y value on the line of best fit.
  • Example: If an athlete scored 137 in round one, predict their final score by plugging 137 into the regression equation.
  • Calculation: Put 137 into the equation to get y = 269.36.

Implementation

  • rarely done by hand; statistical software is used.
  • Functions in R:
    • LSfit: A function that gives the intercept which is B0 and the X value is B1.
    • LM: Linear Model. Used in the lab.

Extrapolation

  • Definition: Using the line of best fit to predict values outside the range of the data used to create the line.
  • Caution: Should be avoided, as it can lead to inaccurate or nonsensical predictions.

Population of Farmers Example

  • Data: US population of farmers from 1935 to 1980.
  • Trend: The population of farmers decreases over time.
  • Problem: Using the line to predict the population in 2050 would result in a negative number, which is not realistic.
  • Equation for the line in the example data is used to show prediction is bad.
  • Prediction for 2020: Would give a negative farmer population.
  • Guideline: Only use the regression line to make predictions within or just slightly beyond the range of the original data.
  • Lab Test Questions: Pay attention to whether it's appropriate to use the line for a given x value.

Residuals and R-squared

  • Focus: Understanding residuals and R-squared value.

Checking Linear Regression Assumptions

  • Linearity: Ensure the data is approximately linear before fitting a linear regression.
  • R Value: Don't rely solely on the value of R to determine linearity; plot the data.

Anscombe's Quartet

  • Description: Four datasets with similar summary statistics but very different patterns.
  • Summary Statistics: Similar means, variances, and R values.
  • Importance: Demonstrates the danger of relying solely on summary statistics without visualizing the data.
  • Dataset A: The most linear and suitable for linear regression.
  • Dataset D: Not suitable for linear Regression.

Residuals

  • Definition: The difference between the fitted value and the observed response value (Residual = y_i - \hat{y}).
  • Purpose: Assessing the appropriateness of a linear model.
  • Good Model: Points should be scattered randomly on both sides of the line with no apparent structure.
  • Bad Model: Model consistently over or underestimating, or exhibiting structure. The randomness of when it overestimates or underestimates is the important piece.
  • Overfitting: Creating a model that fits the data too closely, which can be unhelpful.

Janka Hardness Test Example

  • Context: Measures the hardness of wood.
  • Variables: Density (x-axis) and log of hardness (y-axis).
  • Goal: Determine if the assumptions of linear regression are met.

Residual Plot

  • Construction: Plotting the residuals against the x-axis.
  • Zero Line: Represents the ideal situation where the line of best fit goes through every point.
  • Interpretation: Look for patterns in the residuals.
  • Random Scatter: Indicates a good fit for the linear model.
  • Non-Random Pattern: Suggests a non-linear relationship or other issues with the model.

Example

*For densities around 30, tend to overestimate.

  • Higher densities tend to underestimate.

  • Higher end, it is slightly over estimated again.

  • Recommendation: If patterns are observed, consider a non-linear model.

  • Plot Appearance: Avoid distorting the plot by squashing the axes.

R-Squared Value

  • Coefficient of Determination measures the strength of the regression.
  • Definition: The square of the Pearson correlation coefficient (r).
  • Range: Between 0 and 1.
  • Interpretation: Represents the percentage of variation in the response variable (y) that is explained by the explanatory variable (x).
  • Formula: r^2.

Temperature and Log Electricity Example

  • R Value: -0.9
  • R-squared Value: 0.81
  • Interpretation: 81% of the variance in log electricity usage is explained by the temperature.

Standardized Values

  • Process: Convert x and y values to standardized values (subtract the mean and divide by the standard deviation).

Outliers

Barbara's Car Example

  • Context: Barbara wants to buy used cars for her catering business.
  • Data: Age vs. price of cars.
  • Outlier: A car that is seven years old but very expensive.
  • Possible Explanations:
    • Typos in the ad.
    • Classic car.
    • Barely used.
  • Impact: The outlier affects the line of best fit and the residuals.
  • Residual Plot: Shows the outlier.
  • Removal: Removing from the model did not have an impact.

Lift Example

  • Data: Maximum number of people vs. maximum weight for lifts.
  • Outliers:
    • Lift with a max of 60 people.
    • Low max number of people with low max weight.
  • Impact: Removing these outliers does not make a big deal in the model, because data is a mistake.

Rule of Thumb with Outliers

  • If it doesn't make a big difference just leave it in there.
  • If it does make a difference: Need a good reason to remove it.
  • Want ot avoid removing it if it represent some structure.
  • If you don't have a good reason it's worth talking about that one case or what result it brings.

Chapter 4: Continuous Random Variables

Terminology

  • True Mean: The average if you were able to check the height of every student at UNSW.
  • Sample Mean: Average of the height of people of sample.
  • Observed Sample Mean: The mean of the sample you actually get.
  • Emphasis: The sample mean is a random variable

Continuous Random Variables

*Discrete is countable.

  • Continuous: Infinite numbers, cannot describe progression

Continuous and Discrete data

  • Daily rainfall in millimeters: Continuous
  • A person's eye color: Categorical, is not a random variable.
  • Daily maximum C temperature: Continuous
  • Number of houses sold each week: Discrete
  • Final score of one team in a football match: Discrete
  • Flow rate of a river in meters per second at a given depth: Continuous
  • Weight of the next green sea turtle captured in the ocean: Continuous