Math1041 Lecture Notes: Regression, Residuals, and Continuous Random Variables
Announcements
- Lab test next week: Questions provided in advance, except for one.
- Assignment: Will be emailed by the end of the week; more details in the next lecture.
Least Squares Regression Line
- Objective: Find the line that minimizes the sum of the squares of the vertical distances between data points and the line.
- Calculus and Linear Algebra: Obtaining the least squares regression line requires calculus and linear algebra.
- Notation:
- X and Y: Denote the coordinates of the data points.
- \hat{y} (y-hat): Denotes values computed from the least squares line (fitted values).
- Fitted Value: The value of y on the least squares line for a particular x.
- Predicted Value: Another term for the fitted value, emphasizing the use of the line for prediction.
Ski Jump Example
- Explanatory Variable: The score from the first jump.
- Response Variable: The sum of the two jumps.
- Equation: y = b0 + b1 * x (where b0 is the intercept and b1 is the coefficient for the explanatory variable x).
- Prediction: To predict the sum of the two jumps for a new first jump score, find the corresponding y value on the line of best fit.
- Example: If an athlete scored 137 in round one, predict their final score by plugging 137 into the regression equation.
- Calculation: Put 137 into the equation to get y = 269.36.
Implementation
- rarely done by hand; statistical software is used.
- Functions in R:
LSfit: A function that gives the intercept which is B0 and the X value is B1.LM: Linear Model. Used in the lab.
Extrapolation
- Definition: Using the line of best fit to predict values outside the range of the data used to create the line.
- Caution: Should be avoided, as it can lead to inaccurate or nonsensical predictions.
Population of Farmers Example
- Data: US population of farmers from 1935 to 1980.
- Trend: The population of farmers decreases over time.
- Problem: Using the line to predict the population in 2050 would result in a negative number, which is not realistic.
- Equation for the line in the example data is used to show prediction is bad.
- Prediction for 2020: Would give a negative farmer population.
- Guideline: Only use the regression line to make predictions within or just slightly beyond the range of the original data.
- Lab Test Questions: Pay attention to whether it's appropriate to use the line for a given x value.
Residuals and R-squared
- Focus: Understanding residuals and R-squared value.
Checking Linear Regression Assumptions
- Linearity: Ensure the data is approximately linear before fitting a linear regression.
- R Value: Don't rely solely on the value of R to determine linearity; plot the data.
Anscombe's Quartet
- Description: Four datasets with similar summary statistics but very different patterns.
- Summary Statistics: Similar means, variances, and R values.
- Importance: Demonstrates the danger of relying solely on summary statistics without visualizing the data.
- Dataset A: The most linear and suitable for linear regression.
- Dataset D: Not suitable for linear Regression.
Residuals
- Definition: The difference between the fitted value and the observed response value (Residual = y_i - \hat{y}).
- Purpose: Assessing the appropriateness of a linear model.
- Good Model: Points should be scattered randomly on both sides of the line with no apparent structure.
- Bad Model: Model consistently over or underestimating, or exhibiting structure. The randomness of when it overestimates or underestimates is the important piece.
- Overfitting: Creating a model that fits the data too closely, which can be unhelpful.
Janka Hardness Test Example
- Context: Measures the hardness of wood.
- Variables: Density (x-axis) and log of hardness (y-axis).
- Goal: Determine if the assumptions of linear regression are met.
Residual Plot
- Construction: Plotting the residuals against the x-axis.
- Zero Line: Represents the ideal situation where the line of best fit goes through every point.
- Interpretation: Look for patterns in the residuals.
- Random Scatter: Indicates a good fit for the linear model.
- Non-Random Pattern: Suggests a non-linear relationship or other issues with the model.
Example
*For densities around 30, tend to overestimate.
Higher densities tend to underestimate.
Higher end, it is slightly over estimated again.
Recommendation: If patterns are observed, consider a non-linear model.
Plot Appearance: Avoid distorting the plot by squashing the axes.
R-Squared Value
- Coefficient of Determination measures the strength of the regression.
- Definition: The square of the Pearson correlation coefficient (r).
- Range: Between 0 and 1.
- Interpretation: Represents the percentage of variation in the response variable (y) that is explained by the explanatory variable (x).
- Formula: r^2.
Temperature and Log Electricity Example
- R Value: -0.9
- R-squared Value: 0.81
- Interpretation: 81% of the variance in log electricity usage is explained by the temperature.
Standardized Values
- Process: Convert x and y values to standardized values (subtract the mean and divide by the standard deviation).
Outliers
Barbara's Car Example
- Context: Barbara wants to buy used cars for her catering business.
- Data: Age vs. price of cars.
- Outlier: A car that is seven years old but very expensive.
- Possible Explanations:
- Typos in the ad.
- Classic car.
- Barely used.
- Impact: The outlier affects the line of best fit and the residuals.
- Residual Plot: Shows the outlier.
- Removal: Removing from the model did not have an impact.
Lift Example
- Data: Maximum number of people vs. maximum weight for lifts.
- Outliers:
- Lift with a max of 60 people.
- Low max number of people with low max weight.
- Impact: Removing these outliers does not make a big deal in the model, because data is a mistake.
Rule of Thumb with Outliers
- If it doesn't make a big difference just leave it in there.
- If it does make a difference: Need a good reason to remove it.
- Want ot avoid removing it if it represent some structure.
- If you don't have a good reason it's worth talking about that one case or what result it brings.
Chapter 4: Continuous Random Variables
Terminology
- True Mean: The average if you were able to check the height of every student at UNSW.
- Sample Mean: Average of the height of people of sample.
- Observed Sample Mean: The mean of the sample you actually get.
- Emphasis: The sample mean is a random variable
Continuous Random Variables
*Discrete is countable.
- Continuous: Infinite numbers, cannot describe progression
Continuous and Discrete data
- Daily rainfall in millimeters: Continuous
- A person's eye color: Categorical, is not a random variable.
- Daily maximum C temperature: Continuous
- Number of houses sold each week: Discrete
- Final score of one team in a football match: Discrete
- Flow rate of a river in meters per second at a given depth: Continuous
- Weight of the next green sea turtle captured in the ocean: Continuous