Math1041 Lecture Notes: Regression, Residuals, and Continuous Random Variables

Lab test next week: Questions provided in advance, except for one.
Assignment: Will be emailed by the end of the week; more details in the next lecture.

Objective: Find the line that minimizes the sum of the squares of the vertical distances between data points and the line.
Calculus and Linear Algebra: Obtaining the least squares regression line requires calculus and linear algebra.
Notation:
- X and Y: Denote the coordinates of the data points.
- $\hat{y}$ (y-hat): Denotes values computed from the least squares line (fitted values).
Fitted Value: The value of y on the least squares line for a particular x.
Predicted Value: Another term for the fitted value, emphasizing the use of the line for prediction.

Explanatory Variable: The score from the first jump.
Response Variable: The sum of the two jumps.
Equation: $y = b<em>0 + b</em>1 * x$ (where $b<em>0$ is the intercept and $b</em>1$ is the coefficient for the explanatory variable x).
Prediction: To predict the sum of the two jumps for a new first jump score, find the corresponding y value on the line of best fit.
Example: If an athlete scored 137 in round one, predict their final score by plugging 137 into the regression equation.
Calculation: Put 137 into the equation to get $y = 269.36$ .

rarely done by hand; statistical software is used.
Functions in R:
- LSfit: A function that gives the intercept which is B0 and the X value is B1.
- LM: Linear Model. Used in the lab.

Definition: Using the line of best fit to predict values outside the range of the data used to create the line.
Caution: Should be avoided, as it can lead to inaccurate or nonsensical predictions.

Data: US population of farmers from 1935 to 1980.
Trend: The population of farmers decreases over time.
Problem: Using the line to predict the population in 2050 would result in a negative number, which is not realistic.
Equation for the line in the example data is used to show prediction is bad.
Prediction for 2020: Would give a negative farmer population.
Guideline: Only use the regression line to make predictions within or just slightly beyond the range of the original data.
Lab Test Questions: Pay attention to whether it's appropriate to use the line for a given x value.

Linearity: Ensure the data is approximately linear before fitting a linear regression.
R Value: Don't rely solely on the value of R to determine linearity; plot the data.

Description: Four datasets with similar summary statistics but very different patterns.
Summary Statistics: Similar means, variances, and R values.
Importance: Demonstrates the danger of relying solely on summary statistics without visualizing the data.
Dataset A: The most linear and suitable for linear regression.
Dataset D: Not suitable for linear Regression.

Definition: The difference between the fitted value and the observed response value ( $Residual = y_i - \hat{y}$ ).
Purpose: Assessing the appropriateness of a linear model.
Good Model: Points should be scattered randomly on both sides of the line with no apparent structure.
Bad Model: Model consistently over or underestimating, or exhibiting structure. The randomness of when it overestimates or underestimates is the important piece.
Overfitting: Creating a model that fits the data too closely, which can be unhelpful.

Construction: Plotting the residuals against the x-axis.
Zero Line: Represents the ideal situation where the line of best fit goes through every point.
Interpretation: Look for patterns in the residuals.
Random Scatter: Indicates a good fit for the linear model.
Non-Random Pattern: Suggests a non-linear relationship or other issues with the model.

*For densities around 30, tend to overestimate.

Coefficient of Determination measures the strength of the regression.
Definition: The square of the Pearson correlation coefficient (r).
Range: Between 0 and 1.
Interpretation: Represents the percentage of variation in the response variable (y) that is explained by the explanatory variable (x).
Formula: $r^2$ .

R Value: -0.9
R-squared Value: 0.81
Interpretation: 81% of the variance in log electricity usage is explained by the temperature.

Process: Convert x and y values to standardized values (subtract the mean and divide by the standard deviation).

Data: Maximum number of people vs. maximum weight for lifts.
Outliers:
- Lift with a max of 60 people.
- Low max number of people with low max weight.
Impact: Removing these outliers does not make a big deal in the model, because data is a mistake.

If it doesn't make a big difference just leave it in there.
If it does make a difference: Need a good reason to remove it.
Want ot avoid removing it if it represent some structure.
If you don't have a good reason it's worth talking about that one case or what result it brings.

True Mean: The average if you were able to check the height of every student at UNSW.
Sample Mean: Average of the height of people of sample.
Observed Sample Mean: The mean of the sample you actually get.
Emphasis: The sample mean is a random variable

*Discrete is countable.