Two Quantitative Variables and Regression – Comprehensive Study Notes

1. Introduction

  • Big idea: comparing groups and relationships between quantitative variables. Distinctions touched on in the transcript:
    • Between groups vs across groups: difference across groups is a characteristic to study, not just within a single group.
    • Categorical vs categorical comparisons and how differences are described verbally (direction, form, strength) and via plots (scatter plots) and correlation when relationships are linear.
    • Recap from prior: correlation can be described verbally (direction, form, strength, outliers) and computed for linear relationships.
    • Context for today: a deeper dive into the two quantitative variable scenario (bivariate data).
  • Class logistics and signposts discussed:
    • Upcoming midterm in two weeks.
    • Each module has slides with signposts:
    • A slide at the start telling you what you’re about to learn.
    • A mid-module slide summarizing what’s been learned so far.
    • An end-of-module summary slide.
    • These slides help study and identifying which slides to focus on for homework.
    • Example of signpost use: Homework 2.2 is between the blue slide 21 and blue slide 35.
  • New goal for today: go beyond verbal descriptions to predicting with a line of best fit (regression line) for two quantitative variables.
    • Use the line of best fit to predict the response for a given explanatory value (e.g., if a car weighs X pounds, what is its miles-per-gallon?).
    • The regression line is the standardized way to choose a best-fit line.
  • Key concept: residuals and prediction error
    • Prediction error is the vertical distance between actual y and predicted ŷ on the regression line.
    • Residual for a data point i: ei = yi - \hat{y}_i
    • Positive residual means the data point is above the line; negative means below.
    • Why minimize residuals? We want predictions to be as accurate as possible (minimize error).
    • Why not just sum residuals? If you sum all residuals, the positives and negatives cancel, giving zero. So we minimize something else.
    • The standard approach: square residuals and minimize the sum of squared residuals (Least Squares).
    • This is why it’s called the "least squares" method: we square then minimize.
  • Notation for the regression line in simple linear regression:
    • The line is written as: \hat{Y} = a + bX
    • Here, a (or the intercept) is the predicted y when X = 0; b is the slope (the change in Y for a one-unit change in X).
    • In statistics, these are sample estimates: \hat{a}, \hat{b}; they estimate population parameters \alpha, \beta.
    • \hat{Y} (with a hat) denotes the predicted (fitted) response.
    • The slope corresponds to the coefficient of the explanatory variable in the regression table.
  • Practical note on interpretation of the intercept:
    • The intercept is often not meaningful by itself if X = 0 is outside the observed data range.
    • Intercept interpretation only makes sense when zero is within the plausible range of the explanatory variable for the context (e.g., zero beers would imply zero blood alcohol content).
  • Example setup described in the lecture: using car weight (X) and miles per gallon (Y).
    • Data range example: weight values observed roughly between 2500 and 6500 (pounds).
    • Regression practice: data were put into Excel to run regression; the output includes an intercept and a slope (coefficients) that define the regression line.
    • The displayed regression table labels may vary (e.g., "Intercept" or "Constant" for the intercept; the other row for the explanatory variable may be labeled with the variable name or its coefficient).
    • Important practical note: later modules introduce standard error, t-statistics, and p-values; ignore these outputs for now and focus on the intercept and slope.
  • Summary takeaway for Section 1:
    • Two quantitative variables can be described with a line of best fit to predict one from the other.
    • The regression line is chosen to minimize squared residuals (the least-squares criterion).
    • The regression equation has an intercept and a slope; interpretation depends on the context and data range.

2. Line of Best Fit

  • Core idea: predicting the response using the line of best fit (regression line).
  • Regression line formula: \hat{Y} = a + bX where:
    • a is the intercept (predicted Y when X=0).
    • b is the slope (change in Y for a one-unit change in X).
    • \hat{Y} is the predicted response.
  • Why we need a standard method to determine the line: hand-drawn lines would vary, so we use the least-squares criterion to pick the "best" line.
  • Two key data concepts:
    • Predict the response for a specific explanatory value: e.g., given a car weight, predict mpg.
    • The range of observed X matters for prediction; predictions outside the observed range are extrapolations and can be unreliable.
  • Regression output in practice:
    • In Excel (regression tool), the output includes a regression table with a row for the intercept (often labeled "Intercept" or "Constant") and a row for the slope (the explanatory variable coefficient).
    • The regression table outputs include many statistics (standard error, t-statistics, p-values, etc.), which will be covered later in modules 7 and 8; for now, focus on the intercept and slope and the resulting equation.
  • Worked example from the lecture (car data):
    • Intercept (a) \approx 45.645
    • Slope (b) \approx -0.005 (per unit of weight)
    • The example uses these values to form the equation for predicting MPG from weight.
  • Intuition check: an intercept around 45 MPG with weight-based slope of -0.005 suggests that heavier cars tend to have slightly lower MPG, but the intercept by itself is not meant to be interpreted literally when weight is far from observed values.
  • Important concepts introduced:
    • Y-hat is the predicted response; X is the explanatory variable; the axis orientation is always: X on the horizontal, Y on the vertical.
    • The slope tells you the rate of change in the predicted response per unit change in X.
    • The range of X observed in the dataset defines the valid domain for predictions; outside this range, predictions are extrapolations.
  • Practice takeaway: learn to read the regression table, identify intercept and slope, and form the predictive equation. Ignore standard errors and p-values for now.

3. Fitting the Line

  • Interpret the intercept and slope in context:
    • Intercept: the predicted Y when X = 0. Often nonsensical if 0 is not a plausible value for X in the data (e.g., predicting MPG from weight). It is mainly a mathematical component that helps fit the line.
    • Slope: the change in the predicted response for a one-unit change in the explanatory variable; here, the relationship between weight and MPG.
  • Remember the data context:
    • The response variable is the Y-axis (e.g., Miles per Gallon).
    • The explanatory variable is the X-axis (e.g., car weight).
    • Observed data range in the example: X values ranged roughly from about 2500 to 6500.
  • Predicting a specific data point:
    • Given an observed (x, y) pair, compute the predicted ŷ from the regression equation and then the residual: e = y - \hat{y}.
    • Example from the lecture: an actual car with weight around 3970 and actual mpg y = 21; predicted mpg from the line is ŷ \approx 25.795; residual is:
    • Residual: e = y - \hat{y} = 21 - 25.795 = -4.795
    • The residual is negative because the actual mpg is below the predicted value by the line at that weight.
  • Concept check:
    • If a point lies above the regression line, its residual is positive; if below, residual is negative.
    • This residual reflects the error in the prediction for that data point.
  • Exercise: Students are invited to practice computing predicted values and residuals for given data points.

4. Data Conditions (LINE Criteria)

  • The Line of Best Regression relies on certain data conditions for trustworthiness. The acronym LINE helps remember them:
    • L = Linear: the relationship between X and Y is approximately linear (not curved).
    • I = Independent: data are random samples from a large population; the data generation process should be random.
    • N = Normal: the distribution of the response variable around the line is approximately normal (the bell curve, rotated to vertical for the residuals).
    • E = Equal variance: the spread of the residuals (vertical spread) is roughly constant across all levels of X (homoscedasticity).
  • Visualization intuition:
    • If X is height and Y is wingspan, most people with a given height have wingspans around a central value with some variability; the data should show a roughly linear pattern with constant spread around the line.
    • The visual analogy shows a vertical normal distribution rotated to illustrate the constant spread around the regression line.
  • How these conditions show up in practice:
    • L: a roughly straight-line pattern in a scatter plot.
    • I: random scatter around the line in a dataset drawn from a larger population.
    • N: the distribution of residuals around zero is roughly bell-shaped when considering the errors at a given X.
    • E: the vertical spread of points around the line is similar across all X values (no funnel or fan shape).
  • Practical note: In real data, you often won’t see perfect LINE conditions; some violations are common. The goal is to assess whether the line is still a reasonable approximation given the data collection process and the context.

5. Residual Plots

  • Purpose of residual plots: to diagnose whether the regression model is appropriate.
  • What a residual plot shows:
    • Plot residuals e = y - \hat{y} on the vertical axis against the explanatory variable X on the horizontal axis (or sometimes against ŷ). Zero residuals lie at the center.
    • A residual of zero occurs for points exactly on the regression line.
  • Ideal residual plot characteristics:
    • Residuals look random (no obvious patterns) and centered around zero.
    • The spread of residuals is roughly constant across X (no changing variance).
  • Interpretation of pattern violations (as shown in the slide exercise in the lecture):
    • Curved pattern indicates nonlinearity (violation of LINE for the linear model).
    • Funnel or increasing/decreasing spread indicates non-constant variance (violation of E).
    • Clustering or gaps may indicate non-independence or outliers.
    • Outliers are points far from the rest of the data and may indicate data entry errors or unusual cases; they require investigation.
  • Outliers and influence:
    • Outliers: data points with large residuals; may or may not be influential.
    • The exercise emphasized identifying which residual plots violate the conditions and noting that I (independence) is typically not visible in a plot but is addressed by understanding data collection.
  • Practical note: in real data, perfect LINE conditions are rare; residual plots help assess the adequacy of the model and signal potential issues.

6. Lurking Variables and R-squared

  • Lurking variables: other variables not included in the model can influence the response, leading to residual patterns or mis-specified models.
    • Examples in diamond pricing: carats (size) plus color, clarity, cut, etc. are lurking variables that influence price, not just weight or a single predictor.
    • If you include multiple predictors and possibly nonlinear fits, the residual pattern can improve, but the simple two-variable line might be too simplistic for some datasets.
  • When multiple predictors are considered:
    • The goal is to understand how much each predictor contributes to explaining the response.
    • A common summary measure is R-squared (R^2): the proportion of variation in the response explained by the model.
  • Relationship between correlation and R-squared:
    • R-squared is the square of the correlation coefficient r: R^2 = r^2
    • r can range between -1 and 1, so R^2 ranges between 0 and 1.
    • Higher R^2 means the predictor explains more of the variation in the response.
  • Interpreting R-squared in context (example from the lecture):
    • Two predictors were considered to model a dependent variable (DPI in the notes; the exact label varies by example).
    • One predictor (Work Satisfaction) yielded R^2 \approx 0.709, i.e., about 71% of the variation in the response is explained by this predictor.
    • Another predictor yielded a higher R-squared value of 0.84, indicating it explains more variation than the Work Satisfaction predictor in that comparison (the exact mapping of predictors to numbers depends on the dataset).
  • Practical takeaway:
    • R^2 helps compare which predictor or model explains more of the variation in the response.
    • R^2 alone is not all you need; consider context, model assumptions, and whether multiple predictors would increase explanatory power.
  • Extended ideas hinted in the lecture:
    • When dealing with multiple predictors, a combined model can capture more variation than any single-predictor model.
    • There are many other factors (sleep, mental health, prior GPA, etc.) that could influence a student’s class grade; these illustrate how multiple inputs can affect outcomes.

7. Influential Points and Conclusion

  • Influential points vs outliers:
    • Influential points are a special type of outlier that behaves differently in the horizontal direction; they exert a strong pull on the regression line because of their x-value location.
    • Not all outliers are influential; an outlier with an extreme y-value but typical x-value might not pull the line much.
  • How to handle influential points:
    • Inspect the data point for potential errors or data quality issues (typos, measurement errors).
    • Refit the regression line with and without the influential point to assess its impact on the line and on R^2.
    • If the point substantially changes the line, investigate its validity and consider whether to keep or remove it (with justification).
  • Demonstrated effect of influential points:
    • A line fitted without the influential point vs with the influential point shows how the point pulls the line toward itself.
    • If the influential point lies on the overall trend, R^2 may increase when included; if it lies far from the trend, R^2 may decrease or the fit may worsen.
  • Final takeaway:
    • Understanding and diagnosing residuals, normality, equal variance, and influential points are essential for validating linear regression models and ensuring that predictions are reliable within the data context.

Equations and Key Symbols

  • Regression line (simple linear): \hat{Y} = a + bX
  • Residual for observation i: ei = yi - \hat{y}_i
  • Sum of residuals property in ordinary least squares: \sumi ei = 0
  • Least squares objective (minimize squared residuals): \min{a,b} \sumi (yi - (a + bXi))^2
  • Predicted value for a given X: \hat{y} = \hat{a} + \hat{b}X
  • Interpreting R-squared: R^2 = r^2, \quad 0 \le R^2 \le 1
  • Conceptual mapping: linear relationship, independence, normality, equal variance (LINE) for trusting the line
  • Data range example (X): 2500 \le X \le 6500
  • Example numbers from the car MPG data (illustrative):
    • Intercept: a \approx 45.645
    • Slope: b \approx -0.005
    • Predicted MPG at a given weight: use \hat{Y} = a + bX
    • Example residual: for an observed (X, y) = (3970, 21) and predicted \hat{y} = 25.795, e = 21 - 25.795 = -4.795

Notes on How to Study from These Notes

  • Use the section headings to navigate through concepts in order: Introduction, Line Of Best Fit, Fitting The Line, Data Conditions (LINE Criteria), Residual Plots, Lurking Variables And R-squared, Influential Points And Conclusion.
  • For each section, focus on the definitions, the purpose of the regression line, how residuals are computed, and what residual plots diagnose.
  • Practice interpreting intercepts and slopes in the context of the data, and remember the importance of the data range when making predictions.
  • Review the LINE criteria and residual plot diagnostics regularly to assess model suitability.
  • When encountering multiple predictors, remember that R^2 helps compare explanatory power, but model context and assumptions remain crucial.