lect-regression-1

CS 422: Data Mining

  • Instructor: Vijay K. Gurbani, Ph.D., Illinois Institute of Technology

  • Email: vgurbani@iit.edu

Linear Regression: Theory

Definition

  • A statistical process for estimating the relationship among variables.

    • Response Variable (Dependent Variable, Y): What you aim to predict.

    • Predictor(s) Variable(s) (Independent Variable(s), X): the variables used for prediction.

  • Widely used for predicting and forecasting outcomes.

Method of Least Squares

  • The primary technique studied.

  • There are many other estimation techniques available.

Linear Regression Equation

Simple Linear Regression Equation

  • Expressed as:Y = β0 + β1X + ε

    • β0 (Intercept): the predicted value of Y when X is zero.

    • β1 (Slope): the change in Y for a one-unit change in X.

    • Model Coefficients (Parameters/Weights): β0 and β1 are derived from the dataset to estimate predictions on new values of X.

Residual Analysis

Error Components

  • Since β0 and β1 are estimates:

    • There will be some error in the observed response (Y) and predicted response (Y-hat).

    • This error is called the Residual (ε).

    • Residual Sum of Squares (RSS): The goal is to minimize the sum of residuals, leading to optimization of the coefficients.

Analytical Derivation

  • The coefficients can be derived analytically from RSS using calculus.

  • Once coefficients are obtained, prediction can be made using the derived linear equation for values of X.

  • The residual (ε) accounts for what is missed by the model.

Geometric Interpretation

Best Regression Line

  • The best regression line is one that minimizes the Sum of Squared Residuals (RSS).

Types of Regression

Uni-variate vs. Multi-variate Regression

  • Uni-variate Regression: Y = β0 + β1X + ε

  • Multi-variate Regression:Y = β0 + β1X1 + β2X2 + … + βnXn + ε

    • Extension to multiple independent variables.

Linear Regression Example

Empirical Advertising Data Example

  • Data consists of a 200x4 data frame:

    • Response Variable (Y): Sales (in thousands of a product).

    • Predictors (X): TV, Radio, and Newspaper expenditures (in thousands of $).

Effect of Radio Advertising on Sales

  • Regression Equation: sales = β0 + β1radio = 9.312 + 0.203radio

  • Interpretation: A $1000 increase in radio advertising spending results in an average increase of 203 units in sales.

All Advertising Media Effect on Sales

  • Regression Equation: sales = β0 + β1TV + β2radio + β3newspaper = 2.939 + 0.046TV + 0.189radio - 0.001newspaper

  • This shows the collective impact of all forms of advertising on sales.