lect-regression-1
CS 422: Data Mining
Instructor: Vijay K. Gurbani, Ph.D., Illinois Institute of Technology
Email: vgurbani@iit.edu
Linear Regression: Theory
Definition
A statistical process for estimating the relationship among variables.
Response Variable (Dependent Variable, Y): What you aim to predict.
Predictor(s) Variable(s) (Independent Variable(s), X): the variables used for prediction.
Widely used for predicting and forecasting outcomes.
Method of Least Squares
The primary technique studied.
There are many other estimation techniques available.
Linear Regression Equation
Simple Linear Regression Equation
Expressed as:Y = β0 + β1X + ε
β0 (Intercept): the predicted value of Y when X is zero.
β1 (Slope): the change in Y for a one-unit change in X.
Model Coefficients (Parameters/Weights): β0 and β1 are derived from the dataset to estimate predictions on new values of X.
Residual Analysis
Error Components
Since β0 and β1 are estimates:
There will be some error in the observed response (Y) and predicted response (Y-hat).
This error is called the Residual (ε).
Residual Sum of Squares (RSS): The goal is to minimize the sum of residuals, leading to optimization of the coefficients.
Analytical Derivation
The coefficients can be derived analytically from RSS using calculus.
Once coefficients are obtained, prediction can be made using the derived linear equation for values of X.
The residual (ε) accounts for what is missed by the model.
Geometric Interpretation
Best Regression Line
The best regression line is one that minimizes the Sum of Squared Residuals (RSS).
Types of Regression
Uni-variate vs. Multi-variate Regression
Uni-variate Regression: Y = β0 + β1X + ε
Multi-variate Regression:Y = β0 + β1X1 + β2X2 + … + βnXn + ε
Extension to multiple independent variables.
Linear Regression Example
Empirical Advertising Data Example
Data consists of a 200x4 data frame:
Response Variable (Y): Sales (in thousands of a product).
Predictors (X): TV, Radio, and Newspaper expenditures (in thousands of $).
Effect of Radio Advertising on Sales
Regression Equation: sales = β0 + β1radio = 9.312 + 0.203radio
Interpretation: A $1000 increase in radio advertising spending results in an average increase of 203 units in sales.
All Advertising Media Effect on Sales
Regression Equation: sales = β0 + β1TV + β2radio + β3newspaper = 2.939 + 0.046TV + 0.189radio - 0.001newspaper
This shows the collective impact of all forms of advertising on sales.