Introduction to Regression
Introduction to Regression
Overview of Topics to be Discussed
Scatter Plots
Correlation
Bivariate Regression
Transition towards Multi-variable Multiple Regression
Scatter Plots
Definition: Scatter plots visualize the relationship between two continuous variables.
Values are plotted on a grid based on their X-values (independent variable) and Y-values (dependent variable).
Common Usage: To analyze data from observations.
Example Dataset: 1970s Cars data
X-axis: Car weight (in pounds).
Y-axis: Fuel efficiency (miles per gallon).
Plot example:
Car weighs ~3000 lbs and gets ~23 mpg.
Car weighs ~4000 lbs and gets ~15 mpg.
Key Factors to Analyze in Scatter Plots:
Shape:
Straight line
Curvilinear
Shapeless blob
Direction:
Positive correlation: As X increases, Y increases.
Negative correlation: As X increases, Y decreases.
Variability:
How closely observations cluster around a pattern.
Identification of outliers: Observations that deviate strongly from a trend.
Shape Analysis:
Straight Line: Positive or negative; uniform relationship.
Example: Weight increasing tightly correlates with fuel efficiency decreasing.
Curvilinear: A non-linear, variable relationship.
Blob: No clear association between variables.
Outliers:
Defined as observations that significantly deviate from the expected pattern.
Examples of analyzing outliers relative to expected patterns.
Examples of Scatter Plots
Car Weight vs. Fuel Efficiency: Negative relationship confirmed.
Lighter cars tend to display more variability in fuel efficiency.
No significant outliers noted.
Life Expectancy Over Time:
X-axis: Year
Y-axis: Life expectancy at birth
A generally clear positive trend observed as years progress, with significant outlier in 1918 due to the Spanish Flu.
Life Expectancy vs. GDP per Capita:
Positive correlation observed.
Identified outlier: Haiti.
Correlation and Correlation Coefficient
Definition: Correlation measures the direction and strength of the relationship between two variables.
Denoted as $r$ (lower case italicized).
Range: -1 to 1
$r = 1$: Perfect positive correlation.
$r = -1$: Perfect negative correlation.
$r = 0$: No correlation.
Implications of Correlation:
Strong $|r|$ values close to 1 (or -1) indicate significant linear associations, while values close to 0 indicate weak associations.
Caution: Correlation coefficient can be highly influenced by outliers, which may distort the perceived strength of the correlation.
Bivariate Regression
Definition: A statistical method that models the relationship between two variables by fitting a linear equation to the observed data.
Key Components of Bivariate Linear Regression:
Dependent Variable (Y): Outcome variable being predicted.
Independent Variable (X): Predictor variable.
Regression Line:
The line drawn to minimize the sum of the squared vertical distances (residuals) of the observed points from the line.
Form of linear equation: Y = eta0 + eta1X + ext{error}
$eta_0$: Intercept
$eta_1$: Slope (indicates change in Y for each unit change in X).
Example of Regression Analysis:
Predicting life expectancy based on year: Plugging in different years to get expected life expectancy values.
Example with car weight affecting fuel efficiency.
Each pound ($X$) correlates with a decrease in mpg ($Y$).
Residuals and Least Squares Estimation
Residuals: Difference between observed and predicted values.
Positive if above the line, negative if below.
Goal: Minimize the sum of squared residuals to ascertain the best fit line.
Ordinary Least Squares (OLS):
Method used to estimate parameters in regression by minimizing the sum of squared residuals.
Correlation vs. Causation
Important Consideration:
Correlation does not imply causation.
Study of non-causal correlations remains important, particularly in social sciences.
Example: Examining correlations between income and happiness, while noting complexities in establishing direct causality.
Understanding the Predictive Power of Regression
R-squared ($R^2$):
Represents the proportion of variance in the dependent variable that can be predicted from the independent variable.
$R^2$ values range from 0 (no explanatory power) to 1 (perfect explanatory power).
Weak Association and Predictive Limitations:
Just knowing X may not provide strong predictive power for Y due to other influencing variables.
Example discussing age and income illustrates that while there is a positive correlation, age alone isn’t a strong predictor of income due to other factors.
Predictive Validity and Limitations
Limitations on Predictions:
Avoid predicting values of X that are outside the range of available data (ex: predicting car weights outside realistic ranges).
Identification of Distinct Groups:
In datasets with multiple groups, overall trends may mask significant variations across subsets (e.g., first-year vs. second-year students).
Conclusion and Next Steps
Continued Studies on Regression:
Future sessions will elaborate on regression methodologies and interpretations.
Encouragement for Questions:
Students are encouraged to seek clarification on any points discussed.