Introduction to Regression

Introduction to Regression

  • Overview of Topics to be Discussed

    • Scatter Plots

    • Correlation

    • Bivariate Regression

    • Transition towards Multi-variable Multiple Regression

Scatter Plots

  • Definition: Scatter plots visualize the relationship between two continuous variables.

    • Values are plotted on a grid based on their X-values (independent variable) and Y-values (dependent variable).

    • Common Usage: To analyze data from observations.

  • Example Dataset: 1970s Cars data

    • X-axis: Car weight (in pounds).

    • Y-axis: Fuel efficiency (miles per gallon).

    • Plot example:

    • Car weighs ~3000 lbs and gets ~23 mpg.

    • Car weighs ~4000 lbs and gets ~15 mpg.

  • Key Factors to Analyze in Scatter Plots:

    • Shape:

    • Straight line

    • Curvilinear

    • Shapeless blob

    • Direction:

    • Positive correlation: As X increases, Y increases.

    • Negative correlation: As X increases, Y decreases.

    • Variability:

    • How closely observations cluster around a pattern.

    • Identification of outliers: Observations that deviate strongly from a trend.

  • Shape Analysis:

    • Straight Line: Positive or negative; uniform relationship.

    • Example: Weight increasing tightly correlates with fuel efficiency decreasing.

    • Curvilinear: A non-linear, variable relationship.

    • Blob: No clear association between variables.

  • Outliers:

    • Defined as observations that significantly deviate from the expected pattern.

    • Examples of analyzing outliers relative to expected patterns.

Examples of Scatter Plots

  • Car Weight vs. Fuel Efficiency: Negative relationship confirmed.

    • Lighter cars tend to display more variability in fuel efficiency.

    • No significant outliers noted.

  • Life Expectancy Over Time:

    • X-axis: Year

    • Y-axis: Life expectancy at birth

    • A generally clear positive trend observed as years progress, with significant outlier in 1918 due to the Spanish Flu.

  • Life Expectancy vs. GDP per Capita:

    • Positive correlation observed.

    • Identified outlier: Haiti.

Correlation and Correlation Coefficient

  • Definition: Correlation measures the direction and strength of the relationship between two variables.

    • Denoted as $r$ (lower case italicized).

    • Range: -1 to 1

    • $r = 1$: Perfect positive correlation.

    • $r = -1$: Perfect negative correlation.

    • $r = 0$: No correlation.

  • Implications of Correlation:

    • Strong $|r|$ values close to 1 (or -1) indicate significant linear associations, while values close to 0 indicate weak associations.

    • Caution: Correlation coefficient can be highly influenced by outliers, which may distort the perceived strength of the correlation.

Bivariate Regression

  • Definition: A statistical method that models the relationship between two variables by fitting a linear equation to the observed data.

  • Key Components of Bivariate Linear Regression:

    • Dependent Variable (Y): Outcome variable being predicted.

    • Independent Variable (X): Predictor variable.

    • Regression Line:

    • The line drawn to minimize the sum of the squared vertical distances (residuals) of the observed points from the line.

    • Form of linear equation: Y = eta0 + eta1X + ext{error}

      • $eta_0$: Intercept

      • $eta_1$: Slope (indicates change in Y for each unit change in X).

  • Example of Regression Analysis:

    • Predicting life expectancy based on year: Plugging in different years to get expected life expectancy values.

    • Example with car weight affecting fuel efficiency.

    • Each pound ($X$) correlates with a decrease in mpg ($Y$).

Residuals and Least Squares Estimation

  • Residuals: Difference between observed and predicted values.

    • Positive if above the line, negative if below.

    • Goal: Minimize the sum of squared residuals to ascertain the best fit line.

  • Ordinary Least Squares (OLS):

    • Method used to estimate parameters in regression by minimizing the sum of squared residuals.

Correlation vs. Causation

  • Important Consideration:

    • Correlation does not imply causation.

    • Study of non-causal correlations remains important, particularly in social sciences.

    • Example: Examining correlations between income and happiness, while noting complexities in establishing direct causality.

Understanding the Predictive Power of Regression

  • R-squared ($R^2$):

    • Represents the proportion of variance in the dependent variable that can be predicted from the independent variable.

    • $R^2$ values range from 0 (no explanatory power) to 1 (perfect explanatory power).

  • Weak Association and Predictive Limitations:

    • Just knowing X may not provide strong predictive power for Y due to other influencing variables.

    • Example discussing age and income illustrates that while there is a positive correlation, age alone isn’t a strong predictor of income due to other factors.

Predictive Validity and Limitations

  • Limitations on Predictions:

    • Avoid predicting values of X that are outside the range of available data (ex: predicting car weights outside realistic ranges).

  • Identification of Distinct Groups:

    • In datasets with multiple groups, overall trends may mask significant variations across subsets (e.g., first-year vs. second-year students).

Conclusion and Next Steps

  • Continued Studies on Regression:

    • Future sessions will elaborate on regression methodologies and interpretations.

  • Encouragement for Questions:

    • Students are encouraged to seek clarification on any points discussed.