Introduction to Regression

Overview of Topics to be Discussed
- Scatter Plots
- Correlation
- Bivariate Regression
- Transition towards Multi-variable Multiple Regression

Definition: Scatter plots visualize the relationship between two continuous variables.
- Values are plotted on a grid based on their X-values (independent variable) and Y-values (dependent variable).
- Common Usage: To analyze data from observations.
Example Dataset: 1970s Cars data
- X-axis: Car weight (in pounds).
- Y-axis: Fuel efficiency (miles per gallon).
- Plot example:
- Car weighs ~3000 lbs and gets ~23 mpg.
- Car weighs ~4000 lbs and gets ~15 mpg.
Key Factors to Analyze in Scatter Plots:
- Shape:
- Straight line
- Curvilinear
- Shapeless blob
- Direction:
- Positive correlation: As X increases, Y increases.
- Negative correlation: As X increases, Y decreases.
- Variability:
- How closely observations cluster around a pattern.
- Identification of outliers: Observations that deviate strongly from a trend.
Shape Analysis:
- Straight Line: Positive or negative; uniform relationship.
- Example: Weight increasing tightly correlates with fuel efficiency decreasing.
- Curvilinear: A non-linear, variable relationship.
- Blob: No clear association between variables.
Outliers:
- Defined as observations that significantly deviate from the expected pattern.
- Examples of analyzing outliers relative to expected patterns.

Car Weight vs. Fuel Efficiency: Negative relationship confirmed.
- Lighter cars tend to display more variability in fuel efficiency.
- No significant outliers noted.
Life Expectancy Over Time:
- X-axis: Year
- Y-axis: Life expectancy at birth
- A generally clear positive trend observed as years progress, with significant outlier in 1918 due to the Spanish Flu.
Life Expectancy vs. GDP per Capita:
- Positive correlation observed.
- Identified outlier: Haiti.

Definition: Correlation measures the direction and strength of the relationship between two variables.
- Denoted as $r$ (lower case italicized).
- Range: -1 to 1
- $r = 1$: Perfect positive correlation.
- $r = -1$: Perfect negative correlation.
- $r = 0$: No correlation.
Implications of Correlation:
- Strong $|r|$ values close to 1 (or -1) indicate significant linear associations, while values close to 0 indicate weak associations.
- Caution: Correlation coefficient can be highly influenced by outliers, which may distort the perceived strength of the correlation.

Definition: A statistical method that models the relationship between two variables by fitting a linear equation to the observed data.
Key Components of Bivariate Linear Regression:
- Dependent Variable (Y): Outcome variable being predicted.
- Independent Variable (X): Predictor variable.
- Regression Line:
- The line drawn to minimize the sum of the squared vertical distances (residuals) of the observed points from the line.
- Form of linear equation: Y = eta0 + eta1X + ext{error}
  - $eta_0$: Intercept
  - $eta_1$: Slope (indicates change in Y for each unit change in X).
Example of Regression Analysis:
- Predicting life expectancy based on year: Plugging in different years to get expected life expectancy values.
- Example with car weight affecting fuel efficiency.
- Each pound ($X$) correlates with a decrease in mpg ($Y$).

Residuals: Difference between observed and predicted values.
- Positive if above the line, negative if below.
- Goal: Minimize the sum of squared residuals to ascertain the best fit line.
Ordinary Least Squares (OLS):
- Method used to estimate parameters in regression by minimizing the sum of squared residuals.

Important Consideration:
- Correlation does not imply causation.
- Study of non-causal correlations remains important, particularly in social sciences.
- Example: Examining correlations between income and happiness, while noting complexities in establishing direct causality.

R-squared ($R^2$):
- Represents the proportion of variance in the dependent variable that can be predicted from the independent variable.
- $R^2$ values range from 0 (no explanatory power) to 1 (perfect explanatory power).
Weak Association and Predictive Limitations:
- Just knowing X may not provide strong predictive power for Y due to other influencing variables.
- Example discussing age and income illustrates that while there is a positive correlation, age alone isn’t a strong predictor of income due to other factors.

Limitations on Predictions:
- Avoid predicting values of X that are outside the range of available data (ex: predicting car weights outside realistic ranges).
Identification of Distinct Groups:
- In datasets with multiple groups, overall trends may mask significant variations across subsets (e.g., first-year vs. second-year students).

Continued Studies on Regression:
- Future sessions will elaborate on regression methodologies and interpretations.
Encouragement for Questions:
- Students are encouraged to seek clarification on any points discussed.