Class6
Introduction to Statistics
Class 6: Covariance, correlation, and regression
Presented by: Department of Statistics - UC3M
Chapter Overview
Covariance
Correlation
Regression analysis
Recommended reading: Examples of spurious correlations.
Objectives of the Study
Analyzing whether abstention rates correlate with average income from various sources, focusing on the 2021 Madrid regional elections.
Linear Relationships
Approaching the Relationship
Investigate if abstention rates and income have a linear relationship.
Determine the best-fit line for the data and check its fit to assess its reasonableness.
Covariance
Definition
Covariance measures the relationship between two variables, indicating if it is positive or negative.
It is computed using data points (x1,y1), (x2,y2), ..., (xn,yn).
Relationship Interpretation
For positive relationships, data clusters in the top right and bottom left, leading to positive covariance.
For negative relationships, data clusters in the top left and bottom right, resulting in negative covariance.
Example calculation: Covariance can yield a value like đť›”_xy = -24199, questioning the strength of this relationship.
Unit of Measurement Impact
Covariance Implications
Measuring in different units can drastically change covariance values (e.g., from -24199 to -0.24199), highlighting that covariance is sensitive to units.
Alternative measures are needed that are not unit-dependent.
Correlation
Concept
Correlation is defined and unitless, showcasing how two variables relate invariant to scale.
Example correlation from election data: r_xy = -0.895.
Properties of Correlation
Range: -1 ≤ r_xy ≤ 1.
Interpretation:
r_xy = 1: Perfect positive linear relationship.
r_xy = -1: Perfect negative linear relationship.
r_xy = 0: No relationship.
Closer values to 1 or -1 indicate that data points are close to forming a straight line.
Correlation Examples
Levels of Correlation
Stronger correlation: closer data points to a straight line.
Zero correlation does not imply no relationship; the trend can be flat.
High correlation does not guarantee a good fit for a regression line; visual examination of data is essential.
Regression Analysis
Predictive Modeling
Regression helps predict values of Y based on X.
For example, to predict abstention rates in hypothetical districts using average income values in a regression equation: y = a + bx.
Selecting the Best Fit Line
Many lines can pass through data points; comparisons on fit quality are essential.
Residuals (errors between observed and predicted values) are calculated: r_i = y_i - (a + bx_i).
Objective: Minimize overall error.
Least Squares Regression
Minimizing the sum of squared errors (like variance) selects the best regression line, denoted as the least squares regression line.
Regression Output
Excel Example
Summary statistics from regression analysis yield R values and coefficients.
Example equation resulting from analysis: % of abstention = 39.24 - 0.00087 Ă— Average Income.
Predictions from Regression
Practical Application
Predictions can be made for abstention rates based on average income.
Caution: Predictions outside the data range (e.g., incomes of 60,000) yield implausible results (-13% abstention).
Coefficient of Determination (R²)
Measure of Fit
R² value indicates how much variance in Y is explained by X, with an R² of 80% indicating effective prediction.
Evaluating the Fit
Residuals Analysis
Graphing residuals helps assess regression quality; random patterns signify good fit.
Patterns in residuals may indicate that simple linear regression is inadequate.
Exercises
Analysis Tasks
Correlation between well-being (Better Life Index) and wealth (GDP per person) - Evaluate various options based on the provided data.
Assess the correlation between SEDA scores and happiness levels from multiple countries - Identify correct statistical relationships.