Class 6: Covariance, correlation, and regression
Presented by: Department of Statistics - UC3M
Covariance
Correlation
Regression analysis
Recommended reading: Examples of spurious correlations.
Analyzing whether abstention rates correlate with average income from various sources, focusing on the 2021 Madrid regional elections.
Investigate if abstention rates and income have a linear relationship.
Determine the best-fit line for the data and check its fit to assess its reasonableness.
Covariance measures the relationship between two variables, indicating if it is positive or negative.
It is computed using data points (x1,y1), (x2,y2), ..., (xn,yn).
For positive relationships, data clusters in the top right and bottom left, leading to positive covariance.
For negative relationships, data clusters in the top left and bottom right, resulting in negative covariance.
Example calculation: Covariance can yield a value like 𝛔_xy = -24199, questioning the strength of this relationship.
Measuring in different units can drastically change covariance values (e.g., from -24199 to -0.24199), highlighting that covariance is sensitive to units.
Alternative measures are needed that are not unit-dependent.
Correlation is defined and unitless, showcasing how two variables relate invariant to scale.
Example correlation from election data: r_xy = -0.895.
Range: -1 ≤ r_xy ≤ 1.
Interpretation:
r_xy = 1: Perfect positive linear relationship.
r_xy = -1: Perfect negative linear relationship.
r_xy = 0: No relationship.
Closer values to 1 or -1 indicate that data points are close to forming a straight line.
Stronger correlation: closer data points to a straight line.
Zero correlation does not imply no relationship; the trend can be flat.
High correlation does not guarantee a good fit for a regression line; visual examination of data is essential.
Regression helps predict values of Y based on X.
For example, to predict abstention rates in hypothetical districts using average income values in a regression equation: y = a + bx.
Many lines can pass through data points; comparisons on fit quality are essential.
Residuals (errors between observed and predicted values) are calculated: r_i = y_i - (a + bx_i).
Objective: Minimize overall error.
Minimizing the sum of squared errors (like variance) selects the best regression line, denoted as the least squares regression line.
Summary statistics from regression analysis yield R values and coefficients.
Example equation resulting from analysis: % of abstention = 39.24 - 0.00087 × Average Income.
Predictions can be made for abstention rates based on average income.
Caution: Predictions outside the data range (e.g., incomes of 60,000) yield implausible results (-13% abstention).
R² value indicates how much variance in Y is explained by X, with an R² of 80% indicating effective prediction.
Graphing residuals helps assess regression quality; random patterns signify good fit.
Patterns in residuals may indicate that simple linear regression is inadequate.
Correlation between well-being (Better Life Index) and wealth (GDP per person) - Evaluate various options based on the provided data.
Assess the correlation between SEDA scores and happiness levels from multiple countries - Identify correct statistical relationships.