Class6

Introduction to Statistics

  • Class 6: Covariance, correlation, and regression

  • Presented by: Department of Statistics - UC3M

Chapter Overview

  1. Covariance

  2. Correlation

  3. Regression analysis

  • Recommended reading: Examples of spurious correlations.

Objectives of the Study

  • Analyzing whether abstention rates correlate with average income from various sources, focusing on the 2021 Madrid regional elections.

Linear Relationships

Approaching the Relationship

  • Investigate if abstention rates and income have a linear relationship.

  • Determine the best-fit line for the data and check its fit to assess its reasonableness.

Covariance

Definition

  • Covariance measures the relationship between two variables, indicating if it is positive or negative.

  • It is computed using data points (x1,y1), (x2,y2), ..., (xn,yn).

Relationship Interpretation

  • For positive relationships, data clusters in the top right and bottom left, leading to positive covariance.

  • For negative relationships, data clusters in the top left and bottom right, resulting in negative covariance.

  • Example calculation: Covariance can yield a value like 𝛔_xy = -24199, questioning the strength of this relationship.

Unit of Measurement Impact

Covariance Implications

  • Measuring in different units can drastically change covariance values (e.g., from -24199 to -0.24199), highlighting that covariance is sensitive to units.

  • Alternative measures are needed that are not unit-dependent.

Correlation

Concept

  • Correlation is defined and unitless, showcasing how two variables relate invariant to scale.

  • Example correlation from election data: r_xy = -0.895.

Properties of Correlation

  • Range: -1 ≤ r_xy ≤ 1.

  • Interpretation:

    • r_xy = 1: Perfect positive linear relationship.

    • r_xy = -1: Perfect negative linear relationship.

    • r_xy = 0: No relationship.

  • Closer values to 1 or -1 indicate that data points are close to forming a straight line.

Correlation Examples

Levels of Correlation

  • Stronger correlation: closer data points to a straight line.

  • Zero correlation does not imply no relationship; the trend can be flat.

  • High correlation does not guarantee a good fit for a regression line; visual examination of data is essential.

Regression Analysis

Predictive Modeling

  • Regression helps predict values of Y based on X.

  • For example, to predict abstention rates in hypothetical districts using average income values in a regression equation: y = a + bx.

Selecting the Best Fit Line

  • Many lines can pass through data points; comparisons on fit quality are essential.

  • Residuals (errors between observed and predicted values) are calculated: r_i = y_i - (a + bx_i).

  • Objective: Minimize overall error.

Least Squares Regression

  • Minimizing the sum of squared errors (like variance) selects the best regression line, denoted as the least squares regression line.

Regression Output

Excel Example

  • Summary statistics from regression analysis yield R values and coefficients.

  • Example equation resulting from analysis: % of abstention = 39.24 - 0.00087 × Average Income.

Predictions from Regression

Practical Application

  • Predictions can be made for abstention rates based on average income.

  • Caution: Predictions outside the data range (e.g., incomes of 60,000) yield implausible results (-13% abstention).

Coefficient of Determination (R²)

Measure of Fit

  • R² value indicates how much variance in Y is explained by X, with an R² of 80% indicating effective prediction.

Evaluating the Fit

Residuals Analysis

  • Graphing residuals helps assess regression quality; random patterns signify good fit.

  • Patterns in residuals may indicate that simple linear regression is inadequate.

Exercises

Analysis Tasks

  1. Correlation between well-being (Better Life Index) and wealth (GDP per person) - Evaluate various options based on the provided data.

  2. Assess the correlation between SEDA scores and happiness levels from multiple countries - Identify correct statistical relationships.

robot