Stats

Overview of Covariance and Correlation

  • Covariance Definition

    • Covariance is a measure of the joint variability of two random variables, denoted as x and y.
    • Calculation involves ordered pairs: each observations' x coordinate is subtracted from its mean (x̄) and multiplied by the corresponding difference from the y coordinate (ȳ).
  • Difference from Standard Deviation

    • Covariance can be negative, unlike standard deviation which is always non-negative.
    • Positive covariance indicates that both variables increase together, while negative covariance indicates that while one variable increases, the other decreases.
    • If no pattern exists between the variables, covariance approaches zero.
  • Uniqueness of Covariance

    • The value of covariance is unbounded: its magnitude varies based on the scale of x and y.
    • Larger datasets yield larger covariance values, while datasets with smaller values result in smaller covariance.

Calculation Example of Covariance

  • Dataset Examination

    • The analysis begins with a dataset to calculate covariance. Example observations analyzed include their distances from the mean in their respective x and y coordinates.
  • Step-by-Step Calculation

    • For the first observation:
    • Distance from x bar: -1, distance from y bar: -2.3.
    • Multiply: 1imes2.3=2.3-1 imes -2.3 = 2.3.
    • For the second observation:
    • Distance: -0.5 (x) and -2.9 (y).
    • Multiply: 0.5imes2.9=1.45-0.5 imes -2.9 = 1.45.
    • For the third observation on x bar, covariance contribution is zero.
    • Compute covariance contributions for the remaining observations and sum to get total SSxy = 7.5.
  • Covariance Interpretation

    • Covariance computed as s_{xy} = rac{SSxy}{n - 1} resulting in 1.875.
    • The only insight from covariance is the direction (positive in this case), not its strength.

Relationship and Correlation Definition

  • Properties of Correlation

    • Correlation measures both strength and direction of linear relationships between two variables.
    • Can be calculated in two equivalent ways:
    1. From SSxy: r = rac{SSxy}{ ext{sqrt}(SSx imes SSy)}.
    2. From covariance: r = rac{Cov(X,Y)}{SX imes SY}.
    • The correlation is dimensionless, meaning it is not affected by the units used.
    • Bounds of correlation: between -1 (perfect negative) and +1 (perfect positive). A value close to zero indicates a weak relationship.
  • Practical Use Case

    • Correlation alone does not imply causality and strong correlation does not guarantee a direct cause - effect relationship.
    • Exploring examples of different correlation magnitudes (0, -0.25, 0.5, etc.) can help determine the strength of linear relationships.

Limitations of Correlation

  • Non-Linear Relationships

    • While correlation indicates the strength of linear relationships, it can be misleading in cases of non-linear relationships (curvilinear data).
    • In scenarios where the pattern of the data is quadratic, it may result in low correlation values, suggesting inaccurately no relationship.
  • Assessment of Weak vs. Strong Relationships

    • Importance of graphical representation alongside correlation statistic to assess relationships visually.

Simple Linear Regression Overview

  • Concept of Linear Regression

    • Linear regression uses one quantitative predictor to describe or predict a quantitative response, forming the equation:
      ext{y hat} = eta0 + eta1 imes x where:
    • eta_0: y-intercept
    • eta_1: slope of the line
    • y hat: predicted value based on x.
  • Finding Regression Parameters

    • Parameters are estimated through the least squares method (minimizing the sum of squared errors):
      • SSE=extSumofsquaresduetoerrorSSE = ext{Sum of squares due to error} calculates the errors' squared differences between actual responses and predictions on the regression line.
    • Determine slopes eta1 and intercepts eta0 utilizing underlying data properties available.
  • Application

    • Utilizing regression to predict relationships in data, including fuel efficiency metrics regarding vehicle acceleration. Calculating SSE allows for systematic testing of the model fit.

Example Calculation of Regression Parameters

  • Finding Parameters
    • Example:
      • Given SSxy = 7.5 and SSx, calculate slope.
      • Once the slope is identified, the intercept follows from eta0 = ar{y} - eta1 imes ar{x}.
    • Conclude with regression equations to characterize relationships accurately, ensuring careful consideration of the line fitting to the data's trends.