Stats
Overview of Covariance and Correlation
Covariance Definition
- Covariance is a measure of the joint variability of two random variables, denoted as x and y.
- Calculation involves ordered pairs: each observations' x coordinate is subtracted from its mean (x̄) and multiplied by the corresponding difference from the y coordinate (ȳ).
Difference from Standard Deviation
- Covariance can be negative, unlike standard deviation which is always non-negative.
- Positive covariance indicates that both variables increase together, while negative covariance indicates that while one variable increases, the other decreases.
- If no pattern exists between the variables, covariance approaches zero.
Uniqueness of Covariance
- The value of covariance is unbounded: its magnitude varies based on the scale of x and y.
- Larger datasets yield larger covariance values, while datasets with smaller values result in smaller covariance.
Calculation Example of Covariance
Dataset Examination
- The analysis begins with a dataset to calculate covariance. Example observations analyzed include their distances from the mean in their respective x and y coordinates.
Step-by-Step Calculation
- For the first observation:
- Distance from x bar: -1, distance from y bar: -2.3.
- Multiply: .
- For the second observation:
- Distance: -0.5 (x) and -2.9 (y).
- Multiply: .
- For the third observation on x bar, covariance contribution is zero.
- Compute covariance contributions for the remaining observations and sum to get total SSxy = 7.5.
Covariance Interpretation
- Covariance computed as s_{xy} = rac{SSxy}{n - 1} resulting in 1.875.
- The only insight from covariance is the direction (positive in this case), not its strength.
Relationship and Correlation Definition
Properties of Correlation
- Correlation measures both strength and direction of linear relationships between two variables.
- Can be calculated in two equivalent ways:
- From SSxy: r = rac{SSxy}{ ext{sqrt}(SSx imes SSy)}.
- From covariance: r = rac{Cov(X,Y)}{SX imes SY}.
- The correlation is dimensionless, meaning it is not affected by the units used.
- Bounds of correlation: between -1 (perfect negative) and +1 (perfect positive). A value close to zero indicates a weak relationship.
Practical Use Case
- Correlation alone does not imply causality and strong correlation does not guarantee a direct cause - effect relationship.
- Exploring examples of different correlation magnitudes (0, -0.25, 0.5, etc.) can help determine the strength of linear relationships.
Limitations of Correlation
Non-Linear Relationships
- While correlation indicates the strength of linear relationships, it can be misleading in cases of non-linear relationships (curvilinear data).
- In scenarios where the pattern of the data is quadratic, it may result in low correlation values, suggesting inaccurately no relationship.
Assessment of Weak vs. Strong Relationships
- Importance of graphical representation alongside correlation statistic to assess relationships visually.
Simple Linear Regression Overview
Concept of Linear Regression
- Linear regression uses one quantitative predictor to describe or predict a quantitative response, forming the equation:
ext{y hat} = eta0 + eta1 imes x where: - eta_0: y-intercept
- eta_1: slope of the line
- y hat: predicted value based on x.
- Linear regression uses one quantitative predictor to describe or predict a quantitative response, forming the equation:
Finding Regression Parameters
- Parameters are estimated through the least squares method (minimizing the sum of squared errors):
- calculates the errors' squared differences between actual responses and predictions on the regression line.
- Determine slopes eta1 and intercepts eta0 utilizing underlying data properties available.
- Parameters are estimated through the least squares method (minimizing the sum of squared errors):
Application
- Utilizing regression to predict relationships in data, including fuel efficiency metrics regarding vehicle acceleration. Calculating SSE allows for systematic testing of the model fit.
Example Calculation of Regression Parameters
- Finding Parameters
- Example:
- Given SSxy = 7.5 and SSx, calculate slope.
- Once the slope is identified, the intercept follows from eta0 = ar{y} - eta1 imes ar{x}.
- Conclude with regression equations to characterize relationships accurately, ensuring careful consideration of the line fitting to the data's trends.
- Example: