Intro to linear regression life
PSTAT 5LS: Statistics for Life Sciences
Introduction
Purpose of Course: Introduces concepts of simple linear regression.
Bivariate Data: Measuring two quantitative variables on the same individual.
Power of Linear Regression: Statistical technique showcasing relationships between variables.
Definition of Linear Relationship
Term: Linear Relationship
Pronunciation: [\'li-nē-ǝr ri-'lā-shen-ship]
Definition: A relationship between two variables that, when graphed, is represented by a straight line.
Concept of Variables in Regression
Variables Defined:
Explanatory (Predictor) Variable: Denoted by ; specific values denoted by .
Represents variables thought to explain changes in another variable.
Response (Outcome) Variable: Denoted by ; specific values denoted by .
Measures the outcome of the study.
Caution:
Selection of does not imply causation. Association does not imply causation.
Alternative terms: Some use independent/dependent variables, avoided due to specific meanings in probability studies.
Scatterplots
Purpose: The initial step to examine relationships between two quantitative variables.
Definition: A graphical display (scatterplot) showcasing individual data points for two variables.
Each point represents an individual in the dataset.
Useful for identifying patterns and deviations from those patterns.
Characteristics to Examine in Scatterplots
Characteristics:
Direction: Positive or Negative
Form: Linear or Non-linear
Strength: Weak, Moderate, or Strong
Outliers: Points deviating significantly from the main pattern
Mnemonic: DOFS (Direction, Outliers, Form, Strength) for remembering these characteristics.
Example Dataset: Palmer Penguins
Dataset Information: Contains data on 333 penguins in the Palmer Archipelago
Focus: Relationship between flipper length (mm) and bill length (mm).
Variable Assignment:
Explanatory Variable: Bill length (x)
Response Variable: Flipper length (y)
Visual Insights with Species Categorization
Including species in the scatterplot can reveal variations in relationships among different species.
Multiple Regression
Concept Introduction: Incorporating species variable indicates a shift to multiple regression.
Note: Detailed analysis of multiple regression will be covered in future courses.
Correlation
Definition: The correlation (denoted by ) quantifies the strength and direction of linear relationships between two quantitative variables, and .
Analytic Measure: Correlation contrasts standard deviation of single variables to assess interconnected variabilities.
Correlation Coefficient Formula
Calculation: where:
= sample size
= explanatory variable for the i-th observation
= mean of the explanatory variable
= standard deviation of
= response variable for the i-th observation
= mean of the response variable
= standard deviation of
Intuition Behind the Correlation Coefficient
Standardization Method: standardizes , transforming it into a dimensionless value, termed the z-score.
Summation of Products: Summation of products of standardized variables will clarify how and deviate from their means; a large sum indicates a strong relationship while a near-zero sum indicates a weak relationship.
Correlation Coefficient Example - Gentoo Penguins
Calculated Value: showing a moderate positive linear relationship between bill length and flipper length.
Strength of Correlation Guidelines
Guidelines:
Weak: 0 < r < 0.3
Moderate: 0.3 < r < 0.7
Strong: 0.7 < r < 1
Contextual Value: Interpretation may vary across disciplines.
Finding the Correlation Coefficient
Methodology: Use technology for efficient calculation, often referred to as Pearson's correlation coefficient (or ).
Historical Context: Karl Pearson linked to eugenics movement; this historical note complicates his statistical legacy.
Properties of Correlation
No Variable Distinction: Correlation assessments do not differentiate between and ; switching variables does not affect the correlation value.
Unit-Free Measure: Correlation is devoid of unit dependency; its value remains constant irrespective of the measurement unit changes.
Directionality: Correlation direction (positive or negative association) is tied to the correlation sign.
Boundaries: Correlation exists strictly between and .
Important Facts About Correlation
Quantitative Variables Only: Both variables must be quantitative.
Sensitivity to Outliers: Extreme values significantly influence correlation values.
Average vs Individual Variability: Correlation based on averages is notably stronger than on individual data points due to reduced variability.
Not Complete Summary: Correlation does not fully encapsulate bivariate data; visual inspection via scatterplots is crucial.
Data-Relationship Specificity: Validates only the linear relationships, with Anscombe's quartet demonstrating correlation discrepancies despite identical coefficient values.
Regression Modeling
Regression Line Purpose: Provides a mathematical model to describe the linear relationship between two variables. The regression line allows prediction of responses based on explanatory variables.
Equations:
General form:
where
= estimated response variable,
= y-intercept,
= slope of the regression line.
Evaluating Regression Line Fit
Difference Calculation: The difference between observed responses and predicted responses, termed residuals, assists in evaluating regression fit.
Formula: .
Residuals
Definition: The difference signifies residuals, indicating fit precision. Small residuals are ideal for model selection.
Properties of Least-Squares Regression Line
Property Summary:
Slope Estimate:
Intercept Estimate:
Interpretation: Both slope and intercept estimates unveil regression viability, providing valuable insights.
Example: Playing Possum (Common Brushtail Possums)
Study Relationship: Examines total length and tail length.
Scatterplots: Assess scatterplot correlations for deeper insights on variable relationships.
Compute Residuals and Predictions
Answer Questions: Estimate length via regression line, compute residuals based on actual versus predicted values, interpret slopes and intercepts relative to realistic interpretations.
Avoiding Extrapolation
Definition: Extrapolating beyond observed values without sufficient supporting data leads to unreliable predictions.
Importance: To maintain validity in predictions, models should remain within observed data ranges.
Summary of Regression Analysis
Evaluation Metric: Use of (coefficient of determination) conveys the proportion of variability explained by the regression model. Higher values indicate better model fit.
Next Steps in Learning
Further Exploration: Inference methods for regression, hypothesis testing about variable relationships, and constructing confidence intervals for slope and intercept estimates will follow for better understanding of linear regression applications.