Intro to linear regression life

PSTAT 5LS: Statistics for Life Sciences

Introduction

  • Purpose of Course: Introduces concepts of simple linear regression.

  • Bivariate Data: Measuring two quantitative variables on the same individual.

  • Power of Linear Regression: Statistical technique showcasing relationships between variables.

Definition of Linear Relationship
  • Term: Linear Relationship

    • Pronunciation: [\'li-nē-ǝr ri-'lā-shen-ship]

    • Definition: A relationship between two variables that, when graphed, is represented by a straight line.

Concept of Variables in Regression

  • Variables Defined:

    • Explanatory (Predictor) Variable: Denoted by XX; specific values denoted by xx.

      • Represents variables thought to explain changes in another variable.

    • Response (Outcome) Variable: Denoted by YY; specific values denoted by yy.

      • Measures the outcome of the study.

  • Caution:

    • Selection of XX does not imply causation. Association does not imply causation.

    • Alternative terms: Some use independent/dependent variables, avoided due to specific meanings in probability studies.

Scatterplots

  • Purpose: The initial step to examine relationships between two quantitative variables.

  • Definition: A graphical display (scatterplot) showcasing individual data points for two variables.

    • Each point represents an individual in the dataset.

    • Useful for identifying patterns and deviations from those patterns.

Characteristics to Examine in Scatterplots
  • Characteristics:

    1. Direction: Positive or Negative

    2. Form: Linear or Non-linear

    3. Strength: Weak, Moderate, or Strong

    4. Outliers: Points deviating significantly from the main pattern

  • Mnemonic: DOFS (Direction, Outliers, Form, Strength) for remembering these characteristics.

Example Dataset: Palmer Penguins

  • Dataset Information: Contains data on 333 penguins in the Palmer Archipelago

    • Focus: Relationship between flipper length (mm) and bill length (mm).

  • Variable Assignment:

    • Explanatory Variable: Bill length (x)

    • Response Variable: Flipper length (y)

Visual Insights with Species Categorization
  • Including species in the scatterplot can reveal variations in relationships among different species.

Multiple Regression

  • Concept Introduction: Incorporating species variable indicates a shift to multiple regression.

  • Note: Detailed analysis of multiple regression will be covered in future courses.

Correlation

  • Definition: The correlation (denoted by rr) quantifies the strength and direction of linear relationships between two quantitative variables, xx and yy.

  • Analytic Measure: Correlation contrasts standard deviation of single variables to assess interconnected variabilities.

Correlation Coefficient Formula
  • Calculation: r=rac1n1imesracextΣ(x<em>ixˉ)(y</em>iyˉ)s<em>xs</em>yr = rac{1}{n-1} imes rac{ ext{Σ} (x<em>i - \bar{x}) (y</em>i - \bar{y})}{s<em>x s</em>y} where:

    • nn = sample size

    • xix_i = explanatory variable for the i-th observation

    • xˉ\bar{x} = mean of the explanatory variable

    • sxs_x = standard deviation of XX

    • yiy_i = response variable for the i-th observation

    • yˉ\bar{y} = mean of the response variable

    • sys_y = standard deviation of YY

Intuition Behind the Correlation Coefficient

  • Standardization Method: racx<em>ixˉs</em>xrac{x<em>i - \bar{x}}{s</em>x} standardizes XX, transforming it into a dimensionless value, termed the z-score.

  • Summation of Products: Summation of products of standardized variables will clarify how xx and yy deviate from their means; a large sum indicates a strong relationship while a near-zero sum indicates a weak relationship.

Correlation Coefficient Example - Gentoo Penguins
  • Calculated rr Value: r=+0.6642r = +0.6642 showing a moderate positive linear relationship between bill length and flipper length.

Strength of Correlation Guidelines

  • Guidelines:

    • Weak: 0 < r < 0.3

    • Moderate: 0.3 < r < 0.7

    • Strong: 0.7 < r < 1

  • Contextual Value: Interpretation may vary across disciplines.

Finding the Correlation Coefficient

  • Methodology: Use technology for efficient calculation, often referred to as Pearson's correlation coefficient (or rr).

  • Historical Context: Karl Pearson linked to eugenics movement; this historical note complicates his statistical legacy.

Properties of Correlation

  1. No Variable Distinction: Correlation assessments do not differentiate between XX and YY; switching variables does not affect the correlation value.

  2. Unit-Free Measure: Correlation is devoid of unit dependency; its value remains constant irrespective of the measurement unit changes.

  3. Directionality: Correlation direction (positive or negative association) is tied to the correlation sign.

  4. Boundaries: Correlation exists strictly between 1-1 and +1+1.

Important Facts About Correlation
  1. Quantitative Variables Only: Both variables must be quantitative.

  2. Sensitivity to Outliers: Extreme values significantly influence correlation values.

  3. Average vs Individual Variability: Correlation based on averages is notably stronger than on individual data points due to reduced variability.

  4. Not Complete Summary: Correlation does not fully encapsulate bivariate data; visual inspection via scatterplots is crucial.

  5. Data-Relationship Specificity: Validates only the linear relationships, with Anscombe's quartet demonstrating correlation discrepancies despite identical coefficient values.

Regression Modeling

  • Regression Line Purpose: Provides a mathematical model to describe the linear relationship between two variables. The regression line allows prediction of responses based on explanatory variables.

  • Equations:

    • General form: yˉ=b<em>0+b</em>1x\bar{y} = b<em>0 + b</em>1 x

      • where

      • yˉ\bar{y} = estimated response variable,

      • b0b_0 = y-intercept,

      • b1b_1 = slope of the regression line.

Evaluating Regression Line Fit
  • Difference Calculation: The difference between observed responses and predicted responses, termed residuals, assists in evaluating regression fit.

    • Formula: extResidual=extObservedResponseextEstimatedResponseext{Residual} = ext{Observed Response} - ext{Estimated Response}.

Residuals

  • Definition: The difference yyˉy - \bar{y} signifies residuals, indicating fit precision. Small residuals are ideal for model selection.

Properties of Least-Squares Regression Line
  • Property Summary:

    1. Slope Estimate: b<em>1=rracs</em>ysxb<em>1 = r rac{s</em>y}{s_x}

    2. Intercept Estimate: b<em>0=yˉb</em>1xˉb<em>0 = \bar{y} - b</em>1 \bar{x}

  • Interpretation: Both slope and intercept estimates unveil regression viability, providing valuable insights.

Example: Playing Possum (Common Brushtail Possums)

  • Study Relationship: Examines total length and tail length.

  • Scatterplots: Assess scatterplot correlations for deeper insights on variable relationships.

Compute Residuals and Predictions
  • Answer Questions: Estimate length via regression line, compute residuals based on actual versus predicted values, interpret slopes and intercepts relative to realistic interpretations.

Avoiding Extrapolation

  • Definition: Extrapolating beyond observed XX values without sufficient supporting data leads to unreliable predictions.

  • Importance: To maintain validity in predictions, models should remain within observed data ranges.

Summary of Regression Analysis

  • Evaluation Metric: Use of R2R^2 (coefficient of determination) conveys the proportion of variability explained by the regression model. Higher values indicate better model fit.

Next Steps in Learning

  • Further Exploration: Inference methods for regression, hypothesis testing about variable relationships, and constructing confidence intervals for slope and intercept estimates will follow for better understanding of linear regression applications.