Linear Regression - In Depth Notes

6.4 Types of Data Consideration

Spatial Data:
- Involves clustering based on geographic location which can affect the behavior and relationships between variables.
- Example: House prices tend to be similar within a suburb due to shared amenities and neighborhood factors, whereas prices may differ significantly between suburbs due to disparities in desirability, local economy, and school quality.
- Spatial analysis methods such as Geographically Weighted Regression (GWR) can be employed to model relationships that vary across space.
Repeated Measures:
- Involves longitudinal studies where the same participants are measured multiple times, which can introduce correlation between measurements.
- Measurements from the same participants tend to be more similar to each other than measurements taken from different participants, leading to potential biases if not accounted for.
- Common methods for analyzing repeated measures data include using Generalized Estimating Equations (GEE) or Linear Mixed Models (LMM).
Independence Assessment:
- When analyzing existing data, it's crucial to assess the independence of observations to ensure the validity of statistical conclusions.
- Common conclusions:
1. Confidence in independence based on robust data collection methods, ensuring all subjects are independent of each other.
2. Lack of information about independence, allowing one to proceed with analysis but with a critical caveat on the assumption of independence, recognizing the limitations it poses.
3. Identified problems with independence, which necessitate halting analysis or employing alternative statistical techniques to address the dependency.
- Example case: Reporting in a study on father-son height relationships typically requires acknowledging insufficient information regarding independence, which is vital in drawing valid conclusions about heritability in traits.

6.5 Statistical Evidence in Linear Regression

6.5.1 Understanding Analysis Decision

Simple Linear Regression: - Focus on the slope parameter to deeply understand the relationship between explanatory variable (x) and response variable (y).
- Null Hypothesis (H0):
  H0: eta1 = 0
  No association between x and y, indicating that x does not predict y in the population.
- Model representation in statistical software R:
  y = b0 + b1 x where:
  - b_0 is the y-intercept, representing the expected value of y when x is zero.
  - b_1 represents the change in y for a one-unit change in x.
- The model parameters are sample-specific; different samples could yield slightly different results, which is important to evaluate through confidence intervals.

6.5.2 Framework for Testing the Null Hypothesis

Components:
1. Test Statistic: Measures how closely observed data aligns with hypothesized values, allowing evaluation of the effect of x on y.
2. Distribution of Test Statistic: Visualizes variations under the null hypothesis, typically following a t-distribution for regression contexts.
3. P-value: Indicates the probability of obtaining the test statistic under the null hypothesis. Small p-values suggest significant evidence against H0, leading to potential rejection of the null hypothesis.
Key Formula for Test Statistic (for regression): t = \frac{b1 - \beta1}{SE(b1)} - Where SE(b1) is the standard error of the slope coefficient estimate, providing insight into the precision of the estimate.

6.5.3 P-values and Statistical Significance

Large test statistics indicate that the null hypothesis is unlikely, suggesting significant evidence against H0.
P-values are calculated to determine significance, often with a standard significance level (α). Commonly, this value is set at 0.05, but adjustments may be made for multiple comparisons.
If p < 0.05, we reject H0; if p \geq 0.05, we fail to reject H0, leading to conclusions based on statistical insignificance.
P-value Example: p = 0.0352 implies a 3.52% chance of observing this statistic under the null hypothesis being true, indicating a significant under this threshold.

6.5.4 Predictions and Fitted Values

The regression line estimates responses for given explanatory variable values:
\hat{y} = b0 + b1 x_0
where \hat{y} represents the predicted value.
The regression model must be applied only within the observed range to avoid unreliable extrapolation outside the data scope.
Example: For a father with a height of 67 inches, predicted son height is calculated as:
\hat{y} = 33.88660 + 0.51409 \times 67 = 68.33063
Thus, the average predicted son height is approximately 68.3 inches.

6.5.5 Prediction and Confidence Intervals

Prediction Interval: - Provides an estimated range for new response values given an explanatory variable; these intervals are typically wider than confidence intervals because they account for the variability in observations, reflecting inherent uncertainty.
Confidence Interval: - Provides a range for estimating parameters like slope, centered around the point estimate, showcasing reliability.
General Formula for Confidence Interval:
\text{CI} = \text{estimate} \pm k \times SE(\text{estimate})
- Critical Value (k): Dependency on the chosen confidence level, often derived from the t-distribution, especially for small sample sizes, ensuring comprehensiveness and robustness of estimates.

6.5.6 Model Variability Explanation (R²)

R²: Quantifies the proportion of variance in the response variable explained by the regression model:
R^2 = \frac{\text{Regression Sums of Squares}}{\text{Total Sums of Squares}}
- Interpretation of R² is context-dependent; higher values suggest better model fit, though they do not imply causation.

6.5.7 Reporting Findings

Effectively communicate findings must include:
- Evidence against H0 (state p-value clearly).
- Interpretation of effect (one unit increase in x effect on response y, along with confidence interval to encapsulate certainty).
- Predictions and intervals associated to give insight into expected outcomes with given explanatory variables.
- Proportion of variability in the response explained by the model, fostering a complete understanding of model validity.

6.5.8 Technical Report Example

Report format should encompass hypotheses, equations, verification of assumptions, and model outcomes such as R², p-values, and detailed interpretations to foster clarity and systematic understanding of the analysis conducted.

Assumptions of Linear Regression

Key assumptions to check include:
1. Linearity: Relationship between x and y should be consistent, depicted suitably by scatterplots.
2. Independence of residuals: Residuals must not exhibit patterns, indicating appropriate model structure.
3. Homoscedasticity: Requires constant variance of residuals across all levels of x, pivotal for effective statistical inference.
4. Normality of error terms: Assumes that the residuals are normally distributed enabling valid hypothesis testing when utilizing t-statistics in regression modeling.