Simple linear regression is a statistical technique used to model the relationship between a response variable and a single explanatory variable [1, 2]. It is a type of predictive modeling where the target variable to be estimated is continuous [3, 4]. The goal of simple linear regression is to find a linear function that best fits the data [5].
Core Concepts
Linear Relationship: Simple linear regression assumes a linear relationship between the explanatory and response variables, meaning that the relationship can be represented by a straight line [6].
Equation: The relationship is expressed by the equation y = b0 + b1x, where y is the response variable, x is the explanatory variable, b0 is the intercept, and b1 is the slope [7]. The intercept (b0) represents the value of y when x is zero, while the slope (b1) represents the change in y for a one-unit increase in x [7, 8].
Error Term: The model also includes an error term, denoted by π, to account for the fact that the observed values of y do not fall perfectly on the regression line [9, 10]. The error term is assumed to follow a normal distribution with a mean of zero and constant variance [10].
Assumptions of Simple Linear Regression The simple linear regression model makes several assumptions [3, 6, 11]:
Linearity: The relationship between the response and explanatory variables is linear [6].
Independence: The errors are independent of each other [6].
Normality: The errors are normally distributed [6].
Equal Variance (Homoscedasticity): The variance of the errors is constant across all values of the explanatory variable [6].
Ordinary Least Squares (OLS) The most common method to determine the regression line is the Ordinary Least Squares (OLS) method [5, 12]. The OLS method minimizes the sum of the squared distances between the observed data points and the predicted values on the regression line [12].
Model Fitting
Scatter Plot: A scatter plot is a useful first step to visualize the relationship between the explanatory and response variables [13].
Parameter Estimation: The OLS method is used to find the values of b0 and b1 that minimize the sum of squared residuals. The formulas for calculating the slope and intercept are [14]:
Μπ½1 = β π=1(π₯ β Μπ₯)(π¦ β Μπ¦) / β π=1(π₯ β Μπ₯)2 [15, 16].
Μπ½0 = Μπ¦ β Μπ½1 Μπ₯ [14, 16]. Where Μπ₯ and Μπ¦ represent the sample means of x and y, respectively [14, 15].
Model Evaluation
Coefficient of Determination (RΒ²): The coefficient of determination (RΒ²) measures the proportion of the variance in the response variable that is explained by the model [17]. RΒ² values range from 0 to 1, with higher values indicating a better fit [17, 18]. The formula for RΒ² is:
RΒ² = SSReg / SST = 1 β SSE / SST [17]. Where, SSReg is the sum of squares regression, SST is the total sum of squares, and SSE is the sum of squares error [17, 19].
Residual Standard Error (RSE): RSE is the standard deviation of the residuals, which measures the typical size of the errors [20].
Residuals: Residuals are the differences between the observed values and the predicted values [21].
ANOVA table: The decomposition of variance is organized into an ANOVA table, which includes the sum of squares, degrees of freedom, and mean square for the regression and error components [22].
Making Predictions
Once the regression line is determined, it can be used to predict the value of the response variable for a given value of the explanatory variable [23]. The predicted value is denoted as Μπ¦ [23].
The fitted line is given by: Μπ¦ = Μπ½0 + Μπ½1π₯ [23].
Inference for Simple Linear Regression
Gauss-Markov Theorem: The Gauss-Markov theorem states that the OLS estimators are the best linear unbiased estimators (BLUE) of the parameters of the simple linear regression model [24].
Sampling Distributions: The estimators for slope and intercept have their own sampling distributions, which allows us to create confidence intervals and perform hypothesis tests about these parameters [25, 26].
Hypothesis Tests: Hypothesis tests can be used to determine if there is a significant linear relationship between the explanatory and response variables [27, 28].
The null hypothesis for this test is generally that the slope of the regression line is equal to zero (H0: Ξ²1 = 0).
A t-test can be used to test the hypothesis about the slope parameter, and the test statistic is the estimate of the slope divided by the standard error of the slope [27, 29].
An F-test can also be used to test for the significance of the regression, which is equivalent to the t-test in simple linear regression [30].
Confidence Intervals: Confidence intervals can be created for the slope and intercept parameters, and the mean response for a given value of the explanatory variable [31].
R Implementation
The lm() function in R can be used to fit linear models [32, 33].
The summary() function can be used to display the results of the model including the parameter estimates, standard errors, t-values, p-values, and RΒ² [29].
The anova() function can be used to compare nested models and perform F-tests [34].
The predict() function can be used to obtain predictions and confidence intervals for the mean response [35].
Limitations
Simple linear regression assumes a linear relationship, which may not always be the case.
The model is sensitive to outliers, which can disproportionately influence the regression line.
The model assumes that the explanatory variable is fixed, not random [2, 6].
Simple linear regression is a fundamental statistical technique that serves as a basis for understanding more complex regression models and is a useful tool for modeling and predicting continuous response variables.