Simple Linear Regressionl

Simple linear regression is a statistical technique used to model the relationship between a response variable and a single explanatory variable [1, 2]. It is a type of predictive modeling where the target variable to be estimated is continuous [3, 4]. The goal of simple linear regression is to find a linear function that best fits the data [5].

Core Concepts

  • Linear Relationship: Simple linear regression assumes a linear relationship between the explanatory and response variables, meaning that the relationship can be represented by a straight line [6].

  • Equation: The relationship is expressed by the equation y = b0 + b1x, where y is the response variable, x is the explanatory variable, b0 is the intercept, and b1 is the slope [7]. The intercept (b0) represents the value of y when x is zero, while the slope (b1) represents the change in y for a one-unit increase in x [7, 8].

  • Error Term: The model also includes an error term, denoted by πœ–, to account for the fact that the observed values of y do not fall perfectly on the regression line [9, 10]. The error term is assumed to follow a normal distribution with a mean of zero and constant variance [10].

Assumptions of Simple Linear Regression The simple linear regression model makes several assumptions [3, 6, 11]:

  • Linearity: The relationship between the response and explanatory variables is linear [6].

  • Independence: The errors are independent of each other [6].

  • Normality: The errors are normally distributed [6].

  • Equal Variance (Homoscedasticity): The variance of the errors is constant across all values of the explanatory variable [6].

Ordinary Least Squares (OLS) The most common method to determine the regression line is the Ordinary Least Squares (OLS) method [5, 12]. The OLS method minimizes the sum of the squared distances between the observed data points and the predicted values on the regression line [12].

Model Fitting

  • Scatter Plot: A scatter plot is a useful first step to visualize the relationship between the explanatory and response variables [13].

  • Parameter Estimation: The OLS method is used to find the values of b0 and b1 that minimize the sum of squared residuals. The formulas for calculating the slope and intercept are [14]:

  • ̂𝛽1 = βˆ‘ 𝑖=1(π‘₯ βˆ’ Μ„π‘₯)(𝑦 βˆ’ ̄𝑦) / βˆ‘ 𝑖=1(π‘₯ βˆ’ Μ„π‘₯)2 [15, 16].

  • ̂𝛽0 = ̄𝑦 βˆ’ ̂𝛽1 Μ„π‘₯ [14, 16]. Where Μ„π‘₯ and ̄𝑦 represent the sample means of x and y, respectively [14, 15].

Model Evaluation

  • Coefficient of Determination (RΒ²): The coefficient of determination (RΒ²) measures the proportion of the variance in the response variable that is explained by the model [17]. RΒ² values range from 0 to 1, with higher values indicating a better fit [17, 18]. The formula for RΒ² is:

  • RΒ² = SSReg / SST = 1 βˆ’ SSE / SST [17]. Where, SSReg is the sum of squares regression, SST is the total sum of squares, and SSE is the sum of squares error [17, 19].

  • Residual Standard Error (RSE): RSE is the standard deviation of the residuals, which measures the typical size of the errors [20].

  • Residuals: Residuals are the differences between the observed values and the predicted values [21].

  • ANOVA table: The decomposition of variance is organized into an ANOVA table, which includes the sum of squares, degrees of freedom, and mean square for the regression and error components [22].

Making Predictions

  • Once the regression line is determined, it can be used to predict the value of the response variable for a given value of the explanatory variable [23]. The predicted value is denoted as ̂𝑦 [23].

  • The fitted line is given by: ̂𝑦 = ̂𝛽0 + ̂𝛽1π‘₯ [23].

Inference for Simple Linear Regression

  • Gauss-Markov Theorem: The Gauss-Markov theorem states that the OLS estimators are the best linear unbiased estimators (BLUE) of the parameters of the simple linear regression model [24].

  • Sampling Distributions: The estimators for slope and intercept have their own sampling distributions, which allows us to create confidence intervals and perform hypothesis tests about these parameters [25, 26].

  • Hypothesis Tests: Hypothesis tests can be used to determine if there is a significant linear relationship between the explanatory and response variables [27, 28].

  • The null hypothesis for this test is generally that the slope of the regression line is equal to zero (H0: Ξ²1 = 0).

  • A t-test can be used to test the hypothesis about the slope parameter, and the test statistic is the estimate of the slope divided by the standard error of the slope [27, 29].

  • An F-test can also be used to test for the significance of the regression, which is equivalent to the t-test in simple linear regression [30].

  • Confidence Intervals: Confidence intervals can be created for the slope and intercept parameters, and the mean response for a given value of the explanatory variable [31].

R Implementation

  • The lm() function in R can be used to fit linear models [32, 33].

  • The summary() function can be used to display the results of the model including the parameter estimates, standard errors, t-values, p-values, and RΒ² [29].

  • The anova() function can be used to compare nested models and perform F-tests [34].

  • The predict() function can be used to obtain predictions and confidence intervals for the mean response [35].

Limitations

  • Simple linear regression assumes a linear relationship, which may not always be the case.

  • The model is sensitive to outliers, which can disproportionately influence the regression line.

  • The model assumes that the explanatory variable is fixed, not random [2, 6].

Simple linear regression is a fundamental statistical technique that serves as a basis for understanding more complex regression models and is a useful tool for modeling and predicting continuous response variables.

robot