In regression analysis we wish to investigate the relationship between an independent variable and a dependent variable.
If we have one independent variable is is called simple linear regression.
If we have more than one independent variable it is called multiple linear regression.
Statement of Theory or Hypothesis
Example: Theoretically, as age increases, security awareness decreases. Therefore, an increase of one year in age should result in a decrease in security awareness. If we let variable X represent age and variable Y represent security awareness, then Y is expected to decrease as X increases.
Specify a Statistical Model
In statistics we wish to build models often of the type: $Outcome_i = Model_i + Error_i$
In regression analysis this model can be written as:
⁍
where $𝛽_1 < 0$ according to our theory and:
$Y$ = dependent variable,
$X_1$ = independent variable 1,
$ε$ = error term.
Example: We have the model: $Y_i = 𝛽_0 + 𝛽_1X_1 + 𝜀_i,$ where $Y$ = dependent variable corresponding to security awareness, $X_1$ = independent variable corresponding to age, $ε$ = error term.
Find Data
Data obtained for regression analysis may be:
Cross-sectional (e.g. country data, firm data),
Time series data (e.g. country data over time),
Panel data (a mix of cross-sectional and time series data).
Estimate the Model
We want to estimate the effect of $X$ on $Y$ and we need to estimate the parameters $β_0$ and $β_1$.
Test Hypothesis
Make Predictions
The error term stands for all variables that affect the dependent variable but are not included as explanatory variables. Why have it rather than throwing in all possible variables?
Measurement error caused by for example poor proxy variables or poor indices.
Specification error: Core variables versus peripheral (not so important) variables. Peripheral variables may most appropriately be included in error term. But core variables needs to included.
Intrinsic randomness in human behaviour.
In the previous model we have $𝛽_0$ and $𝛽_1$ that are population parameters. Then, we estimate these parameters using ordinary least squares by $b_0$ and $b_1$. Hence the population model: $Y_i = 𝛽_0 + 𝛽_1X_1 + 𝜀_i$ is estimated using the following equation:
⁍
To estimate the parameters ($β_0 and β_1$) we want to minimise the error term (ε).
The parameter for the independent variables $X$ may be interpreted as the estimated linear effect on $Y$ of a one-unit change in $X$.
Example: We wish to investigate the effect of age on security awareness and estimate the model: $Y_i = 𝛽_0 + 𝛽_1X_1 + 𝜀_i$ which is estimated by the following model: $Y_i = b_0 + b_1X_1 + ε_i$, where $Y$ = dependent variable corresponding to security awareness, $X$ = independent variable corresponding to age, and $ε$ = error term.
SPSS result
We have the model: $Y_i = b_0 + b_1X_1 + ε_i = 20.047 − 0.201X_i + 𝜀_i$.
If age increases by 1 year then security awareness decreases by 0.201 (it is an index from 0–20).
We have the following model:
⁍
where:
$Y$ = dependent variable,
$X_1$ = independent variable 1,
$X_2$ = independent variable 2,
$X_K$ = independent variable K,
$ε$ = error term.
We have: $Y_i = 𝛽_0 + 𝛽_1X_{1i} + 𝛽_2X_{2i} + ... + 𝛽_KX_{Ki} + 𝜀_i$, which is estimated by:
⁍
Example: To include other independent variables that may effect the credibility of a website we could estimate: $Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + ... + b_KX_{Ki} + 𝜀_i$, which is a multiple regression model where: $Y$ = website credibility (dependent variable) and the independent variables are: $X_1$ = age, $X_2$ = gender, $X_3$ = experience, and even more independent variables if one wants to.
To estimate the parameters we minimise the error term.
Each parameter for the independent variables $X_i (i = 1, ..., K$) may be interpreted as the estimated linear effect on $Y$ of a one-unit change in $X_i$ after removing the estimated linear effects of the other $X_j$ ($j ≠ i$) on both $Y$ and $X_i$.
Example (Website Credibility): We estimate the following model: $Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + ... + b_KX_{Ki} + 𝜀_i$, where $Y$ = website credibility (1–6), $X_1$ = age in years, $X_2$ = gender (1 if female and 0 otherwise), $X_3$ = how many times do they use internet per day.
SPSS result based on 408 observations.
Based on 408 observation we receive:
If age increases by 1 year then credibility increases by 0.014 points.
If internet frequency increases by 1 time then credibility increases with 0.129.
The coefficient of determination, or $R^2$, measures how much of the variation in the regression model is explained by the independent variables. It is calculated as:
⁍
$SS_T$ — the total variation in the dependent variable, $Y$, calculated as: $\sum(y_i - \bar{y})^2$
$SS_R$ — is the total variation in the error term, $ε$, calculates as: $\sum(ε_i - \bar{ε})^2$
$SS_M$ — is the difference between $SS_T and SS_R$ and, therefore, how much of the variation is explained by the regression model, calculated as: $\sum(\hat{y_i} - \bar{y})^2$
$R^2 = \frac{SS_M}{SS_T}$ is, therefore, the proportion of variance explained by the regression model.
$R^2$ represents the fraction (between 0 and 1 inclusive) of the sample variance in $Y$ the regression explains. As $R^2$ increases, the fit of the regression to the observations increases. Having $R^2 = 1$ means a perfect fit to the observations.
Example: The R square is 0.048. Therefore our independent variable explains 0.048 or 4.8% of the variation in website credibility.
Output from SPSS.
We use a t-test to test hypothesis: $H_0 : β_i = 0 vs H_1 : β_i ≠ 0$.
$H_0$ : the independent variable $i$ does not have an effect on the dependent variable.
$H_1$ : the independent variable $i$ has an effect on the dependent variable.
This may be done using a t-test that follows a t-distribution with N-K-1 degrees of freedoms:
$t = \frac{b_i}{S_{b_i}}$
Test if the individual parameter has an impact.
Output from SPSS.
For age, we have:
$t = \frac{b_i}{S_{b_i}} = \frac{0.014}{0.006} = 2.503$
Using p-values: The p-value is 0.013 which is less than $α=0.05$ and therefore we reject the null hypothesis.
For internet frequency, we have:
$t = \frac{b_i}{S_{b_i}} = \frac{0.129}{0.0620} = 2.076$
Using p-values: The p-value is 0.039 which is less than $α=0.05$ and therefore we reject the null hypothesis.
We may also want to test the overall significance of a multiple regression model. Hence, we may want to test the following hypothesis: $H_0 : β_1 = β_2 = ... = β_p = 0$
vs $H_1 : any \space β_i \space(i = 1,...,K) ≠ 0$.
Hence, we test if:
$H_0$ : no independent variable has an effect on the dependent variable
$H_1$ : at least one independent variable has an effect on the dependent variable
This may be done using the F-test that follows an F-distribution with K degrees of freedoms in the numerator and N-K-1 degrees of freedoms in the denominator:
⁍
$MS_R$ is the mean variation in the error term, $ε$, calculated as:
⁍
$MS_M$ is the average variation explained by the regression model calculated as:
⁍
Test the overall significance at the 5 % level of significance.
Based on the output from SPSS, we have that:
⁍
The p-value is less than 0.001 which is less than 0.05 and therefore we reject the null hypothesis.
Conclusion: at least one independent variable has an effect on the dependent variable.
Sometimes we want to investigate the effect of a non-metric variable on a metric variable.
💡 When an intercept is included:
If there are $k$ categories for a qualitative explanatory variable, then include in the regression only $k$-1 dummy variables.
The left-out category (i.e. with no dummy variable) is called the base, benchmark, control, comparison, reference, or omitted category.
The estimated intercept is the mean value of the dependent variable in the benchmark category.
The coefficient estimates for a dummy variable represent how the intercept would change due to being in that category rather than in the benchmark category; it is how the mean value of the dependent variable in that category differs from that in the benchmark category.
For example, suppose that the variable $Y$ is reading satisfaction, which is explained theoretically by font, screen size, age etc.
To take into account font if you have two different fonts, you can include among your independent variables the variable $font$, which equals:
1 — if an observation has font size 1,
0 — if not.
To take into account font if you compare 3 different fonts we have two dummy variables:
The first can be named $font \space 2$ and defined as:
$0 = font \space size \space 1$ and
$1 = font \space size \space 2$.
The second one may be called $font \space 3$ and defined as follows:
$0 = font \space size \space 1$ and
$1 = font \space size \space 3$.
By using regression techniques we may estimate if there is a difference of font sizes even if we account for factors such as differences in internet usage, education etc. We estimate the model:
$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + 𝜀_i$
Where:
$Y$ — the reading satisfation,
$X_{1i}$ — is denoted $font \space 2$ and defined as: $0 = font \space 1$ and $1 = font \space 2$,
$X_{2i}$ — is denoted $font \space 3$ and defined as: $0 = font \space 1$ and $1 = font \space 3$.
The SPSS result.
The reading satisfaction for $font \space size \space 2$ is on average 0.279 less than the one for $font \space size \space 1$.
The reading satisfaction for $font \space size \space 3$ is on average 0.158 larger than the one for $font \space size \space 1$.
Functional form
Moderation effect
Assumptions of regression analysis
After the independent variables are chosen, the next step is to choose the functional form of the relationship between the dependent variable and each of the independent variables.
Makes non-linear relationship more linear.
Interpretation: In this functional form the slope parameter are interpreted as elasticity coefficients. A one percent change in x will cause a % change in y, e.g., if the estimated coefficient is -2 that means that a 1% increase in x will generate a 2% decrease in y.
Choosing a functional form
Hypothesis 1: Visual Design (VD), will have a non-linear relationship with students’ perceived usefulness of collaborative web-based learning.
If visual design increase 1% then web based learning increases with 1.223%.
Scatter plots
Polynomial functional forms express Y as a function of the independent variables, some of which are raised to powers other than 1. For example, in a second-degree polynomial (also called a quadratic) equation, at least one independent variable is squared:
⁍
Hypothesis 2: Perceived ease of use will have a non-linear relationship with usefulness in collaborative web-based learning.
U-shape (Left) and Inverted U-shape (Right)
We can see that ease of use is positive and ease of use square is negative indicating a nonlinear relationship (an inverted U-shape).
We want to model the relationship between ease of use and usability of a web learning system. We have the following:
$Y$ = Usefulness (1–5)
$X$ = Ease of use (1–5)
The Relationship Between Usefulness and Ease of Use.
In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. It is usually expressed as follows, where $x_2$ is the moderator:
⁍
The effect of how much you exercise on muscle increase depends as well on age.
$y$ = muscle increase due to training
$x_1$ = how much you exercise
$x_2$ = age
In an intervening variable model, variable $X$, is postulated to exert an effect on an outcome variable, $Y$, through one or more intervening variables called mediators ($M$).
Meditational models advance an $X → M → Y$ causal sequence, and seek to illustrate the mechanisms through which $X$ and $Y$ are related.
Conditions of mediation
Meditational designs advance a time-based model of events whereby $X$ occurs before $M$ which in turn occurs before $Y$. It is the temporal relationships of the underlying phenomena that are at issue, not necessarily the timing of measurements.
Types of Mediation.
Equation $X → Y$
⁍
Equation $X → M$
⁍
Equation $X, M → Y$
⁍
Testing for Indirect Effects.
Testing for Partial Mediation Effects.
Testing for Full Mediation Effects.
$X$ — AI-based technology applications
$M$ — the human computer interaction experience which is a mediator
$Y$ — learning effectiveness
Step 1: $X → Y$
$y = β_0 + β_{yx}x + ε_{yx} = 1.98 + 0.4878x + ε_{yx} \space (p–value = 0.00001)$
Step 2: $X → M$
$m = β_0 + β_{mx}x + ε_{mx} = 0.87 + 0.845x + \space (p–value = 0.013)$
Step 3 and 4
$y = β_0 + β_{ym}m + β_{yxm}x + ε_{yxm} = 0.81 + 0.6325m + 0.35x \space (p–value \space β_{ym} = 0.013, \space p–value \space β_{yxm} = 0.09$
Conclusions: At 5% level of significance, $β_{yxm}$ is not significant, indicating full mediation. However, it is not very small as $β_{yx}$ compared indicating a not so substantial mediation.
Statistical methods are always based on some assumption regarding the data.
We have the following simple linear regression model: $y = β_0 + β_1x + ε$,
or the multiple linear regression model: $y = β_0 + β_1x_1 + β_2x_2 + β_3x_3 + ε$.
The expected value of the error term is 0, hence: $E(ε) = 0$.
This implies there is no systematic over or under-rejection of the regression line.
We assume that the variance of the $e$ is constant over all observations. Hence, it is equal to $σ^2$ for all observations. This is usually referred to as a homoscedastic error term. If the error term is not constant then it is heteroscedastic.
If the graph expands, it suggests the Heteroscedasticity. If it doesn’t, it suggests Homoscedasticity.
The error term $e$ is normally distributed.
The assumption of homoscedasticity and normality may be validated using graphs. This is usually defined as residual analysis where one analyse the following error term:
$y - \hat{y} = y - (b_0 + b_1x)$
By a residual plot against an independent variable then one may see if the variance is constant or not.
We want to investigate the effect of Visual Design (VD) on perceived usefulness of collaborative web-based learning. Hence, we are interested in the model: $y = 𝛽_0 + 𝛽_1x + e$, which is estimated by: $y = b_0 + b_1x$, where $y$ = perceived usefulness of collaborative web-based learning and $x$ = Visual Design (VD).
The Estimated Model.
Statistical solutions:
Apply Weighted least squares.
Apply Whites heteroscedasticity consistent covariance matrix (most common).
We might need to rethink our model:
Is there omitted variables bias?
Is the functional form correct?
Scatter Plot.
Log Transform with Even Distribution.
The main consequence of violating the assumptions is that we cannot trust the estimated standard errors and therefore we cannot trust the t-test and F-test.
A histogram is a good way of investigating normality assumption.
A Histogram for Non-normal Residuals.
A Histogram for Normally Distributed Residuals.
The main consequence of violating this assumptions is that we cannot trust the estimated standard errors and therefore we cannot trust the t-test and F-test. But only in small samples.
If we have two or more explanatory variable that are highly correlated then we have a multicollinearity problem. This may be detected using a correlation matrix. If the Pearson correlation is high then we have a multicollinearity problem.
The estimated parameters become unstable and we have an inflated risk of committing a type II error (not rejecting a false null hypothesis).
We want to estimate the following model: $\hat{y} = b_0 + b_1x_1 + b_2x_2, where: y$ = consumption, $x_1$ = income, $x_2$ = wealth.
The Result.
The Correlation Matrix.
Based on this result we may conclude that income does not effect the consumption since the p-value (0.29) is greater than any of the conventional significance levels (0.01, 0.05 and 0.1).
We may also conclude that wealth does not effect the consumption since the p-value (0.615) is greater than any of the conventional significance levels (0.01, 0.05 and 0.1).
This is most likely due to a type II error (not rejecting a false null hypothesis), error caused by a multicollinearity problem. The solutions to mitigate this error are either using a single variable or create indexes.