Untitled Flashcards Set

Regression Analysis

In regression analysis we wish to investigate the relationship between an independent variable and a dependent variable.

  • If we have one independent variable is is called simple linear regression.

  • If we have more than one independent variable it is called multiple linear regression.

Method for Regression Analysis

  1. Statement of Theory or Hypothesis

Example: Theoretically, as age increases, security awareness decreases. Therefore, an increase of one year in age should result in a decrease in security awareness. If we let variable X represent age and variable Y represent security awareness, then Y is expected to decrease as X increases.


  1. Specify a Statistical Model

In statistics we wish to build models often of the type: $Outcome_i = Model_i + Error_i$

In regression analysis this model can be written as:

where $𝛽_1 < 0$ according to our theory and:

  • $Y$ = dependent variable,

  • $X_1$ = independent variable 1,

  • $ε$ = error term.

Example: We have the model: $Y_i = 𝛽_0 + 𝛽_1X_1 + 𝜀_i,$ where $Y$ = dependent variable corresponding to security awareness, $X_1$ = independent variable corresponding to age, $ε$ = error term.


  1. Find Data

Data obtained for regression analysis may be:

  • Cross-sectional (e.g. country data, firm data),

  • Time series data (e.g. country data over time),

  • Panel data (a mix of cross-sectional and time series data).


  1. Estimate the Model

We want to estimate the effect of $X$ on $Y$ and we need to estimate the parameters $β_0$ and $β_1$.


  1. Test Hypothesis

  2. Make Predictions

Why do we have an error term in the regression?

The error term stands for all variables that affect the dependent variable but are not included as explanatory variables. Why have it rather than throwing in all possible variables?

  1. Measurement error caused by for example poor proxy variables or poor indices.

  2. Specification error: Core variables versus peripheral (not so important) variables. Peripheral variables may most appropriately be included in error term. But core variables needs to included.

  3. Intrinsic randomness in human behaviour.

Simple Linear Regression

In the previous model we have $𝛽_0$ and $𝛽_1$ that are population parameters. Then, we estimate these parameters using ordinary least squares by $b_0$ and $b_1$. Hence the population model: $Y_i = 𝛽_0 + 𝛽_1X_1 + 𝜀_i$ is estimated using the following equation:

Simple Regression Analysis

To estimate the parameters ($β_0 and β_1$) we want to minimise the error term (ε).

The parameter for the independent variables $X$ may be interpreted as the estimated linear effect on $Y$ of a one-unit change in $X$.

Example: We wish to investigate the effect of age on security awareness and estimate the model: $Y_i = 𝛽_0 + 𝛽_1X_1 + 𝜀_i$ which is estimated by the following model: $Y_i = b_0 + b_1X_1 + ε_i$, where $Y$ = dependent variable corresponding to security awareness, $X$ = independent variable corresponding to age, and $ε$ = error term.

SPSS result

We have the model: $Y_i = b_0 + b_1X_1 + ε_i = 20.047 − 0.201X_i + 𝜀_i$.

If age increases by 1 year then security awareness decreases by 0.201 (it is an index from 0–20).

Multiple Linear Regression

We have the following model:

where:

  • $Y$ = dependent variable,

  • $X_1$ = independent variable 1,

  • $X_2$ = independent variable 2,

  • $X_K$ = independent variable K,

  • $ε$ = error term.


We have: $Y_i = 𝛽_0 + 𝛽_1X_{1i} + 𝛽_2X_{2i} + ... + 𝛽_KX_{Ki} + 𝜀_i$, which is estimated by:

Example: To include other independent variables that may effect the credibility of a website we could estimate: $Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + ... + b_KX_{Ki} + 𝜀_i$, which is a multiple regression model where: $Y$ = website credibility (dependent variable) and the independent variables are: $X_1$ = age, $X_2$ = gender, $X_3$ = experience, and even more independent variables if one wants to.

Multiple Regression Analysis

To estimate the parameters we minimise the error term.

Each parameter for the independent variables $X_i (i = 1, ..., K$) may be interpreted as the estimated linear effect on $Y$ of a one-unit change in $X_i$ after removing the estimated linear effects of the other $X_j$ ($j ≠ i$) on both $Y$ and $X_i$.

Example (Website Credibility): We estimate the following model: $Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + ... + b_KX_{Ki} + 𝜀_i$, where $Y$ = website credibility (1–6), $X_1$ = age in years, $X_2$ = gender (1 if female and 0 otherwise), $X_3$ = how many times do they use internet per day.

SPSS result based on 408 observations.

Based on 408 observation we receive:

If age increases by 1 year then credibility increases by 0.014 points.

If internet frequency increases by 1 time then credibility increases with 0.129.

Coefficient of Determination

The coefficient of determination, or $R^2$, measures how much of the variation in the regression model is explained by the independent variables. It is calculated as:

  • $SS_T$ — the total variation in the dependent variable, $Y$, calculated as: $\sum(y_i - \bar{y})^2$

  • $SS_R$ — is the total variation in the error term, $ε$, calculates as: $\sum(ε_i - \bar{ε})^2$

  • $SS_M$ — is the difference between $SS_T and SS_R$ and, therefore, how much of the variation is explained by the regression model, calculated as: $\sum(\hat{y_i} - \bar{y})^2$

$R^2 = \frac{SS_M}{SS_T}$ is, therefore, the proportion of variance explained by the regression model.

$R^2$ represents the fraction (between 0 and 1 inclusive) of the sample variance in $Y$ the regression explains. As $R^2$ increases, the fit of the regression to the observations increases. Having $R^2 = 1$ means a perfect fit to the observations.

Example: The R square is 0.048. Therefore our independent variable explains 0.048 or 4.8% of the variation in website credibility.

Output from SPSS.

Hypothesis Testing

We use a t-test to test hypothesis: $H_0 : β_i = 0 vs H_1 : β_i ≠ 0$.

$H_0$ : the independent variable $i$ does not have an effect on the dependent variable.

$H_1$ : the independent variable $i$ has an effect on the dependent variable.

This may be done using a t-test that follows a t-distribution with N-K-1 degrees of freedoms:

$t = \frac{b_i}{S_{b_i}}$

Example

Test if the individual parameter has an impact.

Output from SPSS.


For age, we have:

$t = \frac{b_i}{S_{b_i}} = \frac{0.014}{0.006} = 2.503$

Using p-values: The p-value is 0.013 which is less than $α=0.05$ and therefore we reject the null hypothesis.


For internet frequency, we have:

$t = \frac{b_i}{S_{b_i}} = \frac{0.129}{0.0620} = 2.076$

Using p-values: The p-value is 0.039 which is less than $α=0.05$ and therefore we reject the null hypothesis.

Hypothesis Testing (Multiple Regression)

We may also want to test the overall significance of a multiple regression model. Hence, we may want to test the following hypothesis: $H_0 : β_1 = β_2 = ... = β_p = 0$

vs $H_1 : any \space β_i \space(i = 1,...,K) ≠ 0$.

Hence, we test if:

$H_0$ : no independent variable has an effect on the dependent variable

$H_1$ : at least one independent variable has an effect on the dependent variable

This may be done using the F-test that follows an F-distribution with K degrees of freedoms in the numerator and N-K-1 degrees of freedoms in the denominator:

$MS_R$ is the mean variation in the error term, $ε$, calculated as:

$MS_M$ is the average variation explained by the regression model calculated as:

Example (Website Credibility)

Test the overall significance at the 5 % level of significance.

!https://prod-files-secure.s3.us-west-2.amazonaws.com/67fd99a0-4927-431c-b2ed-d6a02b04791c/9e876a2b-193d-44bb-ba59-3024e8a1a74e/Screenshot_2024-11-08_at_11.47.29.png

!https://prod-files-secure.s3.us-west-2.amazonaws.com/67fd99a0-4927-431c-b2ed-d6a02b04791c/084b4664-1dd0-45f7-a818-bee3c8521dd8/Screenshot_2024-11-08_at_11.47.35.png

Based on the output from SPSS, we have that:

The p-value is less than 0.001 which is less than 0.05 and therefore we reject the null hypothesis.

Conclusion: at least one independent variable has an effect on the dependent variable.

Dummy Variables

Sometimes we want to investigate the effect of a non-metric variable on a metric variable.

💡 When an intercept is included:

  • If there are $k$ categories for a qualitative explanatory variable, then include in the regression only $k$-1 dummy variables.

  • The left-out category (i.e. with no dummy variable) is called the base, benchmark, control, comparison, reference, or omitted category.

  • The estimated intercept is the mean value of the dependent variable in the benchmark category.

  • The coefficient estimates for a dummy variable represent how the intercept would change due to being in that category rather than in the benchmark category; it is how the mean value of the dependent variable in that category differs from that in the benchmark category.

Example

For example, suppose that the variable $Y$ is reading satisfaction, which is explained theoretically by font, screen size, age etc.

To take into account font if you have two different fonts, you can include among your independent variables the variable $font$, which equals:

  • 1 — if an observation has font size 1,

  • 0 — if not.

To take into account font if you compare 3 different fonts we have two dummy variables:

  • The first can be named $font \space 2$ and defined as:

$0 = font \space size \space 1$ and

$1 = font \space size \space 2$.

  • The second one may be called $font \space 3$ and defined as follows:

$0 = font \space size \space 1$ and

$1 = font \space size \space 3$.

Example (Font Size and Reading Satisfaction)

By using regression techniques we may estimate if there is a difference of font sizes even if we account for factors such as differences in internet usage, education etc. We estimate the model:

$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + 𝜀_i$

Where:

  • $Y$ — the reading satisfation,

  • $X_{1i}$ — is denoted $font \space 2$ and defined as: $0 = font \space 1$ and $1 = font \space 2$,

  • $X_{2i}$ — is denoted $font \space 3$ and defined as: $0 = font \space 1$ and $1 = font \space 3$.

The SPSS result.

The reading satisfaction for $font \space size \space 2$ is on average 0.279 less than the one for $font \space size \space 1$.

The reading satisfaction for $font \space size \space 3$ is on average 0.158 larger than the one for $font \space size \space 1$.

  • Functional form

  • Moderation effect

  • Assumptions of regression analysis

Functional Form

After the independent variables are chosen, the next step is to choose the functional form of the relationship between the dependent variable and each of the independent variables.

Log Transform

Makes non-linear relationship more linear.

Interpretation: In this functional form the slope parameter are interpreted as elasticity coefficients. A one percent change in x will cause a % change in y, e.g., if the estimated coefficient is -2 that means that a 1% increase in x will generate a 2% decrease in y.

Example: Log Transform

Choosing a functional form

Hypothesis 1: Visual Design (VD), will have a non-linear relationship with students’ perceived usefulness of collaborative web-based learning.

If visual design increase 1% then web based learning increases with 1.223%.

!https://prod-files-secure.s3.us-west-2.amazonaws.com/67fd99a0-4927-431c-b2ed-d6a02b04791c/3fdd3ec3-3964-4681-af9d-5f7658dd4873/Screenshot_2024-11-19_at_13.25.49.png

Scatter plots

!https://prod-files-secure.s3.us-west-2.amazonaws.com/67fd99a0-4927-431c-b2ed-d6a02b04791c/ec561332-89f2-4dc3-812f-b533bd8faa61/9565ab4d-f057-4a77-9478-06ccad866adc.png

Polynomial Form

Polynomial functional forms express Y as a function of the independent variables, some of which are raised to powers other than 1. For example, in a second-degree polynomial (also called a quadratic) equation, at least one independent variable is squared:

Example: Polynomial Form

Hypothesis 2: Perceived ease of use will have a non-linear relationship with usefulness in collaborative web-based learning.

U-shape (Left) and Inverted U-shape (Right)

!https://prod-files-secure.s3.us-west-2.amazonaws.com/67fd99a0-4927-431c-b2ed-d6a02b04791c/830cfedb-4333-45ab-8e73-efed0b1e6840/Screenshot_2024-11-19_at_13.34.34.png

We can see that ease of use is positive and ease of use square is negative indicating a nonlinear relationship (an inverted U-shape).

We want to model the relationship between ease of use and usability of a web learning system. We have the following:

$Y$ = Usefulness (1–5)

$X$ = Ease of use (1–5)

The Relationship Between Usefulness and Ease of Use.

Moderation Effect

In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. It is usually expressed as follows, where $x_2$ is the moderator:

Example

The effect of how much you exercise on muscle increase depends as well on age.

$y$ = muscle increase due to training

$x_1$ = how much you exercise

$x_2$ = age

Mediation Analysis

In an intervening variable model, variable $X$, is postulated to exert an effect on an outcome variable, $Y$, through one or more intervening variables called mediators ($M$).

Meditational models advance an $X → M → Y$ causal sequence, and seek to illustrate the mechanisms through which $X$ and $Y$ are related.

Conditions of mediation

Meditational designs advance a time-based model of events whereby $X$ occurs before $M$ which in turn occurs before $Y$. It is the temporal relationships of the underlying phenomena that are at issue, not necessarily the timing of measurements.

Types of Mediation.

Different Equations

Equation $X → Y$

Equation $X → M$

Equation $X, M → Y$

Testing for Indirect Effects.

Testing for Partial Mediation Effects.

Testing for Full Mediation Effects.

Example

$X$ — AI-based technology applications

$M$ — the human computer interaction experience which is a mediator

$Y$ — learning effectiveness


Step 1: $X → Y$

$y = β_0 + β_{yx}x + ε_{yx} = 1.98 + 0.4878x + ε_{yx} \space (p–value = 0.00001)$

Step 2: $X → M$

$m = β_0 + β_{mx}x + ε_{mx} = 0.87 + 0.845x + \space (p–value = 0.013)$

Step 3 and 4

$y = β_0 + β_{ym}m + β_{yxm}x + ε_{yxm} = 0.81 + 0.6325m + 0.35x \space (p–value \space β_{ym} = 0.013, \space p–value \space β_{yxm} = 0.09$

Conclusions: At 5% level of significance, $β_{yxm}$ is not significant, indicating full mediation. However, it is not very small as $β_{yx}$ compared indicating a not so substantial mediation.

Assumptions of Regression Model

Statistical methods are always based on some assumption regarding the data.

We have the following simple linear regression model: $y = β_0 + β_1x + ε$,

or the multiple linear regression model: $y = β_0 + β_1x_1 + β_2x_2 + β_3x_3 + ε$.

Expected Value of the Error Term

The expected value of the error term is 0, hence: $E(ε) = 0$.

This implies there is no systematic over or under-rejection of the regression line.

Homoscedasticity (Constant Variance)

We assume that the variance of the $e$ is constant over all observations. Hence, it is equal to $σ^2$ for all observations. This is usually referred to as a homoscedastic error term. If the error term is not constant then it is heteroscedastic.

!https://prod-files-secure.s3.us-west-2.amazonaws.com/67fd99a0-4927-431c-b2ed-d6a02b04791c/3a958d2b-1ba6-414a-9b5c-9e74ff1e289b/1_Z1XFm9psN1o8FOqc4QRH_w.png

If the graph expands, it suggests the Heteroscedasticity. If it doesn’t, it suggests Homoscedasticity.

Normality Assumption

The error term $e$ is normally distributed.

Validating Model Assumptions

The assumption of homoscedasticity and normality may be validated using graphs. This is usually defined as residual analysis where one analyse the following error term:

$y - \hat{y} = y - (b_0 + b_1x)$

Investigating Homoscedasticity Assumption

By a residual plot against an independent variable then one may see if the variance is constant or not.

Example

We want to investigate the effect of Visual Design (VD) on perceived usefulness of collaborative web-based learning. Hence, we are interested in the model: $y = 𝛽_0 + 𝛽_1x + e$, which is estimated by: $y = b_0 + b_1x$, where $y$ = perceived usefulness of collaborative web-based learning and $x$ = Visual Design (VD).

The Estimated Model.

Solutions

Statistical solutions:

  • Apply Weighted least squares.

  • Apply Whites heteroscedasticity consistent covariance matrix (most common).

We might need to rethink our model:

  • Is there omitted variables bias?

  • Is the functional form correct?

Scatter Plot.

Log Transform with Even Distribution.

Consequences of Violating the Homoscedasticity Assumption

The main consequence of violating the assumptions is that we cannot trust the estimated standard errors and therefore we cannot trust the t-test and F-test.

Investigating Normality Assumption

A histogram is a good way of investigating normality assumption.

A Histogram for Non-normal Residuals.

A Histogram for Normally Distributed Residuals.

Consequences of Violating the Normality Assumption

The main consequence of violating this assumptions is that we cannot trust the estimated standard errors and therefore we cannot trust the t-test and F-test. But only in small samples.

Multicollinearity

If we have two or more explanatory variable that are highly correlated then we have a multicollinearity problem. This may be detected using a correlation matrix. If the Pearson correlation is high then we have a multicollinearity problem.

Consequence of Multicollinearity

The estimated parameters become unstable and we have an inflated risk of committing a type II error (not rejecting a false null hypothesis).

Example

We want to estimate the following model: $\hat{y} = b_0 + b_1x_1 + b_2x_2, where: y$ = consumption, $x_1$ = income, $x_2$ = wealth.

The Result.

The Correlation Matrix.

Based on this result we may conclude that income does not effect the consumption since the p-value (0.29) is greater than any of the conventional significance levels (0.01, 0.05 and 0.1).

We may also conclude that wealth does not effect the consumption since the p-value (0.615) is greater than any of the conventional significance levels (0.01, 0.05 and 0.1).

This is most likely due to a type II error (not rejecting a false null hypothesis), error caused by a multicollinearity problem. The solutions to mitigate this error are either using a single variable or create indexes.

robot