Linear Regression Notes
Recap of Last Week's Concepts
Last week's discussion covered association, correlation, and significance. Association measures the strength of the relationship between two variables. Correlation measures the strength and directionality of the relationship between two variables. Significance helps determine if an observed parameter reflects a real population phenomenon or is due to statistical coincidence.
Recap - Significance
Significance testing, illustrated by Fischer's tea-tasting experiment with Lady Muriel, uses the P-value. The P-value represents the probability of rejecting the null hypothesis (H0) when H0 is actually true. A normal distribution helps visualize how “common” or “unlikely” results are.
Regression and Correlation: Distinctions
Correlation measures the extent to which two variables (x and y) move together. However, correlation has limitations:
- It cannot identify relationships between more than two variables.
- It cannot predict values. For example, while education and income are correlated (e.g., 0.64), correlation alone cannot predict how much more a BA earner makes compared to a high school graduate.
- It cannot determine the unique relevance of an individual variable within a group of correlated variables.
Consider the relationships between education, intelligence, and income. European Studies graduates from FASoS earn more, but they are also more intelligent. Intelligence also predicts education. Correlation can't determine if higher earnings are due to education or intelligence.
To isolate the effect of education on income, we need a tool that assesses income differences for people of comparable intelligence.
What is Linear Regression Analysis?
Linear regression is the estimation of the values of one variable from the values of another variable (Diamond et al., 2015). It's a statistical tool that helps:
- Sort out which factors matter most in explaining an outcome.
- Identify factors that can be ignored.
- Understand how these factors interact.
- Determine the certainty of these factors.
Linear regression is a fundamental technique used in many fields, including artificial intelligence (AI). In AI, linear regression is used as a simple machine learning algorithm to predict a target variable based on input variables. It is used in supervised learning to train a model on input-output pairs. Linear regression is a building block for complex machine learning algorithms, where it can preprocess data and estimate initial weights of neural networks.
Basic Glossary
- Dependent variable: The variable being predicted (outcome variable).
- Independent variable: The variable used to predict the dependent variable (predictor).
- Simple regression: Linear regression with a single independent variable.
- Multiple regression: Linear regression with multiple independent variables.
For example:
- IV (Independent Variable): Education
- DV (Dependent Variable): Income
Linear Regression
Linear regression aims to fit a straight line to the dependent variable. The dependent variable needs to be interval/ratio, but the independent variable(s) can vary.
This type of regression predicts a linear relationship where every change in X results in a linear change in Y.
How Regression Analysis Works
Regression analysis starts with a theory: Phenomenon Y (DV) can be explained by X (IV). Examples:
- Yearly ice-cream consumption (y) can be explained by total hours of sunshine (x).
- Vote for leftist parties (y) could be driven by income (x).
- Income (y) can be explained by a person's education (x).
The linear regression equation is:
Where:
- Y is the Dependent Variable
- is the intercept (Mean income if someone had no education)
- is the coefficient for the independent variable (Education)
- e is the error term (the part of the dependent variable that remains unexplained by our co-variates)
Aim
The aim is to draw a straight (linear!) line that best fits the data. This is the regression line. It is achieved through the least squares method (OLS). The goal of least squares is to find the line that minimizes the sum of the squared distances between the regression line and the values.
The regression line is a linear representation of two variables based on observed values. We use the slope and intercept to make predictions about income (DV) based on education (IV).
Slope
The slope is the estimated amount by which the DV goes up for every unit-increase for the IV. In our case, how many euros does your income increase for every year of education. This predicted change is the same at every point across the line, because it is linear. This means that every change in x should result in the same change y!
For example, the difference between the point on the regressionslope for 0 years and 1 year of education is 198.6. So, based on our observed information we can expect that for every additional year of education, individuals earn an additional €198.6 a month.
Mathematically this means we can express income as:
Income = #Education*198.6
Wherever we look on the line, a one-unit change in X (education) leads to a 198.6 increase in Y (income).
Thus, the equation becomes:
Intercept
The intercept tells us the expected value of Y when the X is 0. OR, the expected value of our dependent variable when our independent variable is 0! In our case, it is 870.38. The intercept is the point at which our regression line crosses the y-axis.
Thus, the equation becomes:
In principle, this is equal to the € you would earn if you did not have any education plus the additional income you earn as the result of every year of education.
With this information, we can now make predictions about income based on the observed years of education. Let us try for 10 years.
Intercept + Slope*years of education = income
870.38 + 198.6*years of education = income
870.38 + 198.6*10 = Income = 2.858,38
Estimation and Error
Estimation is only an estimate based on observed values, and it is an imperfect estimate. The line we plot is a simplified model of reality. “all models are wrong but, some are useful”. In our case: Income is clearly more than a function of your education, yet looking at it can help us to understand the role of education!
To work with these models, we therefore need to know “how wrong” they are!
To measure how far our estimated values typically lie from the observed values we use the standard error. This is calculated by taking the sum of the squared distances and dividing it by the square root of the number of sample. The standard error is a measure that tells us how wrong your model is, by telling you the average distance between the observed and estimated values.
Standard error =
Explained Variance
For this we look at the R2:
- Tells us the proportion of explained variance.
- Ranges from 0-1, with 0 being low and 1 high.
- A R2 of 0.7, for example, tells us the IV explains around 70% of the variation on the DV.
à The higher the R2, the better our model predicts the DV!
Bringing it all together
- Coefficient: our slope
- Constant: Our intercept
- R2
- P-value: Our significance. A P>0.000 tells us it is extremely unlikely (> 1/1000) that our results are a statistical coincidence!
- The standard error
- Confidence interval: The true value lies within This range 95% of the time.
- Number of observations
Multiple Regression
So far we looked at regressions where we used a single IV to predict the DV. However, in most cases it is clear social phenomena cannot be reduced to a single IV. In many cases potential IV’s are deeply interlinked using only a single IV runs the risk of poorly identifying effects. Multiple regression allows us to “control” for those effects and more precisely identify the unique relevance of each IV!
The equation is:
We need a tool to understand the individual effect of these variables:
- How much does intelligence influence income for people of comparable education
- How much does education influence income for people of comparable intelligence?
Here we see what happens when we add a second IV to the equation:
First, we see intelligence also strongly predicts income. Smarter people earn more… The result also seems statistically significant!
Second, our coefficient for education goes down. This tells us, part of the education effect we identified earlier was likely an intelligence effect!
Note how the coefficiënt tells us the estimate change in the DV (income) for every 1-unit change in the respective IV.
à This means we have to interpret this in light of the measurement of the IV!
In this case:
Every extra year of education predicts an €127 increase
Every extra IQ-point predicts an €92 increase
Finally, our R2 goes up significantly. This is typically the case as adding additional variables to the equation means our model to explain the DV gets more accurate. R2 goes up to the extent that our additional IV helps to explain previously unexplained variance in the DV.
Taken, together this all tells us that education matters for income but so does intelligence!
Quiz
(1) GDP/Capita
Net migration 1215.84*** (5.36)
Constant 23708.39*** (26.48)
N 85.528
*P<0.05, **P<0.01, ***P<0.001
- A one-unit change in the net migration rate is associated with a $1.214 increase in GDP
- The constant, or GDP when net migration is 0, is $23.000
- The coefficients are highly statistically significant. If there was no statistical relation between migration and GDP we would only expect to wrongfully reject the null-hypotheses >1/1000 by chance
- We base these findings on 85.528 observations
We cannot claim the migration CAUSES high GDP. (Maybe people just flock to prosperous places!)
There was only 1 IV. Hence, we do not know whether migration really accounts for all this prosperity OR whether it is correlated with other predictors of prosperity. (stability, innovation etc)
Our estimate is only as good as our sample, so it could be that the real effect of migration would be different if we added more observations. (though 85.000 towns probably represents a good part of the world)
Quiz 2
(1) European integration has gone too far
Years of education 0.0777*** 0.00581
Annual income -0.00005*** (0.000001)
Number of children -0.00950 (0.0212)
Constant 3.883*** (0.0912)
N 15.613
- p<0.05, **p<0.01, ***p<0.001
Education significantly affects EU attitude, so does income. # children seems unimportant. If all were 0, we would expect citizens to be relatively unlikely to think the EU went too far
1-unit change of income and education means something substantively different!
Why is the effect size so different between years of education and income?
In this case, we expect 13 years of additional education to influence the answer on the scale by on average 1 point. (1/0.077). Expected EU satisfaction is expected to go up by 1 for every €20.000 on earns! ((1/0.00005)
Conclusion
- Regression analysis is a statistical tool to test theories about the relation (not causal!) of two or more variables.
- Multiple regression allows us to disentangle complex interrelations between multiple IVs and a DV.
- Power to estimate and predict outcomes, gauge model-error and indicate explained variance.
- Results of regression produce estimates which are only as good as the underlying data and model!