Interpreting Bivariate Regression and Statistical Significance
The Concept and Interpretation of R2 (R-Squared)
- Definition and Function: R2 is a statistical measure that tells the researcher how much of the variation in the dependent variable (y) is explained by the regression model (the independent variables).
- Explaining Variation:
* If a model has an R2 of 0.25, it means the equation explains about a quarter (25%)) of the variation in the outcome of interest.
* If the R2 is 0.5, then half of the variation in the dependent variable is explained by the independent variable. This also implies that the other half of the variation (50%)) remains unexplained.
- Contextualizing Unexplained Variation:
* It is rare for one single thing to explain and entire outcome. Many factors usually affect the phenomena being studied.
* Unexplained variation suggests there are other factors that could be incorporated into the study to better explain the outcome. Examples of such "control variables" include gender, educational background, or income levels.
- Measurement Target: R2 specifically measures the effectiveness of a particular regression line in explaining the outcome. It represents how well the equation fits the data.
Case Study: Foreign Aid and Public Opinion (Goldsmith et al.)
- Source Material: Ben Goldsmith and colleagues published an article titled "Doing Well by Doing Good: The Impact of Foreign Aid on Public Opinion" in 2010.
- Research Question: Does foreign aid extended by one country improve that country's image among the populations of the recipient countries?
- Methodology:
* The researchers used a multinational survey (data from many countries).
* They focused on the USAID program targeted to address HIV and AIDS, specifically the PEPFAR program.
- Analysis Approach ("First Cut"):
* The authors used simple scatter plots using the outcome variable.
* Time Periods Observed:
* Entire period: 2002–2010.
* The Bush administration: 2007–2008.
* The early Obama administration: 2009–2010.
* They identified an OLS (Ordinary Least Squares) regression line, which is the "best fit" line that minimizes the squared distances between each data point (the dots) and the line.
- Results and Interpretation:
* Slope Trend: The regression lines have a positive slope, meaning that as targeted aid programs increase, positive perceptions of the U.S. also increase.
* Numerical Coefficients: For the period 2007–2010, the slope coefficient was reported as 0.26.
* Meaning of the Slope (b): A one-unit increase in the independent variable (HIV/AIDS programs) leads to a 0.26 unit increase in the dependent variable (public perception percentage).
* Substantive Significance: While statistically significant, a 0.26 increase is essentially a quarter of a percentage point increase in positive perception. This is a real effect, but not necessarily a "dramatic" one (it is not a 10 percentage point jump).
* Level of Confidence: The results were cited as "highly significant at the 99% level." This means the probability (p) that these results were found if the null hypothesis was true is less than 0.01 (p < 0.01).
Case Study: HIV/AIDS Treatment and Aid Lag (Margo Day's IA)
- Research Focus: Investigating whether aid given in 2003 affected the number of people being treated for HIV/AIDS in the recipient country in 2004.
- Strategy of Lagging Data: Data is staggered (lagged) by one year to guarantee that the cause (aid) preceded the effect (treatment), which is essential for determining causality.
- Model 1: Total Aid vs. Treatment:
* Relationship: Statistically significant (p < 0.05).
* Direction: Positive coefficient.
* Magnitude: The coefficient is approximately 0.3.
* Substantive Interpretation: If the independent variable (x) is measured in thousands of dollars, a one-unit increase (1,000dm3 in aid) leads to 0.3 more people being treated (about a third of a person). Essentially, it costs approximately 3,000dm3 in aid to treat one person on average.
* Variation explained (R2): The total aid explains about 31.5% of the variation in treatment numbers. This is significant for a single variable, though 68.5% remains unexplained due to factors like medical infrastructure or social norms.
- Model 2: Bilateral Aid:
* Significance: Statistically significant (p < 0.05).
* Magnitude: The coefficient is 189.4.
* Interpretation: A one-unit increase (1,000dm3 in aid) leads to approximately 189 additional people being treated. This suggests bilateral aid has a much larger substantive effect than the general aid category.
* Variation explained (R2): Only about 5.5% of the variation was explained by bilateral aid alone.
- Model 3: Multilateral Aid (UN, WHO, IMF):
* Significance: The P-value is 0.397.
* Interpretation: Because p > 0.05, the result is not statistically significant. The coefficient (approximately 79) is meaningless because we cannot reject the null hypothesis.
Case Study: Terrorism and Female Leaders (Holman et al.)
- Research Focus: Do female leaders get a "rally around the flag" effect (a public opinion boost) following a terrorist attack?
- Subject: Theresa May (UK Prime Minister) during the Manchester attack.
- Hypothesis: The authors predicted female leaders would be punished or see a decline in favorability, unlike the traditional rally effect seen for male leaders.
- Variables:
* Dependent Variable (y): Perceived favorability of Theresa May.
* Independent Variable (x): Being surveyed before the attack vs. being surveyed after the attack.
- Numerical Analysis of the Results Table:
* Coefficient: −0.332 (negative).
* Interpretation: Being surveyed after the attack (a one-unit change in the binary variable) leads to a 0.332 decrease (about a third of a percentage point) in favorability.
* Significance: Indicated by stars in the table legend (typically one star for p < 0.05).
* Variation explained (R2): The timing of the survey (before/after attack) explains about 14.3% of the variation in favorability.
Practical Math in Regression Models
- Hypothetical Model: Unemployment and Approval Ratings:
* Variables: Both measured in percentages (%)).
* Equation: y=65−2.5x
* Y-Intercept (a): 65. This is the internal approval rating if unemployment (x) is zero.
* Slope/Coefficient (b): −2.5.
* Significance: p=0.03. This is statistically significant (p < 0.05), so we reject the null hypothesis.
* Calculating Expected Outcomes:
* Scenario 1 (5% unemployment): y=65−(2.5×5)=65−12.5=52.5. Expected approval is 52.5%).
* Scenario 2 (10% unemployment): y=65−(2.5×10)=65−25=40. Expected approval is 40%).
* Summary of Effect: A 1% increase in the unemployment rate leads to a 2.5% decrease in the leader's approval rating.
Questions & Discussion
- Question (Nora): Could you clarify if the hypothesis explains the phenomenon?
* Response: Technically, the regression model (specifically the equation/line) explains a certain percentage of the variation in the dependent variable. It is better to focus the statistics on how much the equation explains the outcome rather than the abstract hypothesis.
- Comment (Marissa): Is the coefficient the percentage of variation explained?
* Response: No, the percentage of variation explained is the R2. The coefficient (slope) tells you the magnitude of the effect that a one-unit change in x has on y.
- Question (Norm): Is a higher R2 better or does it make a study more valid?
* Response: Not necessarily. High R2 just means more variation is accounted for. Unethical researchers might "pack" a model with 30 variables just to inflate R2, but this obscures which variables are actually important. A lower R2 (like 14.3%) can still be very meaningful if it shows a strong substantive effect from a key variable.