Module 10-1 (2023)
Introduction to Bivariate Data
Observing the relationship between two numerical variables (e.g., height and weight).
Aim is to understand how one numerical variable responds to changes in another.
Variables Definitions
Response Variable (Dependent Variable)
A variable that changes in response to the independent variable.
Denoted as y.
Independent Variable (Explanatory Variable)
A variable used to explain changes in the dependent variable.
Denoted as x.
Acts independently to cause differences in the response variable y.
Examples of Variables
Example 1:
Effect of rainfall on crop yield.
X = amount of rainfall (independent variable).
Y = crop yield (dependent variable).
Example 2:
Effect of midterm score on final grade.
X = midterm score (independent variable).
Y = final grade (dependent variable).
Data Recording and Visualization
Data for two numerical variables should be recorded as pairs (X, Y).
Use scatter plots to visualize these bivariate observations:
X-axis: Independent variable (x).
Y-axis: Dependent variable (y).
Plot data points based on bivariate observations (e.g., (x1, y1), (x2, y2)).
Evaluating Relationships in Scatter Plots
Example: Does schooling affect salary?
X = years of schooling.
Y = salary.
Scatter plots show relationships and can indicate differences among age groups by using different symbols for data points.
Examining Scatterplots
1. Direction of Relationship
Positive Association: as X increases, Y also increases.
Negative Association: as X increases, Y decreases.
2. Form of Relationship
Linear: points follow a straight line.
Curvilinear: points follow a curved line.
Clustered data: points are loose and hard to identify a trend.
3. Strength of Relationship
Strong linear relationship: data points closely align with a linear trend.
Moderate linear relationship: data points are somewhat clustered around a trend line.
Weak relationship: data points are scattered with no clear trend.
Outliers
Observations that deviate significantly from overall pattern.
Could mislead interpretations of the relationship.
Correlation Coefficient
Measures strength and direction of linear relationships between two numerical variables.
Denoted by r (or R).
Ranges from -1 to 1:
r = 1: Perfect positive linear correlation.
r = -1: Perfect negative linear correlation.
r = 0: No linear correlation.
Calculated using means and standard deviations: sensitive to outliers.
Interpreting Correlation Coefficient
Positive Correlation (r > 0)
Indicates positive association, where increases in X lead to increases in Y.
Example: Years of schooling (X) and salary (Y) have r = 0.9941, indicating strong positive linear relationship.
Negative Correlation (r < 0)
Indicates negative association, where increases in X lead to decreases in Y.
Correlation Coefficient Characteristics
Symmetrical: Switching X and Y does not change the r value.
Dimensionless: Has no units, purely a numerical signifier.
Assess strength using the absolute value of r:
Close to 1 indicates strong relation, close to 0 indicates weak relation.
Correlation does not imply causation:
Correlation can exist due to lurking variables.
Lurking Variables and Causation
Lurking Variables: Hidden influences that affect both x and y.
Example: Association between years of schooling and salary does not imply causation (factors like experience, company size affect salary).
Correct conclusions require controlling for all lurking variables.
Conclusion
Need scatter plot to confirm linearity before using correlation coefficient.
Strong positive correlation noted between years of schooling and salary.
Important to refrain from assuming causation based solely on correlation.