Module 10-1 (2023)

Introduction to Bivariate Data

  • Observing the relationship between two numerical variables (e.g., height and weight).

  • Aim is to understand how one numerical variable responds to changes in another.

Variables Definitions

Response Variable (Dependent Variable)

  • A variable that changes in response to the independent variable.

  • Denoted as y.

Independent Variable (Explanatory Variable)

  • A variable used to explain changes in the dependent variable.

  • Denoted as x.

  • Acts independently to cause differences in the response variable y.

Examples of Variables

Example 1:

  • Effect of rainfall on crop yield.

    • X = amount of rainfall (independent variable).

    • Y = crop yield (dependent variable).

Example 2:

  • Effect of midterm score on final grade.

    • X = midterm score (independent variable).

    • Y = final grade (dependent variable).

Data Recording and Visualization

  • Data for two numerical variables should be recorded as pairs (X, Y).

  • Use scatter plots to visualize these bivariate observations:

    • X-axis: Independent variable (x).

    • Y-axis: Dependent variable (y).

    • Plot data points based on bivariate observations (e.g., (x1, y1), (x2, y2)).

Evaluating Relationships in Scatter Plots

  • Example: Does schooling affect salary?

    • X = years of schooling.

    • Y = salary.

  • Scatter plots show relationships and can indicate differences among age groups by using different symbols for data points.

Examining Scatterplots

1. Direction of Relationship

  • Positive Association: as X increases, Y also increases.

  • Negative Association: as X increases, Y decreases.

2. Form of Relationship

  • Linear: points follow a straight line.

  • Curvilinear: points follow a curved line.

  • Clustered data: points are loose and hard to identify a trend.

3. Strength of Relationship

  • Strong linear relationship: data points closely align with a linear trend.

  • Moderate linear relationship: data points are somewhat clustered around a trend line.

  • Weak relationship: data points are scattered with no clear trend.

Outliers

  • Observations that deviate significantly from overall pattern.

  • Could mislead interpretations of the relationship.

Correlation Coefficient

  • Measures strength and direction of linear relationships between two numerical variables.

  • Denoted by r (or R).

  • Ranges from -1 to 1:

    • r = 1: Perfect positive linear correlation.

    • r = -1: Perfect negative linear correlation.

    • r = 0: No linear correlation.

  • Calculated using means and standard deviations: sensitive to outliers.

Interpreting Correlation Coefficient

Positive Correlation (r > 0)

  • Indicates positive association, where increases in X lead to increases in Y.

    • Example: Years of schooling (X) and salary (Y) have r = 0.9941, indicating strong positive linear relationship.

Negative Correlation (r < 0)

  • Indicates negative association, where increases in X lead to decreases in Y.

Correlation Coefficient Characteristics

  • Symmetrical: Switching X and Y does not change the r value.

  • Dimensionless: Has no units, purely a numerical signifier.

  • Assess strength using the absolute value of r:

    • Close to 1 indicates strong relation, close to 0 indicates weak relation.

  • Correlation does not imply causation:

    • Correlation can exist due to lurking variables.

Lurking Variables and Causation

  • Lurking Variables: Hidden influences that affect both x and y.

  • Example: Association between years of schooling and salary does not imply causation (factors like experience, company size affect salary).

  • Correct conclusions require controlling for all lurking variables.

Conclusion

  • Need scatter plot to confirm linearity before using correlation coefficient.

  • Strong positive correlation noted between years of schooling and salary.

  • Important to refrain from assuming causation based solely on correlation.

robot