BM 6

ASSOCIATION BETWEEN VARIABLES

Outline

  • Introduction
  • Correlation analysis
  • Scatter plots
  • Correlation Coefficient
  • Coefficient of determination, r²
  • Regression analysis

Objectives

  • To be able to draw and interpret scatter diagrams.
  • To be able to calculate the correlation coefficient and coefficient of determination.
  • To understand and be able to use the least squares method to estimate the regression line.

Introduction

  • The lecture introduces two fundamental techniques in statistics:
    • Correlation: A method to measure the association between two variables.
    • Regression: A technique to derive the relationship between two variables, helping to predict dependent variables based on independent variables.
  • Example scenarios include:
    • The potential dependence of production costs on the quantity produced.
    • The relationship between product sales and pricing.

Correlation Analysis

  • Correlation measures the strength of the association between two variables.
  • A change in one variable corresponds to a change in another when they are said to be associated.
  • Illustrative examples of potential correlations include:
    • Relationship between production cost and price.
    • Association between advertising efforts and sales revenue.
    • Correlation between the number of deliveries and the time taken for those deliveries.

Class Activity: Determining Dependent and Independent Variables

  • Examples include:
    • Study Hours and Exam Grades: Study hours (independent) → Exam grades (dependent)
    • Temperature and Ice Cream Sales: Temperature (independent) → Ice cream sales (dependent)
    • Advertising Budget and Sales Revenue: Budget (independent) → Revenue (dependent)
    • Distance Traveled and Fuel Consumption: Distance (independent) → Fuel (dependent)
    • Employee Training Hours and Productivity: Training hours (independent) → Productivity (dependent)
    • Rainfall Amount and Crop Yield: Rainfall (independent) → Crop yield (dependent)
    • Social Media Engagement and Website Traffic: Engagement (independent) → Traffic (dependent)
    • Education Level and Income: Education (independent) → Income (dependent)

Scatter Diagrams

  • A scatter plot visually represents bivariate data, showing potential correlations.
  • The independent variable is plotted on the x-axis, while the dependent variable is plotted on the y-axis.
  • Analysis of scatter plots can reveal:
    • Patterns indicating the strength and direction of the association.
    • Suitability of data before conducting quantitative analysis.

Degrees of Association/Correlation

  • When examining scatter plots, consider:
    • Evidence of a pattern in the plotted points.
    • Types of correlation:
    • Perfect Correlation: All data points lie on a straight line, indicating a precise linear relationship (either positive or negative).
    • Partially Correlated: Some pattern exists but not linear or perfectly associated.
    • Uncorrelated: No discernible relationship between the variables.

Pearson’s Product Moment Correlation Coefficient

  • Used to measure the strength of the association between two variables.
  • The correlation coefficient (denoted as r) ranges from -1 to +1:
    • A value close to +1 (e.g., 0.9) indicates a strong positive linear relationship.
    • A value close to –1 (e.g., –0.9) indicates a strong negative linear relationship.
    • A value of 0 indicates no linear relationship, though other types of relationships may exist.

Formula for r

  • The formula for calculating the Pearson product moment correlation coefficient (r) is given by:
    r=nxyxy[nx2(x)2][ny2(y)2]r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}

Example 1: Calculating r

  • Calculate Pearson's correlation coefficient for the data:
    • Units produced (x): 1, 2, 3, 4, 5, 6
    • Production cost (y): 5.0, 10.5, 15.5, 25.0, 16.0, 22.5

Class Exercise

  • Task: Plot a scatter diagram and calculate the Pearson product moment correlation coefficient for the following data:
    • Policy (X) and Overtime hours (Y):
    • 150 → 10
    • 300 → 20
    • 100 → 10
    • 400 → 40
    • 350 → 30
    • 500 → 35

Coefficient of Determination r²

  • Before using regression for prediction, evaluate its fit to the data.
  • The coefficient of determination (r²) measures how well the independent variable explains the variation in the dependent variable.
  • Given by the formula:
    r2=(r)2r^2 = (r)²
  • Calculate r² for the previous examples.

Linear Regression Analysis

  • Linear regression defines the relationship between dependent and independent variables using a linear equation.
  • The focus is on estimating the line of best fit between two variables using:
    • Least Squares Method: A calculation to minimize the sum of squared differences (errors) between observed and predicted values.

The Linear Regression Model

  • The linear regression model is represented by: y=a+bxy = a + bx
    • Where:
    • y = dependent variable
    • x = independent variable
    • a = y-intercept (constant)
    • b = slope (gradient)

The Values of a and b

  • The formulas to determine the optimal values of a and b that minimize squared errors are:
    b=nxyxynx2(x)2b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}
    a=ynbxna = \frac{\sum y}{n} - b \frac{\sum x}{n}

Example: Calculating the Regression Line

  • Given data:
    • Units produced (x): 1, 2, 3, 4, 5, 6
    • Production cost (y): 5.0, 10.5, 15.5, 25.0, 16.0, 22.5
  • Use the formulas for a and b to calculate the linear regression line.

Class Exercise 2

  • Given data regarding output and cost:
    • Output (x) in 000 units and Costs (Y) in P’000:
    • 20 → 82
    • 16 → 70
    • 24 → 90
    • 22 → 85
    • 18 → 73
  • Task: Calculate the regression line for the data.

Interpretation of a and b

  • After calculating a and b, interpret the meaning of these coefficients in the context of the problem.
  • Write the regression equation and plot the regression line on a scatter diagram to visualize the fit in relation to the data points.