Correlation

Chapter 6: Describing the relationship between two variables

Research on Relations Between Variables

Many research questions are relational in nature, such as:
- Whether the occurrence of depression is related to gender.
- How income relates to one’s psychological well-being.
- Whether the time spent playing violent video games relates to one’s tendency to be violent.

Data Analysis with Correlation

When investigating a relational question (non-causal) between two variables measured on an INTERVAL or RATIO scale, one can conduct correlational analysis.
Correlational analysis reveals:
- The direction of the association between the two variables.
- The strength of the association (co-variation) between these variables.

Steps in Relational Research

Ask a Relational Question:
- Example: Is there a relation between (X) length of psychotherapy and (Y) psychological well-being?
Construct the Measures:
- Length of Psychotherapy (Variable X) = number of months.
- Psychological Well-Being (Variable Y) = Score on a 100-point scale.
Make Paired Observations:
- Assess X and Y for each client; two variables to measure for each subject.
Analyze the Relationship:
- Visually through scatterplots.
- Mathematically through correlation coefficient (r), coefficient of determination (r²), and regression analysis.

Co-Variation Explained

Analyzing the numerical relation between two variables (e.g., the length of therapy and psychological well-being) involves examining "co-variation", which means:
- One variable changes together (co) with changes in the other variable.
- For instance, when one variable increases as the other also increases, the relationship demonstrates positive co-variation.

Visualizing Relationships

Example of Scatterplot:
- X-axis: Months of therapy (X)
- Y-axis: Well-being (Y)
- Each point on the scatterplot represents an XY pair.

Types of Relationships

Perfect Positive:
- All points fall on a line that slopes upward.
- When X increases, Y increases by the same amount.
Imperfect Positive:
- Points cluster around a line that slopes upward.
- When X increases, Y tends to increase.
Imperfect Negative:
- Points cluster around a line that slopes downward.
- When X increases, Y tends to decrease.
Perfect Negative:
- All points fall on a line that slopes downward.
- When X increases, Y decreases by the same amount.
Zero Relationship:
- Points do not cluster around any line.

Non-Linear Relationships

Relationships may involve non-linear relations when data is summarized by a line that is not straight.

Goals of This Chapter

Express mathematically the direction and strength of linear relationships.
Analyze relationships as positive or negative.

Key Concepts

Pearson Correlation Coefficient (r)

An index that describes the strength and direction of the linear relationship between two interval and/or ratio variables.

Coefficient of Determination (r²)

Also known as the squared correlation coefficient.
An index that describes only the strength of the linear relationship between two interval and/or ratio variables.
A technique for describing the strength of co-variation.

Computation Approaches

Two approaches to compute correlation will be discussed:
1. Definitional formula
2. Computational formula

Definitional Formula of Pearson Correlation Coefficient

This formula illustrates correlation as a standardized (Z score based) co-variation between two variables.
It indicates that r (correlation) is unaffected by the unit of measurement, making it independent of the measurement units used.

Reminder for Z-scores

The Z-scores for variables X and Y can be computed as follows:
ZX = \frac{X - \bar{X}}{SDX}
ZY = \frac{Y - \bar{Y}}{SDY}

Types of Relationships in Employment Context

Example:
- Hours Worked (X) and Salary Earned (Y)
- Points can represent different subjects with textual data of hours worked and respective salaries.

Computation of Correlation Example

Data will illustrate a perfect positive relationship; thus, interpretation follows:
- The “+” indicates that the relationship is positive.
- A correlation of “1.00” indicates a perfect relationship.

Coefficient of Determination (r²)

r² shows the proportion (percentage) of variability in Y explained by the variability in X, and vice versa.
For example, if r² = 1.00 between salary earned (Y) and hours worked (X), then variability in salary can be perfectly explained by the variability in hours worked.

Definitional Formula (Second Example)

Sample question:
- What is the relationship between X and Y?
- Example measures for X and Y should be constructed and paired for analysis.

Summary of Results

r ranges from -1.00 to +1.00:
- Imperfect relations fall within this range.
- r cannot be less than -1.00 or greater than +1.00.

Understanding the Definition of Correlation

Why convert raw scores to Z-scores?
- To standardize the scores without altering the distribution or relative distances.

Interpretation of the Correlation Coefficient (r)

r values indicate strength and direction:
- Absolute Value: Indicates the strength of the relationship.
- Sign: Indicates the direction (+ or -).
- Example Strength Ratings:
  - +0.60 to +0.80: Very Strong
  - +0.40 to +0.60: Strong
  - +0.20 to +0.40: Moderate

Conditions for Appropriate Use of r

Use r when:
1. Evaluating strength and direction of linear relationships between two variables.
2. Both X and Y are interval and/or ratio—nominal or ordinal cannot be used.
3. Scatterplot appears linear.
4. Scatterplot displays homoscedasticity.

Linear vs. Non-Linear Relationships

Correlation coefficient r will underestimate the strength of a non-linear relationship.
r accurately describes the strength of a linear relationship.

Homoscedasticity vs. Heteroscedasticity

Homoscedasticity: Indicates constant variability at all levels of X.
Heteroscedasticity: Indicates varying variability at different levels of X.

Proper Uses of Correlation

Appropriate when:
- X and Y are measured on interval or ratio scales.
- Scatterplots are linear and exhibit homoscedasticity.
Questionable or improper use occurs when:
- Using nominal or ordinal measurements, or when scatterplots are non-linear or heteroscedastic.
Conclusion should focus on the relationship rather than causation.