Study Notes on Scatterplots, Associations, and Correlation

Introduction to Bivariate Analysis in Statistics

Overview of Bivariate Analysis

In statistics, bivariate analysis refers to the analysis involving two variables. The primary objectives when analyzing such data include:

  1. Plotting/Graphing: To visualize the relationship between the two variables.

  2. Characteristics Description: Identifying key characteristics that define the association.

  3. Measurement: Employing measures that quantify these characteristics.

  4. Inference Method: Implementing methods to draw conclusions regarding the relationship.

Example Inquiry

The relationship under investigation may be framed as follows:

  • Does wine consumption result in a decrease in heart disease?

    • Response Variable: Death rate from heart disease (measured per 100,000 population).

    • Explanatory Variable: Wine consumption – measured as alcohol consumption in liters/year/person.

Visualizing and Describing Relationships

When creating a scatterplot, consider the following characteristics:

  1. Form: The shape of the scatterplot can be linear, curved, etc.

  2. Direction:

    • Positive Direction indicates that as one variable increases, the other also increases.

    • Negative Direction indicates that as one variable increases, the other decreases.

  3. Strength: This assesses how closely the points cluster around a line.

  4. Presence of Outliers: Points that do not conform to the overall pattern of the scatterplot.

Measuring Strength of Relationships

Strength refers to how closely data points adhere to a defined correlation. We primarily focus on the correlation coefficient, denoted as $r$.

  • Definition: The correlation coefficient measures both the direction (sign) and strength (magnitude) of a linear relationship between two quantitative variables.

  • Range: The value of $r$ falls between -1 and +1, where:

    • $r = 1$ indicates a perfect positive linear correlation.

    • $r = -1$ indicates a perfect negative linear correlation.

    • $r = 0$ indicates no linear correlation.

Pearson Correlation Coefficient Formula

The Pearson correlation coefficient can be calculated using the following formula:
r = rac{1}{n - 1} rac{ ext{Cov}(x, y)}{sx sy}
Where:

  • Cov(x, y) is the covariance of variables $x$ and $y$.

  • $sx$ and $sy$ represent the standard deviations of $x$ and $y$, respectively.

  • $n$ is the total number of data points.

Analyzing Correlation and Causation

Distinction Between Correlation and Causation

It is crucial to understand that correlation does not imply causation. An example to illustrate this is the relationship between money wagered at US racetracks and the correlation coefficient of $r = 0.931$. While both variables may demonstrate a tendency to increase together, this does not confirm that one causes the other.

The Concept of Lurking Variables

A lurking variable, or confounding variable, is a third variable that influences both the independent and dependent variable, potentially leading to erroneous conclusions about causation.

  • Example 1: A study found a positive correlation between teachers' salaries and liquor sales. This does not imply that higher salaries cause increased liquor sales; other economic factors may be involved.

  • Example 2: The study of life expectancy and the number of doctors per person in 40 countries resulted in a correlation of $r = 0.705$, suggesting a positive association that could be influenced by various other factors, such as socioeconomic status.

Correlation Tables

Correlation tables consolidate data displaying correlations between pairs of variables within a dataset. They are exceptionally useful for observing relationships at a glance.

Example Correlation Table
  • Table 6.1 presents correlations between variables collected for a sample of Amazon books. Here’s a summary of the significant correlations:

    • Number of Pages and Thickness have a correlation of $r = 0.813$.

    • Width has a low positive correlation with the number of pages ($r = 0.003$).

    • Year published has varying correlations with other factors, for instance, $r = 0.253$ with the number of pages.

Correlation Analysis Summary Statistics

In correlation analysis, summary statistics are paramount. For example, a dataset could include the following statistics for scores and time spent on tasks:

  • Scores:

    • Mean: 79.23

    • Variance: 152.62

    • Standard deviation: 12.35

    • Median: 81.25

  • Time Spent:

    • Mean: 0.73

    • Variance: 0.21

    • Standard deviation: 0.45

    • Minimum: 45.93

Straightening Scatterplots

Example of Transforming Data
  • The cost of generating solar power was observed to decline between 2009 and 2013. The original time series plot did not show a strong linear association.

  • By applying a logarithmic transformation to the data, the scatterplot appeared straighter, suggesting that the correlation of the transformed data is now a more reliable measure of association.

  • Common Transformations: Logarithmic, square root, and reciprocal transformations may be used. However, it’s crucial to interpret results according to the transformation applied.

Homework and Practice

Students are encouraged to engage in practice exercises related to Chapter 6 on scatterplots, association, and correlation, as well as complete the associated data report assignments.