Scatterplots and Correlation

Statistical Reasoning - Scatterplots and Correlation

Scatterplot Definition and Explanation

  • A scatterplot is a graphical representation that displays the relationship between two quantitative variables measured on the same individuals.

    • Each data point on the scatterplot corresponds to the values of both variables for that individual.

  • Axes Orientation:

    • The explanatory variable is plotted on the horizontal axis (x-axis).

    • If there is no distinction between explanatory and response variables, either variable can occupy the x-axis.

  • In analyzing scatterplots, the focus is to identify any possible explanatory relationship between the x and y variables, though such relationships are not guaranteed.

Case Study: Food Nutrition

  • The scatterplot illustrates the relationship between the percentage of nutritionists who consider a food item healthy and the percentage of American voters who believe the same about that food item.

    • The explanatory variable is displayed on the x-axis, while the response variable occupies the y-axis.

  • Scatterplots can be enhanced by incorporating additional features such as different shapes to highlight outliers or points of interest.

Case Study: Food Nutrition (Detailed Observation)

  • Each point on the scatterplot represents one specific food item.

  • Observational Trend: Generally, as the percentage of nutritionists deeming the food healthy increases, so does the percentage of American voters affirming the food item's healthiness.

Case Study: Health vs. Wealth

  • A scatterplot representing data from the World Bank for the year 2016 highlights the correlation between GDP per capita (explanatory variable) and life expectancy at birth (response variable).

  • Non-linear Relationships: Not all relationships are linear; in such cases, transformations of the x variable may be employed (e.g., using $x^2$ or $ ext{ln}(x)$) or alternative functional forms like logarithmic or exponential may be applied.

Observational Insights from Health vs. Wealth

  • The data indicate that life expectancy increases rapidly with rising GDP until it reaches a saturation point.

  • Notably, life expectancy in extremely wealthy nations, such as the United States, does not significantly exceed that of less affluent, yet not impoverished, nations.

  • Countries like Costa Rica perform comparably in terms of life expectancy as the United States despite having lower GDP.

  • Liechtenstein is identified as an outlier, attributed to its strong financial sector and status as a tax haven.

  • Scatterplots serve as a fundamental tool in exploratory data analysis (EDA) to discern patterns within datasets.

How to Use Scatterplots

  • Descriptions of scatterplots can be articulated through:

    • Direction: The overall trend of the points.

    • Form: The shape of the cluster of data points.

    • Strength: The strength of the relationship showcased.

  • Attention should also be directed towards outliers—data points that deviate from the general trend of the scatterplot.

  • Types of Correlation:

    • Positive Correlation: Characterized by an upward slope in the scatterplot.

    • Negative Correlation: Exhibited by a downward slope in the scatterplot.

  • Important Note: The visible correlation in scatterplots does not imply causation; establishing causation necessitates a more rigorous analysis and evidence.

Further Case Study: Food Nutrition

  • In regard to the Food Nutrition case study, the scatterplot appears to form a linear pattern. While the relationship is not significantly strong, a positive association exists between nutritionists' and voters' perceptions of food healthiness.

    • Scatterplots effectively illustrate the relationship between two continuous variables, showcasing both strength and magnitude.

Case Study: MPG vs. Car Weight

  • Analysis of the scatterplot indicates a moderately strong linear relationship between gas mileage (MPG) and car weight.

    • A distinct negative association is noted: as car weight increases, gas mileage decreases.

Expanding Scatterplots with Multiple Variables

  • Scatterplots can incorporate more than two variables by utilizing various symbols to differentiate categories.

    • An example scatterplot accounts for the number of cylinders in cars by employing different shapes and colors to indicate this additional variable alongside others.

Correlation

  • A scatterplot serves to reveal the direction, form, and strength of relationships between two variables.

  • Strength of Relationships:

    • A straight-line relationship is categorized as strong if data points closely align with the line of best fit and weak if points are dispersed widely.

  • The strength of the correlation is measured by the proximity of data points to the line of best fit.

Correlation Coefficient

  • The correlation coefficient, denoted as $r$, ranges from -1 to 1, providing a numerical representation of correlation.

    • Interpretation of Values:

    • Strong positive correlation: $r > 0.8$

    • Weak, slightly negative correlation: $-0.4 < r < 0$

    • Moderate, negative correlation: $-0.8 < r < -0.4$

  • The degree of scattering reflects the strength of the correlation: closer data points suggest a stronger correlation, whereas wider dispersal indicates a weaker correlation.

  • Personal Preference: Individual interpretations may vary regarding what constitutes a strong, moderate, or weak correlation.

Limitations of Correlation

  • Causation vs. Correlation: It is vital to understand that correlation does not entail causation yet provides valuable insight into the potential predictive capacity of an X-variable concerning a Y-variable.

  • Explanatory vs. Response Variables: Correlation analysis does not differentiate between these two types of variables.

  • The fundamental statistical outcome remains unchanged regardless of which variable is designated as x or y.

  • Correlation does not adequately capture curved relationships between variables, even if they are robust.

  • Correlation is particularly sensitive to outlying observations—exercise caution in using the correlation coefficient $r$ when outliers exist in the scatterplot.