Scatterplots and Correlation
Statistical Reasoning - Scatterplots and Correlation
Scatterplot Definition and Explanation
A scatterplot is a graphical representation that displays the relationship between two quantitative variables measured on the same individuals.
Each data point on the scatterplot corresponds to the values of both variables for that individual.
Axes Orientation:
The explanatory variable is plotted on the horizontal axis (x-axis).
If there is no distinction between explanatory and response variables, either variable can occupy the x-axis.
In analyzing scatterplots, the focus is to identify any possible explanatory relationship between the x and y variables, though such relationships are not guaranteed.
Case Study: Food Nutrition
The scatterplot illustrates the relationship between the percentage of nutritionists who consider a food item healthy and the percentage of American voters who believe the same about that food item.
The explanatory variable is displayed on the x-axis, while the response variable occupies the y-axis.
Scatterplots can be enhanced by incorporating additional features such as different shapes to highlight outliers or points of interest.
Case Study: Food Nutrition (Detailed Observation)
Each point on the scatterplot represents one specific food item.
Observational Trend: Generally, as the percentage of nutritionists deeming the food healthy increases, so does the percentage of American voters affirming the food item's healthiness.
Case Study: Health vs. Wealth
A scatterplot representing data from the World Bank for the year 2016 highlights the correlation between GDP per capita (explanatory variable) and life expectancy at birth (response variable).
Non-linear Relationships: Not all relationships are linear; in such cases, transformations of the x variable may be employed (e.g., using $x^2$ or $ ext{ln}(x)$) or alternative functional forms like logarithmic or exponential may be applied.
Observational Insights from Health vs. Wealth
The data indicate that life expectancy increases rapidly with rising GDP until it reaches a saturation point.
Notably, life expectancy in extremely wealthy nations, such as the United States, does not significantly exceed that of less affluent, yet not impoverished, nations.
Countries like Costa Rica perform comparably in terms of life expectancy as the United States despite having lower GDP.
Liechtenstein is identified as an outlier, attributed to its strong financial sector and status as a tax haven.
Scatterplots serve as a fundamental tool in exploratory data analysis (EDA) to discern patterns within datasets.
How to Use Scatterplots
Descriptions of scatterplots can be articulated through:
Direction: The overall trend of the points.
Form: The shape of the cluster of data points.
Strength: The strength of the relationship showcased.
Attention should also be directed towards outliers—data points that deviate from the general trend of the scatterplot.
Types of Correlation:
Positive Correlation: Characterized by an upward slope in the scatterplot.
Negative Correlation: Exhibited by a downward slope in the scatterplot.
Important Note: The visible correlation in scatterplots does not imply causation; establishing causation necessitates a more rigorous analysis and evidence.
Further Case Study: Food Nutrition
In regard to the Food Nutrition case study, the scatterplot appears to form a linear pattern. While the relationship is not significantly strong, a positive association exists between nutritionists' and voters' perceptions of food healthiness.
Scatterplots effectively illustrate the relationship between two continuous variables, showcasing both strength and magnitude.
Case Study: MPG vs. Car Weight
Analysis of the scatterplot indicates a moderately strong linear relationship between gas mileage (MPG) and car weight.
A distinct negative association is noted: as car weight increases, gas mileage decreases.
Expanding Scatterplots with Multiple Variables
Scatterplots can incorporate more than two variables by utilizing various symbols to differentiate categories.
An example scatterplot accounts for the number of cylinders in cars by employing different shapes and colors to indicate this additional variable alongside others.
Correlation
A scatterplot serves to reveal the direction, form, and strength of relationships between two variables.
Strength of Relationships:
A straight-line relationship is categorized as strong if data points closely align with the line of best fit and weak if points are dispersed widely.
The strength of the correlation is measured by the proximity of data points to the line of best fit.
Correlation Coefficient
The correlation coefficient, denoted as $r$, ranges from -1 to 1, providing a numerical representation of correlation.
Interpretation of Values:
Strong positive correlation: $r > 0.8$
Weak, slightly negative correlation: $-0.4 < r < 0$
Moderate, negative correlation: $-0.8 < r < -0.4$
The degree of scattering reflects the strength of the correlation: closer data points suggest a stronger correlation, whereas wider dispersal indicates a weaker correlation.
Personal Preference: Individual interpretations may vary regarding what constitutes a strong, moderate, or weak correlation.
Limitations of Correlation
Causation vs. Correlation: It is vital to understand that correlation does not entail causation yet provides valuable insight into the potential predictive capacity of an X-variable concerning a Y-variable.
Explanatory vs. Response Variables: Correlation analysis does not differentiate between these two types of variables.
The fundamental statistical outcome remains unchanged regardless of which variable is designated as x or y.
Correlation does not adequately capture curved relationships between variables, even if they are robust.
Correlation is particularly sensitive to outlying observations—exercise caution in using the correlation coefficient $r$ when outliers exist in the scatterplot.