Nonlinear Relationships and Outliers — Study Notes

Key Concepts

Pearson's r measures linear relationships between two variables and can indicate direction as positive or negative. It is most appropriately interpreted from scatter plots when the relationship appears linear.
Not all relationships are linear. Some relationships are nonlinear (e.g., curvilinear) and Pearson's r is inappropriate for such patterns.
Examples of nonlinear patterns include:
- Y increases with X up to a point, then decreases (curvilinear).
- Y increases with X but levels off over time (saturation).
Real-world nonlinear example: the relationship between emotional arousal and performance follows an optimal level of arousal for peak performance (often described as an inverted-U or Yerkes-Dodson-type relation); low arousal yields poor performance, high arousal also yields poor performance, with a middle range yielding maximum performance.
Correlation is not a resistant (robust) statistic: it is sensitive to outliers, similar to the mean and standard deviation.
Outliers can dramatically change the correlation coefficient, especially with small sample sizes.
For example, starting with a data set where there is no clear relationship, r can be around $r = 0.075$ (slightly positive, very weak).
If you move one data point to be high on both X and Y (an outlier in the joint space), r can jump to $r = 0.50$ (a strong positive correlation).
Adding another outlier point can push r further to $r = 0.65$ .
The magnitude of the outlier effect is amplified when the sample size is small; with fewer participants, an outlier has more influence on r.
Practical implication: always inspect both predictor and outcome variables for deviations/outliers and consider their influence before drawing conclusions from r.
In the stress and hazardous drinking example, there are a couple of data points that could be outliers.
There are steps to identify true outliers, and one should never remove data solely based on a scatter plot.
Illustrative data adjustment: removing two points in the stress/hazardous drinking example changes r from $r = 0.27$ to $r = 0.29$ . This is a small change, partly because the dataset contains many data points (79 participants), so those two points contribute less weight relative to the larger sample than in a smaller sample scenario.
If we shift one data point to an extreme value on both variables, the correlation can rise to $r = 0.35$ .
Key takeaway: data handling requires care and principled judgment. Human error or deliberate manipulation can skew data and erode credibility and careers. As scientists, remove data points only for valid, principled reasons, not merely to achieve a desired result.

Nonlinear relationships and why Pearson's r fails

Pearson's r captures linear associations, not curvature or more complex patterns.
Curvilinear patterns may show strong, predictable relationships that are not reflected in a high or low r.
The example of arousal and performance illustrates that a predictable, non-linear relation can exist even when r suggests a weak or moderate association.

A real example: emotional arousal and performance

Low arousal → poorer performance.
High arousal → poorer performance.
There is an optimal (middle) level of arousal for maximal performance.
This pattern is not captured well by Pearson's r because the relationship is not linear.

Outliers and correlation

Correlation is not resistant to outliers; it can be distorted by extreme values.
Outliers can occur due to measurement error, unusual cases, or genuine variation.
Small sample sizes are particularly sensitive to outliers.
Numerical examples from the transcript show the impact:
- Initial dataset: $r = 0.075$ (very weak positive).
- One outlier added/high on both axes: $r = 0.50$ (moderate to strong).
- A second outlier added: $r = 0.65$ (strong).

Case study: stress and hazardous drinking data

Initial observation: there may be points that look like outliers on a scatter plot.
After removing two points (for illustration), the correlation changes from $r = 0.27$ to $r = 0.29$ .
This small change is partly due to having 79 participants; the two points represent a smaller proportion of the data, so their impact is less than in smaller samples.
If one data point is shifted to an extreme on both variables, the correlation can rise to $r = 0.35$ .
This demonstrates how outliers and their placement can influence r, especially in smaller samples.

Detecting outliers and data handling best practices

There are multiple steps to determine whether a point is an outlier; the decision to remove data should not be based solely on a scatter plot.
In the provided example, points were discussed as potential outliers and their removal had limited impact due to the larger sample size.
Emphasize a principled approach to data analysis: avoid data manipulation, report how decisions affect results, and justify any data exclusions with valid criteria.

Ethical, philosophical, and practical implications

Misreporting or tampering with data can severely damage credibility and careers.
Data integrity is foundational to scientific credibility.
Researchers should follow principled guidelines for data cleaning: only remove data with valid, pre-specified criteria and transparent reporting of how exclusions affect results.

Practical takeaways

Pearson's r is a measure of linear association, not non-linear relationships.
Nonlinear patterns require other analyses or transformations to capture the relationship.
Outliers can heavily influence r, especially with small samples; always assess potential outliers before interpreting r.
When data appear to have outliers, document your decision process and consider reporting analyses both with and without the suspected outliers.
Maintain ethical standards and avoid manipulating data to produce desired results; integrity is essential for credible science.

Mathematical recap

Definition of Pearson's r (for a sample of size $n$ ):
$r = \frac{\sum{i=1}^n (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum{i=1}^n (xi - \bar{x})^2} \; \sqrt{\sum{i=1}^n (y_i - \bar{y})^2}}$
Alternative expression in terms of covariance and standard deviations:
$r = \frac{\mathrm{cov}(X,Y)}{\sigmaX \sigmaY}$
Note: r is bounded in the interval $[-1, 1]$ and assumes a linear relationship without significant outlier distortion.