Correlation/ Regression

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/52

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

53 Terms

New cards

Putting Correlation / Regression into Context

DV = interval/ratio

New cards

Three Characteristics of a Correlation (r) Between X & Y

Direction: positive, negative, no correlation

Degree / magnitude: ranges from -1 to +1; extreme scores unlikely

Possible ranges of r values [-1 to +1] perfect negative correlation to perfect positive correlation… correlations closer to -1 or +1 = points are closer to the regression line… unlikely to get a correlation of ±1 because… measurement error, Y is predicted by more than just 1 X variable

Form: correlation useful only for linear relationships

<p><strong>Direction</strong>: positive, negative, no correlation</p><p><strong>Degree / magnitude</strong>: ranges from -1 to +1; extreme scores unlikely</p><ul><li><p>Possible ranges of r values [-1 to +1] perfect negative correlation to perfect positive correlation… correlations closer to -1 or +1 = <u>points are closer to the regression line</u>… unlikely to get a correlation of <span>±1 because… <em>measurement error, Y is predicted by more than just 1 X variable </em></span><em> </em></p></li></ul><p><strong>Form</strong>: correlation useful only for <em>linear</em> relationships </p><p></p>

New cards

Important note

Before conducting statistical analyses (e.g., calculating r), always look at your x and y data by producing a scatterplot!

New cards

Problems to Watch Out For: Restricted Range

Occurs when experimenters look at only one portion of the data (not good)
Can lead to underestimation of a relationship between X and Y
Can lead to overestimation of a relationship between X and Y

New cards

Restricted Range: Case 1

Underestimation of association
- Strong linear relationship when full range of x values is used, but no relationship when restricted range used

<ul><li><p>Underestimation of association</p><ul><li><p>Strong linear relationship when full range of x values is used, but no relationship when restricted range used</p></li></ul></li></ul><p></p>

New cards

Restricted Range: Case 2

Underestimation of association
- Strong linear relationship when full range of x values is used, but no relationship when restricted range used

New cards

Restricted Range: Case 3

Overestimation of association
- No relationship when full range of x values is used, but strong relationship when restricted range used

<ul><li><p>Overestimation of association</p><ul><li><p>No relationship when full range of x values is used, but strong relationship when restricted range used </p></li></ul></li></ul><p></p>

New cards

Non-linear relationships: Example 1

Curvilinear relationship:

If we looked only at low to moderate levels of anxiety, we would come to an erroneous conclusion about the relationship between anxiety and performance.
To be safe, use a wide range of X and Y values, and do not make conclusions based on extrapolation beyond your observed values of X
SIDE NOTE: I will ask you to extrapolate when you calculate the regression line, though.

New cards

Non-linear relationships: Example 2

These data are not linear (the curve levels off – is asymptotic)
Here, the change in the y-values become increasingly small as the x-values increase.

<ul><li><p>These data are not linear (the curve levels off – is asymptotic)</p></li><li><p>Here, the change in the y-values become increasingly small as the x-values increase.</p></li></ul><p></p>

New cards

Problems to watch out for: Outliers

Scores that are outside the range for X, Y, or both
Can cause either over or underestimation of the correlation

New cards

Outliers: Overestimation of association

Including the outlier (inaccurately) strengthens the correlation.

New cards

Outliers: Underestimation of association

In this case, including the outlier weakens the correlation

Bottom line: set criteria for outlier removal BEFORE plotting/ analyzing the data; do not tempt yourself into selective data trimming!

<p>In this case, including the outlier <em>weakens </em>the correlation</p><p>Bottom line: set criteria for outlier removal BEFORE plotting/ analyzing the data; do not tempt yourself into selective data trimming!</p>

New cards

The Pearson Product-Moment Correlation (r) - Conceptual Understanding

r is an indicator of location in X distribution (relative to mean of X)

& location in Y distribution (relative to mean of Y)

Regression line must cross through the mean of & the mean of Y

We can use z scores to precisely describe the location of an individual within a distribution.

For a positive correlation: positive Z scores on X correspond to positive Z scores on Y; negative Z scores on X correspond to negative Z scores on Y.
For a negative correlation, positive Z scores on X correspond to negative Z scores on Y and vice-versa.
We cannot just use ∑ZXZY as a measure of r because of the same drawback we ran into with the SS…
It keeps getting larger and larger as the number of scores increases. But.. Use same solution (divide by N-1)

<p>r is an indicator of location in X distribution (relative to mean of X)</p><p>& location in Y distribution (relative to mean of Y)</p><p>Regression line must cross through the mean of & the mean of Y</p><p>We can use z scores to precisely describe the location of an individual within a distribution.</p><ul><li><p>For a positive correlation: positive Z scores on X correspond to positive Z scores on Y; negative Z scores on X correspond to negative Z scores on Y.</p></li><li><p>For a negative correlation, positive Z scores on X correspond to negative Z scores on Y and vice-versa.</p></li><li><p>We cannot just use <strong>∑</strong>ZXZY as a measure of r because of the same drawback we ran into with the SS…</p></li><li><p>It keeps getting larger and larger as the number of scores increases. But.. Use same solution (divide by N-1)</p></li></ul><p></p>

New cards

Definitional formula for r

SSxy = SP = Sum of Products: How much X and Y covary

Remember: SS = how much a set of scores varies from its mean.

So we will use… SSX for variable X SSY for variable Y.

r assesses how the covariation of X and Y relates to the variation (variability) of X and Y separately

Unlike sum of squares, covariance CAN be negative! (i.e., as X increases, Y decreases).

When the covariation (numerator) is big, r is big.

New cards

Understanding r / r vs. slope

Value of r = .72→ A 1 unit increase in X is associated with, on average, a .72 unit increase in Y.

r and the slope of a line (b) are similar, but not the same.

r is in standard deviation units, whereas the slope of a line (b) is the rise over run (e.g., the change in X over the change in Y), and is in scale units.

New cards

Inferential Aspects of Correlation (i.e., “Is r significant?”)

BUT… we can get a big r in our sample even when there’s no relationship between X and Y in the population… BECAUSE… a large r could occur as a result of sampling error alone. So we need to check whether or not our r is statistically significant.

Statistics vs. Parameters

For a sample: correlation = r

For a population: correlation = ρ (rho)

New cards

Steps in Hypothesis Testing; Null & Alternative Hypotheses

H0: ρ = 0 (there is NO linear relationship in the population) H1: ρ ≠ 0 (there IS a linear relationship in the population)

New cards

Steps in Hypothesis Testing: Constructing a Sampling Distribution

Pull all possible random samples of size N (where a sample is a set of X,Y scores) from a single population, calculate the r values, and plot the r values in a histogram.

New cards

What are df?

What are df for a correlation?

We lose a df for each mean we estimate (i.e., one for X and one for Y)

So, df = N – 2

New cards

Our 3 Qs:

Is there a relationship?

Is the correlation significant? (no leading 0s!)

If so, what is the nature of that relationship?

Is the correlation positive or negative?
Describe in words using the context of the example, i.e. Disgust and conservatism were positively correlated; in other words, stronger feelings of disgust were associated with greater political conservatism

If so, what is the strength of that relationship?

New cards

Strength of the Relationship between X & Y

We can’t just use the magnitude of the correlation coefficient (r) as a measure of strength.

the same r can be big in one context and small in another, depending on N
r isn’t linear (.8 isn’t twice as big as .4)
Strength = Coefficient of determination = r²
For our example, r² = (.721)² = .52

Because it is expressed as a proportion of variability, r² can be compared across studies

<p>We can’t just use the magnitude of the correlation coefficient (r) as a measure of strength.</p><ul><li><p>the same r can be big in one context and small in another, depending on N</p></li><li><p>r isn’t linear (.8 isn’t twice as big as .4)</p></li><li><p>Strength = Coefficient of determination = r²</p></li><li><p>For our example, r² = (.721)² = .52</p></li></ul><p>Because it is expressed as a proportion of variability, r² can be compared across studies</p><p></p>

New cards

Coefficient of alienation

1 - r²; 1 - .52 = .48

proportion of variability in Y NOT explained by X; i.e., error

New cards

Assumptions of the Pearson Correlation

Sample is independently and randomly selected from the population
Bivariate normality:
- The distribution of Y scores at any value of X is normal in the population
- Robust to violations of this assumption if N > 15
Homoscedasticity:
- Y scores have equal variances at each value of X in the population
- Not robust to violations of this assumption

Data H→ They have a similar spread in Y across the range of X values.

No→ data violate the homoscedasticity assumption: They have a greater spread in Y at higher levels of X.

New cards

Bivariate Normality

Schematic of a bivariate normal dataset→ Picture this as a 3D graph, with the distributions coming out toward you.

An actual bivariate normal data set

Most of the scores fall close to the line of best fit.

<p>Schematic of a bivariate normal dataset→ Picture this as a 3D graph, with the distributions coming out toward you.</p><p>An actual bivariate normal data set</p><p>Most of the scores fall close to the line of best fit.</p><p></p>

New cards

Visualizing Correlations

Notice that as r approaches 1, the data are closer and closer to the regression line.

Also consider: When r = -1 there is a perfect negative correlation between x and y

and: when r = 0 there is no correlation between x and y

…and here is what a scatterplot with a negative r looks like.

New cards

Interpreting r & r²

r = .3, r² = .09— We can explain about 9% of the variability in depression with loneliness

New cards

Problems to Watch Out For: Restricted range

Occurs when experimenters look at only one portion of the data (not good)
Can lead to underestimation of a relationship between X and Y
Can lead to overestimation of a relationship between X and Y

New cards

Restricted range: Case 1: Underestimation of Association

Strong linear relationship when full range of x values is used, but no relationship when restricted range used

New cards

Restricted range: Case 2: Underestimation of Association

Strong linear relationship when full range of x values is used, but no relationship when restricted range used

New cards

Restricted range: Case 3: Overestimation of Association

No relationship when full range of x values is used, but strong relationship when restricted range used

New cards

Non-linear relationships: Example 1: curvilinear relationship

Point 1: If we looked only at low to moderate levels of anxiety, we would come to an erroneous conclusion about the relationship between anxiety and performance.

To be safe, use a wide range of X and Y values, and do not make conclusions based on extrapolation beyond your observed values of X.

SIDE NOTE: I will ask you to extrapolate when you calculate the regression line, though.

Point 2: These x and y data are not linear. Don’t conduct a correlation analysis on these data!

<p>Point 1: If we looked only at low to moderate levels of anxiety, we would come to an erroneous conclusion about the relationship between anxiety and performance.</p><p>To be safe, use a wide range of X and Y values, and do not make conclusions based on extrapolation beyond your observed values of X.</p><p>SIDE NOTE: I will ask you to extrapolate when you calculate the regression line, though.</p><p>Point 2: These x and y data are not linear. Don’t conduct a correlation analysis on these data!</p>

New cards

Non-linear relationships: Example 2

These data are not linear (the curve levels off – is asymptotic)

Here, the change in the y-values become increasingly small as the x-values increase.

Don’t conduct a correlation analysis on these data!

<p>These data are not linear (the curve levels off – is asymptotic)</p><p>Here, the change in the y-values become increasingly small as the x-values increase.</p><p>Don’t conduct a correlation analysis on these data!</p>

New cards

Problems to watch out for continued: Outliers

Scores that are outside the range for X, Y, or both

Can cause either over or underestimation of the correlation

New cards

Outliers: Overestimation of association

Including the outlier (inaccurately) strengthens the correlation.

New cards

Outliers: Underestimation of association

In this case, including the outlier weakens the correlation.

Bottom line:

Always look at a scatterplot of your x and y data!

Set your criteria for outlier removal BEFORE plotting or analyzing the data. Do not tempt yourself into selective data trimming!

<p>In this case, including the outlier weakens the correlation.</p><p>Bottom line: </p><ul><li><p>Always look at a scatterplot of your x and y data!</p></li></ul><ul><li><p>Set your criteria for outlier removal BEFORE plotting or analyzing the data. Do not tempt yourself into selective data trimming!</p></li></ul><p></p>

New cards

The Pearson Product-Moment Correlation (r) - Conceptual Understanding

r is an indicator of location in X distribution (relative to mean of X)

and location in Y distribution (relative to mean of Y)

Regression: line must cross through the mean of X and the mean of Y

We can use z scores to precisely describe the location of an individual within a distribution.

So, for a positive correlation: positive Z scores on X correspond to positive Z scores on Y;

negative Z scores on X correspond to negative Z scores on Y.

For a negative correlation, positive Z scores on X correspond to negative Z scores on Y and vice-versa.

New cards

Statistics vs. Parameters

For a sample: correlation = r
For a population: correlation = p (rho)

New cards

What are the null & alternative hypotheses for a test of correlation?

H0: p = 0 (there is NO linear relationship in the population) H1: p ≠ 0 (there IS a linear relationship in the population)

New cards

Flashback: How did we construct the sampling distribution for the paired sample t test?

Pull all possible random samples of size N from a single population (where a sample is a set of difference scores), calculate the means, and plot the means in a histogram.

New cards

How do we construct the sampling distribution of r?

Pull all possible random samples of size N (where a sample is a set of X,Y scores) from a single population, calculate the r values, and plot the r values in a histogram.

New cards

3 Qs for every test:

Is there a relationship? (i.e., is the correlation significant?)
If so, what is the nature of that relationship?
- Is the correlation positive or negative? (r = .72)
- Describe in words using the context of the example:
- Disgust and conservatism were positively correlated; in other words, stronger feelings of disgust were associated with greater political conservatism
- If so, what is the strength of that relationship?

New cards

Strength of the Relationship Between X and Y

We can’t just use the magnitude of the correlation coefficient (r) as a measure of strength.

the same r can be big in one context and small in another, depending on N
r isn’t linear (.8 isn’t twice as big as .4)
Strength = Coefficient of determination = r2
For our example, r2 = (.721)² = .52 (no leading 0)
- shared variance; 52% of the variability (SS) in political conservatism can be explained/predicted by disgust sensitivity (like h2!)
- Because it is expressed as a proportion of variability, r² can be compared across studies.

New cards

Coefficient of alienation

1 - r²; proportion of variability in Y NOT explained by X; i.e., error

New cards

Assumptions of the Pearson Correlation

Sample is independently and randomly selected from the population
Bivariate normality:
- The distribution of Y scores at any value of X is normal in the population
- Robust to violations of this assumption if N > 15

New cards

Homoscedasticity

Y scores have equal variances at each value of X in the population
Not robust to violations of this assumption

New cards

Visualization of Homoscendasticity

1st— data are homoscendastic: they have a similar spread in Y across the range of X values

2nd— These data violate the homoscedasticity assumption: They have a greater spread in Y at higher levels of X.

New cards

What you need to conduct a regression analysis

You need the disgust scores (X = predictor variable)
X and Y need to be significantly correlated (r = sig).
You need the formula for the regression line
You need to do some simple math

THM: Regression allows you to be a bit more accurate than just relying on the mean of Y to predict the individual scores of Y.

New cards

Least-squares line

If you know someone’s disgust sensitivity score, and you have a least-squares regression line, you can predict any y value based on a given x value

Why is it called this?

Because it minimizes the squared vertical distances of the points to the line. — Should do a better job of predicting Y than just using the mean of Y

Regression involves finding the formula for the line that best predicts Y from X

New cards

Regression line vs. basic formula

Y = a + bX (where a = y intercept, b = slope)

Ŷ = a+bX

The equation yields a predicted Y value for every X value rather than an actual Y value.

The line is a prediction, so the points don’t all fall exactly on the line; there will be error associated with the line.

New cards

b vs. r

b→ Describes the relation between X & Y in X & Y units

r→ Describes the relation between X & Y in standardized units (Z scores). An r of .72 means that a 1 SD change in X is associated with, on average, a .72 SD change in Y.

If you transform X & Y to Z scores before correlating them, r = b.

New cards

Formula for the y-intercept

We know that the regression line must go through the mean of the X ratings (IV; = 2.083) and the mean of Y ratings (DV; = 4.500)

So we can plug in the means for X and Y into our formula to calculate the Y intercept:

New cards

Plotting the least-squares line

Step 1. Select the lowest (0) & the highest (4) values of X.

Step 2. Plug in 0 for X to get the first Y value.

Step 3. Plug in 4 for X to get the second Y value.

Other than to calculate the Y intercept, do not extend your line past the values of X! Can’t extrapolate!

New cards

Accuracy of the Regression Line

What is the relationship between r and the accuracy of the regression equation?

The bigger the r, the closer the points are to the line and the more accurate the regression equation

How do we evaluate how good our line is?

Compute the standard error of estimate, which tells us, on average, how far the points fall (vertically) from the regression line