Correlation/ Regression

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/52

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

53 Terms

1
New cards

Putting Correlation / Regression into Context

DV = interval/ratio

<p>DV = interval/ratio</p>
2
New cards

Three Characteristics of a Correlation (r) Between X & Y

Direction: positive, negative, no correlation

Degree / magnitude: ranges from -1 to +1; extreme scores unlikely

  • Possible ranges of r values [-1 to +1] perfect negative correlation to perfect positive correlation… correlations closer to -1 or +1 = points are closer to the regression line… unlikely to get a correlation of ±1 because… measurement error, Y is predicted by more than just 1 X variable

Form: correlation useful only for linear relationships

<p><strong>Direction</strong>: positive, negative, no correlation</p><p><strong>Degree / magnitude</strong>: ranges from -1 to +1; extreme scores unlikely</p><ul><li><p>Possible ranges of r values [-1 to +1] perfect negative correlation to perfect positive correlation… correlations closer to -1 or +1 = <u>points are closer to the regression line</u>… unlikely to get a correlation of <span>±1 because… <em>measurement error, Y is predicted by more than just 1 X variable </em></span><em> </em></p></li></ul><p><strong>Form</strong>: correlation useful only for <em>linear</em> relationships </p><p></p>
3
New cards

Important note

Before conducting statistical analyses (e.g., calculating r), always look at your x and y data by producing a scatterplot!

4
New cards

Problems to Watch Out For: Restricted Range

  • Occurs when experimenters look at only one portion of the data (not good)

  • Can lead to underestimation of a relationship between X and Y

  • Can lead to overestimation of a relationship between X and Y

5
New cards

Restricted Range: Case 1

  • Underestimation of association

    • Strong linear relationship when full range of x values is used, but no relationship when restricted range used

<ul><li><p>Underestimation of association</p><ul><li><p>Strong linear relationship when full range of x values is used, but no relationship when restricted range used</p></li></ul></li></ul><p></p>
6
New cards

Restricted Range: Case 2

  • Underestimation of association

    • Strong linear relationship when full range of x values is used, but no relationship when restricted range used

<ul><li><p>Underestimation of association</p><ul><li><p>Strong linear relationship when full range of x values is used, but no relationship when restricted range used</p></li></ul></li></ul><p></p>
7
New cards

Restricted Range: Case 3

  • Overestimation of association

    • No relationship when full range of x values is used, but strong relationship when restricted range used

<ul><li><p>Overestimation of association</p><ul><li><p>No relationship when full range of x values is used, but strong relationship when restricted range used </p></li></ul></li></ul><p></p>
8
New cards

Non-linear relationships: Example 1

Curvilinear relationship:

  • If we looked only at low to moderate levels of anxiety, we would come to an erroneous conclusion about the relationship between anxiety and performance.

  • To be safe, use a wide range of X and Y values, and do not make conclusions based on extrapolation beyond your observed values of X

  • SIDE NOTE: I will ask you to extrapolate when you calculate the regression line, though.

<p>Curvilinear relationship: </p><ul><li><p>If we looked only at low to moderate levels of anxiety, we would come to an erroneous conclusion about the relationship between anxiety and performance.</p></li><li><p>To be safe, use a wide range of X and Y values, and do not make conclusions based on extrapolation beyond your observed values of X</p></li><li><p>SIDE NOTE: I will ask you to extrapolate when you calculate the regression line, though.</p></li></ul><p></p>
9
New cards

Non-linear relationships: Example 2

  • These data are not linear (the curve levels off – is asymptotic)

  • Here, the change in the y-values become increasingly small as the x-values increase.

<ul><li><p>These data are not linear (the curve levels off – is asymptotic)</p></li><li><p>Here, the change in the y-values become increasingly small as the x-values increase.</p></li></ul><p></p>
10
New cards

Problems to watch out for: Outliers

  • Scores that are outside the range for X, Y, or both

  • Can cause either over or underestimation of the correlation

11
New cards

Outliers: Overestimation of association

Including the outlier (inaccurately) strengthens the correlation.

<p>Including the outlier (inaccurately) strengthens the correlation.</p><p></p>
12
New cards

Outliers: Underestimation of association

In this case, including the outlier weakens the correlation

Bottom line: set criteria for outlier removal BEFORE plotting/ analyzing the data; do not tempt yourself into selective data trimming!

<p>In this case, including the outlier <em>weakens </em>the correlation</p><p>Bottom line: set criteria for outlier removal BEFORE plotting/ analyzing the data; do not tempt yourself into selective data trimming!</p>
13
New cards

The Pearson Product-Moment Correlation (r) - Conceptual Understanding

r is an indicator of location in X distribution (relative to mean of X)

& location in Y distribution (relative to mean of Y)

Regression line must cross through the mean of & the mean of Y

We can use z scores to precisely describe the location of an individual within a distribution.

  • For a positive correlation: positive Z scores on X correspond to positive Z scores on Y; negative Z scores on X correspond to negative Z scores on Y.

  • For a negative correlation, positive Z scores on X correspond to negative Z scores on Y and vice-versa.

  • We cannot just use ZXZY as a measure of r because of the same drawback we ran into with the SS…

  • It keeps getting larger and larger as the number of scores increases. But.. Use same solution (divide by N-1)

<p>r is an indicator of location in X distribution (relative to mean of X)</p><p>&amp; location in Y distribution (relative to mean of Y)</p><p>Regression line must cross through the mean of &amp; the mean of Y</p><p>We can use z scores to precisely describe the location of an individual within a distribution.</p><ul><li><p>For a positive correlation: positive Z scores on X correspond to positive Z scores on Y; negative Z scores on X correspond to negative Z scores on Y.</p></li><li><p>For a negative correlation, positive Z scores on X correspond to negative Z scores on Y and vice-versa.</p></li><li><p>We cannot just use <strong>∑</strong>ZXZY as a measure of r because of the same drawback we ran into with the SS…</p></li><li><p>It keeps getting larger and larger as the number of scores increases. But.. Use same solution (divide by N-1)</p></li></ul><p></p>
14
New cards

Definitional formula for r

SSxy = SP = Sum of Products: How much X and Y covary

Remember: SS = how much a set of scores varies from its mean.

So we will use… SSX for variable X SSY for variable Y.

r assesses how the covariation of X and Y relates to the variation (variability) of X and Y separately

Unlike sum of squares, covariance CAN be negative! (i.e., as X increases, Y decreases).

When the covariation (numerator) is big, r is big.

15
New cards

Understanding r / r vs. slope

Value of r = .72→ A 1 unit increase in X is associated with, on average, a .72 unit increase in Y.

r and the slope of a line (b) are similar, but not the same.

r is in standard deviation units, whereas the slope of a line (b) is the rise over run (e.g., the change in X over the change in Y), and is in scale units.

16
New cards

Inferential Aspects of Correlation (i.e., “Is r significant?”)

BUT… we can get a big r in our sample even when there’s no relationship between X and Y in the population… BECAUSE… a large r could occur as a result of sampling error alone. So we need to check whether or not our r is statistically significant.

Statistics vs. Parameters

For a sample: correlation = r

For a population: correlation = ρ (rho)

17
New cards

Steps in Hypothesis Testing; Null & Alternative Hypotheses

H0: ρ = 0 (there is NO linear relationship in the population) H1: ρ ≠ 0 (there IS a linear relationship in the population)

18
New cards

Steps in Hypothesis Testing: Constructing a Sampling Distribution

Pull all possible random samples of size N (where a sample is a set of X,Y scores) from a single population, calculate the r values, and plot the r values in a histogram.

19
New cards

What are df?

What are df for a correlation?

We lose a df for each mean we estimate (i.e., one for X and one for Y)

So, df = N – 2

20
New cards

Our 3 Qs:

Is there a relationship?

  • Is the correlation significant? (no leading 0s!)

If so, what is the nature of that relationship?

  • Is the correlation positive or negative?

  • Describe in words using the context of the example, i.e. Disgust and conservatism were positively correlated; in other words, stronger feelings of disgust were associated with greater political conservatism

If so, what is the strength of that relationship?

21
New cards

Strength of the Relationship between X & Y

We can’t just use the magnitude of the correlation coefficient (r) as a measure of strength.

  • the same r can be big in one context and small in another, depending on N

  • r isn’t linear (.8 isn’t twice as big as .4)

  • Strength = Coefficient of determination = r²

  • For our example, r² = (.721)² = .52

Because it is expressed as a proportion of variability, r² can be compared across studies

<p>We can’t just use the magnitude of the correlation coefficient (r) as a measure of strength.</p><ul><li><p>the same r can be big in one context and small in another, depending on N</p></li><li><p>r isn’t linear (.8 isn’t twice as big as .4)</p></li><li><p>Strength = Coefficient of determination = r²</p></li><li><p>For our example, r² = (.721)² = .52</p></li></ul><p>Because it is expressed as a proportion of variability, r² can be compared across studies</p><p></p>
22
New cards

Coefficient of alienation

1 - r²; 1 - .52 = .48

proportion of variability in Y NOT explained by X; i.e., error

23
New cards

Assumptions of the Pearson Correlation

  • Sample is independently and randomly selected from the population

  • Bivariate normality:

    • The distribution of Y scores at any value of X is normal in the population

    • Robust to violations of this assumption if N > 15

  • Homoscedasticity:

    • Y scores have equal variances at each value of X in the population

    • Not robust to violations of this assumption

Data H→ They have a similar spread in Y across the range of X values.

No→ data violate the homoscedasticity assumption: They have a greater spread in Y at higher levels of X.

<ul><li><p>Sample is independently and randomly selected from the population</p></li><li><p>Bivariate normality:</p><ul><li><p>The distribution of Y scores at any value of X is normal in the population</p></li><li><p>Robust to violations of this assumption if N &gt; 15</p></li></ul></li><li><p>Homoscedasticity:</p><ul><li><p>Y scores have equal variances at each value of X in the population</p></li><li><p>Not robust to violations of this assumption</p></li></ul></li></ul><p>Data H→ They have a similar spread in Y across the range of X values.</p><p>No→ data violate the homoscedasticity assumption: They have a greater spread in Y at higher levels of X.</p><p></p>
24
New cards

Bivariate Normality

Schematic of a bivariate normal dataset→ Picture this as a 3D graph, with the distributions coming out toward you.

An actual bivariate normal data set

Most of the scores fall close to the line of best fit.

<p>Schematic of a bivariate normal dataset→ Picture this as a 3D graph, with the distributions coming out toward you.</p><p>An actual bivariate normal data set</p><p>Most of the scores fall close to the line of best fit.</p><p></p>
25
New cards

Visualizing Correlations

Notice that as r approaches 1, the data are closer and closer to the regression line.

Also consider: When r = -1 there is a perfect negative correlation between x and y

and: when r = 0 there is no correlation between x and y

…and here is what a scatterplot with a negative r looks like.

26
New cards

Interpreting r &

r = .3, r² = .09— We can explain about 9% of the variability in depression with loneliness

27
New cards

Problems to Watch Out For: Restricted range

  • Occurs when experimenters look at only one portion of the data (not good)

  • Can lead to underestimation of a relationship between X and Y

  • Can lead to overestimation of a relationship between X and Y

28
New cards

Restricted range: Case 1: Underestimation of Association

Strong linear relationship when full range of x values is used, but no relationship when restricted range used

<p>Strong linear relationship when full range of x values is used, but no relationship when restricted range used</p>
29
New cards

Restricted range: Case 2: Underestimation of Association

Strong linear relationship when full range of x values is used, but no relationship when restricted range used

<p>Strong linear relationship when full range of x values is used, but no relationship when restricted range used</p>
30
New cards

Restricted range: Case 3: Overestimation of Association

No relationship when full range of x values is used, but strong relationship when restricted range used

<p>No relationship when full range of x values is used, but strong relationship when restricted range used</p>
31
New cards

Non-linear relationships: Example 1: curvilinear relationship

Point 1: If we looked only at low to moderate levels of anxiety, we would come to an erroneous conclusion about the relationship between anxiety and performance.

To be safe, use a wide range of X and Y values, and do not make conclusions based on extrapolation beyond your observed values of X.

SIDE NOTE: I will ask you to extrapolate when you calculate the regression line, though.

Point 2: These x and y data are not linear. Don’t conduct a correlation analysis on these data!

<p>Point 1: If we looked only at low to moderate levels of anxiety, we would come to an erroneous conclusion about the relationship between anxiety and performance.</p><p>To be safe, use a wide range of X and Y values, and do not make conclusions based on extrapolation beyond your observed values of X.</p><p>SIDE NOTE: I will ask you to extrapolate when you calculate the regression line, though.</p><p>Point 2: These x and y data are not linear. Don’t conduct a correlation analysis on these data!</p>
32
New cards

Non-linear relationships: Example 2

These data are not linear (the curve levels off – is asymptotic)

Here, the change in the y-values become increasingly small as the x-values increase.

Don’t conduct a correlation analysis on these data!

<p>These data are not linear (the curve levels off – is asymptotic)</p><p>Here, the change in the y-values become increasingly small as the x-values increase.</p><p>Don’t conduct a correlation analysis on these data!</p>
33
New cards

Problems to watch out for continued: Outliers

Scores that are outside the range for X, Y, or both

Can cause either over or underestimation of the correlation

34
New cards

Outliers: Overestimation of association

Including the outlier (inaccurately) strengthens the correlation.

<p>Including the outlier (inaccurately) strengthens the correlation.</p>
35
New cards

Outliers: Underestimation of association

In this case, including the outlier weakens the correlation.

Bottom line:

  • Always look at a scatterplot of your x and y data!

  • Set your criteria for outlier removal BEFORE plotting or analyzing the data. Do not tempt yourself into selective data trimming!

<p>In this case, including the outlier weakens the correlation.</p><p>Bottom line: </p><ul><li><p>Always look at a scatterplot of your x and y data!</p></li></ul><ul><li><p>Set your criteria for outlier removal BEFORE plotting or analyzing the data. Do not tempt yourself into selective data trimming!</p></li></ul><p></p>
36
New cards

The Pearson Product-Moment Correlation (r) - Conceptual Understanding

r is an indicator of location in X distribution (relative to mean of X)

and location in Y distribution (relative to mean of Y)

Regression: line must cross through the mean of X and the mean of Y

We can use z scores to precisely describe the location of an individual within a distribution.

So, for a positive correlation: positive Z scores on X correspond to positive Z scores on Y;

negative Z scores on X correspond to negative Z scores on Y.

For a negative correlation, positive Z scores on X correspond to negative Z scores on Y and vice-versa.

37
New cards

Statistics vs. Parameters

  • For a sample: correlation = r

  • For a population: correlation = p (rho)

38
New cards

What are the null & alternative hypotheses for a test of correlation?

H0: p = 0 (there is NO linear relationship in the population) H1: p ≠ 0 (there IS a linear relationship in the population)

39
New cards

Flashback: How did we construct the sampling distribution for the paired sample t test?

Pull all possible random samples of size N from a single population (where a sample is a set of difference scores), calculate the means, and plot the means in a histogram.

40
New cards

How do we construct the sampling distribution of r?

Pull all possible random samples of size N (where a sample is a set of X,Y scores) from a single population, calculate the r values, and plot the r values in a histogram.

41
New cards

3 Qs for every test:

  • Is there a relationship? (i.e., is the correlation significant?)

  • If so, what is the nature of that relationship?

    • Is the correlation positive or negative? (r = .72)

    • Describe in words using the context of the example:

    • Disgust and conservatism were positively correlated; in other words, stronger feelings of disgust were associated with greater political conservatism

    • If so, what is the strength of that relationship?

42
New cards

Strength of the Relationship Between X and Y

We can’t just use the magnitude of the correlation coefficient (r) as a measure of strength.

  • the same r can be big in one context and small in another, depending on N

  • r isn’t linear (.8 isn’t twice as big as .4)

  • Strength = Coefficient of determination = r2

  • For our example, r2 = (.721)² = .52 (no leading 0)

    • shared variance; 52% of the variability (SS) in political conservatism can be explained/predicted by disgust sensitivity (like h2!)

    • Because it is expressed as a proportion of variability, r² can be compared across studies.

43
New cards

Coefficient of alienation

1 - r²; proportion of variability in Y NOT explained by X; i.e., error

44
New cards

Assumptions of the Pearson Correlation

  • Sample is independently and randomly selected from the population

  • Bivariate normality:

    • The distribution of Y scores at any value of X is normal in the population

    • Robust to violations of this assumption if N > 15

45
New cards

Homoscedasticity

  • Y scores have equal variances at each value of X in the population

  • Not robust to violations of this assumption

46
New cards

Visualization of Homoscendasticity

1st— data are homoscendastic: they have a similar spread in Y across the range of X values

2nd— These data violate the homoscedasticity assumption: They have a greater spread in Y at higher levels of X.

<p>1st— data are homoscendastic: they have a similar spread in Y across the range of X values</p><p>2nd— These data violate the homoscedasticity assumption: They have a greater spread in Y at higher levels of X.</p>
47
New cards

What you need to conduct a regression analysis

  1. You need the disgust scores (X = predictor variable)

  2. X and Y need to be significantly correlated (r = sig).

  3. You need the formula for the regression line

  4. You need to do some simple math

THM: Regression allows you to be a bit more accurate than just relying on the mean of Y to predict the individual scores of Y.

48
New cards

Least-squares line

If you know someone’s disgust sensitivity score, and you have a least-squares regression line, you can predict any y value based on a given x value

Why is it called this?

Because it minimizes the squared vertical distances of the points to the line. — Should do a better job of predicting Y than just using the mean of Y

Regression involves finding the formula for the line that best predicts Y from X

49
New cards

Regression line vs. basic formula

Y = a + bX (where a = y intercept, b = slope)

vs

Ŷ = a+bX

The equation yields a predicted Y value for every X value rather than an actual Y value.

The line is a prediction, so the points don’t all fall exactly on the line; there will be error associated with the line.

50
New cards

b vs. r

b→ Describes the relation between X & Y in X & Y units

r→ Describes the relation between X & Y in standardized units (Z scores). An r of .72 means that a 1 SD change in X is associated with, on average, a .72 SD change in Y.

If you transform X & Y to Z scores before correlating them, r = b.

51
New cards

Formula for the y-intercept

We know that the regression line must go through the mean of the X ratings (IV; = 2.083) and the mean of Y ratings (DV; = 4.500)

So we can plug in the means for X and Y into our formula to calculate the Y intercept:

52
New cards

Plotting the least-squares line

Step 1. Select the lowest (0) & the highest (4) values of X.

Step 2. Plug in 0 for X to get the first Y value.

Step 3. Plug in 4 for X to get the second Y value.

Other than to calculate the Y intercept, do not extend your line past the values of X! Can’t extrapolate!

53
New cards

Accuracy of the Regression Line

What is the relationship between r and the accuracy of the regression equation?

  • The bigger the r, the closer the points are to the line and the more accurate the regression equation

How do we evaluate how good our line is?

  • Compute the standard error of estimate, which tells us, on average, how far the points fall (vertically) from the regression line