GW Blok 6 Seminar 1.2 Descriptive statistics for two continuous variables

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/22

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

23 Terms

New cards

What is a pearson correlation and how do you calculate it?

The Pearson correlation coefficient (often denoted as r) measures the strength and direction of the linear relationship between two continuous variables.
r depends on the average distance of the observations from some straight line

Value range: -1 to +1
- +1: Perfect positive linear correlation (as one variable increases, the other increases)
- -1: Perfect negative linear correlation (as one variable increases, the other decreases)
- 0: No linear correlation
Pearson correlation measures linear relationships only

<ul><li><p>The <strong>Pearson correlation coefficient</strong> (often denoted as <strong>r</strong>) measures the <strong>strength and direction of the linear relationship</strong> between two continuous variables.</p></li><li><p>r depends on the average distance of the observations from some straight line</p></li></ul><ul><li><p><strong>Value range:</strong> -1 to +1</p><ul><li><p><strong>+1:</strong> Perfect positive linear correlation (as one variable increases, the other increases)</p></li><li><p><strong>-1:</strong> Perfect negative linear correlation (as one variable increases, the other decreases)</p></li><li><p><strong>0:</strong> No linear correlation</p></li></ul></li><li><p>Pearson correlation measures linear relationships only</p></li></ul><p></p>

New cards

When is it appropriate to calculate the piersons correlation?

Both variables are continuous and measured on an interval or ratio scale.
(e.g., height, weight, temperature, test scores)
The relationship between the two variables is linear.
Pearson’s correlation measures linear associations — if the relationship is curved or non-linear, Pearson’s r might be misleading.
Data is approximately normally distributed (especially important for small samples).
Normality helps with hypothesis testing related to Pearson’s r, though for large samples this assumption is less strict.
No extreme outliers in either variable.
Outliers can heavily distort the correlation coefficient.
Plaatje: 1 en 3 mag je Pierson meten, 2 geen lineair verband en 4 je hebt geen spreiding op x-as

<ul><li><p><strong>Both variables are continuous and measured on an interval or ratio scale.</strong><br>(e.g., height, weight, temperature, test scores)</p></li><li><p><strong>The relationship between the two variables is linear.</strong><br>Pearson’s correlation measures <em>linear</em> associations — if the relationship is curved or non-linear, Pearson’s r might be misleading.</p></li><li><p><strong>Data is approximately normally distributed (especially important for small samples).</strong><br>Normality helps with hypothesis testing related to Pearson’s r, though for large samples this assumption is less strict.</p></li><li><p><strong>No extreme outliers in either variable.</strong><br>Outliers can heavily distort the correlation coefficient.</p></li><li><p>Plaatje: 1 en 3 mag je Pierson meten, 2 geen lineair verband en 4 je hebt geen spreiding op x-as</p></li></ul><p></p>

New cards

How do you interpret a positive and negative slope in a scatterplot in terms of correlation?

A positive slope indicates a positive correlation; as X increases, Y also increases.
A negative slope indicates a negative correlation; as X increases, Y decreases

New cards

Interpret the following graphs

Perfect positive relationship (Figure 1.7 a)

Large X-values are associated with large Y-values.
The relationship is perfect because if you know the value of X (or Y), then the value of Y (or X) is fully determined. All observations are on a straight line.
r = 1

Perfect negative relationship (Figure 1.7 b)

Large X-values are associated with small Y-values.
This relationship is also perfect and Y (or X) is fully determined by X (or Y).
In that case the correlation is equal to the optimal negative correlation, i.e. r = -1

Non-linear relationship (Figure 1.7 c)

In Figure 1.7 c, there is a bell-shaped relationship between X and Y but the Pearson correlation is still equal to 0.
However, it doesn’t imply no association; it only reflects a zero linear association. There is, however, a quadratic association, which is not a topic of this course.

No association at all (Figure 1.7 d)

Figure 1.7 d. shows a scatter plot where there is no association at all.
The way how the Yvalues vary around the horizontal line seems to be unaffected by the value of X, i.e. there is a same amount of variation of Y-values around the horizontal line, independently of the Xvalues.

New cards

How is the "quality of fit" of the regression line related to correlation strength?

The better the fit (less scatter around the line), the stronger the correlation.

What?
- Coefficient of determination (R 2 ): 0 ≤ 𝑅^2 ≤ 1
Why?
- How good the fitted model is? Or how precise the predicted value is?
How?
i.e., how much total variability in the y values (SST) is in the explained part (SSR)
Interpretation: It indicates the percentage of variability of y explained by the variable x

<ul><li><p>The better the fit (less scatter around the line), the stronger the correlation.</p></li></ul><ul><li><p>What?</p><ul><li><p>Coefficient of determination (R 2 ): 0 ≤ 𝑅^2 ≤ 1</p></li></ul></li><li><p>Why?</p><ul><li><p>How good the fitted model is? Or how precise the predicted value is?</p></li></ul></li><li><p>How?</p></li><li><p>i.e., how much total variability in the y values (SST) is in the explained part (SSR)</p></li><li><p>Interpretation: It indicates the percentage of variability of y explained by the variable x</p></li></ul><p></p>

New cards

When is it appropriate to use simple linear regression?

Y variable (dependent variable): quantitative (continuous)
X variable (independent variable): usually numeric/continuous, but can be categorical
Goal: determing the (approximately) average Y value for a given X value (or conditional on X).
X: independent variable/cause
Y: dependent variable/effect

New cards

What is the difference between correlation and regression?

Correlation coefficient (r):
- Measures the strength and direction of a linear association between X and Y.
- Symmetric: correlation of X with Y equals correlation of Y with X.
- Does not imply causation or prediction.
Regression:
- Asymmetric: regression of Y on X is different from regression of X on Y.
- Focuses on predicting mean Y given X or estimating how Y changes as X changes.

New cards

What is the simple linear regression model?

The general idea of performing a linear regression analysis in our example is to summarize the scatter plot by means of a straight line that can be interpreted as (approximate) average values of length for different values of Age.
Goals:
- Effect Size Model: Estimate and interpret effect of X on Y
- Prediction Model: Predict Y value given X value

<ul><li><p>The general idea of performing a linear regression analysis in our example is to summarize the scatter plot by means of a straight line that can be interpreted as (approximate) average values of length for different values of Age.</p></li><li><p>Goals: </p><ul><li><p><strong>Effect Size Model</strong>: Estimate and interpret <strong>effect of X on Y</strong></p></li><li><p><strong>Prediction Model</strong>: Predict <strong>Y value</strong> given <strong>X value</strong></p></li></ul></li></ul><p></p>

New cards

How can you calculate B1 and B0?

New cards

What is a regression line?

The best-fitting line minimizes the sum of squared residuals (Method of Least Squares).
The regression line passes through the point (Xˉ,Yˉ) (the means of X and Y).
The points on the regression line are only approximately equal to the average Y values, which will be assumed to be equal for very large sample sizes (almost equal to the population size).
For a finite sample size (which will be the case in practice), these points will be called predicted Y-value and denoted as Y-hat

<ul><li><p>The <strong>best-fitting line</strong> minimizes the sum of squared residuals (Method of Least Squares).</p></li><li><p>The regression line passes through the point (Xˉ,Yˉ) (the means of X and Y).</p></li><li><p>The points on the regression line are only approximately equal to the average Y values, which will be assumed to be equal for very large sample sizes (almost equal to the population size).</p></li><li><p>For a finite sample size (which will be the case in practice), these points will be called predicted Y-value and denoted as Y-hat</p></li></ul><p></p>

New cards

What are residuals?

The (vertical) deviations of the observation from the line are called the residuals. Reflects how good the regression line summarizes the scatter point by means of a straight line.
Residuals = observed Y - predicted Y
They represent unexplained variability due to biological variation, measurement error, or other factors.
Residuals can be positive (above line) or negative (below line).
Error: theoretical error
Residual: observed error for a set of data

<ul><li><p>The (vertical) deviations of the observation from the line are called the residuals. Reflects how good the regression line summarizes the scatter point by means of a straight line.</p></li><li><p>Residuals = observed Y - predicted Y</p></li><li><p>They represent unexplained variability due to biological variation, measurement error, or other factors.</p></li><li><p>Residuals can be positive (above line) or negative (below line).</p></li><li><p>Error: theoretical error</p></li><li><p>Residual: observed error for a set of data</p></li></ul><p></p>

New cards

What does the correlation coefficient describe?

It describes only the direction and strength of the association between two variables, not the effect of X on Y or the ability to estimate Y from X.

New cards

What is the "method of least squares" in regression?

It is a method that estimates the best-fitting line by minimizing the sum of squared errors (residuals).
SSxx: sum of squares for X
- s2 = variance
SSyy: sum of squares for Y
SPxy: sum of product
- cov = covariance
The residuals sum to 0
The line passes through (Xˉ,Yˉ)

<ul><li><p>It is a method that estimates the best-fitting line by minimizing the sum of squared errors (residuals).</p></li><li><p>SSxx: sum of squares for X</p><ul><li><p>s2 = variance</p></li></ul></li><li><p>SSyy: sum of squares for Y</p></li><li><p>SPxy: sum of product</p><ul><li><p>cov = covariance</p></li></ul></li><li><p>The residuals sum to 0</p></li><li><p>The line passes through (Xˉ,Yˉ) </p></li></ul><p></p>

New cards

How can you visually inspect a scatter plot?

Think about:
- Direction
- Functional form
- Strength
- Unusual features
WHY do we need to investigate these four elements?
- Because they justify whether Pearson correlation or linear regression model is appropriate!!!

New cards

Direction of a scatter plot

Positive association: the values of y tend to increase as the values of x increase
- e.g.: Earnings increases by age
Negative association: the values of y tend to decrease as the values of x increase
- e.g.: Max legibility distance of highway signs decreases by driver age

<ul><li><p>Positive association: the values of y tend to increase as the values of x increase</p><ul><li><p>e.g.: Earnings increases by age</p></li></ul></li><li><p>Negative association: the values of y tend to decrease as the values of x increase</p><ul><li><p>e.g.: Max legibility distance of highway signs decreases by driver age</p></li></ul></li></ul><p></p>

New cards

Functional form of a scatter plot

Non-linear vs linear

New cards

Strength of a scatter plot

How strong is the association? How strong is the association?
From weak to highly associated (more straight line)

<ul><li><p>How strong is the association? How strong is the association?</p></li><li><p>From weak to highly associated (more straight line)</p></li></ul><p></p>

New cards

Unusual features of a scatter plot

New cards

Example inspection scatter plot

New cards

What is R squared (R²) and how do you calculate it?

R2 tells us how well the regression line fits the data.
It quantifies how much of the variation in the dependent variable (y) is explained by the independent variable (x) using the regression model.
- 0 ≤ R²≤ 1
R² =1 → perfect prediction (all actual values lie on the regression line)
R² =0 → regression line explains none of the variability in y
SSR + SSE = SST

<ul><li><p>R2 tells us how well the regression line fits the data.</p></li><li><p>It quantifies how much of the variation in the dependent variable (y) is explained by the independent variable (x) using the regression model.</p><ul><li><p>0 ≤ R<sup>2 </sup>≤ 1</p></li></ul></li><li><p>R<sup>2</sup> =1 → perfect prediction (all actual values lie on the regression line)</p></li><li><p>R<sup>2</sup> =0 → regression line explains none of the variability in y</p></li><li><p>SSR + SSE = SST</p></li></ul><p></p>

New cards

NOT A TOPIC OF THIS COURSE What is standard error of the estimate (SEE) and how do you calculate it?

Measures the average distance between the actual y-values and the predicted y-values.
It's like a standard deviation of the residuals.
SEE gives us an absolute measure of prediction error (in units of y).
Lower SEE means the model predicts more accurately.

<ul><li><p>Measures the average distance between the actual y-values and the predicted y-values.</p></li><li><p>It's like a standard deviation of the residuals.</p></li><li><p>SEE gives us an <strong>absolute measure of prediction error</strong> (in units of y).</p></li><li><p>Lower SEE means the model predicts <strong>more accurately</strong>.</p><p></p></li></ul><p></p>

New cards

NOT A TOPIC OF THIS COURSE What is the difference between R² and SEE

New cards

Sum of squares error