1/100
Flashcards covering key vocabulary and concepts from the lecture notes on Correlation and Regression.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Unit 2 Focus
Correlation and Regression
Unit 2 Summary Topics
Association, explanatory variable, response variable, scatterplots, correlation, least squares criterion and least squares regression line, prediction, slope, intercept, r 2, residuals, outliers, influential observations, association vs. causation, lurking variables, and extrapolation.
Unit 1 Statistics
Comparing two or more populations with respect to the same variable.
Unit 2 New question:
Examine relationships between two or more variables with respect to the same population?
Examine Relationships Focus
The nature of the relationship between the variables.
Examine Relationships Focus
One of the variables might be thought to explain/predict the other one.
Explanatory Variable
Variable thought to explain/predict the other one, denoted by X, values represented by x.
Response Variable
The variable that is being explained/predicted, denoted by Y, values represented by y.
Explanatory Variable Role
To explain or predict the response variable.
Explanatory Variable Example
Number of cups of coffee per day
Response Variable Example
Number of hours of sleep
Explanatory/Response Variable Example
Percentage grades in English and Math courses - only interested in the nature of the relationship.
Scatterplots Purpose
To visualize/display the relationship between two quantitative variables.
Scatterplots Definition
Displays the values of two different quantitative variables measured on the same individuals, on a Cartesian plane.
Scatterplots Variable Positioning
Explanatory variable on the x-axis, response variable on the y-axis. If there isn't an explanatory/response variable, the choice of axes is arbitrary.
Scatterplots Exam Score Example Question
What do you notice? (re: relationship between classes missed and exam score)
Scatterplots Examination
Form, direction, strength, outliers.
Linear Relationship
A straight line would do a fairly good job at approximating the relationship between the two variables.
Non-Linear Relationships
Quadratic, logarithmic, exponential, etc.
Negative Association
The pattern of points slopes downwards from left to right.
Positive Association
The pattern of points slopes upwards from left to right.
Strength of Relationship
Determined by how close the points lie to a simple form, such as a straight line.
Strong Relationship
Points fall quite close to the line.
Weak Relationship
Points appear to be randomly scattered and many fall far from the approximating line.
Outliers for Bivariate Data
Observation may be outlying in the x-direction, the y-direction, or both. An outlier could simply fall outside the general pattern of points.
Linear Relationship Strength Assessment
Better to use a numerical measure, called correlation.
Correlation Coefficient
Denoted by r, measures both the direction and strength of a linear relationship between two quantitative variables.
Correlation Coefficient Formula
r = (1 / (n - 1)sx sy) * Σ((xi - x̄)(yi - ȳ))
Correlation Calculation Steps
Calculate x̄, ȳ, sx and sy; calculate deviations xi - x̄ and yi - ȳ; multiply corresponding deviations; add the n products; divide by (n - 1)sx sy.
Correlation Properties: Positive values
Indicate a positive association.
Correlation Properties: Negative values
Indicate a negative association
Correlation Properties
r is always a number between -1 and 1 (inclusive).
Correlation Properties: r near 1
Indicates a strong positive linear association.
Correlation Properties: Positive r near 0
Indicates a weak positive linear association.
Correlation Properties: r near -1
Indicates a strong negative linear association.
Correlation Properties: Negative r near 0
Indicates a weak negative linear association.
Correlation Properties: r = 1
Perfect positive linear relationship.
Correlation Properties: r = -1
Perfect negative linear relationship.
Correlation Properties: r = 0
No linear association.
Correlation Properties: Units
r has no units; it's just a number.
Correlation Properties: Distinction
Makes no distinction between X and Y.
Correlation Properties: Units Change
Changing the units of X and Y does not affect the correlation.
Correlation Limitations
Measures only the strength of a linear relationship. Useless if there's another type of relationship.
Correlation Caveats
Correlation does not imply causation!!!!!!
Lurking Variable
A variable that helps explain the relationship between variables in a study but is not included in the study itself.
Correlation Affected by Outliers
Yes, due to dependence on sample mean and standard deviation.
Regression Line
A straight line that describes how a response variable Y changes as an explanatory variable X changes.
Regression Line Use
Used to predict values of Y for given values of X.
Least Squares Regression Line
Line that minimizes the sum of squared deviations in the vertical direction: Σ(yi - ŷi)^2.
Least Squares Regression Line Formula
ŷ = b0 +b1x, where: b1 = r (sy/sx) (slope) and b0 = ȳ - b1x̄ (intercept).
Slope of Least Squares Regression Line
Predicted increase in y when x increases by one unit.
Intercept of Least Squares Regression Line
Predicted value of y when x=0.
Slope Formula
r * (sy / sx)
Intercept Formula
y bar - b1 * x bar
r^2 Definition
Fraction of variation in Y that is accounted for by its regression on X.
r^2 = 1
Predict Y exactly for any value of X.
r^2 = 0
Regression on X tells us absolutely nothing about the value of Y.
Residual Definition
The value yi-ŷi (for i = 1, 2, 3, …, n): actual value of y - predicted value of y.
Residual
Reflects the error of our prediction
Positive Residual
An observation falls above the least squares regression line.
Negative Residual
An observation falls below the least squares regression line.
Least Squares Regression Line Minimization
Minimizes the sum of squared residuals.
Extrapolation
The process of predicting a value of Y for a value of X that's outside our range of data.
Extrapolation Prediction
Gives unreliable predictions and should be avoided.
Scatterplots Y - direction outlier
Generally has little effect on the regression line
Scatterplots Not outlier
Falls outside the general pattern of points--bivariate outlier. Generally has little effect on the regression line.
Scatterplots X - direction outlier
Has a strong effect on the regression line.
Influential Observation
Removing it from the data set would dramatically alter the position of the least squares regression line (and thus the value of r^2 as well).
Influential Observation Outliers
Outliers in the x-direction are typically influential observations.
Least Squares Regression Line Property
Always passes through the point (x̄, ȳ).
Observational Study
A study where individuals are simply observed. The observed relationship could be due to one or more lurking variables.
Experiment
The values of the explanatory variable are randomly "assigned" to the sample units, rather than simply being observed prior to the study.
Marijuana Example
Correlation between X and Y was calculated to be r = 0.85 among teens.
Drug Lurking Variable
The availability of drugs in different cities.
Realistic Observational Studies
Realistically, observational studies are often more feasible
Scatterplots Categorical Variables
Sometimes, a scatterplot may actually be displaying two or more distinct relationships
Categorical Variables Conclusion
Careful when examining a relationship to ensure that the data belongs to only one population.
Association
Relationship between two variables.
Explanatory Variable
Variable used to predict or explain changes in the response variable.
Response Variable
Variable that is affected by the explanatory variable.
Scatterplot
Graphical representation of the relationship between two quantitative variables.
Correlation
Statistical measure that describes the strength and direction of a linear relationship between two variables.
R^2
Coefficient of determination, indicating the proportion of variance in the response variable that is predictable from the explanatory variable(s).
Residual
Difference between the actual value and the predicted value.
Outlier
Data point that differs significantly from other data points in the set.
Influential Observation
Observation that, if removed, would substantially change the fitted regression line.
Lurking Variable
Variable that is not measured in the study but affects the relationship between the explanatory and response variables.
Correlation Coefficient (r)
A number between -1 and +1 expressing the degree of relationship between two variables
Extrapolation
Predicting values outside the range of the data; can lead to unreliable predictions.
Least Squares Criterion
Method of finding the regression line that minimizes the sum of the squares of the vertical distances between the data points and the line.
Strength
How closely the data follows the overall pattern
Predicted Value
The value of y-hat
Direction
Can be positive or negative depending on association
Causation
One variable directly affects the other
Least-squares Regression Line
The line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible
Slope
The amount by which y is predicted to change when x increases by one unit
Intercept
The predicted valve for y when x=0
Linear Regression
a statistical method used to fit a linear model to a given data set.
Association vs Causation
Just because to variables are correlated, does not imply that one causes another.
Linear Relationship
When the data is somewhat close to forming a straight line.