1/40
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Regression - overall idea
Statistical methods that fit a model to data in order to predict a response variable from one or more explanatory variables.
Regression line - definition
A straight line that describes how a response variable y changes as an explanatory variable x changes; used to predict y for a given x.
When to use a regression line
When a scatterplot shows an approximately straight‑line relationship and one variable is used to explain or predict the other.
Least-squares regression line - definition
The line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Equation of a regression line
Written in the form y = a + bx, y=a+bx, where a is the intercept and b is the slope.
Slope of regression line - interpretation
The slope b is the amount by which the predicted y changes when x increases by 1 unit.
Intercept of regression line - interpretation
The intercept a is the predicted value of y when x = 0 (though x = 0 may be outside the meaningful data range).
Using the regression equation for prediction
Substitute a given x‑value into the equation y = a + bx, y=a+bx to compute the predicted y‑value.
Example: fossil bones - pattern
Femur and humerus lengths of archaeopteryx fossils follow a strong straight‑line pattern, making regression prediction accurate.
Example: fossil bones - equation
For the fossils, the least-squares line is: humerus length = −3.66 + (1.197 × femur length).
Understanding prediction - model idea
Prediction is based on fitting some "model" (such as a straight line) to the data; better fit leads to more reliable predictions.
Prediction works best when
The model fits the data closely and there is a clear, strong pattern in the relationship.
Extrapolation - definition
Using a regression line to predict values of y for x‑values outside the range of the data; this is risky and often unreliable.
Warning about extrapolation
Patterns may change outside the observed x‑range, so predictions far beyond the data can be seriously misleading.
Example of extrapolation error
Using a child's linear growth from ages 3 to 8 to predict adult height at 25 would give an unrealistic height (like 8 feet).
Correlation vs regression - key difference
Correlation measures direction and strength of a linear relationship; regression fits a specific line and requires choosing an explanatory and response variable.
Effect of outliers on correlation and regression
Both correlation and regression are strongly affected by outliers; a single extreme point can change r and the regression line substantially.
Coefficient of determination r² - definition
The square of the correlation r; r² is the proportion of the variation in y explained by the least-squares regression of y on x.
Interpreting r² - example
If r = 0.994, then r² = 0.988, meaning 98.8% of the variation in y is explained by the straight‑line relationship with x.
Example: fossil bones - r²
For the fossil data, r = 0.994 and r² = 0.988, so femur length explains 98.8% of the variation in humerus length.
Prediction vs causation - distinction
A relationship can be used to make predictions even when there is no evidence that changes in one variable cause changes in the other.
Statistics and causation - key warning
A strong relationship between two variables does not necessarily mean that changes in one variable cause changes in the other.
Lurking variable - definition (causation context)
A variable not included in the analysis that influences both x and y, potentially creating a misleading association.
Common response - definition
A type of lurking situation where a third variable influences both x and y, producing a correlation even if x and y do not directly affect each other.
Confounding - definition
When the effects of two or more variables on a response are mixed together, making it difficult to distinguish their separate influences.
Best evidence for causation
Comes from randomized comparative experiments, which control for lurking variables by random assignment.
Using associations for prediction without causation
An observed relationship can still be used for prediction as long as the past pattern continues, even without knowing the causal mechanism.
Smoking and lung cancer - causation example
Nonexperimental evidence for smoking causing lung cancer is very strong based on multiple consistent observational studies.
Criteria for causation without experiments - strength
The association is strong: smokers have much higher lung cancer rates than similar nonsmokers.
Criteria for causation without experiments - consistency
The association appears in many different studies, groups, and countries, reducing the chance a specific lurking variable explains it.
Criteria for causation - dose-response
Higher doses are associated with stronger responses: heavier and longer-term smoking leads to higher lung cancer risk; quitting reduces risk.
Criteria for causation - time order
The alleged cause precedes the effect: increases in smoking were followed about 30 years later by rises in lung cancer deaths.
Criteria for causation - plausibility
Experiments with animals show that tars from cigarette smoke cause cancer, making the causal mechanism biologically plausible.
Big data - definition/idea
Massive databases (often petabytes in size) of information from sources like web searches, social media, and credit card records used to find patterns and correlations.
Google Flu Trends - example
Google used correlations between flu‑related search terms and flu cases to track influenza spread faster than the CDC, until it later over‑predicted cases.
Limitations of big data - sampling
Big data often come from large but biased convenience samples (like Twitter users), not from representative samples of the whole population.
Limitations of big data - extrapolation risk
Without understanding why a correlation exists, predictions can fail badly when the situation changes or when extrapolating to new conditions.
Big data and theory
Claims that "the numbers speak for themselves" are misleading; statistical theory is still needed to avoid bias, misinterpretation, and extrapolation errors.
Statistics in summary - regression
Regression fits models (often straight lines) to data to predict y from x; least-squares is the standard method for fitting a line.
Statistics in summary - r² and extrapolation
r² tells what fraction of variation in y is explained by the linear model; extrapolation beyond the data range remains risky and must be treated with caution.
Statistics in summary - causation
Strong association does not prove causation; lurking variables, common response, and confounding can explain observed relationships, especially without experiments