Stats Unit 3
Independent- There is no relationship between two variables
Association- A relationship between two variables
Correlation- A linear relationship between two variables
Causation- One variable cause the changes in the other variable (a VERY strong statement)
scatter plot descriptors- direction, form, strength, and unusual features
direction- positive or negative
form- linear and nonlinear (don’t let one or two points sway you)
strength- strong (|r| >= .8), moderate (.5 < |r| < .8), or weak (|r| < .5)
unusual features- outliers/influential points, groupings, or gaps
y- actual value
y^- predicted value
residual equation- y - y^
residual- the distance in the y direction between the actual data point and the line of best fit
line of best fit/least squares regression line- straight line on a graph that shows the overall trend of a set of data points, minimizing the distance between the line and each point
line of best fit/regression line equation- y^ = a (y-intercept) + bx (slope)
slope- for every increase of 1 in the (x-variable), there is an increase or decrease of (slope) in the (y-variable)
y-intercept- the value of y when x is zero. it shows the starting point or baseline for y before any changes in x are considered
interpolation- estimation within the data you have
extrapolation- estimation outside the date you have
outlier- point far from the regression line
leverage point- far from x, might strengthen or weaken the correlation
influential points- leverage points that are also far from the regression line
coefficient of determination- shows how well the data fit the linear model, or tells us what percent of the variation in y is explained by the variation in x (r^2): r^2 = 1 (perfect fit 100% of variation in y is explained by x) and r^2 = 0 (0% of of variation in y is explained by x)
correlation coefficient- measures the strength and direction of a linear relationship between two variables (r): r > 0 (positive correlation), r < 0 (negative correlation), and r = 0 (no linear correlation)
Linear- The relationship between variables should be roughly a straight line. Check the residual plot—points should be randomly scattered around zero with no pattern.
Independent- Each observation must be independent of others to avoid bias in the model.
Normal- Residuals should be roughly normally distributed around zero, which can be checked with a histogram or normal probability plot.
Equal Variance- Residuals should have a consistent spread across all values of the explanatory variable. If residuals fan out, this indicates unequal variance, suggesting a poor fit in some areas.
categorical predictors- Variables that represent categories or groups rather than numerical values. They are used in models to compare the effect of different groups on the outcome. Examples include gender, region, or type of product.
conditions for categorical predictors- Linearity, independence of observations, normality of residuals, and equal variance
Confounding Variable- A variable that affects both the predictor and the outcome, making their relationship unclear
Lurking Variable- A hidden variable that influences both the predictor and outcome, causing a misleading link between them.
Residual by X Plot- A plot of residuals against the predictor variable (X). It’s useful for checking patterns but not ideal for assessing the overall model when there are multiple variables.
Residual by Predicted Plot- A common plot that shows residuals versus predicted values. It helps check if the residuals are randomly distributed around zero and is good for assessing the normality of residuals.
Residual by Row Plot- A plot showing residuals against the row number (often used when data is in a time sequence). It helps assess independence and detect trends, like seasonal patterns or increasing/decreasing trends over time.
Predicted Plot- A plot showing actual versus predicted values, especially useful in multivariate situations. It helps identify patterns, with the ideal being points close to the line where actual equals predicted (y = x).
Quantile Plot- A plot used to check if the residuals are normally distributed. It compares the quantiles of residuals to a normal distribution.