DC

Exam 1 Review Flashcards

Exam Policies and Exam Preparation

The reviewer emphasizes that Exam 1 is taken remotely from home. In a room by yourself, you may use paper-based resources (handwritten or printed notes from the course), but electronic resources are not allowed. This includes personal handwritten notes, printed lecture slides, activity solutions from D2L and Top Hat, and customized formula sheets. There is no page limit for your review notes, but a practical approach is to carry one or two neatly organized formula sheets and backups. Having backups ensures that if one sheet is damaged or misplaced during the exam, you still have access to your critical information. Since the exam is timed, you should keep your notes highly organized so you can answer most questions using your formulas without scanning through all notes. Organizing notes by module or topic, with a clear index or color-coding, can significantly reduce time spent searching. Focus on conciseness for frequently used formulas and interpretations. The course provides several resources in D2L and Top Hat that can be useful during the exam, including lecture notes for Modules 1–3 and completed Module 2 and 3 activities. You should customize and supplement these notes as needed for exam readiness.

Modifying Lecture Notes for Exam Readiness

A key strategy is to tailor the Module 2 lecture notes to become more useful during Exam 1. The “table of when to use what” (on page 2 of the Module 2 notes) is useful, but some items are missing. This table serves as a quick decision-making guide for selecting appropriate statistical methods, but it often lacks specifics. For example, it might not explicitly list all formulas or detailed interpretation guidelines. For example:

  • The expression for the percentage of variability of y explained by the model is given by R^2 \times 100\%, and one should remind oneself that terms like strong or weak correlation relate to this measure. R^2 (R-squared) represents the proportion of the variance in the dependent variable ($y$) that is predictable from the independent variable(s). A higher R^2 value indicates that the model explains a larger proportion of the variability, suggesting a stronger fit. For example, an R^2 of 0.80 means 80% of the variability in $y$ is explained by the model.

  • The confidence interval for the slope is often omitted. The margin of error is ME = 1.96 \times SE_{\hat{\beta}1}, so the interval is \hat{\beta}1 \pm ME. This interval provides a range of plausible values for the true population slope from which the sample was drawn. If the interval does not contain zero, it suggests a statistically significant linear relationship between $x$ and $y$ at the 5% level.

  • The slope equation, when solving for the change in y given a change in x, is often not emphasized. Since you are given \Delta x and the slope $b$, the change in the predicted value is \Delta \hat{y} = b \cdot \Delta x. This form is crucial for practical interpretations, allowing you to quickly calculate the predicted impact of a specific change in the independent variable on the dependent variable, rather than just a one-unit change.

Other modifications to consider for your notes:

  • Always write the slope equation as you’re solving for \Delta y when necessary: if \displaystyle \frac{\Delta y}{\Delta x} = b then \Delta y = b \cdot \Delta x.

  • Create personal formula sheets; use resources in D2L and Top Hat for extra practice questions. In Module 1 you may not need everything (e.g., Posit Cloud setup notes are not essential for Exam 1).

Tools and Resources for the Exam

Desmos is one of the approved online calculators. To use it, ignore the graph section and use the input space to type calculations, e.g., for 300 + 5\times 50 you type that expression and obtain the result. To use Desmos effectively, become proficient with its basic arithmetic operations and how to input complex expressions. For example, for operations involving exponents, type ^ (e.g., 5^2 for 5^2). Parentheses are essential for order of operations. You can also copy and paste previous calculations to streamline the process or define variables (e.g., a=5, b=10 then a*b) for multi-step calculations, which saves time. This tool is permitted during the exam. It’s important to be comfortable with using Desmos to perform quick arithmetic during the test.

Basic Statistical Concepts and Variable Types

A recurring theme in the exam review is distinguishing between types of variables and types of statistics:

  • A categorical variable (also called a qualitative or group variable) assigns observations to discrete groups. It does not measure amount. Example: pet owner (values: Yes, No; could also be categories like 'Dog,' 'Cat,' 'Bird' or 'Excellent', 'Good', 'Fair', 'Poor' for a Rating variable). Categorical variables can be nominal (no inherent order) or ordinal (have a meaningful order).

  • A quantitative variable (also called numerical) measures magnitude. Example: sales (dollar amount per transaction). Quantitative variables can be discrete (countable, like number of items sold) or continuous (measurable, like height or temperature).

  • Descriptive statistics describe full data, often using visuals or summary numbers for a complete dataset. The goal is to summarize and visualize the features of a dataset without making inferences beyond it. This includes calculating measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation). Predictive statistics use sample data to predict outcomes for a population or future observations. Also known as inferential statistics, these methods allow us to make generalizations, estimates, or predictions about a larger population based on a representative sample. Hypothesis testing and confidence intervals are key tools in predictive statistics. Example: Graphing total Lego sets sold historically is descriptive; using past data to forecast future demand is predictive.

Key takeaway: If the data are complete, descriptive statistics are appropriate; if you are estimating or predicting for a population, use predictive statistics. It's vital for the exam to correctly identify whether a question requires a descriptive summary of known data or an inferential prediction for unknown data or a larger population.

Simple Linear Regression: Concept and Output

A simple linear regression model is defined as a straight-line relationship with exactly one independent variable:


\hat{y} = \hat{b}0 + \hat{b}1 x

where the slope $\hat{b}1$ quantifies the average change in the predicted dependent variable ($\hat{y}$) for every one-unit increase in the independent variable ($x$), assuming a linear relationship. The line is determined by two coefficients, the intercept $\hat{b}0$ and the slope $\hat{b}1$. The intercept $\hat{b}0$ represents the predicted value of $y$ when $x$ is zero. This interpretation is only meaningful if $x=0$ is within the range of observed data or makes logical sense in the context.

When interpreting the slope in practice, it is common to express the rate of change in the dependent variable for a specified change in the independent variable. For example, if the model is \hat{y} = 597 + 7x, the intercept is 597 and the slope is 7. This means that when $x$ increases by 1 unit, the predicted $y$ increases by 7 units. This implies that no other relevant variables are changing, ensuring the isolated effect of $x$ on $y$ is observed. For a model with a slope of 7, if $x$ goes from 10 to 11, $\hat{y}$ is predicted to increase by 7, and if $x$ goes from 20 to 21, $\hat{y}$ is also predicted to increase by 7.

Reducing a Model: Goals, Methods, and Cautions

Reducing a model means starting with a larger model and removing variables one at a time to retain similar predictive accuracy with fewer variables. Key ideas include:

  • Intercept is generally not forced to be zero in this course (special cases exist but are not the focus).

  • A successful reduced model maintains similar accuracy while offering simpler interpretation.

  • Two common methods to decide which variable to remove:

    • The p-value approach: remove the variable with a large p-value and recompute the model; repeat.

    • Automatic stepwise reduction using AIC (Akaike Information Criterion): remove the variable that results in the lowest AIC, balancing fit and complexity. AIC is a measure of the relative quality of statistical models for a given set of data. It considers both the goodness of fit of the model and the number of parameters used. A lower AIC value indicates a better model, preferring simpler models that explain the data nearly as well as more complex ones. The goal is to find the model with the minimum AIC.

  • Multicollinearity (high correlation among independent variables) may reduce interpretability and the apparent significance of coefficients, but it does not necessarily degrade the predictive accuracy of the model. While multicollinearity can inflate the standard errors of coefficients, making individual predictors appear less significant, it often does not impair the overall predictive power of the model as measured by metrics like R^2 or Adjusted R^2. The model can still make accurate predictions for the dependent variable, even if it's difficult to disentangle the individual contributions of correlated predictors. Therefore, reducing a model should consider both predictive performance and interpretability.

Scatterplot Matrix and Visualization

A scatterplot matrix is a collection of scatterplots arranged in a grid. It contains no numerical values; it is a visual tool to explore relationships between pairs of variables. Its purpose is to visually assess potential linear relationships and correlations, not to compute exact numbers. When examining a scatterplot matrix, look for:

  • Direction: Positive (upward slope) or negative (downward slope) relationships.

  • Form: Linear or non-linear patterns.

  • Strength: How closely the points cluster around a potential line (strong vs. weak correlation).

  • Outliers: Points that deviate significantly from the general pattern.

  • Heteroscedasticity: Varying spread of residuals across the range of predictor values.
    Such visual cues help in selecting appropriate independent variables for a regression model and identifying potential issues. It helps to identify strong or weak associations and potential non-linear patterns.

Dependent Variable and Regression Formulas

In a regression model, the dependent variable (often denoted as $Y$) is the variable you are trying to predict. Commonly, the dependent variable is written on the left-hand side of the equation or described with the word "predicted". The independent variables (the $X$ variables) are the predictors you use to estimate $Y$. In a multiple regression model with $p$ predictors, we have

\hat{y} = \hat{b}0 + \hat{b}1 x1 + \hat{b}2 x2 + \cdots + \hat{b}p x_p

When preparing an interpretation, ensure your variables match exactly the ones described in the interpretation; the predicted variable must be $Y$, and any variable that is held constant or controlled for is treated as an $X$ variable. Precise language is critical. For instance, do not say 'sales increases' if the model predicts 'predicted sales increases.' Always keep the distinction between the actual variable and its predicted value clear.

Reading Regression Output: Intercept, Slope, and p-values

From an R regression output (lm), the trend line is \hat{y} = \hat{b}0 + \hat{b}1 x. The intercept is the estimate in the row labeled intercept, and the slope is the estimate in the row labeled with the predictor variable (the $X$). For a sample output describing \hat{y} = 597 + 7x, the intercept is 597 and the slope is 7. The p-values for the coefficients appear in the column labeled "Pr(>|t|)" (often shortened to Pr > |t|) next to the corresponding coefficient. A large p-value (e.g., > 0.05) for a slope coefficient suggests that the corresponding predictor may not significantly predict $Y$ when other variables are in the model. Specifically, a large p-value implies that we cannot reject the null hypothesis ($H0: \beta1 = 0$), meaning there is insufficient statistical evidence to conclude that the independent variable has a linear relationship with $Y$, after accounting for other predictors in the model. Conversely, a small p-value (e.g., < 0.05) indicates the predictor is statistically significant.

Summary Statistics in Regression Output

In a summary output, the abbreviation "1stQU" denotes the first quartile (Q1), i.e., the 25th percentile. For a typical quantitative variable, you also see the minimum, median, mean, third quartile (Q3), and maximum. The presence of these statistics helps in understanding the distribution of each variable before modeling. These statistics provide a quick overview of the data's central tendency, spread, and potential skewness. The difference between the mean and median can indicate skewness, while the range and quartiles give insight into the data's variability and the presence of extreme values.

How to Build and Interpret Regression Models in R (lm)

A common model specification is:

  • To predict $Y$ based on $X$: \text{lm}(Y \text{\textasciitilde} X, data = dataset). For multiple regression: \text{lm}(Y \text{\textasciitilde} X1 + X2, data = dataset).

  • The general form of the regression is \hat{y} = \hat{b}0 + \hat{b}1 x for simple regression, and \hat{y} = \hat{b}0 + \hat{b}1 x1 + \hat{b}2 x2 + \cdots + \hat{b}p x_p for multiple regression. After running model <- lm(Y ~ X, data = dataset), you typically use summary(model) to view the full regression output, which includes coefficients, standard errors, t-values, p-values, R^2, and adjusted R^2.

If you are asked to predict a variable $Y$ using a model, you identify the dependent variable (the one being predicted) as the left side of the equation (or the variable described as predicted), and the independent variables as the predictors on the right side of the tilde (\text{\textasciitilde}).

Hypothesis Testing and Significance in Regression

In regression output, significance of the relationship between a predictor $X$ and $Y$ is assessed via the p-value associated with the slope coefficient. A p-value below 0.05 (the typical threshold) indicates a statistically significant relationship; a p-value above 0.05 suggests no statistically significant relationship at the 5% level. For a slope coefficient $\beta1$, the null hypothesis is typically $H0: \beta1 = 0$ (no linear relationship between $X$ and $Y$ in the population after controlling for other variables), and the alternative hypothesis is $HA: \beta1 \neq 0$ (a linear relationship exists). If the p-value is less than the significance level \alpha (e.g., 0.05), we reject $H0$.

Multicollinearity: Implications for Interpretation and Prediction

Multicollinearity occurs when independent variables are highly correlated with one another. Key points:

  • It makes interpretation of individual coefficients difficult because changes in one predictor are associated with changes in another. Symptoms of multicollinearity include high pairwise correlations between independent variables, large standard errors for coefficients, and coefficients that change drastically when other variables are added or removed.

  • It does not necessarily harm the predictive accuracy of the model; however, it can inflate standard errors and reduce the reliability of coefficient estimates. While it complicates interpreting the unique effect of each predictor, the model's ability to predict new observations can remain strong. However, confidence intervals for the coefficients will be wider and less precise.

  • Multicollinearity requires careful interpretation and possibly techniques to reduce it (e.g., removing redundant predictors, combining predictors, or using regularization techniques in other contexts). Another diagnostic tool is the Variance Inflation Factor (VIF), where a VIF value typically above 5 or 10 indicates problematic multicollinearity for a given predictor.

Real-World Interpretations: Examples and Calculations

  • Example: Revenue vs. Temperature

    A model: \widehat{\text{Revenue}} = 51{,}000 + 4{,}000 \times \text{Temperature}. If the temperature increases by 4 degrees Fahrenheit, the predicted change in revenue is

    \Delta \text{Revenue} = (4{,}000) \times 4 = 16{,}000.

    Therefore, a 4-degree Fahrenheit increase in temperature is predicted to result in a $16,000 increase in Revenue. This interpretation assumes that the relationship is constant across the range of temperatures observed.

  • Example: Predicting Sales with Budget

    A model: \widehat{\text{Sales}} = 50{,}000 + 1.5 \times \text{Budget}. If the budget is 11{,}000, the predicted sales are

    \widehat{\text{Sales}} = 50{,}000 + 1.5 \times 11{,}000 = 66{,}500.

    Note: This calculation assumes the same model includes the predictors as in the interpretation; if additional predictors (e.g., a print budget) are present in another model, the interpretation and the coefficients may differ. Every interpretation is conditional on the specific model used and the other variables included (or excluded). Changes in model specification can alter coefficient estimates and their significance.

  • Example: Interpreting a Model with Two Predictors (Rent)

    Suppose a model predicting rent uses Bedrooms and Distance:

    • The slope for Bedrooms is 380, so holding Distance constant, each additional bedroom increases rent by $380.

    • The slope for Distance is −120, so holding Bedrooms constant, each additional 1 mile farther from the city center decreases rent by $120.

    • If Distance changes by 0.5 miles, the change in rent is \Delta \text{Rent} = (-120) \times 0.5 = -60.

      These interpretations rely on the model’s assumption that the two predictors are included and held as specified in the interpretation. The phrasing 'holding Distance constant' or 'holding Bedrooms constant' is crucial in multiple regression, as it isolates the effect of one predictor while accounting for the effects of others. This is also known as the ceteris paribus assumption.

  • Interpreting 95% Confidence Interval for the Slope

    Given a slope estimate ($\hat{\beta}1$) with standard error ($SE{\hat{\beta}1}$), the 95% CI is \hat{\beta}1 \pm 1.96 \cdot SE{\hat{\beta}1}. For example, if the slope is 4 and the standard error is 1.5, the margin of error is

    1.96 \times 1.5 = 2.94,

    so the 95% CI is 4 \pm 2.94.

    This means we are 95% confident that the true population slope lies between 4 - 2.94 = 1.06 and 4 + 2.94 = 6.94. If this interval does not contain zero, it reinforces the statistical significance of the predictor.

  • Model Comparison and Variable Selection

    When comparing models with and without a variable (e.g., comparing a two-variable model to a four-variable model), use the p-value of the variable you would remove. If the p-value for that variable is greater than 0.05, removing it will not significantly reduce accuracy, suggesting the simpler model may be preferable. If the p-value is small, removing it would harm accuracy. When comparing models, also consider the Adjusted R^2, which accounts for the number of predictors. A model with a higher Adjusted R^2 is generally preferred, especially if it achieves similar predictive power with fewer variables.

  • R-squared and Variability Explained

    The proportion of variability in the dependent variable explained by the model is given by R^2. It is common to report the percent as R^2 \times 100\%. For example, if the output shows a multiple R-squared of 0.929, then the model explains

    0.929 \times 100\% = 92.9\% of the variability in the dependent variable. While a high R^2 is desirable, it doesn't necessarily imply that the model is 'good' in an absolute sense, or that all predictors are important, or that the model is correctly specified. It simply quantifies the proportion of explained variance within the sample.

  • Practical Note on Output Interpretation

    In practice, you may encounter a scenario where a variable’s p-value is large in a multiple regression model, but in a simple regression model with only that variable, it could be significant. This is because the presence of other predictors can influence the perceived importance of a given predictor due to shared variance or multicollinearity. Always interpret coefficients in the context of the specified model specification.

Quick Reference: Common Formulas to Memorize (LaTeX-Formatted)

  • Simple linear regression: Predicted value of y based on a single predictor x.

    \hat{y} = \hat{b}0 + \hat{b}1 x

  • Change in y for a change in x: Calculates the impact on the predicted dependent variable for a given change in the independent variable.

    \Delta y = b \cdot \Delta x

  • 95% CI for the slope: Estimates the range for the true population slope.

    \hat{b}1 \pm 1.96 \cdot SE_{\hat{b}1}

  • Margin of error (general): General formula for margin of error for any estimated coefficient, using the appropriate critical z-value or t-value.

    ME = z{\alpha/2} \cdot SE{\hat{\beta}}

  • Proportion of variance explained: Quantifies the proportion of the dependent variable's total variance accounted for by the model.

    R^2 and in percent form: R^2 \times 100\%

  • Desmos calculator example: Illustrates basic arithmetic input in Desmos.

    300 + 5 \times 50 = 550

  • Multivariate regression: Predicted value of y based on multiple predictors.

    \hat{y} = \hat{b}0 + \hat{b}1 x1 + \hat{b}2 x2 + \cdots + \hat{b}p x_p

Quick Recap of Exam 1 Takeaways

  • Be familiar with exam policies: remote from home, paper notes allowed, no electronics, organized formula sheets, use Desmos as an allowed calculator, and rely on provided lecture notes with careful augmentation.

  • Distinguish between variable types: categorical (nominal/ordinal) vs quantitative (discrete/continuous); descriptive (summarizing known data) vs predictive (inferring for a population or predicting future observations) statistics.

  • Understand simple linear regression and how to interpret intercept (when meaningful) and slope; recognize when multiple predictors are involved and how to interpret changes in the predicted variable given changes in predictors, emphasizing the 'ceteris paribus' condition.

  • Learn model reduction strategies (p-value approach and AIC-based stepwise reduction) and understand the effect of multicollinearity on interpretation versus predictive accuracy, including its symptoms like inflated standard errors and VIF.

  • Be able to read and interpret regression output (R), including the intercept, slope, p-values (related to hypothesis testing for coefficients), R^2 and Adjusted R^2 values, and to perform straightforward calculations for predictions and changes in the dependent variable given changes in the predictors.

  • Practice converting between numeric results and practical interpretations (e.g., dollars per degree, units per budget, rent per bedroom) while ensuring that the interpretation aligns with the exact variables included in the model and their units.

  • Know how to compute and interpret confidence intervals for the slope, understanding what it means if the interval contains zero, and how to compare models using p-values for the variable being removed and associated changes in predictive accuracy and Adjusted R^2.

These notes synthesize the major and minor points from the transcript to provide a comprehensive study resource that mirrors the