1/121
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is the main purpose of categorical encoding in a data set?
B. To convert categories into a numerical form that a model can use
Which of the following is an example of a nominal variable?
C. Color: red, blue, green
One-hot encoding is especially appropriate for:
C. Categories with no natural ordering
A possible drawback of one-hot encoding is that it can:
B. Create many new columns when a variable has many categories
Why can ordinal encoding be risky in some models?
B. It may incorrectly suggest that distances between categories are meaningful
A data set contains a variable called shirt_color with categories red, blue, green, and black. A student decides to code them as 1, 2, 3, and 4 and use that directly in linear regression. What is the main concern?
B. The encoding may incorrectly impose an artificial order on the categories
A regression model includes a predictor for education_level with values: high school, college, graduate school. If these are encoded as 1, 2, and 3, when is this most defensible?
A. When the categories have a meaningful order
Multicollinearity refers to a situation where:
B. Two or more predictors are strongly linearly related
Which of the following is a practical consequence of strong multicollinearity?
C. Standard errors of coefficients can become inflated
If two predictors are highly correlated, that means:
C. They may create redundancy, but not always
Which diagnostic is commonly used to assess multicollinearity?
B. VIF
Why is the assumption of no perfect multicollinearity important for OLS?
A. Otherwise the model cannot uniquely estimate the separate effect of some predictors
A student is building a regression model to predict salary and includes both years_worked and months_worked. What is the most likely issue?
C. Strong multicollinearity
In a regression model, two predictors have very high VIF values, but both are conceptually important. What is the best interpretation?
B. The predictors are highly related, so coefficient uncertainty may be large
In ordinary least squares (OLS), the estimated coefficients are chosen to:
B. Minimize the sum of squared residuals
In a regression model, the standard error of a coefficient measures:
C. The uncertainty in the estimated coefficient
If a coefficient has a large standard error, this usually means:
B. The estimate is less precise
In the phrase “OLS is BLUE,” the word “Best” means:
C. Minimum variance among linear unbiased estimators
If OLS is BLUE, what does it guarantee?
A. The estimates are unbiased and have the smallest variance among linear unbiased estimators
Which of the following is NOT one of the core assumptions associated with OLS being BLUE?
D. Residuals must be perfectly normal
If the OLS assumptions for BLUE are violated, a practical consequence can be:
A. Standard errors and inference may become unreliable
A researcher estimates an OLS model and notices that the coefficients seem reasonable, but the standard errors are very large. What is the practical meaning of this?
C. There is high uncertainty around the coefficient estimates
Suppose OLS coefficients are unbiased, but they are not BLUE because another linear unbiased estimator would have smaller variance. Practically, what does that mean?
B. OLS estimates are still centered correctly, but they are less efficient than they could be
A professor says, “OLS is BLUE.” A student responds, “So that means the model predictions are always correct.” What is the best correction?
C. No, BLUE refers to coefficient estimation properties, not guaranteed perfect prediction
Which missing data type occurs when the probability of missingness is unrelated to both observed and unobserved data?
C. MCAR
Which missing data type means missingness may depend on observed variables, but not on the missing value itself after accounting for those observed variables?
B. MAR
Which missing data type is often the most problematic because the missingness depends on the unobserved value itself or other unobserved factors?
C. MNAR
A hospital data set is missing some patient income values because higher-income patients were less likely to report them. Which missingness type is most plausible?
B. MAR or possibly MNAR depending on the mechanism
A data analyst removes every row with a missing value without checking the reason values are missing. What is the main risk?
C. The remaining data may become less representative and possibly biased
In a good residuals vs fitted plot for a linear regression, the residuals should usually look like:
B. A random cloud around zero with no obvious pattern
If a residuals vs fitted plot shows a curved shape, this often suggests:
A. The model may be missing a nonlinear relationship
A beginner student fits a linear regression model, and the residuals vs fitted plot looks like a U-shape. What is the most likely interpretation?
B. The linear model may not capture the true relationship well
If the residuals vs fitted plot fans out as fitted values increase, this suggests:
B. Heteroscedasticity, meaning the spread of residuals changes
A simple beginner-level fix for a fan-shaped residuals vs fitted plot is often to:
B. Try transforming the response, such as using a log of the target when appropriate
If a residuals vs fitted plot shows one or two points far away from the rest, a student should first think about:
A. Possible outliers or influential observations
In a P–P plot, the points should ideally:
B. Fall roughly along a straight diagonal line
If the points in a P–P plot deviate strongly from the diagonal line, this suggests:
A. The residual distribution may not match normality well (i.e. residuals don't have a bell-shaped distribution)
Suppose a P–P plot shows mild deviations from the diagonal, but the sample size is fairly large. What is often a reasonable beginner interpretation?
B. Small deviations may not be a major practical problem, especially in larger samples
A student sees nonlinearity in the residuals vs fitted plot and non-normality in the P–P plot. Which is the most beginner-friendly first response?
A. Check whether a transformation or an added nonlinear term could improve the model
In practice, why do we look at residuals in our one data set?
B. Because residuals can give clues about whether OLS assumptions seem reasonable in the sample we have
We often say that in real life we only have one sample. What does that mean for checking OLS?
B. We must use plots and diagnostics from this one sample to look for warning signs
A student says, “If the residuals in my sample look random around zero, that is encouraging because it suggests the estimates may be reasonably centered.” What is the best response?
A. Correct, that is a helpful sign, though not absolute proof
In the dartboard analogy, what does it mean for an estimator to be centered?
C. On average, the darts land around the bullseye
In the dartboard analogy, what does it mean for an estimator to be precise?
A. The darts land very close together
If residuals are scattered randomly around zero with roughly constant spread, what is that usually telling a beginner student?
A. This is more consistent with OLS working reasonably well
Suppose the residuals show a strong curve. Why is this a problem when thinking about whether estimates are centered?
A. It suggests the model may be systematically missing part of the relationship
If residuals fan out more and more as fitted values increase, what does that mainly suggest about the estimates?
B. The precision of the estimates and the reliability of standard errors may be affected
Which statement best matches the role of residuals for a student using only one sample?
B. Residuals are clues that help us judge whether OLS may be behaving well or poorly
A beginner student asks: “If residuals look messy, what is a simple way to think about it?” Which answer is best?
A. The darts on our dartboard may not be landing in a clean, tight pattern, so we should be cautious about centeredness and precision
Which of the following best describes a fitted linear regression model?
B. A line or equation estimated from the sample that gives the predicted value of the outcome based on the predictors
A fitted regression model for exam score is What is the best interpretation of the coefficient 4.2?
A. For each additional hour studied, the student’s exam score is predicted to increase by 4.2 points on average
A fitted regression model for house price is What is the best interpretation of the coefficient 18,500?
B. Holding the model form fixed, a one-bedroom increase is associated with a predicted increase of $18,500 in house price on average
A fitted regression model for yearly salary is What is the best interpretation of the coefficient 2,100?
A. For each additional year of experience, predicted salary increases by $2,100 on average
A fitted regression model for electricity usage is What is the best interpretation of the coefficient 0.16?
B. For each additional kilowatt-hour used, the predicted monthly bill increases by $0.16 on average
In a linear regression model, what is OLS mainly used for?
B. To estimate the coefficients, including the intercept and slopes
In the regression model what does OLS try to estimate from the sample?
C. The unknown population coefficients, such as and
Why do we call OLS coefficients “estimates” rather than the true parameters?
C. Because they are sample-based best guesses of population parameters that we do not directly know
Even after fitting a regression model with OLS, why do we still talk about uncertainty?
C. Because the estimated coefficients come from one sample and may differ from the true population values
Which statement best describes the role of the standard error (SE) in OLS?
C. It tells us how much uncertainty there is around our estimated coefficient
What does using robust standard errors such as HC3 mainly do in an OLS regression?
C. It keeps the OLS coefficients the same but adjusts the standard errors to be more reliable when error variance is not constant
Taking the log of the response variable is often a simple fix when which OLS issue appears in the residuals vs fitted plot?
C. Heteroscedasticity, where the spread of residuals changes as fitted values increase
What is the main purpose of categorical encoding in a data set?
A. To remove missing values B. To convert categories into a numerical form that a model can use C. To increase sample size D. To standardize continuous variables
Which of the following is an example of a nominal variable?
A. Education level: high school, college, graduate B. Satisfaction rating: low, medium, high C. Color: red, blue, green D. Class rank: first, second, third
One-hot encoding is especially appropriate for:
A. Variables with a natural order B. Continuous variables C. Categories with no natural ordering D. Variables with extreme outliers
A possible drawback of one-hot encoding is that it can:
A. Reduce all correlations to zero B. Create many new columns when a variable has many categories C. Remove all missing values D. Guarantee causal interpretation
Why can ordinal encoding be risky in some models?
A. It always causes multicollinearity B. It may incorrectly suggest that distances between categories are meaningful C. It removes too much information D. It can only be used for binary variables
A data set contains a variable called shirt_color with categories red, blue, green, and black. A student decides to code them as 1, 2, 3, and 4 and use that directly in linear regression. What is the main concern?
A. The model will run too slowly B. The encoding may incorrectly impose an artificial order on the categories C. The variable will automatically become missing D. The regression will always fail
A regression model includes a predictor for education_level with values: high school, college, graduate school. If these are encoded as 1, 2, and 3, when is this most defensible?
A. When the categories have a meaningful order B. When the categories are colors C. When the target variable is binary D. When there are missing values
Multicollinearity refers to a situation where:
A. The target variable has missing values B. Two or more predictors are strongly linearly related C. Residuals are not normally distributed D. The sample size is too large
Which of the following is a practical consequence of strong multicollinearity?
A. Coefficients become impossible to estimate at all B. The model automatically becomes biased C. Standard errors of coefficients can become inflated D. The residuals become exactly zero
If two predictors are highly correlated, that means:
A. One of them must always be removed B. They are definitely redundant for the model C. They may create redundancy, but not always D. The model is invalid
Which diagnostic is commonly used to assess multicollinearity?
A. RMSE B. VIF C. Confusion matrix D. Silhouette score
Why is the assumption of no perfect multicollinearity important for OLS?
A. Otherwise the model cannot uniquely estimate the separate effect of some predictors B. Otherwise the sample mean becomes biased C. Otherwise missing values disappear D. Otherwise the intercept becomes zero
A student is building a regression model to predict salary and includes both years_worked and months_worked. What is the most likely issue?
A. Heteroscedasticity B. Missing completely at random C. Strong multicollinearity D. Underfitting
In a regression model, two predictors have very high VIF values, but both are conceptually important. What is the best interpretation?
A. One must always be deleted immediately B. The predictors are highly related, so coefficient uncertainty may be large C. The model has no useful information D. The target variable should be removed
In ordinary least squares (OLS), the estimated coefficients are chosen to:
A. Maximize the number of predictors B. Minimize the sum of squared residuals C. Minimize the number of observations D. Maximize the variance of the residuals
In a regression model, the standard error of a coefficient measures:
A. The size of the coefficient itself B. The average value of the predictor C. The uncertainty in the estimated coefficient D. The correlation between predictors
If a coefficient has a large standard error, this usually means:
A. The estimate is more precise B. The estimate is less precise C. The predictor has no units D. The model is necessarily biased
In the phrase “OLS is BLUE,” the word “Best” means:
A. Largest coefficient values B. Smallest prediction errors on every future data set C. Minimum variance among linear unbiased estimators D. Perfect causal interpretation
If OLS is BLUE, what does it guarantee?
A. The estimates are unbiased and have the smallest variance among linear unbiased estimators B. The model predictions are always exactly correct C. The predictors are all statistically significant D. The residuals are zero
Which of the following is NOT one of the core assumptions associated with OLS being BLUE?
A. Linearity in parameters B. Random sampling C. No perfect multicollinearity D. Residuals must be perfectly normal
A. Standard errors and inference may become unreliable B. The model becomes a classification model C. All predictors become significant D. The data automatically become normalized
A. The model estimates are very precise B. The model has perfect fit C. There is high uncertainty around the coefficient estimates D. The predictors must be categorical
A. OLS estimates are wrong on average B. OLS estimates are still centered correctly, but they are less efficient than they could be C. OLS can no longer be computed D. The sample size must be zero
A. Yes, BLUE means perfect prediction B. Yes, BLUE means no residuals C. No, BLUE refers to coefficient estimation properties, not guaranteed perfect prediction D. No, BLUE only applies to logistic regression
A. MAR B. MNAR C. MCAR D. NMAR-free
A. MCAR B. MAR C. MNAR D. Completely deterministic missingness
A. MCAR B. MAR C. MNAR D. Random omission
A. MCAR B. MAR or possibly MNAR depending on the mechanism C. There is no missingness problem D. One-hot missingness
A. The file size becomes too large B. The model becomes nonlinear C. The remaining data may become less representative and possibly biased D. Multicollinearity is guaranteed
A. A clear curved pattern B. A random cloud around zero with no obvious pattern C. A strong upward line D. Two separate groups only
A. The model may be missing a nonlinear relationship B. The residuals are perfectly normal C. The predictors are all independent D. The model is guaranteed to be BLUE
A. The model fits perfectly B. The linear model may not capture the true relationship well C. There is definitely no multicollinearity D. The target variable has no variance
A. Perfect linearity B. Heteroscedasticity, meaning the spread of residuals changes C. No outliers D. The predictors are categorical
A. Ignore it because it is always harmless B. Try transforming the response, such as using a log of the target when appropriate C. Delete all predictors D. Replace regression with clustering
A. Possible outliers or influential observations B. Perfect normality C. Whether the model has too many rows D. Whether one-hot encoding is required
A. Scatter randomly with no structure at all B. Fall roughly along a straight diagonal line C. Form a U-shape D. Form horizontal stripes
A. The residual distribution may not match normality well (i.e. residuals don't have a bell-shaped distribution) B. The predictors are highly correlated C. The intercept is missing D. The target variable is categorical
A. The model must be discarded immediately B. Small deviations may not be a major practical problem, especially in larger samples C. The coefficients are automatically biased D. OLS cannot be computed anymore
A. Check whether a transformation or an added nonlinear term could improve the model B. Declare the data useless C. Remove half the data randomly D. Conclude the model is perfect anyway