Data and Society

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/121

There's no tags or description

Looks like no tags are added yet.

Last updated 2:48 PM on 5/1/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

122 Terms

New cards

What is the main purpose of categorical encoding in a data set?

B. To convert categories into a numerical form that a model can use

New cards

Which of the following is an example of a nominal variable?

C. Color: red, blue, green

New cards

One-hot encoding is especially appropriate for:

C. Categories with no natural ordering

New cards

A possible drawback of one-hot encoding is that it can:

B. Create many new columns when a variable has many categories

New cards

Why can ordinal encoding be risky in some models?

B. It may incorrectly suggest that distances between categories are meaningful

New cards

A data set contains a variable called shirt_color with categories red, blue, green, and black. A student decides to code them as 1, 2, 3, and 4 and use that directly in linear regression. What is the main concern?

B. The encoding may incorrectly impose an artificial order on the categories

New cards

A regression model includes a predictor for education_level with values: high school, college, graduate school. If these are encoded as 1, 2, and 3, when is this most defensible?

A. When the categories have a meaningful order

New cards

Multicollinearity refers to a situation where:

B. Two or more predictors are strongly linearly related

New cards

Which of the following is a practical consequence of strong multicollinearity?

C. Standard errors of coefficients can become inflated

New cards

If two predictors are highly correlated, that means:

C. They may create redundancy, but not always

New cards

Which diagnostic is commonly used to assess multicollinearity?

B. VIF

New cards

Why is the assumption of no perfect multicollinearity important for OLS?

A. Otherwise the model cannot uniquely estimate the separate effect of some predictors

New cards

A student is building a regression model to predict salary and includes both years_worked and months_worked. What is the most likely issue?

C. Strong multicollinearity

New cards

In a regression model, two predictors have very high VIF values, but both are conceptually important. What is the best interpretation?

B. The predictors are highly related, so coefficient uncertainty may be large

New cards

In ordinary least squares (OLS), the estimated coefficients are chosen to:

B. Minimize the sum of squared residuals

New cards

In a regression model, the standard error of a coefficient measures:

C. The uncertainty in the estimated coefficient

New cards

If a coefficient has a large standard error, this usually means:

B. The estimate is less precise

New cards

In the phrase “OLS is BLUE,” the word “Best” means:

C. Minimum variance among linear unbiased estimators

New cards

If OLS is BLUE, what does it guarantee?

A. The estimates are unbiased and have the smallest variance among linear unbiased estimators

New cards

Which of the following is NOT one of the core assumptions associated with OLS being BLUE?

D. Residuals must be perfectly normal

New cards

If the OLS assumptions for BLUE are violated, a practical consequence can be:

A. Standard errors and inference may become unreliable

New cards

A researcher estimates an OLS model and notices that the coefficients seem reasonable, but the standard errors are very large. What is the practical meaning of this?

C. There is high uncertainty around the coefficient estimates

New cards

Suppose OLS coefficients are unbiased, but they are not BLUE because another linear unbiased estimator would have smaller variance. Practically, what does that mean?

B. OLS estimates are still centered correctly, but they are less efficient than they could be

New cards

A professor says, “OLS is BLUE.” A student responds, “So that means the model predictions are always correct.” What is the best correction?

C. No, BLUE refers to coefficient estimation properties, not guaranteed perfect prediction

New cards

Which missing data type occurs when the probability of missingness is unrelated to both observed and unobserved data?

C. MCAR

New cards

Which missing data type means missingness may depend on observed variables, but not on the missing value itself after accounting for those observed variables?

B. MAR

New cards

Which missing data type is often the most problematic because the missingness depends on the unobserved value itself or other unobserved factors?

C. MNAR

New cards

A hospital data set is missing some patient income values because higher-income patients were less likely to report them. Which missingness type is most plausible?

B. MAR or possibly MNAR depending on the mechanism

New cards

A data analyst removes every row with a missing value without checking the reason values are missing. What is the main risk?

C. The remaining data may become less representative and possibly biased

New cards

In a good residuals vs fitted plot for a linear regression, the residuals should usually look like:

B. A random cloud around zero with no obvious pattern

New cards

If a residuals vs fitted plot shows a curved shape, this often suggests:

A. The model may be missing a nonlinear relationship

New cards

A beginner student fits a linear regression model, and the residuals vs fitted plot looks like a U-shape. What is the most likely interpretation?

B. The linear model may not capture the true relationship well

New cards

If the residuals vs fitted plot fans out as fitted values increase, this suggests:

B. Heteroscedasticity, meaning the spread of residuals changes

New cards

A simple beginner-level fix for a fan-shaped residuals vs fitted plot is often to:

B. Try transforming the response, such as using a log of the target when appropriate

New cards

If a residuals vs fitted plot shows one or two points far away from the rest, a student should first think about:

A. Possible outliers or influential observations

New cards

In a P–P plot, the points should ideally:

B. Fall roughly along a straight diagonal line

New cards

If the points in a P–P plot deviate strongly from the diagonal line, this suggests:

A. The residual distribution may not match normality well (i.e. residuals don't have a bell-shaped distribution)

New cards

Suppose a P–P plot shows mild deviations from the diagonal, but the sample size is fairly large. What is often a reasonable beginner interpretation?

B. Small deviations may not be a major practical problem, especially in larger samples

New cards

A student sees nonlinearity in the residuals vs fitted plot and non-normality in the P–P plot. Which is the most beginner-friendly first response?

A. Check whether a transformation or an added nonlinear term could improve the model

New cards

In practice, why do we look at residuals in our one data set?

B. Because residuals can give clues about whether OLS assumptions seem reasonable in the sample we have

New cards

We often say that in real life we only have one sample. What does that mean for checking OLS?

B. We must use plots and diagnostics from this one sample to look for warning signs

New cards

A student says, “If the residuals in my sample look random around zero, that is encouraging because it suggests the estimates may be reasonably centered.” What is the best response?

A. Correct, that is a helpful sign, though not absolute proof

New cards

In the dartboard analogy, what does it mean for an estimator to be centered?

C. On average, the darts land around the bullseye

New cards

In the dartboard analogy, what does it mean for an estimator to be precise?

A. The darts land very close together

New cards

If residuals are scattered randomly around zero with roughly constant spread, what is that usually telling a beginner student?

A. This is more consistent with OLS working reasonably well

New cards

Suppose the residuals show a strong curve. Why is this a problem when thinking about whether estimates are centered?

A. It suggests the model may be systematically missing part of the relationship

New cards

If residuals fan out more and more as fitted values increase, what does that mainly suggest about the estimates?

B. The precision of the estimates and the reliability of standard errors may be affected

New cards

Which statement best matches the role of residuals for a student using only one sample?

B. Residuals are clues that help us judge whether OLS may be behaving well or poorly

New cards

A beginner student asks: “If residuals look messy, what is a simple way to think about it?” Which answer is best?

A. The darts on our dartboard may not be landing in a clean, tight pattern, so we should be cautious about centeredness and precision

New cards

Which of the following best describes a fitted linear regression model?

B. A line or equation estimated from the sample that gives the predicted value of the outcome based on the predictors

New cards

A fitted regression model for exam score is What is the best interpretation of the coefficient 4.2?

A. For each additional hour studied, the student’s exam score is predicted to increase by 4.2 points on average

New cards

A fitted regression model for house price is What is the best interpretation of the coefficient 18,500?

B. Holding the model form fixed, a one-bedroom increase is associated with a predicted increase of $18,500 in house price on average

New cards

A fitted regression model for yearly salary is What is the best interpretation of the coefficient 2,100?

A. For each additional year of experience, predicted salary increases by $2,100 on average

New cards

A fitted regression model for electricity usage is What is the best interpretation of the coefficient 0.16?

B. For each additional kilowatt-hour used, the predicted monthly bill increases by $0.16 on average

New cards

In a linear regression model, what is OLS mainly used for?

B. To estimate the coefficients, including the intercept and slopes

New cards

In the regression model what does OLS try to estimate from the sample?

C. The unknown population coefficients, such as and

New cards

Why do we call OLS coefficients “estimates” rather than the true parameters?

C. Because they are sample-based best guesses of population parameters that we do not directly know

New cards

Even after fitting a regression model with OLS, why do we still talk about uncertainty?

C. Because the estimated coefficients come from one sample and may differ from the true population values

New cards

Which statement best describes the role of the standard error (SE) in OLS?

C. It tells us how much uncertainty there is around our estimated coefficient

New cards

What does using robust standard errors such as HC3 mainly do in an OLS regression?

C. It keeps the OLS coefficients the same but adjusts the standard errors to be more reliable when error variance is not constant

New cards

Taking the log of the response variable is often a simple fix when which OLS issue appears in the residuals vs fitted plot?

C. Heteroscedasticity, where the spread of residuals changes as fitted values increase

New cards

What is the main purpose of categorical encoding in a data set?

A. To remove missing values B. To convert categories into a numerical form that a model can use C. To increase sample size D. To standardize continuous variables

New cards

Which of the following is an example of a nominal variable?

A. Education level: high school, college, graduate B. Satisfaction rating: low, medium, high C. Color: red, blue, green D. Class rank: first, second, third

New cards

One-hot encoding is especially appropriate for:

A. Variables with a natural order B. Continuous variables C. Categories with no natural ordering D. Variables with extreme outliers

New cards

A possible drawback of one-hot encoding is that it can:

A. Reduce all correlations to zero B. Create many new columns when a variable has many categories C. Remove all missing values D. Guarantee causal interpretation

New cards

Why can ordinal encoding be risky in some models?

A. It always causes multicollinearity B. It may incorrectly suggest that distances between categories are meaningful C. It removes too much information D. It can only be used for binary variables

New cards

A data set contains a variable called shirt_color with categories red, blue, green, and black. A student decides to code them as 1, 2, 3, and 4 and use that directly in linear regression. What is the main concern?

A. The model will run too slowly B. The encoding may incorrectly impose an artificial order on the categories C. The variable will automatically become missing D. The regression will always fail

New cards

A regression model includes a predictor for education_level with values: high school, college, graduate school. If these are encoded as 1, 2, and 3, when is this most defensible?

A. When the categories have a meaningful order B. When the categories are colors C. When the target variable is binary D. When there are missing values

New cards

Multicollinearity refers to a situation where:

A. The target variable has missing values B. Two or more predictors are strongly linearly related C. Residuals are not normally distributed D. The sample size is too large

New cards

Which of the following is a practical consequence of strong multicollinearity?

A. Coefficients become impossible to estimate at all B. The model automatically becomes biased C. Standard errors of coefficients can become inflated D. The residuals become exactly zero

New cards

If two predictors are highly correlated, that means:

A. One of them must always be removed B. They are definitely redundant for the model C. They may create redundancy, but not always D. The model is invalid

New cards

Which diagnostic is commonly used to assess multicollinearity?

A. RMSE B. VIF C. Confusion matrix D. Silhouette score

New cards

Why is the assumption of no perfect multicollinearity important for OLS?

A. Otherwise the model cannot uniquely estimate the separate effect of some predictors B. Otherwise the sample mean becomes biased C. Otherwise missing values disappear D. Otherwise the intercept becomes zero

New cards

A student is building a regression model to predict salary and includes both years_worked and months_worked. What is the most likely issue?

A. Heteroscedasticity B. Missing completely at random C. Strong multicollinearity D. Underfitting

New cards

In a regression model, two predictors have very high VIF values, but both are conceptually important. What is the best interpretation?

A. One must always be deleted immediately B. The predictors are highly related, so coefficient uncertainty may be large C. The model has no useful information D. The target variable should be removed

New cards

In ordinary least squares (OLS), the estimated coefficients are chosen to:

A. Maximize the number of predictors B. Minimize the sum of squared residuals C. Minimize the number of observations D. Maximize the variance of the residuals

New cards

In a regression model, the standard error of a coefficient measures:

A. The size of the coefficient itself B. The average value of the predictor C. The uncertainty in the estimated coefficient D. The correlation between predictors

New cards

If a coefficient has a large standard error, this usually means:

A. The estimate is more precise B. The estimate is less precise C. The predictor has no units D. The model is necessarily biased

New cards

In the phrase “OLS is BLUE,” the word “Best” means:

A. Largest coefficient values B. Smallest prediction errors on every future data set C. Minimum variance among linear unbiased estimators D. Perfect causal interpretation

New cards

If OLS is BLUE, what does it guarantee?

A. The estimates are unbiased and have the smallest variance among linear unbiased estimators B. The model predictions are always exactly correct C. The predictors are all statistically significant D. The residuals are zero

New cards

Which of the following is NOT one of the core assumptions associated with OLS being BLUE?

A. Linearity in parameters B. Random sampling C. No perfect multicollinearity D. Residuals must be perfectly normal

New cards

If the OLS assumptions for BLUE are violated, a practical consequence can be:

A. Standard errors and inference may become unreliable B. The model becomes a classification model C. All predictors become significant D. The data automatically become normalized

New cards

A researcher estimates an OLS model and notices that the coefficients seem reasonable, but the standard errors are very large. What is the practical meaning of this?

A. The model estimates are very precise B. The model has perfect fit C. There is high uncertainty around the coefficient estimates D. The predictors must be categorical

New cards

Suppose OLS coefficients are unbiased, but they are not BLUE because another linear unbiased estimator would have smaller variance. Practically, what does that mean?

A. OLS estimates are wrong on average B. OLS estimates are still centered correctly, but they are less efficient than they could be C. OLS can no longer be computed D. The sample size must be zero

New cards

A professor says, “OLS is BLUE.” A student responds, “So that means the model predictions are always correct.” What is the best correction?

A. Yes, BLUE means perfect prediction B. Yes, BLUE means no residuals C. No, BLUE refers to coefficient estimation properties, not guaranteed perfect prediction D. No, BLUE only applies to logistic regression

New cards

Which missing data type occurs when the probability of missingness is unrelated to both observed and unobserved data?

A. MAR B. MNAR C. MCAR D. NMAR-free

New cards

Which missing data type means missingness may depend on observed variables, but not on the missing value itself after accounting for those observed variables?

A. MCAR B. MAR C. MNAR D. Completely deterministic missingness

New cards

Which missing data type is often the most problematic because the missingness depends on the unobserved value itself or other unobserved factors?

A. MCAR B. MAR C. MNAR D. Random omission

New cards

A hospital data set is missing some patient income values because higher-income patients were less likely to report them. Which missingness type is most plausible?

A. MCAR B. MAR or possibly MNAR depending on the mechanism C. There is no missingness problem D. One-hot missingness

New cards

A data analyst removes every row with a missing value without checking the reason values are missing. What is the main risk?

A. The file size becomes too large B. The model becomes nonlinear C. The remaining data may become less representative and possibly biased D. Multicollinearity is guaranteed

New cards

In a good residuals vs fitted plot for a linear regression, the residuals should usually look like:

A. A clear curved pattern B. A random cloud around zero with no obvious pattern C. A strong upward line D. Two separate groups only

New cards

If a residuals vs fitted plot shows a curved shape, this often suggests:

A. The model may be missing a nonlinear relationship B. The residuals are perfectly normal C. The predictors are all independent D. The model is guaranteed to be BLUE

New cards

A beginner student fits a linear regression model, and the residuals vs fitted plot looks like a U-shape. What is the most likely interpretation?

A. The model fits perfectly B. The linear model may not capture the true relationship well C. There is definitely no multicollinearity D. The target variable has no variance

New cards

If the residuals vs fitted plot fans out as fitted values increase, this suggests:

A. Perfect linearity B. Heteroscedasticity, meaning the spread of residuals changes C. No outliers D. The predictors are categorical

New cards

A simple beginner-level fix for a fan-shaped residuals vs fitted plot is often to:

A. Ignore it because it is always harmless B. Try transforming the response, such as using a log of the target when appropriate C. Delete all predictors D. Replace regression with clustering

New cards

If a residuals vs fitted plot shows one or two points far away from the rest, a student should first think about:

A. Possible outliers or influential observations B. Perfect normality C. Whether the model has too many rows D. Whether one-hot encoding is required

New cards

In a P–P plot, the points should ideally:

A. Scatter randomly with no structure at all B. Fall roughly along a straight diagonal line C. Form a U-shape D. Form horizontal stripes

New cards

If the points in a P–P plot deviate strongly from the diagonal line, this suggests:

A. The residual distribution may not match normality well (i.e. residuals don't have a bell-shaped distribution) B. The predictors are highly correlated C. The intercept is missing D. The target variable is categorical

New cards

Suppose a P–P plot shows mild deviations from the diagonal, but the sample size is fairly large. What is often a reasonable beginner interpretation?

A. The model must be discarded immediately B. Small deviations may not be a major practical problem, especially in larger samples C. The coefficients are automatically biased D. OLS cannot be computed anymore

100

New cards

A student sees nonlinearity in the residuals vs fitted plot and non-normality in the P–P plot. Which is the most beginner-friendly first response?

A. Check whether a transformation or an added nonlinear term could improve the model B. Declare the data useless C. Remove half the data randomly D. Conclude the model is perfect anyway