Data and Society

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/121

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 2:48 PM on 5/1/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

122 Terms

1
New cards

What is the main purpose of categorical encoding in a data set?

B. To convert categories into a numerical form that a model can use

2
New cards

Which of the following is an example of a nominal variable?

C. Color: red, blue, green

3
New cards

One-hot encoding is especially appropriate for:

C. Categories with no natural ordering

4
New cards

A possible drawback of one-hot encoding is that it can:

B. Create many new columns when a variable has many categories

5
New cards

Why can ordinal encoding be risky in some models?

B. It may incorrectly suggest that distances between categories are meaningful

6
New cards

A data set contains a variable called shirt_color with categories red, blue, green, and black. A student decides to code them as 1, 2, 3, and 4 and use that directly in linear regression. What is the main concern?

B. The encoding may incorrectly impose an artificial order on the categories

7
New cards

A regression model includes a predictor for education_level with values: high school, college, graduate school. If these are encoded as 1, 2, and 3, when is this most defensible?

A. When the categories have a meaningful order

8
New cards

Multicollinearity refers to a situation where:

B. Two or more predictors are strongly linearly related

9
New cards

Which of the following is a practical consequence of strong multicollinearity?

C. Standard errors of coefficients can become inflated

10
New cards

If two predictors are highly correlated, that means:

C. They may create redundancy, but not always

11
New cards

Which diagnostic is commonly used to assess multicollinearity?

B. VIF

12
New cards

Why is the assumption of no perfect multicollinearity important for OLS?

A. Otherwise the model cannot uniquely estimate the separate effect of some predictors

13
New cards

A student is building a regression model to predict salary and includes both years_worked and months_worked. What is the most likely issue?

C. Strong multicollinearity

14
New cards

In a regression model, two predictors have very high VIF values, but both are conceptually important. What is the best interpretation?

B. The predictors are highly related, so coefficient uncertainty may be large

15
New cards

In ordinary least squares (OLS), the estimated coefficients are chosen to:

B. Minimize the sum of squared residuals

16
New cards

In a regression model, the standard error of a coefficient measures:

C. The uncertainty in the estimated coefficient

17
New cards

If a coefficient has a large standard error, this usually means:

B. The estimate is less precise

18
New cards

In the phrase “OLS is BLUE,” the word “Best” means:

C. Minimum variance among linear unbiased estimators

19
New cards

If OLS is BLUE, what does it guarantee?

A. The estimates are unbiased and have the smallest variance among linear unbiased estimators

20
New cards

Which of the following is NOT one of the core assumptions associated with OLS being BLUE?

D. Residuals must be perfectly normal

21
New cards

If the OLS assumptions for BLUE are violated, a practical consequence can be:

A. Standard errors and inference may become unreliable

22
New cards

A researcher estimates an OLS model and notices that the coefficients seem reasonable, but the standard errors are very large. What is the practical meaning of this?

C. There is high uncertainty around the coefficient estimates

23
New cards

Suppose OLS coefficients are unbiased, but they are not BLUE because another linear unbiased estimator would have smaller variance. Practically, what does that mean?

B. OLS estimates are still centered correctly, but they are less efficient than they could be

24
New cards

A professor says, “OLS is BLUE.” A student responds, “So that means the model predictions are always correct.” What is the best correction?

C. No, BLUE refers to coefficient estimation properties, not guaranteed perfect prediction

25
New cards

Which missing data type occurs when the probability of missingness is unrelated to both observed and unobserved data?

C. MCAR

26
New cards

Which missing data type means missingness may depend on observed variables, but not on the missing value itself after accounting for those observed variables?

B. MAR

27
New cards

Which missing data type is often the most problematic because the missingness depends on the unobserved value itself or other unobserved factors?

C. MNAR

28
New cards

A hospital data set is missing some patient income values because higher-income patients were less likely to report them. Which missingness type is most plausible?

B. MAR or possibly MNAR depending on the mechanism

29
New cards

A data analyst removes every row with a missing value without checking the reason values are missing. What is the main risk?

C. The remaining data may become less representative and possibly biased

30
New cards

In a good residuals vs fitted plot for a linear regression, the residuals should usually look like:

B. A random cloud around zero with no obvious pattern

31
New cards

If a residuals vs fitted plot shows a curved shape, this often suggests:

A. The model may be missing a nonlinear relationship

32
New cards

A beginner student fits a linear regression model, and the residuals vs fitted plot looks like a U-shape. What is the most likely interpretation?

B. The linear model may not capture the true relationship well

33
New cards

If the residuals vs fitted plot fans out as fitted values increase, this suggests:

B. Heteroscedasticity, meaning the spread of residuals changes

34
New cards

A simple beginner-level fix for a fan-shaped residuals vs fitted plot is often to:

B. Try transforming the response, such as using a log of the target when appropriate

35
New cards

If a residuals vs fitted plot shows one or two points far away from the rest, a student should first think about:

A. Possible outliers or influential observations

36
New cards

In a P–P plot, the points should ideally:

B. Fall roughly along a straight diagonal line

37
New cards

If the points in a P–P plot deviate strongly from the diagonal line, this suggests:

A. The residual distribution may not match normality well (i.e. residuals don't have a bell-shaped distribution)

38
New cards

Suppose a P–P plot shows mild deviations from the diagonal, but the sample size is fairly large. What is often a reasonable beginner interpretation?

B. Small deviations may not be a major practical problem, especially in larger samples

39
New cards

A student sees nonlinearity in the residuals vs fitted plot and non-normality in the P–P plot. Which is the most beginner-friendly first response?

A. Check whether a transformation or an added nonlinear term could improve the model

40
New cards

In practice, why do we look at residuals in our one data set?

B. Because residuals can give clues about whether OLS assumptions seem reasonable in the sample we have

41
New cards

We often say that in real life we only have one sample. What does that mean for checking OLS?

B. We must use plots and diagnostics from this one sample to look for warning signs

42
New cards

A student says, “If the residuals in my sample look random around zero, that is encouraging because it suggests the estimates may be reasonably centered.” What is the best response?

A. Correct, that is a helpful sign, though not absolute proof

43
New cards

In the dartboard analogy, what does it mean for an estimator to be centered?

C. On average, the darts land around the bullseye

44
New cards

In the dartboard analogy, what does it mean for an estimator to be precise?

A. The darts land very close together

45
New cards

If residuals are scattered randomly around zero with roughly constant spread, what is that usually telling a beginner student?

A. This is more consistent with OLS working reasonably well

46
New cards

Suppose the residuals show a strong curve. Why is this a problem when thinking about whether estimates are centered?

A. It suggests the model may be systematically missing part of the relationship

47
New cards

If residuals fan out more and more as fitted values increase, what does that mainly suggest about the estimates?

B. The precision of the estimates and the reliability of standard errors may be affected

48
New cards

Which statement best matches the role of residuals for a student using only one sample?

B. Residuals are clues that help us judge whether OLS may be behaving well or poorly

49
New cards

A beginner student asks: “If residuals look messy, what is a simple way to think about it?” Which answer is best?

A. The darts on our dartboard may not be landing in a clean, tight pattern, so we should be cautious about centeredness and precision

50
New cards

Which of the following best describes a fitted linear regression model?

B. A line or equation estimated from the sample that gives the predicted value of the outcome based on the predictors

51
New cards

A fitted regression model for exam score is What is the best interpretation of the coefficient 4.2?

A. For each additional hour studied, the student’s exam score is predicted to increase by 4.2 points on average

52
New cards

A fitted regression model for house price is What is the best interpretation of the coefficient 18,500?

B. Holding the model form fixed, a one-bedroom increase is associated with a predicted increase of $18,500 in house price on average

53
New cards

A fitted regression model for yearly salary is What is the best interpretation of the coefficient 2,100?

A. For each additional year of experience, predicted salary increases by $2,100 on average

54
New cards

A fitted regression model for electricity usage is What is the best interpretation of the coefficient 0.16?

B. For each additional kilowatt-hour used, the predicted monthly bill increases by $0.16 on average

55
New cards

In a linear regression model, what is OLS mainly used for?

B. To estimate the coefficients, including the intercept and slopes

56
New cards

In the regression model what does OLS try to estimate from the sample?

C. The unknown population coefficients, such as ​ and ​

57
New cards

Why do we call OLS coefficients “estimates” rather than the true parameters?

C. Because they are sample-based best guesses of population parameters that we do not directly know

58
New cards

Even after fitting a regression model with OLS, why do we still talk about uncertainty?

C. Because the estimated coefficients come from one sample and may differ from the true population values

59
New cards

Which statement best describes the role of the standard error (SE) in OLS?

C. It tells us how much uncertainty there is around our estimated coefficient

60
New cards

What does using robust standard errors such as HC3 mainly do in an OLS regression?

C. It keeps the OLS coefficients the same but adjusts the standard errors to be more reliable when error variance is not constant

61
New cards

Taking the log of the response variable is often a simple fix when which OLS issue appears in the residuals vs fitted plot?

C. Heteroscedasticity, where the spread of residuals changes as fitted values increase

62
New cards
  1. What is the main purpose of categorical encoding in a data set?

A. To remove missing values B. To convert categories into a numerical form that a model can use C. To increase sample size D. To standardize continuous variables

63
New cards
  1. Which of the following is an example of a nominal variable?

A. Education level: high school, college, graduate B. Satisfaction rating: low, medium, high C. Color: red, blue, green D. Class rank: first, second, third

64
New cards
  1. One-hot encoding is especially appropriate for:

A. Variables with a natural order B. Continuous variables C. Categories with no natural ordering D. Variables with extreme outliers

65
New cards
  1. A possible drawback of one-hot encoding is that it can:

A. Reduce all correlations to zero B. Create many new columns when a variable has many categories C. Remove all missing values D. Guarantee causal interpretation

66
New cards
  1. Why can ordinal encoding be risky in some models?

A. It always causes multicollinearity B. It may incorrectly suggest that distances between categories are meaningful C. It removes too much information D. It can only be used for binary variables

67
New cards
  1. A data set contains a variable called shirt_color with categories red, blue, green, and black. A student decides to code them as 1, 2, 3, and 4 and use that directly in linear regression. What is the main concern?

A. The model will run too slowly B. The encoding may incorrectly impose an artificial order on the categories C. The variable will automatically become missing D. The regression will always fail

68
New cards
  1. A regression model includes a predictor for education_level with values: high school, college, graduate school. If these are encoded as 1, 2, and 3, when is this most defensible?

A. When the categories have a meaningful order B. When the categories are colors C. When the target variable is binary D. When there are missing values

69
New cards
  1. Multicollinearity refers to a situation where:

A. The target variable has missing values B. Two or more predictors are strongly linearly related C. Residuals are not normally distributed D. The sample size is too large

70
New cards
  1. Which of the following is a practical consequence of strong multicollinearity?

A. Coefficients become impossible to estimate at all B. The model automatically becomes biased C. Standard errors of coefficients can become inflated D. The residuals become exactly zero

71
New cards
  1. If two predictors are highly correlated, that means:

A. One of them must always be removed B. They are definitely redundant for the model C. They may create redundancy, but not always D. The model is invalid

72
New cards
  1. Which diagnostic is commonly used to assess multicollinearity?

A. RMSE B. VIF C. Confusion matrix D. Silhouette score

73
New cards
  1. Why is the assumption of no perfect multicollinearity important for OLS?

A. Otherwise the model cannot uniquely estimate the separate effect of some predictors B. Otherwise the sample mean becomes biased C. Otherwise missing values disappear D. Otherwise the intercept becomes zero

74
New cards
  1. A student is building a regression model to predict salary and includes both years_worked and months_worked. What is the most likely issue?

A. Heteroscedasticity B. Missing completely at random C. Strong multicollinearity D. Underfitting

75
New cards
  1. In a regression model, two predictors have very high VIF values, but both are conceptually important. What is the best interpretation?

A. One must always be deleted immediately B. The predictors are highly related, so coefficient uncertainty may be large C. The model has no useful information D. The target variable should be removed

76
New cards
  1. In ordinary least squares (OLS), the estimated coefficients are chosen to:

A. Maximize the number of predictors B. Minimize the sum of squared residuals C. Minimize the number of observations D. Maximize the variance of the residuals

77
New cards
  1. In a regression model, the standard error of a coefficient measures:

A. The size of the coefficient itself B. The average value of the predictor C. The uncertainty in the estimated coefficient D. The correlation between predictors

78
New cards
  1. If a coefficient has a large standard error, this usually means:

A. The estimate is more precise B. The estimate is less precise C. The predictor has no units D. The model is necessarily biased

79
New cards
  1. In the phrase “OLS is BLUE,” the word “Best” means:

A. Largest coefficient values B. Smallest prediction errors on every future data set C. Minimum variance among linear unbiased estimators D. Perfect causal interpretation

80
New cards
  1. If OLS is BLUE, what does it guarantee?

A. The estimates are unbiased and have the smallest variance among linear unbiased estimators B. The model predictions are always exactly correct C. The predictors are all statistically significant D. The residuals are zero

81
New cards
  1. Which of the following is NOT one of the core assumptions associated with OLS being BLUE?

A. Linearity in parameters B. Random sampling C. No perfect multicollinearity D. Residuals must be perfectly normal

82
New cards
  1. If the OLS assumptions for BLUE are violated, a practical consequence can be:

A. Standard errors and inference may become unreliable B. The model becomes a classification model C. All predictors become significant D. The data automatically become normalized

83
New cards
  1. A researcher estimates an OLS model and notices that the coefficients seem reasonable, but the standard errors are very large. What is the practical meaning of this?

A. The model estimates are very precise B. The model has perfect fit C. There is high uncertainty around the coefficient estimates D. The predictors must be categorical

84
New cards
  1. Suppose OLS coefficients are unbiased, but they are not BLUE because another linear unbiased estimator would have smaller variance. Practically, what does that mean?

A. OLS estimates are wrong on average B. OLS estimates are still centered correctly, but they are less efficient than they could be C. OLS can no longer be computed D. The sample size must be zero

85
New cards
  1. A professor says, “OLS is BLUE.” A student responds, “So that means the model predictions are always correct.” What is the best correction?

A. Yes, BLUE means perfect prediction B. Yes, BLUE means no residuals C. No, BLUE refers to coefficient estimation properties, not guaranteed perfect prediction D. No, BLUE only applies to logistic regression

86
New cards
  1. Which missing data type occurs when the probability of missingness is unrelated to both observed and unobserved data?

A. MAR B. MNAR C. MCAR D. NMAR-free

87
New cards
  1. Which missing data type means missingness may depend on observed variables, but not on the missing value itself after accounting for those observed variables?

A. MCAR B. MAR C. MNAR D. Completely deterministic missingness

88
New cards
  1. Which missing data type is often the most problematic because the missingness depends on the unobserved value itself or other unobserved factors?

A. MCAR B. MAR C. MNAR D. Random omission

89
New cards
  1. A hospital data set is missing some patient income values because higher-income patients were less likely to report them. Which missingness type is most plausible?

A. MCAR B. MAR or possibly MNAR depending on the mechanism C. There is no missingness problem D. One-hot missingness

90
New cards
  1. A data analyst removes every row with a missing value without checking the reason values are missing. What is the main risk?

A. The file size becomes too large B. The model becomes nonlinear C. The remaining data may become less representative and possibly biased D. Multicollinearity is guaranteed

91
New cards
  1. In a good residuals vs fitted plot for a linear regression, the residuals should usually look like:

A. A clear curved pattern B. A random cloud around zero with no obvious pattern C. A strong upward line D. Two separate groups only

92
New cards
  1. If a residuals vs fitted plot shows a curved shape, this often suggests:

A. The model may be missing a nonlinear relationship B. The residuals are perfectly normal C. The predictors are all independent D. The model is guaranteed to be BLUE

93
New cards
  1. A beginner student fits a linear regression model, and the residuals vs fitted plot looks like a U-shape. What is the most likely interpretation?

A. The model fits perfectly B. The linear model may not capture the true relationship well C. There is definitely no multicollinearity D. The target variable has no variance

94
New cards
  1. If the residuals vs fitted plot fans out as fitted values increase, this suggests:

A. Perfect linearity B. Heteroscedasticity, meaning the spread of residuals changes C. No outliers D. The predictors are categorical

95
New cards
  1. A simple beginner-level fix for a fan-shaped residuals vs fitted plot is often to:

A. Ignore it because it is always harmless B. Try transforming the response, such as using a log of the target when appropriate C. Delete all predictors D. Replace regression with clustering

96
New cards
  1. If a residuals vs fitted plot shows one or two points far away from the rest, a student should first think about:

A. Possible outliers or influential observations B. Perfect normality C. Whether the model has too many rows D. Whether one-hot encoding is required

97
New cards
  1. In a P–P plot, the points should ideally:

A. Scatter randomly with no structure at all B. Fall roughly along a straight diagonal line C. Form a U-shape D. Form horizontal stripes

98
New cards
  1. If the points in a P–P plot deviate strongly from the diagonal line, this suggests:

A. The residual distribution may not match normality well (i.e. residuals don't have a bell-shaped distribution) B. The predictors are highly correlated C. The intercept is missing D. The target variable is categorical

99
New cards
  1. Suppose a P–P plot shows mild deviations from the diagonal, but the sample size is fairly large. What is often a reasonable beginner interpretation?

A. The model must be discarded immediately B. Small deviations may not be a major practical problem, especially in larger samples C. The coefficients are automatically biased D. OLS cannot be computed anymore

100
New cards
  1. A student sees nonlinearity in the residuals vs fitted plot and non-normality in the P–P plot. Which is the most beginner-friendly first response?

A. Check whether a transformation or an added nonlinear term could improve the model B. Declare the data useless C. Remove half the data randomly D. Conclude the model is perfect anyway