Advanced Data Analytics Midterm Review

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/133

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 8:49 PM on 3/28/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

134 Terms

1
New cards

Statistical Learning

Refers to methods intended to help us understand data

2
New cards

Standard Convention

Each row represents an observation and each column a variable.

3
New cards

Input Variables

Predictors, features, independent variables

4
New cards

Output Variables

Responses, dependent variables

5
New cards

Supervised Learning

Building models that seek to predict the value of an output based on a set of input variables

6
New cards

Learning a rule that approximates the relationship between the predictors and the response

Supervised Learning

7
New cards

Unsupervised Learning

Detecting patterns and relationships in data

8
New cards

Learning aa rule for categorizing observations

Unsupervised Learning

9
New cards

Interference

Understanding the relationship between the response and the predictors

10
New cards

Prediction

Entails estimating the value of a response based on observed predictors

11
New cards

Quantitative Response

Regression

12
New cards

Qualitative Response

Classification

13
New cards

Logistic regression is a _______ method.

Classification

14
New cards

We believe there is a relationship between a _______ __ and at least one of the predictors in _

response Y, X

15
New cards

Supervised learning is all about _______ (learning) f

Estimating

16
New cards

f is a _____ but _______ function

fixed, unknown

17
New cards

f represents the _______ information about Y provided by X.

Systematic

18
New cards

∊ is a _________ (noise) with mean zero, independent of X

Random Error

19
New cards

∊ represents…

The effect of unmeasured variables on Y, or unmeasurable variation. Means that the same X can lead to different Y

20
New cards

We apply a learning method to the ___________ and obtain a ________

Training dataset, fitted model

21
New cards

In testing stage, we have _________ test dataset

Separate

22
New cards

Statistical Learning Approach

Use the training data and a statistical method to estimate f

Find a good-fitting function f and use it for prediction

23
New cards

Parametric Method (model)

Reduce the problem from one of estimating f to one of estimating a set of parameters, e.g., β0 and β1

24
New cards

Non-parametric Method (model)

Tend to be more flexible

25
New cards

Models that are more flexible tend to be ____________ and have the potential to _________ the training data.

Less interpretable, overfit

26
New cards

We can measure the quality of a model’s predictions by the _________

Mean Squared Error

27
New cards

The method with the lowest training MSE may not have the

Lowest Test MSE

28
New cards

Training MSE __________ as flexibility _________

Decreases, Increases

29
New cards

Test MSE has a U-shape because of the _____________

Bias-variance trade-off

30
New cards

Variance Term

Refers to the uncertainty due to randomness in the training data

31
New cards

Bias Term

Refers to our error in approximating a real-life problem

32
New cards

Var(∊) Term is called ____________

Irreducible Error

33
New cards

Horizontal line in Bias-Variance Trade-Off

Var(∊)

34
New cards

Vertical Line in Bias-Variance Trade-Off

Flexibility level with smallest test MSE

35
New cards

Simple Linear Regression setting

Predicting a quantitative response Y based on a single predictor X

36
New cards

Simple Linear Regression is also called as _____________

Population Regression Line

37
New cards

Parameter β0

Intercept (the avg value of Y if X = 0)

38
New cards

Parameter β1

Slope (the avg increases in Y when X is increased by 1)

39
New cards

Error (∊)

Assumed to be normally distributed with mean 0 and variance σ² : ∊ ~ N(0,σ²)

40
New cards

The ___________________ B̂0 and B̂1 minimize RSS

Least squares coefficient estimates

41
New cards

Red Line (Probabilistic interpretation of regression)

Population regression line

42
New cards

Blue Line (Probabilistic interpretation of regression)

Line of best fit

43
New cards

Residual Standard Error

Estimate of the standard deviation of ∊, i.e., σ

44
New cards

Residual standard error (RSE) measures the ________ of the model to the training data

Lack of fit

45
New cards

R² statistic measures…

Proportion of variance explained by fitted linear model

46
New cards

takes values between ___ and ___, and ______ values indicate better fit.

0, 1, Larger

47
New cards

The Null Hypothesis means…

There’s is no relationship between X and Y

48
New cards

Alternative Hypothesis means…

There is some relationship between X and Y

49
New cards

We want to strongly _____ the null hypothesis, i.e., obtain a very low _______.

Reject, p-value

50
New cards

Polynomial Regression

More higher order terms → more flexible model → more potential for overfitting

51
New cards

Residual Plots

Can tell us a lot about the relationship between Y and X

52
New cards

In a residual plot, if the points are not vertically centered around 0 for all x, it suggests a ___________ relationship between Y and X.

Non-linear

53
New cards

In a residual plot, if the point cloud has different vertical spreads for different xs (e.g., funnel shape), it suggests the variance σ² is a ________ across x. This is called a _________.

Non-constant, Heteroskedasticity

54
New cards

We sometimes want to report ___________ for the expected response given a particular predictor.

Confidence Intervals

55
New cards

We sometimes want to report ________ for the response at given a particular predictor, Y | x

Prediction Intervals

56
New cards

Confidence Intervals give….

Plausible values for f(x) = E[Y|x] (the average output)

57
New cards

Prediction intervals give…

Plausible values for Y|x (an individual output)

58
New cards

Multiple Linear Predictors

β0 : intercept (avg values of Y if all predictors are 0)

βj : slope of jth predictor

59
New cards

βj is the average increase in Y if Xj is increased by 1 and…

All other predictors are held constant

60
New cards

R performs a different ___________ (called the F test)

Model Utility Test

61
New cards

A small p-value indicated that _________ of the predictors has a statistically significant relationship with the response.

At least one

62
New cards

Model Utility Test measures the _________ of adding the jth predictor to the model when the other p-1 predictors are already in it.

Partial Effect

63
New cards

The p-value for the F-test tells us…

Whether the multiple linear regression model is reasonable

64
New cards

The p-values for the tests for each predictor can be helpful for…

Choosing which input variables to include in the final model

65
New cards

For multiple linear regression, we often consider the __________ value, which penalizes including superfluous predictors in the model.

Adjusted R²

66
New cards

_______ us guaranteed to increase when adding predictors. ______ is not.

R², Adjusted R²

67
New cards

Cross-validation Methods are used when…

Trying out several learning methods and want to find the one with the best (lowest) test MSE.

68
New cards

How can we estimate test MSE/error rate using only the training dataset?

Validation Set Approach

Leave-One-Out Cross-Validation (LOOCV)

k-Fold Cross Validation

69
New cards

Validation set approach Setup

Randomly split the available data into two parts

70
New cards

Training Set

Observations that will be used to train the models

71
New cards

Validation Set

Observations that will be used for testing models

72
New cards

Choose the model with the _______ test MSE when applied to the validation data.

Lowest

73
New cards

For LOOVC, we validate __________ and aggregate results.

Multiple times

74
New cards

For f-Fold CV, ike LOOVC, we validate multiple times, but only fit ________, not n.

k Models

75
New cards

Validation set approach is easy to implement, but statistical methods tend to do better when…

Trained on more data

76
New cards

LOOCV and k-Fold CV allows us to ______________ for a model to fit on ________.

Estimate the test MSE, available data

77
New cards

LOOCV is computationally intensive because…

We apply the same method n times!

78
New cards

Classification deals with…

Predicting a qualitative response

79
New cards

Classifer

A rule for assigning a newly observed combination of input variables to an output category

80
New cards

A classifier can make two types of errors: __________ and ________.

False positives and False negatives

81
New cards

_____________ is a classification method.

Logistic Regression

82
New cards

Comparative Boxplots

A useful way to visualize the distributions of input variables for each of the output categories.

83
New cards

Training Error Rate

The fraction of misclassified training observations

84
New cards

Suppose that Y can be in one of J categories, indexed j = 1,2,…,J. Condtional on X, Y is assumed to be _________ distributed with _____________.

Multinomially, Probability mass function (pmf)

85
New cards

For classification, P(Y = j | .) is fized but ________

Unknown (seek to approximate it)

86
New cards

Bayes Classifier

A classifier that would minimize the average misclassification fraction

87
New cards

Bayes Classifier is an ________ ideal.

Unattainable

88
New cards

Bayes Error Rate

Represents the best we can do on a classification problem; analogous to the irreducible error

89
New cards

Bayes Classifier: Different categories are represented by _______ and ________

Blue, Orange

90
New cards

Bayes Decision Boundary

The purple dashed line

91
New cards

K-nearest Neighbors (KNN)

Nonparametric method that directly attempts to estimate P(Y = j | X = x0) by looking at the categories (outputs) of neighbors of x0

92
New cards

RHS features…

The empirical proportions of nearby observations in each class

93
New cards

KNN Classifier

Black solid Line

94
New cards

Parametric Methods

LDA, QDA, and Naive Bayes

95
New cards

Linear Discriminant Analysis (LDA)

Assumes that for observations with outputs in category j, the predictor is normally distributed with mean and variance

96
New cards

LDA assumes ________ variances across categories.

Equal

97
New cards

Substituting the normal density into Bayes’ Theorem taking a log of both sides, and removing extra terms gives a _______________ that is ________ in x.

Discriminant function, Linear

98
New cards

LDA assumes that for observations in category j, the predictors are _____________ with mean vector and covariance.

Multivariable normally distributed

99
New cards

Multivariable Normal Distribution

Extends the univariate normal distribution to higher dimensions

100
New cards

Quadratic Discriminant Analysis (QDA)

Extends the LDA framework to allow for different covariance matrices for different categories. The resulting discriminant function is quadratic in x.

Explore top notes

note
Bristol case study
Updated 1055d ago
0.0(0)
note
Factoring Polynomials Part 1:
Updated 1517d ago
0.0(0)
note
Health Quiz
Updated 394d ago
0.0(0)
note
Voice Referendum
Updated 111d ago
0.0(0)
note
european expansion and exploration
Updated 1083d ago
0.0(0)
note
AP Stats Section 2 (unit 6-12)
Updated 305d ago
0.0(0)
note
Bristol case study
Updated 1055d ago
0.0(0)
note
Factoring Polynomials Part 1:
Updated 1517d ago
0.0(0)
note
Health Quiz
Updated 394d ago
0.0(0)
note
Voice Referendum
Updated 111d ago
0.0(0)
note
european expansion and exploration
Updated 1083d ago
0.0(0)
note
AP Stats Section 2 (unit 6-12)
Updated 305d ago
0.0(0)

Explore top flashcards

flashcards
humanities final
29
Updated 108d ago
0.0(0)
flashcards
Beach Vocab
49
Updated 1073d ago
0.0(0)
flashcards
Hinduism
20
Updated 1097d ago
0.0(0)
flashcards
October exam
32
Updated 528d ago
0.0(0)
flashcards
bio flashcards - genes/genetics
28
Updated 1068d ago
0.0(0)
flashcards
Unit 1 English Literary Terms
46
Updated 772d ago
0.0(0)
flashcards
Odyssey Test review
85
Updated 526d ago
0.0(0)
flashcards
humanities final
29
Updated 108d ago
0.0(0)
flashcards
Beach Vocab
49
Updated 1073d ago
0.0(0)
flashcards
Hinduism
20
Updated 1097d ago
0.0(0)
flashcards
October exam
32
Updated 528d ago
0.0(0)
flashcards
bio flashcards - genes/genetics
28
Updated 1068d ago
0.0(0)
flashcards
Unit 1 English Literary Terms
46
Updated 772d ago
0.0(0)
flashcards
Odyssey Test review
85
Updated 526d ago
0.0(0)