Advanced Data Analytics Midterm Review

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/133

There's no tags or description

Looks like no tags are added yet.

Last updated 8:49 PM on 3/28/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

134 Terms

New cards

Statistical Learning

Refers to methods intended to help us understand data

New cards

Standard Convention

Each row represents an observation and each column a variable.

New cards

Input Variables

Predictors, features, independent variables

New cards

Output Variables

Responses, dependent variables

New cards

Supervised Learning

Building models that seek to predict the value of an output based on a set of input variables

New cards

Learning a rule that approximates the relationship between the predictors and the response

Supervised Learning

New cards

Unsupervised Learning

Detecting patterns and relationships in data

New cards

Learning aa rule for categorizing observations

Unsupervised Learning

New cards

Interference

Understanding the relationship between the response and the predictors

New cards

Prediction

Entails estimating the value of a response based on observed predictors

New cards

Quantitative Response

Regression

New cards

Qualitative Response

Classification

New cards

Logistic regression is a _______ method.

Classification

New cards

We believe there is a relationship between a _______ __ and at least one of the predictors in _

response Y, X

New cards

Supervised learning is all about _______ (learning) f

Estimating

New cards

f is a _____ but _______ function

fixed, unknown

New cards

f represents the _______ information about Y provided by X.

Systematic

New cards

∊ is a _________ (noise) with mean zero, independent of X

Random Error

New cards

∊ represents…

The effect of unmeasured variables on Y, or unmeasurable variation. Means that the same X can lead to different Y

New cards

We apply a learning method to the ___________ and obtain a ________

Training dataset, fitted model

New cards

In testing stage, we have _________ test dataset

Separate

New cards

Statistical Learning Approach

Use the training data and a statistical method to estimate f

Find a good-fitting function f and use it for prediction

New cards

Parametric Method (model)

Reduce the problem from one of estimating f to one of estimating a set of parameters, e.g., β0 and β1

New cards

Non-parametric Method (model)

Tend to be more flexible

New cards

Models that are more flexible tend to be ____________ and have the potential to _________ the training data.

Less interpretable, overfit

New cards

We can measure the quality of a model’s predictions by the _________

Mean Squared Error

New cards

The method with the lowest training MSE may not have the

Lowest Test MSE

New cards

Training MSE __________ as flexibility _________

Decreases, Increases

New cards

Test MSE has a U-shape because of the _____________

Bias-variance trade-off

New cards

Variance Term

Refers to the uncertainty due to randomness in the training data

New cards

Bias Term

Refers to our error in approximating a real-life problem

New cards

Var(∊) Term is called ____________

Irreducible Error

New cards

Horizontal line in Bias-Variance Trade-Off

Var(∊)

New cards

Vertical Line in Bias-Variance Trade-Off

Flexibility level with smallest test MSE

New cards

Simple Linear Regression setting

Predicting a quantitative response Y based on a single predictor X

New cards

Simple Linear Regression is also called as _____________

Population Regression Line

New cards

Parameter β0

Intercept (the avg value of Y if X = 0)

New cards

Parameter β1

Slope (the avg increases in Y when X is increased by 1)

New cards

Error (∊)

Assumed to be normally distributed with mean 0 and variance σ² : ∊ ~ N(0,σ²)

New cards

The ___________________ B̂0 and B̂1 minimize RSS

Least squares coefficient estimates

New cards

Red Line (Probabilistic interpretation of regression)

Population regression line

New cards

Blue Line (Probabilistic interpretation of regression)

Line of best fit

New cards

Residual Standard Error

Estimate of the standard deviation of ∊, i.e., σ

New cards

Residual standard error (RSE) measures the ________ of the model to the training data

Lack of fit

New cards

R² statistic measures…

Proportion of variance explained by fitted linear model

New cards

R² takes values between ___ and ___, and ______ values indicate better fit.

0, 1, Larger

New cards

The Null Hypothesis means…

There’s is no relationship between X and Y

New cards

Alternative Hypothesis means…

There is some relationship between X and Y

New cards

We want to strongly _____ the null hypothesis, i.e., obtain a very low _______.

Reject, p-value

New cards

Polynomial Regression

More higher order terms → more flexible model → more potential for overfitting

New cards

Residual Plots

Can tell us a lot about the relationship between Y and X

New cards

In a residual plot, if the points are not vertically centered around 0 for all x, it suggests a ___________ relationship between Y and X.

Non-linear

New cards

In a residual plot, if the point cloud has different vertical spreads for different xs (e.g., funnel shape), it suggests the variance σ² is a ________ across x. This is called a _________.

Non-constant, Heteroskedasticity

New cards

We sometimes want to report ___________ for the expected response given a particular predictor.

Confidence Intervals

New cards

We sometimes want to report ________ for the response at given a particular predictor, Y | x

Prediction Intervals

New cards

Confidence Intervals give….

Plausible values for f(x) = E[Y|x] (the average output)

New cards

Prediction intervals give…

Plausible values for Y|x (an individual output)

New cards

Multiple Linear Predictors

β0 : intercept (avg values of Y if all predictors are 0)

βj : slope of jth predictor

New cards

βj is the average increase in Y if Xj is increased by 1 and…

All other predictors are held constant

New cards

R performs a different ___________ (called the F test)

Model Utility Test

New cards

A small p-value indicated that _________ of the predictors has a statistically significant relationship with the response.

At least one

New cards

Model Utility Test measures the _________ of adding the jth predictor to the model when the other p-1 predictors are already in it.

Partial Effect

New cards

The p-value for the F-test tells us…

Whether the multiple linear regression model is reasonable

New cards

The p-values for the tests for each predictor can be helpful for…

Choosing which input variables to include in the final model

New cards

For multiple linear regression, we often consider the __________ value, which penalizes including superfluous predictors in the model.

Adjusted R²

New cards

_______ us guaranteed to increase when adding predictors. ______ is not.

R², Adjusted R²

New cards

Cross-validation Methods are used when…

Trying out several learning methods and want to find the one with the best (lowest) test MSE.

New cards

How can we estimate test MSE/error rate using only the training dataset?

Validation Set Approach

Leave-One-Out Cross-Validation (LOOCV)

k-Fold Cross Validation

New cards

Validation set approach Setup

Randomly split the available data into two parts

New cards

Training Set

Observations that will be used to train the models

New cards

Validation Set

Observations that will be used for testing models

New cards

Choose the model with the _______ test MSE when applied to the validation data.

Lowest

New cards

For LOOVC, we validate __________ and aggregate results.

Multiple times

New cards

For f-Fold CV, ike LOOVC, we validate multiple times, but only fit ________, not n.

k Models

New cards

Validation set approach is easy to implement, but statistical methods tend to do better when…

Trained on more data

New cards

LOOCV and k-Fold CV allows us to ______________ for a model to fit on ________.

Estimate the test MSE, available data

New cards

LOOCV is computationally intensive because…

We apply the same method n times!

New cards

Classification deals with…

Predicting a qualitative response

New cards

Classifer

A rule for assigning a newly observed combination of input variables to an output category

New cards

A classifier can make two types of errors: __________ and ________.

False positives and False negatives

New cards

_____________ is a classification method.

Logistic Regression

New cards

Comparative Boxplots

A useful way to visualize the distributions of input variables for each of the output categories.

New cards

Training Error Rate

The fraction of misclassified training observations

New cards

Suppose that Y can be in one of J categories, indexed j = 1,2,…,J. Condtional on X, Y is assumed to be _________ distributed with _____________.

Multinomially, Probability mass function (pmf)

New cards

For classification, P(Y = j | .) is fized but ________

Unknown (seek to approximate it)

New cards

Bayes Classifier

A classifier that would minimize the average misclassification fraction

New cards

Bayes Classifier is an ________ ideal.

Unattainable

New cards

Bayes Error Rate

Represents the best we can do on a classification problem; analogous to the irreducible error

New cards

Bayes Classifier: Different categories are represented by _______ and ________

Blue, Orange

New cards

Bayes Decision Boundary

The purple dashed line

New cards

K-nearest Neighbors (KNN)

Nonparametric method that directly attempts to estimate P(Y = j | X = x0) by looking at the categories (outputs) of neighbors of x0

New cards

RHS features…

The empirical proportions of nearby observations in each class

New cards

KNN Classifier

Black solid Line

New cards

Parametric Methods

LDA, QDA, and Naive Bayes

New cards

Linear Discriminant Analysis (LDA)

Assumes that for observations with outputs in category j, the predictor is normally distributed with mean and variance

New cards

LDA assumes ________ variances across categories.

Equal

New cards

Substituting the normal density into Bayes’ Theorem taking a log of both sides, and removing extra terms gives a _______________ that is ________ in x.

Discriminant function, Linear

New cards

LDA assumes that for observations in category j, the predictors are _____________ with mean vector and covariance.

Multivariable normally distributed

New cards

Multivariable Normal Distribution

Extends the univariate normal distribution to higher dimensions

100

New cards

Quadratic Discriminant Analysis (QDA)

Extends the LDA framework to allow for different covariance matrices for different categories. The resulting discriminant function is quadratic in x.