1/133
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Statistical Learning
Refers to methods intended to help us understand data
Standard Convention
Each row represents an observation and each column a variable.
Input Variables
Predictors, features, independent variables
Output Variables
Responses, dependent variables
Supervised Learning
Building models that seek to predict the value of an output based on a set of input variables
Learning a rule that approximates the relationship between the predictors and the response
Supervised Learning
Unsupervised Learning
Detecting patterns and relationships in data
Learning aa rule for categorizing observations
Unsupervised Learning
Interference
Understanding the relationship between the response and the predictors
Prediction
Entails estimating the value of a response based on observed predictors
Quantitative Response
Regression
Qualitative Response
Classification
Logistic regression is a _______ method.
Classification
We believe there is a relationship between a _______ __ and at least one of the predictors in _
response Y, X
Supervised learning is all about _______ (learning) f
Estimating
f is a _____ but _______ function
fixed, unknown
f represents the _______ information about Y provided by X.
Systematic
∊ is a _________ (noise) with mean zero, independent of X
Random Error
∊ represents…
The effect of unmeasured variables on Y, or unmeasurable variation. Means that the same X can lead to different Y
We apply a learning method to the ___________ and obtain a ________
Training dataset, fitted model
In testing stage, we have _________ test dataset
Separate
Statistical Learning Approach
Use the training data and a statistical method to estimate f
Find a good-fitting function f and use it for prediction
Parametric Method (model)
Reduce the problem from one of estimating f to one of estimating a set of parameters, e.g., β0 and β1
Non-parametric Method (model)
Tend to be more flexible
Models that are more flexible tend to be ____________ and have the potential to _________ the training data.
Less interpretable, overfit
We can measure the quality of a model’s predictions by the _________
Mean Squared Error
The method with the lowest training MSE may not have the
Lowest Test MSE
Training MSE __________ as flexibility _________
Decreases, Increases
Test MSE has a U-shape because of the _____________
Bias-variance trade-off
Variance Term
Refers to the uncertainty due to randomness in the training data
Bias Term
Refers to our error in approximating a real-life problem
Var(∊) Term is called ____________
Irreducible Error
Horizontal line in Bias-Variance Trade-Off
Var(∊)
Vertical Line in Bias-Variance Trade-Off
Flexibility level with smallest test MSE
Simple Linear Regression setting
Predicting a quantitative response Y based on a single predictor X
Simple Linear Regression is also called as _____________
Population Regression Line
Parameter β0
Intercept (the avg value of Y if X = 0)
Parameter β1
Slope (the avg increases in Y when X is increased by 1)
Error (∊)
Assumed to be normally distributed with mean 0 and variance σ² : ∊ ~ N(0,σ²)
The ___________________ B̂0 and B̂1 minimize RSS
Least squares coefficient estimates
Red Line (Probabilistic interpretation of regression)
Population regression line
Blue Line (Probabilistic interpretation of regression)
Line of best fit
Residual Standard Error
Estimate of the standard deviation of ∊, i.e., σ
Residual standard error (RSE) measures the ________ of the model to the training data
Lack of fit
R² statistic measures…
Proportion of variance explained by fitted linear model
R² takes values between ___ and ___, and ______ values indicate better fit.
0, 1, Larger
The Null Hypothesis means…
There’s is no relationship between X and Y
Alternative Hypothesis means…
There is some relationship between X and Y
We want to strongly _____ the null hypothesis, i.e., obtain a very low _______.
Reject, p-value
Polynomial Regression
More higher order terms → more flexible model → more potential for overfitting
Residual Plots
Can tell us a lot about the relationship between Y and X
In a residual plot, if the points are not vertically centered around 0 for all x, it suggests a ___________ relationship between Y and X.
Non-linear
In a residual plot, if the point cloud has different vertical spreads for different xs (e.g., funnel shape), it suggests the variance σ² is a ________ across x. This is called a _________.
Non-constant, Heteroskedasticity
We sometimes want to report ___________ for the expected response given a particular predictor.
Confidence Intervals
We sometimes want to report ________ for the response at given a particular predictor, Y | x
Prediction Intervals
Confidence Intervals give….
Plausible values for f(x) = E[Y|x] (the average output)
Prediction intervals give…
Plausible values for Y|x (an individual output)
Multiple Linear Predictors
β0 : intercept (avg values of Y if all predictors are 0)
βj : slope of jth predictor
βj is the average increase in Y if Xj is increased by 1 and…
All other predictors are held constant
R performs a different ___________ (called the F test)
Model Utility Test
A small p-value indicated that _________ of the predictors has a statistically significant relationship with the response.
At least one
Model Utility Test measures the _________ of adding the jth predictor to the model when the other p-1 predictors are already in it.
Partial Effect
The p-value for the F-test tells us…
Whether the multiple linear regression model is reasonable
The p-values for the tests for each predictor can be helpful for…
Choosing which input variables to include in the final model
For multiple linear regression, we often consider the __________ value, which penalizes including superfluous predictors in the model.
Adjusted R²
_______ us guaranteed to increase when adding predictors. ______ is not.
R², Adjusted R²
Cross-validation Methods are used when…
Trying out several learning methods and want to find the one with the best (lowest) test MSE.
How can we estimate test MSE/error rate using only the training dataset?
Validation Set Approach
Leave-One-Out Cross-Validation (LOOCV)
k-Fold Cross Validation
Validation set approach Setup
Randomly split the available data into two parts
Training Set
Observations that will be used to train the models
Validation Set
Observations that will be used for testing models
Choose the model with the _______ test MSE when applied to the validation data.
Lowest
For LOOVC, we validate __________ and aggregate results.
Multiple times
For f-Fold CV, ike LOOVC, we validate multiple times, but only fit ________, not n.
k Models
Validation set approach is easy to implement, but statistical methods tend to do better when…
Trained on more data
LOOCV and k-Fold CV allows us to ______________ for a model to fit on ________.
Estimate the test MSE, available data
LOOCV is computationally intensive because…
We apply the same method n times!
Classification deals with…
Predicting a qualitative response
Classifer
A rule for assigning a newly observed combination of input variables to an output category
A classifier can make two types of errors: __________ and ________.
False positives and False negatives
_____________ is a classification method.
Logistic Regression
Comparative Boxplots
A useful way to visualize the distributions of input variables for each of the output categories.
Training Error Rate
The fraction of misclassified training observations
Suppose that Y can be in one of J categories, indexed j = 1,2,…,J. Condtional on X, Y is assumed to be _________ distributed with _____________.
Multinomially, Probability mass function (pmf)
For classification, P(Y = j | .) is fized but ________
Unknown (seek to approximate it)
Bayes Classifier
A classifier that would minimize the average misclassification fraction
Bayes Classifier is an ________ ideal.
Unattainable
Bayes Error Rate
Represents the best we can do on a classification problem; analogous to the irreducible error
Bayes Classifier: Different categories are represented by _______ and ________
Blue, Orange
Bayes Decision Boundary
The purple dashed line
K-nearest Neighbors (KNN)
Nonparametric method that directly attempts to estimate P(Y = j | X = x0) by looking at the categories (outputs) of neighbors of x0
RHS features…
The empirical proportions of nearby observations in each class
KNN Classifier
Black solid Line
Parametric Methods
LDA, QDA, and Naive Bayes
Linear Discriminant Analysis (LDA)
Assumes that for observations with outputs in category j, the predictor is normally distributed with mean and variance
LDA assumes ________ variances across categories.
Equal
Substituting the normal density into Bayes’ Theorem taking a log of both sides, and removing extra terms gives a _______________ that is ________ in x.
Discriminant function, Linear
LDA assumes that for observations in category j, the predictors are _____________ with mean vector and covariance.
Multivariable normally distributed
Multivariable Normal Distribution
Extends the univariate normal distribution to higher dimensions
Quadratic Discriminant Analysis (QDA)
Extends the LDA framework to allow for different covariance matrices for different categories. The resulting discriminant function is quadratic in x.