1/106
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Clustering Problem
Grouping individuals according to observed characteristics
Feature Selection
How to select the best set of predictors
Trees:
split prediction into subsets (using mean or mode)
Response (Target)
Value we wish to predict
Generally represented as 'y'
R-squared is low when variance is ________:
high
Bias-Variance Tradeoff
As either bias or variance is decreases, the other increases
How do we identify the tree?
Recursive Binary Splitting
Linear regression is for predicting a _______________ response:
quantitative
Features (Predictors)
Input values
Generally represented as 'X= (X1, X2, X3)'
Model Selection
How to select the best linear model
Tuning
How to modify coefficients to get a better bias-variance trade-off
Supervised Learning
Have both predictors and outcome measures for each observation in training data
estimate =
B0
How do we prevent overfitting a tree?
Bagging, random forests, boosting
An over-fitted model is one with low:
Bias
Parametric models assume:
That the data-generating process follows a probability distribution with a fixed set of parameters
Top-down:
The algorithm begins with all observations in a single region, then successively splits the predictor space
standard error:
a measure of the accuracy of a prediction under the logic of repeated sampling
Unsupervised Learning
Have predictors (x), but no responses (y)
Lots of time look for "clusters" to relate data
Logistic Regression Coefficient Explanation
Gives the change in log odds of an outcome for a one unit increase in the predictor variable
Shrinkage
Fit a model involving all predictors, shrinking the estimated coefficients down towards zero relative to least square estimates. Has the effect of reducing variance
Dimensionality Reduction
Project the P predictions into an M dimensional space
Precision
When the classifier predicts yes, how often is it correct
TP / (TP + FP)
σ:
The residual standard error
Greedy:
algorithm looks for a locally optimal choice at each split
Variance
Error in the predicted value for a feature across different data samples
Recall
How often does the classifier predict yes, when the actual process is yes
TP / (TP + FN)
Best Subset Selection
Create a model for every possible combinations of predictors and pick the best one. Pick the one with the highest R^2 for each different number of P and then cross-validate every one and pick the best. Too computationally expensive for high values of P.
Reducible Error
Error that stems from an innaccurate model that can be reduced
Inference
How Y is changing as a function of X
Want to learn about relationships between predictors and Y
recursive binary splitting:
continue splitting the training observations (predictor j) into 2 regions until stopping criteria (s) is reached
The standard errors of the coefficients are proportional to the standard error of the ____________ and inversely proportional to the ________________ of the sample size.
1) regression
2) square root
Accuracy
How often is the classifier correct
(TN + TP) / N
Forward Stepwise Model
Start with 0 predictors, and continue to add the one predictor that results in the highest R^2 for the model. Downside is that it doesn't always pick the best predictors because you can't remove previous ones. P can be larger than N
hypothesis testing:
using the standard error to guess the relationship between X and Y
Irreducible Error
Error that stems from the random error term (noise)
A decision tree may over fit the data because:
The tree is too complex
Backward Stepwise Model
Exact opposite of Forward Stepwise Model. N must be larger than P.
Misclassification Rate
How often is the classifier wrong
RSE(residual standard error):
- The average amount the response will deviate from the true regression line
- shows relationship between the predictor and the response
A _________ tree with fewer splits made lead to lower variance and better interpretation at the cost of a little bias
smaller tree
Irreducible Error
epsilon (noise)
False Positive Rate
How often does the classifier predict yes when the actual value is no?
FP / True No
Ridge Regression
Adds a "shrinkage penalty" to the RSS when fitting a linear model to bring the coefficients closer to zero. Lambda is used to tune it. Small increase in bias for large decrease in variance.
Tree pruning:
grow a large tree T0, then prune it to a subtree with less variance
tree pruning, choosing between 2 subtrees:
estimate test errors with cross-validation
Parametric Model
Assumes the data-generating process follows a probabilistic distribution with a fixed set of parameters
2-steps:
1.) Assume a functional form of f (e.g. f is linear with x):
f(x) = b0 + b1X1 + b2X2 + ... + bnXn
2.) Select a method to fit the model
True Negative Rate
How often does the classifier predict no when the actual value is no?
TN / True No
Lasso Method
Also adds a shrinkage penalty like ridge regression, but if lambda is large enough, it will zero out some predictors. This makes it useful for feature selection.
False Negative Rate
How often does the classifier predict no when the actual value is yes?
FN / True Yes
Polonomial Regression
yi = b0 + b1xi + b2xi^2 + noise
More interested in fitted values than coefficients.
Non-parametric Model
Does not assume (or makes fewer assumptions) about the shape or parameters of the population distribution that generated data
Goal is to estimate f such that f is as close as possible to the data points without overfitting
Higher the flexibility, less the inference
tree pruning, what do you do if their are too many possible subtrees?:
Use cost complexity pruning to select a small set of subtrees for consideration
Cost complexity pruning:
α = tuning parameter
as α increases there is a price for having a large tree, so the quantity will be minimized for a smaller tree
Prediction accuracy vs. interpretability
Linear models are easy to interpret; thin-plate splines are not
Training Error Rate
Average error that results from using a statistical learning method to predict the response of an observation in the training set
Step Functions
Break x into different parts and create a constant for each part
Cost complexity pruning, when α = 0:
the subtree T will simply equal T0 (full size tree)
Good fit vs. overfit or underfit
How do we know when the fit is just right?
Test Error Rate
Average error that results from using a statistical learning method to predict the response of an observation not in the training set
Knots
The points where coefficients change in a Regression Spline
How to build a regression tree:
1) use recursive binary splitting to make a large tree
2) apply cost complexity pruning to get a sequence of best subtrees as a function of α
3) use K-fold cross-validation to choose α, repeats steps 1 +2 K times, pick α to minimize the average error
Parsimony vs. Black-Box
We often (especially for inference) prefer a simpler model involving fewer predictors over a black-box involving many predictors
Validation set approach
Split the data set into training and testing data at the start
Splines
Piecewise defined polynomial functions with a high degree of smoothness between knots
Natural Spline
When you use a straight line on the last knot
Leave One Out Cross Validation (LOOCV)
Train on n-1 observations over and over, average error rate
classification tree:
Like a regression tree, but with qualitative responses and uses the mode,
uses classification error rate(fraction of observations in a region not belonging to the common class), instead of RSS
p(hat)mk =
the proportion of training observations in the mth region are from the kth class
K-Fold Cross Validation
Separate the dataset into k folds. Training data is folds - 1. Test on other fold. Average out test errors
High Variance
Regression Spline
Combo of polynomial and stepwise functions. Flexibility increased by adding extra knots rather than polynomials
John Snow
Showed water fountains guilty of spreading Cholera disease in London using data visualization
Smoothing Spline
Implements a loss+penalty function on a regression spline by minimizing the second derivative so that it is smoother. Uses lambda to tune it
Gini index:
a measure of total variance across the K classes. small value = node mostly has values from a single class
John Tukey
Created Fast Fourier Transform algorithm, coined the terms "bit" and "software" and invented the BOX PLOT
General Additive Model
Allow for previous functions talked about to be used with multiple predictors
advantages of trees:
1) Trees are very easy to explain to people
2) (perhaps) decision trees more closely mirror human decision-making
3) Trees can be displayed graphically
4) trees can easily handle qualitative predictors
Purpose of EDA
To develop an intuition about your data set
disadvantages of trees:
1) Trees (generally) have less predictive accuracy than other approaches
2) trees can be non-robust, where a small change makes a big difference
Bagging (boostrap aggregation):
a general-purpose procedure for reducing the variance of a statistical learning method
support vector machines (SVM):
High flexibility, low interpretability
intended for binary classification
supervised
extension of support vector classifier, which extends maximum margin classifier
Bagging steps:
reduce variance by and increase accuracy by:
1) taking many training sets from the population
2) building a separate prediction model from each training set
3) averaging the resulting predictions
Maximum margin classifier (+ support vector classifier):
based on concept of separating hyperplane
hyperplane yi=1
above the hyperplane
hyperplane yi = -1
below the hyperplane
maximum margin classifer chooses the hyperlane that is....:
farthest from the training observations (largest margin)
Support vectors:
The 3 observations that lie along the dashed line indicating the width of the hyperplanes margin
maximum margin classifier problems:
1) when no separating hyperplane exists, there is no maximum marginal classifier
2) the maximum margin classifier is extremely sensitive to a change in a single observation, may overfit the training data
The support Vector Classifier:
used to construct a hyperplane that does not perfectly separate the two classes
Support vector classifier yields:
1) Greater robustness to individual observations
2) better classification of most of the training observations
the support vector classifier allows:
some observations to be on the incorrect side of the margin (or hyperplane)
support vector classifier is also called:
soft margin classifier
support vector classifier
ei > 0
or
ei > 1
if ei > 0 then the i-th observation is on the wrong side of the margin
if ei > 1 the i-th term is on the wrong side of the hyperplane
support vector classifier C:
C is the tuning parameter, with larger values allowing more errors in the classification(more bias)
when C is small:
low bias but high variance
when C is large:
more bias, lower variance
support vectors:
observations on the margin or wrong side that affect the hyperplane
support vector classifier performs poorly when....:
there are non-linear class boundaries
boostrap:
taking random samples with replacements from the (single) training data set. Train model on the bth bootstrapped training set to get f^* exponent b(x)
Bagging applied to trees, bias-variance:
with bootstrapped training sets make B not pruned trees, and take the average. (majority vote for qualitative)