CPSC 4300 Final Exam

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/106

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

107 Terms

New cards

Clustering Problem

Grouping individuals according to observed characteristics

New cards

Feature Selection

How to select the best set of predictors

New cards

Trees:

split prediction into subsets (using mean or mode)

New cards

Response (Target)

Value we wish to predict

Generally represented as 'y'

New cards

R-squared is low when variance is ________:

high

New cards

Bias-Variance Tradeoff

As either bias or variance is decreases, the other increases

New cards

How do we identify the tree?

Recursive Binary Splitting

New cards

Linear regression is for predicting a _______________ response:

quantitative

New cards

Features (Predictors)

Input values

Generally represented as 'X= (X1, X2, X3)'

New cards

Model Selection

How to select the best linear model

New cards

Tuning

How to modify coefficients to get a better bias-variance trade-off

New cards

Supervised Learning

Have both predictors and outcome measures for each observation in training data

New cards

estimate =

New cards

How do we prevent overfitting a tree?

Bagging, random forests, boosting

New cards

An over-fitted model is one with low:

Bias

New cards

Parametric models assume:

That the data-generating process follows a probability distribution with a fixed set of parameters

New cards

Top-down:

The algorithm begins with all observations in a single region, then successively splits the predictor space

New cards

standard error:

a measure of the accuracy of a prediction under the logic of repeated sampling

New cards

Unsupervised Learning

Have predictors (x), but no responses (y)

Lots of time look for "clusters" to relate data

New cards

Logistic Regression Coefficient Explanation

Gives the change in log odds of an outcome for a one unit increase in the predictor variable

New cards

Shrinkage

Fit a model involving all predictors, shrinking the estimated coefficients down towards zero relative to least square estimates. Has the effect of reducing variance

New cards

Dimensionality Reduction

Project the P predictions into an M dimensional space

New cards

Precision

When the classifier predicts yes, how often is it correct

TP / (TP + FP)

New cards

σ:

The residual standard error

New cards

Greedy:

algorithm looks for a locally optimal choice at each split

New cards

Variance

Error in the predicted value for a feature across different data samples

New cards

Recall

How often does the classifier predict yes, when the actual process is yes

TP / (TP + FN)

New cards

Best Subset Selection

Create a model for every possible combinations of predictors and pick the best one. Pick the one with the highest R^2 for each different number of P and then cross-validate every one and pick the best. Too computationally expensive for high values of P.

New cards

Reducible Error

Error that stems from an innaccurate model that can be reduced

New cards

Inference

How Y is changing as a function of X

Want to learn about relationships between predictors and Y

New cards

recursive binary splitting:

continue splitting the training observations (predictor j) into 2 regions until stopping criteria (s) is reached

New cards

The standard errors of the coefficients are proportional to the standard error of the ____________ and inversely proportional to the ________________ of the sample size.

1) regression

2) square root

New cards

Accuracy

How often is the classifier correct

(TN + TP) / N

New cards

Forward Stepwise Model

Start with 0 predictors, and continue to add the one predictor that results in the highest R^2 for the model. Downside is that it doesn't always pick the best predictors because you can't remove previous ones. P can be larger than N

New cards

hypothesis testing:

using the standard error to guess the relationship between X and Y

New cards

Irreducible Error

Error that stems from the random error term (noise)

New cards

A decision tree may over fit the data because:

The tree is too complex

New cards

Backward Stepwise Model

Exact opposite of Forward Stepwise Model. N must be larger than P.

New cards

Misclassification Rate

How often is the classifier wrong

New cards

RSE(residual standard error):

- The average amount the response will deviate from the true regression line

- shows relationship between the predictor and the response

New cards

A _________ tree with fewer splits made lead to lower variance and better interpretation at the cost of a little bias

smaller tree

New cards

Irreducible Error

epsilon (noise)

New cards

False Positive Rate

How often does the classifier predict yes when the actual value is no?

FP / True No

New cards

Ridge Regression

Adds a "shrinkage penalty" to the RSS when fitting a linear model to bring the coefficients closer to zero. Lambda is used to tune it. Small increase in bias for large decrease in variance.

New cards

Tree pruning:

grow a large tree T0, then prune it to a subtree with less variance

New cards

tree pruning, choosing between 2 subtrees:

estimate test errors with cross-validation

New cards

Parametric Model

Assumes the data-generating process follows a probabilistic distribution with a fixed set of parameters

2-steps:

1.) Assume a functional form of f (e.g. f is linear with x):

f(x) = b0 + b1X1 + b2X2 + ... + bnXn

2.) Select a method to fit the model

New cards

True Negative Rate

How often does the classifier predict no when the actual value is no?

TN / True No

New cards

Lasso Method

Also adds a shrinkage penalty like ridge regression, but if lambda is large enough, it will zero out some predictors. This makes it useful for feature selection.

New cards

False Negative Rate

How often does the classifier predict no when the actual value is yes?

FN / True Yes

New cards

Polonomial Regression

yi = b0 + b1xi + b2xi^2 + noise

More interested in fitted values than coefficients.

New cards

Non-parametric Model

Does not assume (or makes fewer assumptions) about the shape or parameters of the population distribution that generated data

Goal is to estimate f such that f is as close as possible to the data points without overfitting

Higher the flexibility, less the inference

New cards

tree pruning, what do you do if their are too many possible subtrees?:

Use cost complexity pruning to select a small set of subtrees for consideration

New cards

Cost complexity pruning:

α = tuning parameter

as α increases there is a price for having a large tree, so the quantity will be minimized for a smaller tree

New cards

Prediction accuracy vs. interpretability

Linear models are easy to interpret; thin-plate splines are not

New cards

Training Error Rate

Average error that results from using a statistical learning method to predict the response of an observation in the training set

New cards

Step Functions

Break x into different parts and create a constant for each part

New cards

Cost complexity pruning, when α = 0:

the subtree T will simply equal T0 (full size tree)

New cards

Good fit vs. overfit or underfit

How do we know when the fit is just right?

New cards

Test Error Rate

Average error that results from using a statistical learning method to predict the response of an observation not in the training set

New cards

Knots

The points where coefficients change in a Regression Spline

New cards

How to build a regression tree:

1) use recursive binary splitting to make a large tree

2) apply cost complexity pruning to get a sequence of best subtrees as a function of α

3) use K-fold cross-validation to choose α, repeats steps 1 +2 K times, pick α to minimize the average error

New cards

Parsimony vs. Black-Box

We often (especially for inference) prefer a simpler model involving fewer predictors over a black-box involving many predictors

New cards

Validation set approach

Split the data set into training and testing data at the start

New cards

Splines

Piecewise defined polynomial functions with a high degree of smoothness between knots

New cards

Natural Spline

When you use a straight line on the last knot

New cards

Leave One Out Cross Validation (LOOCV)

Train on n-1 observations over and over, average error rate

New cards

classification tree:

Like a regression tree, but with qualitative responses and uses the mode,

uses classification error rate(fraction of observations in a region not belonging to the common class), instead of RSS

New cards

p(hat)mk =

the proportion of training observations in the mth region are from the kth class

New cards

K-Fold Cross Validation

Separate the dataset into k folds. Training data is folds - 1. Test on other fold. Average out test errors

High Variance

New cards

Regression Spline

Combo of polynomial and stepwise functions. Flexibility increased by adding extra knots rather than polynomials

New cards

John Snow

Showed water fountains guilty of spreading Cholera disease in London using data visualization

New cards

Smoothing Spline

Implements a loss+penalty function on a regression spline by minimizing the second derivative so that it is smoother. Uses lambda to tune it

New cards

Gini index:

a measure of total variance across the K classes. small value = node mostly has values from a single class

New cards

John Tukey

Created Fast Fourier Transform algorithm, coined the terms "bit" and "software" and invented the BOX PLOT

New cards

General Additive Model

Allow for previous functions talked about to be used with multiple predictors

New cards

advantages of trees:

1) Trees are very easy to explain to people

2) (perhaps) decision trees more closely mirror human decision-making

3) Trees can be displayed graphically

4) trees can easily handle qualitative predictors

New cards

Purpose of EDA

To develop an intuition about your data set

New cards

disadvantages of trees:

1) Trees (generally) have less predictive accuracy than other approaches

2) trees can be non-robust, where a small change makes a big difference

New cards

Bagging (boostrap aggregation):

a general-purpose procedure for reducing the variance of a statistical learning method

New cards

support vector machines (SVM):

High flexibility, low interpretability

intended for binary classification

supervised

extension of support vector classifier, which extends maximum margin classifier

New cards

Bagging steps:

reduce variance by and increase accuracy by:

1) taking many training sets from the population

2) building a separate prediction model from each training set

3) averaging the resulting predictions

New cards

Maximum margin classifier (+ support vector classifier):

based on concept of separating hyperplane

New cards

hyperplane yi=1

above the hyperplane

New cards

hyperplane yi = -1

below the hyperplane

New cards

maximum margin classifer chooses the hyperlane that is....:

farthest from the training observations (largest margin)

New cards

Support vectors:

The 3 observations that lie along the dashed line indicating the width of the hyperplanes margin

New cards

maximum margin classifier problems:

1) when no separating hyperplane exists, there is no maximum marginal classifier

2) the maximum margin classifier is extremely sensitive to a change in a single observation, may overfit the training data

New cards

The support Vector Classifier:

used to construct a hyperplane that does not perfectly separate the two classes

New cards

Support vector classifier yields:

1) Greater robustness to individual observations

2) better classification of most of the training observations

New cards

the support vector classifier allows:

some observations to be on the incorrect side of the margin (or hyperplane)

New cards

support vector classifier is also called:

soft margin classifier

New cards

support vector classifier

ei > 0

ei > 1

if ei > 0 then the i-th observation is on the wrong side of the margin

if ei > 1 the i-th term is on the wrong side of the hyperplane

New cards

support vector classifier C:

C is the tuning parameter, with larger values allowing more errors in the classification(more bias)

New cards

when C is small:

low bias but high variance

New cards

when C is large:

more bias, lower variance

New cards

support vectors:

observations on the margin or wrong side that affect the hyperplane

New cards

support vector classifier performs poorly when....:

there are non-linear class boundaries

New cards

boostrap:

taking random samples with replacements from the (single) training data set. Train model on the bth bootstrapped training set to get f^* exponent b(x)

100

New cards

Bagging applied to trees, bias-variance:

with bootstrapped training sets make B not pruned trees, and take the average. (majority vote for qualitative)