CPSC 4300 Final Exam

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/106

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

107 Terms

1
New cards

Clustering Problem

Grouping individuals according to observed characteristics

2
New cards

Feature Selection

How to select the best set of predictors

3
New cards

Trees:

split prediction into subsets (using mean or mode)

4
New cards

Response (Target)

Value we wish to predict

Generally represented as 'y'

5
New cards

R-squared is low when variance is ________:

high

6
New cards

Bias-Variance Tradeoff

As either bias or variance is decreases, the other increases

7
New cards

How do we identify the tree?

Recursive Binary Splitting

8
New cards

Linear regression is for predicting a _______________ response:

quantitative

9
New cards

Features (Predictors)

Input values

Generally represented as 'X= (X1, X2, X3)'

10
New cards

Model Selection

How to select the best linear model

11
New cards

Tuning

How to modify coefficients to get a better bias-variance trade-off

12
New cards

Supervised Learning

Have both predictors and outcome measures for each observation in training data

13
New cards

estimate =

B0

14
New cards

How do we prevent overfitting a tree?

Bagging, random forests, boosting

15
New cards

An over-fitted model is one with low:

Bias

16
New cards

Parametric models assume:

That the data-generating process follows a probability distribution with a fixed set of parameters

17
New cards

Top-down:

The algorithm begins with all observations in a single region, then successively splits the predictor space

18
New cards

standard error:

a measure of the accuracy of a prediction under the logic of repeated sampling

19
New cards

Unsupervised Learning

Have predictors (x), but no responses (y)

Lots of time look for "clusters" to relate data

20
New cards

Logistic Regression Coefficient Explanation

Gives the change in log odds of an outcome for a one unit increase in the predictor variable

21
New cards

Shrinkage

Fit a model involving all predictors, shrinking the estimated coefficients down towards zero relative to least square estimates. Has the effect of reducing variance

22
New cards

Dimensionality Reduction

Project the P predictions into an M dimensional space

23
New cards

Precision

When the classifier predicts yes, how often is it correct

TP / (TP + FP)

24
New cards

σ:

The residual standard error

25
New cards

Greedy:

algorithm looks for a locally optimal choice at each split

26
New cards

Variance

Error in the predicted value for a feature across different data samples

27
New cards

Recall

How often does the classifier predict yes, when the actual process is yes

TP / (TP + FN)

28
New cards

Best Subset Selection

Create a model for every possible combinations of predictors and pick the best one. Pick the one with the highest R^2 for each different number of P and then cross-validate every one and pick the best. Too computationally expensive for high values of P.

29
New cards

Reducible Error

Error that stems from an innaccurate model that can be reduced

30
New cards

Inference

How Y is changing as a function of X

Want to learn about relationships between predictors and Y

31
New cards

recursive binary splitting:

continue splitting the training observations (predictor j) into 2 regions until stopping criteria (s) is reached

32
New cards

The standard errors of the coefficients are proportional to the standard error of the ____________ and inversely proportional to the ________________ of the sample size.

1) regression

2) square root

33
New cards

Accuracy

How often is the classifier correct

(TN + TP) / N

34
New cards

Forward Stepwise Model

Start with 0 predictors, and continue to add the one predictor that results in the highest R^2 for the model. Downside is that it doesn't always pick the best predictors because you can't remove previous ones. P can be larger than N

35
New cards

hypothesis testing:

using the standard error to guess the relationship between X and Y

36
New cards

Irreducible Error

Error that stems from the random error term (noise)

37
New cards

A decision tree may over fit the data because:

The tree is too complex

38
New cards

Backward Stepwise Model

Exact opposite of Forward Stepwise Model. N must be larger than P.

39
New cards

Misclassification Rate

How often is the classifier wrong

40
New cards

RSE(residual standard error):

- The average amount the response will deviate from the true regression line

- shows relationship between the predictor and the response

41
New cards

A _________ tree with fewer splits made lead to lower variance and better interpretation at the cost of a little bias

smaller tree

42
New cards

Irreducible Error

epsilon (noise)

43
New cards

False Positive Rate

How often does the classifier predict yes when the actual value is no?

FP / True No

44
New cards

Ridge Regression

Adds a "shrinkage penalty" to the RSS when fitting a linear model to bring the coefficients closer to zero. Lambda is used to tune it. Small increase in bias for large decrease in variance.

45
New cards

Tree pruning:

grow a large tree T0, then prune it to a subtree with less variance

46
New cards

tree pruning, choosing between 2 subtrees:

estimate test errors with cross-validation

47
New cards

Parametric Model

Assumes the data-generating process follows a probabilistic distribution with a fixed set of parameters

2-steps:

1.) Assume a functional form of f (e.g. f is linear with x):

f(x) = b0 + b1X1 + b2X2 + ... + bnXn

2.) Select a method to fit the model

48
New cards

True Negative Rate

How often does the classifier predict no when the actual value is no?

TN / True No

49
New cards

Lasso Method

Also adds a shrinkage penalty like ridge regression, but if lambda is large enough, it will zero out some predictors. This makes it useful for feature selection.

50
New cards

False Negative Rate

How often does the classifier predict no when the actual value is yes?

FN / True Yes

51
New cards

Polonomial Regression

yi = b0 + b1xi + b2xi^2 + noise

More interested in fitted values than coefficients.

52
New cards

Non-parametric Model

Does not assume (or makes fewer assumptions) about the shape or parameters of the population distribution that generated data

Goal is to estimate f such that f is as close as possible to the data points without overfitting

Higher the flexibility, less the inference

53
New cards

tree pruning, what do you do if their are too many possible subtrees?:

Use cost complexity pruning to select a small set of subtrees for consideration

54
New cards

Cost complexity pruning:

α = tuning parameter

as α increases there is a price for having a large tree, so the quantity will be minimized for a smaller tree

55
New cards

Prediction accuracy vs. interpretability

Linear models are easy to interpret; thin-plate splines are not

56
New cards

Training Error Rate

Average error that results from using a statistical learning method to predict the response of an observation in the training set

57
New cards

Step Functions

Break x into different parts and create a constant for each part

58
New cards

Cost complexity pruning, when α = 0:

the subtree T will simply equal T0 (full size tree)

59
New cards

Good fit vs. overfit or underfit

How do we know when the fit is just right?

60
New cards

Test Error Rate

Average error that results from using a statistical learning method to predict the response of an observation not in the training set

61
New cards

Knots

The points where coefficients change in a Regression Spline

62
New cards

How to build a regression tree:

1) use recursive binary splitting to make a large tree

2) apply cost complexity pruning to get a sequence of best subtrees as a function of α

3) use K-fold cross-validation to choose α, repeats steps 1 +2 K times, pick α to minimize the average error

63
New cards

Parsimony vs. Black-Box

We often (especially for inference) prefer a simpler model involving fewer predictors over a black-box involving many predictors

64
New cards

Validation set approach

Split the data set into training and testing data at the start

65
New cards

Splines

Piecewise defined polynomial functions with a high degree of smoothness between knots

66
New cards

Natural Spline

When you use a straight line on the last knot

67
New cards

Leave One Out Cross Validation (LOOCV)

Train on n-1 observations over and over, average error rate

68
New cards

classification tree:

Like a regression tree, but with qualitative responses and uses the mode,

uses classification error rate(fraction of observations in a region not belonging to the common class), instead of RSS

69
New cards

p(hat)mk =

the proportion of training observations in the mth region are from the kth class

70
New cards

K-Fold Cross Validation

Separate the dataset into k folds. Training data is folds - 1. Test on other fold. Average out test errors

High Variance

71
New cards

Regression Spline

Combo of polynomial and stepwise functions. Flexibility increased by adding extra knots rather than polynomials

72
New cards

John Snow

Showed water fountains guilty of spreading Cholera disease in London using data visualization

73
New cards

Smoothing Spline

Implements a loss+penalty function on a regression spline by minimizing the second derivative so that it is smoother. Uses lambda to tune it

74
New cards

Gini index:

a measure of total variance across the K classes. small value = node mostly has values from a single class

75
New cards

John Tukey

Created Fast Fourier Transform algorithm, coined the terms "bit" and "software" and invented the BOX PLOT

76
New cards

General Additive Model

Allow for previous functions talked about to be used with multiple predictors

77
New cards

advantages of trees:

1) Trees are very easy to explain to people

2) (perhaps) decision trees more closely mirror human decision-making

3) Trees can be displayed graphically

4) trees can easily handle qualitative predictors

78
New cards

Purpose of EDA

To develop an intuition about your data set

79
New cards

disadvantages of trees:

1) Trees (generally) have less predictive accuracy than other approaches

2) trees can be non-robust, where a small change makes a big difference

80
New cards

Bagging (boostrap aggregation):

a general-purpose procedure for reducing the variance of a statistical learning method

81
New cards

support vector machines (SVM):

High flexibility, low interpretability

intended for binary classification

supervised

extension of support vector classifier, which extends maximum margin classifier

82
New cards

Bagging steps:

reduce variance by and increase accuracy by:

1) taking many training sets from the population

2) building a separate prediction model from each training set

3) averaging the resulting predictions

83
New cards

Maximum margin classifier (+ support vector classifier):

based on concept of separating hyperplane

84
New cards

hyperplane yi=1

above the hyperplane

85
New cards

hyperplane yi = -1

below the hyperplane

86
New cards

maximum margin classifer chooses the hyperlane that is....:

farthest from the training observations (largest margin)

87
New cards

Support vectors:

The 3 observations that lie along the dashed line indicating the width of the hyperplanes margin

88
New cards

maximum margin classifier problems:

1) when no separating hyperplane exists, there is no maximum marginal classifier

2) the maximum margin classifier is extremely sensitive to a change in a single observation, may overfit the training data

89
New cards

The support Vector Classifier:

used to construct a hyperplane that does not perfectly separate the two classes

90
New cards

Support vector classifier yields:

1) Greater robustness to individual observations

2) better classification of most of the training observations

91
New cards

the support vector classifier allows:

some observations to be on the incorrect side of the margin (or hyperplane)

92
New cards

support vector classifier is also called:

soft margin classifier

93
New cards

support vector classifier

ei > 0

or

ei > 1

if ei > 0 then the i-th observation is on the wrong side of the margin

if ei > 1 the i-th term is on the wrong side of the hyperplane

94
New cards

support vector classifier C:

C is the tuning parameter, with larger values allowing more errors in the classification(more bias)

95
New cards

when C is small:

low bias but high variance

96
New cards

when C is large:

more bias, lower variance

97
New cards

support vectors:

observations on the margin or wrong side that affect the hyperplane

98
New cards

support vector classifier performs poorly when....:

there are non-linear class boundaries

99
New cards

boostrap:

taking random samples with replacements from the (single) training data set. Train model on the bth bootstrapped training set to get f^* exponent b(x)

100
New cards

Bagging applied to trees, bias-variance:

with bootstrapped training sets make B not pruned trees, and take the average. (majority vote for qualitative)