CI 460 Machine Learning

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/62

Earn XP

Description and Tags

ERAU

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

63 Terms

New cards

Irreducible Error

inherent noise or variability in the data that cannot be reduced.

New cards

Reducible Error

Error that can be reduced by improving the model

New cards

Prediction

estimation, we can make accurate predictions for the response

New cards

Inference

(inference) if we can understand how the output changes based on the input

New cards

Parametric Methods

Assume the functional form of F, fixed number of parameters. Simple with less flexibility. Examples: linear regression, logistic regression

New cards

Non-parametric methods

Learns directly from the data, adapts to complex patterns. Needs more training data and are more flexible. Examples: decision trees, random forests, KNN

New cards

MSE (Mean Squared Error)

Measure used to asses quality of fit/accuracy for regression model. Always want to minimize MSE.

New cards

MSE

Squared difference between actual and predicted values

n - number of observations
yi - observed values
^yi - predicted values

New cards

MSE example

True Y = [14, 7, 5, 9]
Predicted Y = [4, 8, 8, 10]

MSE = ( (14- 4)^2 + (7 - 8)^2 + (5 - 8)^2 + (9-10)^2 ) / 8

New cards

Error Rate

Measuring Quality of fit for classification. If the condition gives a 1 then the condition is incorrect, otherwise is a zero. Error rate represents a fraction of the incorrect classifications.

$<p>Measuring Quality of fit for classification. If the condition gives a 1 then the condition is incorrect, otherwise is a zero. Error rate represents a fraction of the incorrect classifications.</p>$

New cards

Loss Function

Measures how far the prediction is from the true outcome and assigns a numerical penalty to the incorrect predictions. Want to choose the weights that minimize the loss.

EX - MSE, Error Rate

New cards

Bias

(underfitting) error from modeling problems with too simple of the model. (Using a linear model to try and fit a complex pattern)

New cards

Variance

(Overfitting) How much the output (f) would change based on a different training data set. More flexible the model the more variance

New cards

Bias Variance Trade off

The more flexible the model the higher the variance and the less bias

New cards

Bayes Error Rate

Irreducible error in classification. It is the lowest error rate that any classifier can achieve.

New cards

KNN (K Nearest Neighbors)

Classification, Regression, uses the closest neighbors to decide classification, Non-parametric.

K = small value → High variance and low bias
In higher dimensions performs worse and fits to the noise
Is sensitive to scaling

New cards

Scaling

Scaling - making sure that no one feature will dominate the distance and outcome, pre processing technique
algorithms that have gradient decent need scaling

New cards

Normalization

when doesn’t linear conclusion

New cards

Standardization

when data follows a normal distribution

New cards

Peaking Phenomenon

As features go up so does performance until optimal and then goes down as more and more features are added.

New cards

Curse of Dimensionality

When the dimension is so high that all the points are basically the same distance apart and the center of the cube is empty. Need to reduce features

New cards

Binary Classifier Evaluation

New cards

Accuracy

New cards

Sensitivity

True Positive Rate

TP/ ( TP +FN )

New cards

Specificity

Specificity is the proportion of actual negatives that got predicted

True Negative Rate

TN / ( TN + FP)

New cards

1 - Specificity

False Positive Rate

FP / ( FP + TN)

New cards

Logit Output

Probability Output

P( y=1 | x)

New cards

OLS (Ordinary Least Squared)

Minimizes the squared differences between actual and predicted values

Uses MSE and the Residuals

New cards

Normal Equations

Closed form solution
Minimize MSE
For simple linear regression, smaller datasets, and for model parameters

<ul><li><p>Closed form solution</p></li><li><p>Minimize MSE</p></li><li><p>For simple linear regression, smaller datasets, and for model parameters</p></li></ul><p></p>

New cards

Gradient Decent

Efficient for large datasets and high dimensions
scales well in machine learning
iterative optimization algorithm to minimize the loss function

New cards

Normal Equations

Closed form solution
efficient for smaller datasets with few features

New cards

Precision

TP / ( TP + FP )

New cards

Information Theory

Study of qualification, storage, and digital information

Founded by Harry Nyquist, Ralph Hartley, Claude Shannon

New cards

Quantifying Information

Low probability event → high information (surprising)
High probability event → low information

New cards

Self Information

The measure of the information associated with the outcome of a random variable, (How much information is in a given variable)

1 unit of information = 1 bit
I(x) = -log(P(x))

New cards

Expected Value

The long term average

New cards

Entropy

The measure of the average uncertainty associated with a given random variable

New cards

Self Information vs Entropy

Self Information → measures the information/uncertainty in an event
Entropy → the amount of uncertainty/information in a set of events

New cards

Linear Regression

Models the relationship between the dependent and independent variable by using line of best fit

<ul><li><p>Models the relationship between the dependent and independent variable by using line of best fit</p></li></ul><p></p>

New cards

Gradient decent is efficient for

large datasets and high dimensions, scales well

New cards

least squares fit in terns of linear regression

is used to minimize the differences between the predicted and actual values.

New cards

normal equations are good for

small datasets and small features, involve inverting a matrix

New cards

Regression sum of squares

SST = SSR + SSE

SSTotal - total squared distance of observations of mean of y
SSRegression - distance from regression line to mean of y
SSRedidual (SSE) - variance around the regression line that is not explained by the regression line (this is what we are trying to minimize)

New cards

Coefficient of Determination (R²)

values range from 0 - 1
0 - model does not explain any variability in the data
1 - the model perfectly explains the variability in the data
R² = SSR/SST
The Higher the R² the better fit of the model to the data

New cards

Logistic Regression

for classification problems

instead of predicting y we predict P(Y=1) (aka yes or no)

New cards

Logistic regression gives an

s shaped curve

New cards

odds ratio

measure that gives the change in the odds of an outcome if everything else is constant

values close to 0 indicate very low or very high probabilites

New cards

Logit

New cards

ROC vs AUC

ROC measures the classifier performance over all thresholds
AUC (area under the roc curve) - single numbered score that summarizes the performance

New cards

Regression Trees

Regression problems
divide predictor space into distinct regions

New cards

Split regression trees by using

the results with the lowest mse or entropy in the training data

New cards

Classification tree

predictions are categorical not continuous (regression)
split by minimizing the error rate

New cards

we know how far back to prune a tree by

using cross validation to see which tree has the lowest error rate

New cards

Complexity parameter, use to control complexity and accuracy trade off

New cards

Pros vs Cons of decision trees

Pro - Trees are easy to explain, plotted graphically, used for both classification and regression
Cons - Don’t have prediction accuracy as more complicated models

New cards

Bagging

bootstrap aggregating
- resampling of the observed dataset done by random sampling with replacement from the og dataset
- averaging reduces variance
- gives lots of training datasets
- less interpretation

New cards

In random forests variable importance is computed

Through mean decrease gini

New cards

Mean Decrease Gini vs Entropy

Both used to evaluate quality of split
Gini - Based on probability of missclassification
Entropy - based on the amount of information needed to identify the class of an element

New cards

Random Forests

Same process as bagging but de-correlates the trees
Done by taking a random sample of predictors every time a split is considered

New cards

When random forests are built using m = p

This amounts to being a bagged tree

New cards

Validation Set Approach

Split data into training and testing (validation)
build multiple models and test on lowest error rate
can be highly variable

New cards

Leave - one - out cross validation

split data of size n into
- training - n-1
- test (validation) 1
validate and find mse
Has less bias, variability, computationally intensive

New cards

K - fold cross validation

divide dataset into parts
remove one part and test on last part
repeat with different part each time
average k different mse’s
more accurate, balances bias/variance trade off for error estimates