Data Science Fundemtals 2

Studied by 0 people

0.0(0)

Get a hint

Hint

An ___ feature has values that are unaffected by other features.

1 / 94

There's no tags or description

Looks like no one added any tags here yet for you.

95 Terms

An ___ feature has values that are unaffected by other features.

Input

New cards

An ___ feature has values affected by other features.

Output

New cards

Residual Error

The difference between the observed and predicted value.

New cards

Extrapolation

A prediction that is far beyond the range of the original data.

New cards

Simple Linear Regression =

f(0)+mx

New cards

Sum of Squared Errors (SSE)

The sum of the squares of all residuals.

New cards

Least Squares Regression Line

The simple linear regression formula that minimizes SSE.

New cards

Correlation Coefficient

Measures the direction and strength of a linear relationship as a value between 0 and 1.

New cards

Fitted vs. Residuals Plots

Displays the predicted values against the residuals.

New cards

Normal Q-Q Plot

Displays the sample quantiles against the theoretical quantiles.

New cards

Multiple Linear Regression =

f(0) x_0 + f(1) x_1 + … + f(k) x_k

New cards

Simple Polynomial Regression =

f(0) x^{0} + f(1) x^{1} + … + f(k) x^{k}

New cards

Polynomial Regression Model

A regression model that displays a polynomial relationship between two features.

New cards

Interaction Term

A term in a regression model that contains multiple input features.

New cards

Logistic Regression =

\frac{e^{b_0 + b_1 x}}{1+e^{b_0 + b_1 x}}

New cards

Hot Encoding

Transforming a categorical feature into a numeric feature.

New cards

Log-Odds = ln(\frac{p}{1-p}) =

b_0 + b_1 x

New cards

Odds Ratio

Compares the relative odds of a outcome given a feature.

New cards

A model is ___ if it is too simple to fit the given data.

Underfit

New cards

A model is ___ if it is too complex to fit the given data.

Overfit

New cards

Ideally, a model ___ pass through every point on a graph.

Shouldn’t

New cards

The ___ complex model is preferred over the ___ complex model.

Least, More

New cards

Total Error

How much the observed values differ from predicted values.

New cards

Bias

How much a model’s prediction differs from the observed values.

New cards

Variance

How spread out a model’s predictions are.

New cards

Irreducible Error

Error inherent to the situation, unaffected by the model.

New cards

A complex model will have more ___ than ___.

Variance, Bias

New cards

A simple model will have more ___ than ___.

Bias, Variance

New cards

Machine Learning Algorithm

Uses data to build a model that makes predictions.

New cards

Regression

A machine learning model used to predict numerical values.

New cards

Classification

A machine learning model used to predict categorical values.

New cards

Model Training

The process of estimating model parameters used to make a prediction.

New cards

___ data is used to fit a model.

Training

New cards

___ data is used to evaluate a model’s performance while working on the model.

Validation

New cards

___ data is used to evaluatethe final model’s performance compared to other models.

Test

New cards

Loss Function

Quantifies the difference between a model’s predictions and the observed values.

New cards

Regression Metric

The value returned by a loss function.

New cards

The lower the regression metric, the ___ the model is.

Better

New cards

Mean Squared Error =

\frac{1}{n} \sum (y_i - \hat{y}_{i})^{2}

New cards

Mean Squared Error

A direct measure of a model’s variance.

New cards

Mean Absolute Error =

\frac{1}{n} \sum |y_i - \hat{y}_{i}|^{2}

New cards

Mean Absolute Error

Like Mean Squared Error, but is less influenced by outliers.

New cards

Absolute Loss

Quantifies the loss due to uncertainty.

New cards

L_{abs}(y,\hat{p})=|y-\hat{p}| where y is the ___ and \hat{p} is the ___.

Observed class, Predicted probability

New cards

An instance is ___ if the output feature’s value is known for that instance.

Labeled

New cards

Supervised Learning

Training a model to predict a labeled output feature.

New cards

A model is ___ if the relationship between input and output features in the model are easy to explain.

Interpretable

New cards

A model is ___ if the outputs produced by the model match the actual outputs with new data.

Predictive

New cards

K-Nearest Neighbors

A supervised learning algorithm that predicts the output of a new instance using instances with similar inputs.

New cards

Metric

A method of determining the distance between two instances.

New cards

Confusion Matrix

A table that summarizes the combinations of predicted and actual values.

New cards

Accuracy =

\frac{\text{TP} + \text{TN}}{\text{TP}+\text{FP}+\text{TN}+\text{FN}}

New cards

Precision =

\frac{\text{TP}}{\text{TP} + \text{FP}}

New cards

Recall =

\frac{\text{TP}}{\text{TP}+\text{FN}}

New cards

Receiver Operating Characteristic Curve (ROC Curve)

Measures how well a classification model distinguishes between classes at various probabilties.

New cards

Area Under The ROC Curve (AUC)

A metric used to compare the performance between two classification models.

New cards

Naive Bayes Classification

A supervised learning classifier that uses the number of times a category occurs in a class to eastimate the likelihood of an instance being in that class.

New cards

P(\text{class}|\text{data}) indicates the probability that ___.

The probability of an instance being in \text{class} given \text{data}.

New cards

Laplace Smoothing

Adds one ficitonal instance to a class if none exist.

New cards

Naive Bayes Classification assumes all categories are ___.

Equally important

New cards

Support Vector Machine

A supervised learning algorithm that uses hyperplanes to divide data into different classes.

New cards

Hyperplane

A flat surface that is one dimension lower than the input feature space.

New cards

A dataset is ___ if a hyperplane can divide the dataset so that all instances of one class fall on one side and everything else falls on the other.

Well-Seperated

New cards

Margin

The space between a hyperplane and its supporting vectors.

New cards

Support Vectors

The closest instances to a hyperplane.

New cards

Vectors on the wrong side of a hyperplane are often given a ___.

Penalty

New cards

Hinge Function

Takes the distance from the margin as input, returns a 0 if vector is on the right side and a linear penalty if on the wrong side.

New cards

Sensitivity/Recall

The True-Positive rate.

New cards

Specificity

The True-Negative rate.

New cards

Accuracy

The ratio of the number of correct labels to the total labels.

New cards

Missclassification Rate

The ratio of the number of incorrect labels to the total labels.

New cards

Missclassification Rate =

1 - \text{Accuracy}

New cards

F1 Score

A number between 0 and 1 that represents the harmonic mean of precision and recall.

New cards

F1 Score =

2 \frac{\text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall}}

New cards

Sensitivity =

\frac{\text{TP}}{\text{TP}+\text{FN}}

New cards

Specificity =

\frac{\text{TN}}{\text{TN}+\text{FP}}

New cards

Entropy

Describes the number of ways a situation could diverge.

New cards

Steps to make a decision tree:

Calculate entropy of decision, split decision’s attributes into subtables and calculate their entropy, choose the attribute with the largest entropy, then repeat the process.

New cards

Information Gain

Entropy before split compared to entropy after split.

New cards

Heuristic

The attribute that produces the purest node.

New cards

Entropy / Expected Information needed to classify tuple D=

\text{Info}(D)=-\sum_{m}^{i=l}p_{i}\log_{2}(p_{i})

New cards

Information needed to classify D after using A to split D into v partitions=

\text{Info}_{A}(D)=\sum^{v}_{j=l}\frac{|D_{j}|}{|D|}*I(D_j)

New cards

Information gained by branching on attribute A=

\text{Gain}(A)=\text{Info}(D)-\text{Info}_{A}(D)

New cards

When picking a distance metric for kNN, the metric doesn’t have to be the ___ on a graph.

Physical distance

New cards

The ___ set is used to train the model before testing it.

Training

New cards

The ___ set is used to test the model’s abilities after training it.

Testing

New cards

Picking an ___ is the 3rd step in creating a kNN model.

Evaluation Metric

New cards

The k in kNN represents the ___.

Distance Metric

New cards

Unsupervised Learning

Teaching a model to categorize data where no labels are available.

New cards

kMeans

An unsupervised learning technique that groups different tuples together based on known attributes.

New cards

Centroids

The center points in a cluster for kMeans.

New cards

Each cluster in kMeans represents an individual ___.

Attribute

New cards

Step 3 of kMeans is to ___.

Move the centroids to the average location of the data points

New cards

kMeans should repeat until ___.

The centroids move either very little or not at all.

New cards

kMeans has the possible to fall into an ___ or give a ___ answer.

Infinite loop, Useless

New cards

Explore top notes

APWH NOTES

Note

Studied by 51 people

... ago

5.0(1)

Social Science ( Structure of Globalization)

Note

Studied by 9 people

... ago

5.0(1)

Chapter 8: Place as part of marketing mix

Note

Studied by 14 people

... ago

5.0(1)

Chapter 10: Global Change

Note

Studied by 4 people

... ago

5.0(1)

📚

Guía Matemáticas 2-3

Note

Studied by 59 people

... ago

5.0(3)

Spanish-American War and Imperialism

Note

Studied by 7 people

... ago

4.0(1)

🥗

Блок 4: Питание — Пищеварительная система

Note

Studied by 3 people

... ago

5.0(1)

👩‍🔬

AP Biology Ultimate Guide

Note

Studied by 123508 people

... ago

4.8(561)

Explore top flashcards

E BOLC PM STUDY SET

Flashcard (85)

Studied by 4 people

... ago

5.0(2)

culture vocab

Flashcard (37)

Studied by 17 people

... ago

5.0(1)

Drug Quiz 8 Flashcards

Flashcard (40)

Studied by 11 people

... ago

5.0(1)

Sp III - La Salud

Flashcard (56)

Studied by 548 people

... ago

4.8(5)

GCSE German Unit 3

Flashcard (169)

Studied by 1 person

... ago

5.0(1)

Franska 3 - À la pizzeria

Flashcard (24)

Studied by 4 people

... ago

5.0(2)

Research Exam 1

Flashcard (118)

Studied by 52 people

... ago

5.0(1)

Macbeth Quotes

Flashcard (21)

Studied by 2 people

... ago

5.0(1)