Reviewer for Data Mining (MIDTERM)

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/66

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

67 Terms

New cards

Regression

A supervised learning technique used to predict continuous numerical values based on one or more input variables.

New cards

Dependent Variable (Response)

The value you want to predict in a regression model.

New cards

Independent Variable (Predictor/Features)

The variable(s) used to make predictions in regression.

New cards

Goal of Regression

To find the best

New cards

Regression Applications

Finance, healthcare, marketing, manufacturing, retail.

New cards

Continuous Target Variable

Regression predicts a continuous value like sales price or height.

New cards

Mean Squared Error (MSE)

The average of the squares of the errors, a common regression metric.

New cards

Root Mean Squared Error (RMSE)

The square root of MSE, measures average prediction error in same units as target.

New cards

Overfitting

When a model is too complex and learns noise from training data (poor generalization).

New cards

Underfitting

When a model is too simple and misses key data patterns.

New cards

Interpretability

Regression coefficients show how much each predictor affects the target.

New cards

Predictor Variable (Feature)

Input used for prediction in regression.

New cards

Response Variable

Output to be predicted.

New cards

Coefficient

Represents the change in the response variable for a one

New cards

Residuals

The differences between observed and predicted values.

New cards

Multicollinearity

Situation where predictors are highly correlated, which may affect coefficient stability.

New cards

Outliers

Data points that deviate substantially and may distort the regression model.

New cards

Simple Regression

Regression with one predictor and one response variable.

New cards

Multiple Regression

Regression using two or more predictors for a single response variable.

New cards

Nonlinear Regression

Regression capturing nonlinear relationships, e.g. plant growth over time.

New cards

Simple Linear Regression Formula

y = β0 + β1 x

New cards

Multiple Linear Regression Formula

y = β0 + β1X1 + β2X2 + … + βpXp + ε

New cards

Linear Regression

Algorithm fitting a straight line to predict outcomes.

New cards

Polynomial Regression

Algorithm fitting a curve, capturing nonlinear relationships.

New cards

Ridge Regression

Regularization method preventing overfitting by shrinking coefficients.

New cards

Lasso Regression

Regularization method that can force some coefficients to exactly zero for feature selection.

New cards

Decision Tree Regression

Uses tree structures, can handle nonlinear relationships.

New cards

Random Forest Regression

Uses ensemble of trees for robust, less overfitted predictions.

New cards

Support Vector Regression (SVR)

Uses hyperplanes in high

New cards

Advantages of Regression

Interpretable, good for forecasting, reveals feature importance, flexible.

New cards

Disadvantages of Regression

Often assumes linearity, can overfit, sensitive to outliers.

New cards

Mean Absolute Error (MAE)

Average of absolute errors between predicted and actual, lower is better.

New cards

Mean Squared Error (MSE)

Average squared error, penalizes large errors, lower is better.

New cards

Root Mean Squared Error (RMSE)

Sqrt of MSE, interpretable in target units, lower is better.

New cards

squared (R2)

New cards

Adjusted R-squared

squared

New cards

Classification

Categorizes data points into defined classes based on features.

New cards

Binary Classification

Classification with two possible outcomes (e.g., spam/not spam).

New cards

Multiclass Classification

Classification with more than two possible labels.

New cards

Classification Applications

Credit risk analysis, shopping prediction, medical diagnosis, sentiment analysis.

New cards

Supervised Classification

Trained using labeled data (target classes known).

New cards

Unsupervised Classification

Discovers classes from unlabeled data (e.g., clustering).

New cards

Training Phase (Classification)

Model learns from labeled data.

New cards

Testing Phase (Classification)

Model is validated on new/unseen data for accuracy.

New cards

Dataset Split (Classification)

Split into training (60

New cards

Decision Tree Classifier

Uses tree structure; splits data on features to classify.

New cards

Decision Tree Algorithms

ID3 and C4.5 use information gain and Gini index for splits.

New cards

Advantage of Decision Trees

Easy to interpret and use.

New cards

Disadvantage of Decision Trees

Can be sensitive to small changes, may be inaccurate or complex.

New cards

Overfitting in Trees

Complex trees may overfit to training data.

New cards

Pruning

Removes unnecessary branches to improve prediction on unseen data.

New cards

Information Theory

Quantifies information and measures uncertainty, crucial in machine learning.

New cards

Entropy (H)

Measures randomness/uncertainty or impurity in a dataset.

New cards

High Entropy

More uncertainty; labels are mixed.

New cards

Low Entropy

More certainty; labels are pure.

New cards

Entropy Formula

H(S) =

New cards

Information Gain (IG)

How much a feature reduces entropy when splitting data.

New cards

Purpose of Information Gain

Select the best attribute for decision

New cards

Information Gain Formula

IG(S, A) = H(S)

New cards

Confusion Matrix

Shows count of actual vs. predicted classes in classification.

New cards

True Positive (TP)

Correctly predicted as positive.

New cards

True Negative (TN)

Correctly predicted as negative.

New cards

False Positive (FP)

Incorrectly predicted as positive.

New cards

False Negative (FN)

Incorrectly predicted as negative.

New cards

Precision

TP/(TP + FP): Fraction of predicted positives that are actual positives.

New cards

Recall

TP/(TP + FN): Fraction of actual positives correctly found.

New cards

F1 Score

Harmonic mean of Precision & Recall: 2(PrecisionRecall)/(Precision+Recall)