data mining written

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/53

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

54 Terms

1
New cards

Continuous vs Class Response Variables

Continuous responses are numerical (like income or height), while class responses are categorical (like “yes/no” or “spam/not spam”)

2
New cards

Why Logistic Regression Improves Over Linear Regression:

Logistic regression models probabilities between 0 and 1 for classification instead of predicting continuous values.

3
New cards

Linear Discriminant Analysis (LDA):

LDA assumes each class has the same variance and finds a straight line (or boundary) that best separates them.

4
New cards

Quadratic Discriminant Analysis (QDA):

QDA allows each class to have its own variance, creating curved (quadratic) decision boundaries.

5
New cards

Naive Bayes:

Naive Bayes uses probability rules assuming predictors are independent, making it simple and fast for text or categorical data.

6
New cards

K-Nearest Neighbors (KNN):

KNN classifies a new point based on the majority class of its closest neighbors in the data.

7
New cards

Confusion Matrix:

A table that shows how many predictions were correct or incorrect (True/False Positives and Negatives).

8
New cards

Area Under the Curve (AUC):

AUC measures how well a model separates classes — higher AUC means better overall classification performance.

9
New cards

Linear Regression

Fits a straight line by minimizing the total prediction error (RSS) between predicted and actual values.

10
New cards

Linear Regression - Relationship

Assumes a linear relationship between predictors and the response variable.

11
New cards

Linear Regression - Flexibility

Lowest flexibility; fits only straight lines, no curves.

12
New cards

Linear Regression - Coefficients Meaning

Coefficients show how much Y changes when a predictor increases by 1 unit.

13
New cards

Linear Regression - Key Feature

Assumes a straight-line relationship — no curves allowed.

14
New cards

Ridge Regression

Minimizes RSS plus a penalty on the squared coefficient values (L2 penalty).

15
New cards

Ridge Regression - Effect

Shrinks coefficients toward zero to reduce overfitting but never makes them exactly zero.

16
New cards

Ridge Regression - Key Feature

Simplifies the model by shrinking coefficients but keeps all predictors.

17
New cards

Lasso Regression

Minimizes RSS plus a penalty on the absolute values of coefficients (L1 penalty).

18
New cards

Lasso Regression - Effect

Can shrink some coefficients all the way to zero, removing less important predictors.

19
New cards

Lasso Regression - Key Feature

Performs variable selection automatically by keeping only the most important predictors.

20
New cards

Polynomial Regression

Adds powers of X (like X², X³) to capture curved relationships while still fitting a linear model.

21
New cards

Polynomial Regression - Key Feature

Fits one smooth curved line across all data instead of a straight line.

22
New cards

Splines

Combine polynomials and step functions to fit flexible curves that change shape at specific points called knots.

23
New cards

Splines - Key Feature

Fits smooth, piecewise curves that join together smoothly at knots.

24
New cards

Mean Squared Error (MSE)

Measures average squared prediction error; smaller is better; units are the square of Y.

25
New cards

R² (R-squared)

Shows how much of Y’s variation is explained by the model; closer to 1 means better fit; unitless.

26
New cards

β (Beta) in Linear Regression

β shows how much Y changes when X increases by 1 (the slope).

27
New cards

β (Beta) in Ridge/Lasso Regression

β still means change in Y per unit of X, but it’s smaller because of the penalty that shrinks coefficients.

28
New cards

β (Beta) in Polynomial Regression

β controls how curved the line is — β₁ affects slope, β₂ and higher bend the curve up or down.

29
New cards

β (Beta) in Splines

β controls the curve’s shape in one section of X, describing local slope between knots.

30
New cards

How to Explain β in Linear Regression

If β = 3, when X goes up by 1, Y goes up by 3.

31
New cards

How to Explain β in Ridge or Lasso Regression

β is smaller than in linear regression (e.g., 2.5 instead of 3) to reduce overfitting.

32
New cards

How to Explain β in Polynomial Regression

β₁ sets the direction, and β₂ (on X²) bends the curve up if positive or down if negative.

33
New cards

How to Explain β in Splines

Each β shapes how Y changes with X in one range — different β’s for different sections.

34
New cards

Resampling

A method to estimate model accuracy by repeatedly training and testing the model on different splits of the data.

35
New cards

Leave-One-Out Cross Validation (LOOCV)

Trains on all data except one point, tests on that point, and repeats for every observation.
Pros: Uses almost all data for training (low bias).
Cons: Very slow to compute and can vary a lot (high variance).

36
New cards

K-Fold Cross Validation

Splits data into K parts, trains on K–1 parts, and tests on the remaining one, repeating K times.
Pros: Faster and gives more stable results (lower variance).
Cons: Uses less data each time, so slightly higher bias.

37
New cards

LOOCV vs K-Fold Summary

LOOCV = slow, low bias, high variance.
K-Fold = faster, slightly higher bias, lower variance.

38
New cards

Basics of Decision Trees

Decision Trees split data into regions using predictor values to make predictions or classifications. They are easy to interpret but can overfit and have high variance.

39
New cards

Method of Splitting and Building Trees

The model chooses the predictor and cutoff that best separate the data at each step, splitting until no further improvement is made or a stopping rule is reached.

40
New cards

Improvements from Bagging and Random Forests

Bagging builds many trees on different data samples and averages them to reduce variance, while Random Forests also randomize predictor selection at each split to make trees less correlated and more accurate.

41
New cards

Tree Sketching

Each region in a plot represents a split in the tree; you can draw one from the other by matching splits to the boundaries of the regions

42
New cards

difference between Parametric and Non-Parametric Methods

Parametric methods assume a specific equation or shape for the model (like linear regression), while non-parametric methods make fewer assumptions and adapt more flexibly to the data (like KNN or trees).

43
New cards

issues with High Collinearity Between Predictors

When predictors are highly correlated, it becomes difficult to separate their individual effects, leading to unstable or unreliable coefficient estimates.

44
New cards

the purpose of and methods for Dimension Reduction or Feature Selection

Methods like PCA or Lasso simplify models by reducing the number of predictors while keeping the most relevant information.

45
New cards

how Degrees of Freedom relate to Model Complexity

More degrees of freedom mean a more flexible and complex model that can capture data patterns better but is more likely to overfit.

46
New cards

what Major Assumptions Various Methods May Make About the Input Data

Different models assume certain properties (like linearity, independence, or normality); breaking these assumptions can reduce model accuracy.

47
New cards

Attributes of “Sound Models” (No Data Leakage Between Train and Test Data)

A sound model keeps training and testing data separate so no information from the test set influences the training process.

48
New cards
49
New cards
50
New cards
51
New cards
52
New cards
53
New cards
54
New cards