Data Preprocessing, Machine Learning Models, and Ensemble Learning Techniques

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/104

flashcard set

Earn XP

Description and Tags

Flashcards covering key concepts in data preprocessing, machine learning models (linear models, tree based ML models), and ensemble learning techniques. Good luck!

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

105 Terms

1
New cards

What is Feature Scaling?

A crucial preprocessing step in machine learning that ensures numerical features contribute equally to model performance.

2
New cards

Standardization Formula

x′ = (x - µ) / σ, where µ is the mean and σ is the standard deviation of the feature.

3
New cards

Normalization Formula (Min-Max Scaling)

x′ = (x - min(x)) / (max(x) - min(x))

4
New cards

Normalization Formula (Max-Abs Scaling)

x′ = x / max(|x|)

5
New cards

Robust Scaling Formula

x′ = (x - median(x)) / IQR, where IQR is the interquartile range.

6
New cards

What is Ordinal Encoding?

A technique used when categorical features have an inherent order or ranking, assigning integer values reflecting relative position.

7
New cards

What is Nominal Encoding (One-Hot Encoding)?

A method applied when categories do not have any inherent order. Each category is transformed into a binary vector.

8
New cards

Reciprocal Transformation Formula

y = 1 / x (Effective for reducing the impact of large values.)

9
New cards

Square Transformation Formula

y = x^2 (Useful for handling left-skewed data.)

10
New cards

Logarithmic Transformation Formula

y = log(x + 1) or y = log(x) (Appropriate for compressing right-skewed data.)

11
New cards

Square Root Transformation Formula

y = √x (Less aggressive than log transformation, often used for count data.)

12
New cards

Box-Cox Transformation Formula

x′ = (xλ - 1) / λ, λ ≠ 0; log(x), λ = 0 (Applicable only to strictly positive data.)

13
New cards

Yeo-Johnson Transformation Formula

Defined piecewise for positive, negative, and zero values; suitable for both positive and negative values. (See notes for full formula)

14
New cards

Quantile Transformation Overview

Maps the original data to a specified distribution (uniform or normal) using its empirical cumulative distribution function (CDF).

15
New cards

What is Discretization / Binning?

Transforms continuous numerical features into categorical values by grouping them into discrete intervals or bins.

16
New cards

Unsupervised Binning Types

Uniform Binning (Equal Width), Quantile Binning (Equal Frequency), K-means Binning

17
New cards

Binarization Rule

yi = 0 if xi ≤ threshold, 1 if xi > threshold

18
New cards

Mean and Standard Deviation Method for Outlier Detection (Normally Distributed Data)

Lower Bound: µ - 3σ; Upper Bound: µ + 3σ

19
New cards

Interquartile Range (IQR) Method for Outlier Detection (Skewed Data)

Q1 = 25th percentile, Q3 = 75th percentile; IQR = Q3 - Q1; Lower Bound: Q1 - 1.5 × IQR; Upper Bound: Q3 + 1.5 × IQR

20
New cards

What is Trimming (Outlier Handling)?

Removing outliers from the dataset; useful when outliers are likely due to data entry errors or noise.

21
New cards

What is Capping (Winsorizing)?

Outliers are replaced with the nearest boundary value (e.g., 1st or 99th percentile).

22
New cards

Conditions for Removing Rows with Missing Values

Missing Completely at Random (MCAR), Low Missing Percentage (less than 5%), Distribution Preservation

23
New cards

KNN Imputer - Step 1: Calculate Squared Euclidean Distance

Distance(A, B) = Σ(feature(A)i - feature(B)i)^2, using only available (non-missing) values.

24
New cards

KNN Imputer - Adjust Distance Using Missing Values Formula

Adjusted Distance = Squared Distance × (Total Columns / Columns Used)

25
New cards

Iterative Imputation Process

Predicts missing values based on other features using a model; iteratively updates imputed values until convergence.

26
New cards

SMOTE (Synthetic Minority Oversampling Technique)

A technique for handling imbalanced data by creating synthetic data points for the minority class.

27
New cards

Classification Metric - Accuracy

(TP + TN) / (TP + TN + FP + FN) - Proportion of total predictions that were correct.

28
New cards

Classification Metric - Precision

TP / (TP + FP) - Proportion of positive predictions that were truly positive.

29
New cards

Classification Metric - Recall (Sensitivity)

TP / (TP + FN) - Proportion of actual positive instances correctly identified.

30
New cards

Classification Metric - Specificity (True Negative Rate)

TN / (TN + FP) - Proportion of actual negative instances correctly identified.

31
New cards

Classification Metric - Balanced Accuracy

(Recall + Specificity) / 2 - Average of recall for each class; useful for imbalanced datasets.

32
New cards

Classification Metric - MCC

(TP * TN - FP * FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN)) - Measures the correlation between observed and predicted binary classifications.

33
New cards

Classification Metric - Cohen’s Kappa (κ)

(Po - Pe) / (1 - Pe) - Measures agreement between predicted and actual labels, correcting for chance agreement.

34
New cards

Classification Metric - F1-score

2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean of precision and recall.

35
New cards

Classification Metric - Fβ-score

((1 + β^2) * Precision * Recall) / ((β^2 * Precision) + Recall) - Weighted harmonic mean of precision and recall.

36
New cards

Classification Metric - Log Loss

  • (1 / n) * Σ [yi log(ŷi) + (1 - yi) log(1 - ŷi)] - Penalizes confident incorrect probabilistic predictions.
37
New cards

Classification Metric - Area Under ROC (AUC)

Area under the ROC curve (TPR vs FPR) - Represents the model’s ability to distinguish between positive and negative classes.

38
New cards

True Positive Rate (TPR) Formula in ROC Curves

TPR = TP / (TP + FN) = TP / P

39
New cards

False Positive Rate (FPR) Formula in ROC Curves

FPR = FP / (FP + TN) = FP / N

40
New cards

How is the ROC Curve Generated?

By varying the decision threshold from very high to very low and calculating the TPR and FPR for each threshold.

41
New cards

What does the x-axis (FPR) represent in an ROC curve?

The rate of false alarms; also represents the percentage of negative classes incorrectly identified as positive.

42
New cards

What does the y-axis (TPR) represent in an ROC curve?

The rate of correct positive predictions; also represents the percentage of positive class points are correctly identified.

43
New cards

Good agreement range for Cohen’s Kappa Score

κ Value Agreement Level:
< 0 Less than chance
0.01–0.20 Slight
0.21–0.40 Fair
0.41–0.60 Moderate
0.61–0.80 Substantial
0.81–1.00 Almost perfect

44
New cards

Regression Metric - Mean Absolute Error (MAE)

(1 / n) * Σ |yi - ŷi| - Average absolute difference between predicted and actual values.

45
New cards

Regression Metric - Mean Squared Error (MSE)

(1 / n) * Σ (yi - ŷi)^2 - Average squared difference between predicted and actual values.

46
New cards

Regression Metric - Root Mean Squared Error (RMSE)

sqrt((1 / n) * Σ (yi - ŷi)^2) - Square root of MSE; in the same units as the response variable.

47
New cards

Regression Metric - R-squared (R2)

1 - (Σ(yi - ŷi)^2 / Σ(yi - ȳ)^2) - Proportion of the variance predictable from the independent variable(s).

48
New cards

Regression Metric - Adjusted R-squared

1 - ((1 - R2)(n - 1) / (n - p - 1)) - Modified R-squared that adjusts for the number of predictors (p) and observations (n).

49
New cards

What is the goal of PCA?

To reduce the number of features while retaining the most significant information, making the dataset easier to analyze, visualize, and model.

50
New cards

What's the first step in PCA?

Mean Centering the Data: Subtracting the mean of each feature (column) from the data.

51
New cards

What is the purpose of computing the Covariance Matrix in PCA?

The covariance matrix C captures the relationships between the features.

52
New cards

In PCA, what do Eigenvalues represent?

Eigenvalues represent the magnitude of variance explained by each principal component

53
New cards

In PCA, what do Eigenvectors represent?

Eigenvectors represent the directions (or axes) along which the data varies the most.

54
New cards

What is the goal of LDA?

To find a projection that maximizes the separability between different classes while minimizing the variance within each class.

55
New cards

SW (LDA) - Within-Class Scatter Matrix - Description

Captures the spread of each class around its own mean

56
New cards

SB (LDA) - Between-Class Scatter Matrix - Description

Quantifies how far apart the class means are from the overall mean of the dataset

57
New cards

LDA Optimization Objective

Maximize the ratio of between-class variance to within-class variance.

58
New cards

Difference between PCA and LDA

PCA: A data-driven, unsupervised technique.
LDA: A supervised method that uses class labels to maximize the separability between classes.

59
New cards

SVD - Matrix Factorization

A = UΣV T where:
U ∈ Rm×m is an orthogonal matrix (left singular vectors),
Σ ∈ Rm×n diagonal matrix (singular values),
V ∈ Rn×n is an orthogonal matrix (right singular vectors).

60
New cards

SVD - Truncating Top K Components - Description

To reduce matrix A to a lower-dimensional approximation using the top k singular values.

61
New cards

OLS Closed Form Solution

β = (X⊤X)−1X⊤Y

62
New cards

Gradient Descent Update Rule (Linear Regression)

β(n) = β(n−1) − α / 1000 · ∇βL, where α is the learning rate.

63
New cards

What is Ridge Regression (L2 Regularization)?

A regularized version of linear regression that adds an L2 penalty to the loss function, thereby discouraging large coefficients and reducing model complexit

64
New cards

Ridge Regression - Closed-Form Solution

β = (X⊤X + λI)−1X⊤Y

65
New cards

As λ increases, what happens to the coefficients in Ridge Regression?

As λ increases, the magnitude of the coefficients decreases, but they never reach zero.

66
New cards

Lasso Regression (L1 Regularization) Description

Introduces an L1 penalty to the loss function, which encourages sparsity in the coefficient estimates, effectively performing feature selection

67
New cards

Why is Lasso Regression good for Feature Selection?

Lasso is often used for feature selection because, as λ increases, it forces some coefficients to become exactly zero, thereby removing those features from the model.

68
New cards

Elastic Net Regression Description

Combines both L1 (Lasso) and L2 (Ridge) regularization penalties, offering a balance between the two methods

69
New cards

Effect of increasing λ on coefficients (Ridge vs Lasso)

Ridge: Coefficients decrease toward zero but never exactly reach zero.
Lasso: Coefficients can shrink to exactly zero, leading to feature selection.

70
New cards

Limitation of the Perceptron Algorithm

The perceptron stops once it finds a separating hyperplane, not necessarily the best one.

71
New cards

Sigmoid Function

σ(z) = 1 / (1 + e−z)

72
New cards

Logistic Regression - Loss Function (Binary Cross-Entropy)

L = − (1 / m) * Σ [yi log(ŷi) + (1 - yi) log(1 - ŷi)]

73
New cards

Softmax Function

σ(z)j = ezj / Σ(k=1 to K) ezk for j = 1, . . . , K (Converts raw scores (logits) into probabilities.)

74
New cards

KNN Step 1

Normalize the Data - Scale features appropriately using min-max normalization: x′ = (x − min(X)) / (max(X) − min(X))

75
New cards

KNN Prediction - Classification

Majority vote ŷ = mode({yi}(i=1 to K) )

76
New cards

KNN Prediction - Regression

Weighted average ŷ = Σ(i=1 to K) wiyi / Σ(i=1 to K) wi , where wi = 1 / d(x, xi)

77
New cards

Naive Bayes - Bayes’ Theorem

P(A|B) = (P(B|A)P(A)) / P(B)

78
New cards

Naive Bayes - Classification Formulation

P(Ck|A, B, C) = (P(A|Ck)P(B|Ck)P(C|Ck)P(Ck)) / P(A, B, C)

79
New cards

CART algorithm - step 2 - Classification Problems

Determine the best feature to split the data on using a criterion like Gini impurity or entropy

80
New cards

CART algorithm - step 2 - Numerical Problems

Determine the best feature to split the data, computing the standard deviation of the target variable in the parent node

81
New cards

Minmimum entropy in a binary classification problem

0

82
New cards

Maxmimum entropy in a binary classification problem

1

83
New cards

Gini Impurity

G = 1 − Σ(i=1 to c) p2 i

84
New cards

What split algorithm balances the tree (Gini vs Entropy)?

Ginny Impurity

85
New cards

Information Gain Formula

Information Gain = Impurity(Parent) − Σ(i=1 to k) wi * Impurity(Childi)
= Entropy or Gini(Parent) − Weighted Average of Child Impurities

86
New cards

Formula for weighted standard deviation

Information Gain = Standard Deviation (Parent) − Weighted Standard Deviation (Children)

87
New cards

Feature Importance(i) in Decision Tree

Σ t∈nodes using i * Nt / N [ Impurityt - Ntl / Nt * impuritytl - Ntr / Nt * impuritytr]

88
New cards

Voting Ensemble - ensemble accuracy 3 independent models, each with an individual accuracy of 0.7

= (0.7)3 + 3 × (0.7)2 × 0.3 = 0.343 + 0.441 = 0.784

89
New cards

Voting Ensemble - How is the final output in the Regression problems?

Is determined by the mean of the individual model predictions used as the final result.

90
New cards

Bagging (Bootstrap Aggregating)

Involves training multiple instances of the same model on different random subsets (with replacement) of the training data.

91
New cards

Boosting - key feature

Is a sequential ensemble technique where each model tries to correct the errors of its predecessor.

92
New cards

OOB in bagging - rate of non selected samples

approximately 37% of the samples are not selected for training any given model

93
New cards

What happens in Multi-Layered Stacking?

Models learns from previous model predictions, enhancing overall performance.

94
New cards

AdaBoost - Initial sample weight

wi = 1/n, for all i = 1, 2, . . . , n

95
New cards

Compute Learner Weight (α)

α = 1 / 2 log ((1 − ϵ)/ϵ)

96
New cards

Update Sample Weights - general formula

wnew i = wi * exp (-αyiyˆi)

97
New cards

Normalize the updated weights formula

wnorm i= wnew i/ Σ(j to n) (wnew j)

98
New cards

Gradient Boosting result - key formula

Result = Model0(X) + η · Model1(X) + η · Model2(X) + · · · + η · Modeln(X)

99
New cards

Final log loss calculation - description

Total log loss = modelsList[0] + η · (Σ(nestimators i=1) modelsListi

100
New cards

Final probablity calculation - description

Final prob =1 / 1 + e−Total log loss