1/104
Flashcards covering key concepts in data preprocessing, machine learning models (linear models, tree based ML models), and ensemble learning techniques. Good luck!
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is Feature Scaling?
A crucial preprocessing step in machine learning that ensures numerical features contribute equally to model performance.
Standardization Formula
x′ = (x - µ) / σ, where µ is the mean and σ is the standard deviation of the feature.
Normalization Formula (Min-Max Scaling)
x′ = (x - min(x)) / (max(x) - min(x))
Normalization Formula (Max-Abs Scaling)
x′ = x / max(|x|)
Robust Scaling Formula
x′ = (x - median(x)) / IQR, where IQR is the interquartile range.
What is Ordinal Encoding?
A technique used when categorical features have an inherent order or ranking, assigning integer values reflecting relative position.
What is Nominal Encoding (One-Hot Encoding)?
A method applied when categories do not have any inherent order. Each category is transformed into a binary vector.
Reciprocal Transformation Formula
y = 1 / x (Effective for reducing the impact of large values.)
Square Transformation Formula
y = x^2 (Useful for handling left-skewed data.)
Logarithmic Transformation Formula
y = log(x + 1) or y = log(x) (Appropriate for compressing right-skewed data.)
Square Root Transformation Formula
y = √x (Less aggressive than log transformation, often used for count data.)
Box-Cox Transformation Formula
x′ = (xλ - 1) / λ, λ ≠ 0; log(x), λ = 0 (Applicable only to strictly positive data.)
Yeo-Johnson Transformation Formula
Defined piecewise for positive, negative, and zero values; suitable for both positive and negative values. (See notes for full formula)
Quantile Transformation Overview
Maps the original data to a specified distribution (uniform or normal) using its empirical cumulative distribution function (CDF).
What is Discretization / Binning?
Transforms continuous numerical features into categorical values by grouping them into discrete intervals or bins.
Unsupervised Binning Types
Uniform Binning (Equal Width), Quantile Binning (Equal Frequency), K-means Binning
Binarization Rule
yi = 0 if xi ≤ threshold, 1 if xi > threshold
Mean and Standard Deviation Method for Outlier Detection (Normally Distributed Data)
Lower Bound: µ - 3σ; Upper Bound: µ + 3σ
Interquartile Range (IQR) Method for Outlier Detection (Skewed Data)
Q1 = 25th percentile, Q3 = 75th percentile; IQR = Q3 - Q1; Lower Bound: Q1 - 1.5 × IQR; Upper Bound: Q3 + 1.5 × IQR
What is Trimming (Outlier Handling)?
Removing outliers from the dataset; useful when outliers are likely due to data entry errors or noise.
What is Capping (Winsorizing)?
Outliers are replaced with the nearest boundary value (e.g., 1st or 99th percentile).
Conditions for Removing Rows with Missing Values
Missing Completely at Random (MCAR), Low Missing Percentage (less than 5%), Distribution Preservation
KNN Imputer - Step 1: Calculate Squared Euclidean Distance
Distance(A, B) = Σ(feature(A)i - feature(B)i)^2, using only available (non-missing) values.
KNN Imputer - Adjust Distance Using Missing Values Formula
Adjusted Distance = Squared Distance × (Total Columns / Columns Used)
Iterative Imputation Process
Predicts missing values based on other features using a model; iteratively updates imputed values until convergence.
SMOTE (Synthetic Minority Oversampling Technique)
A technique for handling imbalanced data by creating synthetic data points for the minority class.
Classification Metric - Accuracy
(TP + TN) / (TP + TN + FP + FN) - Proportion of total predictions that were correct.
Classification Metric - Precision
TP / (TP + FP) - Proportion of positive predictions that were truly positive.
Classification Metric - Recall (Sensitivity)
TP / (TP + FN) - Proportion of actual positive instances correctly identified.
Classification Metric - Specificity (True Negative Rate)
TN / (TN + FP) - Proportion of actual negative instances correctly identified.
Classification Metric - Balanced Accuracy
(Recall + Specificity) / 2 - Average of recall for each class; useful for imbalanced datasets.
Classification Metric - MCC
(TP * TN - FP * FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN)) - Measures the correlation between observed and predicted binary classifications.
Classification Metric - Cohen’s Kappa (κ)
(Po - Pe) / (1 - Pe) - Measures agreement between predicted and actual labels, correcting for chance agreement.
Classification Metric - F1-score
2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean of precision and recall.
Classification Metric - Fβ-score
((1 + β^2) * Precision * Recall) / ((β^2 * Precision) + Recall) - Weighted harmonic mean of precision and recall.
Classification Metric - Log Loss
Classification Metric - Area Under ROC (AUC)
Area under the ROC curve (TPR vs FPR) - Represents the model’s ability to distinguish between positive and negative classes.
True Positive Rate (TPR) Formula in ROC Curves
TPR = TP / (TP + FN) = TP / P
False Positive Rate (FPR) Formula in ROC Curves
FPR = FP / (FP + TN) = FP / N
How is the ROC Curve Generated?
By varying the decision threshold from very high to very low and calculating the TPR and FPR for each threshold.
What does the x-axis (FPR) represent in an ROC curve?
The rate of false alarms; also represents the percentage of negative classes incorrectly identified as positive.
What does the y-axis (TPR) represent in an ROC curve?
The rate of correct positive predictions; also represents the percentage of positive class points are correctly identified.
Good agreement range for Cohen’s Kappa Score
κ Value Agreement Level:
< 0 Less than chance
0.01–0.20 Slight
0.21–0.40 Fair
0.41–0.60 Moderate
0.61–0.80 Substantial
0.81–1.00 Almost perfect
Regression Metric - Mean Absolute Error (MAE)
(1 / n) * Σ |yi - ŷi| - Average absolute difference between predicted and actual values.
Regression Metric - Mean Squared Error (MSE)
(1 / n) * Σ (yi - ŷi)^2 - Average squared difference between predicted and actual values.
Regression Metric - Root Mean Squared Error (RMSE)
sqrt((1 / n) * Σ (yi - ŷi)^2) - Square root of MSE; in the same units as the response variable.
Regression Metric - R-squared (R2)
1 - (Σ(yi - ŷi)^2 / Σ(yi - ȳ)^2) - Proportion of the variance predictable from the independent variable(s).
Regression Metric - Adjusted R-squared
1 - ((1 - R2)(n - 1) / (n - p - 1)) - Modified R-squared that adjusts for the number of predictors (p) and observations (n).
What is the goal of PCA?
To reduce the number of features while retaining the most significant information, making the dataset easier to analyze, visualize, and model.
What's the first step in PCA?
Mean Centering the Data: Subtracting the mean of each feature (column) from the data.
What is the purpose of computing the Covariance Matrix in PCA?
The covariance matrix C captures the relationships between the features.
In PCA, what do Eigenvalues represent?
Eigenvalues represent the magnitude of variance explained by each principal component
In PCA, what do Eigenvectors represent?
Eigenvectors represent the directions (or axes) along which the data varies the most.
What is the goal of LDA?
To find a projection that maximizes the separability between different classes while minimizing the variance within each class.
SW (LDA) - Within-Class Scatter Matrix - Description
Captures the spread of each class around its own mean
SB (LDA) - Between-Class Scatter Matrix - Description
Quantifies how far apart the class means are from the overall mean of the dataset
LDA Optimization Objective
Maximize the ratio of between-class variance to within-class variance.
Difference between PCA and LDA
PCA: A data-driven, unsupervised technique.
LDA: A supervised method that uses class labels to maximize the separability between classes.
SVD - Matrix Factorization
A = UΣV T where:
U ∈ Rm×m is an orthogonal matrix (left singular vectors),
Σ ∈ Rm×n diagonal matrix (singular values),
V ∈ Rn×n is an orthogonal matrix (right singular vectors).
SVD - Truncating Top K Components - Description
To reduce matrix A to a lower-dimensional approximation using the top k singular values.
OLS Closed Form Solution
β = (X⊤X)−1X⊤Y
Gradient Descent Update Rule (Linear Regression)
β(n) = β(n−1) − α / 1000 · ∇βL, where α is the learning rate.
What is Ridge Regression (L2 Regularization)?
A regularized version of linear regression that adds an L2 penalty to the loss function, thereby discouraging large coefficients and reducing model complexit
Ridge Regression - Closed-Form Solution
β = (X⊤X + λI)−1X⊤Y
As λ increases, what happens to the coefficients in Ridge Regression?
As λ increases, the magnitude of the coefficients decreases, but they never reach zero.
Lasso Regression (L1 Regularization) Description
Introduces an L1 penalty to the loss function, which encourages sparsity in the coefficient estimates, effectively performing feature selection
Why is Lasso Regression good for Feature Selection?
Lasso is often used for feature selection because, as λ increases, it forces some coefficients to become exactly zero, thereby removing those features from the model.
Elastic Net Regression Description
Combines both L1 (Lasso) and L2 (Ridge) regularization penalties, offering a balance between the two methods
Effect of increasing λ on coefficients (Ridge vs Lasso)
Ridge: Coefficients decrease toward zero but never exactly reach zero.
Lasso: Coefficients can shrink to exactly zero, leading to feature selection.
Limitation of the Perceptron Algorithm
The perceptron stops once it finds a separating hyperplane, not necessarily the best one.
Sigmoid Function
σ(z) = 1 / (1 + e−z)
Logistic Regression - Loss Function (Binary Cross-Entropy)
L = − (1 / m) * Σ [yi log(ŷi) + (1 - yi) log(1 - ŷi)]
Softmax Function
σ(z)j = ezj / Σ(k=1 to K) ezk for j = 1, . . . , K (Converts raw scores (logits) into probabilities.)
KNN Step 1
Normalize the Data - Scale features appropriately using min-max normalization: x′ = (x − min(X)) / (max(X) − min(X))
KNN Prediction - Classification
Majority vote ŷ = mode({yi}(i=1 to K) )
KNN Prediction - Regression
Weighted average ŷ = Σ(i=1 to K) wiyi / Σ(i=1 to K) wi , where wi = 1 / d(x, xi)
Naive Bayes - Bayes’ Theorem
P(A|B) = (P(B|A)P(A)) / P(B)
Naive Bayes - Classification Formulation
P(Ck|A, B, C) = (P(A|Ck)P(B|Ck)P(C|Ck)P(Ck)) / P(A, B, C)
CART algorithm - step 2 - Classification Problems
Determine the best feature to split the data on using a criterion like Gini impurity or entropy
CART algorithm - step 2 - Numerical Problems
Determine the best feature to split the data, computing the standard deviation of the target variable in the parent node
Minmimum entropy in a binary classification problem
0
Maxmimum entropy in a binary classification problem
1
Gini Impurity
G = 1 − Σ(i=1 to c) p2 i
What split algorithm balances the tree (Gini vs Entropy)?
Ginny Impurity
Information Gain Formula
Information Gain = Impurity(Parent) − Σ(i=1 to k) wi * Impurity(Childi)
= Entropy or Gini(Parent) − Weighted Average of Child Impurities
Formula for weighted standard deviation
Information Gain = Standard Deviation (Parent) − Weighted Standard Deviation (Children)
Feature Importance(i) in Decision Tree
Σ t∈nodes using i * Nt / N [ Impurityt - Ntl / Nt * impuritytl - Ntr / Nt * impuritytr]
Voting Ensemble - ensemble accuracy 3 independent models, each with an individual accuracy of 0.7
= (0.7)3 + 3 × (0.7)2 × 0.3 = 0.343 + 0.441 = 0.784
Voting Ensemble - How is the final output in the Regression problems?
Is determined by the mean of the individual model predictions used as the final result.
Bagging (Bootstrap Aggregating)
Involves training multiple instances of the same model on different random subsets (with replacement) of the training data.
Boosting - key feature
Is a sequential ensemble technique where each model tries to correct the errors of its predecessor.
OOB in bagging - rate of non selected samples
approximately 37% of the samples are not selected for training any given model
What happens in Multi-Layered Stacking?
Models learns from previous model predictions, enhancing overall performance.
AdaBoost - Initial sample weight
wi = 1/n, for all i = 1, 2, . . . , n
Compute Learner Weight (α)
α = 1 / 2 log ((1 − ϵ)/ϵ)
Update Sample Weights - general formula
wnew i = wi * exp (-αyiyˆi)
Normalize the updated weights formula
wnorm i= wnew i/ Σ(j to n) (wnew j)
Gradient Boosting result - key formula
Result = Model0(X) + η · Model1(X) + η · Model2(X) + · · · + η · Modeln(X)
Final log loss calculation - description
Total log loss = modelsList[0] + η · (Σ(nestimators i=1) modelsListi
Final probablity calculation - description
Final prob =1 / 1 + e−Total log loss