1/50
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data Preprocessing
Prepare raw data clean normalize standardize encode features
Cleaning
Handle missing values remove outliers remove duplicates
Normalization
Scale features to range 0 to 1
Standardization
Center features at zero mean unit variance
Feature Engineering
Transform raw data into informative features
Dimensionality Reduction
Reduce complexity reduce overfitting speed training
PCA
Project data onto directions of maximum variance
Imputation
Fill missing values to make dataset usable
Mean Imputation
Replace missing values with mean
Median Imputation
Replace missing values with median
Mode Imputation
Replace missing values with mode
KNN Imputation
Use nearest neighbors to fill missing values
Regression Imputation
Use model to predict missing values
Supervised Learning
Learn mapping from inputs to labeled outputs
Classification
Predict class label
Regression
Predict continuous values
KNN
Predict based on closest points majority vote or average
KNN Sensitivity
Sensitive to number of neighbors and feature scaling
SVM
Maximize margin between features
SVM Pros
Effective with high dimensional data robust to overfitting
SVM Cons
Slow training requires careful parameter tuning
Naive Bayes
Assume features independent given class
Bayes Prior
Probability of class before observing features
Bayes Likelihood
Probability of feature given class
Bayes Posterior
Probability of class given features
Neural Network MLP
Input hidden output layers for classification
Activation Functions
Sigmoid ReLU Tanh
Backpropagation
Algorithm for training neural networks
Gradient Descent
Optimization method to minimize error
Ensemble Learning
Combine models to improve performance
Bagging
Train multiple models on different subsets reduce variance
Boosting
Sequentially train models focus on previous errors reduce bias
Random Forest
Collection of decision trees with bagging
AdaBoost
Boost weak learners sequentially to improve accuracy
Decision Tree
Split data to reduce uncertainty and predict classes
Information Gain
Reduction in entropy after splitting on attribute
Entropy
Measure of uncertainty or disorder in data
Parent Entropy
Entropy of dataset before split
Subset Entropy
Entropy of subset after split
Info Gain Calculation
Parent entropy minus weighted subset entropy
KNN Centroid
Method to assign class using mean of points in each class
Centroid
Mean position of points in class
Euclidean Distance
Distance between points in multidimensional space
KNN Prediction
Assign to class with nearest centroid or majority vote
Cross Validation
Split data to estimate model performance
Overfitting
Model fits training data too closely performs poorly on new data
Underfitting
Model too simple fails to capture patterns
Feature Scaling
Adjust range of features for algorithms sensitive to magnitude
Hyperplane
Decision boundary in SVM separating classes
Margin
Distance between hyperplane and closest data points
Use Case Selection
Choose algorithm based on data size feature type and goal