1/24
Vocabulary flashcards covering ML applications, learning types, data splitting, cross-validation, and Scikit-Learn pipelines.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Applications of Machine Learning
Common applications include healthcare (disease prediction), finance (fraud detection), e-commerce (recommendations), autonomous vehicles, NLP (chatbots, translation), and computer vision (facial recognition).
Supervised Learning
Training with labeled data to predict outputs. Examples: Linear Regression, SVM, Decision Trees.
Unsupervised Learning
Training with unlabeled data to find hidden patterns. Examples: K-Means, PCA, Clustering.
Reinforcement Learning
Learning through trial and error with rewards/punishments. Examples: Q-Learning, AlphaGo.
Batch Learning
Learns from the entire dataset at once; retraining needed for updates.
Online Learning
Learns incrementally from data streams; adapts continuously.
Overfitting
Model fits training data too well but fails on new data.
Regularization
Technique to prevent overfitting by penalizing model complexity.
Underfitting
Model is too simple, fails to capture patterns.
Training Set
Data used to train the model.
Testing Set
Data used to evaluate model performance on unseen data.
Dataset Split (70–80% / 20–30%)
Typical split: 70–80% training data and 20–30% testing data.
K-Fold Cross Validation
Data is split into k folds; train on k−1 folds, test on the remaining fold; repeat k times.
Stratified Sampling
Ensures class proportions are preserved in train/test splits (important for imbalanced datasets).
Scikit-Learn Design
Main features: Consistent API, Estimators (fit, predict, transform), Transformers, Pipelines, Cross-validation tools, Metrics.
Estimator (in Scikit-Learn)
An object with methods like fit, predict (and transform) used to fit models.
Transformer
An object that transforms data (used inside Pipelines).
Pipeline (Scikit-Learn)
A sequence of preprocessing + model steps applied consistently.
Why Use Pipelines
Prevents data leakage, simplifies workflows, and ensures transformations apply to both training and testing.
Pipeline Example
StandardScaler → Logistic Regression model (a typical pipeline).
Generalization error
The error rate on new cases
Overfitting the training data
If the training error is low but the generalization error is high, it means that your model is…
Fit
The ______method is used to build models
Association Rule Learning
Discover patterns and relationships between attributes in large datasets
Semisupervised Learning
Few labeled instances and plenty of unlabeled instances, Combinations of unsupervised and supervised algorithms