1/19
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
Machine Learning Definition (Mitchell)
Machine Learning is the field where a computer's ability to learn is defined: A program learns from experience E with respect to a task T and a performance measure P, if its performance on T, as measured by P, improves with E. The program is not explicitly programmed for each step.
Supervised Learning: Training Data
In Supervised Learning, the training data consists of labeled examples: a set of pairs {(x₁, y₁), (x₂, y₂), …, (xₙ, yₙ)}. Each xᵢ is a feature vector (input), and each yᵢ is its corresponding label (output). The goal is to learn a function mapping features to labels.
Feature Vector Example
A feature vector is a structured representation of an input instance. Example for weather: x₁ = [sun, hot, high, weak]^T, where each element is a value for a feature (Outlook, Temperature, Humidity, Wind). This vector is the input for a classifier.
Unsupervised Learning
In Unsupervised Learning, the input data is unlabeled: {x₁, x₂, …, xₙ}, containing only feature vectors. The goal is to find inherent patterns, such as grouping similar data points (clustering), anomaly detection, or knowledge discovery, without predefined categories.
Example ML Problem: Handwriting Recognition
Task T: Recognize and classify handwritten words. Performance P: Percentage of words correctly classified. Experience E: A database of handwritten words with known classifications (labeled data). This is a supervised learning problem.
Example ML Problem: Self-Driving Car
Task T: Drive on a public four-lane highway using vision sensors. Performance P: Average distance traveled before an error (as judged by a human supervisor). Experience E: A sequence of images and corresponding steering commands recorded while observing a human driver. This is a supervised learning problem.
Decision Tree Learning
Decision Tree Learning approximates discrete-valued target functions where the learned hypothesis h(x) → y is represented as a tree. Internal nodes test a feature/attribute. Branches correspond to attribute values. Leaf nodes provide a final classification decision. Any function in Disjunctive Normal Form (DNF) can be expressed as a decision tree.
Building a Decision Tree: Key Principle
Build a decision tree using a divide-and-conquer, greedy strategy: Always test the most important attribute first (the one that provides the greatest information gain or reduction in entropy/impurity for classification). Then, recursively build subtrees for each resulting subset of data.
Information Gain
Information Gain measures the effectiveness of an attribute for classifying data. It is defined as the reduction in entropy achieved by splitting the dataset based on that attribute. Formally, Gain(S, A) = Entropy(S) - Σ{v∈Values(A)} (|Sv|/|S|) * Entropy(Sv), where Sv is the subset of S where attribute A has value v. The attribute with the highest gain is chosen for splitting.
K-Nearest Neighbors (KNN) Algorithm
K-Nearest Neighbors (KNN) predicts the label for a new instance xnew by: 1) Finding the k training examples closest to xnew (using a distance metric, e.g., Euclidean). 2) For classification: Taking a majority vote among their labels. For k=1, it simply uses the label of the single nearest neighbor.
KNN: Need for Normalization
KNN requires feature normalization (or standardization). Because it uses distance metrics, if features have different scales (e.g., age 0-100 vs. salary 0-100,000), the feature with the larger range will disproportionately dominate the distance calculation, skewing results. Normalization puts all features on a comparable scale.
Generalization in ML
Generalization refers to a model's ability to perform well on unseen data (the test set), not just the training data. Desired properties: 1) Stability: Small changes in training data cause minimal changes in predictions. 2) Consistent performance: Performance metric P remains similar between training and testing. KNN with k > 1 is generally stable due to its voting mechanism.
Bagging (Bootstrap Aggregating)
Bagging reduces variance in unstable models (like deep decision trees). Process: 1) Create K bootstrap samples (datasets of size n) by sampling with replacement from the original training set. 2) Train a separate model (e.g., decision tree) on each sample. 3) Aggregate predictions: average for regression, majority vote for classification. The final model is hbag(x) = (1/K) Σ hi(x).
Problem with Bagging Standard Decision Trees
A problem arises when bagging standard decision trees: If certain features have consistently high information gain, most trees will greedily select those same features at the root. This makes the trees highly correlated, limiting the variance reduction benefit of aggregation.
Random Forests: Core Idea
Random Forests improve upon bagging by decorrelating the trees. They introduce two sources of randomness: 1) Row (Data) Sampling: Bootstrap sampling (bagging). 2) Feature Sampling: At each split in a tree, only a random subset of m features (typically m ≈ √p, where p is the total features) is considered for splitting. This forces trees to differ.
Random Forests: Advantages
Advantages of Random Forests: 1) Increased robustness and accuracy by averaging many decorrelated trees. 2) Very good stability without needing heavy pruning. 3) Computational efficiency and ease of parallelization (trees are independent). 4) Provides estimates of feature importance.
Cross-Validation (CV) Process
K-Fold Cross-Validation process: 1) Randomly split the training data into K non-overlapping folds (subsets). 2) For i = 1 to K: Train the model on the other K-1 folds and validate its performance on the i-th fold. 3) Average the performance (e.g., accuracy) across the K validation folds to get a more reliable estimate of model performance.
Purpose of Cross-Validation
The primary purpose of cross-validation is model evaluation and selection without touching the final test set. It provides a low-bias estimate of a model's generalization performance by using all data for both training and validation in a structured way. It is commonly used for hyperparameter tuning.
Leave-One-Out Cross-Validation (LOOCV)
Leave-One-Out Cross-Validation (LOOCV) is an extreme case where K = n (number of training examples). Each iteration uses n-1 examples for training and the single remaining example for validation. It is computationally expensive but provides an almost unbiased performance estimate, as it maximizes the training data used each time.
Two Layers of Model Evaluation with CV
Proper ML practice involves two layers of evaluation: 1) Model Selection/Validation: Use cross-validation on the training set to compare different models or tune hyperparameters. 2) Final Evaluation: After selecting the best model, evaluate its performance once on the held-out test set to estimate real-world performance. The test set is used only once.