(9/24, DS, Lecture) Feature Selection and Introduction to Machine Learning Models

Understanding Overfitting and Feature Quality

Overfitting: This occurs when a model learns the training data too well, including its noise and idiosyncrasies, leading to a decision boundary that is overly complex or "wiggly." Such models typically fail to generalize to new, unseen data, performing poorly on test sets.
Better Quality Features: The goal of robust feature quality is to allow the model's decision boundary to adapt appropriately to new data, rather than being rigidly fixed by the training data. This improves the model's ability to generalize.
Discriminative Capacity: Refers to the ability of a model or a set of features to effectively distinguish between different classes or categories within the data. High discriminative capacity is essential for accurate classification.

Data Structure and Representation

Data Columns: In a typical dataset, x1 and x2 represent two distinct data columns or features. For example, in a tumor scan, x1 could be the "hue value" and x2 could be the "saturation value" of a tumor.
Target Variable (y): The y column contains the labels or the information we aim to predict. In the tumor example, y could indicate whether a tumor is "malignant" or "benign."
Data Matrix (X): The data, with multiple features (e.g., 18 observations by 2 features), is structured as a matrix, typically denoted as X. Each row represents an observation, and each column represents a feature.
Example Data Point: A data point might look like (x1=10, x2=2) with a corresponding label y=malignant.

The Data Science Pipeline

The data science pipeline involves a sequential set of steps to build and deploy a data-driven solution:

Data Acquisition: The process of collecting or scraping raw data.
Data Cleaning: Preprocessing raw data to handle missing values, correct errors, and ensure consistency.
Feature Engineering: Creating new features from existing ones to improve model performance or provide more relevant information.
Feature Selection: Identifying and selecting the most relevant and impactful features from the dataset to use in model training.
Model Building: The core process of selecting, training, and evaluating machine learning models.

Limitations of Linear Models and the Need for Feature Transformation

Linear Model Limitations: A simple linear classifier (e.g., a straight line) works well when data points of different classes are linearly separable. For instance, if malignant tumors (crosses) and benign tumors (circles) can be perfectly separated by a straight line, a linear model (e.g., logistic regression classifier) is effective.
Non-linear Patterns: However, real-world data is often more complex. New data points (e.g., new tumor types or blurry images) might introduce non-linear patterns, making a straight line insufficient for separation. The old linear model will not generalize to this unseen data.
Feature Transformation: To address non-linear patterns, one can apply intelligent transformations to the existing features. This involves mapping the original feature values to a new space where they become linearly separable, allowing a linear classifier to work effectively again. This process is part of feature engineering and feature selection.
Impact on Accuracy: Selecting the right features and engineering new, more discriminative ones significantly impacts the downstream classification accuracy of the model.

Supervised vs. Unsupervised Learning

Supervised Learning: This approach requires labeled data, meaning the y column (the target variable) is available for the training dataset. With labels (e.g., "malignant" or "benign"), the model can learn to map inputs to outputs. Having y labels makes it simpler to build and evaluate models by comparing predictions to known labels. Example: Cancer tumor classification where medical records provide labels.
Unsupervised Learning: This approach is used when y labels are not available. This is often the case with real-world data like social media posts, where manual labeling is expensive or impossible. Without training labels, the goal is often to find hidden patterns, structures, or relationships within the data. Example: Analyzing Reddit posts without predefined categories (e.g., no labels for 'TDA' posts).
Semi-Supervised Learning: A scenario where a small number of labels are available, but acquiring more is expensive. This necessitates devising strategies to leverage the limited labels effectively.

Feature Selection Strategies

Feature selection aims to find a high-quality subset of features that reduces noise, prevents overfitting, and improves model performance. These methods can be broadly categorized:

1. Wrapper Methods (Adaptive Methods)

Concept: These methods involve repeatedly training and evaluating a machine learning model using different subsets of features. The performance of the model guides the selection process.
Process: Search for a subset of features $ ightarrow$ Evaluate the selected features by building a classifier $ ightarrow$ Repeat until optimal performance is achieved.
Computational Cost: Can be computationally expensive, especially with many features, as it involves building and evaluating multiple models.
Greedy Search: Common types include:
- Forward Search: Start with no features and incrementally add the feature that provides the greatest improvement in model accuracy.
- Backward Search: Start with all features and incrementally remove the feature whose removal causes the least decrease (or greatest increase) in model accuracy.
Tool: Scikit-learn's SequentialFeatureSelector is a useful library for implementing forward or backward search.

2. Filter Methods

Concept: These methods evaluate the quality of features based on their inherent characteristics (e.g., statistical scores) independently of any specific machine learning algorithm.
Process: Score and rank features individually (e.g., based on variance, correlation, mutual information) and then select the top k features.
Example: SelectKBest in scikit-learn allows selecting the top k features based on a specified scoring criterion.
Benefit: Computationally less expensive than wrapper methods as no model training is involved during feature evaluation.

3. Embedded Methods

Concept: Feature selection is performed during the process of training the machine learning model itself. The model's internal structure or optimization process naturally selects or discards features.
Example Models/Techniques:
- Decision Trees (and Forests):
 - Mechanism: Decision trees inherently perform feature selection by choosing the best features to split the data at each node. They construct decision rules (e.g., "if x1 < 10 and x2 > 2") that partition the feature space into subspaces, effectively separating classes.
 - Tree Representation: A decision tree visualizes these rules, guiding decisions down branches based on feature values until a classification (e.g., 'malignant' or 'benign') is reached.
 - Ensemble Methods (Forests): These combine multiple decision trees (e.g., Random Forests). Each tree might be trained on different subsets of features or data. The final prediction is made by aggregating the votes of individual trees, often leading to improved accuracy and robustness compared to a single tree.
- Regularization (L1/L2 Penalties):
 - Equation of a Straight Line: Recall the equation $y = mx + b$ , where $m$ is the slope (or weight/coefficient) and $b$ is the intercept. In machine learning, this is often generalized to f(x) = w^Tx + eta, where $w$ represents weights (like $m$ ) and eta is the bias (like $b$ ).
 - Impact of Large Weights: If the weights (e.g., $m$ ) in a model become too large, the decision boundary becomes very aggressive or steep, leading to overfitting on the training data and poor generalization to test data.
 - Regularization Mechanism: Regularization introduces a penalty term to the model's loss function that discourages large weights. For example, a term like $- ext{magnitude}(m)$ or $- rac{1}{C} rac{1}{2}||w||^2$ (for L2 regularization) is added. This effectively constrains the weights from growing excessively large.
 - Parameters (C/Alpha): Parameters like C (inverse of regularization strength) or alpha control the intensity of this penalty. A smaller C (or larger alpha) means stronger regularization, forcing weights to be smaller or even zero, thereby performing feature selection indirectly and preventing overfitting.
 - Practical Use: Crucial for models with many predictors to save memory and computational costs, as it can effectively reduce the number of features contributing significantly to the model.

Dimensionality Reduction

Concept: A family of techniques used to reduce the number of features (dimensions) in a dataset while retaining as much important information as possible. This naturally serves as a form of feature selection, especially for high-dimensional data like images (e.g., reducing a 20,000-feature image to 10 features).
Methods Covered:
- Principal Component Analysis (PCA):
  - Nature: Unsupervised learning method.
  - Goal: Find new, orthogonal (uncorrelated) axes, called Principal Components (PCs), that capture the maximum variance in the data. The first PC captures the most variance, the second PC captures the second most, and so on.
  - Benefit: Identifies underlying interdependencies between features (e.g., cunning and courage might be linked). It transforms features from arbitrary scales (e.g., temper, wisdom, courage from different ranges) into a standardized scale (e.g., $-3$ to $+3$ for PC1 and PC2).
  - Evaluation: The quality of PCA is assessed by how well it separates entities that are known to be distinct in the original, full-dimensional space.
  - Explained Variance Ratio: An important metric that quantifies the percentage of total data variance explained by each principal component (e.g., PC1 explains 50%, PC2 explains 30%, totaling 80% with just two dimensions). This indicates how much information is retained.
- Linear Discriminant Analysis (LDA):
  - Nature: Supervised learning method. It requires y labels.
  - Goal: To find a projection of the data that maximizes the separation between class means while minimizing the variance within each class. Instead of just variance, it actively uses class labels to find discriminative dimensions.
  - Output: Produces new dimensions (like PC1, PC2) that emphasize class separation.
  - Usage: Often used for classification or as a supervised dimensionality reduction technique. Scikit-learn includes LinearDiscriminantAnalysis and can be used as a classifier.

Modes of Machine Learning: Training and Testing

Model building broadly operates in two crucial modes:

1. Training Mode

Training Set: Comprises available examples ( $x1$ through $xL$ ) and their corresponding labels ( $y1$ through $yL$ ) for supervised problems.
Process: The model learns patterns and relationships from the training data by iteratively adjusting its internal parameters (e.g., weights w and bias b) to minimize a loss function.
Decision Boundary: During training, the model develops a decision boundary that differentiates between classes. This boundary can be linear (e.g., logistic regression), polynomial, or curved (e.g., decision trees, support vector machines).

2. Testing Mode

Testing Set: A separate subset of the data, X_test, that the model has not seen during training. It also has corresponding known y_test labels (for supervised problems).
Process: After training, the model's performance is evaluated on the testing set to assess its ability to generalize to unseen data. This provides an unbiased estimate of the model's accuracy and robustness.
Data Splitting: It is critical to split the dataset into training and testing sets (e.g., using train_test_split in scikit-learn) to prevent evaluating the model on data it has already learned, which would lead to an overoptimistic performance estimate.

Basic Machine Learning Model: Logistic Regression

Simplest Classifier: Often the first model introduced in machine learning, categorized as a linear binary or logistic classifier.
Equation: Uses a linear function f(x) = w^Tx + eta where $w$ is a vector of weights and eta is the bias term. The output is then passed through a sigmoid function to produce probabilities, which are then thresholded (e.g., positive values get $+1$ , negative values get $-1$ ) to make a binary classification decision.
Supervised: It's a supervised model because it requires y labels to learn the relationship between features and classes.
Linear: It's termed linear because the decision boundary it creates is a straight line or a hyperplane (in higher dimensions). This implies no polynomial or quadratic terms (no exponents higher than 1) on the predictor variables (x) in the underlying function that determines the decision boundary. If terms like $x^2$ or $x^3$ were added, it would become a non-linear model.