1/146
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What is Feature Subset Selection (FSS)?
A process to select a subset of features from a given feature set to optimize an objective function. Unlike feature extraction which transforms features, FSS selects existing features without transformation.
What is the main difference between feature extraction and feature selection?
Feature extraction (e.g., PCA) transforms existing features into a lower dimensional space creating new features. Feature selection chooses a subset of existing features without any transformation, maintaining original features and their interpretability.
Why is feature selection preferred over feature extraction for extracting meaningful rules?
When you transform original features into new ones (like in PCA), the measurement units and interpretability of features are lost. Feature selection keeps original features, making it easier to extract meaningful, interpretable rules from data mining models.
How many possible feature subsets exist for n features?
2^n - 1 possible subsets (excluding the empty set). For n=100 features, this equals 1,267,650,600,228,229,401,496,703,205,376 possible subsets, making exhaustive search computationally impossible.
What are the three main categories of Feature Selection methods?
What is the Variance Filter in unsupervised feature selection?
A method that removes features with a high percentage of identical values across all objects. Low variance indicates constant features that lack useful information for prediction.
What is the Correlation Filter in unsupervised feature selection?
A method that removes highly correlated input numerical features. When features are highly correlated (e.g., Feature 2 = 2 × Feature 1), they contain redundant information. Note: correlation can only detect linear relationships between features.
What does Mutual Information measure in feature selection?
Mutual Information measures the amount of information that two categorical features share using entropies. It can detect any kind of relationship between features (not just linear like correlation). Formula: I(X;Y) = H(X) + H(Y) - H(X,Y), where H represents entropy.
What is entropy in the context of feature selection?
Entropy measures the level of uncertainty, randomness, or disorder in a dataset. Formula: H(S) = -Σ p(xi) × log(p(xi)), where p(xi) is the probability of element xi. Higher entropy means more uncertainty/impurity in the set.
Calculate the entropy of a set with 6 red balls and 6 green balls out of 12 total.
p(red) = 6/12 = 0.5, p(green) = 6/12 = 0.5. H(S) = -(0.5×log(0.5) + 0.5×log(0.5)) = -(-0.5-0.5) = 1. This represents maximum uncertainty (perfectly balanced).
Calculate the entropy of a set with 3 red balls and 9 green balls out of 12 total.
p(red) = 3/12 = 0.25, p(green) = 9/12 = 0.75. H(S) = -(0.25×log(0.25) + 0.75×log(0.75)) = 0.811. Lower than 1 because there's less uncertainty (more green balls).
Calculate the entropy of a set with 0 red balls and 12 green balls.
p(red) = 0, p(green) = 1. H(S) = -(0 + 1×log(1)) = 0. Zero entropy means no uncertainty - we know for certain every ball is green (pure set).
What is Information Gain in filter methods?
A measure of how much information is gained about the target categorical variable when a dataset is split on a given categorical feature. Formula: IG(Feature) = H(Root) - Σ p(value) × H(value). Higher information gain means the feature is more useful for prediction.
How do you calculate Information Gain? Give the formula.
IG(Feature) = H(Root node) - Σ [p(Feature = value) × H(value)], where H represents entropy. You subtract the weighted average entropy after the split from the entropy before the split.
What does an Information Gain of 1 indicate?
An Information Gain of 1 indicates a perfect predictor - the feature completely determines the target variable, resulting in pure sets after splitting (entropy = 0 for all subsets).
What is the Chi-Square test in feature selection?
A statistical test that measures the association or dependency between a categorical feature and a categorical target variable by comparing observed (O) and expected frequencies (E). Formula: χ² = Σ [(O - E)² / E].
How do you calculate expected frequencies in Chi-Square test?
Expected frequency for row i and column j: E_ij = (row total i × column total j) / N, where N is the total number of observations. This represents what we'd expect if the feature and target were independent.
In Chi-Square test, what does a large χ² value indicate?
A large χ² value indicates that observed counts deviate significantly from expected counts, meaning the feature is dependent on (related to) the target variable. Therefore, we should keep this feature for prediction.
What is the decision rule for Chi-Square test? Given χ² = 4.102, df = 1, α = 0.05, critical value = 3.84.
Since χ² (4.102) > critical value (3.84), we reject the null hypothesis H0 (that the two variables are independent). This means there IS a relationship between the feature and target, so we keep the feature.
What are the advantages of Filter methods for feature selection?
What are the disadvantages of Filter methods for feature selection?
What is the recommended strategy for using Filter methods?
Use Filter methods as an initial step, especially for large datasets, then combine with Wrapper methods (hybridization). This leverages the computational efficiency of filters and the model-aware optimization of wrappers.
What is the goal of Wrapper methods for feature selection?
To evaluate features by using a data mining model as a black box. The model is trained on different subsets of features, and the subset that produces the best model performance is selected.
Describe the process of Wrapper methods for feature selection.
What is Forward Selection in wrapper methods?
Start with 0 features, then add features step by step, choosing the feature that improves the model the most at each iteration. Stop when: reaching a given number of features, hitting a performance threshold, or using elbow method.
What is Backward Elimination in wrapper methods?
Start with all features, then remove features step by step, choosing the feature whose removal does not decrease model performance (or decreases it least) at each iteration. Stop using same criteria as forward selection.
What are the three main stopping rules for Forward Selection and Backward Elimination?
What is the Elbow Method in feature selection?
A visual method to find the point where adding more features gives diminishing returns. Plot number of features vs. performance - the "elbow" is where the curve starts to flatten, representing the optimal trade-off between performance and complexity.
In Genetic Algorithm for feature selection, what does a chromosome represent?
A chromosome is a binary vector representing a feature subset. Each position corresponds to a feature: 1 means the feature is selected, 0 means it's not. Example: [1,0,1,1,0,0] means features 1, 3, and 4 are selected.
What are the three main genetic operators in GA for feature selection?
Describe the complete GA process for feature selection.
What are the advantages of Wrapper methods for feature selection?
What are the disadvantages of Wrapper methods for feature selection?
When should you use Filter methods vs Wrapper methods based on dataset size?
Filter methods work well for LARGE datasets because they're computationally efficient (no model training). Wrapper methods work well for SMALL datasets because they find optimal subsets but are computationally expensive.
When should you use Filter methods vs Wrapper methods based on computational budget?
Use Filter methods when computational resources are limited (they are FAST - no model training). Use Wrapper methods when you have sufficient computational budget (they are SLOW - train models repeatedly).
When should you use Filter methods vs Wrapper methods based on interpretability needs?
Use Filter methods when interpretability is important - they allow inspection of feature importance through statistical metrics. Use Wrapper methods when predictive performance is priority - they act as black box focused on accuracy.
When should you use Unsupervised vs Supervised feature selection methods?
Use Unsupervised methods when you don't have labels or want to reduce redundancy without considering target. Use Supervised methods when you have domain labels and want to select features relevant to predicting the target variable.
Compare Filter and Wrapper methods on algorithm fit.
Filter methods are independent of any algorithm - results are generalizable across different models. Wrapper methods are tailored to a specific data mining algorithm - optimized for that algorithm but may not transfer well to others.
Compare Filter and Wrapper methods on handling feature interactions.
Wrapper methods handle feature interactions well - they evaluate features together as subsets, capturing synergies. Filter methods assess features independently - may miss important interactions between features that work well together.
What is the key principle for choosing a feature selection method?
No one-size-fits-all method exists. Choose based on: dataset size/dimensionality, computational budget, interpretability vs performance needs, domain knowledge, chosen algorithm, and feature relationships. Hybrid approaches often work best.
Why might you discretize data for feature selection even though it can work with non-numerical data?
While discretization allows PCA-like methods to work with categorical data, it may lose important relationships between categories. Feature selection preserves original categorical features and their natural meaning, making rules more interpretable without information loss from transformation.
What is the difference between "highly correlated" thresholds in practice?
"Highly" is practitioner-defined, typically Pearson correlation > 0.9 or 0.95. You stop removing features when remaining correlations fall below your chosen threshold. The exact value depends on how aggressively you want to reduce redundancy versus retain information.
How does Mutual Information differ from Correlation for feature selection?
Correlation (Pearson) only detects LINEAR relationships between features and is limited to numerical data. Mutual Information can capture ANY kind of relationship (including nonlinear) and works with categorical variables, making it more versatile.
In supervised correlation filtering, what is the complete process?
What does a contingency table show in Chi-Square test?
A contingency table shows the observed frequencies (counts) of combinations between a categorical feature and categorical target. Rows represent feature values, columns represent target classes, and cells contain counts of samples with each combination.
Why do we use absolute value when ranking features by correlation with target?
Because both strong positive correlation and strong negative correlation indicate relevance to the target. A correlation of -0.8 is just as informative as +0.8 for prediction purposes - the sign shows direction but magnitude shows strength of relationship.
In the Chi-Square example with gender and pet preference, what was the conclusion?
With χ² = 4.102, df = 1, and critical value = 3.84 at α = 0.05, we reject H0 (independence). This means there IS a significant relationship between gender and pet preference, so gender should be kept as a predictive feature.
How do you interpret different entropy values: H(S) = 1, H(S) = 0.811, H(S) = 0?
H(S) = 1: Maximum uncertainty (perfectly balanced classes, like 50-50 split). H(S) = 0.811: Moderate uncertainty (unbalanced but mixed, like 25-75 split). H(S) = 0: No uncertainty (pure set, only one class present, like 0-100 split).
What makes feature selection challenging compared to other dimensionality reduction?
The search space grows exponentially - with n features, there are 2^n possible subsets. For 100 features, that's over 10^30 combinations. This makes exhaustive search impossible, requiring smart search strategies like forward/backward selection or genetic algorithms.
Why can't you always use PCA instead of feature selection for dimensionality reduction?
PCA creates transformed features that: 1) Are hard to interpret (lose original meaning), 2) Don't work directly with categorical features, 3) Make it impossible to extract meaningful rules using original variables, 4) Don't allow you to know which original features matter most.