dimensionality reduction 2: feature selection (data mining)

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/146

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

147 Terms

1
New cards

What is Feature Subset Selection (FSS)?

A process to select a subset of features from a given feature set to optimize an objective function. Unlike feature extraction which transforms features, FSS selects existing features without transformation.

2
New cards
3
New cards

What is the main difference between feature extraction and feature selection?

Feature extraction (e.g., PCA) transforms existing features into a lower dimensional space creating new features. Feature selection chooses a subset of existing features without any transformation, maintaining original features and their interpretability.

4
New cards
5
New cards

Why is feature selection preferred over feature extraction for extracting meaningful rules?

When you transform original features into new ones (like in PCA), the measurement units and interpretability of features are lost. Feature selection keeps original features, making it easier to extract meaningful, interpretable rules from data mining models.

6
New cards
7
New cards

How many possible feature subsets exist for n features?

2^n - 1 possible subsets (excluding the empty set). For n=100 features, this equals 1,267,650,600,228,229,401,496,703,205,376 possible subsets, making exhaustive search computationally impossible.

8
New cards
9
New cards

What are the three main categories of Feature Selection methods?

  1. Unsupervised methods (use statistical measures without target variable), 2. Supervised Filter methods (use statistical measures with target variable), 3. Supervised Wrapper methods (use model performance to evaluate subsets).
10
New cards
11
New cards

What is the Variance Filter in unsupervised feature selection?

A method that removes features with a high percentage of identical values across all objects. Low variance indicates constant features that lack useful information for prediction.

12
New cards
13
New cards

What is the Correlation Filter in unsupervised feature selection?

A method that removes highly correlated input numerical features. When features are highly correlated (e.g., Feature 2 = 2 × Feature 1), they contain redundant information. Note: correlation can only detect linear relationships between features.

14
New cards
15
New cards

What does Mutual Information measure in feature selection?

Mutual Information measures the amount of information that two categorical features share using entropies. It can detect any kind of relationship between features (not just linear like correlation). Formula: I(X;Y) = H(X) + H(Y) - H(X,Y), where H represents entropy.

16
New cards
17
New cards

What is entropy in the context of feature selection?

Entropy measures the level of uncertainty, randomness, or disorder in a dataset. Formula: H(S) = -Σ p(xi) × log(p(xi)), where p(xi) is the probability of element xi. Higher entropy means more uncertainty/impurity in the set.

18
New cards
19
New cards

Calculate the entropy of a set with 6 red balls and 6 green balls out of 12 total.

p(red) = 6/12 = 0.5, p(green) = 6/12 = 0.5. H(S) = -(0.5×log(0.5) + 0.5×log(0.5)) = -(-0.5-0.5) = 1. This represents maximum uncertainty (perfectly balanced).

20
New cards
21
New cards

Calculate the entropy of a set with 3 red balls and 9 green balls out of 12 total.

p(red) = 3/12 = 0.25, p(green) = 9/12 = 0.75. H(S) = -(0.25×log(0.25) + 0.75×log(0.75)) = 0.811. Lower than 1 because there's less uncertainty (more green balls).

22
New cards
23
New cards

Calculate the entropy of a set with 0 red balls and 12 green balls.

p(red) = 0, p(green) = 1. H(S) = -(0 + 1×log(1)) = 0. Zero entropy means no uncertainty - we know for certain every ball is green (pure set).

24
New cards
25
New cards

What is Information Gain in filter methods?

A measure of how much information is gained about the target categorical variable when a dataset is split on a given categorical feature. Formula: IG(Feature) = H(Root) - Σ p(value) × H(value). Higher information gain means the feature is more useful for prediction.

26
New cards
27
New cards

How do you calculate Information Gain? Give the formula.

IG(Feature) = H(Root node) - Σ [p(Feature = value) × H(value)], where H represents entropy. You subtract the weighted average entropy after the split from the entropy before the split.

28
New cards
29
New cards

What does an Information Gain of 1 indicate?

An Information Gain of 1 indicates a perfect predictor - the feature completely determines the target variable, resulting in pure sets after splitting (entropy = 0 for all subsets).

30
New cards
31
New cards

What is the Chi-Square test in feature selection?

A statistical test that measures the association or dependency between a categorical feature and a categorical target variable by comparing observed (O) and expected frequencies (E). Formula: χ² = Σ [(O - E)² / E].

32
New cards
33
New cards

How do you calculate expected frequencies in Chi-Square test?

Expected frequency for row i and column j: E_ij = (row total i × column total j) / N, where N is the total number of observations. This represents what we'd expect if the feature and target were independent.

34
New cards
35
New cards

In Chi-Square test, what does a large χ² value indicate?

A large χ² value indicates that observed counts deviate significantly from expected counts, meaning the feature is dependent on (related to) the target variable. Therefore, we should keep this feature for prediction.

36
New cards
37
New cards

What is the decision rule for Chi-Square test? Given χ² = 4.102, df = 1, α = 0.05, critical value = 3.84.

Since χ² (4.102) > critical value (3.84), we reject the null hypothesis H0 (that the two variables are independent). This means there IS a relationship between the feature and target, so we keep the feature.

38
New cards
39
New cards

What are the advantages of Filter methods for feature selection?

  1. Computational Efficiency - no model training required, scales well to large datasets. 2. Algorithm Independence - unbiased toward any particular algorithm, results are generalizable.
40
New cards
41
New cards

What are the disadvantages of Filter methods for feature selection?

  1. Suboptimal Selection - may not identify the best subset because it doesn't consider model performance. 2. Complex Relationships - may not handle complex interactions between features since features are evaluated independently.
42
New cards
43
New cards

What is the recommended strategy for using Filter methods?

Use Filter methods as an initial step, especially for large datasets, then combine with Wrapper methods (hybridization). This leverages the computational efficiency of filters and the model-aware optimization of wrappers.

44
New cards
45
New cards

What is the goal of Wrapper methods for feature selection?

To evaluate features by using a data mining model as a black box. The model is trained on different subsets of features, and the subset that produces the best model performance is selected.

46
New cards
47
New cards

Describe the process of Wrapper methods for feature selection.

  1. Feature Subset Generation - search algorithm generates various subsets. 2. Model Evaluation - each subset trains the model and performance is evaluated. 3. Selection Criterion - a metric (e.g., accuracy) guides the best subset choice. 4. Iteration - repeat until optimal set is found.
48
New cards
49
New cards

What is Forward Selection in wrapper methods?

Start with 0 features, then add features step by step, choosing the feature that improves the model the most at each iteration. Stop when: reaching a given number of features, hitting a performance threshold, or using elbow method.

50
New cards
51
New cards

What is Backward Elimination in wrapper methods?

Start with all features, then remove features step by step, choosing the feature whose removal does not decrease model performance (or decreases it least) at each iteration. Stop using same criteria as forward selection.

52
New cards
53
New cards

What are the three main stopping rules for Forward Selection and Backward Elimination?

  1. Reach a given number of features (fixed target). 2. Performance threshold reached (desired accuracy achieved). 3. Elbow method (when performance improvement becomes minimal/plateaus).
54
New cards
55
New cards

What is the Elbow Method in feature selection?

A visual method to find the point where adding more features gives diminishing returns. Plot number of features vs. performance - the "elbow" is where the curve starts to flatten, representing the optimal trade-off between performance and complexity.

56
New cards
57
New cards

In Genetic Algorithm for feature selection, what does a chromosome represent?

A chromosome is a binary vector representing a feature subset. Each position corresponds to a feature: 1 means the feature is selected, 0 means it's not. Example: [1,0,1,1,0,0] means features 1, 3, and 4 are selected.

58
New cards
59
New cards

What are the three main genetic operators in GA for feature selection?

  1. Crossover - combine two parent chromosomes to produce offspring (new feature subsets). 2. Mutation - randomly flip bits to introduce variation and exploration. 3. Selection - select the fittest chromosomes (best performing subsets) for reproduction.
60
New cards
61
New cards

Describe the complete GA process for feature selection.

  1. Initialization (random population of subsets). 2. Fitness Evaluation (train model, measure performance). 3. Selection (choose best performers). 4. Crossover (combine parents). 5. Mutation (random changes). 6. Replacement (new generation). 7. Repeat until stopping criteria met.
62
New cards
63
New cards

What are the advantages of Wrapper methods for feature selection?

  1. Optimal Subset Selection - can detect the ideal feature subset for a specific algorithm. 2. Managing Complex Relationships - effective at handling feature interactions by evaluating subsets together, not individual features.
64
New cards
65
New cards

What are the disadvantages of Wrapper methods for feature selection?

  1. Computational Intensity - computationally demanding, especially with large datasets (must train model many times). 2. Algorithm Bias - may exhibit bias toward the specific algorithm used for feature evaluation.
66
New cards
67
New cards

When should you use Filter methods vs Wrapper methods based on dataset size?

Filter methods work well for LARGE datasets because they're computationally efficient (no model training). Wrapper methods work well for SMALL datasets because they find optimal subsets but are computationally expensive.

68
New cards
69
New cards

When should you use Filter methods vs Wrapper methods based on computational budget?

Use Filter methods when computational resources are limited (they are FAST - no model training). Use Wrapper methods when you have sufficient computational budget (they are SLOW - train models repeatedly).

70
New cards
71
New cards

When should you use Filter methods vs Wrapper methods based on interpretability needs?

Use Filter methods when interpretability is important - they allow inspection of feature importance through statistical metrics. Use Wrapper methods when predictive performance is priority - they act as black box focused on accuracy.

72
New cards
73
New cards

When should you use Unsupervised vs Supervised feature selection methods?

Use Unsupervised methods when you don't have labels or want to reduce redundancy without considering target. Use Supervised methods when you have domain labels and want to select features relevant to predicting the target variable.

74
New cards
75
New cards

Compare Filter and Wrapper methods on algorithm fit.

Filter methods are independent of any algorithm - results are generalizable across different models. Wrapper methods are tailored to a specific data mining algorithm - optimized for that algorithm but may not transfer well to others.

76
New cards
77
New cards

Compare Filter and Wrapper methods on handling feature interactions.

Wrapper methods handle feature interactions well - they evaluate features together as subsets, capturing synergies. Filter methods assess features independently - may miss important interactions between features that work well together.

78
New cards
79
New cards

What is the key principle for choosing a feature selection method?

No one-size-fits-all method exists. Choose based on: dataset size/dimensionality, computational budget, interpretability vs performance needs, domain knowledge, chosen algorithm, and feature relationships. Hybrid approaches often work best.

80
New cards
81
New cards

Why might you discretize data for feature selection even though it can work with non-numerical data?

While discretization allows PCA-like methods to work with categorical data, it may lose important relationships between categories. Feature selection preserves original categorical features and their natural meaning, making rules more interpretable without information loss from transformation.

82
New cards
83
New cards

What is the difference between "highly correlated" thresholds in practice?

"Highly" is practitioner-defined, typically Pearson correlation > 0.9 or 0.95. You stop removing features when remaining correlations fall below your chosen threshold. The exact value depends on how aggressively you want to reduce redundancy versus retain information.

84
New cards
85
New cards

How does Mutual Information differ from Correlation for feature selection?

Correlation (Pearson) only detects LINEAR relationships between features and is limited to numerical data. Mutual Information can capture ANY kind of relationship (including nonlinear) and works with categorical variables, making it more versatile.

86
New cards
87
New cards

In supervised correlation filtering, what is the complete process?

  1. Compute correlation of each feature with target (Pearson for numerical, point biserial/ANOVA for mixed, Chi-square/MI for categorical). 2. Rank features by absolute correlation. 3. Set threshold or select top-k features. 4. Optionally remove highly correlated features among themselves to reduce redundancy.
88
New cards
89
New cards

What does a contingency table show in Chi-Square test?

A contingency table shows the observed frequencies (counts) of combinations between a categorical feature and categorical target. Rows represent feature values, columns represent target classes, and cells contain counts of samples with each combination.

90
New cards
91
New cards

Why do we use absolute value when ranking features by correlation with target?

Because both strong positive correlation and strong negative correlation indicate relevance to the target. A correlation of -0.8 is just as informative as +0.8 for prediction purposes - the sign shows direction but magnitude shows strength of relationship.

92
New cards
93
New cards

In the Chi-Square example with gender and pet preference, what was the conclusion?

With χ² = 4.102, df = 1, and critical value = 3.84 at α = 0.05, we reject H0 (independence). This means there IS a significant relationship between gender and pet preference, so gender should be kept as a predictive feature.

94
New cards
95
New cards

How do you interpret different entropy values: H(S) = 1, H(S) = 0.811, H(S) = 0?

H(S) = 1: Maximum uncertainty (perfectly balanced classes, like 50-50 split). H(S) = 0.811: Moderate uncertainty (unbalanced but mixed, like 25-75 split). H(S) = 0: No uncertainty (pure set, only one class present, like 0-100 split).

96
New cards
97
New cards

What makes feature selection challenging compared to other dimensionality reduction?

The search space grows exponentially - with n features, there are 2^n possible subsets. For 100 features, that's over 10^30 combinations. This makes exhaustive search impossible, requiring smart search strategies like forward/backward selection or genetic algorithms.

98
New cards
99
New cards

Why can't you always use PCA instead of feature selection for dimensionality reduction?

PCA creates transformed features that: 1) Are hard to interpret (lose original meaning), 2) Don't work directly with categorical features, 3) Make it impossible to extract meaningful rules using original variables, 4) Don't allow you to know which original features matter most.

100
New cards