1/74
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Why is traditional data warehousing insufficient for discovering patterns in massive datasets
Because data warehousing provides data storage and summary reports but does not implement algorithms to discover hidden patterns; the curse of dimensionality and volume make manual analysis impractical, so automated data mining techniques are required
How do classification and clustering differ in terms of supervision and output
Classification is supervised learning that maps objects to predefined classes based on labelled training data, whereas clustering is unsupervised and groups objects into clusters based on similarity without prior labels
Why is the curse of dimensionality problematic for distance based methods in data mining
As the number of attributes increases, data points become sparse and distances between points tend to become similar, making distance measures less meaningful; algorithms like nearest neighbour lose discriminatory power and computational cost increases dramatically
What is the difference between nominal and ordinal attributes, and why does this matter for measuring similarity
Nominal attributes have unordered categorical values where only equality comparisons make sense, whereas ordinal attributes have meaningful order among categories; this distinction affects the choice of similarity measures because ordinal values may be encoded to reflect order while nominal values are treated as distinct categories
How does analysing graph structured data differ from analysing numeric vectors
Graph data consists of nodes and edges capturing relationships, requiring algorithms that consider connectivity and topology, whereas numeric vectors are fixed length and can use metrics like Euclidean distance; therefore graph mining techniques such as community detection or path analysis are needed instead of conventional clustering or regression on vectors
In market basket analysis, why is support used as a threshold for selecting frequent itemsets
Support measures how often a set of items occurs in transactions; using a minimum support threshold filters out itemsets that appear rarely, reducing the search space and focusing on patterns that are statistically significant
Why does high confidence in an association rule not imply causation
Confidence measures conditional co occurrence of items but does not account for external factors or temporal order; two items might co occur frequently due to a common cause or random correlation, so rules indicate association rather than cause and effect
How can clustering be used as a preprocessing step for classification
Clustering can group similar data points to reveal underlying structure; these clusters may then be used to design specialized classifiers for each group, handle class imbalance by oversampling cluster centroids, or compress data into prototypes that reduce computational complexity
In predictive modelling, why is it important to separate training and test datasets
Training data is used to build the model, while test data, which is not seen during training, is used to evaluate generalization; using the same data for both leads to overly optimistic performance estimates and fails to detect overfitting
How does anomaly detection differ from classification, and why is it challenging
Anomaly detection seeks rare or unusual instances without labelled examples of anomalies; the majority class may be well characterised but anomalies can be diverse and lack training data, making it difficult to define precise boundaries and requiring unsupervised or semi supervised methods
A retailer collects high dimensional transaction data; why might unsupervised clustering followed by association analysis reveal meaningful insights
Clustering can group similar customers or purchases without prior labels, reducing complexity and highlighting segments; within each cluster, frequent itemset mining can identify tailored association rules that reflect group specific purchasing patterns, leading to targeted marketing strategies
How are data mining tasks like clustering, association rule mining, and anomaly detection complementary when exploring a large dataset
Clustering reveals groups of similar objects, association rule mining discovers relationships among items, and anomaly detection identifies outliers; together they provide a comprehensive understanding of structure, patterns, and deviations within the data
Why might the same dataset be used for both classification and regression tasks, and how does the choice depend on the target variable
If the target variable is categorical, a classification model predicts class labels; if it is continuous, regression predicts numerical values; the same features can support both tasks but the modelling approach differs according to the nature of the target and evaluation criteria
Why is simple random sampling sometimes inadequate when sampling rare subgroups
In simple random sampling, each element has equal probability of selection; rare subgroups may have so few members that they are underrepresented in the sample, leading to biased estimates or missed rare events
How could hidden periodicity in the data cause bias in systematic sampling
Systematic sampling selects every k th element; if the data exhibits periodic patterns aligned with the sampling interval, the sample may over represent or under represent certain patterns, leading to systematic bias
In cluster sampling, why is it beneficial to sample clusters rather than individual units, and what is a potential drawback
Sampling clusters reduces cost and logistical effort because entire groups are selected instead of many separate individuals; however, clusters may be internally homogeneous and thus provide less variability, introducing bias if clusters differ systematically
How does stratified sampling help detect rare events, and how should strata be chosen
Stratified sampling divides the population into homogeneous subgroups and samples from each; by allocating more samples to rare strata, rare events become more likely to appear in the sample; strata should be defined based on attributes related to the event of interest
Explain why each element in a streaming dataset has equal probability of being retained in a reservoir sample of size n without knowing the total length of the stream
The reservoir sampling algorithm initializes a reservoir with the first n items; for each subsequent item at position i greater than n, it replaces an element in the reservoir with probability n divided by i; induction shows that each item’s probability of remaining in the reservoir after processing is n divided by N for N total items
Why does the TF IDF weighting scheme assign higher weight to terms that appear frequently in a document but infrequently across the corpus
Term frequency captures a term’s importance within a document, while inverse document frequency downweights terms common across many documents; thus, terms with high term frequency but low document frequency receive large TF IDF values, highlighting words that are both frequent in a particular document and discriminative across the corpus
How can summary statistics such as mean and variance be misleading in skewed distributions, and what alternative measures could be used
Mean and variance are sensitive to outliers and may not reflect the central tendency or spread in skewed data; median and interquartile range provide more robust measures of location and spread, and boxplots or histograms can visualise distribution shape
What is the Bonferroni principle, and why must it be considered when interpreting patterns discovered in data mining
The Bonferroni principle warns that testing many hypotheses increases the chance of false positives; when exploring many possible patterns, some may appear statistically significant by chance, so corrections or hold out validation are needed to avoid reporting spurious results
When using text mining to analyse Yelp reviews, why is it important to remove stopwords and apply IDF weighting before ranking terms
Stopwords are common words that carry little content; removing them reduces noise; IDF weighting downweights words that are common across documents, so ranking terms by TF IDF highlights distinctive words that reflect the subject and sentiment of the reviews
Why is it necessary to normalize or standardize variables before computing similarity or applying clustering algorithms
Different variables may have different scales; without normalization, variables with larger numeric ranges dominate distance calculations and skew results; standardization ensures each variable contributes proportionally to distance metrics and improves comparability across features
You need to estimate the average income of households in a country; why might you use stratified sampling rather than cluster sampling
Stratified sampling ensures representation from each socioeconomic group, capturing the variability of income across strata; cluster sampling may select entire neighbourhoods that are homogeneous within clusters, leading to biased estimates if selected clusters are not representative
How does the choice of sampling method impact the generalization of models trained on the sampled data
Sampling methods influence how well the sample represents the population; biased samples lead to models that generalize poorly, while stratified or random sampling improves representativeness and reduces sampling bias, resulting in more reliable models
How would failing to consider sampling bias during association rule mining affect the interpretation of discovered rules
If certain transactions or subgroups are underrepresented due to sampling bias, some frequent itemsets may be missed and support values may not reflect true population frequencies; this can lead to spurious or missing associations, diminishing the validity of the rules
What is the difference between support count and support fraction in frequent itemset mining, and why might one be more informative than the other
Support count is the number of transactions containing a given itemset, while support fraction divides the count by the total number of transactions; support fraction is normalized and allows comparison across datasets of different sizes, whereas support count alone depends on dataset size
Why does the naive algorithm for frequent itemset mining become intractable even for moderate numbers of items
The naive algorithm enumerates all possible itemsets, whose number grows exponentially with the number of items; as the number of possible itemsets becomes enormous, counting their support is computationally infeasible
How does the anti monotonic property of support enable the Apriori algorithm to prune the search space
The property states that if an itemset is infrequent, all supersets of that itemset are also infrequent; conversely, only itemsets whose all subsets are frequent need to be considered; Apriori uses this property to avoid generating and testing candidates whose subsets are not frequent, greatly reducing the number of candidates
Describe the two main steps in the Apriori algorithm for generating candidate k itemsets from frequent k minus 1 itemsets
In the join step, pairs of frequent k minus 1 itemsets that share k minus 2 items are combined to form candidate k itemsets; in the prune step, any candidate whose k minus 1 sized subsets are not all frequent is eliminated using the anti monotonic property
Once frequent itemsets are generated, how are association rules derived and evaluated
For each frequent itemset, all non empty subsets are considered as potential antecedents; each rule divides the itemset into antecedent and consequent; its support equals the frequent itemset’s support and its confidence equals support of the union divided by support of the antecedent; rules with confidence and support above thresholds are retained
Why is the confidence of an association rule anti monotonic with respect to the size of the antecedent
As more items are moved from the consequent to the antecedent, the support of the antecedent decreases or stays the same, while the support of the entire itemset remains unchanged; thus the ratio support of union over support of antecedent can only decrease, implying that rules with larger antecedents have lower confidence
How does the FP growth algorithm avoid generating large numbers of candidate itemsets
It compresses the database into an FP tree that stores item sets in a trie like structure; it then recursively mines frequent patterns using conditional pattern bases and conditional FP trees, eliminating the need to generate and test candidate itemsets explicitly
In what scenarios might FP growth outperform Apriori, and why
FP growth is more efficient when the dataset contains many long frequent patterns and has some common prefixes, because the FP tree compresses data and reduces disk input output; Apriori incurs high cost generating and counting candidates for such datasets; however, FP growth may require more memory and can be less intuitive to implement
How does increasing the minimum support threshold affect the number of frequent itemsets and computation time
Raising the support threshold reduces the number of itemsets that qualify as frequent; this decreases candidate generation and support counting, reducing computation time and memory; conversely, a low threshold yields many frequent itemsets and increases computational overhead
In a dataset with 1000 transactions and 100 items, why would applying the Apriori algorithm with a very low support threshold be impractical, and how might sampling help
A low threshold would produce a huge number of frequent itemsets, many of which are spurious; the Apriori algorithm would generate and test an exponential number of candidates, leading to excessive runtime; sampling a subset of transactions can approximate frequent patterns more quickly at the cost of accuracy
Why might a rule with high confidence still be undesirable, and what additional measures can be used to assess rule interestingness
High confidence may occur due to a high support item in the consequent rather than a strong relationship; metrics such as lift or conviction can adjust for consequent frequency and better reflect the strength of the association
How might a business use frequent itemset mining in combination with classification to improve a recommendation system
Frequent itemset mining can identify co purchased items to generate association based recommendations; classification can predict user preferences or purchase intent based on features; combining the two allows recommendations to be tailored using both item associations and user characteristics, enhancing relevance
Why is the downward closure property of support not applicable to confidence, and how does this affect rule generation
Support is anti monotonic because adding items cannot increase support count, but confidence depends on the ratio of supports, which can increase or decrease when items are added to antecedent or consequent; thus confidence is not monotonic, requiring evaluation of all possible partitions of a frequent itemset when generating rules
Why might a model that performs extremely well on training data perform poorly on unseen data, and how can this issue be mitigated
Overfitting occurs when a model captures noise or peculiarities of the training data rather than general patterns; as a result, it fails to generalize; mitigation strategies include using cross validation, pruning decision trees, regularisation, and holding out a test set
How do decision trees handle non linear decision boundaries, and why might a linear classifier struggle in such cases
Decision trees partition the feature space into axis aligned regions using hierarchical splits, enabling them to approximate complex non linear boundaries by combining multiple splits; linear classifiers such as logistic regression draw a single hyperplane and cannot easily represent intricate shapes without feature transformations
When splitting on a continuous attribute in a decision tree, why is dynamic discretisation necessary, and how is the best split point chosen
Continuous attributes have many possible split points; dynamic discretisation involves sorting values and evaluating candidate cut points using criteria like information gain or Gini impurity; the split that maximizes purity improvement is chosen
Compare entropy and Gini impurity as splitting criteria in decision trees; under what circumstances might one be preferred
Both measure node impurity; entropy originates from information theory and emphasises large differences in class probabilities; Gini is simpler to compute and often yields similar splits; some implementations prefer Gini for computational efficiency, though entropy can provide more discriminative splitting when class distributions are highly skewed
Why does a split with high information gain indicate a good attribute for a decision tree, and what is the trade off when using many small leaves
High information gain means the split substantially reduces impurity, producing subsets with more homogeneous class labels; however, creating many small leaves can overfit training data and reduce generalization, so pruning or minimum leaf size constraints are used to balance complexity and accuracy
Explain how precision and recall differ, and why accuracy alone can be misleading for imbalanced classification tasks
Precision measures the proportion of predicted positives that are correct, whereas recall measures the proportion of actual positives detected; in imbalanced datasets where the negative class dominates, accuracy may be high even if the model misses most positives, so precision and recall provide a more informative assessment
How does the independence assumption in Naive Bayes simplify computation of class probabilities, and why might it fail in real world datasets
The assumption allows the joint probability of features given a class to be computed as the product of individual feature probabilities, drastically reducing parameter estimation; in practice, features often exhibit dependencies, and ignoring them can lead to suboptimal models, although Naive Bayes can still perform well due to its simplicity and robust probability estimates
In the Naive Bayes classifier, what is the role of the prior probability, and how does the maximum a posteriori criterion determine the predicted class
The prior reflects how common each class is; the MAP classifier multiplies the prior by the likelihood of the features given the class and predicts the class with the highest posterior probability; this balances evidence from the data with prior expectations
Compare decision trees and Naive Bayes in terms of interpretability and assumptions, and discuss when one might outperform the other
Decision trees provide explicit rules that are easy to interpret but can overfit unless pruned; Naive Bayes makes a strong independence assumption but is computationally efficient; Naive Bayes may perform well on high dimensional data such as text, while decision trees capture interactions better when features are dependent
Why is pruning important in decision tree learning, and what methods can be used to decide when to stop splitting
Without pruning, a decision tree can grow deep and fit noise; pruning reduces complexity by removing splits that do not improve generalization; methods include setting a maximum depth, minimum number of samples per leaf, cost complexity pruning, or validating on a separate dataset
A classifier yields high recall but low precision; what practical consequence does this have, and how might you adjust the model
High recall but low precision means the model identifies most positives but also produces many false positives; in applications such as spam detection, many legitimate emails may be misclassified; adjusting classification threshold, changing cost functions, or using a different algorithm can improve precision
Why is it useful to use a confusion matrix rather than just the number of correct predictions when evaluating a multi class classifier
The confusion matrix tabulates predicted versus actual classes, revealing which classes are confused; this helps diagnose systematic errors, imbalanced performance across classes, and guides model improvement beyond overall accuracy
How can feature selection methods influence decision tree structure, and why might embedded methods such as tree based models naturally perform feature selection
Feature selection removes irrelevant attributes before training, simplifying the tree and improving generalization; embedded methods like decision trees compute feature importance during training and inherently select informative features by splitting on them, reducing the need for separate selection
Why might a Naive Bayes classifier perform poorly on data generated from a clustering task, and how could one modify it
If the data’s clusters correspond to classes but features within a cluster are strongly correlated, the independence assumption fails and Naive Bayes may misestimate probabilities; using a model that accounts for feature dependencies, such as Bayesian networks or logistic regression with interaction terms, could improve performance
Why is linear regression unsuitable for binary classification, and how does logistic regression address this issue
Linear regression outputs unbounded real values, so mapping to class labels with a fixed threshold can lead to predictions outside zero to one and poorly calibrated probabilities; logistic regression applies the sigmoid function to map the linear combination of features into the zero to one interval, producing valid probabilities for binary classification
How does the sigmoid function shape the decision boundary in logistic regression, and what happens to the predicted probability as the linear predictor becomes large
The sigmoid function is S shaped, approaching one as the linear predictor grows large and zero as it becomes highly negative; near zero it is approximately linear; thus, logistic regression yields a smooth probability transition across the decision boundary and saturates at extreme values
Why is cross entropy loss preferred over mean squared error for training logistic regression models
Cross entropy loss is derived from maximum likelihood estimation for Bernoulli outcomes and penalizes misclassified examples more appropriately; it ensures convexity and leads to larger gradients when predictions are far from true labels, whereas mean squared error may produce non convex cost and can lead to slow or unstable learning
Describe the update rule for gradient descent in logistic regression and explain the role of the learning rate
Parameters are updated by subtracting the product of the learning rate and the gradient of the loss with respect to the parameters; the learning rate controls step size: too high can cause divergence and oscillation, while too low results in slow convergence
How does L2 regularization mitigate overfitting in regression, and what is its effect on coefficients
L2 regularization adds a penalty proportional to the square of the coefficients to the loss function, discouraging large weights; it shrinks coefficients toward zero but does not drive them exactly to zero, reducing variance and preventing overfitting
Why does L1 regularization perform feature selection, and how does it differ from ridge regression
L1 regularization adds a penalty proportional to the absolute values of coefficients; this can lead to some coefficients becoming exactly zero, effectively selecting a subset of features; ridge shrinks coefficients but rarely zeroes them, so it does not perform feature selection
Explain the bias variance trade off in the context of regularized regression models
Increasing regularization reduces model variance by simplifying the model, but increases bias by constraining flexibility; decreasing regularization yields lower bias but higher variance; optimal performance balances these to minimise expected prediction error
In logistic regression, what happens if you choose a threshold of 0.9 instead of 0.5 to decide class membership
A higher threshold requires stronger evidence to classify an instance as positive; this reduces false positives and increases precision but may increase false negatives and decrease recall; the choice depends on application needs
How can logistic regression be modified to account for differing costs of false positives and false negatives
The decision threshold can be adjusted to reflect cost ratios, or the loss function can be weighted to penalize mistakes on one class more than the other, effectively shifting the decision boundary
How is logistic regression extended to handle multiple classes, and what is the role of the softmax function
Multinomial logistic regression assigns each class its own parameter vector; the softmax function converts linear outputs into a probability distribution over classes, ensuring probabilities sum to one; the class with highest probability is chosen
Why might gradient descent converge to a local minimum in logistic regression, and how can this be mitigated
Logistic regression with convex cross entropy loss has a global minimum; however, non convexities may arise with certain feature transformations or regularization; using stochastic gradient descent and proper learning rates can avoid being stuck in poor minima
How can text mining applications combine TF IDF features with logistic regression, and why might regularization be crucial in this context
TF IDF vectors are high dimensional and sparse; logistic regression can model probability of categories such as spam or sentiment; regularization prevents overfitting by shrinking coefficients of irrelevant terms and potentially selecting informative features, improving generalization
Compare logistic regression with Naive Bayes for text classification in terms of assumptions and performance
Naive Bayes assumes feature independence given the class and is fast and robust; logistic regression makes no independence assumption and can capture dependencies via weighting; logistic regression often yields higher accuracy but may require more data and regularization to avoid overfitting
Why do distance measures lose their discriminative power as the dimensionality of the data increases, and how does this affect nearest neighbour algorithms
In high dimensional spaces, most points become equidistant because the range of distances shrinks; the difference between the nearest and farthest neighbours diminishes, so nearest neighbour algorithms cannot reliably identify the closest points, leading to poor performance
What is the key difference between feature selection and dimensionality reduction, and how does each address high dimensional data
Feature selection retains a subset of the original variables deemed most informative, maintaining interpretability, whereas dimensionality reduction constructs new features as combinations of the original ones, often using techniques like principal components analysis; both reduce dimensionality but with different impacts on interpretability and model bias
Why are filter methods for feature selection considered model independent, and what is a potential disadvantage
Filter methods rank features based on statistical criteria such as correlation or information gain without referencing a specific learning algorithm, making them fast and general; however, they may ignore interactions among features and may select features that are individually good but redundant when combined
How do wrapper methods use the performance of a specific model to select features, and what are the trade offs compared to filter methods
Wrapper methods evaluate subsets of features by training and validating a chosen model on them, selecting the subset that yields the best performance; this captures interactions between features but is computationally intensive and risks overfitting due to repeated model training
Explain how embedded methods integrate feature selection into the model training process, and provide an example
Embedded methods incorporate feature selection within the learning algorithm’s objective function, such as decision trees selecting splits or lasso regularized regression imposing sparsity; they balance efficiency and interaction detection by selecting features while fitting the model
According to the comparative table of feature selection methods, why might embedded methods offer a good balance between capturing interactions and computational cost
Embedded methods leverage model training to evaluate feature importance, capturing interactions better than filter methods while avoiding exhaustive search typical of wrapper methods; they achieve moderate computational cost and often yield a sparse model
In a high dimensional gene expression dataset, why might using lasso regression be preferable to forward selection for feature selection
Lasso is an embedded method that imposes an L1 penalty, simultaneously performing regression and feature selection; it can handle thousands of correlated genes efficiently, whereas forward selection is a wrapper method that evaluates subsets incrementally and becomes computationally prohibitive
How does the curse of dimensionality influence the effectiveness of clustering algorithms, and how can dimensionality reduction help
In high dimensions, distances become less meaningful and clusters may be indistinguishable; dimensionality reduction techniques like principal components analysis can project data into lower dimensional spaces where structure is clearer, improving cluster separability and algorithm performance
When using filter methods to select the top k features based on correlation with the target, why might important features still be omitted, and how can this be addressed
Highly correlated features might be redundant or may only have predictive power in combination with others; filter methods evaluate each feature individually and can miss interactions; combining filter methods with wrapper or embedded approaches, or using multivariate statistical tests, can capture such joint effects