Data Mining Midterm Practice

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/74

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

75 Terms

New cards

Why is traditional data warehousing insufficient for discovering patterns in massive datasets

Because data warehousing provides data storage and summary reports but does not implement algorithms to discover hidden patterns; the curse of dimensionality and volume make manual analysis impractical, so automated data mining techniques are required

New cards

How do classification and clustering differ in terms of supervision and output

Classification is supervised learning that maps objects to predefined classes based on labelled training data, whereas clustering is unsupervised and groups objects into clusters based on similarity without prior labels

New cards

Why is the curse of dimensionality problematic for distance based methods in data mining

As the number of attributes increases, data points become sparse and distances between points tend to become similar, making distance measures less meaningful; algorithms like nearest neighbour lose discriminatory power and computational cost increases dramatically

New cards

What is the difference between nominal and ordinal attributes, and why does this matter for measuring similarity

Nominal attributes have unordered categorical values where only equality comparisons make sense, whereas ordinal attributes have meaningful order among categories; this distinction affects the choice of similarity measures because ordinal values may be encoded to reflect order while nominal values are treated as distinct categories

New cards

How does analysing graph structured data differ from analysing numeric vectors

Graph data consists of nodes and edges capturing relationships, requiring algorithms that consider connectivity and topology, whereas numeric vectors are fixed length and can use metrics like Euclidean distance; therefore graph mining techniques such as community detection or path analysis are needed instead of conventional clustering or regression on vectors

New cards

In market basket analysis, why is support used as a threshold for selecting frequent itemsets

Support measures how often a set of items occurs in transactions; using a minimum support threshold filters out itemsets that appear rarely, reducing the search space and focusing on patterns that are statistically significant

New cards

Why does high confidence in an association rule not imply causation

Confidence measures conditional co occurrence of items but does not account for external factors or temporal order; two items might co occur frequently due to a common cause or random correlation, so rules indicate association rather than cause and effect

New cards

How can clustering be used as a preprocessing step for classification

Clustering can group similar data points to reveal underlying structure; these clusters may then be used to design specialized classifiers for each group, handle class imbalance by oversampling cluster centroids, or compress data into prototypes that reduce computational complexity

New cards

In predictive modelling, why is it important to separate training and test datasets

Training data is used to build the model, while test data, which is not seen during training, is used to evaluate generalization; using the same data for both leads to overly optimistic performance estimates and fails to detect overfitting

New cards

How does anomaly detection differ from classification, and why is it challenging

Anomaly detection seeks rare or unusual instances without labelled examples of anomalies; the majority class may be well characterised but anomalies can be diverse and lack training data, making it difficult to define precise boundaries and requiring unsupervised or semi supervised methods

New cards

A retailer collects high dimensional transaction data; why might unsupervised clustering followed by association analysis reveal meaningful insights

Clustering can group similar customers or purchases without prior labels, reducing complexity and highlighting segments; within each cluster, frequent itemset mining can identify tailored association rules that reflect group specific purchasing patterns, leading to targeted marketing strategies

New cards

How are data mining tasks like clustering, association rule mining, and anomaly detection complementary when exploring a large dataset

Clustering reveals groups of similar objects, association rule mining discovers relationships among items, and anomaly detection identifies outliers; together they provide a comprehensive understanding of structure, patterns, and deviations within the data

New cards

Why might the same dataset be used for both classification and regression tasks, and how does the choice depend on the target variable

If the target variable is categorical, a classification model predicts class labels; if it is continuous, regression predicts numerical values; the same features can support both tasks but the modelling approach differs according to the nature of the target and evaluation criteria

New cards

Why is simple random sampling sometimes inadequate when sampling rare subgroups

In simple random sampling, each element has equal probability of selection; rare subgroups may have so few members that they are underrepresented in the sample, leading to biased estimates or missed rare events

New cards

How could hidden periodicity in the data cause bias in systematic sampling

Systematic sampling selects every k th element; if the data exhibits periodic patterns aligned with the sampling interval, the sample may over represent or under represent certain patterns, leading to systematic bias

New cards

In cluster sampling, why is it beneficial to sample clusters rather than individual units, and what is a potential drawback

Sampling clusters reduces cost and logistical effort because entire groups are selected instead of many separate individuals; however, clusters may be internally homogeneous and thus provide less variability, introducing bias if clusters differ systematically

New cards

How does stratified sampling help detect rare events, and how should strata be chosen

Stratified sampling divides the population into homogeneous subgroups and samples from each; by allocating more samples to rare strata, rare events become more likely to appear in the sample; strata should be defined based on attributes related to the event of interest

New cards

Explain why each element in a streaming dataset has equal probability of being retained in a reservoir sample of size n without knowing the total length of the stream

The reservoir sampling algorithm initializes a reservoir with the first n items; for each subsequent item at position i greater than n, it replaces an element in the reservoir with probability n divided by i; induction shows that each item’s probability of remaining in the reservoir after processing is n divided by N for N total items

New cards

Why does the TF IDF weighting scheme assign higher weight to terms that appear frequently in a document but infrequently across the corpus

Term frequency captures a term’s importance within a document, while inverse document frequency downweights terms common across many documents; thus, terms with high term frequency but low document frequency receive large TF IDF values, highlighting words that are both frequent in a particular document and discriminative across the corpus

New cards

How can summary statistics such as mean and variance be misleading in skewed distributions, and what alternative measures could be used

Mean and variance are sensitive to outliers and may not reflect the central tendency or spread in skewed data; median and interquartile range provide more robust measures of location and spread, and boxplots or histograms can visualise distribution shape

New cards

What is the Bonferroni principle, and why must it be considered when interpreting patterns discovered in data mining

The Bonferroni principle warns that testing many hypotheses increases the chance of false positives; when exploring many possible patterns, some may appear statistically significant by chance, so corrections or hold out validation are needed to avoid reporting spurious results

New cards

When using text mining to analyse Yelp reviews, why is it important to remove stopwords and apply IDF weighting before ranking terms

Stopwords are common words that carry little content; removing them reduces noise; IDF weighting downweights words that are common across documents, so ranking terms by TF IDF highlights distinctive words that reflect the subject and sentiment of the reviews

New cards

Why is it necessary to normalize or standardize variables before computing similarity or applying clustering algorithms

Different variables may have different scales; without normalization, variables with larger numeric ranges dominate distance calculations and skew results; standardization ensures each variable contributes proportionally to distance metrics and improves comparability across features

New cards

You need to estimate the average income of households in a country; why might you use stratified sampling rather than cluster sampling

Stratified sampling ensures representation from each socioeconomic group, capturing the variability of income across strata; cluster sampling may select entire neighbourhoods that are homogeneous within clusters, leading to biased estimates if selected clusters are not representative

New cards

How does the choice of sampling method impact the generalization of models trained on the sampled data

Sampling methods influence how well the sample represents the population; biased samples lead to models that generalize poorly, while stratified or random sampling improves representativeness and reduces sampling bias, resulting in more reliable models

New cards

How would failing to consider sampling bias during association rule mining affect the interpretation of discovered rules

If certain transactions or subgroups are underrepresented due to sampling bias, some frequent itemsets may be missed and support values may not reflect true population frequencies; this can lead to spurious or missing associations, diminishing the validity of the rules

New cards

What is the difference between support count and support fraction in frequent itemset mining, and why might one be more informative than the other

Support count is the number of transactions containing a given itemset, while support fraction divides the count by the total number of transactions; support fraction is normalized and allows comparison across datasets of different sizes, whereas support count alone depends on dataset size

New cards

Why does the naive algorithm for frequent itemset mining become intractable even for moderate numbers of items

The naive algorithm enumerates all possible itemsets, whose number grows exponentially with the number of items; as the number of possible itemsets becomes enormous, counting their support is computationally infeasible

New cards

How does the anti monotonic property of support enable the Apriori algorithm to prune the search space

The property states that if an itemset is infrequent, all supersets of that itemset are also infrequent; conversely, only itemsets whose all subsets are frequent need to be considered; Apriori uses this property to avoid generating and testing candidates whose subsets are not frequent, greatly reducing the number of candidates

New cards

Describe the two main steps in the Apriori algorithm for generating candidate k itemsets from frequent k minus 1 itemsets

In the join step, pairs of frequent k minus 1 itemsets that share k minus 2 items are combined to form candidate k itemsets; in the prune step, any candidate whose k minus 1 sized subsets are not all frequent is eliminated using the anti monotonic property

New cards

Once frequent itemsets are generated, how are association rules derived and evaluated

For each frequent itemset, all non empty subsets are considered as potential antecedents; each rule divides the itemset into antecedent and consequent; its support equals the frequent itemset’s support and its confidence equals support of the union divided by support of the antecedent; rules with confidence and support above thresholds are retained

New cards

Why is the confidence of an association rule anti monotonic with respect to the size of the antecedent

As more items are moved from the consequent to the antecedent, the support of the antecedent decreases or stays the same, while the support of the entire itemset remains unchanged; thus the ratio support of union over support of antecedent can only decrease, implying that rules with larger antecedents have lower confidence

New cards

How does the FP growth algorithm avoid generating large numbers of candidate itemsets

It compresses the database into an FP tree that stores item sets in a trie like structure; it then recursively mines frequent patterns using conditional pattern bases and conditional FP trees, eliminating the need to generate and test candidate itemsets explicitly

New cards

In what scenarios might FP growth outperform Apriori, and why

FP growth is more efficient when the dataset contains many long frequent patterns and has some common prefixes, because the FP tree compresses data and reduces disk input output; Apriori incurs high cost generating and counting candidates for such datasets; however, FP growth may require more memory and can be less intuitive to implement

New cards

How does increasing the minimum support threshold affect the number of frequent itemsets and computation time

Raising the support threshold reduces the number of itemsets that qualify as frequent; this decreases candidate generation and support counting, reducing computation time and memory; conversely, a low threshold yields many frequent itemsets and increases computational overhead

New cards

In a dataset with 1000 transactions and 100 items, why would applying the Apriori algorithm with a very low support threshold be impractical, and how might sampling help

A low threshold would produce a huge number of frequent itemsets, many of which are spurious; the Apriori algorithm would generate and test an exponential number of candidates, leading to excessive runtime; sampling a subset of transactions can approximate frequent patterns more quickly at the cost of accuracy

New cards

Why might a rule with high confidence still be undesirable, and what additional measures can be used to assess rule interestingness

High confidence may occur due to a high support item in the consequent rather than a strong relationship; metrics such as lift or conviction can adjust for consequent frequency and better reflect the strength of the association

New cards

How might a business use frequent itemset mining in combination with classification to improve a recommendation system

Frequent itemset mining can identify co purchased items to generate association based recommendations; classification can predict user preferences or purchase intent based on features; combining the two allows recommendations to be tailored using both item associations and user characteristics, enhancing relevance

New cards

Why is the downward closure property of support not applicable to confidence, and how does this affect rule generation

Support is anti monotonic because adding items cannot increase support count, but confidence depends on the ratio of supports, which can increase or decrease when items are added to antecedent or consequent; thus confidence is not monotonic, requiring evaluation of all possible partitions of a frequent itemset when generating rules

New cards

Why might a model that performs extremely well on training data perform poorly on unseen data, and how can this issue be mitigated

Overfitting occurs when a model captures noise or peculiarities of the training data rather than general patterns; as a result, it fails to generalize; mitigation strategies include using cross validation, pruning decision trees, regularisation, and holding out a test set

New cards

How do decision trees handle non linear decision boundaries, and why might a linear classifier struggle in such cases

Decision trees partition the feature space into axis aligned regions using hierarchical splits, enabling them to approximate complex non linear boundaries by combining multiple splits; linear classifiers such as logistic regression draw a single hyperplane and cannot easily represent intricate shapes without feature transformations

New cards

When splitting on a continuous attribute in a decision tree, why is dynamic discretisation necessary, and how is the best split point chosen

Continuous attributes have many possible split points; dynamic discretisation involves sorting values and evaluating candidate cut points using criteria like information gain or Gini impurity; the split that maximizes purity improvement is chosen

New cards

Compare entropy and Gini impurity as splitting criteria in decision trees; under what circumstances might one be preferred

Both measure node impurity; entropy originates from information theory and emphasises large differences in class probabilities; Gini is simpler to compute and often yields similar splits; some implementations prefer Gini for computational efficiency, though entropy can provide more discriminative splitting when class distributions are highly skewed

New cards

Why does a split with high information gain indicate a good attribute for a decision tree, and what is the trade off when using many small leaves

High information gain means the split substantially reduces impurity, producing subsets with more homogeneous class labels; however, creating many small leaves can overfit training data and reduce generalization, so pruning or minimum leaf size constraints are used to balance complexity and accuracy

New cards

Explain how precision and recall differ, and why accuracy alone can be misleading for imbalanced classification tasks

Precision measures the proportion of predicted positives that are correct, whereas recall measures the proportion of actual positives detected; in imbalanced datasets where the negative class dominates, accuracy may be high even if the model misses most positives, so precision and recall provide a more informative assessment

New cards

How does the independence assumption in Naive Bayes simplify computation of class probabilities, and why might it fail in real world datasets

The assumption allows the joint probability of features given a class to be computed as the product of individual feature probabilities, drastically reducing parameter estimation; in practice, features often exhibit dependencies, and ignoring them can lead to suboptimal models, although Naive Bayes can still perform well due to its simplicity and robust probability estimates

New cards

In the Naive Bayes classifier, what is the role of the prior probability, and how does the maximum a posteriori criterion determine the predicted class

The prior reflects how common each class is; the MAP classifier multiplies the prior by the likelihood of the features given the class and predicts the class with the highest posterior probability; this balances evidence from the data with prior expectations

New cards

Compare decision trees and Naive Bayes in terms of interpretability and assumptions, and discuss when one might outperform the other

Decision trees provide explicit rules that are easy to interpret but can overfit unless pruned; Naive Bayes makes a strong independence assumption but is computationally efficient; Naive Bayes may perform well on high dimensional data such as text, while decision trees capture interactions better when features are dependent

New cards

Why is pruning important in decision tree learning, and what methods can be used to decide when to stop splitting

Without pruning, a decision tree can grow deep and fit noise; pruning reduces complexity by removing splits that do not improve generalization; methods include setting a maximum depth, minimum number of samples per leaf, cost complexity pruning, or validating on a separate dataset

New cards

A classifier yields high recall but low precision; what practical consequence does this have, and how might you adjust the model

High recall but low precision means the model identifies most positives but also produces many false positives; in applications such as spam detection, many legitimate emails may be misclassified; adjusting classification threshold, changing cost functions, or using a different algorithm can improve precision

New cards

Why is it useful to use a confusion matrix rather than just the number of correct predictions when evaluating a multi class classifier

The confusion matrix tabulates predicted versus actual classes, revealing which classes are confused; this helps diagnose systematic errors, imbalanced performance across classes, and guides model improvement beyond overall accuracy

New cards

How can feature selection methods influence decision tree structure, and why might embedded methods such as tree based models naturally perform feature selection

Feature selection removes irrelevant attributes before training, simplifying the tree and improving generalization; embedded methods like decision trees compute feature importance during training and inherently select informative features by splitting on them, reducing the need for separate selection

New cards

Why might a Naive Bayes classifier perform poorly on data generated from a clustering task, and how could one modify it

If the data’s clusters correspond to classes but features within a cluster are strongly correlated, the independence assumption fails and Naive Bayes may misestimate probabilities; using a model that accounts for feature dependencies, such as Bayesian networks or logistic regression with interaction terms, could improve performance

New cards

Why is linear regression unsuitable for binary classification, and how does logistic regression address this issue

Linear regression outputs unbounded real values, so mapping to class labels with a fixed threshold can lead to predictions outside zero to one and poorly calibrated probabilities; logistic regression applies the sigmoid function to map the linear combination of features into the zero to one interval, producing valid probabilities for binary classification

New cards

How does the sigmoid function shape the decision boundary in logistic regression, and what happens to the predicted probability as the linear predictor becomes large

The sigmoid function is S shaped, approaching one as the linear predictor grows large and zero as it becomes highly negative; near zero it is approximately linear; thus, logistic regression yields a smooth probability transition across the decision boundary and saturates at extreme values

New cards

Why is cross entropy loss preferred over mean squared error for training logistic regression models

Cross entropy loss is derived from maximum likelihood estimation for Bernoulli outcomes and penalizes misclassified examples more appropriately; it ensures convexity and leads to larger gradients when predictions are far from true labels, whereas mean squared error may produce non convex cost and can lead to slow or unstable learning

New cards

Describe the update rule for gradient descent in logistic regression and explain the role of the learning rate

Parameters are updated by subtracting the product of the learning rate and the gradient of the loss with respect to the parameters; the learning rate controls step size: too high can cause divergence and oscillation, while too low results in slow convergence

New cards

How does L2 regularization mitigate overfitting in regression, and what is its effect on coefficients

L2 regularization adds a penalty proportional to the square of the coefficients to the loss function, discouraging large weights; it shrinks coefficients toward zero but does not drive them exactly to zero, reducing variance and preventing overfitting

New cards

Why does L1 regularization perform feature selection, and how does it differ from ridge regression

L1 regularization adds a penalty proportional to the absolute values of coefficients; this can lead to some coefficients becoming exactly zero, effectively selecting a subset of features; ridge shrinks coefficients but rarely zeroes them, so it does not perform feature selection

New cards

Explain the bias variance trade off in the context of regularized regression models

Increasing regularization reduces model variance by simplifying the model, but increases bias by constraining flexibility; decreasing regularization yields lower bias but higher variance; optimal performance balances these to minimise expected prediction error

New cards

In logistic regression, what happens if you choose a threshold of 0.9 instead of 0.5 to decide class membership

A higher threshold requires stronger evidence to classify an instance as positive; this reduces false positives and increases precision but may increase false negatives and decrease recall; the choice depends on application needs

New cards

How can logistic regression be modified to account for differing costs of false positives and false negatives

The decision threshold can be adjusted to reflect cost ratios, or the loss function can be weighted to penalize mistakes on one class more than the other, effectively shifting the decision boundary

New cards

How is logistic regression extended to handle multiple classes, and what is the role of the softmax function

Multinomial logistic regression assigns each class its own parameter vector; the softmax function converts linear outputs into a probability distribution over classes, ensuring probabilities sum to one; the class with highest probability is chosen

New cards

Why might gradient descent converge to a local minimum in logistic regression, and how can this be mitigated

Logistic regression with convex cross entropy loss has a global minimum; however, non convexities may arise with certain feature transformations or regularization; using stochastic gradient descent and proper learning rates can avoid being stuck in poor minima

New cards

How can text mining applications combine TF IDF features with logistic regression, and why might regularization be crucial in this context

TF IDF vectors are high dimensional and sparse; logistic regression can model probability of categories such as spam or sentiment; regularization prevents overfitting by shrinking coefficients of irrelevant terms and potentially selecting informative features, improving generalization

New cards

Compare logistic regression with Naive Bayes for text classification in terms of assumptions and performance

Naive Bayes assumes feature independence given the class and is fast and robust; logistic regression makes no independence assumption and can capture dependencies via weighting; logistic regression often yields higher accuracy but may require more data and regularization to avoid overfitting

New cards

Why do distance measures lose their discriminative power as the dimensionality of the data increases, and how does this affect nearest neighbour algorithms

In high dimensional spaces, most points become equidistant because the range of distances shrinks; the difference between the nearest and farthest neighbours diminishes, so nearest neighbour algorithms cannot reliably identify the closest points, leading to poor performance

New cards

What is the key difference between feature selection and dimensionality reduction, and how does each address high dimensional data

Feature selection retains a subset of the original variables deemed most informative, maintaining interpretability, whereas dimensionality reduction constructs new features as combinations of the original ones, often using techniques like principal components analysis; both reduce dimensionality but with different impacts on interpretability and model bias

New cards

Why are filter methods for feature selection considered model independent, and what is a potential disadvantage

Filter methods rank features based on statistical criteria such as correlation or information gain without referencing a specific learning algorithm, making them fast and general; however, they may ignore interactions among features and may select features that are individually good but redundant when combined

New cards

How do wrapper methods use the performance of a specific model to select features, and what are the trade offs compared to filter methods

Wrapper methods evaluate subsets of features by training and validating a chosen model on them, selecting the subset that yields the best performance; this captures interactions between features but is computationally intensive and risks overfitting due to repeated model training

New cards

Explain how embedded methods integrate feature selection into the model training process, and provide an example

Embedded methods incorporate feature selection within the learning algorithm’s objective function, such as decision trees selecting splits or lasso regularized regression imposing sparsity; they balance efficiency and interaction detection by selecting features while fitting the model

New cards

According to the comparative table of feature selection methods, why might embedded methods offer a good balance between capturing interactions and computational cost

Embedded methods leverage model training to evaluate feature importance, capturing interactions better than filter methods while avoiding exhaustive search typical of wrapper methods; they achieve moderate computational cost and often yield a sparse model

New cards

In a high dimensional gene expression dataset, why might using lasso regression be preferable to forward selection for feature selection

Lasso is an embedded method that imposes an L1 penalty, simultaneously performing regression and feature selection; it can handle thousands of correlated genes efficiently, whereas forward selection is a wrapper method that evaluates subsets incrementally and becomes computationally prohibitive

New cards

How does the curse of dimensionality influence the effectiveness of clustering algorithms, and how can dimensionality reduction help

In high dimensions, distances become less meaningful and clusters may be indistinguishable; dimensionality reduction techniques like principal components analysis can project data into lower dimensional spaces where structure is clearer, improving cluster separability and algorithm performance

New cards

When using filter methods to select the top k features based on correlation with the target, why might important features still be omitted, and how can this be addressed

Highly correlated features might be redundant or may only have predictive power in combination with others; filter methods evaluate each feature individually and can miss interactions; combining filter methods with wrapper or embedded approaches, or using multivariate statistical tests, can capture such joint effects