1/17
Flashcards created based on the lecture notes covering key concepts of unsupervised learning, clustering, outlier detection, and frequent pattern mining, including algorithms and applications.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
What is the main difference between supervised and unsupervised learning?
In supervised learning, the training data is labeled with class information, whereas in unsupervised learning, class labels are unknown and the goal is to find structures or patterns in the data.
What is clustering in data mining?
Clustering is the process of grouping a set of data objects into clusters, where objects in the same cluster are similar to each other and dissimilar to objects in other clusters.
What are some common applications of clustering?
Clustering is used in preprocessing, image databases, pattern recognition, spatial data analysis, business intelligence, and biology, among others.
What is the k-means clustering algorithm?
The k-means algorithm partitions data into k clusters by minimizing the variance within each cluster.
What is the Silhouette Coefficient?
The Silhouette Coefficient measures how similar an object is to its own cluster compared to other clusters, providing an indication of the quality of clustering.
Define DBSCAN in the context of density-based clustering.
DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together, marked as core points, and marks points as noise if they lie alone in low-density regions.
What is a key strength of the k-means clustering algorithm?
K-means is relatively efficient and easy to implement.
What is the challenge of choosing the parameter k in k-means clustering?
Choosing the correct number of clusters k is crucial as it directly affects the results; too few clusters may lump distinct groups together, while too many may overfit the data.
What is the role of the Expectation-Maximization (EM) algorithm in clustering?
The EM algorithm is used in clustering for finding maximum likelihood estimates of parameters in probabilistic models, iterating between estimating distributions and maximizing likelihood.
What does the term 'outlier' refer to in the context of data mining?
An outlier is an observation that deviates significantly from the majority of the data points, which may indicate abnormal behavior.
What are three types of outliers identified in data mining?
Clustering-based outliers, statistical outliers, and density-based outliers.
What is frequent pattern mining?
Frequent pattern mining involves discovering patterns, associations, and correlations among sets of items or objects in transaction databases.
What is the Apriori algorithm?
The Apriori algorithm is a classic algorithm used for mining frequent itemsets and sequential patterns by leveraging the property that a subset of a frequent itemset must also be a frequent itemset.
Define the term 'support' in the context of frequent itemset mining.
Support is the frequency with which an itemset appears in the database, calculated as the proportion of transactions containing the itemset.
What is the main limitation of the k-means algorithm?
K-means clustering is sensitive to outliers and requires specifying the number of clusters k in advance.
Describe the concept of hierarchical clustering.
Hierarchical clustering builds a tree of clusters (dendrogram) by either agglomerating smaller clusters into larger ones or dividing larger clusters into smaller ones.
What is association rule mining?
Association rule mining aims to identify interesting relationships between variables in large datasets, typically revealed through rules of the form 'X implies Y'.
What does the term 'confidence' refer to in association rule mining?
Confidence is the measure of the likelihood of item Y being bought when item X is bought, calculated as the proportion of transactions containing X that also contain Y.