1/26
Flashcards covering key concepts, definitions, and algorithms related to classification tasks in data mining, including evaluation metrics, cross-validation, K-NN, and Naïve Bayes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Classification Task
The task of mapping an input attribute set (x) into its discrete class label (y).
Class Label
A discrete attribute that a classification model aims to predict.
Training Set
A collection of records with known class labels used to find or build a classification model.
Test Set
A set of previously unseen records used to determine the accuracy of a classification model, validating its performance.
Confusion Matrix
A table used to evaluate the performance of a classification model by summarizing correct and incorrect predictions for each class.
True Positive (TP)
Instances correctly predicted as positive (Class=Yes).
False Negative (FN)
Instances incorrectly predicted as negative when they are actually positive (Class=Yes, but predicted Class=No).
False Positive (FP)
Instances incorrectly predicted as positive when they are actually negative (Class=No, but predicted Class=Yes).
True Negative (TN)
Instances correctly predicted as negative (Class=No).
Accuracy
The proportion of correct predictions out of the total predictions, calculated as (TP+TN) / (TP+TN+FP+FN).
Error Rate
The proportion of wrong predictions out of the total predictions, calculated as (FP+FN) / (TP+TN+FP+FN).
Precision
The proportion of true positive predictions among all positive predictions, calculated as TruePos / (TruePos + FalsePos).
Recall
The proportion of true positive predictions among all actual positive instances, calculated as TruePos / (TruePos + FalseNeg).
Training Error (Re-substitution error)
The error rate of a model when evaluated on the training data used to build it.
Generalization Error
The error rate of a model when evaluated on unseen testing data, indicating its ability to perform new, unseen records.
Holdout Method
A technique where the original dataset is split into a training set and a test set (e.g., 2/3 for training, 1/3 for testing) to evaluate model performance.
Cross Validation
A technique to evaluate model performance by partitioning data into k disjoint subsets, training on k-1 subsets, and testing on the remaining one, repeating k times.
k-fold Cross Validation
A specific type of cross-validation where data is partitioned into k disjoint subsets, and each subset is used as a test set once while the others form the training set.
Leave-one-out Cross Validation
A type of k-fold cross validation where k is equal to the number of data points, meaning each data point serves as the test set once.
Instance-Based Classifiers
Classifiers that store the training records and use them directly to predict the class label of unseen cases, rather than building an explicit model.
K-Nearest Neighbor (K-NN)
An instance-based classification algorithm that classifies an unknown record based on the majority class of its 'k' closest training records.
Lazy Learner
A classification system (like K-NN) that does not build a model explicitly during training but delays generalization until a classification query is made, making classification relatively expensive.
Bayes Classifier
A probabilistic framework for solving classification problems based on conditional probability and Bayes' Theorem.
Conditional Probability
The probability of an event A occurring given that another event B has already occurred, denoted P(A|B).
Bayes Theorem
A statistical formula that calculates conditional probability: P(C|A) = (P(A|C) * P(C)) / P(A).
Naïve Bayes Classifier
A Bayesian classifier that assumes independence among attributes given the class, simplifying the calculation of P(A1, A2, …, An | C) as a product of individual conditional probabilities P(Ai | C).
m-estimate
A technique used in Naïve Bayes to prevent zero probabilities for conditional attributes by smoothing the probability estimation, especially when a count for an attribute value in a given class is zero.