1/25
Vocabulary flashcards covering key terms from the lecture notes on data mining, CRISP-DM, similarity measures, and related concepts.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data mining
The process of applying analytical techniques to extract insights from data, uncover hidden patterns, and support decisions; a building block of machine learning and AI.
Artificial intelligence
Computer systems that exhibit human-like intelligence and cognitive abilities, including deduction, pattern recognition, and complex data interpretation.
Machine learning
Techniques that enable computers to learn automatically from data using self-learning algorithms, improving performance over time and revealing hidden patterns.
CRISP-DM
Cross-Industry Standard Process for Data Mining; a six-phase methodology: Business understanding, Data understanding, Data preparation, Modeling, Evaluation, Deployment; emphasizes business goals first.
SEMMA
A data mining methodology: Sample, Explore, Modify, Model, Assess.
KDD
Knowledge Discovery in Databases; a data mining approach focused on extracting knowledge from large data sets.
Supervised learning
Learning where the target variable is known; used to build predictive models; examples include regression and classification.
Unsupervised learning
Learning without a target variable; used for exploration, dimension reduction, and pattern recognition.
Classification
A supervised learning task where the target is categorical; assigns new cases to classes.
Regression
A supervised learning task where the target is numerical; predicts continuous values; model trained with known outcomes.
Dimension reduction
Reducing high-dimensional data to fewer dimensions while preserving important information; helps reduce redundancy and improve model stability.
Pattern recognition
Identifying recurring sequences, frequent itemsets, or recognizable features in data.
Similarity measures
Quantitative methods to assess how similar or dissimilar observations are, typically based on pairwise distances.
Euclidean distance
The straight-line distance between two points; widely used; sensitive to outliers.
Manhattan distance
The sum of absolute differences across dimensions; often called taxicab distance; less sensitive to outliers than Euclidean.
Standardization (z-score)
Transforming data to z-scores by subtracting the mean and dividing by the standard deviation; makes variables unit-free.
Min-max normalization
Rescaling values to the 0–1 range; preserves relationships but eliminates units.
Binary variable
A categorical variable with only two possible values (e.g., yes/no).
Matching coefficient
A similarity measure for categorical data based on matches; higher values indicate more similarity; equals 1 for a perfect match; does not differentiate positive vs negative outcomes.
Jaccard coefficient
A similarity measure for binary/categorical data that ignores negatives and focuses on shared positives.
Business understanding (CRISP-DM phase)
CRISP-DM phase focusing on clarifying objectives, context, schedule, and deliverables.
Data understanding (CRISP-DM phase)
CRISP-DM phase involving collection and exploration of raw data, initial insights, and hypotheses.
Data preparation (CRISP-DM phase)
CRISP-DM phase consisting of record/variable selection, cleaning, wrangling, and transformation.
Modeling (CRISP-DM phase)
CRISP-DM phase where modeling techniques are selected and applied, data is transformed as needed, and cross-validation is documented.
Evaluation (CRISP-DM phase)
CRISP-DM phase to assess model performance, compare alternatives, interpret results, and develop recommendations.
Deployment (CRISP-DM phase)
CRISP-DM phase to translate insights into actionable deliverables and establish deployment/monitoring/feedback.