Chapter 11: Data Mining Vocabulary

0.0(0)

Studied by 0 people

View linked note

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/25

Earn XP

Description and Tags

Vocabulary flashcards covering key terms from the lecture notes on data mining, CRISP-DM, similarity measures, and related concepts.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

26 Terms

New cards

Data mining

The process of applying analytical techniques to extract insights from data, uncover hidden patterns, and support decisions; a building block of machine learning and AI.

New cards

Artificial intelligence

Computer systems that exhibit human-like intelligence and cognitive abilities, including deduction, pattern recognition, and complex data interpretation.

New cards

Machine learning

Techniques that enable computers to learn automatically from data using self-learning algorithms, improving performance over time and revealing hidden patterns.

New cards

CRISP-DM

Cross-Industry Standard Process for Data Mining; a six-phase methodology: Business understanding, Data understanding, Data preparation, Modeling, Evaluation, Deployment; emphasizes business goals first.

New cards

SEMMA

A data mining methodology: Sample, Explore, Modify, Model, Assess.

New cards

KDD

Knowledge Discovery in Databases; a data mining approach focused on extracting knowledge from large data sets.

New cards

Supervised learning

Learning where the target variable is known; used to build predictive models; examples include regression and classification.

New cards

Unsupervised learning

Learning without a target variable; used for exploration, dimension reduction, and pattern recognition.

New cards

Classification

A supervised learning task where the target is categorical; assigns new cases to classes.

New cards

Regression

A supervised learning task where the target is numerical; predicts continuous values; model trained with known outcomes.

New cards

Dimension reduction

Reducing high-dimensional data to fewer dimensions while preserving important information; helps reduce redundancy and improve model stability.

New cards

Pattern recognition

Identifying recurring sequences, frequent itemsets, or recognizable features in data.

New cards

Similarity measures

Quantitative methods to assess how similar or dissimilar observations are, typically based on pairwise distances.

New cards

Euclidean distance

The straight-line distance between two points; widely used; sensitive to outliers.

New cards

Manhattan distance

The sum of absolute differences across dimensions; often called taxicab distance; less sensitive to outliers than Euclidean.

New cards

Standardization (z-score)

Transforming data to z-scores by subtracting the mean and dividing by the standard deviation; makes variables unit-free.

New cards

Min-max normalization

Rescaling values to the 0–1 range; preserves relationships but eliminates units.

New cards

Binary variable

A categorical variable with only two possible values (e.g., yes/no).

New cards

Matching coefficient

A similarity measure for categorical data based on matches; higher values indicate more similarity; equals 1 for a perfect match; does not differentiate positive vs negative outcomes.

New cards

Jaccard coefficient

A similarity measure for binary/categorical data that ignores negatives and focuses on shared positives.

New cards

Business understanding (CRISP-DM phase)

CRISP-DM phase focusing on clarifying objectives, context, schedule, and deliverables.

New cards

Data understanding (CRISP-DM phase)

CRISP-DM phase involving collection and exploration of raw data, initial insights, and hypotheses.

New cards

Data preparation (CRISP-DM phase)

CRISP-DM phase consisting of record/variable selection, cleaning, wrangling, and transformation.