data 1 data mining

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/119

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

120 Terms

New cards

What is a Dataset?

A collection of objects and their attributes used for analysis

New cards

What is an Attribute in data mining?

A property or characteristic of an object. Also known as variable, field, characteristic, dimension, or feature

New cards

What is an Object in data mining?

A collection of attributes. Also known as record, point, case, sample, entity, or instance

New cards

What are the 5 important characteristics of datasets?

Size, Dimensionality, Sparsity, Distribution, Resolution

New cards

Why is Size an important dataset characteristic?

The type of analysis often depends on the size of the data

New cards

Why is Dimensionality an important dataset characteristic?

High-dimensional data presents unique challenges in analysis

New cards

Why is Sparsity an important dataset characteristic?

It emphasizes the importance of presence over absence in the data

New cards

What are the 4 main types of datasets?

Record Data, Graphs and Networks, Ordered (Sequence) Data, Spatial Data

New cards

What is Record Data?

Records with fixed attributes, including relational records, data matrix, and transaction data

New cards

Give 3 examples of Graphs and Networks datasets

Transportation network, Social or information networks, Molecular Structures

New cards

Give 3 examples of Ordered (Sequence) Data

Video (sequence of images), Genetic Sequence Data, Temporal sequence

New cards

Give 2 examples of Spatial Data

RGB Images, Satellite images

New cards

What are the 4 types of attributes?

Nominal, Ordinal, Interval, Ratio

New cards

What is a Nominal attribute?

Unordered categories (e.g., gender, eye color, types of fruit like apple, orange)

New cards

What is an Ordinal attribute?

Ordered categories (e.g., grades A/B/C, height tall/medium/short, swimming level beginner to advanced)

New cards

What is an Interval attribute?

Numerical with equal intervals but no true zero (e.g., calendar dates, temperatures in Celsius or Fahrenheit)

New cards

What is a Ratio attribute?

Numerical with equal intervals and a true zero (e.g., temperature in Kelvin, length, counts, elapsed time)

New cards

What operations can be performed on Nominal attributes?

Distinctness only (=, ≠)

New cards

What operations can be performed on Ordinal attributes?

Distinctness (=, ≠) and Order (

New cards

What operations can be performed on Interval attributes?

Distinctness (=, ≠), Order (

New cards

What operations can be performed on Ratio attributes?

Distinctness (=, ≠), Order (

New cards

What is a Discrete Attribute?

An attribute that takes values from a finite or countable set (e.g., gender, eye color, swimming level). Typically represented as integers

New cards

What is a Continuous Attribute?

An attribute that takes values within a continuous range (e.g., height, length, temperature). Typically represented as floating-point variables

New cards

What are Binary attributes?

A special case of discrete attributes with only two possible values

New cards

What are Asymmetric Attributes?

Attributes where only the presence (non-zero value) matters, not the absence

New cards

Give 2 examples of asymmetric attributes

Words present in documents, Items present in customer transactions

New cards

Why do we focus on presence in asymmetric attributes?

In real scenarios (e.g., grocery shopping), we don't say purchases are similar because we both didn't buy most of the same products. We focus on what was bou

New cards

What is a Similarity Measure?

Quantifies data object likeness. Higher values indicate greater similarity. Typically within the range [0,1]

New cards

What is a Dissimilarity Measure?

Also called Distance Measure. Quantifies data object differences. Lower values indicate greater similarity. Often starts at 0 with varying upper limit

New cards

What is Proximity in data mining?

Refers to either similarity or dissimilarity

New cards

What are the 2 properties of Similarity?

Identity: s(x,y) = 1 only if x = y. Symmetry: s(x,y) = s(y,x) for all x and y

New cards

What are the 3 properties of Distance?

Non-Negativity: d(x,y) ≥ 0, equals 0 only if x = y. Symmetry: d(x,y) = d(y,x). Triangle Inequality: d(x,z) ≤ d(x,y) + d(y,z)

New cards

What is a Distance Matrix?

Distances between all data objects, useful for clustering and nearest neighbor algorithms, symmetric with values reflecting dissimilarities

New cards

What is a Similarity Matrix?

Similarities between all data objects, useful for clustering and recommendation systems, often symmetric with higher values indicating stronger similarities

New cards

What are 4 measures for numerical vectors?

Euclidean Distance, Minkowski Distance, Cosine Similarity, Linear correlation

New cards

What are 2 measures for binary vectors?

Simple Matching Coefficient (SMC), Jaccard Coefficient

New cards

What is the Euclidean Distance formula?

d(x,y) = √(Σ(xk - yk)²), where n is number of attributes

New cards

When is standardization necessary for Euclidean Distance?

When scales of attributes differ

New cards

What is the Minkowski Distance formula?

d(x,y) = (Σ|xk - yk|^r)^(1/r), where r is a parameter

New cards

What is Minkowski Distance?

A generalization of Euclidean Distance where the hyperparameter r allows adaptation to data characteristics

New cards

What is Manhattan distance (L1 norm)?

Minkowski Distance with r = 1, ideal for measuring distances in grid-like paths (e.g., city blocks)

New cards

What is Euclidean distance (L2 norm)?

Minkowski Distance with r = 2, the most commonly used distance metric for straight-line distance in Euclidean space

New cards

What is Chebyshev distance (Lmax norm)?

Minkowski Distance with r → ∞, calculates maximum difference between any component of vectors (e.g., king movement in chess)

New cards

What is Hamming distance?

A special case of Manhattan distance for binary vectors that counts differing bits

New cards

What is the Cosine Similarity formula?

cos(x,y) = (x · y) / (

New cards

What does Cosine Similarity measure?

The cosine of the angle between two vectors, non-sensitive to magnitudes, focusing on orientation

New cards

What is the range of Cosine Similarity values?

Between -1 and 1: -1 (completely dissimilar), 0 (orthogonal/no similarity), 1 (perfectly similar)

New cards

What does Linear correlation measure?

The linear relationship between two variables, evaluating how well one variable predicts another

New cards

What is the range of Linear correlation values?

Between -1 and 1: 1 (perfect positive correlation), 0 (no linear relationship), -1 (perfect negative correlation)

New cards

What is Simple Matching Coefficient (SMC)?

Number of matches divided by total number of attributes, designed for symmetric binary attributes

New cards

What is the SMC formula?

SMC = (f11 + f00) / (f01 + f10 + f00 + f11)

New cards

What does f01 represent in SMC?

The number of attributes where x was 0 and y was 1

New cards

What does f11 represent in SMC?

The number of attributes where x was 1 and y was 1

New cards

What is the Jaccard Coefficient?

The ratio of shared 1 values to the total number of 1 values across both sets, designed for asymmetric binary attributes

New cards

What is the Jaccard Coefficient formula?

J = f11 / (f01 + f10 + f11)

New cards

Why doesn't Jaccard include f00?

Because it's designed for asymmetric attributes where absence (0-0 matches) doesn't indicate similarity

New cards

When comparing documents using word presence, which measure should you use?

Jaccard Coefficient (similarity based on sharing common words)

New cards

When comparing geographical locations of cities, which measure should you use?

Euclidean Distance (similarity based on closeness by distance)

New cards

When comparing time series of temperature patterns, which measure should you use?

Cosine Similarity (similarity based on pattern variation over time)

New cards

When measuring relationship between study hours and exam marks, which measure should you use?

Linear Correlation (indicates strength of relationship between variables)

New cards

For Nominal attributes with one object, which similarity/distance measures apply?

SMC, Jaccard Coefficient

New cards

For Ordinal attributes with one object, which similarity/distance measures apply?

Euclidean Distance, Minkowski Distance, Cosine Similarity, Linear Correlation

New cards

For Interval attributes with one object, which similarity/distance measures apply?

Euclidean Distance, Minkowski Distance, Cosine Similarity, Linear Correlation

New cards

For Ratio attributes with one object, which similarity/distance measures apply?

Euclidean Distance, Minkowski Distance, Cosine Similarity, Linear Correlat

New cards

What are the 4 major tasks of Data Preprocessing?

Data integration, Data reduction, Data cleaning, Data transformation

New cards

What is Data cleaning?

Handling duplicates and missing values, identifying/removing outliers, smoothing noisy data

New cards

What is Data transformation?

Converting data into a format suitable for analysis (sampling, encoding, discretization, normalization)

New cards

Why is poor data quality a problem?

It can negatively impact modeling efforts, leading to incorrect decisions (e.g., denying loans to credit-worthy candidates or approving non-creditworthy ones)

New cards

What are 4 examples of data quality problems?

Duplicate data, Missing values, Outliers, Noise

New cards

What is Duplicate Data?

Occurrence of identical or nearly identical data objects, common when merging data from diverse sources

New cards

How do you handle duplicate data?

Remove duplicate data objects, or in some scenarios keep them (e.g., customers with multiple accounts accumulating points separately)

New cards

What are 2 reasons for missing values?

Information is not collected (e.g., people decline to give age/weight), or attributes may not be applicable to all cases (e.g., annual income not applicable to children)

New cards

When should you DELETE RECORDS with missing values?

When there is enough data and few missing values

New cards

When should you DELETE COLUMN with missing values?

When missing values are ≥ 60% of the observations

New cards

When should you keep missing values as NaN?

When the data mining algorithm can handle them

New cards

What are 5 imputation-based techniques for missing values?

Random value, Average (mean/median/mode), Nearest neighbor, Heuristic-Based, Interpolation

New cards

What is an Outlier?

A data object with characteristics significantly different from the majority in the dataset

New cards

What are the 2 cases for handling outliers?

Case 1: Outliers as Noise (disrupt data analysis), Case 2: Outliers as the Focus (primary focus of analysis)

New cards

Give 2 examples where outliers are the focus

Credit card fraud detection, Intrusion detection

New cards

What is Noise in data?

Noise in Objects: irrelevant elements affecting data integrity. Noise in Attributes: modification of original attribute values

New cards

Give 3 examples of noise

Erroneous values from data entry errors, Distorted voice on poor phone line, "Snow" on television screen

New cards

What are 4 techniques to handle noise?

Binning, Clustering, Imputation techniques (average, nearest neighbor, heuristic, interpolation), Semi-supervised method (automated detection + human inspection)

New cards

Why might we incorporate noise into data?

To enhance robustness by preventing overfitting, improving generalization, and fostering adaptability to real-world variations

New cards

What are the 4 main data transformation techniques?

Sampling, Encoding, Normalization, Discretization

New cards

What is Sampling in data transformation?

Selecting a subset of the dataset to represent a larger population

New cards

Why do we use sampling?

Using the entire dataset is expensive (collecting, storing, processing) and time-consuming

New cards

What are 2 challenges in sampling?

Ensuring the sample is representative of the population, and addressing potential bias in the sampling process

New cards

What are the 4 sampling methods?

Simple Random Sampling, Systematic Sampling, Stratified Sampling, Cluster Sampling

New cards

What is Simple Random Sampling?

Every item has an equal chance of being selected (could be with or without replacement)

New cards

What is Systematic Sampling?

Selecting individuals at regular intervals from a list or group

New cards

What is Stratified Sampling?

Divide the population into groups (strata) based on a characteristic, then random samples are taken from each group

New cards

What is Cluster Sampling?

Divide the population into clusters (often geographically), then entire clusters are randomly selected for sampling

New cards

What is Encoding in data transformation?

Converting categorical variables into numerical format for data mining algorithms

New cards

What are the 2 main encoding methods?

Label Encoding and One-Hot Encoding

New cards

What is Label Encoding?

Converts categories into numerical labels where each category gets a unique integer

New cards

Why is Label Encoding not suitable for nominal attributes?

It can create unintended ordinal relationships (e.g., France (0) < Spain (1))

New cards

When is Label Encoding suitable?

For ordinal attributes where order matters

New cards

What is One-Hot Encoding?

Creates a binary column for each category, with 1 in its column and 0 elsewhere

New cards

Why is One-Hot Encoding suitable for nominal attributes?

No ordinal relationships are implied between categories

100

New cards

What is a disadvantage of One-Hot Encoding?

Increases the dimensionality of the data, which can be a concern with many categories