data 1 data mining

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/119

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

120 Terms

1
New cards

What is a Dataset?

A collection of objects and their attributes used for analysis

2
New cards

What is an Attribute in data mining?

A property or characteristic of an object. Also known as variable, field, characteristic, dimension, or feature

3
New cards

What is an Object in data mining?

A collection of attributes. Also known as record, point, case, sample, entity, or instance

4
New cards

What are the 5 important characteristics of datasets?

Size, Dimensionality, Sparsity, Distribution, Resolution

5
New cards

Why is Size an important dataset characteristic?

The type of analysis often depends on the size of the data

6
New cards

Why is Dimensionality an important dataset characteristic?

High-dimensional data presents unique challenges in analysis

7
New cards

Why is Sparsity an important dataset characteristic?

It emphasizes the importance of presence over absence in the data

8
New cards

What are the 4 main types of datasets?

Record Data, Graphs and Networks, Ordered (Sequence) Data, Spatial Data

9
New cards

What is Record Data?

Records with fixed attributes, including relational records, data matrix, and transaction data

10
New cards

Give 3 examples of Graphs and Networks datasets

Transportation network, Social or information networks, Molecular Structures

11
New cards

Give 3 examples of Ordered (Sequence) Data

Video (sequence of images), Genetic Sequence Data, Temporal sequence

12
New cards

Give 2 examples of Spatial Data

RGB Images, Satellite images

13
New cards

What are the 4 types of attributes?

Nominal, Ordinal, Interval, Ratio

14
New cards

What is a Nominal attribute?

Unordered categories (e.g., gender, eye color, types of fruit like apple, orange)

15
New cards

What is an Ordinal attribute?

Ordered categories (e.g., grades A/B/C, height tall/medium/short, swimming level beginner to advanced)

16
New cards

What is an Interval attribute?

Numerical with equal intervals but no true zero (e.g., calendar dates, temperatures in Celsius or Fahrenheit)

17
New cards

What is a Ratio attribute?

Numerical with equal intervals and a true zero (e.g., temperature in Kelvin, length, counts, elapsed time)

18
New cards

What operations can be performed on Nominal attributes?

Distinctness only (=, ≠)

19
New cards

What operations can be performed on Ordinal attributes?

Distinctness (=, ≠) and Order (

20
New cards

What operations can be performed on Interval attributes?

Distinctness (=, ≠), Order (

21
New cards

What operations can be performed on Ratio attributes?

Distinctness (=, ≠), Order (

22
New cards

What is a Discrete Attribute?

An attribute that takes values from a finite or countable set (e.g., gender, eye color, swimming level). Typically represented as integers

23
New cards

What is a Continuous Attribute?

An attribute that takes values within a continuous range (e.g., height, length, temperature). Typically represented as floating-point variables

24
New cards

What are Binary attributes?

A special case of discrete attributes with only two possible values

25
New cards

What are Asymmetric Attributes?

Attributes where only the presence (non-zero value) matters, not the absence

26
New cards

Give 2 examples of asymmetric attributes

Words present in documents, Items present in customer transactions

27
New cards

Why do we focus on presence in asymmetric attributes?

In real scenarios (e.g., grocery shopping), we don't say purchases are similar because we both didn't buy most of the same products. We focus on what was bou

28
New cards
What is a Similarity Measure?
Quantifies data object likeness. Higher values indicate greater similarity. Typically within the range [0,1]
29
New cards
What is a Dissimilarity Measure?
Also called Distance Measure. Quantifies data object differences. Lower values indicate greater similarity. Often starts at 0 with varying upper limit
30
New cards
What is Proximity in data mining?
Refers to either similarity or dissimilarity
31
New cards
What are the 2 properties of Similarity?
Identity: s(x,y) = 1 only if x = y. Symmetry: s(x,y) = s(y,x) for all x and y
32
New cards
What are the 3 properties of Distance?
Non-Negativity: d(x,y) ≥ 0, equals 0 only if x = y. Symmetry: d(x,y) = d(y,x). Triangle Inequality: d(x,z) ≤ d(x,y) + d(y,z)
33
New cards
What is a Distance Matrix?
Distances between all data objects, useful for clustering and nearest neighbor algorithms, symmetric with values reflecting dissimilarities
34
New cards
What is a Similarity Matrix?
Similarities between all data objects, useful for clustering and recommendation systems, often symmetric with higher values indicating stronger similarities
35
New cards
What are 4 measures for numerical vectors?
Euclidean Distance, Minkowski Distance, Cosine Similarity, Linear correlation
36
New cards
What are 2 measures for binary vectors?
Simple Matching Coefficient (SMC), Jaccard Coefficient
37
New cards
What is the Euclidean Distance formula?
d(x,y) = √(Σ(xk - yk)²), where n is number of attributes
38
New cards
When is standardization necessary for Euclidean Distance?
When scales of attributes differ
39
New cards
What is the Minkowski Distance formula?
d(x,y) = (Σ|xk - yk|^r)^(1/r), where r is a parameter
40
New cards
What is Minkowski Distance?
A generalization of Euclidean Distance where the hyperparameter r allows adaptation to data characteristics
41
New cards
What is Manhattan distance (L1 norm)?
Minkowski Distance with r = 1, ideal for measuring distances in grid-like paths (e.g., city blocks)
42
New cards
What is Euclidean distance (L2 norm)?
Minkowski Distance with r = 2, the most commonly used distance metric for straight-line distance in Euclidean space
43
New cards
What is Chebyshev distance (Lmax norm)?
Minkowski Distance with r → ∞, calculates maximum difference between any component of vectors (e.g., king movement in chess)
44
New cards
What is Hamming distance?
A special case of Manhattan distance for binary vectors that counts differing bits
45
New cards
What is the Cosine Similarity formula?
cos(x,y) = (x · y) / (
46
New cards
What does Cosine Similarity measure?
The cosine of the angle between two vectors, non-sensitive to magnitudes, focusing on orientation
47
New cards
What is the range of Cosine Similarity values?
Between -1 and 1: -1 (completely dissimilar), 0 (orthogonal/no similarity), 1 (perfectly similar)
48
New cards
What does Linear correlation measure?
The linear relationship between two variables, evaluating how well one variable predicts another
49
New cards
What is the range of Linear correlation values?
Between -1 and 1: 1 (perfect positive correlation), 0 (no linear relationship), -1 (perfect negative correlation)
50
New cards
What is Simple Matching Coefficient (SMC)?
Number of matches divided by total number of attributes, designed for symmetric binary attributes
51
New cards
What is the SMC formula?
SMC = (f11 + f00) / (f01 + f10 + f00 + f11)
52
New cards
What does f01 represent in SMC?
The number of attributes where x was 0 and y was 1
53
New cards
What does f11 represent in SMC?
The number of attributes where x was 1 and y was 1
54
New cards
What is the Jaccard Coefficient?
The ratio of shared 1 values to the total number of 1 values across both sets, designed for asymmetric binary attributes
55
New cards
What is the Jaccard Coefficient formula?
J = f11 / (f01 + f10 + f11)
56
New cards
Why doesn't Jaccard include f00?
Because it's designed for asymmetric attributes where absence (0-0 matches) doesn't indicate similarity
57
New cards
When comparing documents using word presence, which measure should you use?
Jaccard Coefficient (similarity based on sharing common words)
58
New cards
When comparing geographical locations of cities, which measure should you use?
Euclidean Distance (similarity based on closeness by distance)
59
New cards
When comparing time series of temperature patterns, which measure should you use?
Cosine Similarity (similarity based on pattern variation over time)
60
New cards
When measuring relationship between study hours and exam marks, which measure should you use?
Linear Correlation (indicates strength of relationship between variables)
61
New cards
For Nominal attributes with one object, which similarity/distance measures apply?
SMC, Jaccard Coefficient
62
New cards
For Ordinal attributes with one object, which similarity/distance measures apply?
Euclidean Distance, Minkowski Distance, Cosine Similarity, Linear Correlation
63
New cards
For Interval attributes with one object, which similarity/distance measures apply?
Euclidean Distance, Minkowski Distance, Cosine Similarity, Linear Correlation
64
New cards
For Ratio attributes with one object, which similarity/distance measures apply?
Euclidean Distance, Minkowski Distance, Cosine Similarity, Linear Correlat
65
New cards
What are the 4 major tasks of Data Preprocessing?
Data integration, Data reduction, Data cleaning, Data transformation
66
New cards
What is Data cleaning?
Handling duplicates and missing values, identifying/removing outliers, smoothing noisy data
67
New cards
What is Data transformation?
Converting data into a format suitable for analysis (sampling, encoding, discretization, normalization)
68
New cards
Why is poor data quality a problem?
It can negatively impact modeling efforts, leading to incorrect decisions (e.g., denying loans to credit-worthy candidates or approving non-creditworthy ones)
69
New cards
What are 4 examples of data quality problems?
Duplicate data, Missing values, Outliers, Noise
70
New cards
What is Duplicate Data?
Occurrence of identical or nearly identical data objects, common when merging data from diverse sources
71
New cards
How do you handle duplicate data?
Remove duplicate data objects, or in some scenarios keep them (e.g., customers with multiple accounts accumulating points separately)
72
New cards
What are 2 reasons for missing values?
Information is not collected (e.g., people decline to give age/weight), or attributes may not be applicable to all cases (e.g., annual income not applicable to children)
73
New cards
When should you DELETE RECORDS with missing values?
When there is enough data and few missing values
74
New cards
When should you DELETE COLUMN with missing values?
When missing values are ≥ 60% of the observations
75
New cards
When should you keep missing values as NaN?
When the data mining algorithm can handle them
76
New cards
What are 5 imputation-based techniques for missing values?
Random value, Average (mean/median/mode), Nearest neighbor, Heuristic-Based, Interpolation
77
New cards
What is an Outlier?
A data object with characteristics significantly different from the majority in the dataset
78
New cards
What are the 2 cases for handling outliers?
Case 1: Outliers as Noise (disrupt data analysis), Case 2: Outliers as the Focus (primary focus of analysis)
79
New cards
Give 2 examples where outliers are the focus
Credit card fraud detection, Intrusion detection
80
New cards
What is Noise in data?
Noise in Objects: irrelevant elements affecting data integrity. Noise in Attributes: modification of original attribute values
81
New cards
Give 3 examples of noise
Erroneous values from data entry errors, Distorted voice on poor phone line, "Snow" on television screen
82
New cards
What are 4 techniques to handle noise?
Binning, Clustering, Imputation techniques (average, nearest neighbor, heuristic, interpolation), Semi-supervised method (automated detection + human inspection)
83
New cards
Why might we incorporate noise into data?
To enhance robustness by preventing overfitting, improving generalization, and fostering adaptability to real-world variations
84
New cards
What are the 4 main data transformation techniques?
Sampling, Encoding, Normalization, Discretization
85
New cards
What is Sampling in data transformation?
Selecting a subset of the dataset to represent a larger population
86
New cards
Why do we use sampling?
Using the entire dataset is expensive (collecting, storing, processing) and time-consuming
87
New cards
What are 2 challenges in sampling?
Ensuring the sample is representative of the population, and addressing potential bias in the sampling process
88
New cards
What are the 4 sampling methods?
Simple Random Sampling, Systematic Sampling, Stratified Sampling, Cluster Sampling
89
New cards
What is Simple Random Sampling?
Every item has an equal chance of being selected (could be with or without replacement)
90
New cards
What is Systematic Sampling?
Selecting individuals at regular intervals from a list or group
91
New cards
What is Stratified Sampling?
Divide the population into groups (strata) based on a characteristic, then random samples are taken from each group
92
New cards
What is Cluster Sampling?
Divide the population into clusters (often geographically), then entire clusters are randomly selected for sampling
93
New cards
What is Encoding in data transformation?
Converting categorical variables into numerical format for data mining algorithms
94
New cards
What are the 2 main encoding methods?
Label Encoding and One-Hot Encoding
95
New cards
What is Label Encoding?
Converts categories into numerical labels where each category gets a unique integer
96
New cards
Why is Label Encoding not suitable for nominal attributes?
It can create unintended ordinal relationships (e.g., France (0) < Spain (1))
97
New cards
When is Label Encoding suitable?
For ordinal attributes where order matters
98
New cards
What is One-Hot Encoding?
Creates a binary column for each category, with 1 in its column and 0 elsewhere
99
New cards
Why is One-Hot Encoding suitable for nominal attributes?
No ordinal relationships are implied between categories
100
New cards
What is a disadvantage of One-Hot Encoding?
Increases the dimensionality of the data, which can be a concern with many categories