1/36
Vocabulary flashcards covering key terms from Lecture 2: data concepts, attribute types, data representations, data quality issues, preprocessing, and similarity measures.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data
A collection of data objects and their attributes.
Object
An individual data item; also called a record, data point, case, sample, entity, or instance.
Attribute
A property or characteristic of an object; also known as a variable, field, characteristic, or feature.
Attribute values
Numbers or symbols assigned to an attribute; the same attribute can map to different values, and different attributes can map to the same value set.
Categorical (qualitative)
Attributes whose values are categories or labels (e.g., eye color, gender, zip codes).
Quantitative
Attributes whose values are numerical measurements (e.g., age, income, length).
Discrete attribute
An attribute with a finite or countably infinite set of values; often represented as integers (binary attributes are a special case).
Continuous attribute
An attribute with real-number values; in practice represented with finite precision (floating-point).
Nominal
Attribute with only distinct names and no inherent order (e.g., ID numbers, eye color).
Ordinal
Attribute that provides a meaningful order of objects (e.g., rankings, grades).
Interval
Differences between values are meaningful; no true zero (e.g., calendar dates, Celsius).
Ratio
Differences and ratios are meaningful and there is a true zero (e.g., age, length, weight).
Attribute transformation
Process of mapping an attribute’s values to a new set of values to enable analysis (e.g., normalization, log, exponentiation).
Normalization
Transforming values to a standard scale, often [0,1] or with zero mean and unit variance.
Data Matrix
A matrix where rows are objects and columns are attributes; used for numeric attributes; m×n dimensions.
Record Data
Data consisting of a collection of records, each with a fixed set of attributes.
Document Data
Each document is represented as a term vector; the value of a term component is its frequency in the document.
Transaction Data
A special form of record data where each record is a transaction—a set of items bought together.
Graph Data
Data represented as graphs consisting of nodes (entities) and edges (relationships).
Ordered Data
Data where the order of values matters (e.g., genomic sequences).
Spatio-Temporal Data
Data with both spatial and temporal components (space and time).
Data Quality
Quality of data; problems include noise, missing values, outliers, and duplicates.
Noise
Modification or distortion of original values (e.g., voice distortion, television snow).
Outliers
Data objects with characteristics that are markedly different from the rest of the data.
Missing Values
Absent information; causes include nonresponse or inapplicability; handling includes elimination, estimation, ignoring, or imputation.
Duplicate Data
Duplicate or near-duplicate objects; common issue when merging data from different sources.
Data Preprocessing
Preparing data for mining, including discretization, binarization, and attribute transformation.
Discretization
Transforming a continuous attribute into a discrete set of values or categories.
Binarization
Converting attribute values to binary (0/1) values.
Similarity
Numerical measure of how alike two data objects are; higher means more alike.
Dissimilarity
Numerical measure of how different two data objects are.
Euclidean Distance
Distance between two objects in continuous feature space; may require standardization if scales differ.
Standardization
Transforming features so they have zero mean and unit variance.
Binary Vectors
Data objects with binary attributes; similarity can be computed from M01, M10, M00, M11 counts.
Simple Matching Coefficient (SMC)
SMC = (M11 + M00) / total attributes; proportion of matches between two binary vectors.
Jaccard Coefficient
J = M11 / (M01 + M10 + M11); proportion of 1-1 matches among non-both-zero attributes. It is a measure of similarity for binary vectors, representing the proportion of attributes where both objects have a '1' value (true positives) relative to the total number of attributes where at least one object has a '1' value (true positives + false positives + false negatives). It specifically excludes cases where both objects have a '0' (M00).
Mij counts
M01, M10, M00, M11 are counts used to compute similarity/dissimilarity for binary data: M01 (number of attributes where object 1 is 0 and object 2 is 1), M10 (number of attributes where object 1 is 1 and object 2 is 0), M00 (number of attributes where both objects are 0), and M11 (number of attributes where both objects are 1).