CAP 4770 - Lecture 2: Examine data, attributes, types, data quality, and similarity measures

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/36

Earn XP

Description and Tags

Vocabulary flashcards covering key terms from Lecture 2: data concepts, attribute types, data representations, data quality issues, preprocessing, and similarity measures.

Computer Science

Software Engineering

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

37 Terms

New cards

Data

A collection of data objects and their attributes.

New cards

Object

An individual data item; also called a record, data point, case, sample, entity, or instance.

New cards

Attribute

A property or characteristic of an object; also known as a variable, field, characteristic, or feature.

New cards

Attribute values

Numbers or symbols assigned to an attribute; the same attribute can map to different values, and different attributes can map to the same value set.

New cards

Categorical (qualitative)

Attributes whose values are categories or labels (e.g., eye color, gender, zip codes).

New cards

Quantitative

Attributes whose values are numerical measurements (e.g., age, income, length).

New cards

Discrete attribute

An attribute with a finite or countably infinite set of values; often represented as integers (binary attributes are a special case).

New cards

Continuous attribute

An attribute with real-number values; in practice represented with finite precision (floating-point).

New cards

Nominal

Attribute with only distinct names and no inherent order (e.g., ID numbers, eye color).

New cards

Ordinal

Attribute that provides a meaningful order of objects (e.g., rankings, grades).

New cards

Interval

Differences between values are meaningful; no true zero (e.g., calendar dates, Celsius).

New cards

Ratio

Differences and ratios are meaningful and there is a true zero (e.g., age, length, weight).

New cards

Attribute transformation

Process of mapping an attribute’s values to a new set of values to enable analysis (e.g., normalization, log, exponentiation).

New cards

Normalization

Transforming values to a standard scale, often [0,1] or with zero mean and unit variance.

New cards

Data Matrix

A matrix where rows are objects and columns are attributes; used for numeric attributes; m×n dimensions.

New cards

Record Data

Data consisting of a collection of records, each with a fixed set of attributes.

New cards

Document Data

Each document is represented as a term vector; the value of a term component is its frequency in the document.

New cards

Transaction Data

A special form of record data where each record is a transaction—a set of items bought together.

New cards

Graph Data

Data represented as graphs consisting of nodes (entities) and edges (relationships).

New cards

Ordered Data

Data where the order of values matters (e.g., genomic sequences).

New cards

Spatio-Temporal Data

Data with both spatial and temporal components (space and time).

New cards

Data Quality

Quality of data; problems include noise, missing values, outliers, and duplicates.

New cards

Noise

Modification or distortion of original values (e.g., voice distortion, television snow).

New cards

Outliers

Data objects with characteristics that are markedly different from the rest of the data.

New cards

Missing Values

Absent information; causes include nonresponse or inapplicability; handling includes elimination, estimation, ignoring, or imputation.

New cards

Duplicate Data

Duplicate or near-duplicate objects; common issue when merging data from different sources.

New cards

Data Preprocessing

Preparing data for mining, including discretization, binarization, and attribute transformation.

New cards

Discretization

Transforming a continuous attribute into a discrete set of values or categories.

New cards

Binarization

Converting attribute values to binary (0/1) values.

New cards

Similarity

Numerical measure of how alike two data objects are; higher means more alike.

New cards

Dissimilarity

Numerical measure of how different two data objects are.

New cards

Euclidean Distance

Distance between two objects in continuous feature space; may require standardization if scales differ.

New cards

Standardization

Transforming features so they have zero mean and unit variance.

New cards

Binary Vectors

Data objects with binary attributes; similarity can be computed from M01, M10, M00, M11 counts.

New cards

Simple Matching Coefficient (SMC)

SMC = (M11 + M00) / total attributes; proportion of matches between two binary vectors.

New cards

Jaccard Coefficient

J = M11 / (M01 + M10 + M11); proportion of 1-1 matches among non-both-zero attributes. It is a measure of similarity for binary vectors, representing the proportion of attributes where both objects have a '1' value (true positives) relative to the total number of attributes where at least one object has a '1' value (true positives + false positives + false negatives). It specifically excludes cases where both objects have a '0' (M00).

New cards

Mij counts

M01, M10, M00, M11 are counts used to compute similarity/dissimilarity for binary data: M01 (number of attributes where object 1 is 0 and object 2 is 1), M10 (number of attributes where object 1 is 1 and object 2 is 0), M00 (number of attributes where both objects are 0), and M11 (number of attributes where both objects are 1).