1/28
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data Preprocessing
A process of preparing and cleaning raw data to make it suitable for data mining and analysis.
Incomplete Data
Data that lacks values or attributes (e.g., empty occupation field).
Noisy Data
Data with errors or outliers (e.g., salary = -10).
Inconsistent Data
Data with conflicting values across records or systems (e.g., age = 42 but birthday = 1997).
Data Cleaning
Identifying and fixing incorrect, missing, or inconsistent data.
Data Integration
Combining data from multiple sources into a coherent dataset.
Data Transformation
Converting data into a suitable format, including normalization, smoothing, and aggregation.
Data Reduction
Reducing the data volume while preserving its quality for analysis.
Discretization
Transforming continuous data into categorical intervals.
Concept Hierarchy
Grouping data values into higher-level categories (e.g., city → country).
Normalization
Scaling data into a specified range (e.g., 0 to 1).
Min-Max Normalization
Rescales data using min and max values to a new range.
Z-Score Normalization
Scales data based on its mean and standard deviation.
Decimal Scaling
Normalization by moving the decimal point based on the magnitude of values.
Feature Selection
Selecting the most relevant attributes for analysis.
Sampling
Selecting a representative subset of the dataset for quicker analysis.
Simple Random Sampling
Each item has an equal chance of being selected.
Stratified Sampling
Divides data into groups and randomly samples from each group.
Attribute Construction
Creating new attributes from existing ones to improve analysis.
Aggregation
Summarizing data (e.g., daily sales → monthly sales).
Pearson’s Correlation Coefficient
Measures the linear relationship between two variables to detect redundancy.
Chi-Square Test
A statistical test used to evaluate relationships between categorical variables.
Completeness (Data Quality)
Indicates if all required data is present.
Consistency (Data Quality)
Ensures data doesn't contradict across sources.
Conformity (Data Quality)
Data matches standard formats and types.
Accuracy (Data Quality)
Data reflects true real-world values.
Integrity (Data Quality)
Logical relationships between data are maintained.
Timeliness (Data Quality)
Data is available when needed.