DM : Data Preprocessing

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/28

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

29 Terms

1
New cards

Data Preprocessing

A process of preparing and cleaning raw data to make it suitable for data mining and analysis.

2
New cards
3
New cards

Incomplete Data

Data that lacks values or attributes (e.g., empty occupation field).

4
New cards

Noisy Data

Data with errors or outliers (e.g., salary = -10).

5
New cards

Inconsistent Data

Data with conflicting values across records or systems (e.g., age = 42 but birthday = 1997).

6
New cards

Data Cleaning

Identifying and fixing incorrect, missing, or inconsistent data.

7
New cards

Data Integration

Combining data from multiple sources into a coherent dataset.

8
New cards

Data Transformation

Converting data into a suitable format, including normalization, smoothing, and aggregation.

9
New cards

Data Reduction

Reducing the data volume while preserving its quality for analysis.

10
New cards

Discretization

Transforming continuous data into categorical intervals.

11
New cards

Concept Hierarchy

Grouping data values into higher-level categories (e.g., city → country).

12
New cards

Normalization

Scaling data into a specified range (e.g., 0 to 1).

13
New cards

Min-Max Normalization

Rescales data using min and max values to a new range.

14
New cards

Z-Score Normalization

Scales data based on its mean and standard deviation.

15
New cards

Decimal Scaling

Normalization by moving the decimal point based on the magnitude of values.

16
New cards

Feature Selection

Selecting the most relevant attributes for analysis.

17
New cards

Sampling

Selecting a representative subset of the dataset for quicker analysis.

18
New cards

Simple Random Sampling

Each item has an equal chance of being selected.

19
New cards

Stratified Sampling

Divides data into groups and randomly samples from each group.

20
New cards

Attribute Construction

Creating new attributes from existing ones to improve analysis.

21
New cards

Aggregation

Summarizing data (e.g., daily sales → monthly sales).

22
New cards

Pearson’s Correlation Coefficient

Measures the linear relationship between two variables to detect redundancy.

23
New cards

Chi-Square Test

A statistical test used to evaluate relationships between categorical variables.

24
New cards

Completeness (Data Quality)

Indicates if all required data is present.

25
New cards

Consistency (Data Quality)

Ensures data doesn't contradict across sources.

26
New cards

Conformity (Data Quality)

Data matches standard formats and types.

27
New cards

Accuracy (Data Quality)

Data reflects true real-world values.

28
New cards

Integrity (Data Quality)

Logical relationships between data are maintained.

29
New cards

Timeliness (Data Quality)

Data is available when needed.