Data Mining Chapter Two

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/27

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

28 Terms

1
New cards
GIGO
Garbage In - Garbage Out
2
New cards
Replacement Value Method One
Replace the missing value with some constant, specified by the analyst.
3
New cards
Replacement Value Method Two
Replace the missing value with the field mean (for numeric variables), or the mode (for categorical variables)
4
New cards
Replacement Value Method Three
Replace the missing values with a value generated at random from the observed distribution of the variable.
5
New cards
Replacement Value Method Four
Replace the missing values with imputed values based on the other characteristics of the record.
6
New cards
Measures of Center
The most common are mean, median, and mode.
7
New cards
Measures of Location
Measures of center are a special case of measures of location, numerical summaries, that indicate where on a number line a certain characteristic of the variable lies. Examples are percentiles and quantiles.
8
New cards
Normalization

Refers to the process of scaling input data to fall within a specific range, typically (0,1) or (-1,1). This is done to ensure that all features contribute equally to the analysis and that algorithms, particularly those based on distance calculation (like k-nearest neighbors or clustering), perform optimally.

9
New cards
Normalization is important
When features have different units or scales, as it prevents larger-scaled features from dominating the analysis.
10
New cards
Min-Max Normalization
Transforms values to a range between a minimum and maximum (usually 0 and 1).
11
New cards
Z-score Standardization
Center the data around zero by subtracting the mean and dividing by the standard deviation.
12
New cards
Decimal Scaling
Moves the decimal point of values based on the maximum absolute values.
13
New cards
Skewness
Bias of data towards one end or another of a distribution. Represented by this equation.
14
New cards
Right Skew
This is the most common type of skew, most data has this bias.
15
New cards
Transformations

Used to eliminate the skewness of our data, and make it into a normal distribution. Commonly seen in the square root transformation, and the inverse square root transformation.

16
New cards
Z-score method for identifying outliers
States that a data value is an outlier if it has a Z-score that is either less than -3 or greater 3.
17
New cards
Interquartile Range (IQR)
Is a measure of variability, much more robust than the standard deviation. It is calculated as X = Q3 - Q1. And may be interpreted to represent the spread of the middle 50% of the data. A robust measure of outlier detection is therefore defined as follows. A data value is an outlier if
18
New cards
Flag Variables
Also known as a dummy variable or indicator variable. This is the process of turning categories into numerical indicators 0 or 1. This is done to allow for greater statistical analysis that cannot be done on qualitative categories.
19
New cards
Binning
A data preprocessing technique used to reduce the effects of minor observation errors or noise and to group continuous values into discrete intervals or “bins”. This can simplify the data and make patterns more noticeable during analysis or modeling.
20
New cards
Equal Width Binning
Divides the numerical predictor into K categories of equal width, where K is chosen by the client or analyst.
21
New cards
Equal Frequency Binning
Divides the numerical predictor into k categories, each having k/n records, where n is the total number of records.
22
New cards
Binning by Clustering
Uses a clustering algorithm such as k-means clustering to automatically calculate the “optimal” partitioning.
23
New cards
Binning based on Predictive Value
Partitions the numerical predictor based on the effect each partition has on the value of the target variable.
24
New cards
Unary Variables
Variables that take on only a single value, so it is not so much a variable as a constant. Should usually be removed.
25
New cards
Numerics

Decimal or Integer Values

26
New cards
Integers
Whole Numbers
27
New cards
Boolean Values
A true/false value dichotomy
28
New cards
Logical
Boolean Values