Data Mining Chapter Two

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/27

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

28 Terms

New cards

GIGO

Garbage In - Garbage Out

New cards

Replacement Value Method One

Replace the missing value with some constant, specified by the analyst.

New cards

Replacement Value Method Two

Replace the missing value with the field mean (for numeric variables), or the mode (for categorical variables)

New cards

Replacement Value Method Three

Replace the missing values with a value generated at random from the observed distribution of the variable.

New cards

Replacement Value Method Four

Replace the missing values with imputed values based on the other characteristics of the record.

New cards

Measures of Center

The most common are mean, median, and mode.

New cards

Measures of Location

Measures of center are a special case of measures of location, numerical summaries, that indicate where on a number line a certain characteristic of the variable lies. Examples are percentiles and quantiles.

New cards

Normalization

Refers to the process of scaling input data to fall within a specific range, typically (0,1) or (-1,1). This is done to ensure that all features contribute equally to the analysis and that algorithms, particularly those based on distance calculation (like k-nearest neighbors or clustering), perform optimally.

New cards

Normalization is important

When features have different units or scales, as it prevents larger-scaled features from dominating the analysis.

New cards

Min-Max Normalization

Transforms values to a range between a minimum and maximum (usually 0 and 1).

New cards

Z-score Standardization

Center the data around zero by subtracting the mean and dividing by the standard deviation.

New cards

Decimal Scaling

Moves the decimal point of values based on the maximum absolute values.

New cards

Skewness

Bias of data towards one end or another of a distribution. Represented by this equation.

New cards

Right Skew

This is the most common type of skew, most data has this bias.

New cards

Transformations

Used to eliminate the skewness of our data, and make it into a normal distribution. Commonly seen in the square root transformation, and the inverse square root transformation.

New cards

Z-score method for identifying outliers

States that a data value is an outlier if it has a Z-score that is either less than -3 or greater 3.

New cards

Interquartile Range (IQR)

Is a measure of variability, much more robust than the standard deviation. It is calculated as X = Q3 - Q1. And may be interpreted to represent the spread of the middle 50% of the data. A robust measure of outlier detection is therefore defined as follows. A data value is an outlier if

New cards

Flag Variables

Also known as a dummy variable or indicator variable. This is the process of turning categories into numerical indicators 0 or 1. This is done to allow for greater statistical analysis that cannot be done on qualitative categories.

New cards

Binning

A data preprocessing technique used to reduce the effects of minor observation errors or noise and to group continuous values into discrete intervals or “bins”. This can simplify the data and make patterns more noticeable during analysis or modeling.

New cards

Equal Width Binning

Divides the numerical predictor into K categories of equal width, where K is chosen by the client or analyst.

New cards

Equal Frequency Binning

Divides the numerical predictor into k categories, each having k/n records, where n is the total number of records.

New cards

Binning by Clustering

Uses a clustering algorithm such as k-means clustering to automatically calculate the “optimal” partitioning.

New cards

Binning based on Predictive Value

Partitions the numerical predictor based on the effect each partition has on the value of the target variable.

New cards

Unary Variables

Variables that take on only a single value, so it is not so much a variable as a constant. Should usually be removed.

New cards

Numerics

Decimal or Integer Values

New cards

Integers

Whole Numbers

New cards

Boolean Values

A true/false value dichotomy

New cards

Logical

Boolean Values