scale-notes

Course Information

Course Code: CS 422
Instructor: Vijay K. Gurbani, Ph.D.
Affiliation: Illinois Institute of Technology
Topics Covered:
- Distance Measures
- Data Transformation
- Standardization and Scaling
- Binarization and Discretization

Distance Measures

Importance of Distance Measures

Distance measures are essential for determining how similar or dissimilar two points are in an n-dimensional space.
Can help in clustering points for analytical purposes.

Types of Distance Measures

Euclidean Distance
Manhattan Distance
Minkowski Distance
Mahalanobis Distance

Specific Distance Measures

Manhattan Distance

Known as taxi cab distance or L1 norm.
It measures distance along axes at right angles.

Euclidean Distance

Known as L2 norm.
Defined mathematically as:
[ d(x, y) = \sqrt{\sum_{k=1}^{n} (x_k - y_k)^2} ]
- Where n is the number of dimensions.

Minkowski Distance

Generalization of distance measures.
Depends on parameter ( r ):
- If ( r = 1 ), it is Manhattan distance.
- If ( r = 2 ), it is Euclidean distance.
- If ( r = \infty ), it is Supremum distance (Lmax).
Important to note that r should not be confused with n (number of dimensions).

Mahalanobis Distance

Measures the distance of a point from a distribution.
Useful for understanding the distance relative to the data's distribution.

Data Transformation

Binarization (One-Hot Encoding)

Converts categorical attributes into binary attributes.
Example of conversion shown in tables:
- Table illustrating transformation of categorical values into binary attributes (X1, X2, X3).

Discretization

Converts continuous data into discrete categories.
Process Example: Sort data, create split points, map values to categories.

Standardization and Scaling

Definitions

Standardization: Transforms data to have mean = 0 and standard deviation = 1 (Z-score).
Normalization: Scales data to range between 0 and 1.
These terms are often mistakenly used interchangeably.

Need for Standardization

Important to standardize when features' scales differ significantly.
Example features: Age, Height, Weight can have vastly different ranges affecting distance computations.

Impact of Not Scaling

Not scaling can slow down convergence in algorithms that utilize Euclidean distances.
Features with higher magnitudes may dominate distance calculations leading to biased outcomes.

Feature Scaling Strategies

Z-score normalization
Mean normalization
Min-max scaling
Unit vector scaling

When to Scale

Scale when attributes are numeric and using neural networks.
If using decision trees, scaling might not be needed but should still be considered during preprocessing phases.

Practical Example

Given points in feature space, compare distances before and after scaling to demonstrate importance.