scale-notes
Course Information
Course Code: CS 422
Instructor: Vijay K. Gurbani, Ph.D.
Affiliation: Illinois Institute of Technology
Topics Covered:
Distance Measures
Data Transformation
Standardization and Scaling
Binarization and Discretization
Distance Measures
Importance of Distance Measures
Distance measures are essential for determining how similar or dissimilar two points are in an n-dimensional space.
Can help in clustering points for analytical purposes.
Types of Distance Measures
Euclidean Distance
Manhattan Distance
Minkowski Distance
Mahalanobis Distance
Specific Distance Measures
Manhattan Distance
Known as taxi cab distance or L1 norm.
It measures distance along axes at right angles.
Euclidean Distance
Known as L2 norm.
Defined mathematically as:
[ d(x, y) = \sqrt{\sum_{k=1}^{n} (x_k - y_k)^2} ]
Where n is the number of dimensions.
Minkowski Distance
Generalization of distance measures.
Depends on parameter ( r ):
If ( r = 1 ), it is Manhattan distance.
If ( r = 2 ), it is Euclidean distance.
If ( r = \infty ), it is Supremum distance (Lmax).
Important to note that r should not be confused with n (number of dimensions).
Mahalanobis Distance
Measures the distance of a point from a distribution.
Useful for understanding the distance relative to the data's distribution.
Data Transformation
Binarization (One-Hot Encoding)
Converts categorical attributes into binary attributes.
Example of conversion shown in tables:
Table illustrating transformation of categorical values into binary attributes (X1, X2, X3).
Discretization
Converts continuous data into discrete categories.
Process Example: Sort data, create split points, map values to categories.
Standardization and Scaling
Definitions
Standardization: Transforms data to have mean = 0 and standard deviation = 1 (Z-score).
Normalization: Scales data to range between 0 and 1.
These terms are often mistakenly used interchangeably.
Need for Standardization
Important to standardize when features' scales differ significantly.
Example features: Age, Height, Weight can have vastly different ranges affecting distance computations.
Impact of Not Scaling
Not scaling can slow down convergence in algorithms that utilize Euclidean distances.
Features with higher magnitudes may dominate distance calculations leading to biased outcomes.
Feature Scaling Strategies
Z-score normalization
Mean normalization
Min-max scaling
Unit vector scaling
When to Scale
Scale when attributes are numeric and using neural networks.
If using decision trees, scaling might not be needed but should still be considered during preprocessing phases.
Practical Example
Given points in feature space, compare distances before and after scaling to demonstrate importance.