scale-notes

Course Information

  • Course Code: CS 422

  • Instructor: Vijay K. Gurbani, Ph.D.

  • Affiliation: Illinois Institute of Technology

  • Topics Covered:

    • Distance Measures

    • Data Transformation

    • Standardization and Scaling

    • Binarization and Discretization

Distance Measures

Importance of Distance Measures

  • Distance measures are essential for determining how similar or dissimilar two points are in an n-dimensional space.

  • Can help in clustering points for analytical purposes.

Types of Distance Measures

  • Euclidean Distance

  • Manhattan Distance

  • Minkowski Distance

  • Mahalanobis Distance

Specific Distance Measures

Manhattan Distance

  • Known as taxi cab distance or L1 norm.

  • It measures distance along axes at right angles.

Euclidean Distance

  • Known as L2 norm.

  • Defined mathematically as:

    [ d(x, y) = \sqrt{\sum_{k=1}^{n} (x_k - y_k)^2} ]

    • Where n is the number of dimensions.

Minkowski Distance

  • Generalization of distance measures.

  • Depends on parameter ( r ):

    • If ( r = 1 ), it is Manhattan distance.

    • If ( r = 2 ), it is Euclidean distance.

    • If ( r = \infty ), it is Supremum distance (Lmax).

  • Important to note that r should not be confused with n (number of dimensions).

Mahalanobis Distance

  • Measures the distance of a point from a distribution.

  • Useful for understanding the distance relative to the data's distribution.

Data Transformation

Binarization (One-Hot Encoding)

  • Converts categorical attributes into binary attributes.

  • Example of conversion shown in tables:

    • Table illustrating transformation of categorical values into binary attributes (X1, X2, X3).

Discretization

  • Converts continuous data into discrete categories.

  • Process Example: Sort data, create split points, map values to categories.

Standardization and Scaling

Definitions

  • Standardization: Transforms data to have mean = 0 and standard deviation = 1 (Z-score).

  • Normalization: Scales data to range between 0 and 1.

  • These terms are often mistakenly used interchangeably.

Need for Standardization

  • Important to standardize when features' scales differ significantly.

  • Example features: Age, Height, Weight can have vastly different ranges affecting distance computations.

Impact of Not Scaling

  • Not scaling can slow down convergence in algorithms that utilize Euclidean distances.

  • Features with higher magnitudes may dominate distance calculations leading to biased outcomes.

Feature Scaling Strategies

  • Z-score normalization

  • Mean normalization

  • Min-max scaling

  • Unit vector scaling

When to Scale

  • Scale when attributes are numeric and using neural networks.

  • If using decision trees, scaling might not be needed but should still be considered during preprocessing phases.

Practical Example

  • Given points in feature space, compare distances before and after scaling to demonstrate importance.