Lecture 5 – Logistic Regression, K-Nearest Neighbors and Support Vector Machine

Lecture Overview

Topic: Logistic Regression, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM)
Course: SEIS 763, Fall 2025
Lecture Recording Status: Will be recorded
This lecture provides an in-depth look into three fundamental machine learning algorithms for classification, exploring their theoretical foundations, mathematical underpinnings, and practical applications.

Logistic Regression

Definition and Functionality

Logistic Regression:
- Type of Binary Classifier: This means it's designed to distinguish between two classes, assigning an input to one of two categories (e.g., email spam or not spam, disease present or not present). It predicts the probability of the binary outcome (0 or 1), rather than a direct class prediction. These probabilities can then be converted into class labels using a threshold.
- Used for Linearly Separable (classifiable) data: Ideal when the two classes can be separated by a straight line (in 2D), plane (in 3D), or hyperplane (in higher dimensions).

Mathematical Representation

Logistic Regression outputs probabilities that lie within the range of 0 to 1 through an S-shaped curve (sigmoid function). The sigmoid (or logistic) function is central to logistic regression. It takes any real-valued number, which is typically a linear combination of input features, and maps it to a value between 0 and 1, making it suitable for representing probabilities.
y = \frac{1}{1 + e^{-x}}
Where:
- x represents the output of a linear model (w0 + w1x1 + \dots + wnx_n).
- If x \to \text{∞}, then y \to 1
- If x \to -\text{∞}, then y \to 0

Transition from Linear to Logistic Regression

The general form of the underlying linear equation:
y{\text{linear}} = w0 + w_1x
This linear combination can range from - \infty to + \infty.
The probability of the output given input is obtained by passing the linear output through the sigmoid function:
p(x) = \frac{1}{1 + e^{-y_{\text{linear}}}}
For interpretation, the odds (ratio of probability of success to probability of failure) can be expressed as:
\text{Odds} = \frac{p}{1-p}
Taking the natural logarithm of the odds (the log-odds or logit function) recovers the linear form, which is what logistic regression models directly:
\text{ln}\left(\frac{p}{1-p}\right) = w0 + w1x

Classification Procedure

Classification of probabilities determined by threshold settings:
- If p > \text{threshold}, then classify as 1 (positive class).
- Otherwise, classify as 0 (negative class).
- The threshold is typically set at 0.5 by default, but it can be adjusted based on the specific application's need to prioritize precision or recall. For example, a lower threshold might be used when minimizing false negatives is critical (e.g., in medical diagnosis).

Visualizations

Probability of Passing: Graphical depiction of logistic regression fit showing the correlation between hours studied and likelihood of passing. This visualization helps to understand how the sigmoid curve transforms the linear relationship (e.g., hours studied) into a probability, clearly showing that as hours studied increase, the probability of passing approaches 1.
Sigmoid Function acts as an activation function in this method, mapping arbitrary real values to probabilities.

Loss Function

The Logistic Regression Loss function, also known as Binary Cross-Entropy Loss or Log Loss, is designed to evaluate model performance and guide its optimization:
J = -\frac{1}{n} \sum{i=1}^n (yi \log(\tilde{y}i) + (1 - yi) \log(1 - \tilde{y}_i))
This loss function is derived from the principle of maximum likelihood. It penalizes the model more severely when it predicts a high probability for the wrong class or a low probability for the correct class.
Each term in the loss function represents the contribution of an individual data point's prediction (\tilde{y}i) against its actual outcome (yi).
- If yi = 1 (actual positive class), the loss for that term simplifies to - \log(\tilde{y}i). To minimize this, \tilde{y}_i must be close to 1.
- If yi = 0 (actual negative class), the loss for that term simplifies to - \log(1 - \tilde{y}i). To minimize this, \tilde{y}_i must be close to 0.
The total loss J is the average loss across all n samples in the dataset.

K-Nearest Neighbors (KNN)

Fundamentals

KNN is a non-parametric model that makes no assumptions about the underlying data distribution, making it flexible but often computationally intensive for large datasets during prediction. It classifies instances based on the closest training examples in the feature space rather than creating an explicit decision boundary.
When a new instance arrives, KNN:
- Looks at the 'k' neighboring points (instances) in the dataset, effectively searching for the 'k' closest data points in the training set.
- Classifies based on majority vote among these neighbors: For classification, the new instance is assigned the class label most frequent among its 'k' nearest neighbors. For regression tasks, it typically averages the values of the neighbors.

Hyperparameter 'k'

The choice of 'k', or number of neighbors, significantly affects the classification:
- Choosing a smaller 'k' (e.g., k=1) makes the model more sensitive to noise and outliers, potentially leading to overfitting to the training data and a more complex, jagged decision boundary.
- Choosing a larger 'k' can lead to smoother decision boundaries, reducing the impact of noise but potentially blurring the distinctions between classes and increasing bias (underfitting), especially if samples from other classes are included in the 'k' nearest neighbors.
- The optimal 'k' value is typically determined through methods like cross-validation to achieve the best balance between bias and variance.

Distance Metrics

Two primary distance computations are commonly used to identify neighbors:
- Euclidean Distance (L2):
  d(x1, x2) = \text{sqrt}(\sum((x1-x2)^2))
  Represents the shortest straight-line distance between two points in Euclidean space. It's the most common distance metric, akin to measuring the length of a ruler between two points.
- Manhattan Distance (L1):
  d(x1, x2) = \sum(|x1 - x2|)
  Also known as L1 norm or taxicab geometry. It calculates the sum of the absolute differences between the coordinates of the points. Imagine navigating a city grid, where you can only move horizontally or vertically.
The choice between these distance metrics can impact the model's performance, especially with high-dimensional data or data with different scales.

Code Implementation

Example code for fitting KNN using Scikit-learn:

from sklearn.neighbors import KNeighborsClassifier # Import the KNN classifier class
model = KNeighborsClassifier(n_neighbors=5) # Initialize the model with k=5 neighbors
model.fit(X_train, y_train) # Train the model using training data and labels

Support Vector Machine (SVM)

Overview

SVM Philosophy: A powerful supervised learning model primarily used for classification, but also for regression tasks. It aims to find the optimal hyperplane that best separates data points into different classes.
- Utilizes a hyperplane to maximize the margin between two classes: A hyperplane is a decision boundary that separates data points. In 2D, it's a line; in 3D, it's a plane. The margin is the distance between the hyperplane and the nearest data point from each class.
- Aim: Ensure data points (especially the support vectors) are as far from the separating hyperplane as possible. This 'max-margin' principle leads to better generalization and robustness to unseen data, making predictions more confident and reducing the risk of overfitting.

Mechanics and Optimization

Decision separation is based on:
- Support Vectors: These are the critical data points from each class that lie closest to the decision hyperplane. They 'support' the hyperplane, and if they are removed, the position of the hyperplane might change significantly. They define the width of the margin and are crucial for the SVM's decision-making process.
For optimizing SVM, the objective for a linear SVM is to find the weights w and bias b that minimize |w|, subject to the constraint that all data points are correctly classified with a margin of at least 1. The constraint is expressed as:
yi(w \cdot xi + b) \ge 1
This ensures that each data point xi of class yi (where y_i \in {-1, 1}) is on the correct side of the margin-defining hyperplanes.
The geometric margin for a point x0 to a hyperplane is: d = \frac{|w \cdot x0 + b|}{|w|}

The total margin of the classifier is 2/|w|, which is maximized by minimizing |w|.

Non-linearity of Data

SVM can utilize kernels (kernel functions) to implicitly map data into higher-dimensional feature spaces without actually computing the coordinates in that space. This is known as the Kernel Trick, enabling effective classification of non-linear data.
Kernel Trick: Instead of explicitly transforming the data to a higher dimension, which can be computationally expensive or even intractable, the kernel function computes the dot product of the transformed features directly in the higher-dimensional space. This allows SVMs to find a linear separation in the higher-dimensional space that corresponds to a non-linear separation in the original lower-dimensional space.
Popular kernels include:
- Linear Kernel: Used when the data is already linearly separable.
- Polynomial Kernel: Useful for non-linear separation, suitable when classes can be separated by curved boundaries.
- Radial Basis Function (RBF) Kernel (also known as Gaussian kernel): A very versatile kernel that allows for highly non-linear decision boundaries by effectively mapping data to an infinite-dimensional space. It's often a good default choice due to its flexibility.

Code Implementation

Example of SVM implementation using Scikit-learn:

from sklearn.svm import SVC # Import the Support Vector Classifier class
model = SVC(kernel="linear") # Initialize the SVM model, specifying the kernel (e.g., "linear" or "rbf")
model.fit(X_train, y_train) # Train the model using training data and labels

Visualizations

Examples of results shown illustrate linear and RBF kernel classifications across features. These visualizations clearly show the differences in separation and margin effectiveness. Linear kernels create straight-line boundaries, while RBF kernels can create complex, curved boundaries to separate non-linearly distributed data, highlighting the power of the kernel trick. The width of the margin and the placement of support vectors are also clearly visible, indicating the model's confidence in classification.