Object Categorization and Detection: Methods and Frameworks

Overview of Recognition Tasks

Image Classification / Object Categorization: This involves classifying an entire image by assigning it to a correct category label and recognizing any instances of that specific category.
Object Identification: Defined as a $1$-to-$n$ matching system, where $n$ represents all possible identities. It focuses on finding a specific, particular object (e.g., distinguishing "John's car" from any other car).
Object Detection / Localization: This task involves locating objects within an image (such as people or faces) and recognizing the correct category for each located object.
Object Segmentation: This task determines which specific pixels belong to a certain object category (e.g., "Which pixels are part of an aeroplane?").

Identification vs. Categorization

Identification: The goal is to find one particular, unique object (e.g., identifying a specific car owned by an individual).
Categorization: The goal is to recognize ANY instance belonging to a category (e.g., recognizing any object that fits the definition of a "car").

Potential Applications for Object Categorization

Consumer Electronics: Integration into everyday devices for smarter interaction.
Autonomous Robots: Required for navigation and driver safety (e.g., Google's autonomous projects).
Content-Based Retrieval and Analysis: Used for searching images and videos on platforms like Google Images and YouTube.
Medical Image Analysis: Analyzing scans for diagnostic purposes.

Scope and Scale of Categories

According to Biederman (1987), there are approximately $10,000$ to $30,000$ distinct object categories in existence.

Challenges in Robustness

Realistic scenes are often crowded, cluttered, and contain overlapping objects.
Variability Factors: Systems must learn to compensate for:
- Illumination: Changes in lighting conditions.
- Object Pose: The orientation of the object.
- Clutter: Distracting elements in the background.
- Occlusions: Objects being partially hidden by other objects.
- Intra-class Appearance: Variation within the same category (e.g., different breeds of dogs).
- Viewpoint: Changes in the camera's perspective.
- Scale: Variation in the size of the object within the frame.
- Articulation: Changes in the shape of flexible objects.

The Supervised Classification Framework

Goal: Given a collection of labeled examples, develop a function to predict labels for new, unseen examples.
Mathematical Model: $f(x) = y$
- $x$ : Image features.
- $f$ : Prediction function.
- $y$ : Output (label).
Training Phase: Given a training set of labeled examples $\{(x_1, y_1), \dots, (x_N, y_N)\}$ , the function $f$ is estimated by minimizing the prediction error on the training set.
Testing Phase: The function $f$ is applied to an unseen test example $x$ to output the predicted value $y = f(x)$ .

Classification Strategies

Generative Strategy: Training data is used to build a representative probability model. It separately models class-conditional densities and priors.
- Example: modeling $Pr(\text{image, car})$ .
- Formula: $Pr(\text{car}|\text{image}) = \frac{Pr(\text{image}|\text{car}) Pr(\text{car})}{Pr(\text{image})}$ .
Discriminative Strategy: Directly constructs a decision boundary or models the posterior probability $Pr(\text{car}|\text{image})$ . It avoids modeling the underlying distribution of the class itself, focusing instead on the boundary between classes.

Object Detection via Classification: Sliding Window Approach

Basic Component: A binary classifier (e.g., Car vs. Non-car).
Procedure: If an object exists in a cluttered scene, a window is "slid" across the image to check every location.
Spatial and Scale Scanning: To find objects of different sizes, the image is down-scaled repeatedly, and the scan is performed at each scale. This converts detection into a subwindow classification problem.
Training Phase Steps:
1. Obtain labeled training data.
2. Scan locations (if relevant to the specific algorithm).
3. Extract features.
4. Define/Train the classifier.
Testing Phase Steps:
1. Input test data.
2. Scan all possible locations and scales.
3. Extract features for each window.
4. Classify features using the trained model.
5. Post-processing (e.g., Non-maxima suppression).

Post-Processing: Non-Maxima Suppression (NMS)

Multiple overlapping windows may detect the same object.
Procedure:
1. Iterate over all detections.
2. Select the box with the highest confidence score.
3. Calculate the overlap with all other boxes.
4. Remove boxes that have an overlap higher than a specific threshold (usually $0.5$ ).
Intersection over Union (IoU): The standard metric for overlap.
- $IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}$

Feature Extraction: Global Appearance vs. Gradients

Global Appearance: Simple holistic descriptions like grayscale or color histograms and vectors of pixel intensities.
- Weakness: Highly sensitive to small spatial shifts, illumination changes, and intra-class variation (e.g., an "albino koala" would fail a standard color-based pixel test).
Gradient-based Representations: Focuses on edges, contours, and oriented intensity gradients.
- Invariance: Summarizing local distributions of gradients (locally orderless) offers invariance to small shifts and rotations.
- Localized Histograms: Provide more spatial information than a single global histogram.
- Contrast-Normalization: Helps correct for variable illumination.

Histograms of Oriented Gradients (HOG) Descriptor

Developed by Dalal & Triggs (CVPR 2005), the processing chain involves:
Optional Gamma Compression: Goal is to reduce the effect of overly strong gradients. Each pixel intensity $x$ is replaced by its square root: $x = \sqrt{x}$ .
Gradient Computation: Simple finite difference filters $[-1, 0, 1]$ are applied. Gradients are computed on all color channels, and the strongest one is used. No Gaussian smoothing is applied.
Spatial/Orientation Binning: The image is divided into cells (typically $8 \times 8$ pixels). A histogram of orientations is created for each cell, typically using $9$ bins ( $0$ to $180$ degrees).
- Gradient Orientation Voting: Each pixel's contribution to the histogram is weighted by its gradient magnitude.
- Orientation Formula: $\theta = \tan^{-1}\left(\frac{\partial f / \partial y}{\partial f / \partial x}\right)$
- Gradient Magnitude Formula: $||\nabla f|| = \sqrt{(\frac{\partial f}{\partial x})^2 + (\frac{\partial f}{\partial y})^2}$
Contrast Normalization: Cells are grouped into overlapping blocks (e.g., $2 \times 2$ cells). The histograms within the block are normalized (usually $L2$ normalization) to account for changes in lighting/contrast.
Feature Vector Construction: Normalized HOG blocks are collected into a single large vector.
SVM Classification: The vector is passed to a Linear Support Vector Machine (SVM) to decide if the window contains the object or not.

Support Vector Machines (SVMs)

SVM is a discriminative classifier based on finding an optimal separating hyperplane (a line in 2D).
Margin: The width the boundary can be increased before hitting a data point. SVM maximizes this margin between positive and negative training examples.
Linear SVM Math:
- Separating line: $w^T x + b = 0$
- For positive examples: $w^T x_n + b \geq 1$
- For negative examples: $w^T x_n + b \leq -1$
- The margin width $M$ is calculated as: $M = \frac{2}{||w||}$ .
Optimization: Maximizing the margin is equivalent to minimizing $\frac{1}{2} w^T w$ subject to $y_i (w^T x_i + b) \geq 1$ for all $i$ .
Classification Function: $f(x) = \text{sign}\left(\sum_{i=1}^n \alpha_i y_i x_i^T x + b\right)$ . This relies on the inner product between the test point $x$ and the support vectors $x_i$ .

Non-Linear SVMs and the Kernel Trick

If data is not linearly separable in the original space, it can be mapped to a higher-dimensional feature space $\Phi(x)$ where it becomes separable.
Kernel Trick: Instead of explicitly computing the high-dimensional mapping, a kernel function $K(x_i, x_j) = \Phi(x_i)^T \Phi(x_j)$ is defined.
Common Kernel Functions:
- Linear: $K(x_i, x_j) = x_i^T x_j$
- Polynomial of power p: $K(x_i, x_j) = (1 + x_i^T x_j)^p$
- Gaussian (Radial-Basis Function): $K(x_i, x_j) = \exp\left(-\frac{||x_i - x_j||^2}{2 \sigma^2}\right)$

Multi-Class SVMs

Since SVM is inherently binary, multi-class classification is achieved by combining multiple classifiers:
One vs. All: Learn an SVM for each class against all other classes combined. The test example is assigned to the class of the SVM that returns the highest decision value.
One vs. One: Learn an SVM for every possible pair of classes. For a $K$ -way problem, this results in $\frac{K(K-1)}{2}$ binary classifiers. Each SVM "votes," and the class with the most votes is chosen.

Summary: Sliding-Windows

Pros:
- Simple detection protocol to implement.
- Critical success relies on good feature choices (e.g., HOG).
- Effective for specific classes with defined shapes (e.g., pedestrians, faces).
Cons/Limitations:
- High computational complexity due to the exhaustive search over space and scale.
- Rigid structure: not all objects are "box" shaped.