Object Categorization and Detection: Methods and Frameworks

Overview of Recognition Tasks

  • Image Classification / Object Categorization: This involves classifying an entire image by assigning it to a correct category label and recognizing any instances of that specific category.

  • Object Identification: Defined as a $1$-to-$n$ matching system, where $n$ represents all possible identities. It focuses on finding a specific, particular object (e.g., distinguishing "John's car" from any other car).

  • Object Detection / Localization: This task involves locating objects within an image (such as people or faces) and recognizing the correct category for each located object.

  • Object Segmentation: This task determines which specific pixels belong to a certain object category (e.g., "Which pixels are part of an aeroplane?").

Identification vs. Categorization

  • Identification: The goal is to find one particular, unique object (e.g., identifying a specific car owned by an individual).

  • Categorization: The goal is to recognize ANY instance belonging to a category (e.g., recognizing any object that fits the definition of a "car").

Potential Applications for Object Categorization

  • Consumer Electronics: Integration into everyday devices for smarter interaction.

  • Autonomous Robots: Required for navigation and driver safety (e.g., Google's autonomous projects).

  • Content-Based Retrieval and Analysis: Used for searching images and videos on platforms like Google Images and YouTube.

  • Medical Image Analysis: Analyzing scans for diagnostic purposes.

Scope and Scale of Categories

  • According to Biederman (1987), there are approximately 10,00010,000 to 30,00030,000 distinct object categories in existence.

Challenges in Robustness

  • Realistic scenes are often crowded, cluttered, and contain overlapping objects.

  • Variability Factors: Systems must learn to compensate for:

    • Illumination: Changes in lighting conditions.

    • Object Pose: The orientation of the object.

    • Clutter: Distracting elements in the background.

    • Occlusions: Objects being partially hidden by other objects.

    • Intra-class Appearance: Variation within the same category (e.g., different breeds of dogs).

    • Viewpoint: Changes in the camera's perspective.

    • Scale: Variation in the size of the object within the frame.

    • Articulation: Changes in the shape of flexible objects.

The Supervised Classification Framework

  • Goal: Given a collection of labeled examples, develop a function to predict labels for new, unseen examples.

  • Mathematical Model: f(x)=yf(x) = y

    • xx: Image features.

    • ff: Prediction function.

    • yy: Output (label).

  • Training Phase: Given a training set of labeled examples {(x1,y1),,(xN,yN)}\{(x_1, y_1), \dots, (x_N, y_N)\}, the function ff is estimated by minimizing the prediction error on the training set.

  • Testing Phase: The function ff is applied to an unseen test example xx to output the predicted value y=f(x)y = f(x).

Classification Strategies

  • Generative Strategy: Training data is used to build a representative probability model. It separately models class-conditional densities and priors.

    • Example: modeling Pr(image, car)Pr(\text{image, car}).

    • Formula: Pr(carimage)=Pr(imagecar)Pr(car)Pr(image)Pr(\text{car}|\text{image}) = \frac{Pr(\text{image}|\text{car}) Pr(\text{car})}{Pr(\text{image})}.

  • Discriminative Strategy: Directly constructs a decision boundary or models the posterior probability Pr(carimage)Pr(\text{car}|\text{image}). It avoids modeling the underlying distribution of the class itself, focusing instead on the boundary between classes.

Object Detection via Classification: Sliding Window Approach

  • Basic Component: A binary classifier (e.g., Car vs. Non-car).

  • Procedure: If an object exists in a cluttered scene, a window is "slid" across the image to check every location.

  • Spatial and Scale Scanning: To find objects of different sizes, the image is down-scaled repeatedly, and the scan is performed at each scale. This converts detection into a subwindow classification problem.

  • Training Phase Steps:

    1. Obtain labeled training data.

    2. Scan locations (if relevant to the specific algorithm).

    3. Extract features.

    4. Define/Train the classifier.

  • Testing Phase Steps:

    1. Input test data.

    2. Scan all possible locations and scales.

    3. Extract features for each window.

    4. Classify features using the trained model.

    5. Post-processing (e.g., Non-maxima suppression).

Post-Processing: Non-Maxima Suppression (NMS)

  • Multiple overlapping windows may detect the same object.

  • Procedure:

    1. Iterate over all detections.

    2. Select the box with the highest confidence score.

    3. Calculate the overlap with all other boxes.

    4. Remove boxes that have an overlap higher than a specific threshold (usually 0.50.5).

  • Intersection over Union (IoU): The standard metric for overlap.

    • IoU=Area of OverlapArea of UnionIoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}

Feature Extraction: Global Appearance vs. Gradients

  • Global Appearance: Simple holistic descriptions like grayscale or color histograms and vectors of pixel intensities.

    • Weakness: Highly sensitive to small spatial shifts, illumination changes, and intra-class variation (e.g., an "albino koala" would fail a standard color-based pixel test).

  • Gradient-based Representations: Focuses on edges, contours, and oriented intensity gradients.

    • Invariance: Summarizing local distributions of gradients (locally orderless) offers invariance to small shifts and rotations.

    • Localized Histograms: Provide more spatial information than a single global histogram.

    • Contrast-Normalization: Helps correct for variable illumination.

Histograms of Oriented Gradients (HOG) Descriptor

  • Developed by Dalal & Triggs (CVPR 2005), the processing chain involves:

  • Optional Gamma Compression: Goal is to reduce the effect of overly strong gradients. Each pixel intensity xx is replaced by its square root: x=xx = \sqrt{x}.

  • Gradient Computation: Simple finite difference filters [1,0,1][-1, 0, 1] are applied. Gradients are computed on all color channels, and the strongest one is used. No Gaussian smoothing is applied.

  • Spatial/Orientation Binning: The image is divided into cells (typically 8×88 \times 8 pixels). A histogram of orientations is created for each cell, typically using 99 bins (00 to 180180 degrees).

    • Gradient Orientation Voting: Each pixel's contribution to the histogram is weighted by its gradient magnitude.

    • Orientation Formula: θ=tan1(f/yf/x)\theta = \tan^{-1}\left(\frac{\partial f / \partial y}{\partial f / \partial x}\right)

    • Gradient Magnitude Formula: f=(fx)2+(fy)2||\nabla f|| = \sqrt{(\frac{\partial f}{\partial x})^2 + (\frac{\partial f}{\partial y})^2}

  • Contrast Normalization: Cells are grouped into overlapping blocks (e.g., 2×22 \times 2 cells). The histograms within the block are normalized (usually L2L2 normalization) to account for changes in lighting/contrast.

  • Feature Vector Construction: Normalized HOG blocks are collected into a single large vector.

  • SVM Classification: The vector is passed to a Linear Support Vector Machine (SVM) to decide if the window contains the object or not.

Support Vector Machines (SVMs)

  • SVM is a discriminative classifier based on finding an optimal separating hyperplane (a line in 2D).

  • Margin: The width the boundary can be increased before hitting a data point. SVM maximizes this margin between positive and negative training examples.

  • Linear SVM Math:

    • Separating line: wTx+b=0w^T x + b = 0

    • For positive examples: wTxn+b1w^T x_n + b \geq 1

    • For negative examples: wTxn+b1w^T x_n + b \leq -1

    • The margin width MM is calculated as: M=2wM = \frac{2}{||w||}.

  • Optimization: Maximizing the margin is equivalent to minimizing 12wTw\frac{1}{2} w^T w subject to yi(wTxi+b)1y_i (w^T x_i + b) \geq 1 for all ii.

  • Classification Function: f(x)=sign(i=1nαiyixiTx+b)f(x) = \text{sign}\left(\sum_{i=1}^n \alpha_i y_i x_i^T x + b\right). This relies on the inner product between the test point xx and the support vectors xix_i.

Non-Linear SVMs and the Kernel Trick

  • If data is not linearly separable in the original space, it can be mapped to a higher-dimensional feature space Φ(x)\Phi(x) where it becomes separable.

  • Kernel Trick: Instead of explicitly computing the high-dimensional mapping, a kernel function K(xi,xj)=Φ(xi)TΦ(xj)K(x_i, x_j) = \Phi(x_i)^T \Phi(x_j) is defined.

  • Common Kernel Functions:

    • Linear: K(xi,xj)=xiTxjK(x_i, x_j) = x_i^T x_j

    • Polynomial of power p: K(xi,xj)=(1+xiTxj)pK(x_i, x_j) = (1 + x_i^T x_j)^p

    • Gaussian (Radial-Basis Function): K(xi,xj)=exp(xixj22σ2)K(x_i, x_j) = \exp\left(-\frac{||x_i - x_j||^2}{2 \sigma^2}\right)

Multi-Class SVMs

  • Since SVM is inherently binary, multi-class classification is achieved by combining multiple classifiers:

  • One vs. All: Learn an SVM for each class against all other classes combined. The test example is assigned to the class of the SVM that returns the highest decision value.

  • One vs. One: Learn an SVM for every possible pair of classes. For a KK-way problem, this results in K(K1)2\frac{K(K-1)}{2} binary classifiers. Each SVM "votes," and the class with the most votes is chosen.

Summary: Sliding-Windows

  • Pros:

    • Simple detection protocol to implement.

    • Critical success relies on good feature choices (e.g., HOG).

    • Effective for specific classes with defined shapes (e.g., pedestrians, faces).

  • Cons/Limitations:

    • High computational complexity due to the exhaustive search over space and scale.

    • Rigid structure: not all objects are "box" shaped.