Detectors and Descriptors

Detectors and Descriptors Notes

Administrivia

HW2 is out and students are encouraged to start early.

Last Time – Gradients

Image gradients treat the image as a function of x and y. \nabla f = [\frac{\partial f}{\partial x}, 0], \nabla f = [0, \frac{\partial f}{\partial y}], \nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}]
Image gradients help find edges and corners.

Filters

Smoothing Filters remove noise and typically sum to 1.
Derivative Filters find edges and typically sum to 0.

Motivation

The goal is to find corners and features in images.

Desirables for Descriptor

Repeatable: Should find the same features even with distortion.
Saliency: Each feature should be distinctive.
Compactness: Shouldn’t just be all the pixels.
Locality: Should only depend on local image data.

Basic Idea

Edge: No change along the edge direction.
Corner: Significant change in all directions.
Flat Region: No change in all directions.
A significant intensity change should be visible with a small window or any shift.

Formalizing Corner Detection

Error (Sum of Squares) for u,v offset is given by the equation: E(u, v) = \sum_{(x,y) \in W} (I[x + u, y + v] - I[x, y])^2

Taylor Series

Taylor Series is used to linearize a function: f(x + d) \approx f(x) + \frac{\partial f}{\partial x} d
Applying Taylor series to images, treating them as functions of x and y: I(x + u, y + v) \approx I(x, y) + Ix u + Iy v where Ix is the partial derivative of I with respect to x, and Iy is the partial derivative of I with respect to y, both evaluated at point (x, y).

Formalizing Corner Detection with Taylor Series

Using Taylor series expansion to approximate the error function:
- E(u, v) = \sum{(x,y) \in W} (I[x + u, y + v] - I[x, y])^2 \approx \sum{(x,y) \in W} (I[x, y] + Ix u + Iy v - I[x, y])^2
- = \sum{(x,y) \in W} (Ix u + I_y v)^2
- = \sum{(x,y) \in W} Ix^2 u^2 + 2Ix Iy uv + I_y^2 v^2
By linearizing the image, E(u, v) can be approximated with a quadratic function of u and v: E(u, v) \approx [u, v] M [u, v]^T
Where M is the second moment matrix:
- M = \begin{bmatrix} \sum{x,y \in W} Ix^2 & \sum{x,y \in W} Ix Iy \ \sum{x,y \in W} Ix Iy & \sum{x,y \in W} Iy^2 \end{bmatrix}

Intuitively What is M?

Approximation:
- M \approx \begin{bmatrix} a & 0 \ 0 & b \end{bmatrix}
- gradients are either vertical or horizontal at a pixel (so Ix Iy = 0)
Cases:
- a, b both small: flat
- one big, other small: edge
- a, b both big: corner
If the image is rotated by a rotation \theta / matrix V, M will look like V^{-1} \begin{bmatrix} a & 0 \ 0 & b \end{bmatrix} V

Eigenvalues and M Matrix

Given M, can decompose it into eigenvectors V and eigenvalues \lambda1, \lambda2 with
- M = V^{-1} \begin{bmatrix} \lambda1 & 0 \ 0 & \lambda2 \end{bmatrix} V
Compute quantity R from M
- R = det(M) - \alpha trace(M)^2 = \lambda1 \lambda2 - \alpha (\lambda1 + \lambda2)^2
- \alpha is an empirical value, usually 0.04-0.06.

R and Corner/Edge/Flat Detection

R tells us whether we’re at a corner, edge, or flat region:
- R >> 0: corner (\lambda1 \approx \lambda2 \gg 0)
- R << 0: edge (\lambda1 \gg \lambda2 \gg 0 or \lambda2 \gg \lambda1 \gg 0)
- |R| ≈ 0: flat (\lambda1, \lambda2 \approx 0)

What You Need To Know

Need to be able to take derivatives of the image.
Need to be able to compute the entries of M at every pixel.
Should know that some properties of M indicate whether a pixel is a corner or not.

In Practice: Harris Corner Detection

Compute partial derivatives Ix, Iy per pixel.
Compute M at each pixel, using Gaussian weighting w:
- M = \sum{x,y \in W} w(x, y) \begin{bmatrix} Ix^2 & Ix Iy \ Ix Iy & I_y^2 \end{bmatrix}
Compute response function R:
- R = det(M) - \alpha trace(M)^2 = \lambda1 \lambda2 - \alpha (\lambda1 + \lambda2)^2
Threshold R.
Take only local maxima (non-maxima suppression).

Desirable Properties

If detectors are repeatable, they should be:
- Invariant to some things: image is transformed and corners remain the same.
- Equivariant with some things: image is transformed and corners transform with it.

Affine Intensity Change

Partially invariant to affine intensity changes: I{new} = aI{old} + b
M only depends on derivatives, so b is irrelevant.
But a scales derivatives and there’s a threshold.

Image Translation

Equivariant with translation because convolution is translation equivariant.

Image Scaling

Not equivariant with scaling.

Key Idea: Scale Space

Try different scales to handle image scaling.

Blob Detection

Find maxima and minima of blob filter response in scale and space.

Gaussian Derivatives

First Derivative
- \frac{\partial}{\partial x} g
- \frac{\partial}{\partial y} g
Second Derivative
- \frac{\partial^2}{\partial x^2} g
- \frac{\partial^2}{\partial y^2} g

Laplacian of Gaussian (LoG)

\nabla^2 = \frac{\partial^2}{\partial x^2} g + \frac{\partial^2}{\partial y^2} g
Scale the Laplacian of Gaussian if you want to compare across sigmas:
- \nabla_{norm}^2 = \sigma^2 (\frac{\partial^2}{\partial x^2} g + \frac{\partial^2}{\partial y^2} g)

Edge Detection with LoG

Edge = Zero-crossing
f * \frac{\partial^2}{\partial x^2} g

Scale Selection

Characteristic scale of a blob is the scale that produces the maximum response.

Scale-space blob detector

Convolve image with scale-normalized Laplacian at several scales.
Find maxima of squared Laplacian response in scale-space.

Finding Maxima

Point i,j is maxima (minima if you flip sign) in image I if it’s bigger than all neighbors.
- for y in range(i-1, i+1+1): for x in range(j-1, j+1+1): if y == i and x == j: continue I[y, x] < I[i, j]

Scale Space

Blue lines are image-space neighbors.
Red lines are the scale-space neighbors.

Finding Maxima in Scale Space

Suppose I[:,:,k] is image at scale k. Point i,j,k is maxima (minima if you flip sign) in image I if:
- for y in range(i-1, i+1+1): for x in range(j-1, j+1+1): for c in range(k-1, k+1+1): if y == i and x == j and c == k: continue I[y, x, c] < I[i, j, k]

Approximating the Laplacian with a Difference of Gaussians

(Laplacian) \approx (Difference \ of \ Gaussians)
(\frac{\partial^2}{\partial x^2}g + \frac{\partial^2}{\partial y^2}g) \approx (\frac{1}{\sqrt{2\pi\sigma1^2}}e^{-\frac{x^2+y^2}{2\sigma1^2}} - \frac{1}{\sqrt{2\pi\sigma2^2}}e^{-\frac{x^2+y^2}{2\sigma2^2}})

SIFT (Scale-Invariant Feature Transform)

Problem 1: How do we deal with scales? Try them all!

Why is this efficient? 1 + 1/4 + 1/16 + 1/64 + 1/4^i … = 4/3 Vast majority of effort is in the first and second scales

Problem 2 – Describing Features

Once we’ve found a corner/blobs, we can’t just use the image nearby. What about:
1. Scale?
2. Rotation?
3. Additive light?

Handling Scale

Given characteristic scale (maximum Laplacian response), we can just rescale image.

Handling Rotation

Given window, can compute “dominant orientation” and then rotate image.

SIFT Features

SIFT features at characteristic scales and dominant orientations.
Rotate and set to common scale

SIFT Descriptors

Compute gradients.
Build histogram (4x4 in practice).

Gradients ignore global illumination changes.
Gaussian weighting: smooth response
Normalization: reduces illumination effects
Clamping
Tons of more stuff

Properties of SIFT

Can handle: up to ~60 degree out-of-plane rotation, changes of illumination
Fast, efficient, code available (but was patented)

Feature Descriptors

128D vector x
Think of feature as some non-linear filter that maps pixels to 128D feature

2nd Nearest Neighbor Trick

Given a feature xq, nearest neighbor to x is a good match, but distances can’t be thresholded.
Instead, find nearest neighbor (x1NN) and second nearest neighbor (x2NN). This ratio is a good test for matches:
- r = \frac{xq - x{1NN}}{xq - x{2NN}}