Detectors and Descriptors
Detectors and Descriptors Notes
Administrivia
- HW2 is out and students are encouraged to start early.
Last Time – Gradients
- Image gradients treat the image as a function of x and y. \nabla f = [\frac{\partial f}{\partial x}, 0], \nabla f = [0, \frac{\partial f}{\partial y}], \nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}]
- Image gradients help find edges and corners.
Filters
- Smoothing Filters remove noise and typically sum to 1.
- Derivative Filters find edges and typically sum to 0.
Motivation
- The goal is to find corners and features in images.
Desirables for Descriptor
- Repeatable: Should find the same features even with distortion.
- Saliency: Each feature should be distinctive.
- Compactness: Shouldn’t just be all the pixels.
- Locality: Should only depend on local image data.
Basic Idea
- Edge: No change along the edge direction.
- Corner: Significant change in all directions.
- Flat Region: No change in all directions.
- A significant intensity change should be visible with a small window or any shift.
- Error (Sum of Squares) for u,v offset is given by the equation: E(u, v) = \sum_{(x,y) \in W} (I[x + u, y + v] - I[x, y])^2
Taylor Series
- Taylor Series is used to linearize a function: f(x + d) \approx f(x) + \frac{\partial f}{\partial x} d
- Applying Taylor series to images, treating them as functions of x and y: I(x + u, y + v) \approx I(x, y) + Ix u + Iy v where Ix is the partial derivative of I with respect to x, and Iy is the partial derivative of I with respect to y, both evaluated at point (x, y).
- Using Taylor series expansion to approximate the error function:
- E(u, v) = \sum{(x,y) \in W} (I[x + u, y + v] - I[x, y])^2 \approx \sum{(x,y) \in W} (I[x, y] + Ix u + Iy v - I[x, y])^2
- = \sum{(x,y) \in W} (Ix u + I_y v)^2
- = \sum{(x,y) \in W} Ix^2 u^2 + 2Ix Iy uv + I_y^2 v^2
- By linearizing the image, E(u, v) can be approximated with a quadratic function of u and v: E(u, v) \approx [u, v] M [u, v]^T
- Where M is the second moment matrix:
- M = \begin{bmatrix} \sum{x,y \in W} Ix^2 & \sum{x,y \in W} Ix Iy \ \sum{x,y \in W} Ix Iy & \sum{x,y \in W} Iy^2 \end{bmatrix}
Intuitively What is M?
- Approximation:
- M \approx \begin{bmatrix} a & 0 \ 0 & b \end{bmatrix}
- gradients are either vertical or horizontal at a pixel (so Ix Iy = 0)
- Cases:
- a, b both small: flat
- one big, other small: edge
- a, b both big: corner
- If the image is rotated by a rotation \theta / matrix V, M will look like V^{-1} \begin{bmatrix} a & 0 \ 0 & b \end{bmatrix} V
Eigenvalues and M Matrix
- Given M, can decompose it into eigenvectors V and eigenvalues \lambda1, \lambda2 with
- M = V^{-1} \begin{bmatrix} \lambda1 & 0 \ 0 & \lambda2 \end{bmatrix} V
- Compute quantity R from M
- R = det(M) - \alpha trace(M)^2 = \lambda1 \lambda2 - \alpha (\lambda1 + \lambda2)^2
- \alpha is an empirical value, usually 0.04-0.06.
R and Corner/Edge/Flat Detection
- R tells us whether we’re at a corner, edge, or flat region:
- R >> 0: corner (\lambda1 \approx \lambda2 \gg 0)
- R << 0: edge (\lambda1 \gg \lambda2 \gg 0 or \lambda2 \gg \lambda1 \gg 0)
- |R| ≈ 0: flat (\lambda1, \lambda2 \approx 0)
What You Need To Know
- Need to be able to take derivatives of the image.
- Need to be able to compute the entries of M at every pixel.
- Should know that some properties of M indicate whether a pixel is a corner or not.
In Practice: Harris Corner Detection
- Compute partial derivatives Ix, Iy per pixel.
- Compute M at each pixel, using Gaussian weighting w:
- M = \sum{x,y \in W} w(x, y) \begin{bmatrix} Ix^2 & Ix Iy \ Ix Iy & I_y^2 \end{bmatrix}
- Compute response function R:
- R = det(M) - \alpha trace(M)^2 = \lambda1 \lambda2 - \alpha (\lambda1 + \lambda2)^2
- Threshold R.
- Take only local maxima (non-maxima suppression).
Desirable Properties
- If detectors are repeatable, they should be:
- Invariant to some things: image is transformed and corners remain the same.
- Equivariant with some things: image is transformed and corners transform with it.
Affine Intensity Change
- Partially invariant to affine intensity changes: I{new} = aI{old} + b
- M only depends on derivatives, so b is irrelevant.
- But a scales derivatives and there’s a threshold.
Image Translation
- Equivariant with translation because convolution is translation equivariant.
Image Scaling
- Not equivariant with scaling.
Key Idea: Scale Space
- Try different scales to handle image scaling.
Blob Detection
- Find maxima and minima of blob filter response in scale and space.
Gaussian Derivatives
- First Derivative
- \frac{\partial}{\partial x} g
- \frac{\partial}{\partial y} g
- Second Derivative
- \frac{\partial^2}{\partial x^2} g
- \frac{\partial^2}{\partial y^2} g
Laplacian of Gaussian (LoG)
- \nabla^2 = \frac{\partial^2}{\partial x^2} g + \frac{\partial^2}{\partial y^2} g
- Scale the Laplacian of Gaussian if you want to compare across sigmas:
- \nabla_{norm}^2 = \sigma^2 (\frac{\partial^2}{\partial x^2} g + \frac{\partial^2}{\partial y^2} g)
Edge Detection with LoG
- Edge = Zero-crossing
- f * \frac{\partial^2}{\partial x^2} g
Scale Selection
- Characteristic scale of a blob is the scale that produces the maximum response.
Scale-space blob detector
- Convolve image with scale-normalized Laplacian at several scales.
- Find maxima of squared Laplacian response in scale-space.
Finding Maxima
- Point i,j is maxima (minima if you flip sign) in image I if it’s bigger than all neighbors.
for y in range(i-1, i+1+1):
for x in range(j-1, j+1+1):
if y == i and x == j: continue
I[y, x] < I[i, j]
Scale Space
- Blue lines are image-space neighbors.
- Red lines are the scale-space neighbors.
Finding Maxima in Scale Space
- Suppose I[:,:,k] is image at scale k. Point i,j,k is maxima (minima if you flip sign) in image I if:
for y in range(i-1, i+1+1):
for x in range(j-1, j+1+1):
for c in range(k-1, k+1+1):
if y == i and x == j and c == k: continue
I[y, x, c] < I[i, j, k]
Approximating the Laplacian with a Difference of Gaussians
- (Laplacian) \approx (Difference \ of \ Gaussians)
- (\frac{\partial^2}{\partial x^2}g + \frac{\partial^2}{\partial y^2}g) \approx (\frac{1}{\sqrt{2\pi\sigma1^2}}e^{-\frac{x^2+y^2}{2\sigma1^2}} - \frac{1}{\sqrt{2\pi\sigma2^2}}e^{-\frac{x^2+y^2}{2\sigma2^2}})
Problem 1: How do we deal with scales? Try them all!
- Why is this efficient? 1 + 1/4 + 1/16 + 1/64 + 1/4^i … = 4/3 Vast majority of effort is in the first and second scales
Problem 2 – Describing Features
- Once we’ve found a corner/blobs, we can’t just use the image nearby.
What about:
- Scale?
- Rotation?
- Additive light?
Handling Scale
- Given characteristic scale (maximum Laplacian response), we can just rescale image.
Handling Rotation
- Given window, can compute “dominant orientation” and then rotate image.
SIFT Features
- SIFT features at characteristic scales and dominant orientations.
Rotate and set to common scale
SIFT Descriptors
- Compute gradients.
- Build histogram (4x4 in practice).
- Gradients ignore global illumination changes.
- Gaussian weighting: smooth response
- Normalization: reduces illumination effects
- Clamping
- Tons of more stuff
Properties of SIFT
- Can handle: up to ~60 degree out-of-plane rotation, changes of illumination
- Fast, efficient, code available (but was patented)
Feature Descriptors
- 128D vector x
- Think of feature as some non-linear filter that maps pixels to 128D feature
2nd Nearest Neighbor Trick
- Given a feature xq, nearest neighbor to x is a good match, but distances can’t be thresholded.
- Instead, find nearest neighbor (x1NN) and second nearest neighbor (x2NN). This ratio is a good test for matches:
- r = \frac{xq - x{1NN}}{xq - x{2NN}}