1/88
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
image
function mapping coordinates to intensity values
mean filter
all weights are 1/(2k + 1)², blurring the image
convolution
rotate kernel horizontally and vertically, then cross-correlate
full convolution
compute if any kernel overlap exists (use zero-padding) —> m + k - 1 output
same convolution
compute if kernel center is on image —> m output
valid convolution
compute only if kernel entirely on image —> m - k + 1 output
What’s the idea behind the sharpening filter, and how does the kernel work?
blurring removes fine detail, so add the detail you lost (original - blurred) to the original
kernel: 2 * identity - mean_filter
shift equivariance
doesn’t matter if you shift input then filter or filter first then shift input
Gaussian filter
weight pixels by a Gaussian - neighbors closer to center get high weight
How does the sigma parameter control the Gaussian?
small sigma —> narrower —> less blurring (very nearby pixels weighted)
large sigma —> wider —> more blurring
Difference of Gaussians
subtract two Gaussian-blurred images: sharp smaller blur - broad larger blur leaves behind only edges/fine detail
separable filters
decomposing 2D kernels as outer product of two 1D vectors, then apply those in sequence
inverse mapping
for all output pixels, fetch where it came from in the input
interpolation
estimating the value of inverse mapping results if it’s between input pixels (non-int coords)
bilinear interpolation
weighted average of 4 surrounding pixels based on opposing rectangle
aliasing
high-frequency detail (e.g. edges) getting misrepresented as a false low-frequency pattern
Nyquist Theorem
to avoid aliasing, your sampling rate must be >= 2x the highest frequency in the signal
sampling rate
how many samples you take per unit of space (e.g. 2 samples at peak/trough for sine wave)
Gaussian pre-filtering
blur first with a Gaussian to remove any high-frequency details, then safely subsample —> smooths out any rapid variation that would have been aliased
How do you determine where an edge exists?
pixel intensity changes rapidly at an edge, so take derivative of an image function along a row —> edges are extrema/peaks
What do the gradient magnitude and gradient direction tell you?
gradient magnitude (magnitude of the gradient vector [df/dx, df/dy]) tells you edge strength
gradient direction (arctan(dx/dy)) tells you which way intensity increases the most and is perpendicular to the edge
How to handle noise when differentiating?
differentiation amplifies high frequencies (includes noise), so smooth first then differentiate
Sobel filter
smooths along edge direction and differentiates across it, resulting in an image telling you how strong an edge is at that location
Canny Edge Detection
Smooth by convolving w/ Gaussian to separate noise
Compute gradient with x/y derivatives of Gaussian (precompute)
Non-max suppression —> keeps local max gradients, sharp/thin edges
Hysteresis thresholding with two thresholds (mark if either above high or above low + connected to high)
What do lines do that edges don’t?
organize edges into geometric structures
Hough Voting
letting Canny-decided edge pixels vote on what lines they could lie on, and lines with most votes become actual lines
How can a line be written in polar form?
x cos theta + y sin theta = p (perpendicular dist from origin to line)
What do lines correspond to in Hough Space? What do points correspond to?
all lines correspond to a point (p, theta) representing perpendicular dist from origin + the angle of the line
all points correspond to a sinusoidal curve (possible line point pairs that satisfy the polar form)
Line Detection algorithm
Initialize 2D Hough array ‘acc’ as 0’s (axes are p and theta)
For all edge pixels (x, y), loop over theta values, compute p, and increment acc[p][theta] (use gradient detection at an edge pixel to restrict which theta you vote for to perpendicular angles only)
Find local maxima in acc (lines w/ more votes than its neighbors)
Threshold lines with a minimum vote count
segmentation
grouping pixels into regions that belong together
K-Means
Randomly pick k centers
Assign each point to nearest center
Recompute means as avg of assigned points
Repeat until guaranteed convergence
How to apply K-Means to images?
represent a pixel as a point (r, g, b), set k to k colors (color segmentation), and encode position so we don’t group same color but different position (r, g, b, x, y)
How can we fix issues of large regions getting split up in K-Means?
use a very large k so that each small region is a superpixel, then decide later how to merge superpixels
Flood Fill
Represent all pixels as nodes and add edges in between only if that edge doesn’t cross an actual detected edge (edges between nodes within a defined region)
Find connected components using DFS, representing regions
What does weighted graph-based segmentation do differently?
assign weights representing how likely two pixels are to belong to the same object (high = similar color and no edge, low = different color and strong edge)
Min-Cut Based GBS
segmentation involves cutting weak edges to separate clusters, so find the min-cut (minimizing total weight of cut edges) with Ford-Fulkerson
GBS algorithms are a family of algorithms that vary these four things:
Graph connectivity (which pixels get edges)
Edge weight detection (colors, gradient, etc.)
Per-node cost (cost for belonging to a segment)
Objective function (min-cut)
correspondence estimation
finding matching pixels/regions across 2+ images of same scene
General pipeline for correspondence estimation
Feature detection ~ find sparse (few high-confidence) set of distinctive points worth matching
Feature description ~ for all detected points, find a compact representation of its local appearance
Feature matching ~ find best matches among descriptors
Downstream task ~ use matches for later task, e.g. pose estimation
What are two key elements of good feature points?
repeatibility ~ same point detectable between images despite changes in light/perspective
discriminability ~ point looks different from neighbors, so the matching is unambiguous
corner
point where shifting a small window in any direction leads to a large change in appearance
Why is a corner better than a flat or an edge?
flat ~ no change in any direction
edge ~ no change along edge, only across
What do the eigenvalues of the structure tensor tell you about intensity change?
both eigenvalues near 0 —> flat region (no gradient)
one eigenvalue high, other near 0 —> edge (change in only one dir)
both eigenvalues high —> corner
Harris operator
avoids computing eigenvalues by approximating cornerness score R = det(M) - k * trace(M)²
Harris Corner Detection
Compute image gradients Ix, Iy (apply Sobel filter)
Compute Ix², Iy², Ix * Iy
Gaussian blur each of these new images
Compute Harris response
Threshold R (keep only strong response)
Non-max suppression (keep only local maxima)
invariant
output doesn’t change when image is transformed (e.g. cornerness score)
equivariant
output transforms in same way as input image (e.g. corner location)
What are the two photometric transformations?
additive ~ I’ = I + c (represents a brightness change, shifting histogram of intensity values)
multiplicative ~ I’ = c * I (contrast changes, scales/stretches histogram)
Laplacian of Gaussians
applying Difference of Gaussians to find local maxima/minima (bright centers/dark centers)
characteristic scale
sigma value where LoG response peaks and blob size approximately matches size of the feature
blobs
circular regions of interest in an image
How do Harris and LoG behave under photometric transforms?
Harris ~ invariant to additive, not to multiplicative
LoG ~ not invariant to either
Scale-Invariant Feature Detection
Run detector at many sigma values and find points (x, y, sigma) that are local maxima in 3D space, allowing you to match features at different distances in images
Gaussian pyramid
downsampling an image at multiple scales and applying a fixed-size filter at each (similar to a scaling filter)
descriptor
compact representation (vector of numbers) of the local appearance around a feature point
Normalized Cross Correlation
use the normalized pixel patch as the descriptor to remove effect of photometric changes
d = (patch - patch_mean) / std
Rotation Invariance for descriptors
Compute second moment matrix M (apply Sobel filter and match pixel positions)
Find eigenvector x_max corresponding to biggest eigenvalue
Rotate by computing theta = (x_max_x / x_max_y) and applying rotation matrix
Describe rotated pattern w/ descriptor
Multiscale Oriented Patches Descriptor
Choose the right scale (find level in Gaussian pyramid with max Harris cornerness score)
Apply transformation matrix
Normalize intensity
MOPS Transformation Matrix
Apply MT1 ~ translate so feature is at origin
MR ~ rotate to standard orientation
MS ~ scale down 40 × 40 px —> 8 × 8 px
MT2 ~ translate to output image center
What fix would resolve the MOPS issue of failing at sophisticated lighting changes and rotations?
describe edge orientation instead of raw pixel values
Quantized Orientation Histograms
instead of recording exact gradient orientation, bin them into coarse 45 degree buckets
Scale-Invariant Feature Transform (SIFT)
Scale + rotation normalization (same as MOPS)
Divide patch into grid of 4×4 cells
Build orientation histogram per cell
Concatenate all histograms (4×4 cells x 8 bins = 128-dim description vectors
In other words, for all cells in patch, count how many strong edges point in each of 8 dirs + stack these counts
What are three key improvements to SIFT?
Threshold weak edges
Soft voting with bilinear interpolation (vote proportionally into bins)
Normalize descriptor to unit length, clamp any high values, and renormalize
Ratio Test
Find ||f1 - f2|| / ||f1 - f2’||, where f2 is best match and f2’ is second best match
Small ratio —> best is important, keep the feature point; vice versa for large ratio
What are the key aspects of the pinhole model?
origin at the pinhole, Z-axis points away from image plane (which sits at Z = -1), 3D point P = (X, Y, Z) —> 2D point p = (x, y)
What behavior do parallel lines have as a result of projection?
Converge at a vanishing point and share the same vanishing line as Z —> infinity
What extra thing is required for a camera when considering the real-world coordinate system?
needs 3×3 orthogonal rotation matrix R and 3D translation matrix t
p_cam = R * p_world + t
homogeneous coordinates
adding an extra coordinate to represent points so that perspective projection becomes matrix multiplication (e.g. (X, Y, Z) —> (X, Y, Z, 1)
What is the key equation for camera reconstruction?
p_camera = K * [R | t] * p_world
K = intrinsic camera matrix
[R | t] = rotation and translation combined into 3×4 projection matrix
p_world = 4D homogenous point
parallax
nearby things shift more than far things
calibration
finding camera parameters based on known 3D points + correspondences
Direct Linear Transform
you know world (X, Y, Z) —> img (x, y), i.e. x = PX
Find 6 non-coplanar correspondences, producing 12 equations, and solve for p by finding eigenvector of smallest eigenvalue in ATA (unflatten p to yield P)
reprojection error
project known 3D points onto image using estimated P from direct linear transform, then calculate sum of squared dists from true observations
panorama effect
two images taken with pure rotation and no translation of 3D scene are still related by a homography (despite rotating, you still look down the same ray at a given point)
triangulation
given two calibrated cameras and a correspondence, find 3D point (X, Y, Z, 1)
rectified cameras
both cameras parallel and separated by horizontal translation t
stereo
2 cameras side by side, match same point in both images —> helps you solve for Z
disparity
x2 - x1, difference in x coordinate between 2 images (tx)
How is disparity related to depth Z?
Z = tx/(x2 - x1), tx = baseline/distance between two cameras
Adjusted NCC for Stereo
Rectified cameras —> corr pts on same row, so only search along same horizontal line
Take patch around pixel a in image 1, pixel b in image 2
Subtract means of patches from each patch
NCC = (a * b) / |a||b|
NCC = 1 —> perfect match, opposite for -1
plane sweep
reduce cost of NCC by looping over potential disparity values for a pixel, shifting the second image by d pixels, then computing NCC simultaneously
homography
3×3 transformation mapping pixels in one image to pixels in another (approximation unless the scene is flat)
How do you adjust for some rotation in real cameras?
apply homography to each image to get a rectified config, then run stereo
Structure from Motion
images from unknown cameras looking at unknown 3D scene —> recover both camera params and scene
epipolar line
line passing through a point X on camera 1 onto camera 2
epipole
where the other camera appears in a camera’s image
epipolar pencil
the idea that the baseline is fixed even if you rotate the epipolar plane
Essential Matrix
given a point on img 1, E tells you which line in img 2 to seearch (encodes rotation of camera)
Fundamental Matrix
same as essential matrix E, but works on raw pixel coordinates w/o needing to know K (can be approximated)