L9 Detectors

0.0(0)
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/39

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

40 Terms

1
New cards

Machine vision tasks on images

  • classification

  • localization

  • object detection

  • segmentation

2
New cards

Image classification

  • Assigns a label to the entire image.

  • Single object task

  • Example datasets: ImageNet

<ul><li><p>Assigns a label to the entire image.</p></li><li><p>Single object task</p></li><li><p>Example datasets: ImageNet</p></li></ul><p></p>
3
New cards

Localization

  • Identifies the location of an object from a predefined list of classes using bounding boxes.

  • Example: Locate a car in an image.

  • The training data contains not just the one-hot class labels, but bx, by, bw and bh for defining the bounding box and a pc binary value, which signals whether there is an object in the image at all.

<ul><li><p>Identifies the location of an object from a predefined list of classes using bounding boxes.</p></li><li><p>Example: Locate a car in an image.</p></li><li><p>The training data contains not just the one-hot class labels, but b<sub>x</sub>, b<sub>y</sub>, b<sub>w</sub> and b<sub>h </sub>for defining the bounding box and a p<sub>c</sub> binary value, which signals whether there is an object in the image at all.</p></li></ul><p></p>
4
New cards

Object detection

  • Detects and localizes multiple objects from certain classes within an image.

  • Combines classification and localization.

  • Can run a sliding window accross the whole image and make a convolutional network predict class probabilities for each, OR just pass the whole image through the network, and the resulting matrix will contain the values for each sliding window position

<ul><li><p>Detects and localizes multiple objects from certain classes within an image.</p></li><li><p>Combines classification and localization.</p></li><li><p>Can run a sliding window accross the whole image and make a convolutional network predict class probabilities for each, OR just pass the whole image through the network, and the resulting matrix will contain the values for each sliding window position</p></li></ul><p></p>
5
New cards

Segmentation tasks

  • Semantic Segmentation: Groups pixels by category (e.g., sky, road).

  • Instance Segmentation: Groups pixels by their instance membership to certain classes

  • Panoptic Segmentation: Combines semantic and instance segmentation for a comprehensive output. → assigns a class to each pixel, but differentiates instances too

<ul><li><p><strong>Semantic Segmentation</strong>: Groups pixels by category (e.g., sky, road).</p></li><li><p><strong>Instance Segmentation</strong>: Groups pixels by their instance membership to certain classes</p></li><li><p><strong>Panoptic Segmentation</strong>: Combines semantic and instance segmentation for a comprehensive output. → assigns a class to each pixel, but differentiates instances too</p></li></ul><p></p>
6
New cards

Segmentation datasets

  • Pascal VOC (20 object categories for instance segmentation)

  • MS COCO (80 things categories, 91 stuff categories)

  • Cityscapes (semantic segmentation from 50 different cities)

7
New cards

Two main object types in CV tasks

Things: objects that have properly defined geometry and are countable, like a person, cars, animals

Stuff: objects that don’t have proper geometry but are heavily identified by the texture and material like the sky, road, water bodies, etc.

8
New cards

Machine vision tasks on video

  • Tracking: Monitoring object movement across frames, using bounding boxes

  • MOT (Multi Object Tracking): Tracks multiple objects simultaneously.

  • MOTS (Multi Object Tracking and Segmentation): Combines MOT with instance segmentation

  • VOS (Video Object Segmentation): Separates moving objects from the background.

  • VIS (Video Instance Segmentation)

<ul><li><p>Tracking: Monitoring object movement across frames, using bounding boxes</p></li><li><p>MOT (Multi Object Tracking): Tracks multiple objects simultaneously.</p></li><li><p>MOTS (Multi Object Tracking and Segmentation): Combines MOT with instance segmentation</p></li><li><p>VOS (Video Object Segmentation): Separates moving objects from the background.</p></li><li><p>VIS (Video Instance Segmentation)</p></li></ul><p></p>
9
New cards

MOTS

  • Combines tracking (following objects through frames) and instance segmentation (differentiating individual objects in a category).

  • Outputs instance masks for each tracked object in each frame

  • Use case: autonomous driving, where each car or pedestrian is uniquely identified over time.

10
New cards

VOS

  • Segments moving objects in a video without focusing on identifying or tracking individual instances.

  • Emphasis on object-background separation.

  • Doesn’t do any tracking

  • Use case: video editing, where a moving object (e.g., a person or animal) needs to be isolated for further processing.

11
New cards

VIS

  • Combines tracking, segmenting, and classifying instances.

  • VIS focuses on classification (e.g., identifying a "cat" or "dog"), while MOTS may not always require classifying objects but focuses on tracking and segmentation.

  • Example datasets: YouTube VIS, OVIS (occluded)

12
New cards

IoU

Intersection over union

=> the measure of the overlap between two bounding boxes

13
New cards

R-CNN family

  • R-CNN

    • Uses ImageNet-pretrained backbone.

    • Employs Non-Max Suppression (NMS) and Hard Negative Mining.

  • Fast R-CNN

    • Introduces ROI Pooling for efficient region extraction.

    • Runs the CNN only once on the whole image to extract features

  • Faster R-CNN

    • End-to-end training with Region Proposal Networks (RPN).

    • Utilizes anchors for predefined bounding boxes.

  • Mask R-CNN

    • Extends Faster R-CNN to instance segmentation tasks.

    • Uses ROI Align instead of ROI Pooling

    • Adds FPN to get multi-scale features

<ul><li><p><strong>R-CNN</strong></p><ul><li><p>Uses ImageNet-pretrained backbone.</p></li><li><p>Employs Non-Max Suppression (NMS) and Hard Negative Mining.</p></li></ul></li><li><p><strong>Fast R-CNN</strong></p><ul><li><p>Introduces ROI Pooling for efficient region extraction.</p></li><li><p>Runs the CNN only once on the whole image to extract features</p></li></ul></li><li><p><strong>Faster R-CNN</strong></p><ul><li><p>End-to-end training with Region Proposal Networks (RPN).</p></li><li><p>Utilizes anchors for predefined bounding boxes.</p></li></ul></li><li><p><strong>Mask R-CNN</strong></p><ul><li><p>Extends Faster R-CNN to instance segmentation tasks.</p></li><li><p>Uses ROI Align instead of ROI Pooling</p></li><li><p>Adds FPN to get multi-scale features</p></li></ul></li></ul><p></p>
14
New cards

Selective search

Purpose:

  • Generates region proposals for object detection by grouping similar pixels in an image.

How It Works:

  1. Initial Segmentation:

    • Divides the image into small regions using a segmentation algorithm

  2. Region Similarity Calculation:

    • Measures similarity between adjacent regions using color histograms, texture patterns, size similarity, shape compatibility

  3. Hierarchical Grouping:

    • Recursively merges similar regions to form larger regions.

    • Produces a hierarchy of region proposals at multiple scales.

  4. Region Proposal Output:

    • Outputs ~2000 candidate regions with bounding boxes.

Advantages:

  • Simple and effective for generating object proposals.

  • Class-agnostic and doesn’t require training.

Drawbacks:

  • Computationally expensive and slow.

  • Not optimized for specific datasets or tasks.

Applications:

  • Used in early object detection models like R-CNN and Fast R-CNN.

15
New cards

Non-Max Supression (NMS)

Purpose: select the best bounding box from overlapping detections to avoid duplicate predictions.

Process:

  1. Calculate a confidence score for each detected object (e.g., from the object classification network).

  2. Sort all boxes by confidence scores in descending order.

  3. For the box with the highest score:

    • Retain it as a final detection.

    • Compare it to all other boxes. Remove those with an Intersection over Union (IoU) above a certain threshold.

  4. Repeat for the next highest-scoring box.

Key Parameters:

  • Confidence score threshold: Filters low-confidence boxes.

  • IoU threshold: Determines the overlap level allowed between boxes.

Applications:

  • Used in object detection models like R-CNN, YOLO, and Faster R-CNN.

Challenges:

  • Can suppress valid detections in cases of high overlap.

16
New cards

Hard Negative Mining

Purpose:

  • Improves model performance by focusing on difficult examples (false positives and false negatives).

Process:

  1. During training, identify incorrectly classified examples (e.g., background classified as an object or vice versa).

  2. Add these hard negatives to the training data with higher weights.

  3. Retrain the model to better differentiate between background and objects.

Applications:

  • Commonly used in object detection to reduce false detections.

Challenges:

  • Computationally expensive as it requires analyzing the entire dataset for hard negatives.

17
New cards

R-CNN

How It Works:

  1. Region Proposals:

    • Uses an external algorithm (Selective Search) to generate ~2000 region proposals.

  2. Feature Extraction:

    • Applies a convolutional neural network (CNN) (e.g., ImageNet) to each region proposal to extract features.

  3. Classification:

    • A Support Vector Machine (SVM) classifies each region as an object or background.

  4. Bounding Box Regression:

    • Refines the bounding box coordinates.

  5. NMS:

    • Selects the best bounding box from overlapping detections

  6. Hard Negative Mining:

    • After initial training, analyze the classifier’s performance on the dataset, apply hard negative mining to retrain the SVM for better performance.

Advantages:

  • Significant improvement over traditional object detection methods.

Drawbacks:

  • Computationally expensive due to:

    • Extracting features for every region.

    • Using separate pipelines for feature extraction, classification, and bounding box regression.

  • The ROIs get warped to the same size before being fed to the CNN → can distort information

<p><strong>How It Works</strong>:</p><ol><li><p><strong>Region Proposals</strong>:</p><ul><li><p>Uses an external algorithm (Selective Search) to generate ~2000 region proposals.</p></li></ul></li><li><p><strong>Feature Extraction</strong>:</p><ul><li><p>Applies a convolutional neural network (CNN) (e.g., ImageNet) to each region proposal to extract features.</p></li></ul></li><li><p><strong>Classification</strong>:</p><ul><li><p>A Support Vector Machine (SVM) classifies each region as an object or background.</p></li></ul></li><li><p><strong>Bounding Box Regression</strong>:</p><ul><li><p>Refines the bounding box coordinates.</p></li></ul></li><li><p><strong>NMS</strong>:</p><ul><li><p>Selects the best bounding box from overlapping detections</p></li></ul></li><li><p><strong>Hard Negative Mining:</strong></p><ul><li><p>After initial training, analyze the classifier’s performance on the dataset, apply hard negative mining to retrain the SVM for better performance.</p></li></ul></li></ol><p><strong>Advantages</strong>:</p><ul><li><p>Significant improvement over traditional object detection methods.</p></li></ul><p><strong>Drawbacks</strong>:</p><ul><li><p>Computationally expensive due to:</p><ul><li><p>Extracting features for every region.</p></li><li><p>Using separate pipelines for feature extraction, classification, and bounding box regression.</p></li></ul></li><li><p>The ROIs get warped to the same size before being fed to the CNN → can distort information</p></li></ul><p></p>
18
New cards

Region of Interest (ROI) Pooling

Purpose:

  • Extracts fixed-size feature maps for regions of interest (ROIs) in an image.

Process:

  1. Input:

    • Feature map (from a convolutional layer).

    • ROIs (bounding boxes proposed by a region proposal network).

  2. Divide each ROI into a grid of fixed dimensions (e.g., 7x7).

  3. Apply max-pooling within each grid cell to reduce the region to a fixed size.

Advantages:

  • Enables downstream fully connected layers to process regions of different sizes.

  • Reduces computation while retaining important features.

Applications:

  • Integral to Fast R-CNN and Faster R-CNN architectures.

19
New cards

Fast R-CNN

How It Works:

  1. Single Forward Pass:

    • Applies a CNN (e.g., VGG16) to the entire image to generate a feature map.

  2. ROI Projection:

    • mapping ROIs from the original image space onto the feature map

    • ROI coordinates are scaled to match the feature map’s dimensions

  3. ROI Pooling:

    • Extracts fixed-size feature maps for region proposals (from Selective Search) using ROI pooling.

  4. Unified Network:

    • A single fully connected network performs classification and bounding box regression simultaneously.

Advantages:

  • Faster than R-CNN:

    • Feature extraction is done once for the entire image.

    • Combines classification and bounding box regression into one step.

  • End-to-end trainable.

Drawbacks:

  • Still relies on Selective Search for region proposals, which is computationally expensive.

<p><strong>How It Works</strong>:</p><ol><li><p><strong>Single Forward Pass</strong>:</p><ul><li><p>Applies a CNN (e.g., VGG16) to the entire image to generate a feature map.</p></li></ul></li><li><p><strong>ROI Projection: </strong></p><ul><li><p>mapping<strong> </strong>ROIs from the original image space onto the feature map</p></li><li><p>ROI coordinates are scaled to match the feature map’s dimensions</p></li></ul></li><li><p><strong>ROI Pooling</strong>:</p><ul><li><p>Extracts fixed-size feature maps for region proposals (from Selective Search) using ROI pooling.</p></li></ul></li><li><p><strong>Unified Network</strong>:</p><ul><li><p>A single fully connected network performs classification and bounding box regression simultaneously.</p></li></ul></li></ol><p><strong>Advantages</strong>:</p><ul><li><p>Faster than R-CNN:</p><ul><li><p>Feature extraction is done once for the entire image.</p></li><li><p>Combines classification and bounding box regression into one step.</p></li></ul></li><li><p>End-to-end trainable.</p></li></ul><p><strong>Drawbacks</strong>:</p><ul><li><p>Still relies on Selective Search for region proposals, which is computationally expensive.</p></li></ul><p></p>
20
New cards

Anchor boxes

Purpose:

  • Predefined bounding boxes used in RPNs to detect objects of different sizes and aspect ratios.

Characteristics:

  • Anchors are centered at specific points on the feature map.

  • Each anchor has a predefined scale (e.g., small, medium, large) and aspect ratio (e.g., 1:1, 1:2, 2:1).

Process:

  • RPNs predict adjustments (deltas) to the anchors to refine the bounding boxes.

Advantages:

  • Enables detection of objects with varying scales and shapes.

Challenges:

  • Requires careful tuning of anchor sizes and aspect ratios to match the dataset.

21
New cards

Region Proposal Network (RPN)

Purpose:

  • Generates candidate object bounding boxes (region proposals) for object detection.

Process:

  1. Generates anchor points which are placed at regular intervals over the feature map

  2. For each anchor point, we have k predefined anchor boxes of different sizes and aspect ratios

  3. For each achor box:

    • Predicts objectness scores (whether the region contains an object).

    • Regresses bounding box coordinates.

  4. Applies NMS to filter overlapping proposals.

Advantages:

  • End-to-end trainable with the detection network.

  • Faster and more efficient than traditional methods like Selective Search.

22
New cards

Faster R-CNN

How It Works:

  1. Feature Extraction:

    • A CNN generates a feature map for the entire image.

  2. Region Proposal Network (RPN):

    • Replaces Selective Search with an RPN to generate region proposals on the feature map

    • RPN is end-to-end trainable, making it faster and more efficient.

  3. ROI Pooling:

    • Extracts fixed-size feature maps for the proposed regions.

  4. Classification and Bounding Box Regression:

    • Performed on the ROI feature maps.

Advantages:

  • Faster than Fast R-CNN due to the RPN.

  • Fully end-to-end trainable.

Drawbacks:

  • Struggles with pixel-level segmentation tasks.

<p><strong>How It Works</strong>:</p><ol><li><p><strong>Feature Extraction</strong>:</p><ul><li><p>A CNN generates a feature map for the entire image.</p></li></ul></li><li><p><strong>Region Proposal Network (RPN)</strong>:</p><ul><li><p>Replaces Selective Search with an RPN to generate region proposals on the feature map</p></li><li><p>RPN is end-to-end trainable, making it faster and more efficient.</p></li></ul></li><li><p><strong>ROI Pooling</strong>:</p><ul><li><p>Extracts fixed-size feature maps for the proposed regions.</p></li></ul></li><li><p><strong>Classification and Bounding Box Regression</strong>:</p><ul><li><p>Performed on the ROI feature maps.</p></li></ul></li></ol><p><strong>Advantages</strong>:</p><ul><li><p>Faster than Fast R-CNN due to the RPN.</p></li><li><p>Fully end-to-end trainable.</p></li></ul><p><strong>Drawbacks</strong>:</p><ul><li><p>Struggles with pixel-level segmentation tasks.</p></li></ul><p></p>
23
New cards

Feature Pyramid Network (FPN)

Purpose:

  • Enhance object detection by effectively utilizing multi-scale features for detecting objects of various sizes.

  • Make high-level semantic features “flow back down” to low level features, which have the most spatial information

How It Works:

  1. Feature Extraction:

    • Extracts feature maps from different levels of the CNN backbone (e.g., ResNet), corresponding to different spatial resolutions.

  2. Top-Down Pathway:

    • High-level semantic features (e.g., from deeper layers, with smaller spatial dimensions but richer in object context) are upsampled progressively.

  3. Lateral Connections:

    • Combines upsampled features with lower-level features (higher spatial resolution) through lateral connections to retain spatial details
      → merged by addition

  4. Pyramid Outputs:

    • Produces a pyramid of feature maps at multiple scales, each containing both spatial details and semantic context.

  5. Usage in Detection:

    • Each scale in the pyramid is used for detecting objects of corresponding sizes.

Advantages:

  • Improves detection of small objects (e.g., detecting small pedestrians or distant cars).

  • Efficient multi-scale feature representation.

<p><strong>Purpose</strong>:</p><ul><li><p>Enhance object detection by effectively utilizing multi-scale features for detecting objects of various sizes.</p></li><li><p>Make high-level semantic features “flow back down” to low level features, which have the most spatial information</p></li></ul><p><strong>How It Works</strong>:</p><ol><li><p><strong>Feature Extraction</strong>:</p><ul><li><p>Extracts feature maps from different levels of the CNN backbone (e.g., ResNet), corresponding to different spatial resolutions.</p></li></ul></li><li><p><strong>Top-Down Pathway</strong>:</p><ul><li><p>High-level semantic features (e.g., from deeper layers, with smaller spatial dimensions but richer in object context) are upsampled progressively.</p></li></ul></li><li><p><strong>Lateral Connections</strong>:</p><ul><li><p>Combines upsampled features with lower-level features (higher spatial resolution) through lateral connections to retain spatial details<br>→ merged by addition</p></li></ul></li><li><p><strong>Pyramid Outputs</strong>:</p><ul><li><p>Produces a pyramid of feature maps at multiple scales, each containing both spatial details and semantic context.</p></li></ul></li><li><p><strong>Usage in Detection</strong>:</p><ul><li><p>Each scale in the pyramid is used for detecting objects of corresponding sizes.</p></li></ul></li></ol><p><strong>Advantages</strong>:</p><ul><li><p>Improves detection of small objects (e.g., detecting small pedestrians or distant cars).</p></li><li><p>Efficient multi-scale feature representation.</p></li></ul><p></p>
24
New cards

ROI Align

Purpose:

  • To map bounding boxes (Region Proposals) from the original image space to the feature map with high precision.

  • To avoid the spatial misalignment caused by the rounding operations in ROI Pooling.

  • Essential for tasks like instance segmentation, where pixel-level accuracy is required.

How it works:

  • Each ROI (bounding box) is mapped to the continuous coordinates on the feature map based on the downsampling ratio of the CNN backbone.

  • Instead of rounding ROI boundaries to the nearest grid point (as in ROI Pooling), ROI Align uses the exact floating-point coordinates.

  • At each grid cell, calculates the feature value using bilinear interpolation:

    • Interpolates the values of neighboring feature map points to compute the precise value for the grid cell.

Output:

  • Produces a fixed-size feature map for each ROI, preserving spatial details.

<p><strong>Purpose:</strong></p><ul><li><p>To map <strong>bounding boxes</strong> (Region Proposals) from the original image space to the <strong>feature map</strong> with high precision.</p></li><li><p>To avoid the spatial misalignment caused by the rounding operations in <strong>ROI Pooling</strong>.</p></li><li><p>Essential for tasks like <strong>instance segmentation</strong>, where pixel-level accuracy is required.</p></li></ul><p></p><p><strong>How it works:</strong></p><ul><li><p>Each ROI (bounding box) is mapped to the <strong>continuous coordinates</strong> on the feature map based on the downsampling ratio of the CNN backbone.</p></li><li><p>Instead of rounding ROI boundaries to the nearest grid point (as in ROI Pooling), ROI Align uses the exact floating-point coordinates.</p></li><li><p>At each grid cell, calculates the feature value using <strong>bilinear interpolation</strong>:</p><ul><li><p>Interpolates the values of neighboring feature map points to compute the precise value for the grid cell.</p></li></ul></li></ul><p><strong>Output</strong>:</p><ul><li><p>Produces a fixed-size feature map for each ROI, preserving spatial details.</p></li></ul><p></p>
25
New cards

Mask R-CNN

How It Works:

  1. Builds on Faster R-CNN:

    • Adds a branch to predict segmentation masks for each region.

    • Uses FPN to get multi-scale features

  2. Mask Branch:

    • For each ROI, predicts a binary mask (pixel-wise classification) for the object.

  3. ROIAlign:

    • Improves ROI pooling by avoiding quantization errors, leading to better mask accuracy.

  4. Multi-task Learning:

    • Simultaneously performs:

      • Object classification.

      • Bounding box regression.

      • Instance segmentation.

Advantages:

  • Handles pixel-level segmentation (instance segmentation).

  • ROIAlign improves spatial precision compared to ROI Pooling.

Drawbacks:

  • More computationally expensive than Faster R-CNN.

<p><strong>How It Works</strong>:</p><ol><li><p><strong>Builds on Faster R-CNN</strong>:</p><ul><li><p>Adds a branch to predict segmentation masks for each region.</p></li><li><p>Uses FPN to get multi-scale features</p></li></ul></li><li><p><strong>Mask Branch</strong>:</p><ul><li><p>For each ROI, predicts a binary mask (pixel-wise classification) for the object.</p></li></ul></li><li><p><strong>ROIAlign</strong>:</p><ul><li><p>Improves ROI pooling by avoiding quantization errors, leading to better mask accuracy.</p></li></ul></li><li><p><strong>Multi-task Learning</strong>:</p><ul><li><p>Simultaneously performs:</p><ul><li><p>Object classification.</p></li><li><p>Bounding box regression.</p></li><li><p>Instance segmentation.</p></li></ul></li></ul></li></ol><p><strong>Advantages</strong>:</p><ul><li><p>Handles pixel-level segmentation (instance segmentation).</p></li><li><p>ROIAlign improves spatial precision compared to ROI Pooling.</p></li></ul><p><strong>Drawbacks</strong>:</p><ul><li><p>More computationally expensive than Faster R-CNN.</p></li></ul><p></p>
26
New cards

Spatial Transformer Networks (STN)

Purpose:

  • To make neural networks spatially invariant by dynamically transforming input feature maps, improving robustness to variations like rotation, scaling, and translation.

How It Works:

  1. Input:

    • A feature map or image that may have spatial distortions.

  2. Three Components:

    • Localization Network:

      • Predicts transformation parameters (e.g., rotation, scaling, translation).

      • Typically implemented as a small CNN.

    • Grid Generator:

      • Creates a sampling grid based on the transformation parameters.

    • Sampler:

      • Maps input pixels to the new locations defined by the grid.

  3. Output:

    • A transformed feature map that is easier for the network to process.

Advantages:

  • Reduces the need for extensive data augmentation.

  • Improves model performance on tasks involving spatial variations (e.g., rotated text recognition or detecting objects in tilted images).

  • Can be added to existing architectures without significant modifications.

Limitations:

  • Additional computational overhead due to transformation steps.

  • Requires careful tuning of the localization network

<p><strong>Purpose</strong>:</p><ul><li><p>To make neural networks spatially invariant by dynamically transforming input feature maps, improving robustness to variations like rotation, scaling, and translation.</p></li></ul><p><strong>How It Works</strong>:</p><ol><li><p><strong>Input</strong>:</p><ul><li><p>A feature map or image that may have spatial distortions.</p></li></ul></li><li><p><strong>Three Components</strong>:</p><ul><li><p><strong>Localization Network</strong>:</p><ul><li><p>Predicts transformation parameters (e.g., rotation, scaling, translation).</p></li><li><p>Typically implemented as a small CNN.</p></li></ul></li><li><p><strong>Grid Generator</strong>:</p><ul><li><p>Creates a sampling grid based on the transformation parameters.</p></li></ul></li><li><p><strong>Sampler</strong>:</p><ul><li><p>Maps input pixels to the new locations defined by the grid.</p></li></ul></li></ul></li><li><p><strong>Output</strong>:</p><ul><li><p>A transformed feature map that is easier for the network to process.</p></li></ul></li></ol><p><strong>Advantages</strong>:</p><ul><li><p>Reduces the need for extensive data augmentation.</p></li><li><p>Improves model performance on tasks involving spatial variations (e.g., rotated text recognition or detecting objects in tilted images).</p></li><li><p>Can be added to existing architectures without significant modifications.</p></li></ul><p><strong>Limitations</strong>:</p><ul><li><p>Additional computational overhead due to transformation steps.</p></li><li><p>Requires careful tuning of the localization network</p></li></ul><p></p>
27
New cards

Deformable convolution

Main idea:

  • Adds learnable offsets Δpn to the standard fixed grid sampling locations in convolution

  • Sample the input feature map X at the modified positions, using interpolation of needed

  • The offsets are predicted by an auxiliary small network

Purpose:

  • Allow the kernel to adapt to the geometric structure of the input.

  • Improve the model's ability to handle spatial variations, such as object deformation, rotation, and scaling, which traditional convolution struggles with due to its rigid sampling grid.

<p><strong>Main idea:</strong> </p><ul><li><p>Adds learnable offsets Δp<sub>n</sub> to the standard fixed grid sampling locations in convolution</p></li><li><p>Sample the input feature map X at the modified positions, using interpolation of needed</p></li><li><p>The offsets are predicted by an auxiliary small network</p></li></ul><p><strong>Purpose:</strong></p><ul><li><p>Allow the kernel to adapt to the geometric structure of the input.</p></li></ul><ul><li><p>Improve the model's ability to handle spatial variations, such as object deformation, rotation, and scaling, which traditional convolution struggles with due to its rigid sampling grid.</p></li></ul><p></p>
28
New cards

Deformable Convolutional Networks

DETR (Detection Transformer)

  • End-to-end object detection using transformers.

  • Components:

    • Positional Encoding: Adds spatial information to the transformer.

    • Encoder-Decoder Architecture: Processes image features for object detection.

  • Drawbacks:

    • Computationally expensive for high-resolution images.

    • Slow convergence due to quadratic complexity.

Deformable DETR

  • Improves on DETR by:

    • Using sparse attention mechanisms for efficiency.

    • Handling complex scenes with better accuracy.

Mask2Former

  • Foundation model for segmentation tasks.

  • Builds on VIS concepts for enhanced video instance segmentation.

<p><strong>DETR (Detection Transformer)</strong></p><ul><li><p>End-to-end object detection using transformers.</p></li><li><p>Components:</p><ul><li><p><strong>Positional Encoding</strong>: Adds spatial information to the transformer.</p></li><li><p><strong>Encoder-Decoder Architecture</strong>: Processes image features for object detection.</p></li></ul></li><li><p><strong>Drawbacks</strong>:</p><ul><li><p>Computationally expensive for high-resolution images.</p></li><li><p>Slow convergence due to quadratic complexity.</p></li></ul></li></ul><p><strong>Deformable DETR</strong></p><ul><li><p>Improves on DETR by:</p><ul><li><p>Using sparse attention mechanisms for efficiency.</p></li><li><p>Handling complex scenes with better accuracy.</p></li></ul></li></ul><p><strong>Mask2Former</strong></p><ul><li><p>Foundation model for segmentation tasks.</p></li><li><p>Builds on VIS concepts for enhanced video instance segmentation.</p></li></ul><p></p>
29
New cards

DEtection TRansformer (DETR)

Purpose:

  • Detect objects in an image end-to-end using a transformer architecture

  • Rethink object detection as a set prediction problem instead of using traditional methods like region proposals or anchors

How It Works:

  1. Feature Extraction:

    • Uses a CNN backbone (e.g., ResNet) to extract feature maps from the input image.

  2. Transformer Encoder-Decoder:

    • Encoder: Processes the feature map, capturing global context via multi-head self-attention
      → role: separate object instances

    • Decoder: Uses learnable generalized object queries which interact with the encoded image features to collect information about object classes and locations in the current image.

  3. Prediction Heads:

    • simple FC layers to convert the updated queries into concrete outputs

    • separate heads for classification and bounding box prediction

  4. Prediction Matching:

    • For each pair of a predicted query i and a ground truth object j, a cost matrix is computed using the weighted sum of the classification and bounding box loss

    • The Hungarian algorithm finds the optimal one-to-one assignment between predictions and ground truth objects that minimizes the total matching cost.

    • Outputs bounding boxes and class probabilities for each ground truth object

<p><strong>Purpose:</strong></p><ul><li><p>Detect objects in an image end-to-end using a transformer architecture</p></li><li><p>Rethink object detection as a <strong>set prediction problem</strong> instead of using traditional methods like region proposals or anchors</p></li></ul><p></p><p><strong>How It Works</strong>:</p><ol><li><p><strong>Feature Extraction</strong>:</p><ul><li><p>Uses a CNN backbone (e.g., ResNet) to extract feature maps from the input image.</p></li></ul></li><li><p><strong>Transformer Encoder-Decoder</strong>:</p><ul><li><p><strong>Encoder</strong>: Processes the feature map, capturing global context via multi-head self-attention<br>→ role: separate object instances</p></li><li><p><strong>Decoder</strong>: Uses learnable generalized object queries which interact with the encoded image features to collect information about object classes and locations in the current image.</p></li></ul></li><li><p><strong>Prediction Heads:</strong></p><ul><li><p>simple FC layers to convert the updated queries into concrete outputs</p></li><li><p>separate heads for classification and bounding box prediction</p></li></ul></li><li><p><strong>Prediction Matching</strong>:</p><ul><li><p>For each pair of a predicted query i and a ground truth object j, a <strong>cost matrix</strong> is computed using the weighted sum of the classification and bounding box loss</p></li></ul><ul><li><p>The Hungarian algorithm finds the optimal one-to-one assignment between predictions and ground truth objects that minimizes the total matching cost.</p></li><li><p>Outputs bounding boxes and class probabilities for each ground truth object</p></li></ul></li></ol><p></p>
30
New cards

Advantages and disadvantages of DETR

Advantages:

  • End-to-end training: directly predicts the bounding boxes and class labels without the need for complex post-processing steps like NMS

  • Unified architecture: uses a single architecture (a Transformer) for both the detection and classification tasks, no need for separate RPN and classifiers

  • Handles complex scenes with many objects.

  • Simple and general: fewer task-specific components (like anchor boxes)

Drawbacks:

  • Computationally expensive (transformer encoder has quadratic cost), especially for high-resolution images → issues with small object detection

  • Longer training time: slow convergence, because of uniformly spread attention initially ←→ CNNs have some inductive bias due to local neighborhood

  • Needs large datasets

31
New cards

Positional encoding in DETR

Purpose of Positional Encoding:

  • Since the Transformer architecture, used in DETR, does not inherently process sequential or spatial information, positional encoding is introduced to give the model a sense of the position of objects in an image.

  • Helps the model differentiate between objects that are in the same class but in different spatial locations.

How Positional Encoding Works:

  • 2D Positional Encoding: The original DETR uses a 2D grid-based positional encoding generated using sine and cosine functions of different wavelengths to represent both horizontal and vertical positions of image patches.

  • These encodings are added to the input feature maps at the beginning, combining spatial position information with learned visual features.

32
New cards

Queries in DETR

Definition:
Queries in DETR are learnable embeddings (vectors) that serve as placeholders for potential objects in an image.

Purpose:
To interact with the encoded image features via the transformer decoder and produce predictions for object detection (bounding boxes and class labels).

Key Characteristics

  1. Learnable:

    • Queries are initialized as random embeddings and are optimized during training via backpropagation to generalize better across images.

  2. Content-Independent (Initially):

    • At the start of decoding, queries are not tied to any specific image or object. They are generalized placeholders.

  3. Content-Specific (During Decoding):

    • Queries interact with the encoded image features through cross-attention in the decoder, becoming specific to the objects in the image → they are used for prediction, but these updates get discarded

Key Benefits

  • Eliminates the need for predefined anchors or region proposals.

  • Queries dynamically adapt to the image content during decoding.

  • Supports end-to-end training and simplifies the detection pipeline.

33
New cards

Deformable DETR

How It Works:

  • Builds on DETR but addresses its inefficiencies:

    1. Sparse Attention:

      • Focuses attention on a small, learnable set of key points instead of all locations.

    2. Deformable Attention Module:

      • Samples relevant regions dynamically for each object based on predicted offsets, adapting to their shape, position, and size.

    3. Multi-scale feature maps:

      • Combines feature maps from different resolutions (like Feature Pyramid Networks) to detect objects of various sizes.

Advantages:

  • Faster convergence compared to DETR.

  • Better performance for complex scenes with occlusions or small objects.

  • Efficient computation → linear

Drawbacks:

  • Slightly more complex than DETR due to deformable attention mechanisms.

<p><strong>How It Works</strong>:</p><ul><li><p>Builds on DETR but addresses its inefficiencies:</p><ol><li><p><strong>Sparse Attention</strong>:</p><ul><li><p>Focuses attention on a small, learnable set of key points instead of all locations.</p></li></ul></li><li><p><strong>Deformable Attention Module</strong>:</p><ul><li><p>Samples relevant regions dynamically for each object based on predicted offsets, adapting to their shape, position, and size.</p></li></ul></li><li><p><strong>Multi-scale feature maps:</strong></p><ul><li><p>Combines feature maps from different resolutions (like Feature Pyramid Networks) to detect objects of various sizes.</p></li></ul></li></ol></li></ul><p><strong>Advantages</strong>:</p><ul><li><p>Faster convergence compared to DETR.</p></li><li><p>Better performance for complex scenes with occlusions or small objects.</p></li><li><p>Efficient computation → linear</p></li></ul><p><strong>Drawbacks</strong>:</p><ul><li><p>Slightly more complex than DETR due to deformable attention mechanisms.</p></li></ul><p></p>
34
New cards

Deformable attention in Deformable DETR

Instead of attending to all positions, for each query k number of keys get sampled from the feature map for each attention head
→ the sample positions are produced by predicting an offset from the current query’s position

→ the sampled values get aggregated by weighted summation based on learnt attention weights

Number of queries:

  • in the encoder there is a query for each position of the input feature map produced by a CNN backbone

  • in the decoder the number of queries is the number of outputs, and there is a learnt reference point for each query

=> the cost is linear ( O(N*k)) instead of quadratic ( O(N²))

<p>Instead of attending to all positions, for each query k number of keys get sampled from the feature map for each attention head<br>→ the sample positions are produced by predicting an offset from the current query’s position</p><p>→ the sampled values get aggregated by weighted summation based on learnt attention weights</p><p></p><p>Number of queries:</p><ul><li><p>in the encoder there is a query for each position of the input feature map produced by a CNN backbone</p></li><li><p>in the decoder the number of queries is the number of outputs, and there is a learnt reference point for each query</p></li></ul><p></p><p>=&gt; the cost is linear ( O(N*k)) instead of quadratic ( O(N²))</p><p></p>
35
New cards

Mask2Former

How It Works:

Builds on the architecture of DETR, but it replaces bounding box prediction with dynamic mask prediction and incorporates a pixel decoder for fine-grained feature processing

  1. Backbone:

    • A CNN (e.g., ResNet or Swin Transformer) extracts multi-scale feature maps from the input image.

  2. Pixel Decoder:

    • Aggregates the multi-scale features into a dense pixel-level high-resolution feature map.

    • These features retain fine-grained spatial information crucial for segmentation tasks.

  3. Transformer Decoder:

    • Uses query-based attention to predict object masks and their corresponding classes.

    • Queries represent potential objects or regions in the image.

  4. Mask Prediction Head:

    • Combines the output of the transformer decoder with the pixel decoder features to produce binary masks for each query.

Advantages:

  • Generalizes instance, semantic, and panoptic segmentation into a single framework.

  • Achieves state-of-the-art results in segmentation benchmarks.

Drawbacks:

  • Computationally demanding for high-resolution images.

<p><strong>How It Works</strong>:</p><p>Builds on the architecture of DETR, but it replaces bounding box prediction with <strong>dynamic mask prediction</strong> and incorporates a <strong>pixel decoder</strong> for fine-grained feature processing</p><ol><li><p><strong>Backbone</strong>:</p><ul><li><p>A CNN (e.g., ResNet or Swin Transformer) extracts <strong>multi-scale feature maps</strong> from the input image.</p></li></ul></li><li><p><strong>Pixel Decoder</strong>:</p><ul><li><p>Aggregates the multi-scale features into a dense pixel-level high-resolution feature map.</p></li><li><p>These features retain fine-grained spatial information crucial for segmentation tasks.</p></li></ul></li><li><p><strong>Transformer Decoder</strong>:</p><ul><li><p>Uses <strong>query-based attention</strong> to predict object masks and their corresponding classes.</p></li><li><p><strong>Queries</strong> represent potential objects or regions in the image.</p></li></ul></li><li><p><strong>Mask Prediction Head</strong>:</p><ul><li><p>Combines the output of the transformer decoder with the pixel decoder features to produce binary masks for each query.</p></li></ul></li></ol><p><strong>Advantages</strong>:</p><ul><li><p>Generalizes instance, semantic, and panoptic segmentation into a single framework.</p></li><li><p>Achieves state-of-the-art results in segmentation benchmarks.</p></li></ul><p><strong>Drawbacks</strong>:</p><ul><li><p>Computationally demanding for high-resolution images.</p></li></ul><p></p>
36
New cards

Mask2Former for video segmentation

  1. Frame-Level Features:

    • Each frame is processed individually through the backbone and pixel decoder.

  2. Query Propagation:

    • Queries are carried across frames to maintain consistency for objects appearing in multiple frames.

    • This allows the model to track objects over time.

  3. Temporal Attention:

    • In addition to spatial cross-attention (within each frame), queries also attend to features from previous frames.

  4. Output:

    • For each frame, the model produces segmentation masks and tracks object instances over time.

<ol><li><p><strong>Frame-Level Features</strong>:</p><ul><li><p>Each frame is processed individually through the backbone and pixel decoder.</p></li></ul></li><li><p><strong>Query Propagation</strong>:</p><ul><li><p>Queries are carried across frames to maintain consistency for objects appearing in multiple frames.</p></li><li><p>This allows the model to track objects over time.</p></li></ul></li><li><p><strong>Temporal Attention</strong>:</p><ul><li><p>In addition to spatial cross-attention (within each frame), queries also attend to features from previous frames.</p></li></ul></li><li><p><strong>Output</strong>:</p><ul><li><p>For each frame, the model produces segmentation masks and tracks object instances over time.</p></li></ul></li></ol><p></p>
37
New cards

Masked attention in Mask2Former

  • Restricts attention to a specific subset of features or regions, defined by a mask

  • During the computation of the attention scores, invalid positions are masked (set to a large negative value, effectively ignored).

Where Masked Attention Appears in Mask2Former

  • During Mask Refinement:

    • Mask2Former uses dynamic masks predicted by the transformer decoder to guide attention.

    • For each query, the model generates a dynamic mask that highlights the relevant regions in the pixel decoder's feature map.

  • In Cross-Attention:

    • The mask restricts the query’s attention to regions it is responsible for, improving segmentation accuracy and efficiency.

38
New cards

MaskTrackRCNN

A framework designed specifically for video instance segmentation (VIS). It extends Mask R-CNN by adding a mechanism to track instances across frames.

How MaskTrackRCNN Works

  1. Backbone and Mask R-CNN Framework:

    • Processes each video frame using a standard Mask R-CNN pipeline:

      • Extracts feature maps.

      • Generates region proposals (RPN).

      • Predicts bounding boxes, classes, and instance masks.

  2. Instance Association Across Frames:

    • Adds a tracking head to associate instances across frames:

      • Computes instance embeddings for detected objects.

      • Matches embeddings from one frame to the next using similarity metrics (e.g., cosine similarity).

  3. Output:

    • For each video, it produces per-frame instance masks and a tracking ID for each object.

<p>A framework designed specifically for <strong>video instance segmentation (VIS)</strong>. It extends <strong>Mask R-CNN</strong> by adding a mechanism to <strong>track instances across frames</strong>.</p><p></p><p><strong>How MaskTrackRCNN Works</strong></p><ol><li><p><strong>Backbone and Mask R-CNN Framework</strong>:</p><ul><li><p>Processes each video frame using a standard Mask R-CNN pipeline:</p><ul><li><p>Extracts feature maps.</p></li><li><p>Generates region proposals (RPN).</p></li><li><p>Predicts bounding boxes, classes, and instance masks.</p></li></ul></li></ul></li><li><p><strong>Instance Association Across Frames</strong>:</p><ul><li><p>Adds a <strong>tracking head</strong> to associate instances across frames:</p><ul><li><p>Computes instance embeddings for detected objects.</p></li><li><p>Matches embeddings from one frame to the next using similarity metrics (e.g., cosine similarity).</p></li></ul></li></ul></li><li><p><strong>Output</strong>:</p><ul><li><p>For each video, it produces per-frame instance masks and a <strong>tracking ID</strong> for each object.</p></li></ul></li></ol><p></p>
39
New cards

Segment Anything Model (SAM)

A universal segmentation model designed to segment any object in an image automatically or interactively through prompts, even without fine-tuning on specific datasets.

SAM Architecture

  1. Image Encoder Backbone (Vision Transformer):

    • A modified ViT extracts image features.

    • The backbone processes the entire image in a single forward pass, providing a high-resolution feature map.

  2. Prompt Encoder:

    • Encodes user-provided prompts such as:

      • Points: Positive or negative points to indicate object presence or absence.

      • Boxes: Bounding boxes around objects of interest.

      • Masks: Rough or partial masks for the object.

      • Text

    • Outputs embeddings that guide the segmentation process.

  3. Mask Decoder:

    • Combines features from the image encoder and prompt embeddings to predict segmentation masks.

    • Produces high-quality masks that align accurately with object boundaries.

<p>A universal segmentation model designed to <strong>segment any object</strong> in an image automatically or interactively through prompts, even without fine-tuning on specific datasets.</p><p><strong>SAM Architecture</strong></p><ol><li><p><strong>Image Encoder Backbone (Vision Transformer)</strong>:</p><ul><li><p>A <strong>modified ViT</strong> extracts image features.</p></li><li><p>The backbone processes the entire image in a single forward pass, providing a high-resolution feature map.</p></li></ul></li><li><p><strong>Prompt Encoder</strong>:</p><ul><li><p>Encodes user-provided <strong>prompts</strong> such as:</p><ul><li><p><strong>Points</strong>: Positive or negative points to indicate object presence or absence.</p></li><li><p><strong>Boxes</strong>: Bounding boxes around objects of interest.</p></li><li><p><strong>Masks</strong>: Rough or partial masks for the object.</p></li><li><p><strong>Text</strong></p></li></ul></li><li><p>Outputs embeddings that guide the segmentation process.</p></li></ul></li><li><p><strong>Mask Decoder</strong>:</p><ul><li><p>Combines features from the image encoder and prompt embeddings to predict segmentation masks.</p></li><li><p>Produces high-quality masks that align accurately with object boundaries.</p></li></ul></li></ol><p></p>
40
New cards

Iterative training of SAM

  • Training starts with simpler tasks or annotations, and these annotations are iteratively refined to improve mask quality and diversity.

  • The model’s predictions are progressively used to generate better training data for itself.

1. Base Model Training: Pretrain a simple segmentation model on existing datasets.

2. Mask Generation: Generate initial masks for images using the base model.

3. Error region: error region is calculated based on the difference to the human annotated mask

4. Expanded input: in the next training round, include in the input the model’s previous mask prediction and an error region prompt as well

<ul><li><p>Training starts with simpler tasks or annotations, and these annotations are iteratively refined to improve mask quality and diversity.</p></li><li><p>The model’s predictions are progressively used to generate better training data for itself.</p></li></ul><p>1. <strong>Base Model Training</strong>: Pretrain a simple segmentation model on existing datasets.</p><p>2. <strong>Mask Generation</strong>: Generate initial masks for images using the base model.</p><p>3. <strong>Error region</strong>: error region is calculated based on the difference to the human annotated mask</p><p>4. <strong>Expanded input</strong>: in the next training round, include in the input the model’s previous mask prediction and an error region prompt as well</p><p></p>