Robot Vision Test 2

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/25

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

26 Terms

New cards

Image Classification

just detects one label, no spatial extent

New cards

Semantic Segmentation

labels each pixel, outputs dense label map same spatial size as input

New cards

How is semantic segmentation different from instance segmentation

Semantic segmentation only calls out difference in classes

Instance segmentation calls out difference between classes and also differences between the same classes

New cards

Object detection

predict bounding boxes and class labels for object; outputs set of (x,y,w,h)

New cards

Instance Segmentation

Combo of semantic and object
Most advanced
Analyzes every pixel → creates exact border around object → gives each object their own class
outputs per object binary mask

<ul><li><p>Combo of semantic and object</p></li><li><p>Most advanced</p></li><li><p>Analyzes every pixel → creates exact border around object → gives each object their own class</p></li><li><p>outputs per object binary mask</p></li></ul><p></p>

New cards

Sliding Window Semantic Segregation

Slide a window around and it identifies what is what inside of the given window
Keep moving the window around until every pixel has been covered once
Form an aggregated map to output what is in image
Major limitation: heavy redundant computation, poor global context & boundary artifacts

<ul><li><p>Slide a window around and it identifies what is what inside of the given window</p></li><li><p>Keep moving the window around until every pixel has been covered once</p></li><li><p>Form an aggregated map to output what is in image</p></li><li><p>Major limitation: heavy redundant computation, poor global context & boundary artifacts </p></li></ul><p></p>

New cards

List two major inefficiencies/bottlenecks of R-CNN and explain how Fast R-CNN addresses each one (be specific abt shared feature computation and RoI processing). Mention any remaining bottleneck that Fast R-CNN does not remove.

R-CNN bottlenecks: (i) runs a full CNN fwd pass per region proposal → repeated computation; (ii) multi-stage pipeline → slow training and storage overhead
Fast R-CNN fixes: (i) compute one convolutional feature map for the whole image and use RoI pooling to extract per-proposal features → shared computation; (ii) train a single end-to end network w softmax classification and bbox regression → faster and simpler training/inference

New cards

Write a squared-error loss suitable for training an autoencoder

New cards

State the core idea of an autoencoder and briefly explain one application (ex: denoising, dimensionality reduction, representation learning, anomaly detection, super-resolution)

Core Idea: to learn an encoder f_θ: x → z and a decoder g_Φ: z → x̂ that reconstruct x; the input z is a compact representation of the input
Denoising example: train on (x^noisy, x) pairs; at test time, g_Φ(f_θ(x^noisy)) removes noise

New cards

Given a set of token vectors, list the steps to compute self attention: obtaining Q, K, V; computing scores; (optional) scaling/masking; softmax; and weighted sums

Compute Q = XW^Q, K = XW^K, V = XW^V from input X
Scores S = QK^T; optionally scale by 1/√d_k
Weights A = softmax(S) (row-wise)
Output Z = AV; for multi-head, concatenate head outputs then apply W^O

New cards

What scale factor is applied to the dot products and why is it needed?

The scale factor: 1/√d_k

Reason: prevents softmax saturation therefore stabilizing gradients

New cards

Give two ways to feed images into a transformer (Ex: ViT patches of CNN features as tokens)

ViT: split images into patches, flatten/linearly project to tokens (add [CLS] and positional encodings)

New cards

What is the purpose of mutli-head attention?

To allow the model to attend to different pos/relations and subspaces in parallel, therefore improving overall modeling capacity

New cards

Write the summary equation for scaled dot-product attention Attention(Q, K, V)

New cards

Explain how positional encoding provides order information

Input tokens → parallel processing → lost order → generate unique positional encoding (PE) → add PE to embedding → provides order → transformer understands order

New cards

Why is the dot product commonly used as a similarity score in attention?

it is efficient, differentiable, and aligns w cosine similarity after normalization; larger values reflect higher alignment

New cards

define a pretext task and a downstream task. Why can solving the former help the latter?

Pretext: self-supervised task on unlabeled data to learn generalizable features
Downstream: target task using learned features
Solving pretext tasks shapes representations useful for downstream tasks. It’s kinda like building the foundation before building the house.

New cards

Briefly describe three common pretext tasks: rotation prediction, relative patch location/jigsaw, inpainting/colorization

Rotation prediction: classify rotation angle
Relative patch location/jigsaw: predict spatial relations/permutations among patches
Inpainting/colorization: reconstruct missing regions or color channels from context

New cards

Explain the basic idea of contrastive representation learning and what the InfoNCE loss optimizes with 1 positive and N-1 negatives. Why do more negatives generally tighten the improve the performance (mutual-information lower bound)?

Idea: learn an encoder so embeddings of a positive pair (same instance, different views) are close while negatives are apart
with one positive j for anchor i and negatives N_i
More negatives make the denominator harder, encouraging discrimination

<ul><li><p>Idea: learn an encoder so embeddings of a positive pair (same instance, different views) are close while negatives are apart</p></li><li><p>with one positive j for anchor i and negatives N<sub>i</sub></p></li><li><p>More negatives make the denominator harder, encouraging discrimination </p></li></ul><p></p>

New cards

In SimCLR: define a positive pair, list two strong image augmentations used to create views, explain the role of the projection head, and in one sentence each state the role of the temperature parameter and why larger batch sizes help

Positive pair: two augmented views of the same image
Two strong augmentations: random resized crop; color jitter
Projection head: MLP mapping representation h to z for the contrastive loss; improves invariance without hurting h (used for downstream)
Temperature: scals logits; lower τ sharpens softmax, emphasizing hard negatives, higher τ smooths it
Large batches help: provide many in-batch negatives per step, strengthening the contrastive signal

New cards

Give two key differences between SimCLR and MoCo, focusing on how negatives are obtained and the encoder(s) and update rules. State one practical implication of these differences

Negatives: SimCLR uses in-batch negatives; MoCo uses a queue/memory bank of keys, decoupling negatives from batch size
Encoders/updates: SimCLR has a single encoder (two views share weights); MoCo uses a query encoder and key encoder updated by momentum (EMA) for consistency
Implication: MoCo attains many negatives with small batches (memory efficient); SimCLR typically benefits from very large batches (more GPU memory)

New cards

Compare a standard autoencoder and VAE in terms of latent regularization and generative capability. What is the reparameterization trick and why does it enable backpropagation

AE: deterministic z; no explicit latent prior; good reconstructions but weak generative sampling from random z
VAE: encoder outputs (µ, σ) of q_ϕ(z|x); loss combines reconstruction and KL divergence to a prior (often N (0, I)); yields smooth, continuous latent spaces with generative capability
Reparameterization: sample ϵ ∼ N (0, I) and set z = µ + σ ⊙ ϵ to make sampling differentiable for backdrop

New cards

Describe how a GAN is trained (generator vs discriminator updates; objective type). List two common drawbacks of GANs

Training: discriminator D learns to classify real vs generated; generator G learns to fool D via a non-saturating objective, alternating updates
Two drawbacks: mode collapse (generator covers few modes); training instability/sensitivity to hyperparameters (and no explicitly likelihood)

New cards

Explain the idea of fully convolutional networks for segmentation.

The idea: Replace fully connected layers w convolutional layers
Why: so that we keep spatial feature maps
A traditional CNN with FC layers takes a 2D feature map and flattens it into a 1D vector. This destroys spatial information

New cards

What is the role of downsampling and upsampling?