Robot Vision Test 2

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/25

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

26 Terms

1
New cards

Image Classification

just detects one label, no spatial extent

<p>just detects one label, no spatial extent</p>
2
New cards

Semantic Segmentation

labels each pixel, outputs dense label map same spatial size as input

<p>labels each pixel, outputs dense label map same spatial size as input</p>
3
New cards

How is semantic segmentation different from instance segmentation

Semantic segmentation only calls out difference in classes

Instance segmentation calls out difference between classes and also differences between the same classes

4
New cards

Object detection

predict bounding boxes and class labels for object; outputs set of (x,y,w,h)

<p>predict bounding boxes and class labels for object; outputs set of (x,y,w,h)</p>
5
New cards

Instance Segmentation

  • Combo of semantic and object

  • Most advanced

  • Analyzes every pixel → creates exact border around object → gives each object their own class

  • outputs per object binary mask

<ul><li><p>Combo of semantic and object</p></li><li><p>Most advanced</p></li><li><p>Analyzes every pixel → creates exact border around object → gives each object their own class</p></li><li><p>outputs per object binary mask</p></li></ul><p></p>
6
New cards

Sliding Window Semantic Segregation

  • Slide a window around and it identifies what is what inside of the given window

  • Keep moving the window around until every pixel has been covered once

  • Form an aggregated map to output what is in image

  • Major limitation: heavy redundant computation, poor global context & boundary artifacts

<ul><li><p>Slide a window around and it identifies what is what inside of the given window</p></li><li><p>Keep moving the window around until every pixel has been covered once</p></li><li><p>Form an aggregated map to output what is in image</p></li><li><p>Major limitation: heavy redundant computation, poor global context &amp; boundary artifacts </p></li></ul><p></p>
7
New cards

List two major inefficiencies/bottlenecks of R-CNN and explain how Fast R-CNN addresses each one (be specific abt shared feature computation and RoI processing). Mention any remaining bottleneck that Fast R-CNN does not remove.

  • R-CNN bottlenecks: (i) runs a full CNN fwd pass per region proposal → repeated computation; (ii) multi-stage pipeline → slow training and storage overhead

  • Fast R-CNN fixes: (i) compute one convolutional feature map for the whole image and use RoI pooling to extract per-proposal features → shared computation; (ii) train a single end-to end network w softmax classification and bbox regression → faster and simpler training/inference

8
New cards

Write a squared-error loss suitable for training an autoencoder

<p></p>
9
New cards

State the core idea of an autoencoder and briefly explain one application (ex: denoising, dimensionality reduction, representation learning, anomaly detection, super-resolution)

  • Core Idea: to learn an encoder fθ: x → z and a decoder gΦ: z → x̂ that reconstruct x; the input z is a compact representation of the input

  • Denoising example: train on (xnoisy, x) pairs; at test time, gΦ(fθ(xnoisy)) removes noise

10
New cards

Given a set of token vectors, list the steps to compute self attention: obtaining Q, K, V; computing scores; (optional) scaling/masking; softmax; and weighted sums

  1. Compute Q = XWQ, K = XWK, V = XWV from input X

  2. Scores S = QKT; optionally scale by 1/√dk

  3. Weights A = softmax(S) (row-wise)

  4. Output Z = AV; for multi-head, concatenate head outputs then apply WO

11
New cards

What scale factor is applied to the dot products and why is it needed?

The scale factor: 1/√dk

Reason: prevents softmax saturation therefore stabilizing gradients

12
New cards

Give two ways to feed images into a transformer (Ex: ViT patches of CNN features as tokens)

ViT: split images into patches, flatten/linearly project to tokens (add [CLS] and positional encodings)

13
New cards

What is the purpose of mutli-head attention?

To allow the model to attend to different pos/relations and subspaces in parallel, therefore improving overall modeling capacity

14
New cards

Write the summary equation for scaled dot-product attention Attention(Q, K, V)

knowt flashcard image
15
New cards

Explain how positional encoding provides order information

Input tokens → parallel processing → lost order → generate unique positional encoding (PE) → add PE to embedding → provides order → transformer understands order

16
New cards

Why is the dot product commonly used as a similarity score in attention?

it is efficient, differentiable, and aligns w cosine similarity after normalization; larger values reflect higher alignment

17
New cards

define a pretext task and a downstream task. Why can solving the former help the latter?

  • Pretext: self-supervised task on unlabeled data to learn generalizable features

  • Downstream: target task using learned features

  • Solving pretext tasks shapes representations useful for downstream tasks. It’s kinda like building the foundation before building the house.

18
New cards

Briefly describe three common pretext tasks: rotation prediction, relative patch location/jigsaw, inpainting/colorization

  • Rotation prediction: classify rotation angle

  • Relative patch location/jigsaw: predict spatial relations/permutations among patches

  • Inpainting/colorization: reconstruct missing regions or color channels from context

19
New cards

Explain the basic idea of contrastive representation learning and what the InfoNCE loss optimizes with 1 positive and N-1 negatives. Why do more negatives generally tighten the improve the performance (mutual-information lower bound)?

  • Idea: learn an encoder so embeddings of a positive pair (same instance, different views) are close while negatives are apart

  • with one positive j for anchor i and negatives Ni

  • More negatives make the denominator harder, encouraging discrimination

<ul><li><p>Idea: learn an encoder so embeddings of a positive pair (same instance, different views) are close while negatives are apart</p></li><li><p>with one positive j for anchor i and negatives N<sub>i</sub></p></li><li><p>More negatives make the denominator harder, encouraging discrimination </p></li></ul><p></p>
20
New cards

In SimCLR: define a positive pair, list two strong image augmentations used to create views, explain the role of the projection head, and in one sentence each state the role of the temperature parameter and why larger batch sizes help

  • Positive pair: two augmented views of the same image

  • Two strong augmentations: random resized crop; color jitter

  • Projection head: MLP mapping representation h to z for the contrastive loss; improves invariance without hurting h (used for downstream)

  • Temperature: scals logits; lower τ sharpens softmax, emphasizing hard negatives, higher τ smooths it

  • Large batches help: provide many in-batch negatives per step, strengthening the contrastive signal

21
New cards

Give two key differences between SimCLR and MoCo, focusing on how negatives are obtained and the encoder(s) and update rules. State one practical implication of these differences

  • Negatives: SimCLR uses in-batch negatives; MoCo uses a queue/memory bank of keys, decoupling negatives from batch size

  • Encoders/updates: SimCLR has a single encoder (two views share weights); MoCo uses a query encoder and key encoder updated by momentum (EMA) for consistency

  • Implication: MoCo attains many negatives with small batches (memory efficient); SimCLR typically benefits from very large batches (more GPU memory)

22
New cards

Compare a standard autoencoder and VAE in terms of latent regularization and generative capability. What is the reparameterization trick and why does it enable backpropagation 

  • AE: deterministic z; no explicit latent prior; good reconstructions but weak generative sampling from random z

  • VAE: encoder outputs (µ, σ) of qϕ(z|x); loss combines reconstruction and KL divergence to a prior (often N (0, I)); yields smooth, continuous latent spaces with generative capability

  • Reparameterization: sample ϵ ∼ N (0, I) and set z = µ + σ ⊙ ϵ to make sampling differentiable for backdrop

23
New cards

Describe how a GAN is trained (generator vs discriminator updates; objective type). List two common drawbacks of GANs

  • Training: discriminator D learns to classify real vs generated; generator G learns to fool D via a non-saturating objective, alternating updates

  • Two drawbacks: mode collapse (generator covers few modes); training instability/sensitivity to hyperparameters (and no explicitly likelihood)

24
New cards

Explain the idea of fully convolutional networks for segmentation.

  • The idea: Replace fully connected layers w convolutional layers

  • Why: so that we keep spatial feature maps

  • A traditional CNN with FC layers takes a 2D feature map and flattens it into a 1D vector. This destroys spatial information

25
New cards

What is the role of downsampling and upsampling?

  • Role of downsampling: increases receptive field and decreases memory/compute

  • Role of upsampling: restores resolution

26
New cards

Why is operation at full resolution so expensive?

Because large feature maps and many channels over pixels