Week 7 - Transformers

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/15

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

16 Terms

1
New cards

Why do we need Transformers for language tasks?

Because they can understand long‑range relationships in sentences without reading one word at a time.

2
New cards

What is the main weakness of RNNs that Transformers fix?

RNNs read text slowly and forget long‑range information, while Transformers process everything at once.

3
New cards

What is “attention” in a Transformer?

A way for the model to focus on the most important words when understanding a sentence.

4
New cards

Why is attention useful?

It helps the model figure out which words relate to each other, even if they’re far apart.

5
New cards

What are Query, Key, and Value vectors?

They’re the math tools attention uses to decide how strongly each word should pay attention to other words.

6
New cards

What is multi‑head attention?

Multiple attention mechanisms running at the same time, each focusing on different patterns or relationships.

7
New cards

Why do Transformers need positional encoding?

Because they read all words at once, they need a way to know the order of the words.

8
New cards

What does positional encoding represent?

It gives each word a “position tag” so the model knows where it appears in the sentence.

9
New cards

What do feed‑forward layers do in a Transformer?

They refine and transform the information after attention has figured out relationships.

10
New cards

Why do Transformers use residual connections?

To help information flow smoothly through the network and prevent it from getting lost.

11
New cards

What does layer normalization do?

It keeps the numbers stable during training so the model learns better.

12
New cards

What does the final linear + softmax layer do?

It turns the model’s output into a probability distribution for the next word.

13
New cards

What is “sampling” in language models?

Choosing the next word based on the probabilities the model predicts.

14
New cards

What does GPT stand for?

Generative Pre‑trained Transformer.

15
New cards

What is a Vision Transformer (ViT)?

A Transformer that treats image patches like words so it can understand pictures.

16
New cards

What are multimodal Transformers like VATT used for?

They process video, audio, and text together in one unified model.