Transformers / Self-attention

0.0(0)
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/21

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

22 Terms

1
New cards

What is the purpose of Self-Attention?

To consider information from other inputs, to produce contextualized input representations

<p>To consider information from other inputs, to produce contextualized input representations</p>
2
New cards

How are the three projections called which take the input embedding?

Key, query and value

<p>Key, query and value</p>
3
New cards

What are the weights in Self-Attention?

The matrices for the key, query and value projections (one for each projection)

<p>The matrices for the key, query and value projections (one for each projection)</p>
4
New cards

A Transformer combines several self-attention outputs.

True

5
New cards

The Transformer architecture does not work on sequence inputs.

False

6
New cards

In a Transformer, the complexity for each layer is quadratic with respect to the input sequence length.

True

<p>True</p>
7
New cards

During training, Self-Attention is fast since all inputs can be processed in parallel.

True

8
New cards

What is the first step in the Transformer model?

Convert input tokens into dense vector representations (embeddings).

9
New cards

Why is positional encoding added to input embeddings?

To provide information about the order of tokens (since Transformers process tokens in parallel).

10
New cards

What does the self-attention mechanism do?

Computes attention scores to determine how much each token should focus on other tokens.

11
New cards

What are the three vectors in self-attention?

Query (Q), Key (K), and Value (V). Q and K compute attention scores; V carries the actual information.

12
New cards

How are attention scores computed?

Dot product of Query (Q) and Key (K), scaled by √(d_k), followed by softmax.

13
New cards

What is the purpose of multi-head attention?

To capture different types of relationships (e.g., syntactic, semantic) between tokens.

14
New cards

What happens after self-attention?

The output is passed through a feed-forward neural network (FFN) for further processing.

15
New cards

Why are residual connections used?

To stabilize training and improve gradient flow by adding the input to the output of each sub-layer.

16
New cards

What is layer normalization used for?

To normalize the outputs of each sub-layer, improving training stability.

17
New cards

What are the two main components of a Transformer?

Encoder (processes input) and Decoder (generates output).

18
New cards

What does the encoder stack consist of?

Multiple layers of self-attention and FFN, with residual connections and layer normalization.

<p><span>Multiple layers of self-attention and FFN, with residual connections and layer normalization.</span></p>
19
New cards

What does the decoder stack consist of?

Multiple layers of masked self-attention, encoder-decoder attention, and FFN, with residuals and layer normalization.

<p><span>Multiple layers of masked self-attention, encoder-decoder attention, and FFN, with residuals and layer normalization.</span></p>
20
New cards

How does the Transformer generate output?

The final decoder output is passed through a linear layer and softmax to predict the next token.

21
New cards

How is the Transformer trained?

Using a loss function (e.g., cross-entropy) to minimize the difference between predicted and actual outputs.

22
New cards

Why are Transformers faster than RNNs?

They process all tokens in parallel, unlike RNNs, which process tokens sequentially.