Transformers / Self-attention

0.0(0)

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/21

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

22 Terms

New cards

What is the purpose of Self-Attention?

To consider information from other inputs, to produce contextualized input representations

New cards

How are the three projections called which take the input embedding?

Key, query and value

New cards

What are the weights in Self-Attention?

The matrices for the key, query and value projections (one for each projection)

New cards

A Transformer combines several self-attention outputs.

True

New cards

The Transformer architecture does not work on sequence inputs.

False

New cards

In a Transformer, the complexity for each layer is quadratic with respect to the input sequence length.

True

New cards

During training, Self-Attention is fast since all inputs can be processed in parallel.

True

New cards

What is the first step in the Transformer model?

Convert input tokens into dense vector representations (embeddings).

New cards

Why is positional encoding added to input embeddings?

To provide information about the order of tokens (since Transformers process tokens in parallel).

New cards

What does the self-attention mechanism do?

Computes attention scores to determine how much each token should focus on other tokens.

New cards

What are the three vectors in self-attention?

Query (Q), Key (K), and Value (V). Q and K compute attention scores; V carries the actual information.

New cards

How are attention scores computed?

Dot product of Query (Q) and Key (K), scaled by √(d_k), followed by softmax.

New cards

What is the purpose of multi-head attention?

To capture different types of relationships (e.g., syntactic, semantic) between tokens.

New cards

What happens after self-attention?

The output is passed through a feed-forward neural network (FFN) for further processing.

New cards

Why are residual connections used?

To stabilize training and improve gradient flow by adding the input to the output of each sub-layer.

New cards

What is layer normalization used for?

To normalize the outputs of each sub-layer, improving training stability.

New cards

What are the two main components of a Transformer?

Encoder (processes input) and Decoder (generates output).

New cards

What does the encoder stack consist of?

Multiple layers of self-attention and FFN, with residual connections and layer normalization.

New cards

What does the decoder stack consist of?

Multiple layers of masked self-attention, encoder-decoder attention, and FFN, with residuals and layer normalization.

New cards

How does the Transformer generate output?

The final decoder output is passed through a linear layer and softmax to predict the next token.

New cards

How is the Transformer trained?

Using a loss function (e.g., cross-entropy) to minimize the difference between predicted and actual outputs.

New cards

Why are Transformers faster than RNNs?

They process all tokens in parallel, unlike RNNs, which process tokens sequentially.

Explore top notes

Reliability and Validity

Updated 856d ago

Note

Gospel of Luke Lecture

Updated 405d ago

Note

HRM 1

Updated 506d ago

Note

Princeton Review AP Calculus BC, Chapter 5: Differentiation: Composite, Inverse, and Implicit Functions

Updated 752d ago

Note

Photosystems and Electron Flow

Note

Note

Note

AP European History Ultimate Guide

Updated 728d ago

Note

Explore top flashcards

Ch7.2: Enlightenment

Updated 76d ago

Flashcards (24)

MIDTERM Architecture of NYC (slides)

Updated 26d ago

Flashcards (29)

Hitlers rise and the Nazis and development of dictatorship

Updated 35d ago

Flashcards (42)

Biological molecules ocr A level biology

Flashcards (57)

Flashcards (27)

Flashcards (60)

Cold War Key Dates (1945-1991)

Updated 785d ago

Flashcards (317)

Thermodynamics

Updated 512d ago

Flashcards (31)