attention and transformers

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/18

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No study sessions yet.

19 Terms

New cards

transformer is a

sequnce to sequence

encoder decoder model

New cards

Transformer is based on

"Attention"

New cards

What’s so cool with "Attention"?

New cards

types of ways to model attention mathematically

Cross Attention

• Self Attention

• Multi-head Attention

• Masked Attention

New cards

cross attention layer

In attention, instead of only looking at the last hidden state, the model can:

“Look at all relevant parts of another sequence and decide what matters most right now.”

New cards

what is cross attention

Cross-attention = attention between TWO DIFFERENT sequences.

Example:

Source sequence = English sentence
Target sequence = French sentence being generated

When generating each target word, the model:

looks at all source words and decides which ones are important.

That is cross-attention.

Type	Attends to

Self-attention

Same sequence

Cross-attention

Another sequence

New cards

intutiion of cross attention using rnn language

3. Intuition Using RNN Language

Think of this like:

Instead of the decoder using only:

h_t = RNN(h_{t-1}, x_t)

It also gets a context vector:

h_t = RNN(h_{t-1}, x_t, context_t)

And that context_t is computed by attention over the encoder states.

New cards

the core idea of cross attention in one sentence

Cross-attention computes a weighted sum of another sequence’s representations, where the weights are decided by similarity.

New cards

the math of cross attention simple and clean

We have:

Query (Q): comes from decoder
Keys (K): come from encoder
Values (V): come from encoder

Q ∈ R^{n × d}

K ∈ R^{m × d}

V ∈ R^{m × d}

n = target length
m = source length
d = vector size

Step 1: Compute similarity scores

Scores = Q K^T

This gives:

Scores ∈ R^{n × m}

Step 2: Normalize with softmax

Weights = softmax(Q K^T / sqrt(d))

Step 3: Weighted sum of values

Output = Weights × V

Result:

Output ∈ R^{n × d}

New cards

cross attention layer slides

notice that its a scaled dot product.

New cards

permutation equivariance

Self-attention doesn’t care about the order of the input vectors!

New cards

positional encoding

New cards

multi head self attention layer

New cards

semantic segmentation idea fully convolution

New cards

unet

downsample, upsample, but also have skip connections

New cards

learnable upsampling - transposed convolution

pretty much like reverse lmao?

New cards

taxonomy of generative models

New cards

general idea rectified flow matching

New cards

auteoeconders