Lecture 11 Transformers

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/17

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

18 Terms

1
New cards

What kind of data can transformers handle

all kinds of data, image, video, text, etc. super powerful for answering MLP related problems

Encoder - used for classification, feature extraction

Decoder - used for text generation LLM stuff

2
New cards

token

token - chunk of text

common/short words = one token just depends on dictionary

long less common words = multiple tokens

byte-pair encoding (BPE)

3
New cards

byte pair encoding

starts with character level vocab and starts iteratively merging frequent pairs into new tokens until desired vocab size reached

frequent pairings matched in frequency tables ~30-100k words

4
New cards

token input

each token is a one hot vector [nxN]

  • the one is in the position of the index in the dictionary (token ID)

5
New cards

token vector

vector of each token ID of the input in its order

(token ID is the index in the dictionary)

6
New cards

Transformer

encoder + decoder + self attention

autoregressive model

  • n tokens in, one token out

    • output used as part of next iteration input

probabilistic

  • uses (softmax) to return a probability distribution of possible tokens and outputs the highest probable output

7
New cards

Self attention - basic parts

Data tensor X [n x N]→

(in practice data is 3D with [ _ x _ x B]

(batch size) Determined through back propagation

  • Query Q (what' I’m looking for)

    • Q = WQX [n x dQ ]

  • Key K (what I contain)

    • K = WKX [n x dK]

  • Value V (actual content of X that will be passed forward)

    • V = WVX [n x dV]

  • W = linear projection [N x d{K, Q, V}](transformations) learned so each Q,K,V are different

8
New cards

notations

n = # of tokens

N = dictionary size

hyperparameters

d = embedding size (dimension size? used for the QKV) (larger d means more information from the token )

H = # of attention heads (similar to # of filters in CNN)

9
New cards

Self-attention operations

  1. Compute scores between different input vectors S = Q . KT [n x n]

    1. in smoothie ex these is how much u like the fruit

  2. Add mask

  3. Normalize the scores for the stability of gradient Sn= S/sqrt(dk) [n x n] dk = embedding size

  4. Apply softmax to translate the scores into probabilities (attention weights) P = softmax(Sn) [n x n]

    1. P is the correlation matrix of the input tokens [0, 1] bc of softmax

  5. Obtain the weighted value matrix with Z = V * P [n x d]

    1. important bc it integrates only most relevant parts of context for each token, eliminating unfiltered noise

<ol><li><p>Compute scores between different input vectors S = Q <strong><sup>.</sup></strong> K<sup>T</sup> [n x n]</p><ol><li><p>in smoothie ex these is how much u like the fruit</p></li></ol></li><li><p>Add mask</p></li><li><p>Normalize the scores for the stability of gradient S<sub>n</sub>= S/sqrt(d<sub>k</sub>) [n x n] d<sub>k</sub> = embedding size</p></li><li><p>Apply softmax to translate the scores into probabilities (attention weights) P = softmax(S<sub>n</sub>) [n x n]</p><ol><li><p>P is the correlation matrix of the input tokens [0, 1] bc of softmax</p></li></ol></li><li><p>Obtain the weighted value matrix with Z = V * P [n x d]</p><ol><li><p>important bc it integrates only most relevant parts of context for each token, eliminating unfiltered noise</p></li></ol></li></ol><p></p>
10
New cards

Self-attention masking

Need to add padding to get to max length of input, and need to mask padding so not considered in calculations

Add before normalizing or applying softmax

  • matrix of -infinity for padding tokens (becomes 0 in softmax)

  • 0 for actual inputs

11
New cards

Multi-head attention (MHA)

allows for attention between multiple words

MHA output [n x dH] - attention for each token

  • basically dH # of self attention concatenated together (each with diff QKV)

  • each self attention head gets a diff dimension (feature) and so its not redundant and analyzing the same data

12
New cards

Transformer architecture

input —> embedding + position encoding → encoder stack → decoder → linear → softmax → output

<p>input —&gt; embedding + position encoding → encoder stack → decoder → linear → softmax → output </p>
13
New cards

How input / output works in transformers

all input, one out, then out becomes new input

still minimize loss (cross-entropy) but the target output is not isolated and instead included in the input unlike other models

<p>all input, one out, then out becomes new input </p><p></p><p>still minimize loss (cross-entropy) but the target output is not isolated and instead included in the input unlike other models </p>
14
New cards

Encoder

input →

embedding + position encoding →

self - attention (produces weighted V) → residual connection (skip connection) →

layer norm →

feed forward →

  • 2 hidden layers with ReLU (2-hidden layer MLP)

layer norm → decoder

<p>input → </p><p>embedding + position encoding → </p><p>self - attention (produces weighted V) → residual connection (skip connection)  → </p><p>layer norm → </p><p>feed forward → </p><ul><li><p>2 hidden layers with ReLU (2-hidden layer MLP) </p></li></ul><p>layer norm → decoder </p><p></p>
15
New cards

Layer normalization

variant of batch normalization

  • normalizes using mean and std row by row instead of whole matrix

  • when dividing by std also add 0.0001 to avoid overflow

16
New cards

Input embedding

converts raw data (one-hot vector) → embedding

encodes the meaning of the words

linear projection module

17
New cards

Sinusoidal position encoding

k = hyperparameter controlling frequency range

larger = smoother but less informative encodings

d = encoding dimension

pt is corresponding encoding

sin(w* t) if even

cos(wk *t) if odd

w = 1/10002k/d

18
New cards

Decoder attention

utilizes cross attention - not the same as attention used in encoder

helps combine input and target sequence

  • target language Q

  • input language K and V

triangular masking for n tokens in one token out (auto reg nature)

greedy decoding from softmax, just choose highest probability, get the Token ID (dictionary index) and the token itself