1/11
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
sequence modeling
decoder-only, GPT-style
learn to predict the next token in a sequence
ex: language modeling, music generation, etc.
formula: P(x) = P(x1) * P(x2 | x1) * P(xt | x<t)
sequence-to-sequence (seq2seq)
encoder-decoder (original transformer), translation-style
input sequence (english sentence) → output sequence (german translation)
ex: translation, q&a, text-to-speech
formula: P(x|z) = P(x1|z) * P(x2|x1, z)
classification
encoder-only, BERT-style
input = sequence of tokens → output = label/class
ex: sentiment analysis, spam detection
formula: learn P(c|x)
transformers
deep feed-forward neural networks that rely on attention mechanisms
general purpose sequence models with 3 main use cases:
sequence modeling
seq2seq
classification
tokenization
process of representing text as tokens
subword tokenization is most common
each token converted to unique integer ID
token embedding
converts token ID → vector
like a dictionary lookup into a learned matrix
positional embedding
adds information about word order
without it, model sees text as ‘bag of words’
can be learned (finite length) or sinusoidal (infinite length)
final/vector embedding
token embedding + positional embedding
attention
decides which parts of the sequence to focus on
attention score
similarity between query and key
multi-head attention
Multiple attention mechanisms run in parallel, each capturing different relationships.
Outputs are concatenated and linearly combined