1/10
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
RNNs Difficulties
Hard to train
Not easy to parallelize
Bad handling long range dependencies
Transformers Type
Encoder-only (Bert)
Decoder-only (Chatgpt)
Encoder-only - Definition
Use only the encoder part of original transformer architecture
Encoder-only - Objective func
Predict masked context words (fill-in-the-blank)
Masked Language Modelling (MLM)
Encoder-only LLMs Notable Models
BERT
RoBERTa
Encoder-only - Self-attention
K times
Each one focus on different things
After combine them
BERT
Encoder-only
Bidirectional transformer
Attention has access to left and right hand side
Pretrained on:
masked LM
next token prediction
Input:
WordPiece embeding
CLS (Special Classification Token)
Decorder-only LLM
Use the decoder part of the original transformer architecture.
Masked Self-attention
Multiclass classification
Obj: Next token pred
Compute attention up to the n-1 token I want to predict
Decoder-only LLMs Notable Models
GPT
Llama
DeepSeek
Scaling Laws of LLMs
Performance is determine by:
Model size: Vocab size is important
Pretraining data size
Amount of compute for training (GPU size)
1, 2, 3 Big → Better performance
How are tokens represented in input to Transformers?
Sum of token embeddings and position embeddings.