L8 - Transformers

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/10

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

11 Terms

New cards

RNNs Difficulties

Hard to train
Not easy to parallelize
Bad handling long range dependencies

New cards

Transformers Type

Encoder-only (Bert)
Decoder-only (Chatgpt)

New cards

Encoder-only - Definition

Use only the encoder part of original transformer architecture

New cards

Encoder-only - Objective func

Predict masked context words (fill-in-the-blank)
Masked Language Modelling (MLM)

New cards

Encoder-only LLMs Notable Models

BERT
RoBERTa

New cards

Encoder-only - Self-attention

K times
Each one focus on different things
After combine them

New cards

BERT

Encoder-only
Bidirectional transformer
Attention has access to left and right hand side
Pretrained on:
- masked LM
- next token prediction
Input:
- WordPiece embeding
- CLS (Special Classification Token)

New cards

Decorder-only LLM

Use the decoder part of the original transformer architecture.
Masked Self-attention
Multiclass classification
Obj: Next token pred
Compute attention up to the n-1 token I want to predict

New cards

Decoder-only LLMs Notable Models

GPT
Llama
DeepSeek

New cards

Scaling Laws of LLMs

Performance is determine by:

Model size: Vocab size is important
Pretraining data size
Amount of compute for training (GPU size)

1, 2, 3 Big → Better performance

New cards

How are tokens represented in input to Transformers?

Sum of token embeddings and position embeddings.