L8 - Transformers

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/10

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

11 Terms

1
New cards

RNNs Difficulties

  • Hard to train

  • Not easy to parallelize

  • Bad handling long range dependencies

2
New cards

Transformers Type

  • Encoder-only (Bert)

  • Decoder-only (Chatgpt)

3
New cards

Encoder-only - Definition

  • Use only the encoder part of original transformer architecture

4
New cards

Encoder-only - Objective func

  • Predict masked context words (fill-in-the-blank)

  • Masked Language Modelling (MLM)

5
New cards

Encoder-only LLMs Notable Models

  • BERT

  • RoBERTa

6
New cards

Encoder-only - Self-attention

  • K times

  • Each one focus on different things

  • After combine them

7
New cards

BERT

  • Encoder-only

  • Bidirectional transformer

  • Attention has access to left and right hand side

  • Pretrained on:

    • masked LM

    • next token prediction

  • Input:

    • WordPiece embeding

    • CLS (Special Classification Token)

8
New cards

Decorder-only LLM

  • Use the decoder part of the original transformer architecture.

  • Masked Self-attention

  • Multiclass classification

  • Obj: Next token pred

  • Compute attention up to the n-1 token I want to predict

9
New cards

Decoder-only LLMs Notable Models

  • GPT

  • Llama

  • DeepSeek

10
New cards

Scaling Laws of LLMs

Performance is determine by:

  1. Model size: Vocab size is important

  2. Pretraining data size

  3. Amount of compute for training (GPU size)

1, 2, 3 Big → Better performance

11
New cards

How are tokens represented in input to Transformers?

Sum of token embeddings and position embeddings.