Transformer

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/17

flashcard set

Earn XP

Description and Tags

ATA Week 13

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

18 Terms

1
New cards

Transformer - Motivation

  • RNNs are slow

    • Sequential nature prohibits parallelization within training samples

  • The attention mechanism removes the need for recurrence and instead relies on global dependencies between input and output.

2
New cards

Language Modeling

The task of predicting what word comes next in a sentence.

3
New cards

Transfer Learning using Language Model

  • Language model pretraining is effective for many language understanding tasks.

  • Solves the small data problem, similar to using transfer learning for computer vision tasks

    • Unsupervised training on enormous text corpus

    • Fine-tuning for supervised training on small set of labelled data for text classification, Q&A, etc…

4
New cards

Encoder

  • Stack consists of N=6 identical layers

  • Each block consists of 2 sub-layers

    • Multi-head attention layer

    • Position-wise fully connected feed forward

  • Residual connection around each of the two sub-layers

    • All sublayers produce the same output dimension of 512

5
New cards

Decoder

  • A stack of N=6 identical layers

  • Like an encoder, each layer consists of a multi-head attention layer and fully-connected feed-forward network

    • Additional multi-head attention layer to attend to output from encoder stack

  • Each sub-layer adopts a residual connection and a layer normalization

  • The first multi-head attention sub-layer is modified to prevent positions from attending to subsequent positions

    • to prevent looking into the future of the target sequence when predicting the current position.

6
New cards

Multi-head Attention

  • Transformers run through the scaled dot-product attention multiple times in parallel.

  • Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

7
New cards

Positional Encoding

  • In RNN model, the position of words in a sequence is indicated by the sequential time steps we feed the RNN

  • In transformers, the sequence of words is input into the model all at once - the positional info is lost.

  • The word embedding needs to go through another positional encoding to encode the relative positions of words in an input sequence.

8
New cards

Transformer Variants

  • Autoregressive models

  • Autoencoder models

  • Seq2Seq models

9
New cards

Autoregressive Model

  • Uses only the decoder part of the original transformer

  • Uses attention mask so that at each position, the model can look at tokens before the attention heads

  • e.g. GPT2/3, Transformer-XL, Reformer, XLNet, etc…

10
New cards

Autoencoder Model

  • Uses the encoder part of the transformer

  • For pretraining, targets are the original sentences, and inputs are the corrupted version.

  • e.g. BERT, RoBERTa, DistilBERT, ELEKTRA, etc…

11
New cards

Seq2Seq Model

  • Uses both the encoder and decoder of the original transformer

  • e.g. BART, T5, etc…

12
New cards

BERT

Bi-Directional Encoder Representation from Transformer

13
New cards

Uni-directional Model

  • Most language models are ____

    • Open AI GPT uses left-to-right architecture, where every token only attends to the previous tokens in the self-attention layers in Transformer

    • May be sub-optimal for certain sentence-level tasks such as question and answering, where it is important to incorporate context from both directions.

14
New cards

Bi-directional model disadvantages

  • Could allow each word to indirectly “see itself”, causing the model to trivially predict the target word in a multi-layered context.

  • Solution

    • Use a masked language model, where a percentage of the input tokens are masked at random, then predict those masked tokens.

15
New cards

GPT (Generative Pre-trained Transformer)

  • Uses the decoder part transformer to train a massive language model (Trained by predicting the next words

  • Theorized that a large enough LLM could perform general NLP tasks without task-specific training.

16
New cards

Multi-task Inference

  • Supports the use of zero-shot, one-shot, and few-shots prompts.

  • The model is presented with a task description, zero or more examples, and a prompt.

17
New cards

LLMs (Large Language Models)

  • Base ___ - Predicts the next word to complete the text.

  • Instruction-tuned ___ - Response based on instructions

18
New cards