Transformer

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/17

Earn XP

Description and Tags

ATA Week 13

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

18 Terms

New cards

Transformer - Motivation

RNNs are slow
- Sequential nature prohibits parallelization within training samples
The attention mechanism removes the need for recurrence and instead relies on global dependencies between input and output.

New cards

Language Modeling

The task of predicting what word comes next in a sentence.

New cards

Transfer Learning using Language Model

Language model pretraining is effective for many language understanding tasks.
Solves the small data problem, similar to using transfer learning for computer vision tasks
- Unsupervised training on enormous text corpus
- Fine-tuning for supervised training on small set of labelled data for text classification, Q&A, etc…

New cards

Encoder

Stack consists of N=6 identical layers
Each block consists of 2 sub-layers
- Multi-head attention layer
- Position-wise fully connected feed forward
Residual connection around each of the two sub-layers
- All sublayers produce the same output dimension of 512

New cards

Decoder

A stack of N=6 identical layers
Like an encoder, each layer consists of a multi-head attention layer and fully-connected feed-forward network
- Additional multi-head attention layer to attend to output from encoder stack
Each sub-layer adopts a residual connection and a layer normalization
The first multi-head attention sub-layer is modified to prevent positions from attending to subsequent positions
- to prevent looking into the future of the target sequence when predicting the current position.

New cards

Multi-head Attention

Transformers run through the scaled dot-product attention multiple times in parallel.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

New cards

Positional Encoding

In RNN model, the position of words in a sequence is indicated by the sequential time steps we feed the RNN
In transformers, the sequence of words is input into the model all at once - the positional info is lost.
The word embedding needs to go through another positional encoding to encode the relative positions of words in an input sequence.

New cards

Transformer Variants

Autoregressive models
Autoencoder models
Seq2Seq models

New cards

Autoregressive Model

Uses only the decoder part of the original transformer
Uses attention mask so that at each position, the model can look at tokens before the attention heads
e.g. GPT2/3, Transformer-XL, Reformer, XLNet, etc…

New cards

Autoencoder Model

Uses the encoder part of the transformer
For pretraining, targets are the original sentences, and inputs are the corrupted version.
e.g. BERT, RoBERTa, DistilBERT, ELEKTRA, etc…

New cards

Seq2Seq Model

Uses both the encoder and decoder of the original transformer
e.g. BART, T5, etc…

New cards

BERT

Bi-Directional Encoder Representation from Transformer

New cards

Uni-directional Model

Most language models are ____
- Open AI GPT uses left-to-right architecture, where every token only attends to the previous tokens in the self-attention layers in Transformer
- May be sub-optimal for certain sentence-level tasks such as question and answering, where it is important to incorporate context from both directions.

New cards

Bi-directional model disadvantages

Could allow each word to indirectly “see itself”, causing the model to trivially predict the target word in a multi-layered context.

Solution
- Use a masked language model, where a percentage of the input tokens are masked at random, then predict those masked tokens.

New cards

GPT (Generative Pre-trained Transformer)

Uses the decoder part transformer to train a massive language model (Trained by predicting the next words
Theorized that a large enough LLM could perform general NLP tasks without task-specific training.

New cards

Multi-task Inference