1/17
ATA Week 13
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Transformer - Motivation
RNNs are slow
Sequential nature prohibits parallelization within training samples
The attention mechanism removes the need for recurrence and instead relies on global dependencies between input and output.
Language Modeling
The task of predicting what word comes next in a sentence.
Transfer Learning using Language Model
Language model pretraining is effective for many language understanding tasks.
Solves the small data problem, similar to using transfer learning for computer vision tasks
Unsupervised training on enormous text corpus
Fine-tuning for supervised training on small set of labelled data for text classification, Q&A, etc…
Encoder
Stack consists of N=6 identical layers
Each block consists of 2 sub-layers
Multi-head attention layer
Position-wise fully connected feed forward
Residual connection around each of the two sub-layers
All sublayers produce the same output dimension of 512
Decoder
A stack of N=6 identical layers
Like an encoder, each layer consists of a multi-head attention layer and fully-connected feed-forward network
Additional multi-head attention layer to attend to output from encoder stack
Each sub-layer adopts a residual connection and a layer normalization
The first multi-head attention sub-layer is modified to prevent positions from attending to subsequent positions
to prevent looking into the future of the target sequence when predicting the current position.
Multi-head Attention
Transformers run through the scaled dot-product attention multiple times in parallel.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
Positional Encoding
In RNN model, the position of words in a sequence is indicated by the sequential time steps we feed the RNN
In transformers, the sequence of words is input into the model all at once - the positional info is lost.
The word embedding needs to go through another positional encoding to encode the relative positions of words in an input sequence.
Transformer Variants
Autoregressive models
Autoencoder models
Seq2Seq models
Autoregressive Model
Uses only the decoder part of the original transformer
Uses attention mask so that at each position, the model can look at tokens before the attention heads
e.g. GPT2/3, Transformer-XL, Reformer, XLNet, etc…
Autoencoder Model
Uses the encoder part of the transformer
For pretraining, targets are the original sentences, and inputs are the corrupted version.
e.g. BERT, RoBERTa, DistilBERT, ELEKTRA, etc…
Seq2Seq Model
Uses both the encoder and decoder of the original transformer
e.g. BART, T5, etc…
BERT
Bi-Directional Encoder Representation from Transformer
Uni-directional Model
Most language models are ____
Open AI GPT uses left-to-right architecture, where every token only attends to the previous tokens in the self-attention layers in Transformer
May be sub-optimal for certain sentence-level tasks such as question and answering, where it is important to incorporate context from both directions.
Bi-directional model disadvantages
Could allow each word to indirectly “see itself”, causing the model to trivially predict the target word in a multi-layered context.
Solution
Use a masked language model, where a percentage of the input tokens are masked at random, then predict those masked tokens.
GPT (Generative Pre-trained Transformer)
Uses the decoder part transformer to train a massive language model (Trained by predicting the next words
Theorized that a large enough LLM could perform general NLP tasks without task-specific training.
Multi-task Inference
Supports the use of zero-shot, one-shot, and few-shots prompts.
The model is presented with a task description, zero or more examples, and a prompt.
LLMs (Large Language Models)
Base ___ - Predicts the next word to complete the text.
Instruction-tuned ___ - Response based on instructions