From ANNs to Transformers

Latent Semantic Analysis (LSA) - Recap

LSA aims to discover the meaning behind words and the topics in documents.
Words are observable, but topics are latent.
LSA is based on singular value decomposition (SVD), specifically a truncated SVD (r=min(m,n)), used for topic modeling.
Context is provided through a term-document matrix.
In LSA, a term-document matrix is trained on a corpus, SVD is applied, and the first 300 dimensions are used as a vector embedding to represent words.

LDA – assumption & process - Recap

LDA Assumption: Documents are produced from a mixture of topics, which then generate words based on their probability distribution.
Given a dataset of documents, LDA backtracks to figure out what topics would create those documents.
Probabilistic Generative Process:
- DOC1 draws from Topic 1 with probability 1.
- DOC2 draws from Topic 1 with probability 0.5 and from Topic 2 with probability 0.5.
- DOC3 draws from Topic 2 with probability 1.
Inferential Process: Learn the topics and topic distributions.
Topics: \beta_{1:K} (where K=2 in the example).
Topic distribution: \theta_d : (1, 0, 0.5, 0.5, 0, 1)

Improving RNNs - Recap

Techniques:
- Gradient clipping: Applied to control gradient descent by scaling down the gradient if its norm exceeds a threshold.
- Bidirectional RNNs & Multi-Layer RNNs (or Stacked RNNs)
- Long Short-Term Memory (LSTM) RNNs: An RNN with additional "memory" for long-term information, controlled by three dynamic gates for reading, writing, and erasing from that memory cell. LSTMs became the default for most NLP tasks.
- Gated Recurrent Unit (GRU): Similar to LSTM but simpler (only two gates) and more efficient, although potentially less accurate than LSTM in some cases.

Document Embeddings - Recap

Converting documents to numerical data.
Use of pre-trained language models (e.g., BERT, RoBERTa, ALBERT, S-BERT).
- It's important to consider infrastructure and needs when selecting a variant.
Multilingual pre-trained models (e.g., XLM-R, mBERT, mBART, IndicBERT).
- Ensuring that the data language is included in the multilingual language model is crucial.

Attention is all you need

The Transformer in NLP is a novel architecture designed to solve sequence-to-sequence tasks while handling long-range dependencies with ease. It was introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017).
Example:
- The paragraph about NASA's DART mission illustrates the self-attention mechanism by identifying key words in each sentence to define context and the main idea.
RNN-based Seq2Seq approaches became popular for various NLG tasks since 2014 and were enhanced with the attention mechanism in 2015.
Drawbacks of RNN-based Seq2Seq models:
- Long-range dependencies were not captured effectively.
- Training was difficult due to the nature of RNNs (like LSTM), which cannot be parallelized.

From Basic NLP Models…

Two distinct types of NLP models:
- Sequence labeling models: Take a sequence of units (e.g., words) and assign a label (e.g., a POS tag) to each unit.
- Language models: Take a sequence of units (e.g., words) and estimate the probability of a given sequence in the domain in which the model is trained.

Sequence to Sequence (Seq2Seq) Models

In a Seq2Seq model, an encoder converts a sequence of units (e.g., a sentence) into an internal representation, and a decoder generates a sequence of units (e.g., another sentence) from the internal representation.
Machine translation is the most popular application of Seq2Seq models.
The Seq2Seq architecture is applicable to numerous NLP tasks, such as summarization and dialogue flow prediction.

Encoder-Decoder Architectures

Encoder: The first half of an encoder-decoder model turns a sequence (e.g., natural language text) into a lower-dimensional representation.
- Unlike sequential labeling models, only the final hidden state of an RNN is needed, which is then passed to the decoder to generate the target sentence.
- A bidirectional RNN can be used as an encoder, with the final sentence representation being a concatenation of the output of the forward and backward layers.
- A multilayer RNN can also be used as an encoder, with the sentence representation being the concatenation of the output of each layer.

Encoder-Decoder Architectures

Decoder: The second half of an encoder-decoder architecture turns a vector back into human-readable text. A sequence decoder takes an input from the encoder and generates text from left to right.
- An RNN can be used as a decoder, including a multilayer RNN.
- However, a decoder cannot be bidirectional (i.e., you cannot generate a sentence from both sides).

Enhancements - Bucketing

Input sequences can have different lengths, which can add a large number of pad tokens to short sequences, making computation expensive and slowing down training.
Bucketing sorts the sequences by length and uses different sequence lengths during different batch runs.
Some deep learning frameworks provide bucketing tools to suggest optimal buckets for input data.

Enhancements - Attention

Longer input sequences (e.g., documents) tend to produce less precise vector representations.
In Seq2Seq, the sentence representation vector is limited by the dimensionality of the LSTM layer.
The encoder tries to compress all information in the source sentence into a fixed-length vector, and the decoder tries to restore the target sentence from that vector.
The size of the vector is fixed, regardless of the length of the source sentence.
Instead of relying on a single, fixed-length vector, the decoder can refer back to specific parts of the encoder as it generates the target tokens.
Attention is a mechanism in ANNs that focuses on a specific part of the input and computes its context-dependent summary.

Attention

Attention is like having a key-value store containing all of the input’s information, which can be looked up with a query (the current context).
The stored values are typically a list of vectors (one for each token), associated with corresponding keys.
This increases the size of the "memory" the decoder can refer to when making a prediction.
The mapping is different at each time step and is shared with the decoder.
Two common types of attention:
- Encoder-decoder attention (or cross-attention): Used in both RNN-based Seq2Seq models and the Transformer.
- Self-attention: Used in the Transformer.

Self-Attention

For RNN-based Seq2Seq models, the input is the encoder hidden states, while the context is the decoder hidden states.
The core idea of the Transformer's self-attention is that it creates a summary of the input, but the context in which the summary is created is also the input itself.
Unlike self-attention, encoder-decoder attention in RNNs processes the input sequentially, making it progressively more difficult to deal with long-range dependencies between tokens as the sentence gets longer.
Another key difference is that self-attention repeats this process for every single token in the input (random access), which produces a new set of embeddings for the input, one for each token.

Self-Attention

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (the embeddings):
- Query vector (Q)
- Key vector (K)
- Value vector (V)
It views the encoded representation of the input as a set of key-value pairs, (K,V), both of dimension d_k.
1. For each word, create a Q vector qi, K vector ki, and a V vector v_i. These vectors are created by multiplying the embedding of the word by three matrices learned during training.
2. Calculate the self-attention score for each word.
3. Divide the scores by the square root of the dimension of the Key vectors to stabilize gradients.
4. Pass the result through a SoftMax operation.
5. Multiply each value vector by the SoftMax score.
6. Sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

Self-Attention – Numerical Problem

Given a self-attention mechanism with: Q = [1,0], K1 = [1,0], K2 = [0,1], V1 = [2,2], V2 = [1,2], and dimension (d_k) = 2. Calculate the Attention output for Q, assuming \sqrt{2} \approx 1.4 and exp(0.7) \approx 2.
- Step 1 - [Dot products of Q with each key]
- Dot Product: Q . K1 = (1 * 1) + (0 * 0) = 1
- Dot Product: Q . K2 = (1 * 0) + (0 * 1) = 0
- Step 2 - [Scale dot products by sqrt(d_k)]
- Scale factor: \sqrt{2} = 1.4. Therefore, scaled scores = [1/1.4, 0/1.4] = [0.7, 0]
- Step 3 – [Apply Softmax]
- exp(0.7) \approx 2 (given), and exp(0) \approx 1. Therefore, Softmax denominator = 3, and Attention weights = [2/3, 1/3]
- Step 4 - Attention output for Q
- (2/3 * V1) + (1/3 * V2) = [4/3, 4/3] + [1/3, 2/3] == [5/3, 6/3] == [5/3, 2]

Multi-Head Self-Attention

Multi-Head Attention is a refinement of the Self-attention mechanism that allows the model to focus on different positions or sub-spaces.
Multi-head attention is essentially attention repeated several times in parallel.

Positional Encoding

Attention layers see their input as a set of vectors, with no sequential order.
Positional Encoding (PE) is needed because the Transformer model contains no recurrent or convolutional units.
PEs are used to account for the order of the words in the input sequence.
The PE vector is added to the embedding vector.
Embeddings represent a token in a d-dimensional space where tokens with similar meaning will be closer to each other.
Embeddings do not encode the relative position of tokens in a sentence.
After adding PE, tokens will be closer to each other based on the similarity of their meaning and their position in the sentence in the d-dimensional space.
The formula for Positional Encoding is given by:
For example: for the first word in an input sentence (pos=0), and for d_model=4 (the size of the embedding), it would be:
PE(pos=0) = [sin(0/10000^(0/4)), cos(0/10000^(0/4)), sin(0/10000^(2/4)), cos(0/10000^(2/4))] = [0,1,0,1].
Then, add PE(pos=0) to the embedding E:
E’=PE(pos=0)+E.

Positional Encoding – Numerical Problem

In a Transformer model, positional encoding (PE) is calculated with sine and cosine functions to give information to the model on the position of words in the sequence. Imagine a word embedding dimension (d_model) of 5. For the third word in the input sentence, what components must be calculated?
- sin (2 / (10000^(0 / 5)))
- cos (2 / (10000^(0 / 5)))
- sin (2/ (10000^(1 / 5)))
- cos (2/ (10000^(1 / 5)))
- sin (2/ (10000^(2 / 5)))
- cos (2/ (10000^(2 / 5)))
- sin (2/ (10000^(3 / 5)))
- cos (2/ (10000^(3 / 5)))
- sin (2/ (10000^(4 / 5)))
- cos (2/ (10000^(4 / 5)))

Transformer

The Transformer model applies self-attention repeatedly to the inputs to gradually transform them, similar to multilayer RNNs.
The Transformer model generates positional encoding embeddings and adds them to the word embeddings. These are either generated by some mathematical function or learned during training per position.
The cross-attention mechanism is very similar to the encoder-decoder attention mechanism.
If you look at the Transformer as a black box, the way it produces the target sequence is like the RNN Seq2Seq.

Popular Transformer Language Models

Recent Transformer models are complex, with hundreds of millions (even billions) of parameters, and are trained with huge datasets (tens of gigabytes of text), requiring significant GPU resources.
Implementation and pre-trained model parameters for these models are usually made publicly available.
These are also known as foundation models.
Some popular transformer language models:
- Transformer-XL (eXtra-Long)
- GPT (Generative Pre-Trained Transformers) [autoregressive decoder only]
- XLM (cross-lingual Language Model)
- variants: XLM-R, InfoXLM, XLM-R-XL, XLM-R-XXL
- BERT (Bidirectional Encoder Representations from Transformers)
- variants: DistillBERT, SentenceBERT, ModernBERT
- LLaMA (Large Language Model Meta AI) [autoregressive decoder only]
  - Qwen and DeepSeek [autoregressive decoder only]

Transfer Learning

In traditional machine learning, NLP models are trained on a per-task basis and are only useful for the type of task they are trained for.
Word embeddings are trained on an independent, large textual corpus without any training signals and can be used as input to models for different tasks.

Transfer Learning

Transfer learning in ML is a process where different techniques are used to improve the performance of a model in a task using data and/or models trained in a different task.
Transfer learning always consists of two or more steps: a machine learning model is first trained for one task, which is then adjusted and used in another. The first step is called pretraining, and the second step is called adaptation.
If the same model is used for both tasks, the second step is called fine-tuning, because you are tuning the same model slightly but for a different task.

BERT

Word embeddings are powerful but cannot take context into account.
Contextualized embeddings transform the entire sentence into a series of vectors that take into account the context.
Notable attempts in contextualized embeddings include CoVe and ELMo, but the biggest breakthrough was achieved by BERT, a Transformer-based pretrained language model.
BERT contextualizes the input through a series of Transformer encoder layers, inheriting all the strengths of the Transformer.
Its self-attention mechanism enables it to "random access" over the input and capture long-term dependencies among input tokens.
Unlike traditional language models, the Transformer can take into account the context in both directions.

Pretraining BERT

In BERT, we are interested in the word embeddings derived as the parameters of the model.
This type of training paradigm where the data itself provides training signals is called self-supervised learning or simply self-supervision.
For bidirectional language models such as BERT, you cannot simply predict surrounding words based on the embeddings of the target word (i.e., unidirectional), because the input for the prediction (contextualized embeddings) also depends on what comes before and after the input.
BERT can be trained using:
- Masked language model (MLM): Words are dropped (masked) randomly in a given sentence, and the model predicts the dropped word.
- Next sentence prediction (NSP): Two sentences — A and B — are chosen for pre-training, with 50% of the time B is the actual next sentence that follows A, and 50% of the time B is a random sentence from the corpus.
BERT uses the Transformer to encode the input and then uses a feedforward and a softmax layer to derive a probability distribution over possible words that can fill in that blank, and uses regular cross-entropy to train the model.

Adapting BERT

At the second stage of transfer learning, a pretrained model is adapted to the target task so that the latter can leverage signals learned by the former.
The model "inherits" the model weights learned through pretraining.
Two main ways to adapt BERT to individual downstream tasks:
- Fine-tuning: The ANN architecture is slightly modified so that it can produce the type of predictions for the task in question, and the entire network is continuously trained on the training data for the task so that the loss function is minimized.
- Feature extraction: Features are extracted as a sequence of contextualized embeddings produced by the final layer of BERT. These vectors can be fed to another machine learning model as features to make predictions.
The pretrained model cannot change*, but the second ML model doesn’t necessarily have to be a neural network. *unless you continue it’s pre-training for a specific purpose.

Bidirectional Transformers (BERT) and Generative Pre-trained Transformers (GPT)

Scaling laws dictate that Transformers show emergent properties when scaled upwards.
Open-source models are quickly catching up with the closed source models
Larger models train faster. BERT was trained with 3 epochs. GPT-3 was trained with less than one epoch

Transformers for Other Areas

Recently, the Transformers architecture has been adapted to other areas such as Computer Vision.
Some works are starting to work with Transformers for areas like Time Series and Recommender Systems.

Language Models for Other Languages

Code-mixed
Domain-specific/Diachronic
Dialects
Creole Languages
Large Scaled Models (XLM-R variants)
Indian Languages

Group Coursework Token Classification – Abbreviation and Long-form Detection

Biomedical domain data
Identify a label for each word like Named Entity Recognition.
Labels – AC, LF, O in the BIO schema. Therefore, B-AC, B-LF, I-LF, O.
No, I-AC in essence as abbreviations are confined to a single token in most cases.
Classifying each input token into either of the above classes

Immediate Next Steps

Form a team/group of 4 members
No change to member number unless extenuating circumstances
Discuss implementations by week 6 and submit group declaration PDF by 13th March.

Caution

Please read the coursework document carefully and divide tasks upon discussion