Architecture of Transformers

Important Literature

Attention is All You Need paper: This foundational paper introduced the transformer architecture, which has become a cornerstone in modern NLP. It details the self-attention mechanism that allows the model to weigh the importance of different parts of the input sequence.
GPT paper: This paper established the transformer decoder model, demonstrating its effectiveness in language generation tasks. The model is pre-trained on a large corpus of text and then fine-tuned for specific tasks.
BART paper: This paper established the transformer encoder model, which is particularly useful for tasks that require understanding the entire input sequence, such as text summarization and translation.

Transfer Learning

Used in practice; ties into architectural decisions: Transfer learning is a crucial technique that leverages pre-trained models to improve performance on downstream tasks. It significantly reduces training time and data requirements.
Involves two stages:
- Pre-training: Training a model on a large, general dataset (source domain) to capture broad patterns and knowledge.
- Fine-tuning: Further training the pre-trained model on a smaller dataset that covers a potentially different domain (target domain) to adapt the model to specific tasks.
Analogy to human learning:
- Years of education provide general knowledge, which serves as a foundation for understanding new concepts.
- Specialized job uses specific knowledge based on general knowledge, applying the learned principles to specific tasks.
The goal of pre-training is to have a model that generalizes well to specialized tasks, enabling it to perform effectively in various scenarios.
Before fine-tuning, the model is initialized with pre-trained parameters, inheriting knowledge from the source domain, providing a strong starting point for adaptation.
Advantage: Pre-training is computationally expensive but only needs to be done once. The pre-trained model can be copied and fine-tuned for various tasks, making it highly efficient for multiple applications.
Fine-tuning is computationally cheaper because of the smaller dataset and better starting point, allowing for rapid adaptation to new tasks.

Illustration of Transfer Learning

Source domains: ImageNet (natural images), Quick Draw (hand-drawn) are used to provide a diverse range of visual data for pre-training.
Target domains: MS COCO (natural images), hand-drawn digits, traffic signs (combination of natural images and concise shapes) represent specific tasks that the model is adapted to.
Transfer learning combines knowledge from source domains to apply to target domains, enabling the model to leverage common features and patterns.
In NLP: Models pre-trained on fiction, poetry, screenplay, and biographies gain a broad understanding of language nuances and styles.

Fine-Tuning in NLP

Large corpus (sequence of text) is easy to obtain, providing ample data for pre-training language models.
Annotations are expensive, making unsupervised pre-training a cost-effective approach.
In pre-training, avoid large-scale annotated data by leveraging the inherent structure of language.
The corpus itself contains information on syntax and semantics, which the model learns during pre-training.
Language model trained to predict upcoming or masked-out words in an incomplete sequence captures contextual relationships.
Example: "I ate an **** today" (apple) demonstrates how the model learns to predict missing words based on context.
After pre-training, fine-tune with annotated data for downstream tasks like sentiment analysis, allowing the model to adapt to specific applications.

Downstream Tasks

Generating additional text, such as completing a story or creating new content.
Answering questions based on provided context or general knowledge.
Analyzing sentiment to determine the emotional tone of a text.
Analyzing logic between sentences to understand relationships and coherence.
Question:

Fine-Tuning

For fine tuning, is the same part of the model trained?
In most cases (over 90%), the set of model parameters being fine-tuned is different from the set of model parameters being pre-trained, allowing for targeted adaptation.
Pre-training aims to train all parameters, which can be hundreds of billions in LLMs, to capture a comprehensive understanding of the data.
Large parameter sets require sizable datasets to ensure effective training and generalization.
Grokking: A phenomenon where the dataset size must surpass a certain point for the model to exhibit humanlike behavior, indicating a critical mass of knowledge acquisition.
After obtaining a grokked model, fine-tuning with a dataset below the critical size can "ungrok" the model, leading to a loss of generalization.
- Researchers combat this by not training the same set of parameters as pre-training, using a smaller set of parameters, focusing on task-specific adaptations.
- Constraining the model during fine-tuning to prevent destroying the knowledge gained during pre-training maintains the model's general capabilities.

Original Transformer Architecture

Designed for machine translation, enabling the conversion of text from one language to another.
Encoder interprets the source language, capturing its meaning and context.
Decoder produces the target language, generating the translated text.
Each sub-module is a capable model, contributing to the overall performance of the architecture.
BERT: Transformer encoder model for language understanding, only the left half of the architecture, excelling in tasks like sentiment analysis and question answering.
GPT: Transformer decoder model for language generation, only the right half of the architecture, ideal for tasks like text completion and creative writing.
GPT lacks cross-attention because it's unnecessary when only one language is involved, simplifying the architecture for language generation tasks.

Transformer Layers

Encoder self-attention layer: Input is output of the previous encoder layer; each output position attends to all input positions, allowing the model to capture relationships between different parts of the input.
Encoder-decoder attention layer: Input keys and values are from the encoder layer; input queries are from the previous decoder layer; each output position attends to all input positions, enabling the decoder to focus on relevant parts of the input.
Decoder self-attention layer: Input is output of the previous decoder layer; output position attends to input position up to its own position, preventing leftward information flow via masking, ensuring that the model only uses past information for prediction.
Attention Masking
- Original attention score (encoder): Every item attends to every item, allowing the model to capture all possible relationships.
- Decoder: Prevent information leakage from future to present by setting attention scores to negative infinity, ensuring that the model does not use future information for prediction.
- Masking ensures values from later positions do not contribute to the information processing of earlier tokens, maintaining the integrity of the sequential processing.
  - Without masking, all attention scores are present, which is suitable for analyzing information, allowing the model to consider all relationships.
  - With masking, only look at what comes before for language generation, ensuring that the model generates text based on past information.

Positional Encoding

Adds information about token order, enabling the model to understand the sequence of words.
Another set of vector representations layered on top of each other, providing additional information about the position of each word.
Words have vector representations for semantic meaning; positions also have meaning, indicating their role in the sentence.
Create another embedding vector for position and do an element-wise summation with the word vector, combining the semantic and positional information.
Positional encodings can be fixed or learned, offering flexibility in how the model incorporates positional information.
Fixed positional encodings are manually engineered functions on word positional index and the values dimensional index in the embedding vector, using mathematical functions to represent position.
Learned positional encodings are trained jointly with the language model and stored in a matrix, allowing the model to learn the optimal representation of position.
Modern transformer networks are adaptable between fixed and learned positional encodings, providing flexibility in how positional information is incorporated.

*Example of using fixed positionl encoding:

*Columns for each of the position and rows for different demensions.

*Values determined by sin wave fuctions.

*Use as a marker for the position and then model learns to use that marker to determine the positional information

GPT Model

Transformer encoder.
Attends to preceding context using attention masking, ensuring that the model only uses past information for prediction.
Pre-trained to predict the next word in a sequence (language modeling), enabling the model to generate coherent text.
The objective maximizes the likelihood of the correct word (log of each word givenpre-existing context), guiding the model to predict the most probable sequence.
The loss here is the is basically the log of each word given the preexisting context. Formally defined as P(w) = \sum log P(wi || w1…w_{i-1}) This is essentially the log of the probability of a Markov chain with memory, quantifying the model's ability to predict the next word.
w is the sequence of words, and the small w is a single word, representing the individual elements of the sequence.
k is the contact size and determines how much of the the transformer should remember, controlling the context window for prediction.
Attach a fully connected layer to the end for fine-tuning to other tasks, adapting the model for specific applications.
Can keep doing language modeling while fine-tuning, combining general language knowledge with task-specific adaptations.
Diagram: Model is masked multi-head attention with residual links and feed-forward with residual links, illustrating the architecture of the model.
Text prediction for language modeling and classification for downstream tasks, highlighting the model's versatility.

Residual Links

Bypass links common in modern deep learning, originating from ResNet, enabling the model to train more effectively.
Take a copy of the activations and bypass what's going on and then add it back, allowing the model to reuse information from previous layers.
Reduce the effective depth of the model because allow to effectively bypass layer and makes optimization easier, improving training efficiency.
Regularize the loss landscape, making the optimization process more stable.
Stabilize the gradients for the parameters, preventing vanishing or exploding gradients.
Decoder of the transformer model.
The are how the text are formatted.
Examples of input formatting:

*Text followed by special symbols/text feed to tansformer model and classify with the linerar classifier.

BART

Transformer encoder.
Attends to both preceding and succeeding contexts, allowing the model to capture bidirectional relationships in the text.
Pre-trained with masked language modeling and next sentence prediction, enabling the model to understand and generate text.
Masked language modeling predicts words masked out anywhere in a sequence, encouraging the model to learn contextual relationships.
Next sentence prediction predicts if one given sentence follows another, enabling the model to understand discourse structure.
Construct positive examples by taking consecutive sentences from the corpus; negative examples are non-adjacent sentences, training the model to distinguish coherent text.
Segment embedding to distinguish the two sentences, providing additional information to the model.
Fine-tuning is straightforward because it can model single and paired text downstream tasks, adapting the model for various applications.
If have a degenerate sentence that becomes obvious in the second sentence, or in other words sentence has no effect.
Bart diagram: Masked language modeling and classifier token for next sentence prediction, illustrating the architecture of the model.
The Classifier token gives as an imbedding vector then the use it to feed it to a linear clssifier which is a binary classification.

Embeddings in Transformer

There are embeddings and tokens of a useful tool used to add value to the application.

*types of embeddings:

*1) Token Embeddings- where we store sementic

*2)Postitional Embeddings

*3)Segment Embeddings

Key Differences between BART and GPT

BERT is bidirectional and used for text classification or understanding, excelling in tasks like sentiment analysis and question answering.
GPT is unidirectional and used for text generation, ideal for tasks like text completion and creative writing.