Architecture of Transformers

Important Literature
  • Attention is All You Need paper: This foundational paper introduced the transformer architecture, which has become a cornerstone in modern NLP. It details the self-attention mechanism that allows the model to weigh the importance of different parts of the input sequence.

  • GPT paper: This paper established the transformer decoder model, demonstrating its effectiveness in language generation tasks. The model is pre-trained on a large corpus of text and then fine-tuned for specific tasks.

  • BART paper: This paper established the transformer encoder model, which is particularly useful for tasks that require understanding the entire input sequence, such as text summarization and translation.

Transfer Learning
  • Used in practice; ties into architectural decisions: Transfer learning is a crucial technique that leverages pre-trained models to improve performance on downstream tasks. It significantly reduces training time and data requirements.

  • Involves two stages:

    • Pre-training: Training a model on a large, general dataset (source domain) to capture broad patterns and knowledge.

    • Fine-tuning: Further training the pre-trained model on a smaller dataset that covers a potentially different domain (target domain) to adapt the model to specific tasks.

  • Analogy to human learning:

    • Years of education provide general knowledge, which serves as a foundation for understanding new concepts.

    • Specialized job uses specific knowledge based on general knowledge, applying the learned principles to specific tasks.

  • The goal of pre-training is to have a model that generalizes well to specialized tasks, enabling it to perform effectively in various scenarios.

  • Before fine-tuning, the model is initialized with pre-trained parameters, inheriting knowledge from the source domain, providing a strong starting point for adaptation.

  • Advantage: Pre-training is computationally expensive but only needs to be done once. The pre-trained model can be copied and fine-tuned for various tasks, making it highly efficient for multiple applications.

  • Fine-tuning is computationally cheaper because of the smaller dataset and better starting point, allowing for rapid adaptation to new tasks.

Illustration of Transfer Learning
  • Source domains: ImageNet (natural images), Quick Draw (hand-drawn) are used to provide a diverse range of visual data for pre-training.

  • Target domains: MS COCO (natural images), hand-drawn digits, traffic signs (combination of natural images and concise shapes) represent specific tasks that the model is adapted to.

  • Transfer learning combines knowledge from source domains to apply to target domains, enabling the model to leverage common features and patterns.

  • In NLP: Models pre-trained on fiction, poetry, screenplay, and biographies gain a broad understanding of language nuances and styles.

Fine-Tuning in NLP
  • Large corpus (sequence of text) is easy to obtain, providing ample data for pre-training language models.

  • Annotations are expensive, making unsupervised pre-training a cost-effective approach.

  • In pre-training, avoid large-scale annotated data by leveraging the inherent structure of language.

  • The corpus itself contains information on syntax and semantics, which the model learns during pre-training.

  • Language model trained to predict upcoming or masked-out words in an incomplete sequence captures contextual relationships.

  • Example: "I ate an **** today" (apple) demonstrates how the model learns to predict missing words based on context.

  • After pre-training, fine-tune with annotated data for downstream tasks like sentiment analysis, allowing the model to adapt to specific applications.

Downstream Tasks
  • Generating additional text, such as completing a story or creating new content.

  • Answering questions based on provided context or general knowledge.

  • Analyzing sentiment to determine the emotional tone of a text.

  • Analyzing logic between sentences to understand relationships and coherence.

  • Question:

Fine-Tuning
  • For fine tuning, is the same part of the model trained?

  • In most cases (over 90%), the set of model parameters being fine-tuned is different from the set of model parameters being pre-trained, allowing for targeted adaptation.

  • Pre-training aims to train all parameters, which can be hundreds of billions in LLMs, to capture a comprehensive understanding of the data.

  • Large parameter sets require sizable datasets to ensure effective training and generalization.

  • Grokking: A phenomenon where the dataset size must surpass a certain point for the model to exhibit humanlike behavior, indicating a critical mass of knowledge acquisition.

  • After obtaining a grokked model, fine-tuning with a dataset below the critical size can "ungrok" the model, leading to a loss of generalization.

    • Researchers combat this by not training the same set of parameters as pre-training, using a smaller set of parameters, focusing on task-specific adaptations.

    • Constraining the model during fine-tuning to prevent destroying the knowledge gained during pre-training maintains the model's general capabilities.

Original Transformer Architecture
  • Designed for machine translation, enabling the conversion of text from one language to another.

  • Encoder interprets the source language, capturing its meaning and context.

  • Decoder produces the target language, generating the translated text.

  • Each sub-module is a capable model, contributing to the overall performance of the architecture.

  • BERT: Transformer encoder model for language understanding, only the left half of the architecture, excelling in tasks like sentiment analysis and question answering.

  • GPT: Transformer decoder model for language generation, only the right half of the architecture, ideal for tasks like text completion and creative writing.

  • GPT lacks cross-attention because it's unnecessary when only one language is involved, simplifying the architecture for language generation tasks.

Transformer Layers
  • Encoder self-attention layer: Input is output of the previous encoder layer; each output position attends to all input positions, allowing the model to capture relationships between different parts of the input.

  • Encoder-decoder attention layer: Input keys and values are from the encoder layer; input queries are from the previous decoder layer; each output position attends to all input positions, enabling the decoder to focus on relevant parts of the input.

  • Decoder self-attention layer: Input is output of the previous decoder layer; output position attends to input position up to its own position, preventing leftward information flow via masking, ensuring that the model only uses past information for prediction.

  • Attention Masking

    • Original attention score (encoder): Every item attends to every item, allowing the model to capture all possible relationships.

    • Decoder: Prevent information leakage from future to present by setting attention scores to negative infinity, ensuring that the model does not use future information for prediction.

    • Masking ensures values from later positions do not contribute to the information processing of earlier tokens, maintaining the integrity of the sequential processing.

      • Without masking, all attention scores are present, which is suitable for analyzing information, allowing the model to consider all relationships.

      • With masking, only look at what comes before for language generation, ensuring that the model generates text based on past information.

Positional Encoding
  • Adds information about token order, enabling the model to understand the sequence of words.

  • Another set of vector representations layered on top of each other, providing additional information about the position of each word.

  • Words have vector representations for semantic meaning; positions also have meaning, indicating their role in the sentence.

  • Create another embedding vector for position and do an element-wise summation with the word vector, combining the semantic and positional information.

  • Positional encodings can be fixed or learned, offering flexibility in how the model incorporates positional information.

  • Fixed positional encodings are manually engineered functions on word positional index and the values dimensional index in the embedding vector, using mathematical functions to represent position.

  • Learned positional encodings are trained jointly with the language model and stored in a matrix, allowing the model to learn the optimal representation of position.

  • Modern transformer networks are adaptable between fixed and learned positional encodings, providing flexibility in how positional information is incorporated.

*Example of using fixed positionl encoding:

*Columns for each of the position and rows for different demensions.

*Values determined by sin wave fuctions.

*Use as a marker for the position and then model learns to use that marker to determine the positional information

GPT Model
  • Transformer encoder.

  • Attends to preceding context using attention masking, ensuring that the model only uses past information for prediction.

  • Pre-trained to predict the next word in a sequence (language modeling), enabling the model to generate coherent text.

  • The objective maximizes the likelihood of the correct word (log of each word givenpre-existing context), guiding the model to predict the most probable sequence.

  • The loss here is the is basically the log of each word given the preexisting context. Formally defined as P(w) = \sum log P(wi || w1…w_{i-1}) This is essentially the log of the probability of a Markov chain with memory, quantifying the model's ability to predict the next word.

  • w is the sequence of words, and the small w is a single word, representing the individual elements of the sequence.

  • k is the contact size and determines how much of the the transformer should remember, controlling the context window for prediction.

  • Attach a fully connected layer to the end for fine-tuning to other tasks, adapting the model for specific applications.

  • Can keep doing language modeling while fine-tuning, combining general language knowledge with task-specific adaptations.

  • Diagram: Model is masked multi-head attention with residual links and feed-forward with residual links, illustrating the architecture of the model.

  • Text prediction for language modeling and classification for downstream tasks, highlighting the model's versatility.

Residual Links
  • Bypass links common in modern deep learning, originating from ResNet, enabling the model to train more effectively.

  • Take a copy of the activations and bypass what's going on and then add it back, allowing the model to reuse information from previous layers.

  • Reduce the effective depth of the model because allow to effectively bypass layer and makes optimization easier, improving training efficiency.

  • Regularize the loss landscape, making the optimization process more stable.

  • Stabilize the gradients for the parameters, preventing vanishing or exploding gradients.

  • Decoder of the transformer model.

  • The are how the text are formatted.

  • Examples of input formatting:

*Text followed by special symbols/text feed to tansformer model and classify with the linerar classifier.

BART
  • Transformer encoder.

  • Attends to both preceding and succeeding contexts, allowing the model to capture bidirectional relationships in the text.

  • Pre-trained with masked language modeling and next sentence prediction, enabling the model to understand and generate text.

  • Masked language modeling predicts words masked out anywhere in a sequence, encouraging the model to learn contextual relationships.

  • Next sentence prediction predicts if one given sentence follows another, enabling the model to understand discourse structure.

  • Construct positive examples by taking consecutive sentences from the corpus; negative examples are non-adjacent sentences, training the model to distinguish coherent text.

  • Segment embedding to distinguish the two sentences, providing additional information to the model.

  • Fine-tuning is straightforward because it can model single and paired text downstream tasks, adapting the model for various applications.

  • If have a degenerate sentence that becomes obvious in the second sentence, or in other words sentence has no effect.

  • Bart diagram: Masked language modeling and classifier token for next sentence prediction, illustrating the architecture of the model.

  • The Classifier token gives as an imbedding vector then the use it to feed it to a linear clssifier which is a binary classification.

Embeddings in Transformer
  • There are embeddings and tokens of a useful tool used to add value to the application.

*types of embeddings:

*1) Token Embeddings- where we store sementic

*2)Postitional Embeddings

*3)Segment Embeddings

Key Differences between BART and GPT
  • BERT is bidirectional and used for text classification or understanding, excelling in tasks like sentiment analysis and question answering.

  • GPT is unidirectional and used for text generation, ideal for tasks like text completion and creative writing.