Unit 6 Lecture

Page 1

LLMS CONT.

Page 2

Recap of Fine-Tuning LLMs

  • Adjusts pretrained language models to specific downstream tasks.

  • Relies on large, generalized models as a base.

  • Refines model weights using task-specific labeled data.

  • Common methods include full fine-tuning, instruction fine-tuning, and PEFT.

  • Goal: Improve performance for domain-specific tasks.

Page 3

Overview of Fine-Tuning

  • What is Fine-Tuning? Importance of fine-tuning LLMs for better task relevance.

  • Instruction Fine-Tuning: differentiation between single-task vs. multi-task approaches.

  • Evaluation metrics and strategies specifically for LLMs.

  • Introduction to Parameter Efficient Fine-Tuning (PEFT).

  • Real-world examples and associated challenges in fine-tuning.

Page 4

Recap of Important Points

  • Bridges the gap between general-purpose models and task-specific requirements.

  • Reduces the need for training models from scratch.

  • Enhances performance by leveraging pretrained knowledge.

  • Enables domain adaptation (e.g., in medical and legal NLP contexts).

  • Supports resource-efficient NLP development.

Page 5

Instruction Fine-Tuning

Page 6

What is Instruction Fine-Tuning?

  • Focuses on teaching models to follow natural language instructions.

  • Examples include OpenAI InstructGPT and Google FLAN.

  • Aligns models with user expectations and ethical guidelines.

  • Requires diverse task-specific instruction datasets for effectiveness.

  • Enhances generalization to unseen tasks during inference.

Page 7

Aims of Instruction Fine-Tuning

  • Enable multi-task generalization from a single model.

  • Align model behavior with human language understanding effectively.

  • Improve user interaction through explicit instruction-following.

  • Address biases in LLMs through targeted fine-tuning.

  • Facilitate efficient adaptation for practical applications.

Page 8

Challenges in Fine-Tuning ALMs

  • High computational costs arise for training large models.

  • Risk of catastrophic forgetting concerning prior knowledge.

  • Difficulties in balancing task-specific and general knowledge.

  • Dependency on high-quality and diverse datasets.

  • Trade-offs between depth of fine-tuning and resource efficiency.

  • Fine-tuning helps address key gaps in LLM capabilities.

Page 9

Addressing Challenges

  • Discussing single-task fine-tuning as a method of addressing identified challenges.

Page 10

Single-Task Fine-Tuning: Definition

  • Adapts a pretrained model to excel at a specific task.

  • Common tasks include text classification, summarization, and sentiment analysis.

  • Requires task-specific labeled datasets for successful implementation.

  • Typically involves modifications to all model parameters.

  • Emphasizes precision over general model generality.

Page 11

Single Task Fine-Tuning Pipeline

  • Step 1: Load a pretrained model (e.g., BERT, GPT).

  • Step 2: Add task-specific layers (e.g., classifiers for specific tasks).

  • Step 3: Fine-tune on the specific task dataset (e.g., IMDB for sentiment analysis).

  • Step 4: Evaluate performance on validation/test data.

  • Step 5: Deploy model for task-specific inference.

Page 12

Benefits of Single Task Fine-Tuning

  • Tailored performance for specific applications, leading to high accuracy.

  • Simple to implement compared to multi-task fine-tuning.

  • Ideal for well-defined tasks in controlled environments.

  • Results are easy to explain and evaluate for specific uses.

Page 13

Example - Task for Title Generation

  • Example: Sentiment Analysis using IMDB movie reviews.

  • Dataset: IMDB labeled as positive or negative.

  • Task: Predict sentiment from provided text data.

  • Model: Fine-tune BERT with a classification head.

  • Metric: Use accuracy or F1 Score.

  • Outcome: Performance improved compared to generic pretrained models.

Page 14

Limitations of Single Task Fine-Tuning

  • Poor generalization to other tasks or domains.

  • Computationally expensive when working with large models.

  • Requires labeled data for every new task being addressed.

  • Risks overfitting when dealing with small labeled datasets.

  • Can be inefficient for resource usage in multi-task scenarios.

Page 15

Case Study

  • Dataset: CNN/DailyMail used for news summarization.

  • Model: Fine-tuned BART (Bidirectional and Auto-Regressive Transformer).

  • Metrics: Evaluated performance using ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

  • Challenges: Balancing fluency and faithfulness in generated summaries.

  • Results: Fine-tuned BART demonstrates superior performance over traditional baselines.

Page 16

Best Practices in Single Task Fine-Tuning

  • Preprocess data thoroughly to limit noise during training.

  • Utilize hyperparameter tuning for optimizing learning rates and batch sizes.

  • Leverage transfer learning capabilities to handle smaller datasets.

  • Validate results using robust metrics for more reliable evaluations.

  • Incorporate regularization methods to prevent overfitting.

Page 17

Multi-Task Instruction Fine-Tuning

Page 18

What is Multi-Task Fine-Tuning?

  • Simultaneously fine-tunes an LLM for multiple tasks.

  • Example tasks include summarization, question answering (QA), and translation.

  • Unified datasets combine various instructions and task labels.

  • Prepares models for improved generalization across unseen tasks.

  • Reduces computational redundancy for multi-task application.

Page 19

Advantages Over Single-Task Fine-Tuning

  • Efficient resource use leading to overall lower costs.

  • Captures shared knowledge across different tasks enhancing performance.

  • Improves performance on low-resource tasks by sharing information.

  • Reduces the need for retraining on similar or related tasks.

  • Generates generalized instruction-following capabilities.

Page 20

Architecture Adjustments for Multi-Task Training

  • Unified Encoder-Decoder: The architecture processes different tasks in one framework.

  • Task-Specific Heads: Different output layers handle different tasks seamlessly.

  • Shared Embeddings: Allows tasks to share a common vocabulary or encoder to streamline training.

  • Example: T5 (Text-to-Text Transfer Transformer) employs text generation for all tasks, simplifying processes.

Page 21

Example: FLAN (Google)

  • Full Name: Fine-tuned LLaMA on Language Tasks (FLAN).

  • Multi-task model trained using instruction datasets for various tasks.

  • Includes datasets for translation, summarization, and QA.

  • Exhibits strong generalization to unseen tasks during evaluation stages.

  • Use Case: Powers conversational AI systems with diverse capabilities embedded within.

Page 22

Challenges of Multi-Task Fine-Tuning

  • Task Interference: Conflicts between objectives of different tasks lead to mixed results.

  • Imbalanced Datasets: Low-resource tasks may not perform optimally due to lack of data.

  • Overfitting to Dominant Tasks: Larger tasks can overshadow the performance of smaller tasks.

  • Optimization Complexity: Guidelines required to adjust weights effectively for multiple tasks.

  • Scalability Issues: Managing many diverse tasks in one fine-tuning pipeline presents challenges.

Page 23

Strategies to Overcome Multi-Task Challenges

  • Weighted Loss Functions: Apply weightings to tasks during training to manage focus.

  • Task Sampling: Balance updates through frequent sampling from smaller tasks.

  • Multi-Task Adapters: Modularize units tailored to specific task specifications.

  • Intermediate Fine-Tuning: Pretrain on similar tasks to set groundwork before multi-task tuning.

  • Curriculum Learning: Train on easier tasks then gradually advance to more complex ones.

Page 24

Real-World Applications

  • Chatbots: Implement multi-intent recognition and dynamic task switching (e.g., combining summarization and QA).

  • AI Personal Assistants: Employ unified models for various tasks such as email composition and scheduling.

  • Healthcare NLP: Extract patient data, summarize medical records, and respond to diagnostic queries.

  • Multilingual Systems: Translate and summarize text across different languages effectively.

  • Content Moderation: Identify inappropriate text while providing contextual explanations in real-time.

Page 25

Case Study: OpenAI InstructGPT

  • Dataset: Combines various instruction datasets across multiple tasks including summarization and QA.

  • Methodology: Fine-tuned on mixed tasks focusing on instruction-following behavior.

  • Key Results:

    • Enhanced task generalization capabilities.

    • Improved alignment with user instructions.

    • Reduced generation of harmful or irrelevant outputs.

  • Application: Powers systems such as ChatGPT for conversational AI.

Page 26

Benefits of Multi-Task Fine-Tuning

  • Generalization: Performs competently on unseen or new tasks effectively.

  • Efficiency: Minimizes the need for separate models for different tasks, streamlining processes.

  • Knowledge Sharing: Leverages commonalities across tasks to enhance performance.

  • Data Efficiency: Combines datasets, aiding low-resource tasks significantly.

  • Unified Frameworks: Creates streamlined production pipelines suitable for real-world applications.

Page 27

Comparison of Single-Task and Multi-Task Fine-Tuning

Feature

Single-Task Fine-Tuning

Multi-Task Fine-Tuning

Focus

Precision on one task

Generalization across tasks

Data Requirements

Task-specific dataset

Aggregated multi-task data

Efficiency

Higher resource use per task

Lower overall resource use

Adaptability

Limited to one domain/task

Flexible, supporting diverse tasks

Use Cases

Specialized models

General-purpose assistants

Page 28

Model Evaluation

Page 29

Why Evaluate?

  • Ensures model performance aligns with intended tasks effectively.

  • Identifies robustness across diverse input scenarios.

  • Pinpoints areas requiring retraining or dataset improvement.

  • Evaluates fairness to mitigate biases within model outputs.

  • Supports meaningful comparisons among various models for strategic decisions.

Page 30

Common Evaluation Metrics

  • Text Classification: Metrics include accuracy, precision, recall, and F1 score.

  • Summarisation: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) utilized.

  • Translation: BLEU (Bilingual Evaluation Understudy) serves as a key metric.

  • Question Answering: Evaluated through Exact Match (EM) and F1 score.

  • Open-ended Tasks: Assessed using perplexity and human evaluation techniques.

Page 31

Human Evaluation

  • Involves human judges providing scores on output fluency, coherence, and relevance.

  • Essential for generative tasks - summarization and chatbot responses where human input values significantly.

  • Delivers qualitative insights into user satisfaction regarding model responses.

  • Limitations: Human evaluation can be costly, subjective, and slow to implement.

  • Example: Rating chatbot responses on a scale of 1 to 5.

Page 32

Benchmarking LLMs

  • Datasets: Compares performance using GLUE (General Language Understanding Evaluation), SuperGLUE, HELM.

  • Tasks: Include text entailment, sentence similarity, and sentiment analysis.

  • Purpose: Establish performance standards for evaluation and comparison of models.

  • Emerging benchmarks include capabilities in multi-task and multilingual contexts.

  • HELM focuses on fairness and robustness evaluations.

Page 33

Robustness Evaluation

  • Assess model responsiveness to adversarial inputs or edge cases.

  • Performance on rare or domain-specific scenarios evaluated.

  • Techniques involve noise injection and evaluating out-of-distribution data.

  • Example: Examining performance against typographical errors or slang in input data.

  • Robust models should minimize catastrophic failures in output generation.

Page 34

Ethical Considerations in Evaluation

  • Strategies for detecting and quantifying bias present in model outputs.

  • Measure fairness across various demographics or language contexts.

  • Evaluate unintended harmful responses generated by models.

  • Examples: Identifying gender biases in generated text completions.

  • Tools employed include bias detection frameworks and counterfactual fairness tests.

Page 35

Parameter-Efficient Fine-Tuning (PEFT)

Page 36

What is PEFT?

  • Adaptation method for large models focusing on training fewer parameters.

  • Major portions of pretrained weights remain frozen, fine-tuning only smaller modules.

  • Common techniques include Adapters, LoRA (Low-Rank Adaptation), and Prefix Tuning.

  • Designed for scalability across multiple tasks while minimizing resource demands.

  • Enables faster training and deployment for resource-constrained environments.

Page 37

Why PEFT is Necessary?

  • Full LLM fine-tuning is computationally expensive and resource-intensive.

  • Responds to constraints related to memory and storage allocations.

  • Incorporates low-task-specific data scenarios effectively.

  • Retains baseline general knowledge from pretrained models.

  • Ideal approach for scenarios involving on-device or cloud-based fine-tuning.

Page 38

Common Techniques in PEFT

  • Adapters:

    • Introduce small, trainable modules between transformer layers.

    • Updates are applied only to adapter parameters rather than full model weights.

    • Modular format allows for reuse across diverse tasks.

  • LoRA (Low-Rank Adaptation):

    • Integrates low-rank matrices within model weights.

    • Reduces the trainable parameters substantially, increasing efficiency.

  • Prefix Tuning:

    • Optimizes embeddings that act as task-specific prefixes prepended to inputs.

    • Particularly effective in generative tasks.

  • BitFit:

    • Involves fine-tuning only the bias terms within the model structure.

Page 39

Benefits of PEFT

  • Parameter Reduction: Significantly decreases the number of parameters required for tuning.

  • Lowers both memory and computational overhead associated with fine-tuning.

  • Retains generalization capabilities provided by pretrained models during adaptations.

  • Streamlines processes for multi-task training methodologies.

  • Quick adaptation feasible for specific tasks or niches within domains.

Page 40

PEFT in Practice

  • Example 1: LoRA applied to GPT models, updating less than 1% of weights while yielding comparable performance to full fine-tuning.

  • Example 2: Adapters employed in BERT allow insertion of trainable modules for adaptability without full model retraining.

  • Real-World Usage: Tasks include chatbot personalization and domain adaptation in NLP applications.

Page 41

Challenges of PEFT

  • Difficulty balancing efficiency and accuracy for high-stakes applications.

  • Addressing highly complex tasks that may require increased parameters.

  • Reliance on the quality of pretrained model for effectiveness.

  • Modular designs can introduce slight complexity during inference stages.

  • PEFT doesn't always outperform full fine-tuning approaches in low-resource contexts.

Page 42

Summary

  • Fine-tuning paradigms such as single-task and multi-task facilitate targeted LLM applications.

  • Evaluation strategies ensure reliability, robustness, and fairness in modeling deployments.

  • PEFT represents a transformative approach in adapting LLMs with an emphasis on efficiency and scalability.

  • Models like LoRA and Adapters redefine the training processes for constrained resources.

  • Future directions include combining PEFT with instruction fine-tuning methodologies for optimal outcomes.

robot