Unit 6 Lecture

Page 1

LLMS CONT.

Page 2

Recap of Fine-Tuning LLMs

Adjusts pretrained language models to specific downstream tasks.
Relies on large, generalized models as a base.
Refines model weights using task-specific labeled data.
Common methods include full fine-tuning, instruction fine-tuning, and PEFT.
Goal: Improve performance for domain-specific tasks.

Page 3

Overview of Fine-Tuning

What is Fine-Tuning? Importance of fine-tuning LLMs for better task relevance.
Instruction Fine-Tuning: differentiation between single-task vs. multi-task approaches.
Evaluation metrics and strategies specifically for LLMs.
Introduction to Parameter Efficient Fine-Tuning (PEFT).
Real-world examples and associated challenges in fine-tuning.

Page 4

Recap of Important Points

Bridges the gap between general-purpose models and task-specific requirements.
Reduces the need for training models from scratch.
Enhances performance by leveraging pretrained knowledge.
Enables domain adaptation (e.g., in medical and legal NLP contexts).
Supports resource-efficient NLP development.

Page 5

Instruction Fine-Tuning

Page 6

What is Instruction Fine-Tuning?

Focuses on teaching models to follow natural language instructions.
Examples include OpenAI InstructGPT and Google FLAN.
Aligns models with user expectations and ethical guidelines.
Requires diverse task-specific instruction datasets for effectiveness.
Enhances generalization to unseen tasks during inference.

Page 7

Aims of Instruction Fine-Tuning

Enable multi-task generalization from a single model.
Align model behavior with human language understanding effectively.
Improve user interaction through explicit instruction-following.
Address biases in LLMs through targeted fine-tuning.
Facilitate efficient adaptation for practical applications.

Page 8

Challenges in Fine-Tuning ALMs

High computational costs arise for training large models.
Risk of catastrophic forgetting concerning prior knowledge.
Difficulties in balancing task-specific and general knowledge.
Dependency on high-quality and diverse datasets.
Trade-offs between depth of fine-tuning and resource efficiency.
Fine-tuning helps address key gaps in LLM capabilities.

Page 9

Addressing Challenges

Discussing single-task fine-tuning as a method of addressing identified challenges.

Page 10

Single-Task Fine-Tuning: Definition

Adapts a pretrained model to excel at a specific task.
Common tasks include text classification, summarization, and sentiment analysis.
Requires task-specific labeled datasets for successful implementation.
Typically involves modifications to all model parameters.
Emphasizes precision over general model generality.

Page 11

Single Task Fine-Tuning Pipeline

Step 1: Load a pretrained model (e.g., BERT, GPT).
Step 2: Add task-specific layers (e.g., classifiers for specific tasks).
Step 3: Fine-tune on the specific task dataset (e.g., IMDB for sentiment analysis).
Step 4: Evaluate performance on validation/test data.
Step 5: Deploy model for task-specific inference.

Page 12

Benefits of Single Task Fine-Tuning

Tailored performance for specific applications, leading to high accuracy.
Simple to implement compared to multi-task fine-tuning.
Ideal for well-defined tasks in controlled environments.
Results are easy to explain and evaluate for specific uses.

Page 13

Example - Task for Title Generation

Example: Sentiment Analysis using IMDB movie reviews.
Dataset: IMDB labeled as positive or negative.
Task: Predict sentiment from provided text data.
Model: Fine-tune BERT with a classification head.
Metric: Use accuracy or F1 Score.
Outcome: Performance improved compared to generic pretrained models.

Page 14

Limitations of Single Task Fine-Tuning

Poor generalization to other tasks or domains.
Computationally expensive when working with large models.
Requires labeled data for every new task being addressed.
Risks overfitting when dealing with small labeled datasets.
Can be inefficient for resource usage in multi-task scenarios.

Page 15

Case Study

Dataset: CNN/DailyMail used for news summarization.
Model: Fine-tuned BART (Bidirectional and Auto-Regressive Transformer).
Metrics: Evaluated performance using ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
Challenges: Balancing fluency and faithfulness in generated summaries.
Results: Fine-tuned BART demonstrates superior performance over traditional baselines.

Page 16

Best Practices in Single Task Fine-Tuning

Preprocess data thoroughly to limit noise during training.
Utilize hyperparameter tuning for optimizing learning rates and batch sizes.
Leverage transfer learning capabilities to handle smaller datasets.
Validate results using robust metrics for more reliable evaluations.
Incorporate regularization methods to prevent overfitting.

Page 17

Multi-Task Instruction Fine-Tuning

Page 18

What is Multi-Task Fine-Tuning?

Simultaneously fine-tunes an LLM for multiple tasks.
Example tasks include summarization, question answering (QA), and translation.
Unified datasets combine various instructions and task labels.
Prepares models for improved generalization across unseen tasks.
Reduces computational redundancy for multi-task application.

Page 19

Advantages Over Single-Task Fine-Tuning

Efficient resource use leading to overall lower costs.
Captures shared knowledge across different tasks enhancing performance.
Improves performance on low-resource tasks by sharing information.
Reduces the need for retraining on similar or related tasks.
Generates generalized instruction-following capabilities.

Page 20

Architecture Adjustments for Multi-Task Training

Unified Encoder-Decoder: The architecture processes different tasks in one framework.
Task-Specific Heads: Different output layers handle different tasks seamlessly.
Shared Embeddings: Allows tasks to share a common vocabulary or encoder to streamline training.
Example: T5 (Text-to-Text Transfer Transformer) employs text generation for all tasks, simplifying processes.

Page 21

Example: FLAN (Google)

Full Name: Fine-tuned LLaMA on Language Tasks (FLAN).
Multi-task model trained using instruction datasets for various tasks.
Includes datasets for translation, summarization, and QA.
Exhibits strong generalization to unseen tasks during evaluation stages.
Use Case: Powers conversational AI systems with diverse capabilities embedded within.

Page 22

Challenges of Multi-Task Fine-Tuning

Task Interference: Conflicts between objectives of different tasks lead to mixed results.
Imbalanced Datasets: Low-resource tasks may not perform optimally due to lack of data.
Overfitting to Dominant Tasks: Larger tasks can overshadow the performance of smaller tasks.
Optimization Complexity: Guidelines required to adjust weights effectively for multiple tasks.
Scalability Issues: Managing many diverse tasks in one fine-tuning pipeline presents challenges.

Page 23

Strategies to Overcome Multi-Task Challenges

Weighted Loss Functions: Apply weightings to tasks during training to manage focus.
Task Sampling: Balance updates through frequent sampling from smaller tasks.
Multi-Task Adapters: Modularize units tailored to specific task specifications.
Intermediate Fine-Tuning: Pretrain on similar tasks to set groundwork before multi-task tuning.
Curriculum Learning: Train on easier tasks then gradually advance to more complex ones.

Page 24

Real-World Applications

Chatbots: Implement multi-intent recognition and dynamic task switching (e.g., combining summarization and QA).
AI Personal Assistants: Employ unified models for various tasks such as email composition and scheduling.
Healthcare NLP: Extract patient data, summarize medical records, and respond to diagnostic queries.
Multilingual Systems: Translate and summarize text across different languages effectively.
Content Moderation: Identify inappropriate text while providing contextual explanations in real-time.

Page 25

Case Study: OpenAI InstructGPT

Dataset: Combines various instruction datasets across multiple tasks including summarization and QA.
Methodology: Fine-tuned on mixed tasks focusing on instruction-following behavior.
Key Results:
- Enhanced task generalization capabilities.
- Improved alignment with user instructions.
- Reduced generation of harmful or irrelevant outputs.
Application: Powers systems such as ChatGPT for conversational AI.

Page 26

Benefits of Multi-Task Fine-Tuning

Generalization: Performs competently on unseen or new tasks effectively.
Efficiency: Minimizes the need for separate models for different tasks, streamlining processes.
Knowledge Sharing: Leverages commonalities across tasks to enhance performance.
Data Efficiency: Combines datasets, aiding low-resource tasks significantly.
Unified Frameworks: Creates streamlined production pipelines suitable for real-world applications.

Page 27

Comparison of Single-Task and Multi-Task Fine-Tuning

Feature	Single-Task Fine-Tuning	Multi-Task Fine-Tuning
Focus	Precision on one task	Generalization across tasks
Data Requirements	Task-specific dataset	Aggregated multi-task data
Efficiency	Higher resource use per task	Lower overall resource use
Adaptability	Limited to one domain/task	Flexible, supporting diverse tasks
Use Cases	Specialized models	General-purpose assistants

Page 28

Model Evaluation

Page 29

Why Evaluate?

Ensures model performance aligns with intended tasks effectively.
Identifies robustness across diverse input scenarios.
Pinpoints areas requiring retraining or dataset improvement.
Evaluates fairness to mitigate biases within model outputs.
Supports meaningful comparisons among various models for strategic decisions.

Page 30

Common Evaluation Metrics

Text Classification: Metrics include accuracy, precision, recall, and F1 score.
Summarisation: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) utilized.
Translation: BLEU (Bilingual Evaluation Understudy) serves as a key metric.
Question Answering: Evaluated through Exact Match (EM) and F1 score.
Open-ended Tasks: Assessed using perplexity and human evaluation techniques.

Page 31

Human Evaluation

Involves human judges providing scores on output fluency, coherence, and relevance.
Essential for generative tasks - summarization and chatbot responses where human input values significantly.
Delivers qualitative insights into user satisfaction regarding model responses.
Limitations: Human evaluation can be costly, subjective, and slow to implement.
Example: Rating chatbot responses on a scale of 1 to 5.

Page 32

Benchmarking LLMs

Datasets: Compares performance using GLUE (General Language Understanding Evaluation), SuperGLUE, HELM.
Tasks: Include text entailment, sentence similarity, and sentiment analysis.
Purpose: Establish performance standards for evaluation and comparison of models.
Emerging benchmarks include capabilities in multi-task and multilingual contexts.
HELM focuses on fairness and robustness evaluations.

Page 33

Robustness Evaluation

Assess model responsiveness to adversarial inputs or edge cases.
Performance on rare or domain-specific scenarios evaluated.
Techniques involve noise injection and evaluating out-of-distribution data.
Example: Examining performance against typographical errors or slang in input data.
Robust models should minimize catastrophic failures in output generation.

Page 34

Ethical Considerations in Evaluation

Strategies for detecting and quantifying bias present in model outputs.
Measure fairness across various demographics or language contexts.
Evaluate unintended harmful responses generated by models.
Examples: Identifying gender biases in generated text completions.
Tools employed include bias detection frameworks and counterfactual fairness tests.

Page 35

Parameter-Efficient Fine-Tuning (PEFT)

Page 36

What is PEFT?

Adaptation method for large models focusing on training fewer parameters.
Major portions of pretrained weights remain frozen, fine-tuning only smaller modules.
Common techniques include Adapters, LoRA (Low-Rank Adaptation), and Prefix Tuning.
Designed for scalability across multiple tasks while minimizing resource demands.
Enables faster training and deployment for resource-constrained environments.

Page 37

Why PEFT is Necessary?

Full LLM fine-tuning is computationally expensive and resource-intensive.
Responds to constraints related to memory and storage allocations.
Incorporates low-task-specific data scenarios effectively.
Retains baseline general knowledge from pretrained models.
Ideal approach for scenarios involving on-device or cloud-based fine-tuning.

Page 38

Common Techniques in PEFT

Adapters:
- Introduce small, trainable modules between transformer layers.
- Updates are applied only to adapter parameters rather than full model weights.
- Modular format allows for reuse across diverse tasks.
LoRA (Low-Rank Adaptation):
- Integrates low-rank matrices within model weights.
- Reduces the trainable parameters substantially, increasing efficiency.
Prefix Tuning:
- Optimizes embeddings that act as task-specific prefixes prepended to inputs.
- Particularly effective in generative tasks.
BitFit:
- Involves fine-tuning only the bias terms within the model structure.

Page 39

Benefits of PEFT

Parameter Reduction: Significantly decreases the number of parameters required for tuning.
Lowers both memory and computational overhead associated with fine-tuning.
Retains generalization capabilities provided by pretrained models during adaptations.
Streamlines processes for multi-task training methodologies.
Quick adaptation feasible for specific tasks or niches within domains.

Page 40

PEFT in Practice

Example 1: LoRA applied to GPT models, updating less than 1% of weights while yielding comparable performance to full fine-tuning.
Example 2: Adapters employed in BERT allow insertion of trainable modules for adaptability without full model retraining.
Real-World Usage: Tasks include chatbot personalization and domain adaptation in NLP applications.

Page 41

Challenges of PEFT

Difficulty balancing efficiency and accuracy for high-stakes applications.
Addressing highly complex tasks that may require increased parameters.
Reliance on the quality of pretrained model for effectiveness.
Modular designs can introduce slight complexity during inference stages.
PEFT doesn't always outperform full fine-tuning approaches in low-resource contexts.

Page 42

Summary

Fine-tuning paradigms such as single-task and multi-task facilitate targeted LLM applications.
Evaluation strategies ensure reliability, robustness, and fairness in modeling deployments.
PEFT represents a transformative approach in adapting LLMs with an emphasis on efficiency and scalability.
Models like LoRA and Adapters redefine the training processes for constrained resources.
Future directions include combining PEFT with instruction fine-tuning methodologies for optimal outcomes.