Week 8: NLP Pipeline: In-context Learning/Instruction Fine-tuning
Causal LLMs and In-context Learning
Causal Attention
- Generative modeling of sentences: LLMs use causal attention for generative modeling.
- Pre-training: Uses masked self-attention (transformer decoder architecture).
- Inference: During generation, the model generates new tokens by sampling based on the probability distribution of the vocabulary given the preceding tokens.
Working with Pre-trained Models
- Scenarios:
- Zero-shot
- Few-shot
- Techniques:
- Instruction Fine-tuning
- Reinforcement Learning with Human Feedback (RLHF)
- Reward model with human feedback is used to further improve the model.
Pre-training Language Models
- LMs are pre-trained on large-scale unlabeled data via self-supervision.
- The model can be well-trained this way and efficiently adapted to new tasks through fine-tuning or prompting.
In-context Learning
- Give a few examples of the task that the model should solve.
- Craft prompts that reformulate tasks to resemble those solved during pre-training.
- GPT-3 and later LLMs can perform tasks with and without limited examples by predicting sequences or comparing probabilities.
Emergent Abilities of LLMs
- Unpredictable phenomena where LLMs exhibit abilities not present in smaller-scale models.
- Emergence: Quantitative changes in a system result in qualitative changes in behavior.
- Emergent abilities cannot be predicted by simply extrapolating performance improvements on smaller-scale models.
- Examples of emergence are seen in few-shot prompting settings (L. Wei et al., 2022).
Benchmarks for Language Models
BIG-Bench
- Over 200 tasks.
- 80% of benchmark tasks are in JSON format, containing a list of examples made up of inputs and targets.
- 20% are programmatic.
MMLU (Massive Multitask Language Understanding)
- New benchmarks for measuring LM performance on 57 diverse knowledge-intensive tasks.
- Uses a high-school questionnaire format for different subjects.
Prompting
- A pre-trained LM is given a prompt (e.g., a natural language instruction) of a task and completes the response without any further training or gradient updates to its parameters.
- Method of providing an LLM with a specific input or cue to generate a desired output or perform a task.
- A well-crafted prompt can guide an LLM to generate more accurate, relevant, and contextually appropriate responses.
- The process can be iteratively refined.
Prompt Engineering
- Design effective prompts to make better use of LLMs and enhance their practical utility in real-world applications.
Prompt Engineering Details
- Prompt: Input to the LLM:
- Generation process: text by maximizing
- Prompt acts as the condition on which predictions are made and can contain any information that helps describe and solve the problem.
Prompt Template
- Piece of text containing placeholders or variables, where each placeholder can be filled with specific information.
- Variable example: , where
Prompt Engineering (2)
- Prompt with multiple variables: Example is comparing two sentences in terms of their semantic similarity.
- Attribute-based prompts: Represent and store data in key-value pairs, which can describe more complex tasks.
Prompt Engineering (3)
- Prompt with system information: Assign a role to LLMs and provide sufficient context.
- System information helps the LLM understand the context or constraints of the task it is being asked to perform.
- Example: LLM as an assistant to correct English sentences.
In-context Learning
- Learning during inference.
- Prompts involve demonstrations of problem-solving.
- LLMs can learn from these demonstrations how to solve new problems.
- No model parameters are updated in the process.
- Way to efficiently activate and reorganize the knowledge learned in pre-training without additional training or fine-tuning.
- Enables quick adaptation of LLMs to new problems.
- Pushing the boundaries of what pre-trained LLMs can achieve without task-specific adjustments.
In-context Learning: Zero/One/Few-shot Learning
Zero-shot Learning
- Does not involve a traditional “learning” process.
- Directly applies LLMs to address new problems not observed during training.
- In practice, prompts are repetitively adjusted to guide the LLMs in generating better responses without demonstrating problem-solving steps or providing examples.
One-shot Learning
- Extends the 0-shot prompt by adding a demonstration of how to correct sentences.
- Allows the LLM to learn from this newly-added experience.
Few-shot Prompting
- Includes a few input-output examples in the model’s context (input) as a preamble before asking the model to perform the task for an unseen inference-time example.
- Essentially provides a pattern that maps some inputs to the corresponding outputs.
- The LLM attempts to follow this pattern in making predictions, provided that the prompt includes a sufficient number of demonstrations.
Effectiveness of In-context Learning
- Relies heavily on the quality of prompts.
- Is based on the fundamental abilities of pre-trained LLMs.
- Significant prompt engineering effort is needed to develop appropriate prompts that help LLMs learn more effectively from demonstrations.
- Stronger LLMs can make better use of in-context learning for performing new tasks.
- Example: Machine translation; weak understanding if not enough
language data during pre-training leads to poor performance regardless of prompting strategy.
Prompt Engineering Strategies
Clear Task Description
- Precise, specific, and clear description of the problem, instructing the LLM to perform as expected.
- Particularly important when we want the output of the LLM to meet certain expectations.
Guiding LLMs to Think
- For problems that require significant reasoning efforts.
- Add “Let’s think step by step” to the end of each prompt.
- Instruct LLMs to generate steps for reasoning about the problem before reaching the final answer.
- Use multiple rounds of interaction to evaluate the correctness of the answer and, if necessary, rework it to find a better solution.
Providing Reference Information
- Produce outputs that are confined to the relevant text.
- Avoid unconstrained predictions.
- Use Retrieval-Augmented Generation (RAG): Relevant text for the user query is provided by calling an Information Retrieval (IR) system, and LLMs are prompted to generate responses based on this provided relevant text.
Prompt Formats
- Define several fields for prompts and fill in different information in each field.
- Use code-style prompts for LLMs, which can understand and generate both natural language and code.
- Use control characters, XML tags, and specific formatting.
- Specify how the input and output should be formatted or structured.
Prompting for NLP Tasks: Text Classification
- Assigning pre-defined labels to a given text.
- LLM gives the answer not in labels but in text describing the result, which needs an additional label mapping step.
- Map the text output to the label space.
- Induce output labels from LLMs by reframing the problem as a cloze task.
- Constrain the prediction to the set of label words and select the one with the highest probability.
- Example:
- Constrain the output with prompts.
- Issues arise with a large number of categories.
Prompting for NLP Tasks: Text Generation
Text Completion
- Continual writing based on the input text.
Text Transformation
- Transformation of the input text into another text.
Prompting for NLP Tasks: Text Completion
- Generating text based on user requirements.
- Requirements can include style, tone, length, and any specific content that the text should contain.
- Examples: formal report, creative story, or a piece of programming code.
Code Generation
- Example: Write a Python function to calculate the average of a list of numbers.
Prompting for NLP Tasks: Text Transformation
- Typical tasks include machine translation, summarization, and text style transfer.
Other NLP Tasks
- Question-answering (MMLU benchmark, GSM8K dataset).
- Information extraction
- Named entity recognition
- Relation extraction
Multi-step Reasoning
Chain-of-Thought (CoT) Prompting
- Enables language models to solve reasoning tasks, especially those involving multiple steps.
- By guiding models to produce a sequence of intermediate steps before giving the final answer.
- Specialized prompting or finetuning methods can be emergent in that they do not have a positive effect until a certain model scale.
Chain-of-Thought Details
- Do not reach a conclusion directly, but generate reasoning steps or learn from demonstrations of detailed reasoning processes provided in the prompts.
- Addresses problems like algebraic calculation (e.g., calculating the average of the numbers 2, 4, and 6).
Chain-of-Thought Benefits
- Allows LLMs to decompose complex problems into smaller, sequential reasoning steps.
- Mirrors human problem-solving behaviors, making it particularly effective for tasks requiring detailed, multi-step reasoning.
- Makes the reasoning process more transparent and interpretable.
- Increases user trust as users can follow the logic behind the reasoning process.
- In-context learning approach that's applicable to most well-trained, off-the-shelf LLMs.
- Provides efficient ways to adapt LLMs to different types of problems.
Working with LLMs
- Zero-shot / Few-shot Scenarios (learning and evaluation):
- No finetuning, prompt engineering/CoT can improve performance.
- Limited by context length (input token length).
- Complex tasks will probably need gradient steps.
- Instruction Fine-tuning
- Reinforcement Learning with Human Feedback (RLHF)
Instruction Tuning
Language Modeling vs. Assisting Users
- Pre-finetuning: Language models are not aligned with user intent.
- After fine-tuning on a task that shows user intent, language models generate accordingly.
Instruction Tuning Details
- Fine-tune the pre-trained model to follow instructions.
- Enable LMs to perform new tasks simply by reading instructions describing the task (without few-shot exemplars).
- Involves fine-tuning on a mixture of tasks phrased as instructions, allowing LMs to respond to instructions describing an unseen task.
- Train models to follow natural language instructions using data that includes (Task/Instruction, Output) examples.
Instruction Finetuning Process
- Collect examples of (instruction, output) pairs across many tasks and finetune an LM.
- Evaluate on Unseen Pairs for predictions.
Scaling Up Finetuning
- Step 1: Pretrain (on language modeling) using lots of text to learn general things.
- Step 2: Finetune (on many tasks) using fewer labels to adapt to specific tasks.
- Pretraining can improve NLP applications by serving as parameter initialization.
Self-Instruct
Task Pool:
- Seed hand-crafted task pool that contains
- LLM-generated instructions and samples are added to this pool as the algorithm proceeds.
Sampling:
- Randomly select a few human-written instructions and a few LLM-generated instructions to ensure diversity.
Instruction Generation:
- Selected instructions are used as demonstration examples.
- LLM in-context learns from these examples to produce a new instruction.
Sample Generation:
- LLM is prompted to complete the sample by filling in the remaining input fields and generating the corresponding output.
Self-Instruct (2)
- Filtering:
- Newly-generated samples are examined by some heuristic rules.
- Filtering out samples or instructions that are similar to those already in the pool.
- If it passes, the sample and instruction are added to the pool.
- The generation process is repeated to obtain a sufficient number of fine tuning samples.
Instruction Tuning: Self-Instruct
- Input Inversion:
- LLM generates the input based on the specified output.
- Biased predictions for tasks such as classification, as most generated samples belong to a single class.
- Specify the output (i.e., the class) with some prior, and prompt the LLM to generate user input given both the instruction and the output.
Limitations of Instruction Finetuning
- It’s expensive to collect ground-truth data for tasks.
- Tasks like open-ended creative generation have no right answer.
- Language modeling penalizes all token-level mistakes equally, but some errors are worse than others.
- There is a mismatch between the LM objective and the objective of “satisfy human preferences”!
Working with LLMs
- Zero-shot / Few-shot Scenarios (learning and evaluation):
- No finetuning, prompt engineering/CoT can improve performance.
- Limited by context length (input token length).
- Complex tasks will probably need gradient steps.
- Instruction Fine-tuning:
- Simple and straightforward, generalize on unseen tasks.
- Collecting demonstrations for so many tasks is expensive.
- Mismatch between LM objective and human preferences.
- Reinforcement Learning with Human Feedback (RLHF):
- Reward model with human feedback.
- Use these to further improve the model.