Week 8: NLP Pipeline: In-context Learning/Instruction Fine-tuning

Causal LLMs and In-context Learning

Causal Attention

Generative modeling of sentences: LLMs use causal attention for generative modeling.
Pre-training: Uses masked self-attention (transformer decoder architecture).
Inference: During generation, the model generates new tokens by sampling based on the probability distribution of the vocabulary given the preceding tokens.

Working with Pre-trained Models

Scenarios:
- Zero-shot
- Few-shot
Techniques:
- Instruction Fine-tuning
- Reinforcement Learning with Human Feedback (RLHF)
  - Reward model with human feedback is used to further improve the model.

Pre-training Language Models

LMs are pre-trained on large-scale unlabeled data via self-supervision.
The model can be well-trained this way and efficiently adapted to new tasks through fine-tuning or prompting.

In-context Learning

Give a few examples of the task that the model should solve.
Craft prompts that reformulate tasks to resemble those solved during pre-training.
GPT-3 and later LLMs can perform tasks with and without limited examples by predicting sequences or comparing probabilities.

Emergent Abilities of LLMs

Unpredictable phenomena where LLMs exhibit abilities not present in smaller-scale models.
Emergence: Quantitative changes in a system result in qualitative changes in behavior.
Emergent abilities cannot be predicted by simply extrapolating performance improvements on smaller-scale models.
Examples of emergence are seen in few-shot prompting settings (L. Wei et al., 2022).

Benchmarks for Language Models

BIG-Bench

Over 200 tasks.
80% of benchmark tasks are in JSON format, containing a list of examples made up of inputs and targets.
20% are programmatic.

MMLU (Massive Multitask Language Understanding)

New benchmarks for measuring LM performance on 57 diverse knowledge-intensive tasks.
Uses a high-school questionnaire format for different subjects.

Prompting

A pre-trained LM is given a prompt (e.g., a natural language instruction) of a task and completes the response without any further training or gradient updates to its parameters.
Method of providing an LLM with a specific input or cue to generate a desired output or perform a task.
A well-crafted prompt can guide an LLM to generate more accurate, relevant, and contextually appropriate responses.
The process can be iteratively refined.

Prompt Engineering

Design effective prompts to make better use of LLMs and enhance their practical utility in real-world applications.

Prompt Engineering Details

Prompt: Input to the LLM: $x$
Generation process: text $y$ by maximizing $P(y|x)$
Prompt acts as the condition on which predictions are made and can contain any information that helps describe and solve the problem.

Prompt Template

Piece of text containing placeholders or variables, where each placeholder can be filled with specific information.
Variable example: ${* premise *}$ , where $premise = the \ weather \ is \ nice \ this \ weekend$

Prompt Engineering (2)

Prompt with multiple variables: Example is comparing two sentences in terms of their semantic similarity.
Attribute-based prompts: Represent and store data in key-value pairs, which can describe more complex tasks.

Prompt Engineering (3)

Prompt with system information: Assign a role to LLMs and provide sufficient context.
System information helps the LLM understand the context or constraints of the task it is being asked to perform.
Example: LLM as an assistant to correct English sentences.

In-context Learning

Learning during inference.
Prompts involve demonstrations of problem-solving.
LLMs can learn from these demonstrations how to solve new problems.
No model parameters are updated in the process.
Way to efficiently activate and reorganize the knowledge learned in pre-training without additional training or fine-tuning.
Enables quick adaptation of LLMs to new problems.
Pushing the boundaries of what pre-trained LLMs can achieve without task-specific adjustments.

In-context Learning: Zero/One/Few-shot Learning

Zero-shot Learning

Does not involve a traditional “learning” process.
Directly applies LLMs to address new problems not observed during training.
In practice, prompts are repetitively adjusted to guide the LLMs in generating better responses without demonstrating problem-solving steps or providing examples.

One-shot Learning

Extends the 0-shot prompt by adding a demonstration of how to correct sentences.
Allows the LLM to learn from this newly-added experience.

Few-shot Prompting

Includes a few input-output examples in the model’s context (input) as a preamble before asking the model to perform the task for an unseen inference-time example.
Essentially provides a pattern that maps some inputs to the corresponding outputs.
The LLM attempts to follow this pattern in making predictions, provided that the prompt includes a sufficient number of demonstrations.

Effectiveness of In-context Learning

Relies heavily on the quality of prompts.
Is based on the fundamental abilities of pre-trained LLMs.
Significant prompt engineering effort is needed to develop appropriate prompts that help LLMs learn more effectively from demonstrations.
Stronger LLMs can make better use of in-context learning for performing new tasks.
Example: Machine translation; weak understanding if not enough language data during pre-training leads to poor performance regardless of prompting strategy.

Prompt Engineering Strategies

Clear Task Description

Precise, specific, and clear description of the problem, instructing the LLM to perform as expected.
Particularly important when we want the output of the LLM to meet certain expectations.

Guiding LLMs to Think

For problems that require significant reasoning efforts.
Add “Let’s think step by step” to the end of each prompt.
Instruct LLMs to generate steps for reasoning about the problem before reaching the final answer.
Use multiple rounds of interaction to evaluate the correctness of the answer and, if necessary, rework it to find a better solution.

Providing Reference Information

Produce outputs that are confined to the relevant text.
Avoid unconstrained predictions.
Use Retrieval-Augmented Generation (RAG): Relevant text for the user query is provided by calling an Information Retrieval (IR) system, and LLMs are prompted to generate responses based on this provided relevant text.

Prompt Formats

Define several fields for prompts and fill in different information in each field.
Use code-style prompts for LLMs, which can understand and generate both natural language and code.
Use control characters, XML tags, and specific formatting.
Specify how the input and output should be formatted or structured.

Prompting for NLP Tasks: Text Classification

Assigning pre-defined labels to a given text.
LLM gives the answer not in labels but in text describing the result, which needs an additional label mapping step.
Map the text output to the label space.
Induce output labels from LLMs by reframing the problem as a cloze task.
Constrain the prediction to the set of label words and select the one with the highest probability.
Example: $Y: {positive, negative, neutral}$
Constrain the output with prompts.
Issues arise with a large number of categories.

Prompting for NLP Tasks: Text Generation

Text Completion

Continual writing based on the input text.

Text Transformation

Transformation of the input text into another text.

Prompting for NLP Tasks: Text Completion

Generating text based on user requirements.
Requirements can include style, tone, length, and any specific content that the text should contain.
Examples: formal report, creative story, or a piece of programming code.

Code Generation

Example: Write a Python function to calculate the average of a list of numbers.

Prompting for NLP Tasks: Text Transformation

Typical tasks include machine translation, summarization, and text style transfer.

Other NLP Tasks

Question-answering (MMLU benchmark, GSM8K dataset).
Information extraction
Named entity recognition
Relation extraction

Multi-step Reasoning

Chain-of-Thought (CoT) Prompting

Enables language models to solve reasoning tasks, especially those involving multiple steps.
By guiding models to produce a sequence of intermediate steps before giving the final answer.
Specialized prompting or finetuning methods can be emergent in that they do not have a positive effect until a certain model scale.

Chain-of-Thought Details

Do not reach a conclusion directly, but generate reasoning steps or learn from demonstrations of detailed reasoning processes provided in the prompts.
Addresses problems like algebraic calculation (e.g., calculating the average of the numbers 2, 4, and 6).

Chain-of-Thought Benefits

Allows LLMs to decompose complex problems into smaller, sequential reasoning steps.
Mirrors human problem-solving behaviors, making it particularly effective for tasks requiring detailed, multi-step reasoning.
Makes the reasoning process more transparent and interpretable.
Increases user trust as users can follow the logic behind the reasoning process.
In-context learning approach that's applicable to most well-trained, off-the-shelf LLMs.
Provides efficient ways to adapt LLMs to different types of problems.

Working with LLMs

Zero-shot / Few-shot Scenarios (learning and evaluation):
- No finetuning, prompt engineering/CoT can improve performance.
- Limited by context length (input token length).
- Complex tasks will probably need gradient steps.
Instruction Fine-tuning
Reinforcement Learning with Human Feedback (RLHF)

Instruction Tuning

Language Modeling vs. Assisting Users

Pre-finetuning: Language models are not aligned with user intent.
After fine-tuning on a task that shows user intent, language models generate accordingly.

Instruction Tuning Details

Fine-tune the pre-trained model to follow instructions.
Enable LMs to perform new tasks simply by reading instructions describing the task (without few-shot exemplars).
Involves fine-tuning on a mixture of tasks phrased as instructions, allowing LMs to respond to instructions describing an unseen task.
Train models to follow natural language instructions using data that includes (Task/Instruction, Output) examples.

Instruction Finetuning Process

Collect examples of (instruction, output) pairs across many tasks and finetune an LM.
Evaluate on Unseen Pairs for predictions.

Scaling Up Finetuning

Step 1: Pretrain (on language modeling) using lots of text to learn general things.
Step 2: Finetune (on many tasks) using fewer labels to adapt to specific tasks.
Pretraining can improve NLP applications by serving as parameter initialization.

Self-Instruct

Task Pool:
- Seed hand-crafted task pool that contains
- LLM-generated instructions and samples are added to this pool as the algorithm proceeds.
Sampling:
- Randomly select a few human-written instructions and a few LLM-generated instructions to ensure diversity.
Instruction Generation:
- Selected instructions are used as demonstration examples.
- LLM in-context learns from these examples to produce a new instruction.
Sample Generation:
- LLM is prompted to complete the sample by filling in the remaining input fields and generating the corresponding output.

Self-Instruct (2)

Filtering:
- Newly-generated samples are examined by some heuristic rules.
- Filtering out samples or instructions that are similar to those already in the pool.
- If it passes, the sample and instruction are added to the pool.
The generation process is repeated to obtain a sufficient number of fine tuning samples.

Instruction Tuning: Self-Instruct

Input Inversion:
- LLM generates the input based on the specified output.
- Biased predictions for tasks such as classification, as most generated samples belong to a single class.
- Specify the output (i.e., the class) with some prior, and prompt the LLM to generate user input given both the instruction and the output.

Limitations of Instruction Finetuning

It’s expensive to collect ground-truth data for tasks.
Tasks like open-ended creative generation have no right answer.
Language modeling penalizes all token-level mistakes equally, but some errors are worse than others.
There is a mismatch between the LM objective and the objective of “satisfy human preferences”!

Working with LLMs

Zero-shot / Few-shot Scenarios (learning and evaluation):
- No finetuning, prompt engineering/CoT can improve performance.
- Limited by context length (input token length).
- Complex tasks will probably need gradient steps.
Instruction Fine-tuning:
- Simple and straightforward, generalize on unseen tasks.
- Collecting demonstrations for so many tasks is expensive.
- Mismatch between LM objective and human preferences.
Reinforcement Learning with Human Feedback (RLHF):
- Reward model with human feedback.
- Use these to further improve the model.