Week 8: NLP Pipeline: In-context Learning/Instruction Fine-tuning

Causal LLMs and In-context Learning

Causal Attention

  • Generative modeling of sentences: LLMs use causal attention for generative modeling.
  • Pre-training: Uses masked self-attention (transformer decoder architecture).
  • Inference: During generation, the model generates new tokens by sampling based on the probability distribution of the vocabulary given the preceding tokens.

Working with Pre-trained Models

  • Scenarios:
    • Zero-shot
    • Few-shot
  • Techniques:
    • Instruction Fine-tuning
    • Reinforcement Learning with Human Feedback (RLHF)
      • Reward model with human feedback is used to further improve the model.

Pre-training Language Models

  • LMs are pre-trained on large-scale unlabeled data via self-supervision.
  • The model can be well-trained this way and efficiently adapted to new tasks through fine-tuning or prompting.

In-context Learning

  • Give a few examples of the task that the model should solve.
  • Craft prompts that reformulate tasks to resemble those solved during pre-training.
  • GPT-3 and later LLMs can perform tasks with and without limited examples by predicting sequences or comparing probabilities.
Emergent Abilities of LLMs
  • Unpredictable phenomena where LLMs exhibit abilities not present in smaller-scale models.
  • Emergence: Quantitative changes in a system result in qualitative changes in behavior.
  • Emergent abilities cannot be predicted by simply extrapolating performance improvements on smaller-scale models.
  • Examples of emergence are seen in few-shot prompting settings (L. Wei et al., 2022).

Benchmarks for Language Models

BIG-Bench
  • Over 200 tasks.
  • 80% of benchmark tasks are in JSON format, containing a list of examples made up of inputs and targets.
  • 20% are programmatic.
MMLU (Massive Multitask Language Understanding)
  • New benchmarks for measuring LM performance on 57 diverse knowledge-intensive tasks.
  • Uses a high-school questionnaire format for different subjects.

Prompting

  • A pre-trained LM is given a prompt (e.g., a natural language instruction) of a task and completes the response without any further training or gradient updates to its parameters.
  • Method of providing an LLM with a specific input or cue to generate a desired output or perform a task.
  • A well-crafted prompt can guide an LLM to generate more accurate, relevant, and contextually appropriate responses.
  • The process can be iteratively refined.
Prompt Engineering
  • Design effective prompts to make better use of LLMs and enhance their practical utility in real-world applications.

Prompt Engineering Details

  • Prompt: Input to the LLM: xx
  • Generation process: text yy by maximizing P(yx)P(y|x)
  • Prompt acts as the condition on which predictions are made and can contain any information that helps describe and solve the problem.
Prompt Template
  • Piece of text containing placeholders or variables, where each placeholder can be filled with specific information.
  • Variable example: premise{* premise *}, where premise=the weather is nice this weekendpremise = the \ weather \ is \ nice \ this \ weekend

Prompt Engineering (2)

  • Prompt with multiple variables: Example is comparing two sentences in terms of their semantic similarity.
  • Attribute-based prompts: Represent and store data in key-value pairs, which can describe more complex tasks.

Prompt Engineering (3)

  • Prompt with system information: Assign a role to LLMs and provide sufficient context.
  • System information helps the LLM understand the context or constraints of the task it is being asked to perform.
  • Example: LLM as an assistant to correct English sentences.

In-context Learning

  • Learning during inference.
  • Prompts involve demonstrations of problem-solving.
  • LLMs can learn from these demonstrations how to solve new problems.
  • No model parameters are updated in the process.
  • Way to efficiently activate and reorganize the knowledge learned in pre-training without additional training or fine-tuning.
  • Enables quick adaptation of LLMs to new problems.
  • Pushing the boundaries of what pre-trained LLMs can achieve without task-specific adjustments.

In-context Learning: Zero/One/Few-shot Learning

Zero-shot Learning
  • Does not involve a traditional “learning” process.
  • Directly applies LLMs to address new problems not observed during training.
  • In practice, prompts are repetitively adjusted to guide the LLMs in generating better responses without demonstrating problem-solving steps or providing examples.
One-shot Learning
  • Extends the 0-shot prompt by adding a demonstration of how to correct sentences.
  • Allows the LLM to learn from this newly-added experience.
Few-shot Prompting
  • Includes a few input-output examples in the model’s context (input) as a preamble before asking the model to perform the task for an unseen inference-time example.
  • Essentially provides a pattern that maps some inputs to the corresponding outputs.
  • The LLM attempts to follow this pattern in making predictions, provided that the prompt includes a sufficient number of demonstrations.

Effectiveness of In-context Learning

  • Relies heavily on the quality of prompts.
  • Is based on the fundamental abilities of pre-trained LLMs.
  • Significant prompt engineering effort is needed to develop appropriate prompts that help LLMs learn more effectively from demonstrations.
  • Stronger LLMs can make better use of in-context learning for performing new tasks.
  • Example: Machine translation; weak understanding if not enough language data during pre-training leads to poor performance regardless of prompting strategy.

Prompt Engineering Strategies

Clear Task Description
  • Precise, specific, and clear description of the problem, instructing the LLM to perform as expected.
  • Particularly important when we want the output of the LLM to meet certain expectations.
Guiding LLMs to Think
  • For problems that require significant reasoning efforts.
  • Add “Let’s think step by step” to the end of each prompt.
  • Instruct LLMs to generate steps for reasoning about the problem before reaching the final answer.
  • Use multiple rounds of interaction to evaluate the correctness of the answer and, if necessary, rework it to find a better solution.
Providing Reference Information
  • Produce outputs that are confined to the relevant text.
  • Avoid unconstrained predictions.
  • Use Retrieval-Augmented Generation (RAG): Relevant text for the user query is provided by calling an Information Retrieval (IR) system, and LLMs are prompted to generate responses based on this provided relevant text.
Prompt Formats
  • Define several fields for prompts and fill in different information in each field.
  • Use code-style prompts for LLMs, which can understand and generate both natural language and code.
  • Use control characters, XML tags, and specific formatting.
  • Specify how the input and output should be formatted or structured.

Prompting for NLP Tasks: Text Classification

  • Assigning pre-defined labels to a given text.
  • LLM gives the answer not in labels but in text describing the result, which needs an additional label mapping step.
  • Map the text output to the label space.
  • Induce output labels from LLMs by reframing the problem as a cloze task.
  • Constrain the prediction to the set of label words and select the one with the highest probability.
  • Example: Y:positive,negative,neutralY: {positive, negative, neutral}
  • Constrain the output with prompts.
  • Issues arise with a large number of categories.

Prompting for NLP Tasks: Text Generation

Text Completion
  • Continual writing based on the input text.
Text Transformation
  • Transformation of the input text into another text.

Prompting for NLP Tasks: Text Completion

  • Generating text based on user requirements.
  • Requirements can include style, tone, length, and any specific content that the text should contain.
  • Examples: formal report, creative story, or a piece of programming code.
Code Generation
  • Example: Write a Python function to calculate the average of a list of numbers.

Prompting for NLP Tasks: Text Transformation

  • Typical tasks include machine translation, summarization, and text style transfer.
Other NLP Tasks
  • Question-answering (MMLU benchmark, GSM8K dataset).
  • Information extraction
  • Named entity recognition
  • Relation extraction

Multi-step Reasoning

Chain-of-Thought (CoT) Prompting
  • Enables language models to solve reasoning tasks, especially those involving multiple steps.
  • By guiding models to produce a sequence of intermediate steps before giving the final answer.
  • Specialized prompting or finetuning methods can be emergent in that they do not have a positive effect until a certain model scale.

Chain-of-Thought Details

  • Do not reach a conclusion directly, but generate reasoning steps or learn from demonstrations of detailed reasoning processes provided in the prompts.
  • Addresses problems like algebraic calculation (e.g., calculating the average of the numbers 2, 4, and 6).

Chain-of-Thought Benefits

  • Allows LLMs to decompose complex problems into smaller, sequential reasoning steps.
  • Mirrors human problem-solving behaviors, making it particularly effective for tasks requiring detailed, multi-step reasoning.
  • Makes the reasoning process more transparent and interpretable.
  • Increases user trust as users can follow the logic behind the reasoning process.
  • In-context learning approach that's applicable to most well-trained, off-the-shelf LLMs.
  • Provides efficient ways to adapt LLMs to different types of problems.

Working with LLMs

  • Zero-shot / Few-shot Scenarios (learning and evaluation):
    • No finetuning, prompt engineering/CoT can improve performance.
    • Limited by context length (input token length).
    • Complex tasks will probably need gradient steps.
  • Instruction Fine-tuning
  • Reinforcement Learning with Human Feedback (RLHF)

Instruction Tuning

Language Modeling vs. Assisting Users
  • Pre-finetuning: Language models are not aligned with user intent.
  • After fine-tuning on a task that shows user intent, language models generate accordingly.

Instruction Tuning Details

  • Fine-tune the pre-trained model to follow instructions.
  • Enable LMs to perform new tasks simply by reading instructions describing the task (without few-shot exemplars).
  • Involves fine-tuning on a mixture of tasks phrased as instructions, allowing LMs to respond to instructions describing an unseen task.
  • Train models to follow natural language instructions using data that includes (Task/Instruction, Output) examples.

Instruction Finetuning Process

  • Collect examples of (instruction, output) pairs across many tasks and finetune an LM.
  • Evaluate on Unseen Pairs for predictions.

Scaling Up Finetuning

  • Step 1: Pretrain (on language modeling) using lots of text to learn general things.
  • Step 2: Finetune (on many tasks) using fewer labels to adapt to specific tasks.
  • Pretraining can improve NLP applications by serving as parameter initialization.

Self-Instruct

  • Task Pool:

    • Seed hand-crafted task pool that contains
    • LLM-generated instructions and samples are added to this pool as the algorithm proceeds.
  • Sampling:

    • Randomly select a few human-written instructions and a few LLM-generated instructions to ensure diversity.
  • Instruction Generation:

    • Selected instructions are used as demonstration examples.
    • LLM in-context learns from these examples to produce a new instruction.
  • Sample Generation:

    • LLM is prompted to complete the sample by filling in the remaining input fields and generating the corresponding output.

Self-Instruct (2)

  • Filtering:
    • Newly-generated samples are examined by some heuristic rules.
    • Filtering out samples or instructions that are similar to those already in the pool.
    • If it passes, the sample and instruction are added to the pool.
  • The generation process is repeated to obtain a sufficient number of fine tuning samples.

Instruction Tuning: Self-Instruct

  • Input Inversion:
    • LLM generates the input based on the specified output.
    • Biased predictions for tasks such as classification, as most generated samples belong to a single class.
    • Specify the output (i.e., the class) with some prior, and prompt the LLM to generate user input given both the instruction and the output.

Limitations of Instruction Finetuning

  • It’s expensive to collect ground-truth data for tasks.
  • Tasks like open-ended creative generation have no right answer.
  • Language modeling penalizes all token-level mistakes equally, but some errors are worse than others.
  • There is a mismatch between the LM objective and the objective of “satisfy human preferences”!

Working with LLMs

  • Zero-shot / Few-shot Scenarios (learning and evaluation):
    • No finetuning, prompt engineering/CoT can improve performance.
    • Limited by context length (input token length).
    • Complex tasks will probably need gradient steps.
  • Instruction Fine-tuning:
    • Simple and straightforward, generalize on unseen tasks.
    • Collecting demonstrations for so many tasks is expensive.
    • Mismatch between LM objective and human preferences.
  • Reinforcement Learning with Human Feedback (RLHF):
    • Reward model with human feedback.
    • Use these to further improve the model.