LLM Reasoning and Prompting

Prompt Engineering Techniques for Large Language Models

Introduction

Improving large language models (LLMs) without updating weights using prompt engineering.
Techniques aim to enhance model reasoning and performance.

Comparing Human and Artificial Intelligence

Human Intelligence:
- Learns from few examples.
- Reasons and generalizes well.
- Explainable rationale.
Traditional Artificial Intelligence:
- Needs large, labeled datasets.
- Black box approach (hard to explain).
- Struggles with generalization and reasoning.
Reasoning is a difficult task and efforts are made to fill the gaps using Bayesian machine learning, transfer learning, and domain adaptation.

Types of Reasoning Problems

Mathematics and Symbolic Reasoning:
- Example: Solving math problems.
Common Sense Reasoning:
- Example: Combining character traits to understand a situation.
Logical Reasoning:
- Example: Deducing conclusions from a series of if-then statements.
  - If A, then B. If B, then C. Therefore, if A, then C.
  - Example from the transcript: "The more ifs, greater mind states, outbreak of war. Emily is a war. What is? And we have freedom."

Challenges with Large Language Models and Reasoning

LLMs show some reasoning abilities but struggle with true semantic understanding.
Mathematics remains a significant challenge, even for large models (e.g., 540 billion parameter models).

Train of Thought Prompting

Using train of thought prompting can significantly improve results.

Prompt Engineering Techniques

Using natural language to instruct LLMs.
Improving models without updating weights.
Five methods:
- Adding a magic word.
- Adding more context.
- Multi-round iterations.
- Moderating model responses.
- Combining multiple models.

Method 1: Magic Words

Adding specific words to encourage reasoning.
Example: "One last thing. Step by step."
LLMs are directional reasoners; magic words prompt deeper reasoning.
Zero-shot train of thought involves using these magic words to guide the model step by step without prior examples.
Experimented with other prompts such as:
- "Let's think about this water quality."
- "Let's solve this program in a few steps."

Magic Words for Multimodal Models

Asking models to analyze why something is funny.
- Example: "Tell why it is funny."
Using phrases to denote importance or invoke emotion.
- Example: "This is very important to my career."
Emotional prompts can improve performance.

Guidelines for Using Magic Words

Avoid politeness.
Use direct, actionable language.
Avoid negative phrasing.
Avoid relying on stereotypes.
Find magic words through reinforcement learning.
- Using two LLMs, a target and a generator, and evaluate the performance to find the best magic words.

Finding Magic Words

External Method (Reinforcement Learning):
- Using a generator model to create magic words and a target model to evaluate their effectiveness.
Internal Method (Prompting):
- Prompting the model to output a possible good prompt.
- Reverse engineering prompts from input-output pairs.
- Iteratively refining prompts based on previous responses.
- Example: Iteratively improve the prompt through multiple rounds, refining it until the desired response is achieved.
  - Baselines: "Let's think step by step," "Therefore, it is probably a step by step"

Iterative Prompt Optimization

Iteratively optimizing prompts can significantly improve model performance.
Example prompts:
- "Palm to L IDC had produced back home to take a deep breath and work on this problem step by step to get the most accuracy."
Magic words may not always help with newer, larger models.
- Newer models have improved reasoning capabilities and may not require magic words.

Method 2: Adding Context

Context significantly improves LLM performance.
Role-playing: Assigning a specific role to the LLM.
- The outputs of LLMs will have a consistent style and reasoning based on the role.
- Example: Asking the LLM to act like Shakespeare.
- Modern commercial services use role-playing by default.

In-Context Learning

Providing a few examples or demonstrations to the LLM.
The larger models learn from the context without any weight updating.
Examples of positive, negative, and neutral phrases for sentiment analysis.
Future prompting involves providing several question-answer pairs.

Few-Shot Learning

Learning from examples in the prompt (in-context learning).
Emergent capabilities: LLMs can learn from examples in the prompt without explicit training data.
Few-shot train of thought involves providing several example triplets (question, reasoning, and answer).

Understanding Examples

LLMs adapt to the formatting and patterns of the demonstrations.
Changing the labels of examples does not significantly affect performance.

Scaling Laws

Larger models understand examples to some extent.
In-context learning for translation tasks shows significant improvement with larger models.
Gemini model can reach near-human performance by reading sample textbook.

Providing More Context

Manually searching online and providing information to the LLM.
Using a system to automatically retrieve information from a database (Retrieval Augmented Generation - RAG).
Asking the LLM to generate knowledge first and then answer the question (Generated Knowledge Prompting).

Method 3: Multi-Round Iterations

Decomposing complex tasks into separate rounds.
Solving sub-problems and addressing trade-offs.
Breaking down the process of writing an academic paper into multiple steps, such as creating an outline, writing each section based on the outline, and revising the content.
Self-verification: LLMs check their own work.
- Using Python as a calculator.

Decomposed Question Answering

Breaking down questions into sub-questions.
Example: Asking the LLM to divide a complex task into several sub-questions and then answer each one.
Self-prompting: Decomposing the reasoning process into sub-questions and answers.

Generated Knowledge Prompting

Asking the LLM to generate knowledge and then answer the question based on that knowledge.
Combining sub-questions with generated knowledge.

Self-Reflection

LLMs evaluate their own answers and improve them.
Self-refinement: Feeding the initial output back to the model to improve it.
Self-consistency: Generating multiple responses and selecting the most common answer.

Tree of Thoughts

Similar to self-consistency but explores different reasoning paths in a tree-like structure.
Searching for counter passes and jumping between different reasoning paths.

Constitutional AI

Asking LLMs to think twice before responding.
Using a set of principles to refine the API and prevent harmful content.
Model does self-critique and re-answers the question based on feedback.
It checks the answers of every step and not only the final answer.

Homework and Research Opportunities

Design prompts to elicit specific types of output from LLMs.
Explore research opportunities in prompt engineering.