Conversational Search and Retrieval-Augmented Generation (RAG)
Learning Goals and Introduction to RAG
- Role of Retrieval-Augmented Generation (RAG) in Conversational Search:
* Definition: A hybrid approach combining information retrieval with generative models (e.g., Large Language Models) to enhance text quality and factual grounding.
* Location: RAG can be applied at various stages, including pre-processing, post-processing, or integrated within the model architecture.
* Methods: Retrieval can utilize plain text, vectors, Knowledge Graphs (KG), and other structured/unstructured formats.
- Course Context:
* The Advanced Natural Language Processing (ANLP) course focuses on pre-training and fine-tuning LLMs using supervised, self-supervised, and Reinforcement Learning (RL) methods.
* The Information Retrieval and Text Mining (IRTM) course focuses on integrating retrieval and Knowledge Graph-based methods.
- Practical Constraints:
* Most high-end LLMs (GPT, Gemini, Anthropic) are closed-source and accessible only via APIs.
* Functionality to control factuality is often missing in public APIs, necessitating the manual construction of RAG pipelines to manage model behavior.
Technical Architecture of RAG Pipelines
- Definition and Core Benefits:
* RAG relies on an external knowledge base (e.g., Wikipedia, internal databases) instead of purely internal parametric knowledge.
* Up-to-date Output: Access to external sources ensures information is current.
* Model Efficiency: Retrieval reduces the cognitive burden on the LLM, allowing smaller models to achieve high performance.
* Domain Adaptability: Corpus customization allows the system to specialize in specific fields.
- Standard Components:
* Data Ingestion: Loading data from documents, databases, or web pages.
* Chunking: Splitting large documents into manageable segments to preserve semantic context.
* Embedding Model: Converts text into numerical vectors (e.g., Sentence Transformers, OpenAI Embeddings).
* Index / Vector Store: Efficient storage for embeddings (e.g., FAISS) for rapid similarity searching.
* Retriever: Searches the index/vector store for relevant passages based on a query.
* Reader / Generator: An LLM (e.g., GPT, BERT2BERT, FLAN) that synthesizes the final response from retrieved segments.
- Essential Python Libraries:
* Haystack: Modular framework for RAG pipelines.
* RAGatouille: Simplifies RAG by leveraging pre-trained models for retrieval and generation.
* LangChain: Building blocks for chaining LLMs with retrieval tools.
* FAISS: Facebook AI Similarity Search for efficient vector querying.
* Transformers (HuggingFace): Repository for pre-trained language and embedding models.
Challenges in Conversational Search
- Contextual Tracking in Multi-turn Queries:
* Coreference Resolution: Identifying what pronouns (e.g., "he", "it") refer to within long-running conversations.
* Ellipsis Handling: Managing instances where the user omits words, assuming the AI has contextual awareness.
* Case Study:
* Query 1: "Tell me about the CEO of Tesla."
* Query 2: "What companies is he involved in?"
* Problem: The second query relies on identifying "he" as Elon Musk and "Tesla" as the specific electric vehicle company.
- Bias and Hallucinations:
* Stochastic Parrots: A term coined by Bender et al. (2021) to describe LLMs as having no actual knowledge, understanding, or memory—only next-word prediction based on probabilistic patterns.
* Mitigation Strategies:
1. Chain of Prompting: Decomposing complex problems into smaller sub-tasks.
2. Knowledge Injection: Feeding retrieved documents or Knowledge Graph data into the prompt.
3. Hallucination Detection: Using metrics like RAGAS or TruthfulQA to flag low-confidence responses.
4. Agentic Architecture: Employing controller agents to steer the LLM.
- The Provenance Problem: Identifying the source of information. Perplexity AI was a pioneer in citing sources, a practice now followed by Google, Bing, and OpenAI.
Evolution and Economics of Large Language Models
- Model Scaling and Costs (2023–2026):
* GPT-4 (2023): Approximately 1.76trillion parameters (Mixture of Experts - MoE); 13trillion tokens; training cost ~$100\,M+; energy consumption < 50GWh.
* GPT-4.5 (2025): Approximately 12.8trillion parameters; 20+trillion tokens.
* GPT-5 (2025): Approximately 52.5trillion parameters (MoE); 50trillion tokens; training cost ~$1\,B+; energy consumption 65GWh.
* GPT-5.5 (April 2026): Optimized for agentic tasks with a 1M token context window.
- Inference vs. Training Trends: While training costs have surged from ~$100\,M (GPT-4) to over ~$1\,B (GPT-5), inference costs have dropped significantly. GPT-4 input tokens originally cost ~$30 per 1M, whereas current high-tier models cost roughly ~$1.25 per 1M.
- Data Center Limits: The shift is moving toward "Energy Autonomy." Big Tech is commissioning Small Modular Reactors (SMRs) to power Gigawatt-scale data centers. The projected training cost for GPT-6 (2027–2028) is ~$10\,B.
Prompt Engineering and Knowledge Injection
- Background Prompts: Static factual context added to every query (e.g., "Medical diagnoses should be based on PubMed scientific evidence").
- Context Prompts: Maintaining continuity by injecting past interactions or specific roles (e.g., "Write in 17th century English").
- Chain of Thought (CoT): Providing intermediate reasoning steps in the prompt to allow the model to follow a logical path.
- Prompt Enrichment Methods:
* Focus: Restricting output to specific domains (e.g., "Limit to legal implications").
* Styles: Formal vs. informal or target audience (e.g., "Master Students of Computer Science").
* Structure: Specifying headings or word counts (e.g., "Maximum 200 words").
Fusion Methods for RAG
- Prompt Level (Early Fusion):
* Concatenating top-k retrieved texts into the prompt.
* Methods: Selective Context Injection and Context Window Management.
* Reference: Lewis et al. (2020) original RAG paper; Izacard & Grave (2020) Fusion-in-Decoder (FiD).
- Vector Level (Embedding Fusion):
* Query expansion in embedding space.
* Weighted Averaging: Assigning higher weights to context vectors closer to the query.
* Learned Transformations: Using a small Multi-Layer Perceptron (MLP) to fuse query and context embeddings.
* Cross-Attention: Query vectors attend over retrieved context vectors via attention layers to compute a contextualized query embedding.
- Late Fusion (Decoder Level):
* Fusing information during the token decoding process.
* Reference: RAG-token model (Lewis et al., 2020).
Knowledge Graph (KG) Integration and SPARQL
- SPARQL (SPARQL Protocol and RDF Query Language): The SQL equivalent for Knowledge Graphs.
- RDF Data Model: Data is stored as "triples": (Subject)→(Predicate)→(Object).
* Example: Elon Musk→CEO of→Tesla.
- Benefits for AI:
* Prevents hallucinations by using verified facts.
* Enables Multi-hop Reasoning: Finding relationships like "Who influenced Alan Turing?" using Wikidata properties (e.g., wdt:P737).
* Enables Provenance Tracking: Including citations in AI-generated answers.
- KG Sources:
* Wikidata/DBpedia: General knowledge.
* DBLP/OpenCitations: Scientific research and publications.
* GDELT: Real-time news and global events.
* WordNet RDF: Linguistics and lexical relations.
Graph Embedding Moderns and Neural Networks
- TransE (Translating Embedding Model):
* Models relationships as translations in vector space: h+r≈t, where h is head, r is relation, and t is tail.
* Loss function: L=∑positive triples∣∣h+r−t∣∣22.
* Strengths: Fast and simple; good for 1-to-1 relations.
* Weaknesses: Struggles with 1-to-many or many-to-many patterns.
- GraphSAGE (Basic GNN):
* Aggregates information from a node's neighborhood to learn embeddings.
* Process for node v at layer k:
1. hN(v)(k)=Aggregate({hu(k−1):u∈N(v)})
2. hv(k)=MLP(Concat(hv(k−1),hN(v)(k)))
3. Normalize node embeddings.
- The Alignment Problem:
* GNN embeddings (graph structure) and LLM embeddings (natural language) often live in different vector spaces with different distributions, scales, and directions.
* Solution: Projecting graph embeddings into LLM space using adapters or using length normalization (unit vectors) to make cosine similarity meaningful.
Memory-Augmented RAG
- kNN-LM (k-Nearest Neighbors Language Model):
* Instead of storing all knowledge in model weights, the model retrieves nearest neighbors during generation.
* Logit Fusion: Merging the retrieved information with the model's output probability distribution (logits).
* Retrieval in the Decoder: Encoding partial output, retrieving context vectors, and using attention to fuse them dynamically at each token step.
- Vector Databases: Essential for speed (sub-millisecond retrieval) and scalability (billions of vectors) in systems like kNN-LM.
Evaluation and Hallucination Detection
- Faithfulness Metric (RAGAS):
* Measures consistency between the generated answer and the retrieved context.
* Calculation Process:
1. Claim Generation: LLM breaks the answer into individual factual claims.
2. Context Verification: An NLI (Natural Language Inference) model (e.g., facebook/bart−large−mnli) checks if each claim is entailed by the context.
3. Calculation: Faithfulness Score=Total number of claimsNumber of supported claims.
* Example: If a model claims green tea cures cancer but the source only mentions antioxidants, the faithfulness score drops (e.g., 0.5 or less).
- Performance Metrics:
* BM25 (Lexical Search) Latency: 50–200ms.
* Dense Retrieval Latency: 50–300ms.
* Re-Ranking (BERT Cross-Encoders) Latency: 200–800ms.
* LLM Response Time: 1–3s.
- Strategy for Low Scores: Re-rank documents, increase recall, or prompt the LLM to restrict its response more tightly to the provided context.
Questions & Discussion
- Questions: The lecturer asks the audience for questions regarding the RAG pipeline and hallucination detection methods.
- Hands-on: The tutorial involves setting up an advanced integration between a search engine and a chatbot using RAG, measuring result quality, and identifying hallucinations.