Conversational Search and Retrieval-Augmented Generation (RAG)

Learning Goals and Introduction to RAG

  • Role of Retrieval-Augmented Generation (RAG) in Conversational Search:     * Definition: A hybrid approach combining information retrieval with generative models (e.g., Large Language Models) to enhance text quality and factual grounding.     * Location: RAG can be applied at various stages, including pre-processing, post-processing, or integrated within the model architecture.     * Methods: Retrieval can utilize plain text, vectors, Knowledge Graphs (KG), and other structured/unstructured formats.
  • Course Context:     * The Advanced Natural Language Processing (ANLP) course focuses on pre-training and fine-tuning LLMs using supervised, self-supervised, and Reinforcement Learning (RL) methods.     * The Information Retrieval and Text Mining (IRTM) course focuses on integrating retrieval and Knowledge Graph-based methods.
  • Practical Constraints:     * Most high-end LLMs (GPT, Gemini, Anthropic) are closed-source and accessible only via APIs.     * Functionality to control factuality is often missing in public APIs, necessitating the manual construction of RAG pipelines to manage model behavior.

Technical Architecture of RAG Pipelines

  • Definition and Core Benefits:     * RAG relies on an external knowledge base (e.g., Wikipedia, internal databases) instead of purely internal parametric knowledge.     * Up-to-date Output: Access to external sources ensures information is current.     * Model Efficiency: Retrieval reduces the cognitive burden on the LLM, allowing smaller models to achieve high performance.     * Domain Adaptability: Corpus customization allows the system to specialize in specific fields.
  • Standard Components:     * Data Ingestion: Loading data from documents, databases, or web pages.     * Chunking: Splitting large documents into manageable segments to preserve semantic context.     * Embedding Model: Converts text into numerical vectors (e.g., Sentence Transformers, OpenAI Embeddings).     * Index / Vector Store: Efficient storage for embeddings (e.g., FAISS) for rapid similarity searching.     * Retriever: Searches the index/vector store for relevant passages based on a query.     * Reader / Generator: An LLM (e.g., GPT, BERT2BERT, FLAN) that synthesizes the final response from retrieved segments.
  • Essential Python Libraries:     * Haystack: Modular framework for RAG pipelines.     * RAGatouille: Simplifies RAG by leveraging pre-trained models for retrieval and generation.     * LangChain: Building blocks for chaining LLMs with retrieval tools.     * FAISS: Facebook AI Similarity Search for efficient vector querying.     * Transformers (HuggingFace): Repository for pre-trained language and embedding models.

Challenges in Conversational Search

  • Contextual Tracking in Multi-turn Queries:     * Coreference Resolution: Identifying what pronouns (e.g., "he", "it") refer to within long-running conversations.     * Ellipsis Handling: Managing instances where the user omits words, assuming the AI has contextual awareness.     * Case Study:         * Query 1: "Tell me about the CEO of Tesla."         * Query 2: "What companies is he involved in?"         * Problem: The second query relies on identifying "he" as Elon Musk and "Tesla" as the specific electric vehicle company.
  • Bias and Hallucinations:     * Stochastic Parrots: A term coined by Bender et al. (2021) to describe LLMs as having no actual knowledge, understanding, or memory—only next-word prediction based on probabilistic patterns.     * Mitigation Strategies:         1. Chain of Prompting: Decomposing complex problems into smaller sub-tasks.         2. Knowledge Injection: Feeding retrieved documents or Knowledge Graph data into the prompt.         3. Hallucination Detection: Using metrics like RAGAS or TruthfulQA to flag low-confidence responses.         4. Agentic Architecture: Employing controller agents to steer the LLM.
  • The Provenance Problem: Identifying the source of information. Perplexity AI was a pioneer in citing sources, a practice now followed by Google, Bing, and OpenAI.

Evolution and Economics of Large Language Models

  • Model Scaling and Costs (2023–2026):     * GPT-4 (2023): Approximately 1.76trillion1.76\,trillion parameters (Mixture of Experts - MoE); 13trillion13\,trillion tokens; training cost ~$100\,M+; energy consumption < 50GWh<~50\,GWh.     * GPT-4.5 (2025): Approximately 12.8trillion12.8\,trillion parameters; 20+trillion20+\,trillion tokens.     * GPT-5 (2025): Approximately 52.5trillion52.5\,trillion parameters (MoE); 50trillion50\,trillion tokens; training cost ~$1\,B+; energy consumption  65GWh~65\,GWh.     * GPT-5.5 (April 2026): Optimized for agentic tasks with a 1M1\,M token context window.
  • Inference vs. Training Trends: While training costs have surged from ~$100\,M (GPT-4) to over ~$1\,B (GPT-5), inference costs have dropped significantly. GPT-4 input tokens originally cost ~$30 per 1M1\,M, whereas current high-tier models cost roughly ~$1.25 per 1M1\,M.
  • Data Center Limits: The shift is moving toward "Energy Autonomy." Big Tech is commissioning Small Modular Reactors (SMRs) to power Gigawatt-scale data centers. The projected training cost for GPT-6 (2027–2028) is ~$10\,B.

Prompt Engineering and Knowledge Injection

  • Background Prompts: Static factual context added to every query (e.g., "Medical diagnoses should be based on PubMed scientific evidence").
  • Context Prompts: Maintaining continuity by injecting past interactions or specific roles (e.g., "Write in 17th century English").
  • Chain of Thought (CoT): Providing intermediate reasoning steps in the prompt to allow the model to follow a logical path.
  • Prompt Enrichment Methods:     * Focus: Restricting output to specific domains (e.g., "Limit to legal implications").     * Styles: Formal vs. informal or target audience (e.g., "Master Students of Computer Science").     * Structure: Specifying headings or word counts (e.g., "Maximum 200 words").

Fusion Methods for RAG

  • Prompt Level (Early Fusion):     * Concatenating top-k retrieved texts into the prompt.     * Methods: Selective Context Injection and Context Window Management.     * Reference: Lewis et al. (2020) original RAG paper; Izacard & Grave (2020) Fusion-in-Decoder (FiD).
  • Vector Level (Embedding Fusion):     * Query expansion in embedding space.     * Weighted Averaging: Assigning higher weights to context vectors closer to the query.     * Learned Transformations: Using a small Multi-Layer Perceptron (MLP) to fuse query and context embeddings.     * Cross-Attention: Query vectors attend over retrieved context vectors via attention layers to compute a contextualized query embedding.
  • Late Fusion (Decoder Level):     * Fusing information during the token decoding process.     * Reference: RAG-token model (Lewis et al., 2020).

Knowledge Graph (KG) Integration and SPARQL

  • SPARQL (SPARQL Protocol and RDF Query Language): The SQL equivalent for Knowledge Graphs.
  • RDF Data Model: Data is stored as "triples": (Subject)(Predicate)(Object)(Subject) \rightarrow (Predicate) \rightarrow (Object).     * Example: Elon MuskCEO ofTesla\text{Elon Musk} \rightarrow \text{CEO of} \rightarrow \text{Tesla}.
  • Benefits for AI:     * Prevents hallucinations by using verified facts.     * Enables Multi-hop Reasoning: Finding relationships like "Who influenced Alan Turing?" using Wikidata properties (e.g., wdt:P737wdt:P737).     * Enables Provenance Tracking: Including citations in AI-generated answers.
  • KG Sources:     * Wikidata/DBpedia: General knowledge.     * DBLP/OpenCitations: Scientific research and publications.     * GDELT: Real-time news and global events.     * WordNet RDF: Linguistics and lexical relations.

Graph Embedding Moderns and Neural Networks

  • TransE (Translating Embedding Model):     * Models relationships as translations in vector space: h+rth + r \approx t, where hh is head, rr is relation, and tt is tail.     * Loss function: L=positive triplesh+rt22L = \sum_{\text{positive triples}} ||h + r - t||_2^2.     * Strengths: Fast and simple; good for 1-to-1 relations.     * Weaknesses: Struggles with 1-to-many or many-to-many patterns.
  • GraphSAGE (Basic GNN):     * Aggregates information from a node's neighborhood to learn embeddings.     * Process for node vv at layer kk:         1. hN(v)(k)=Aggregate({hu(k1):uN(v)})h_{N(v)}^{(k)} = \text{Aggregate}(\{h_u^{(k-1)} : u \in N(v)\})         2. hv(k)=MLP(Concat(hv(k1),hN(v)(k)))h_v^{(k)} = \text{MLP}(\text{Concat}(h_v^{(k-1)}, h_{N(v)}^{(k)}))         3. Normalize node embeddings.
  • The Alignment Problem:     * GNN embeddings (graph structure) and LLM embeddings (natural language) often live in different vector spaces with different distributions, scales, and directions.     * Solution: Projecting graph embeddings into LLM space using adapters or using length normalization (unit vectors) to make cosine similarity meaningful.

Memory-Augmented RAG

  • kNN-LM (k-Nearest Neighbors Language Model):     * Instead of storing all knowledge in model weights, the model retrieves nearest neighbors during generation.     * Logit Fusion: Merging the retrieved information with the model's output probability distribution (logits).     * Retrieval in the Decoder: Encoding partial output, retrieving context vectors, and using attention to fuse them dynamically at each token step.
  • Vector Databases: Essential for speed (sub-millisecond retrieval) and scalability (billions of vectors) in systems like kNN-LM.

Evaluation and Hallucination Detection

  • Faithfulness Metric (RAGAS):     * Measures consistency between the generated answer and the retrieved context.     * Calculation Process:         1. Claim Generation: LLM breaks the answer into individual factual claims.         2. Context Verification: An NLI (Natural Language Inference) model (e.g., facebook/bartlargemnlifacebook/bart-large-mnli) checks if each claim is entailed by the context.         3. Calculation: Faithfulness Score=Number of supported claimsTotal number of claims\text{Faithfulness Score} = \frac{\text{Number of supported claims}}{\text{Total number of claims}}.     * Example: If a model claims green tea cures cancer but the source only mentions antioxidants, the faithfulness score drops (e.g., 0.50.5 or less).
  • Performance Metrics:     * BM25 (Lexical Search) Latency: 50200ms50\text{--}200\,ms.     * Dense Retrieval Latency: 50300ms50\text{--}300\,ms.     * Re-Ranking (BERT Cross-Encoders) Latency: 200800ms200\text{--}800\,ms.     * LLM Response Time: 13s1\text{--}3\,s.
  • Strategy for Low Scores: Re-rank documents, increase recall, or prompt the LLM to restrict its response more tightly to the provided context.

Questions & Discussion

  • Questions: The lecturer asks the audience for questions regarding the RAG pipeline and hallucination detection methods.
  • Hands-on: The tutorial involves setting up an advanced integration between a search engine and a chatbot using RAG, measuring result quality, and identifying hallucinations.