Conversational Search and Retrieval-Augmented Generation (RAG)

Learning Goals and Introduction to RAG

Role of Retrieval-Augmented Generation (RAG) in Conversational Search: * Definition: A hybrid approach combining information retrieval with generative models (e.g., Large Language Models) to enhance text quality and factual grounding. * Location: RAG can be applied at various stages, including pre-processing, post-processing, or integrated within the model architecture. * Methods: Retrieval can utilize plain text, vectors, Knowledge Graphs (KG), and other structured/unstructured formats.
Course Context: * The Advanced Natural Language Processing (ANLP) course focuses on pre-training and fine-tuning LLMs using supervised, self-supervised, and Reinforcement Learning (RL) methods. * The Information Retrieval and Text Mining (IRTM) course focuses on integrating retrieval and Knowledge Graph-based methods.
Practical Constraints: * Most high-end LLMs (GPT, Gemini, Anthropic) are closed-source and accessible only via APIs. * Functionality to control factuality is often missing in public APIs, necessitating the manual construction of RAG pipelines to manage model behavior.

Technical Architecture of RAG Pipelines

Definition and Core Benefits: * RAG relies on an external knowledge base (e.g., Wikipedia, internal databases) instead of purely internal parametric knowledge. * Up-to-date Output: Access to external sources ensures information is current. * Model Efficiency: Retrieval reduces the cognitive burden on the LLM, allowing smaller models to achieve high performance. * Domain Adaptability: Corpus customization allows the system to specialize in specific fields.
Standard Components: * Data Ingestion: Loading data from documents, databases, or web pages. * Chunking: Splitting large documents into manageable segments to preserve semantic context. * Embedding Model: Converts text into numerical vectors (e.g., Sentence Transformers, OpenAI Embeddings). * Index / Vector Store: Efficient storage for embeddings (e.g., FAISS) for rapid similarity searching. * Retriever: Searches the index/vector store for relevant passages based on a query. * Reader / Generator: An LLM (e.g., GPT, BERT2BERT, FLAN) that synthesizes the final response from retrieved segments.
Essential Python Libraries: * Haystack: Modular framework for RAG pipelines. * RAGatouille: Simplifies RAG by leveraging pre-trained models for retrieval and generation. * LangChain: Building blocks for chaining LLMs with retrieval tools. * FAISS: Facebook AI Similarity Search for efficient vector querying. * Transformers (HuggingFace): Repository for pre-trained language and embedding models.

Challenges in Conversational Search

Contextual Tracking in Multi-turn Queries: * Coreference Resolution: Identifying what pronouns (e.g., "he", "it") refer to within long-running conversations. * Ellipsis Handling: Managing instances where the user omits words, assuming the AI has contextual awareness. * Case Study: * Query 1: "Tell me about the CEO of Tesla." * Query 2: "What companies is he involved in?" * Problem: The second query relies on identifying "he" as Elon Musk and "Tesla" as the specific electric vehicle company.
Bias and Hallucinations: * Stochastic Parrots: A term coined by Bender et al. (2021) to describe LLMs as having no actual knowledge, understanding, or memory—only next-word prediction based on probabilistic patterns. * Mitigation Strategies: 1. Chain of Prompting: Decomposing complex problems into smaller sub-tasks. 2. Knowledge Injection: Feeding retrieved documents or Knowledge Graph data into the prompt. 3. Hallucination Detection: Using metrics like RAGAS or TruthfulQA to flag low-confidence responses. 4. Agentic Architecture: Employing controller agents to steer the LLM.
The Provenance Problem: Identifying the source of information. Perplexity AI was a pioneer in citing sources, a practice now followed by Google, Bing, and OpenAI.

Evolution and Economics of Large Language Models

Model Scaling and Costs (2023–2026): * GPT-4 (2023): Approximately $1.76\,trillion$ parameters (Mixture of Experts - MoE); $13\,trillion$ tokens; training cost ~$100\,M+; energy consumption $<~50\,GWh$ . * GPT-4.5 (2025): Approximately $12.8\,trillion$ parameters; $20+\,trillion$ tokens. * GPT-5 (2025): Approximately $52.5\,trillion$ parameters (MoE); $50\,trillion$ tokens; training cost ~$1\,B+; energy consumption $~65\,GWh$ . * GPT-5.5 (April 2026): Optimized for agentic tasks with a $1\,M$ token context window.
Inference vs. Training Trends: While training costs have surged from ~$100\,M (GPT-4) to over ~$1\,B (GPT-5), inference costs have dropped significantly. GPT-4 input tokens originally cost ~$30 per $1\,M$ , whereas current high-tier models cost roughly ~$1.25 per $1\,M$ .
Data Center Limits: The shift is moving toward "Energy Autonomy." Big Tech is commissioning Small Modular Reactors (SMRs) to power Gigawatt-scale data centers. The projected training cost for GPT-6 (2027–2028) is ~$10\,B.

Prompt Engineering and Knowledge Injection

Background Prompts: Static factual context added to every query (e.g., "Medical diagnoses should be based on PubMed scientific evidence").
Context Prompts: Maintaining continuity by injecting past interactions or specific roles (e.g., "Write in 17th century English").
Chain of Thought (CoT): Providing intermediate reasoning steps in the prompt to allow the model to follow a logical path.
Prompt Enrichment Methods: * Focus: Restricting output to specific domains (e.g., "Limit to legal implications"). * Styles: Formal vs. informal or target audience (e.g., "Master Students of Computer Science"). * Structure: Specifying headings or word counts (e.g., "Maximum 200 words").

Fusion Methods for RAG

Prompt Level (Early Fusion): * Concatenating top-k retrieved texts into the prompt. * Methods: Selective Context Injection and Context Window Management. * Reference: Lewis et al. (2020) original RAG paper; Izacard & Grave (2020) Fusion-in-Decoder (FiD).
Vector Level (Embedding Fusion): * Query expansion in embedding space. * Weighted Averaging: Assigning higher weights to context vectors closer to the query. * Learned Transformations: Using a small Multi-Layer Perceptron (MLP) to fuse query and context embeddings. * Cross-Attention: Query vectors attend over retrieved context vectors via attention layers to compute a contextualized query embedding.
Late Fusion (Decoder Level): * Fusing information during the token decoding process. * Reference: RAG-token model (Lewis et al., 2020).

Knowledge Graph (KG) Integration and SPARQL

SPARQL (SPARQL Protocol and RDF Query Language): The SQL equivalent for Knowledge Graphs.
RDF Data Model: Data is stored as "triples": $(Subject) \rightarrow (Predicate) \rightarrow (Object)$ . * Example: $\text{Elon Musk} \rightarrow \text{CEO of} \rightarrow \text{Tesla}$ .
Benefits for AI: * Prevents hallucinations by using verified facts. * Enables Multi-hop Reasoning: Finding relationships like "Who influenced Alan Turing?" using Wikidata properties (e.g., $wdt:P737$ ). * Enables Provenance Tracking: Including citations in AI-generated answers.
KG Sources: * Wikidata/DBpedia: General knowledge. * DBLP/OpenCitations: Scientific research and publications. * GDELT: Real-time news and global events. * WordNet RDF: Linguistics and lexical relations.

Graph Embedding Moderns and Neural Networks

TransE (Translating Embedding Model): * Models relationships as translations in vector space: $h + r \approx t$ , where $h$ is head, $r$ is relation, and $t$ is tail. * Loss function: $L = \sum_{\text{positive triples}} ||h + r - t||_2^2$ . * Strengths: Fast and simple; good for 1-to-1 relations. * Weaknesses: Struggles with 1-to-many or many-to-many patterns.
GraphSAGE (Basic GNN): * Aggregates information from a node's neighborhood to learn embeddings. * Process for node $v$ at layer $k$ : 1. $h_{N(v)}^{(k)} = \text{Aggregate}(\{h_u^{(k-1)} : u \in N(v)\})$ 2. $h_v^{(k)} = \text{MLP}(\text{Concat}(h_v^{(k-1)}, h_{N(v)}^{(k)}))$ 3. Normalize node embeddings.
The Alignment Problem: * GNN embeddings (graph structure) and LLM embeddings (natural language) often live in different vector spaces with different distributions, scales, and directions. * Solution: Projecting graph embeddings into LLM space using adapters or using length normalization (unit vectors) to make cosine similarity meaningful.

Memory-Augmented RAG

kNN-LM (k-Nearest Neighbors Language Model): * Instead of storing all knowledge in model weights, the model retrieves nearest neighbors during generation. * Logit Fusion: Merging the retrieved information with the model's output probability distribution (logits). * Retrieval in the Decoder: Encoding partial output, retrieving context vectors, and using attention to fuse them dynamically at each token step.
Vector Databases: Essential for speed (sub-millisecond retrieval) and scalability (billions of vectors) in systems like kNN-LM.

Evaluation and Hallucination Detection

Faithfulness Metric (RAGAS): * Measures consistency between the generated answer and the retrieved context. * Calculation Process: 1. Claim Generation: LLM breaks the answer into individual factual claims. 2. Context Verification: An NLI (Natural Language Inference) model (e.g., $facebook/bart-large-mnli$ ) checks if each claim is entailed by the context. 3. Calculation: $\text{Faithfulness Score} = \frac{\text{Number of supported claims}}{\text{Total number of claims}}$ . * Example: If a model claims green tea cures cancer but the source only mentions antioxidants, the faithfulness score drops (e.g., $0.5$ or less).
Performance Metrics: * BM25 (Lexical Search) Latency: $50\text{--}200\,ms$ . * Dense Retrieval Latency: $50\text{--}300\,ms$ . * Re-Ranking (BERT Cross-Encoders) Latency: $200\text{--}800\,ms$ . * LLM Response Time: $1\text{--}3\,s$ .
Strategy for Low Scores: Re-rank documents, increase recall, or prompt the LLM to restrict its response more tightly to the provided context.

Questions & Discussion

Questions: The lecturer asks the audience for questions regarding the RAG pipeline and hallucination detection methods.
Hands-on: The tutorial involves setting up an advanced integration between a search engine and a chatbot using RAG, measuring result quality, and identifying hallucinations.