Notes on Embeddings, Vector Stores, and Retrieval-Augmented Generation (Text Splitting, Embeddings, FAISS/Milvus/Pinecone, and Agent Architecture)

Conceptual Overview

Retrieval-Augmented Generation (RAG) workflow introduced: load textual data (text and PDF), split into manageable chunks, convert chunks to embeddings via embedding models, store embeddings in a vector store (e.g., FAISS, Milvus, Pinecone), and perform semantic/search-by-vector queries to retrieve relevant chunks for answering questions.
Two main embedding model families discussed: Hugging Face models (e.g., MPNet base) and OpenAI embeddings. HP model characteristics: maps text to a dense vector space; MPNet base provides a 768-dimensional embedding for each input, useful for clustering and semantic search. OpenAI embeddings provide comparable functionality but come from a hosted API alternative.
Vector stores (vector databases) store embeddings and support similarity search. Common options mentioned: FAISS (Facebook AI Similarity Search), Milvus, Pinecone. Milvus and Pinecone are vector databases designed to handle large-scale embeddings; FAISS is a local, efficient similarity search library.
Storage and tooling ecosystem: text loaders, document loaders, text splitters, and a lightweight SQLite3 (via sqlite3) for supporting storage-related tasks. The transcript emphasizes that the vector store holds embeddings, while a separate database may hold metadata or document mappings.
Practical hurdles showcased: version mismatches and environment issues (e.g., sentence-transformers import errors) and steps to resolve them (updating libraries, restarting kernels, ensuring paths are correct).
End-to-end demo elements mentioned: loading a state-of-the-union text and a resume PDF, chunking into 24-word groups, exploring the effects of chunk size and overlap, generating embeddings, and performing similarity searches to retrieve top results.
Conceptual takeaway: you can mix embedding sources (Hugging Face vs OpenAI) and vector stores; the architecture remains the same—split, embed, index, search, and answer.

Data Loading and Preprocessing

Text loading workflow:
- Use a text loader to read a plain text file (e.g., stateofunion.txt).
- Load text into memory via a loader (e.g., text_loader.load()) and inspect content (e.g., print first 100 characters or first 100 words).
- Common pitfall: naming mismatches (textloader vs textloader.dot_load) and file path issues leading to File not found errors; ensure file path correctness and working directory alignment.
PDF loading workflow:
- PDF loader to read resumes or papers (e.g., CV or ARXIV paper). PDFs may include images; for multimodal embedding, images would require multimodal embedding techniques.
- Demonstrates loading a resume with potential images and conceptual multimodal embedding considerations.
Example file handling details:
- A sample text file (stateofunion.txt) contains about 22,800 characters (including spaces).
- Basic verification step prints the first 100 tokens/words to confirm the file loaded correctly.
- For PDFs: a similar loader path is used, with additional considerations for multimodal content.
Dependency and environment notes:
- Some issues come from missing packages (e.g., text loader or PDF loader usage). Users may need to install packages (e.g., pypdf, pdfminer, sentence-transformers) and ensure compatibility across versions.
- When a library import fails (e.g., sentence-transformers), upgrade or reinstall to resolve compatibility issues; warnings may appear even if execution continues.

Text Splitting and Chunking

Recursive character text splitting:
- Text is recursively split into chunks to facilitate embedding and retrieval.
- A chunk is defined as a contiguous group of words (or characters) kept together for embedding and context preservation.
Chunk size and overlap:
- Example settings discussed: chunk size = 24 words; overlap discussed as 32 words on each side (leading to an effective overlap across adjacent chunks).
- Practical implication: larger chunk sizes preserve more context per chunk but yield fewer chunks; smaller chunk sizes increase the number of chunks and may improve granularity but can dilute context.
- Concrete demonstration: breaking a resume into 24-word chunks can yield a certain number of chunks (e.g., 15 chunks for the example resume). A simple relation exists between document length, chunk size, and overlap:
- Let N be total tokens/words, c be chunk size, o be overlap, s = c - o be the step between chunk starts. Then the number of chunks K can be approximated by
- $K = \left\lceil\dfrac{N - c}{s}\right\rceil + 1.$
- Doubling the chunk size (e.g., from 24 to 48) reduces the number of chunks roughly by half (assuming similar document length), illustrating the trade-off between context per chunk and total chunks.
Practical guidance:
- Start with a moderate chunk size (e.g., 24 or 48) and a reasonable overlap to maintain context across adjacent chunks.
- Avoid too heavy chunking that causes excessive fragmentation or too little overlap that breaks context continuity.

Embeddings and Embedding Models

Two embedding model categories:
- Hugging Face embedding models (e.g., all-MPNET-base, sentence transformer models): benefits include local hosting without API calls; dimensionality example: 768 dimensions for MPNet base.
- OpenAI embedding models: hosted API; require API access; useful when leveraging powerful external models.
Embedding generation process:
- Text chunks from the splitter are fed into the embedding model to produce vector embeddings.
- The resulting embeddings are stored in the vector store for similarity search.
Embedding dimensionality and length:
- MPNet base (Hugging Face) typically maps text to a 768-dimensional vector space.
- OpenAI embeddings yield embeddings of a different dimensionality (depending on the model used, often 1536 for some OpenAI embeddings; this should align with the chosen vector store configuration).
Embedded data characteristics:
- Embeddings convert text content into dense numeric vectors that preserve semantic relationships (e.g., similar texts map to nearby vectors in the embedding space).
- The quality and usefulness of search depend on the embedding model’s semantic alignment with the task (clustering vs. semantic search).

Vector Stores and Databases

Purpose of vector stores:
- Store embeddings and perform similarity search to retrieve relevant chunks when given a query embedding.
Popular vector stores mentioned:
- FAISS (Facebook AI Similarity Search): often used for local, on-disk or in-memory indices; very fast for cosine similarity or inner-product search with dense vectors.
- Milvus: scalable, open-source vector database designed for large-scale embedding storage and retrieval.
- Pinecone: managed vector database service for scalable vector storage and fast similarity search.
Local vs. hosted options:
- FAISS is typically used locally; Milvus and Pinecone can be used as server-based services or cloud deployments for larger scale.
Integration with data sources:
- The embedding generation step feeds into the vector store; the vector store then supports retrieval by computing similarity between the query embedding and stored embeddings.

Building and Querying the Vector Store

Storage and retrieval workflow:
- After embeddings are created for all chunks, they are stored in a vector store index (e.g., FAISS index).
- For retrieval, an input query is converted to an embedding and compared against stored embeddings using cosine similarity or inner product to obtain top-k results.
- The vector store supports operations like similaritysearch and similaritysearchwithscore to return the top matches and their relevance scores.
Example retrieval parameters:
- Top-k (k) = 2 or 3 in demonstrations to fetch the most relevant chunks.
- Each vector entry is often accompanied by metadata, such as document ID and page content (chunk content).
Output interpretation:
- Retrieved chunks provide the supporting evidence for answering questions; they are not direct final answers unless combined with an LLM.
- Final answer generation typically requires an LLM to compose a coherent reply from the retrieved chunks.
Saving and reloading indexes:
- The index can be saved to disk to persist embeddings and allow later retrieval without recomputing embeddings.
- Vector stores may also be configured to connect to persistent external databases (e.g., Milvus or Pinecone) for long-term storage and multi-user access.

Example Walkthroughs from the Session

Text file loading example:
- stateofunion.txt was loaded into the text loader; first 100 characters were inspected to verify loading worked.
- The file contains approximately 22,800 characters (with spaces).
PDF demonstration:
- CV example (Michael Scott) loaded, showing how PDFs are read as text and prepared for embedding; discussion of multimodal embedding for documents with images.
Chunking demonstration:
- A resume was chunked into 24-word chunks; number of chunks produced = 15 (for that document) given the chunking settings.
Chunk size impact:
- Doubling chunk size reduces the number of chunks; smaller chunks increase the number of chunks but can provide finer granularity and potentially more precise local context.
Embedding and vector search demonstration:
- MPNet-based embedding for text chunks yielded 768-dimensional vectors.
- OpenAI embeddings were used for PDF/text content in some cases; both approaches produced embeddings suitable for FAISS Milvus Pinecone indexing.
- A FAISS index was created for the split text with OpenAI embedding; the system demonstrated a similarity search by vector with scores (e.g., 45–48% fire score in one example).
Similarity search outputs:
- For a given query like "What is the candidate's name?", the top vectors contained metadata including the candidate name (e.g., Michael Scott) and other attributes (skills, diploma, address, etc.).
- With k=3, multiple top results could be shown, including skill sets (Word, Excel, PowerPoint, etc.).
Using the vector store with an LLM:
- The retrieved chunks serve as evidence; a downstream LLM can compose a final answer from these chunks.
- It is highlighted that the current example used only embeddings and similarity search, not a full LLM-powered answer in all cases.
Save vs. retrieve workflow:
- A FAISS index is saved to disk; later the stored embeddings can be retrieved and queried again without re-embedding every chunk.

Tools, Agents, and the Four-Stage Agent Cycle

Agent concept:
- An agent uses tools to accomplish tasks: Wikipedia access, SERP API for Google/Bing/other search engines, LLM Math for calculations, etc.
- Tools are loaded and configured (e.g., Wikipedia, SERP API, and a calculator-like tool as LLM Math).
Agent initialization and tool loading:
- Load tools with a defined list of available tools; tools include search engines and data sources, plus computational helpers.
- A typical setup includes an OpenAI API-backed LLM (e.g., gpt-3.5-turbo or similar) to orchestrate tasks with the tools.
The four-stage agent loop:
- Action/Decide: The agent decides which tool to use given a prompt.
- Perform: The agent uses the selected tool to perform an action (e.g., perform a web search or compute a result).
- Observe: The agent observes the outcome of the action.
- Repeat (Thought): The agent uses the observed results to decide the next action and iterate until a satisfactory answer is produced.
Example: solving a math problem (square root of 16) using a calculator tool instead of direct arithmetic in the model.
Temperature parameter:
- Temperature = 0 makes the agent's behavior deterministic; higher temperatures introduce more creativity/randomness in tool selection and reasoning paths.
- A zero-shot setup (no explicit examples) can still produce stable results, but tweaking temperature can change the degree of determinism and exploration.
Prompt and tool orchestration details:
- The agent can use a combination of tools (OpenAI LLM, Wikipedia API, SERP API) to gather information and compute results.
- The example showed a scalable tool network that could include many search engines and data providers (e.g., Google, Bing, DuckDuckGo, Yahoo Finance) and more.
Prompt injection and transparency considerations:
- Prompt injection refers to prompts that cause the model to reveal internal configurations, proprietary APIs, or other sensitive details; it poses security and IP concerns.
- Transparency vs. prompt injection: explain what data sources the agent used and what data sources are being accessed, while avoiding disclosure of proprietary architecture or internal tool configurations. Guardrails are needed to prevent leakage of internal model details.

Handling Errors, Troubleshooting, and Practical Hints

Common issues and fixes:
- Missing file paths or incorrect working directories leading to FileNotFound errors; ensure the notebook and the data files reside in the same folder or provide absolute paths.
- Version mismatches (e.g., sentence-transformers) causing ImportError or deprecated API warnings; upgrade/downgrade library versions, re-import modules, and restart the kernel.
- Memory constraints when embedding large documents; consider chunking strategy, batch embedding, or incremental indexing.
Dependency management tips:
- Install required libraries in the right order: text/pdf loaders, then sentence transformers or OpenAI embeddings, then vector store libraries.
- When updating libraries, re-check compatibility with existing code; deprecation warnings may appear but not block execution.
Practical deployment notes:
- In enterprise-scale usage (e.g., 500 PDFs), a crawling/parsing strategy plus orchestrated embedding pipeline is needed. Pure bulk embedding may be inefficient; use crawling and incremental indexing.
- For mixed data types (PDFs, text, JSON, code, images), specialized parsers and multipath pipelines may be required; this is beyond the basic demo and is an enterprise-scale concern.
Observability and MLOps:
- Observability and monitoring are essential for production agents; plan for metrics, logging, and performance dashboards (weights & biases is proposed in the session as a tool for ML Ops).

Ethical, Philosophical, and Practical Implications

Data privacy and usage:
- Embeddings and vector stores involve processing user data; organizations must clarify whether data is stored, used for training, or kept private according to policy.
- Transparency should cover how data is used and what data is stored or indexed; proprietary model details should be guarded to protect IP.
Prompt injection risk:
- Open-ended prompts can unintentionally reveal internal configurations or data sources; guardrails are needed to prevent disclosure of sensitive or proprietary information.
Real-world relevance:
- The described pipeline enables building question-answering assistants over large document collections, enabling fast retrieval and evidence-based responses, which is valuable in research, HR (resumes/candidates), and enterprise knowledge management.
Practical trade-offs:
- Accuracy vs. latency: larger chunks provide more context but slower indexing; smaller chunks improve granularity but may require more sophisticated aggregation to produce coherent answers.
- Determinism vs creativity: zero-temperature prompts yield stable results, which is important for reliability in production.

Mathematical and Technical Summary (with LaTeX)

Chunking parameters and relations:
- Let N be the total number of tokens/words in a document, c be the chunk size, o be the overlap between consecutive chunks (on a per-side basis, often described as total overlap across the chunk boundary).
- The step between chunk starts is s = c − o.
- The approximate number of chunks is:
- $K = \left\lceil\dfrac{N - c}{s}\right\rceil + 1.$
- If the chunk size is increased, with all else equal, the number of chunks decreases roughly in proportion to 1/(c) for large N, i.e., doubling c roughly halves K when N >> c.
Embedding dimensionality (example):
- MPNet base (Hugging Face) yields embeddings in a 768-dimensional space:
- $\text{dim}({\bf e}) = 768.$
Similarity search (cosine similarity):
- Given a query embedding q and a stored embedding e_i, the cosine similarity is:
- $\text{cos ext{-}sim}(q, e<em>i) = \dfrac{q \cdot e</em>i}{|q| |e_i|}.$
- The vector store returns the top-k embeddings with highest cosine similarity (or inner product, depending on configuration).
Vector store concept: there is a mapping from a document to a set of chunks, each chunk mapped to a vector embedding; the index stores these vectors for fast retrieval. The output of a similarity search is a set of chunks (with associated metadata) ordered by relevance.
Agent four-stage loop (conceptual):
- Action/Decide → Perform → Observe → (Thought) Repeat until a satisfactory result is obtained.
- Tools may include: LLM, Wikipedia access, SERP API for web search, and a calculator tool (LLM Math).

Quick Practical Checklist for Students

Understand the end-to-end pipeline: load data → split into chunks → embed chunks → index in vector store → query with embeddings → retrieve top chunks → synthesize answer with an LLM.
Be able to explain chunking trade-offs: chunk size, overlap, and their effect on the number of chunks and context preservation.
Distinguish embedding sources (Hugging Face vs OpenAI) and their implications for cost, latency, and access requirements.
Explain what a vector store does and why it’s essential for semantic search.
Describe the agent architecture and its four-stage loop; understand how tools are used in practice to answer questions.
Recognize common pitfalls (path issues, dependency mismatches) and basic debugging steps.
Articulate the ethical considerations around transparency and prompt injection when using external tools and LLMs.
Be able to compute basic formulas related to chunking and vector similarity, and explain how these influence system design.

Summary Takeaways

The session demonstrated building a retrieval-augmented system that converts raw documents into embeddings, stores them in a vector store, and retrieves context for answering questions, with practical hands-on considerations like chunking, model choices, and tool integration.
It highlighted the modular nature of such systems: data loaders, chunkers, embedders, vector stores, and LLMs, all working together with a focus on modular, pluggable components.
It also emphasized the realities of production work: version issues, dependency management, and the need for guardrails around prompt injections and data privacy.