In-Depth Notes on Information Access and Retrieval
Introduction to Information Retrieval
- Definition of Information Retrieval (IR):
- Finding unstructured material (usually text) satisfying information needs.
- Originated with professionals (librarians, paralegals); now widely used by general public via search engines.
- Dominant form of information access:
- Overarching traditional databases.
Strategies for Information Access
Push vs. Pull:
Pull:
Users initiate access (e.g., web search engines).
Suitable for ad hoc information needs (temporary, e.g., product reviews).
Users type queries and browse results for relevant information.
Push:
Systems anticipate stable information needs (e.g., ongoing research topics).
Systems can personalize based on user preferences and past behavior.
Modes of Information Access
Querying vs. Browsing:
Querying:
Users input specific keyword queries.
Effective when users know exact keywords.
Browsing:
Users navigate through structures of documents to explore.
More useful when users are unsure of keywords or prefer convenience.
Analogy: Querying is like taking a taxi to a specific address; browsing is like walking to discover attractions.
Text Retrieval
Basic Concept:
- Text retrieval involves responding to user queries with relevant documents.
- Core of information retrieval; supports collections of web pages, articles, and text files.
Distinction between Text Retrieval and Database Retrieval:
- Text Retrieval: Works with unstructured, ambiguous free text - queries often less defined.
- Database Retrieval: Structured data with defined schemas makes queries precise (e.g., SQL).
Evolution of Search Technologies
- Shift from traditional keyword search to:
- Natural Language Query (NLQ) and Question-and-Answer (Q&A) systems powered by Large Language Models (LLMs).
- LLMs understand user intent and provide contextual answers, enhancing user experience.
Information Retrieval Process
Corpus: Set of documents (e.g., web pages) that the system engages with.
Indexing: Preprocessing step to enable efficient searches.
- Users formulate queries based on information needs, and retrieval models utilize these to find relevant documents.
Ranked Retrieval: Results are ranked based on relevance to user queries.
Concept of a Document
- Definition: A document is any unit of information (e.g., web page, text file, email).
- Documents consist of multiple fields (title, body, metadata).
Document Retrieval Models
- Boolean Retrieval Model:
- Users create queries with Boolean logic (AND, OR, NOT).
- Documents are viewed as binary collections of terms, identified using a Term-Document Incidence matrix.
Relevance in Information Retrieval
- Relevance is subjective; identified by users based on information needs.
- Boolean retrieval simplifies relevancy, often focusing narrowly on specific keywords.
Inverted Indexes
- Definition and Function:
- An inverted index tracks the presence of terms in documents, improving efficiency over naive term-document matrices due to sparsity.
- Components of Inverted Index:
- Collection (Corpus): All documents to be indexed.
- Index Term: Distinct words extracted.
- Posting: Document IDs where terms appear.
Advanced Indexing Techniques
- Positional Indexing: Records term positions within documents to support phrase searches.
- Enables precise matching of terms in proper order.
Retrieval Models Comparison
- Extended Boolean Model vs. Ranked Retrieval:
- Extended models expand beyond basic Boolean functionality, allowing free text queries.
- Advocates for ranking based on query relevance.
Example Library
- Whoosh: A library enabling custom text indexing and search features for applications such as blogging platforms.