In-Depth Notes on Information Access and Retrieval

Introduction to Information Retrieval

Definition of Information Retrieval (IR):
- Finding unstructured material (usually text) satisfying information needs.
- Originated with professionals (librarians, paralegals); now widely used by general public via search engines.
- Dominant form of information access:
- Overarching traditional databases.

Strategies for Information Access

Push vs. Pull:
- Pull:
- Users initiate access (e.g., web search engines).
- Suitable for ad hoc information needs (temporary, e.g., product reviews).
- Users type queries and browse results for relevant information.
- Push:
- Systems anticipate stable information needs (e.g., ongoing research topics).
- Systems can personalize based on user preferences and past behavior.

Modes of Information Access

Querying vs. Browsing:
- Querying:
- Users input specific keyword queries.
- Effective when users know exact keywords.
- Browsing:
- Users navigate through structures of documents to explore.
- More useful when users are unsure of keywords or prefer convenience.
- Analogy: Querying is like taking a taxi to a specific address; browsing is like walking to discover attractions.

Text Retrieval

Basic Concept:
- Text retrieval involves responding to user queries with relevant documents.
- Core of information retrieval; supports collections of web pages, articles, and text files.
Distinction between Text Retrieval and Database Retrieval:
- Text Retrieval: Works with unstructured, ambiguous free text - queries often less defined.
- Database Retrieval: Structured data with defined schemas makes queries precise (e.g., SQL).

Evolution of Search Technologies

Shift from traditional keyword search to:
- Natural Language Query (NLQ) and Question-and-Answer (Q&A) systems powered by Large Language Models (LLMs).
- LLMs understand user intent and provide contextual answers, enhancing user experience.

Information Retrieval Process

Corpus: Set of documents (e.g., web pages) that the system engages with.
Indexing: Preprocessing step to enable efficient searches.
- Users formulate queries based on information needs, and retrieval models utilize these to find relevant documents.
Ranked Retrieval: Results are ranked based on relevance to user queries.

Concept of a Document

Definition: A document is any unit of information (e.g., web page, text file, email).
- Documents consist of multiple fields (title, body, metadata).

Document Retrieval Models

Boolean Retrieval Model:
- Users create queries with Boolean logic (AND, OR, NOT).
- Documents are viewed as binary collections of terms, identified using a Term-Document Incidence matrix.

Relevance in Information Retrieval

Relevance is subjective; identified by users based on information needs.
- Boolean retrieval simplifies relevancy, often focusing narrowly on specific keywords.

Inverted Indexes

Definition and Function:
- An inverted index tracks the presence of terms in documents, improving efficiency over naive term-document matrices due to sparsity.
Components of Inverted Index:
1. Collection (Corpus): All documents to be indexed.
2. Index Term: Distinct words extracted.
3. Posting: Document IDs where terms appear.

Advanced Indexing Techniques

Positional Indexing: Records term positions within documents to support phrase searches.
- Enables precise matching of terms in proper order.

Retrieval Models Comparison

Extended Boolean Model vs. Ranked Retrieval:
- Extended models expand beyond basic Boolean functionality, allowing free text queries.
- Advocates for ranking based on query relevance.

Example Library

Whoosh: A library enabling custom text indexing and search features for applications such as blogging platforms.