1/53
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Information Retrieval (IR)
IR is the process of finding unstructured material (usually text) that satisfies an information need from large document collections.
Structured vs Unstructured Data
Structured data fits into predefined schemas (e.g., SQL tables); unstructured data includes text, images, audio, and video with no fixed format.
Inverted Index
A key data structure in IR mapping each term to the list of documents containing it. Enables efficient query retrieval.
Incidence Matrix
A binary matrix marking whether each document contains each term; impractical for large collections due to sparsity.
Information Need vs Query
The information need is what the user wants; the query is how they express it. Queries are often incomplete.
Vector Space Model
Represents documents and queries as vectors in term space; similarity measured via cosine similarity.
Term Frequency (TF)
The number of times a term appears in a document; higher frequency implies higher importance.
Inverse Document Frequency (IDF)
Measures how rare a term is across all documents; reduces weight of common words.
TF-IDF Weighting
Combines TF and IDF to highlight terms frequent in a document but rare across the corpus.
Precision
Fraction of retrieved documents that are relevant.
Recall
Fraction of relevant documents that are retrieved.
F-measure
Harmonic mean of precision and recall; balances both metrics.
Mean Average Precision (MAP)
Average precision across multiple queries; summarizes retrieval effectiveness.
Precision@k (P@k)
Proportion of relevant documents among the top k results.
R-Precision
Precision when retrieving R documents, where R = total relevant docs for a query.
Discounted Cumulative Gain (DCG)
Evaluation metric that accounts for document rank and graded relevance.
Relevance Feedback
User labels retrieved documents as relevant or not; the system updates the query vector to improve ranking.
Rocchio Algorithm
Updates query vector by moving it closer to relevant documents and away from non-relevant ones.
Query Expansion
Adds new, related terms (e.g., synonyms) to improve recall.
Relevance (Mathematical Definition)
A binary relation R ⊆ Q × D, where (q, d) ∈ R means document d is relevant to query q.
Index Construction
Process of parsing documents, tokenizing terms, and creating inverted lists for efficient lookup.
BSBI Algorithm
Blocked Sort-Based Indexing: builds partial indexes in memory, sorts them, and merges to create the full index.
SPIMI Algorithm
Single-Pass In-Memory Indexing: builds separate dictionaries per block without global term IDs; merges at the end.
Distributed Indexing
Uses multiple machines (parsers and inverters) coordinated by a master to scale indexing across huge collections.
Web Crawling
Automated process of discovering and fetching web pages for indexing.
Crawler Components
Include seed URLs, scheduler/queue, downloader, parser, and frontier update cycle.
Robots.txt
A file specifying which parts of a site crawlers may access.
Coverage vs Quality
Coverage = how much of the web is indexed; Quality = relevance and authority of indexed pages.
Breadth-First Crawling
Visits pages level by level from seeds to ensure early wide coverage.
PageRank
Algorithm ranking pages based on link structure; pages with many high-quality inbound links rank higher.
Early Crawlers
WWWW (1993), WebCrawler (1994), Lycos (1994), and AltaVista (1995) laid foundations for search engines.
Relevance Problem in IR
Users express needs poorly; queries often mismatch user intent.
Feature Extraction (Multimedia IR)
Measures aspects like color, boundaries, or texture to represent multimedia data.
Deep Learning in IR
Automatically extracts high-level features (e.g., with CNNs) from multimedia content.
Black Box Problem
Deep models lack interpretability in how they represent or weigh features.
IID Assumption (Machine Learning)
Assumes training examples are independent and identically distributed—often violated in temporal or sequential data.
Tokenization
Breaking text into individual units (tokens) such as words or phrases.
Stop Words
Common words (e.g., “the”, “is”) often removed to reduce noise in retrieval.
Normalization
Standardizing text (e.g., lowercasing, removing punctuation, merging variants like “U.S.A.” and “USA”).
Compound Word Splitting
Used in languages like German to separate long compound terms into meaningful parts.
Precision-Recall Tradeoff
Improving recall can reduce precision and vice versa; balanced depending on use case.
User-Oriented Evaluation
Considers user satisfaction and perceived relevance beyond system metrics.
Multimedia IR Challenge
Non-textual data (images, audio, video) is ambiguous and high-dimensional, making retrieval complex.
Feature Fingerprint
Compact numerical summary of multimedia data capturing key distinguishing features.
Temporal Data in IR
Data with a time dimension (e.g., video, audio) breaks independence assumptions of traditional models.
Crawling Politeness
Crawlers must avoid overloading servers by respecting delays and limits.
Index Compression
Reduces index size by encoding gaps or frequent patterns in posting lists.
Hybrid IR-LLM Systems
Combine retrieval for grounding with large language models for answer generation.
Authority and Freshness
Page quality metrics that influence ranking—authoritative sources and recent updates rank higher.
Query-Document Mismatch
The fundamental gap between user language and document representation that IR methods aim to bridge.
Vector Space Similarity
Measured by the cosine of the angle between query and document vectors; smaller angles imply higher similarity.
Document Collection
The set of all documents available for indexing and retrieval.
Information Retrieval Evaluation Goal
Quantitatively compare algorithms and justify system improvements.