Information retrieval

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/53

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

54 Terms

1
New cards

Information Retrieval (IR)

IR is the process of finding unstructured material (usually text) that satisfies an information need from large document collections.

2
New cards

Structured vs Unstructured Data

Structured data fits into predefined schemas (e.g., SQL tables); unstructured data includes text, images, audio, and video with no fixed format.

3
New cards

Inverted Index

A key data structure in IR mapping each term to the list of documents containing it. Enables efficient query retrieval.

4
New cards

Incidence Matrix

A binary matrix marking whether each document contains each term; impractical for large collections due to sparsity.

5
New cards

Information Need vs Query

The information need is what the user wants; the query is how they express it. Queries are often incomplete.

6
New cards

Vector Space Model

Represents documents and queries as vectors in term space; similarity measured via cosine similarity.

7
New cards

Term Frequency (TF)

The number of times a term appears in a document; higher frequency implies higher importance.

8
New cards

Inverse Document Frequency (IDF)

Measures how rare a term is across all documents; reduces weight of common words.

9
New cards

TF-IDF Weighting

Combines TF and IDF to highlight terms frequent in a document but rare across the corpus.

10
New cards

Precision

Fraction of retrieved documents that are relevant.

11
New cards

Recall

Fraction of relevant documents that are retrieved.

12
New cards

F-measure

Harmonic mean of precision and recall; balances both metrics.

13
New cards

Mean Average Precision (MAP)

Average precision across multiple queries; summarizes retrieval effectiveness.

14
New cards

Precision@k (P@k)

Proportion of relevant documents among the top k results.

15
New cards

R-Precision

Precision when retrieving R documents, where R = total relevant docs for a query.

16
New cards

Discounted Cumulative Gain (DCG)

Evaluation metric that accounts for document rank and graded relevance.

17
New cards

Relevance Feedback

User labels retrieved documents as relevant or not; the system updates the query vector to improve ranking.

18
New cards

Rocchio Algorithm

Updates query vector by moving it closer to relevant documents and away from non-relevant ones.

19
New cards

Query Expansion

Adds new, related terms (e.g., synonyms) to improve recall.

20
New cards

Relevance (Mathematical Definition)

A binary relation R ⊆ Q × D, where (q, d) ∈ R means document d is relevant to query q.

21
New cards

Index Construction

Process of parsing documents, tokenizing terms, and creating inverted lists for efficient lookup.

22
New cards

BSBI Algorithm

Blocked Sort-Based Indexing: builds partial indexes in memory, sorts them, and merges to create the full index.

23
New cards

SPIMI Algorithm

Single-Pass In-Memory Indexing: builds separate dictionaries per block without global term IDs; merges at the end.

24
New cards

Distributed Indexing

Uses multiple machines (parsers and inverters) coordinated by a master to scale indexing across huge collections.

25
New cards

Web Crawling

Automated process of discovering and fetching web pages for indexing.

26
New cards

Crawler Components

Include seed URLs, scheduler/queue, downloader, parser, and frontier update cycle.

27
New cards

Robots.txt

A file specifying which parts of a site crawlers may access.

28
New cards

Coverage vs Quality

Coverage = how much of the web is indexed; Quality = relevance and authority of indexed pages.

29
New cards

Breadth-First Crawling

Visits pages level by level from seeds to ensure early wide coverage.

30
New cards

PageRank

Algorithm ranking pages based on link structure; pages with many high-quality inbound links rank higher.

31
New cards

Early Crawlers

WWWW (1993), WebCrawler (1994), Lycos (1994), and AltaVista (1995) laid foundations for search engines.

32
New cards

Relevance Problem in IR

Users express needs poorly; queries often mismatch user intent.

33
New cards

Feature Extraction (Multimedia IR)

Measures aspects like color, boundaries, or texture to represent multimedia data.

34
New cards

Deep Learning in IR

Automatically extracts high-level features (e.g., with CNNs) from multimedia content.

35
New cards

Black Box Problem

Deep models lack interpretability in how they represent or weigh features.

36
New cards

IID Assumption (Machine Learning)

Assumes training examples are independent and identically distributed—often violated in temporal or sequential data.

37
New cards

Tokenization

Breaking text into individual units (tokens) such as words or phrases.

38
New cards

Stop Words

Common words (e.g., “the”, “is”) often removed to reduce noise in retrieval.

39
New cards

Normalization

Standardizing text (e.g., lowercasing, removing punctuation, merging variants like “U.S.A.” and “USA”).

40
New cards

Compound Word Splitting

Used in languages like German to separate long compound terms into meaningful parts.

41
New cards

Precision-Recall Tradeoff

Improving recall can reduce precision and vice versa; balanced depending on use case.

42
New cards

User-Oriented Evaluation

Considers user satisfaction and perceived relevance beyond system metrics.

43
New cards

Multimedia IR Challenge

Non-textual data (images, audio, video) is ambiguous and high-dimensional, making retrieval complex.

44
New cards

Feature Fingerprint

Compact numerical summary of multimedia data capturing key distinguishing features.

45
New cards

Temporal Data in IR

Data with a time dimension (e.g., video, audio) breaks independence assumptions of traditional models.

46
New cards

Crawling Politeness

Crawlers must avoid overloading servers by respecting delays and limits.

47
New cards

Index Compression

Reduces index size by encoding gaps or frequent patterns in posting lists.

48
New cards

Hybrid IR-LLM Systems

Combine retrieval for grounding with large language models for answer generation.

49
New cards

Authority and Freshness

Page quality metrics that influence ranking—authoritative sources and recent updates rank higher.

50
New cards

Query-Document Mismatch

The fundamental gap between user language and document representation that IR methods aim to bridge.

51
New cards

Vector Space Similarity

Measured by the cosine of the angle between query and document vectors; smaller angles imply higher similarity.

52
New cards

Document Collection

The set of all documents available for indexing and retrieval.

53
New cards

Information Retrieval Evaluation Goal

Quantitatively compare algorithms and justify system improvements.

54
New cards

Explore top flashcards