1/16
Flashcards based on Information Retrieval Lecture Notes focusing on vector space model, term-document matrices, and cosine similarity.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Term-document incidence
Represents the presence or absence of terms in documents, forming the basis for subsequent matrices.
Term-document count matrix
Records the frequency of each term's occurrence within each document.
Term-document weight matrix
Assigns a weight to each term in each document, often using TF-IDF, to reflect the term's importance.
Vector Space Model
A model where documents and queries are represented as vectors in a high-dimensional term space.
Sparse Vectors
Vectors with mostly zero entries, common in vector space models due to the large vocabulary size.
Queries as Vectors
Representing search queries as vectors in the same term space as documents to enable similarity-based ranking.
Proximity in Vector Space
Measured by the similarity of vectors, often the inverse of the distance between them, to rank documents by relevance.
Euclidean Distance (in vector space)
The straight-line distance between two points (vectors); often a poor measure of document similarity due to differing document lengths.
Cosine Similarity
Measures the angle between two vectors, used to rank documents based on their similarity to a query, regardless of document length.
Length Normalization
The process of dividing a vector by its length (L2 norm) to create a unit vector, ensuring fair comparison between documents of different lengths.
L2 Norm
The square root of the sum of the squares of the components of a vector, used for length normalization.
Dot Product (for length-normalized vectors)
Equivalent to cosine similarity when vectors are length-normalized; a simple way to compute similarity.
Term Frequency (TF)
The number of times a term appears in a document. Often weighted using log frequency weighting.
Log Frequency Weighting
Applying a logarithmic function to term frequency to reduce the impact of very frequent terms.
SMART Notation
A notation (ddd.qqq) for describing the weighting scheme used for documents and queries in an information retrieval system.
lnc.ltc
A common weighting scheme where documents use logarithmic TF, no IDF, and cosine normalization, while queries use logarithmic TF, IDF, and cosine normalization.
TF-IDF Weighting
A weighting scheme that combines Term Frequency (TF) and Inverse Document Frequency (IDF) to assign higher weights to terms that are frequent in a document but rare in the collection.