Information Retrieval - Vector Space Model and Cosine Similarity

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/16

Earn XP

Description and Tags

Flashcards based on Information Retrieval Lecture Notes focusing on vector space model, term-document matrices, and cosine similarity.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

17 Terms

New cards

Term-document incidence

Represents the presence or absence of terms in documents, forming the basis for subsequent matrices.

New cards

Term-document count matrix

Records the frequency of each term's occurrence within each document.

New cards

Term-document weight matrix

Assigns a weight to each term in each document, often using TF-IDF, to reflect the term's importance.

New cards

Vector Space Model

A model where documents and queries are represented as vectors in a high-dimensional term space.

New cards

Sparse Vectors

Vectors with mostly zero entries, common in vector space models due to the large vocabulary size.

New cards

Queries as Vectors

Representing search queries as vectors in the same term space as documents to enable similarity-based ranking.

New cards

Proximity in Vector Space

Measured by the similarity of vectors, often the inverse of the distance between them, to rank documents by relevance.

New cards

Euclidean Distance (in vector space)

The straight-line distance between two points (vectors); often a poor measure of document similarity due to differing document lengths.

New cards

Cosine Similarity

Measures the angle between two vectors, used to rank documents based on their similarity to a query, regardless of document length.

New cards

Length Normalization

The process of dividing a vector by its length (L2 norm) to create a unit vector, ensuring fair comparison between documents of different lengths.

New cards

L2 Norm

The square root of the sum of the squares of the components of a vector, used for length normalization.

New cards

Dot Product (for length-normalized vectors)

Equivalent to cosine similarity when vectors are length-normalized; a simple way to compute similarity.

New cards

Term Frequency (TF)

The number of times a term appears in a document. Often weighted using log frequency weighting.

New cards

Log Frequency Weighting

Applying a logarithmic function to term frequency to reduce the impact of very frequent terms.

New cards

SMART Notation

A notation (ddd.qqq) for describing the weighting scheme used for documents and queries in an information retrieval system.

New cards

lnc.ltc

A common weighting scheme where documents use logarithmic TF, no IDF, and cosine normalization, while queries use logarithmic TF, IDF, and cosine normalization.

New cards

TF-IDF Weighting

A weighting scheme that combines Term Frequency (TF) and Inverse Document Frequency (IDF) to assign higher weights to terms that are frequent in a document but rare in the collection.