Information Retrieval - Vector Space Model and Cosine Similarity

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/16

flashcard set

Earn XP

Description and Tags

Flashcards based on Information Retrieval Lecture Notes focusing on vector space model, term-document matrices, and cosine similarity.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

17 Terms

1
New cards

Term-document incidence

Represents the presence or absence of terms in documents, forming the basis for subsequent matrices.

2
New cards

Term-document count matrix

Records the frequency of each term's occurrence within each document.

3
New cards

Term-document weight matrix

Assigns a weight to each term in each document, often using TF-IDF, to reflect the term's importance.

4
New cards

Vector Space Model

A model where documents and queries are represented as vectors in a high-dimensional term space.

5
New cards

Sparse Vectors

Vectors with mostly zero entries, common in vector space models due to the large vocabulary size.

6
New cards

Queries as Vectors

Representing search queries as vectors in the same term space as documents to enable similarity-based ranking.

7
New cards

Proximity in Vector Space

Measured by the similarity of vectors, often the inverse of the distance between them, to rank documents by relevance.

8
New cards

Euclidean Distance (in vector space)

The straight-line distance between two points (vectors); often a poor measure of document similarity due to differing document lengths.

9
New cards

Cosine Similarity

Measures the angle between two vectors, used to rank documents based on their similarity to a query, regardless of document length.

10
New cards

Length Normalization

The process of dividing a vector by its length (L2 norm) to create a unit vector, ensuring fair comparison between documents of different lengths.

11
New cards

L2 Norm

The square root of the sum of the squares of the components of a vector, used for length normalization.

12
New cards

Dot Product (for length-normalized vectors)

Equivalent to cosine similarity when vectors are length-normalized; a simple way to compute similarity.

13
New cards

Term Frequency (TF)

The number of times a term appears in a document. Often weighted using log frequency weighting.

14
New cards

Log Frequency Weighting

Applying a logarithmic function to term frequency to reduce the impact of very frequent terms.

15
New cards

SMART Notation

A notation (ddd.qqq) for describing the weighting scheme used for documents and queries in an information retrieval system.

16
New cards

lnc.ltc

A common weighting scheme where documents use logarithmic TF, no IDF, and cosine normalization, while queries use logarithmic TF, IDF, and cosine normalization.

17
New cards

TF-IDF Weighting

A weighting scheme that combines Term Frequency (TF) and Inverse Document Frequency (IDF) to assign higher weights to terms that are frequent in a document but rare in the collection.