JB

IN2110: Språkteknologiske metoder - Vektorrom for språkteknologi

Representing Data Mathematically

Vector space models are used to represent data in a mathematical model, enabling classification and clustering. This involves describing data using features, each with a numerical value, forming a feature vector x = ⟨x1, x2, . . . , x_d⟩.

Vector Space Models

Feature vectors are used within vector space models, where features correspond to dimensions in space. Objects become points in this space, and their similarity is measured by geometrical distance. Each example x is a point or vector in a space of d dimensions: x ∈ R^d and x = ⟨x1, x2, . . . , x_d⟩.

Dimensionality

Examples in vector spaces range from points on a line (d = 1) to points in a plane (d = 2) or three-dimensional space (d = 3). High-dimensional spaces (d in thousands or millions) are common in ML/NLP.

Vector Space Models for Information Retrieval (IR)

This approach quantifies the similarity between documents, representing each document by the frequency of words it contains, known as bag-of-words (BoW) features. Each word corresponds to a dimension.

Tokens vs. Types

The number of tokens refers to the total count of words, while the number of types refers to the count of unique words, when analyzing text.

Matrix View

A vector space can be viewed as a matrix (co-occurrence matrix), where rows represent words and columns represent documents. Documents sharing words are considered semantically similar. Rows can also represent words; words co-occurring in documents are semantically related, according to the distributional hypothesis.

Vector Similarity

Similarity between vectors a and b can be computed using:

  • Dot product: a · b = \sum{i=1}^{n} ai bi = a1b1 + a2b2 · · · + anb_n

  • Euclidean distance: d(a, b) = \sqrt{\sum{i=1}^{n} (ai − b_i)^2}, sensitive to extreme values and vector length.

  • Normalization: dividing each element by the length: \frac{x}{||x||}

  • Cosine similarity: \cos(a, b) = \frac{\sum{i} ai bi}{\sqrt{\sum{i} ai^2} \sqrt{\sum{i} b_i^2}} = \frac{a \cdot b}{||a|| ||b||}, which measures the angle between vectors, ranging from 0 to 1.

Queries and Relevance Ranking

In information retrieval, queries are treated as short documents, represented as vectors, and ranked based on their distance to document vectors.

TF-IDF

tf(ti, dj) denotes the number of times the term ti occurs in document dj. df(ti) denotes the total number of documents in the collection that the term occurs in. The inverse document frequency is defined as idf(ti) = \log (\frac{N}{df(ti)}), where N is the total number of documents in the collection. The weight given to term ti in document dj is then computed as tf-idf(ti, dj) = tf(ti, dj) \times idf(ti).

Text Pre-processing

Raw text undergoes tokenization, abstraction, morphological normalization, and stop-list filtering. Tokenization splits text into sentences and words. Stop-list filtering removes closed-class words, retaining content words.

Practical Comments: Sparsity

BoW feature vectors are high-dimensional and sparse, containing few non-zero elements, which impacts data structure implementation.

Classification

Supervised learning requiring labeled training data. Train a classifier to automatically assign new instances to predefined classes, given some set of training examples.

Clustering

Unsupervised learning from unlabeled data. Automatically group similar objects together. No predefined classes or structure, we only specify the similarity measure.

Vector Space Classification

Objects are points, classes are regions. Based on the contiguity hypothesis: Objects in the same class form a contiguous region, and regions of different classes do not overlap. Algorithms include K-Nearest Neighbor (KNN) and Rocchio classification.