IN2110: Språkteknologiske metoder - Vektorrom for språkteknologi

0.0(0)

Studied by 0 people

View linked note

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/24

Earn XP

Description and Tags

Vocabulary flashcards based on the lecture notes.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

25 Terms

New cards

Feature

Observable and relevant properties of the data, each having a numerical value.

New cards

Feature Vector

A tuple of d feature values: x = ⟨x1, x2, . . . , xd⟩ representing an object x.

New cards

Vector Space Model

A model where data is represented as feature vectors, with features as dimensions in a space.

New cards

Bag-of-Words (BoW)

A document representation where features are frequency counts of words in the text.

New cards

Token

An instance of a word in a text.

New cards

Type

A unique word in a text.

New cards

Co-occurrence Matrix

A matrix where rows represent words and columns represent documents, showing word occurrences.

New cards

Distributional Hypothesis

The idea that words which occur in similar contexts are semantically related.

New cards

Euclidean Distance

The straight line distance between two points (vectors) in a vector space.

New cards

Normalization

The process of scaling vectors to have a unit length (∥x∥ = 1).

New cards

Cosine Similarity

A measure of similarity between two vectors based on the cosine of the angle between them.

New cards

TF-IDF

A weighting function that combines term frequency (tf) and inverse document frequency (idf).

New cards

Term Frequency (TF)

The number of times a term occurs in a document.

New cards

Document Frequency (DF)

The number of documents in a collection that contain a term.

New cards

Inverse Document Frequency (IDF)

A measure of how rare a term is in a document collection, calculated as idf(ti) = log (N / df(ti)).

New cards

Tokenization

Splitting a text into sentences and words or other units.

New cards

Lemmatization

Reducing words to their base or dictionary form (lemma).

New cards

Stemming

Reducing words to their stem or root form, often by removing suffixes.

New cards

Stop-list

A list of common words (function words) to be filtered out during text pre-processing.

New cards

Sparsity

A characteristic of high-dimensional vectors with very few non-zero elements.

New cards

Classification

A supervised learning task that involves assigning new instances to predefined classes.

New cards

Clustering

An unsupervised learning task that involves grouping similar objects together.

New cards

Contiguity Hypothesis

Objects in the same class form a contiguous region, and regions of different classes do not overlap.

New cards

KNN (K-Nearest Neighbor)

A classification method based on the distances to the nearest neighbors.

New cards

Rocchio Classification

A classification method that uses the nearest centroid (mean) of each class.