1/24
Vocabulary flashcards based on the lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Feature
Observable and relevant properties of the data, each having a numerical value.
Feature Vector
A tuple of d feature values: x = ⟨x1, x2, . . . , xd⟩ representing an object x.
Vector Space Model
A model where data is represented as feature vectors, with features as dimensions in a space.
Bag-of-Words (BoW)
A document representation where features are frequency counts of words in the text.
Token
An instance of a word in a text.
Type
A unique word in a text.
Co-occurrence Matrix
A matrix where rows represent words and columns represent documents, showing word occurrences.
Distributional Hypothesis
The idea that words which occur in similar contexts are semantically related.
Euclidean Distance
The straight line distance between two points (vectors) in a vector space.
Normalization
The process of scaling vectors to have a unit length (∥x∥ = 1).
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them.
TF-IDF
A weighting function that combines term frequency (tf) and inverse document frequency (idf).
Term Frequency (TF)
The number of times a term occurs in a document.
Document Frequency (DF)
The number of documents in a collection that contain a term.
Inverse Document Frequency (IDF)
A measure of how rare a term is in a document collection, calculated as idf(ti) = log (N / df(ti)).
Tokenization
Splitting a text into sentences and words or other units.
Lemmatization
Reducing words to their base or dictionary form (lemma).
Stemming
Reducing words to their stem or root form, often by removing suffixes.
Stop-list
A list of common words (function words) to be filtered out during text pre-processing.
Sparsity
A characteristic of high-dimensional vectors with very few non-zero elements.
Classification
A supervised learning task that involves assigning new instances to predefined classes.
Clustering
An unsupervised learning task that involves grouping similar objects together.
Contiguity Hypothesis
Objects in the same class form a contiguous region, and regions of different classes do not overlap.
KNN (K-Nearest Neighbor)
A classification method based on the distances to the nearest neighbors.
Rocchio Classification
A classification method that uses the nearest centroid (mean) of each class.