1/23
Concepts covering multimedia indexing, text retrieval, dimensionality reduction techniques like SVD, and core data mining tasks including classification and clustering.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai | Chat |
|---|
No analytics yet
Send a link to your students to track their progress
Content-based Image Retrieval
A field of research aiming at indexing and retrieving images based on their visual contents rather than manual text annotation.
Color Histogram
A compact representation of the color of an image where colors are partitioned into k groups and the percentage of each group in the image is measured.
Stemming
A text processing technique where only the root of each word is kept (e.g., converting 'inverted' and 'inversion' into 'invert').
Inverted Files
A text indexing structure consisting of a dictionary and postings lists, known for high speed despite space overhead.
Zipf Distribution
A distribution observed in text collections where the frequency of a word is approximately inversely proportional to its rank (freq∼1/rank).
Postings Lists
The components of an inverted file that list occurrences of terms in documents; they are identified as the main source of space overhead.
Vector Space Model
A model where each document is represented as a vector of size d, where d is the number of different terms in the database (vocabulary size).
Binary Weights
A term weighting scheme where only the presence (1) or absence (0) of a term is included in the document vector.
tf x idf
A weighting measure defined as w=tf×log(N/nk), where tf is term frequency, N is the total number of documents, and nk is the number of documents containing the term.
Cosine Coefficient
A similarity measure for document vectors, also known as the normalized inner product, calculated as sim(Di,Dj)=∑k=1twik×wjk for normalized vectors.
Latent Semantic Indexing (LSI)
A method that maps documents and terms into latent (hidden) concepts to improve filtering and retrieval.
Singular Value Decomposition (SVD)
The decomposition of a matrix into A=UΛVT, where U is a document-to-concept matrix, Λ is a diagonal matrix of concept strengths, and V is a term-to-concept matrix.
Frobenius Norm
The norm of an n×m matrix M calculated as the square root of the sum of the squares of its elements: ∑M[i,j]2.
Authorities
In Kleinberg's algorithm, these are nodes that receive links from many important hub nodes.
Hubs
In Kleinberg's algorithm, these are nodes that point to many high-quality authority nodes.
PageRank
An algorithm that determines the importance of a page by computing its steady-state probability in a Markov Chain model of a random web surfer.
Isometric Mapping
An embedding where the mapping F ensures the exact preservation of distance between objects.
FastMap
A metric analogue to the KL-transform (PCA) that uses pivot points and the law of cosines to compute pseudo-projections.
Johnson-Lindenstrauss Lemma
A mathematical basis for random projections, stating that a set of points in high-dimensional space can be mapped to much lower dimensions while approximately preserving distances.
Classification
A data mining task involving learning a function that maps an item into one of a set of predefined classes using a training set.
Regression
A data mining task where a function is learned to map an item to a continuous real value.
Clustering
The process of identifying groups of similar items such that intracluster distances are minimized and intercluster distances are maximized.
Association Rule Discovery
The production of dependency rules that predict the occurrence of an item based on the occurrences of other items (e.g., {Milk}→{Coke}).
Stratified Sampling
A sampling method that approximates the percentage of each subpopulation of interest in the overall database, often used with skewed data.