1/16
These flashcards cover key vocabulary and definitions related to machine learning techniques in dimensionality reduction and text data analysis.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Dimensionality Reduction
The process of reducing the number of dimensions (attributes) of a dataset to improve analysis and visualization.
Principal Component Analysis (PCA)
A dimensionality reduction technique that projects data onto a lower-dimensional space while maximizing the variance of the projected data.
Eigenvalues
Values that represent the variance captured by each principal component in PCA.
Eigenvectors
Vectors that define the directions of the axes in the PCA transformed space.
Covariance Matrix
A matrix that indicates the extent to which two variables change together, used in PCA to find eigenvalues and eigenvectors.
Singular Value Decomposition (SVD)
A method of decomposing a matrix into three other matrices, used as an alternative to eigendecomposition for dimensionality reduction.
Bag-of-Words Model
A method of transforming text into numerical form by counting occurrences of words.
Tokenization
The process of breaking down text into individual words or tokens.
Stopwords
Commonly used words in a language that carry little semantic meaning and are often removed in text preprocessing.
Latent Semantic Analysis (LSA)
A technique that uses SVD to reduce dimensions in text data and uncover semantic structures.
Polysemy
The phenomenon where a word has multiple meanings depending on context.
Distributional Semantics
The theory that words that appear in similar contexts tend to have similar meanings.
Topic Models
Algorithms that cluster words and documents into groups (or topics) based on their distributions.
Latent Dirichlet Allocation (LDA)
A generative probabilistic model for collections of discrete data such as text, used for topic modeling.
Jensen-Shannon Divergence
A method of measuring the similarity between two probability distributions over the same variable.
Term Frequency-Inverse Document Frequency (tf-idf)
A numerical statistic that reflects how important a word is to a document in a collection of documents.
Document-Term Matrix
A matrix representation of document data, where rows represent documents and columns represent terms; entries denote term occurrences.