Machine Learning - Dimensionality Reduction and Text as Data

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/16

Earn XP

Description and Tags

These flashcards cover key vocabulary and definitions related to machine learning techniques in dimensionality reduction and text data analysis.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

17 Terms

New cards

Dimensionality Reduction

The process of reducing the number of dimensions (attributes) of a dataset to improve analysis and visualization.

New cards

Principal Component Analysis (PCA)

A dimensionality reduction technique that projects data onto a lower-dimensional space while maximizing the variance of the projected data.

New cards

Eigenvalues

Values that represent the variance captured by each principal component in PCA.

New cards

Eigenvectors

Vectors that define the directions of the axes in the PCA transformed space.

New cards

Covariance Matrix

A matrix that indicates the extent to which two variables change together, used in PCA to find eigenvalues and eigenvectors.

New cards

Singular Value Decomposition (SVD)

A method of decomposing a matrix into three other matrices, used as an alternative to eigendecomposition for dimensionality reduction.

New cards

Bag-of-Words Model

A method of transforming text into numerical form by counting occurrences of words.

New cards

Tokenization

The process of breaking down text into individual words or tokens.

New cards

Stopwords

Commonly used words in a language that carry little semantic meaning and are often removed in text preprocessing.

New cards

Latent Semantic Analysis (LSA)

A technique that uses SVD to reduce dimensions in text data and uncover semantic structures.

New cards

Polysemy

The phenomenon where a word has multiple meanings depending on context.

New cards

Distributional Semantics

The theory that words that appear in similar contexts tend to have similar meanings.

New cards

Topic Models

Algorithms that cluster words and documents into groups (or topics) based on their distributions.

New cards

Latent Dirichlet Allocation (LDA)

A generative probabilistic model for collections of discrete data such as text, used for topic modeling.

New cards

Jensen-Shannon Divergence

A method of measuring the similarity between two probability distributions over the same variable.

New cards

Term Frequency-Inverse Document Frequency (tf-idf)

A numerical statistic that reflects how important a word is to a document in a collection of documents.

New cards

Document-Term Matrix

A matrix representation of document data, where rows represent documents and columns represent terms; entries denote term occurrences.