1/21
This set of flashcards covers vocabulary terms, concepts, and techniques related to text mining, natural language processing, and text analytics based on the provided lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Text mining
The process of quantifying large amounts of string/text data to extract knowledge and help inform decisions.
Natural Language Processing (NLP)
A field of artificial intelligence established in the 1950s that enables computers to understand and process human language.
Lexical ambiguity
A type of ambiguity where a word has multiple meanings, such as 'rows,' 'rose,' and 'roes.'
Syntactic ambiguity
A type of ambiguity involving sentence structure, such as 'We saw her duck' (referring to an animal or an action).
Scope ambiguity
Ambiguity concerning how many subjects are affected, illustrated by the sentence 'Every student did not pass the exam.'
Document
A piece of text that often serves as the level of analysis in text analytics.
Corpus
A collection of documents used for text analysis.
Token
A single word within a piece of text.
Vocabulary
The collection of all unique word tokens within a corpus.
Stemming
The process of reducing inflected words to their stem or base form by removing letters until commonalities are found (e.g., reducing 'changing' to 'chang').
Lemmatization
A more advanced version of word reduction that reduces inflected words to their actual root word (e.g., reducing 'changing' to 'change').
Stop words
Words that are filtered out prior to processing natural language data.
Tokenization
The process of converting words to tokens while accounting for variations like punctuation and contractions.
Vectorization
The process of converting words into numeric representations.
Word representation
Also known as word vectors, these encode word tokens into a vector in a 'word space' with enough dimensions to represent semantic relationships.
One hot vector
An older method of word representation where a vector the size of the vocabulary contains a single 1 for the dimension associated with the word and 0 for all others.
Statistical language models
A probability distribution over a set of words that predicts the next word in a sequence; the predecessor to neural models.
Text clustering
The application of cluster analysis to text documents, commonly using the k-means clustering technique for document organization.
N-gram
A unit of analysis in text mining where a 1-gram is one token.
Latent Dirichlet Allocation (LDA)
The most widely used technique for topic modeling.
Descriptive analytics
In text analytics, this includes techniques such as visualization and frequency analysis.
Predictive analytics
In text analytics, this includes techniques such as text classification and sentiment analysis.