Text Mining and Natural Language Processing (NLP) Fundamentals

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/21

Earn XP

Description and Tags

This set of flashcards covers vocabulary terms, concepts, and techniques related to text mining, natural language processing, and text analytics based on the provided lecture notes.

Last updated 10:43 PM on 4/29/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

22 Terms

New cards

Text mining

The process of quantifying large amounts of string/text data to extract knowledge and help inform decisions.

New cards

Natural Language Processing (NLP)

A field of artificial intelligence established in the 1950s that enables computers to understand and process human language.

New cards

Lexical ambiguity

A type of ambiguity where a word has multiple meanings, such as 'rows,' 'rose,' and 'roes.'

New cards

Syntactic ambiguity

A type of ambiguity involving sentence structure, such as 'We saw her duck' (referring to an animal or an action).

New cards

Scope ambiguity

Ambiguity concerning how many subjects are affected, illustrated by the sentence 'Every student did not pass the exam.'

New cards

Document

A piece of text that often serves as the level of analysis in text analytics.

New cards

Corpus

A collection of documents used for text analysis.

New cards

Token

A single word within a piece of text.

New cards

Vocabulary

The collection of all unique word tokens within a corpus.

New cards

Stemming

The process of reducing inflected words to their stem or base form by removing letters until commonalities are found (e.g., reducing 'changing' to 'chang').

New cards

Lemmatization

A more advanced version of word reduction that reduces inflected words to their actual root word (e.g., reducing 'changing' to 'change').

New cards

Stop words

Words that are filtered out prior to processing natural language data.

New cards

Tokenization

The process of converting words to tokens while accounting for variations like punctuation and contractions.

New cards

Vectorization

The process of converting words into numeric representations.

New cards

Word representation

Also known as word vectors, these encode word tokens into a vector in a 'word space' with enough dimensions to represent semantic relationships.

New cards

One hot vector

An older method of word representation where a vector the size of the vocabulary contains a single 1 for the dimension associated with the word and 0 for all others.

New cards

Statistical language models

A probability distribution over a set of words that predicts the next word in a sequence; the predecessor to neural models.

New cards

Text clustering

The application of cluster analysis to text documents, commonly using the k-means clustering technique for document organization.

New cards

N-gram

A unit of analysis in text mining where a 1-gram is one token.

New cards

Latent Dirichlet Allocation (LDA)

The most widely used technique for topic modeling.

New cards

Descriptive analytics

In text analytics, this includes techniques such as visualization and frequency analysis.

New cards

Predictive analytics

In text analytics, this includes techniques such as text classification and sentiment analysis.