Ch10: Text Mining

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/18

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

19 Terms

New cards

What is text mining?

A: The process of discovering/deriving/extracting high-quality information or unknown patterns from large amounts of unstructured textual data.

New cards

Why is text mining important?

A: Because around 80% of data is unstructured text, and traditional data mining techniques cannot process it directly.

New cards

Give three example applications of text mining.

Spam filtering
Customer care service
Document summarization

New cards

What are the main tasks in text mining?

Text Categorization (supervised)
Text Clustering (unsupervised)
Sentiment Analysis (supervised)
Document Summarization (supervised/unsupervised)
Named Entity Recognition (supervised)

New cards

Why can't we represent text as raw strings or list of sentences for mining?

Strings have no semantic meaning for the machine
→ grammar/order is ignored in naive representations.
List of sentences are just like another document
→ recursive problem

New cards

What is the Bag of Words (BoW) model?

A: A common document representation where a document is treated as a set of words, ignoring grammar and word order.

New cards

What is tokenization?

A: The process of breaking text into smaller units (tokens), such as words or sentences.

New cards

What are the main assumptions and limitations of Bag of Words?

Assumes word independence
Pro: Simple & preserves all info of the text
Con: Loses grammar and sequence & Cannot detect synonyms or semantic similarity

New cards

What is Zipf’s Law in text mining?

A: A small number of words occur very frequently, while most words appear rarely — creating a long tail in word frequency.

New cards

What are the common pre-processing steps in text mining?

Tokenization
Normalization (e.g., lowercase, remove punctuation)
Stop-word removal
Stemming (reduce words to root form)

New cards

What is Term Frequency (TF)?

A: The number of times a term appears in a document — indicates how important the word is in that document.

New cards

Why is raw Term Frequency sometimes misleading?

A: It doesn't account for document length or the diminishing importance of repeated terms.

New cards

What is Inverse Document Frequency (IDF)?

A measure of how rare a term is across documents — rare terms get higher weights.

New cards

What is TF-IDF?

A: A score that combines Term Frequency and Inverse Document Frequency to balance importance within a document and across the corpus.

TF-IDF(t,d) = TF(t,d)×IDF(t)

New cards

How are documents compared after vectorization?

A: By computing similarity or distance between their vector representations.

Euclidian distance is a common measure.

New cards

What is cosine similarity?

A: A measure of similarity between two vectors based on the cosine of the angle between them - ranges from 0 (no overlap) to 1 (identical direction).

New cards

Why is cosine similarity preferred over Euclidean distance for comparing documents?

A: Because it accounts for direction (overlap) rather than just length, making it better for comparing documents of different sizes.

New cards

How can TF-IDF and similarity be used in document classification?