Ch10: Text Mining

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/18

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

19 Terms

1
New cards

What is text mining?

A: The process of discovering/deriving/extracting high-quality information or unknown patterns from large amounts of unstructured textual data.

2
New cards

Why is text mining important?

A: Because around 80% of data is unstructured text, and traditional data mining techniques cannot process it directly.

3
New cards

Give three example applications of text mining.

A:

  • Spam filtering

  • Customer care service

  • Document summarization

4
New cards

What are the main tasks in text mining?

A:

  • Text Categorization (supervised)

  • Text Clustering (unsupervised)

  • Sentiment Analysis (supervised)

  • Document Summarization (supervised/unsupervised)

  • Named Entity Recognition (supervised)

5
New cards

Why can't we represent text as raw strings or list of sentences for mining?

  • Strings have no semantic meaning for the machine

    → grammar/order is ignored in naive representations.

  • List of sentences are just like another document

    → recursive problem

6
New cards

What is the Bag of Words (BoW) model?

A: A common document representation where a document is treated as a set of words, ignoring grammar and word order.

7
New cards

What is tokenization?

A: The process of breaking text into smaller units (tokens), such as words or sentences.

8
New cards

What are the main assumptions and limitations of Bag of Words?

A:

  • Assumes word independence

  • Pro: Simple & preserves all info of the text

  • Con: Loses grammar and sequence & Cannot detect synonyms or semantic similarity

9
New cards

What is Zipf’s Law in text mining?

A: A small number of words occur very frequently, while most words appear rarely — creating a long tail in word frequency.

10
New cards

What are the common pre-processing steps in text mining?

A:

  • Tokenization

  • Normalization (e.g., lowercase, remove punctuation)

  • Stop-word removal

  • Stemming (reduce words to root form)

11
New cards

What is Term Frequency (TF)?

A: The number of times a term appears in a document — indicates how important the word is in that document.

12
New cards

Why is raw Term Frequency sometimes misleading?

A: It doesn't account for document length or the diminishing importance of repeated terms.

13
New cards

What is Inverse Document Frequency (IDF)?

A measure of how rare a term is across documents — rare terms get higher weights.


14
New cards

What is TF-IDF?

A: A score that combines Term Frequency and Inverse Document Frequency to balance importance within a document and across the corpus.

TF-IDF(t,d) = TF(t,d)×IDF(t)

15
New cards

How are documents compared after vectorization?

A: By computing similarity or distance between their vector representations.

Euclidian distance is a common measure.

16
New cards

What is cosine similarity?

A: A measure of similarity between two vectors based on the cosine of the angle between them - ranges from 0 (no overlap) to 1 (identical direction).

17
New cards

Why is cosine similarity preferred over Euclidean distance for comparing documents?

A: Because it accounts for direction (overlap) rather than just length, making it better for comparing documents of different sizes.

18
New cards

How can TF-IDF and similarity be used in document classification?

A:

  1. Preprocess documents

  2. Compute TF-IDF

  3. Measure similarity between documents

  4. Assign the new document to the most similar category

19
New cards

What are some advanced models used in text classification?

  • Probabilistic models (e.g., Naive Bayes)

  • Decision trees

  • Support Vector Machines

  • k-Nearest Neighbors

  • Neural networks