Ch10: Text Mining

0.0(0)
studied byStudied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/18

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 8:36 AM on 6/10/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

19 Terms

1
New cards

What is text mining?

A: The process of discovering/deriving/extracting high-quality information or unknown patterns from large amounts of unstructured textual data.

2
New cards

Why is text mining important?

A: Because around 80% of data is unstructured text, and traditional data mining techniques cannot process it directly.

3
New cards

Give three example applications of text mining.

A:

  • Spam filtering

  • Customer care service

  • Document summarization

4
New cards

What are the main tasks in text mining?

A:

  • Text Categorization (supervised)

  • Text Clustering (unsupervised)

  • Sentiment Analysis (supervised)

  • Document Summarization (supervised/unsupervised)

  • Named Entity Recognition (supervised)

5
New cards

Why can't we represent text as raw strings or list of sentences for mining?

  • Strings have no semantic meaning for the machine

    → grammar/order is ignored in naive representations.

  • List of sentences are just like another document

    → recursive problem

6
New cards

What is the Bag of Words (BoW) model?

A: A common document representation where a document is treated as a set of words, ignoring grammar and word order.

7
New cards

What is tokenization?

A: The process of breaking text into smaller units (tokens), such as words or sentences.

8
New cards

What are the main assumptions and limitations of Bag of Words?

A:

  • Assumes word independence

  • Pro: Simple & preserves all info of the text

  • Con: Loses grammar and sequence & Cannot detect synonyms or semantic similarity

9
New cards

What is Zipf’s Law in text mining?

A: A small number of words occur very frequently, while most words appear rarely — creating a long tail in word frequency.

10
New cards

What are the common pre-processing steps in text mining?

A:

  • Tokenization

  • Normalization (e.g., lowercase, remove punctuation)

  • Stop-word removal

  • Stemming (reduce words to root form)

11
New cards

What is Term Frequency (TF)?

A: The number of times a term appears in a document — indicates how important the word is in that document.

12
New cards

Why is raw Term Frequency sometimes misleading?

A: It doesn't account for document length or the diminishing importance of repeated terms.

13
New cards

What is Inverse Document Frequency (IDF)?

A measure of how rare a term is across documents — rare terms get higher weights.


14
New cards

What is TF-IDF?

A: A score that combines Term Frequency and Inverse Document Frequency to balance importance within a document and across the corpus.

TF-IDF(t,d) = TF(t,d)×IDF(t)

15
New cards

How are documents compared after vectorization?

A: By computing similarity or distance between their vector representations.

Euclidian distance is a common measure.

16
New cards

What is cosine similarity?

A: A measure of similarity between two vectors based on the cosine of the angle between them - ranges from 0 (no overlap) to 1 (identical direction).

17
New cards

Why is cosine similarity preferred over Euclidean distance for comparing documents?

A: Because it accounts for direction (overlap) rather than just length, making it better for comparing documents of different sizes.

18
New cards

How can TF-IDF and similarity be used in document classification?

A:

  1. Preprocess documents

  2. Compute TF-IDF

  3. Measure similarity between documents

  4. Assign the new document to the most similar category

19
New cards

What are some advanced models used in text classification?

  • Probabilistic models (e.g., Naive Bayes)

  • Decision trees

  • Support Vector Machines

  • k-Nearest Neighbors

  • Neural networks

Explore top flashcards

Onc lec 3
Updated 435d ago
flashcards Flashcards (112)
SAT Vocab Lesson 7-8
Updated 321d ago
flashcards Flashcards (30)
Uni
Updated 450d ago
flashcards Flashcards (42)
POS lesson 15
Updated 1074d ago
flashcards Flashcards (29)
Festival Neck Pain
Updated 1099d ago
flashcards Flashcards (81)
Unit 5: Hereditary
Updated 1044d ago
flashcards Flashcards (62)
Onc lec 3
Updated 435d ago
flashcards Flashcards (112)
SAT Vocab Lesson 7-8
Updated 321d ago
flashcards Flashcards (30)
Uni
Updated 450d ago
flashcards Flashcards (42)
POS lesson 15
Updated 1074d ago
flashcards Flashcards (29)
Festival Neck Pain
Updated 1099d ago
flashcards Flashcards (81)
Unit 5: Hereditary
Updated 1044d ago
flashcards Flashcards (62)