CHRISTINE DATA MINING

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/29

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

30 Terms

1
New cards

What is the main purpose of lemmatization?

To reduce words to their base or dictionary form

2
New cards

Why are stop words commonly removed in text classification tasks?

They often carry little semantic information

3
New cards

What does a TF-IDF representation primarily capture?

The importance of words relative to a document and the corpus

4
New cards

TF-IDF

term frequency-inverse document frequency

5
New cards

In a Decision Tree trained on TF-IDF features, what do feature importances represent?

How strongly a term contributes to classification decisions

6
New cards

What is the main limitation of TF-IDF compared to word embeddings?

TF-IDF ignores semantic similarity between words

7
New cards

What does a word embedding represent?

A dense vector encoding semantic relationships

8
New cards

In the naive Doc2Vec approach used in the practical session, a document vector is obtained by:

Averaging the embeddings of words in the document

9
New cards

Which statement best describes the trade-off between TF-IDF and word-embedding approaches?

TF-IDF models are often more interpretable but less semantic

10
New cards

Which statement best describes the difference between TF-IDF and word embeddings in text classification?

TF-IDF is generally more interpretable, while word embeddings better capture semantic similarity

11
New cards

TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF is a statistical representation that measures how important a word is to a document relative to the entire corpus.

  • Words that appear frequently in a document but rarely across documents receive higher weights.

  • The representation is sparse and frequency-based.

  • It does not capture semantic similarity between words (e.g., “profit” and “gain” are treated as unrelated).

12
New cards

Word Embeddings

Word embeddings represent words as dense numerical vectors learned from large corpora, such that semantically similar words have similar vectors.

  • They capture semantic and contextual relationships between words.

  • Similar words (e.g., “profit” and “revenue”) are close in vector space.

  • They are generally less interpretable than TF-IDF features

13
New cards

Why can two documents using different words but expressing the same idea be far apart in TF-IDF space?

TF-IDF does not model semantic similarity between words

14
New cards

Which property allows word embeddings to capture that “profit” and “gain” are related?

Co-occurrence patterns learned from large corpora

15
New cards

Why are TF-IDF vectors typically high-dimensional and sparse?

Each word corresponds to a feature, and most words do not appear in each document

16
New cards

What is a major advantage of word embeddings over TF-IDF?

They encode semantic similarity between words

17
New cards

Why might TF-IDF perform competitively with word embeddings in some classification tasks?

TF-IDF captures syntax better

18
New cards

Which statement about generalization is correct?

Word embeddings can generalize better to semantically similar words

19
New cards

Why are word embeddings often described as dense representations?

Most dimensions contain meaningful values

20
New cards

Which statement is correct?

TF-IDF assigns higher weights to rare words

21
New cards

What is tokenization in natural language processing?

Splitting text into smaller units such as words or symbols

22
New cards

Why is tokenization a necessary first step in most text-processing pipelines?

It allows text to be processed as individual units

23
New cards

Which example best illustrates lemmatization?

All of the above

24
New cards

How does lemmatization differ from simple stemming?

Lemmatization uses linguistic knowledge and vocabulary

25
New cards

Why is lemmatization often preferred over stemming in text classification?

It preserves semantic meaning better

26
New cards
word_list = nltk.word_tokenize(sentence)
print(word_list)

#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']


27
New cards

What is a potential drawback of lemmatization?

It may require more linguistic resources and computation

28
New cards

Why can incorrect tokenization negatively affect lemmatization?

Lemmatization requires correct word boundaries

29
New cards

In the context of document classification, why are tokenization and lemmatization useful?

They reduce noise and vocabulary variation

30
New cards

Which statement is most accurate?

Tokenization must be performed before lemmatization