CHRISTINE DATA MINING

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/29

There's no tags or description

Looks like no tags are added yet.

Last updated 1:07 PM on 12/15/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

30 Terms

New cards

What is the main purpose of lemmatization?

To reduce words to their base or dictionary form

New cards

Why are stop words commonly removed in text classification tasks?

They often carry little semantic information

New cards

What does a TF-IDF representation primarily capture?

The importance of words relative to a document and the corpus

New cards

TF-IDF

term frequency-inverse document frequency

New cards

In a Decision Tree trained on TF-IDF features, what do feature importances represent?

How strongly a term contributes to classification decisions

New cards

What is the main limitation of TF-IDF compared to word embeddings?

TF-IDF ignores semantic similarity between words

New cards

What does a word embedding represent?

A dense vector encoding semantic relationships

New cards

In the naive Doc2Vec approach used in the practical session, a document vector is obtained by:

Averaging the embeddings of words in the document

New cards

Which statement best describes the trade-off between TF-IDF and word-embedding approaches?

TF-IDF models are often more interpretable but less semantic

New cards

Which statement best describes the difference between TF-IDF and word embeddings in text classification?

TF-IDF is generally more interpretable, while word embeddings better capture semantic similarity

New cards

TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF is a statistical representation that measures how important a word is to a document relative to the entire corpus.

Words that appear frequently in a document but rarely across documents receive higher weights.
The representation is sparse and frequency-based.
It does not capture semantic similarity between words (e.g., “profit” and “gain” are treated as unrelated).

New cards

Word Embeddings

Word embeddings represent words as dense numerical vectors learned from large corpora, such that semantically similar words have similar vectors.

They capture semantic and contextual relationships between words.
Similar words (e.g., “profit” and “revenue”) are close in vector space.
They are generally less interpretable than TF-IDF features

New cards

Why can two documents using different words but expressing the same idea be far apart in TF-IDF space?

TF-IDF does not model semantic similarity between words

New cards

Which property allows word embeddings to capture that “profit” and “gain” are related?

Co-occurrence patterns learned from large corpora

New cards

Why are TF-IDF vectors typically high-dimensional and sparse?

Each word corresponds to a feature, and most words do not appear in each document

New cards

What is a major advantage of word embeddings over TF-IDF?

They encode semantic similarity between words

New cards

Why might TF-IDF perform competitively with word embeddings in some classification tasks?

TF-IDF captures syntax better

New cards

Which statement about generalization is correct?

Word embeddings can generalize better to semantically similar words

New cards

Why are word embeddings often described as dense representations?

Most dimensions contain meaningful values

New cards

Which statement is correct?

TF-IDF assigns higher weights to rare words

New cards

What is tokenization in natural language processing?

Splitting text into smaller units such as words or symbols

New cards

Why is tokenization a necessary first step in most text-processing pipelines?

It allows text to be processed as individual units

New cards

Which example best illustrates lemmatization?

All of the above

New cards

How does lemmatization differ from simple stemming?

Lemmatization uses linguistic knowledge and vocabulary

New cards

Why is lemmatization often preferred over stemming in text classification?

It preserves semantic meaning better

New cards

word_list = nltk.word_tokenize(sentence)
print(word_list)

#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']

New cards

What is a potential drawback of lemmatization?

It may require more linguistic resources and computation

New cards

Why can incorrect tokenization negatively affect lemmatization?

Lemmatization requires correct word boundaries

New cards

In the context of document classification, why are tokenization and lemmatization useful?

They reduce noise and vocabulary variation

New cards

Which statement is most accurate?

Tokenization must be performed before lemmatization

Explore top notes

AMSCO AP World History 8.1, 8.2

Updated 1059d ago

Note

IB Biology HL year 2: Unit 1B (Evolution)

Updated 429d ago

Note

Elementary Logic

Updated 768d ago

Note

Unit 5 - Factor Markets Guide

Updated 1037d ago

Note

AP Biology Unit 1 Notes

Updated 382d ago

Note

Formation- Tropical Depression, Tropical Storm, and Hurricane

Updated 1192d ago

Note

Self-Determination Theory

Updated 65d ago

Note

Nucleus

Updated 1153d ago

Note

AMSCO AP World History 8.1, 8.2

Updated 1059d ago

Note

IB Biology HL year 2: Unit 1B (Evolution)

Updated 429d ago

Note

Elementary Logic

Updated 768d ago

Note

Unit 5 - Factor Markets Guide

Updated 1037d ago

Note

AP Biology Unit 1 Notes

Updated 382d ago

Note

Formation- Tropical Depression, Tropical Storm, and Hurricane

Updated 1192d ago

Note

Self-Determination Theory

Updated 65d ago

Note

Nucleus

Updated 1153d ago

Note

Explore top flashcards

Young Adulthood: Physical, Cognitive, Social and Behavioral Development

Updated 646d ago

Flashcards (44)

Human Respiration System

Updated 509d ago

Flashcards (22)

Kraje i stolice Europy

Updated 306d ago

Flashcards (47)

Theology- Freedom Unit

Updated 1073d ago

Flashcards (27)

Econ unit 8 and 9 review

Updated 1010d ago

Flashcards (24)

U.S HISTORY

Updated 274d ago

Flashcards (99)

Greek And Latin Roots Units 4-6

Updated 45d ago

Flashcards (46)

Key Terms from the Civil War Era

Updated 330d ago

Flashcards (30)

Young Adulthood: Physical, Cognitive, Social and Behavioral Development

Updated 646d ago

Flashcards (44)

Human Respiration System

Updated 509d ago

Flashcards (22)

Kraje i stolice Europy

Updated 306d ago

Flashcards (47)

Theology- Freedom Unit

Updated 1073d ago

Flashcards (27)

Econ unit 8 and 9 review

Updated 1010d ago

Flashcards (24)

U.S HISTORY

Updated 274d ago

Flashcards (99)

Greek And Latin Roots Units 4-6

Updated 45d ago

Flashcards (46)

Key Terms from the Civil War Era

Updated 330d ago

Flashcards (30)