1/29
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
What is the main purpose of lemmatization?
To reduce words to their base or dictionary form
Why are stop words commonly removed in text classification tasks?
They often carry little semantic information
What does a TF-IDF representation primarily capture?
The importance of words relative to a document and the corpus
TF-IDF
term frequency-inverse document frequency
In a Decision Tree trained on TF-IDF features, what do feature importances represent?
How strongly a term contributes to classification decisions
What is the main limitation of TF-IDF compared to word embeddings?
TF-IDF ignores semantic similarity between words
What does a word embedding represent?
A dense vector encoding semantic relationships
In the naive Doc2Vec approach used in the practical session, a document vector is obtained by:
Averaging the embeddings of words in the document
Which statement best describes the trade-off between TF-IDF and word-embedding approaches?
TF-IDF models are often more interpretable but less semantic
Which statement best describes the difference between TF-IDF and word embeddings in text classification?
TF-IDF is generally more interpretable, while word embeddings better capture semantic similarity
TF-IDF (Term Frequency – Inverse Document Frequency)
TF-IDF is a statistical representation that measures how important a word is to a document relative to the entire corpus.
Words that appear frequently in a document but rarely across documents receive higher weights.
The representation is sparse and frequency-based.
It does not capture semantic similarity between words (e.g., “profit” and “gain” are treated as unrelated).
Word Embeddings
Word embeddings represent words as dense numerical vectors learned from large corpora, such that semantically similar words have similar vectors.
They capture semantic and contextual relationships between words.
Similar words (e.g., “profit” and “revenue”) are close in vector space.
They are generally less interpretable than TF-IDF features
Why can two documents using different words but expressing the same idea be far apart in TF-IDF space?
TF-IDF does not model semantic similarity between words
Which property allows word embeddings to capture that “profit” and “gain” are related?
Co-occurrence patterns learned from large corpora
Why are TF-IDF vectors typically high-dimensional and sparse?
Each word corresponds to a feature, and most words do not appear in each document
What is a major advantage of word embeddings over TF-IDF?
They encode semantic similarity between words
Why might TF-IDF perform competitively with word embeddings in some classification tasks?
TF-IDF captures syntax better
Which statement about generalization is correct?
Word embeddings can generalize better to semantically similar words
Why are word embeddings often described as dense representations?
Most dimensions contain meaningful values
Which statement is correct?
TF-IDF assigns higher weights to rare words
What is tokenization in natural language processing?
Splitting text into smaller units such as words or symbols
Why is tokenization a necessary first step in most text-processing pipelines?
It allows text to be processed as individual units
Which example best illustrates lemmatization?
All of the above
How does lemmatization differ from simple stemming?
Lemmatization uses linguistic knowledge and vocabulary
Why is lemmatization often preferred over stemming in text classification?
It preserves semantic meaning better
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
What is a potential drawback of lemmatization?
It may require more linguistic resources and computation
Why can incorrect tokenization negatively affect lemmatization?
Lemmatization requires correct word boundaries
In the context of document classification, why are tokenization and lemmatization useful?
They reduce noise and vocabulary variation
Which statement is most accurate?
Tokenization must be performed before lemmatization