1/56
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is Natural Language Processing?
The study of analyzing and modeling text data.
Why is text data challenging?
It is unstructured and requires significant preprocessing.
What is unstructured data?
Data that is not organized in a predefined format.
What is alternative data?
Non-traditional data sources used for insights, such as text from social media or reports.
Why is NLP important?
It is widely used in industry for extracting insights from text.
What are common NLP applications?
Sentiment analysis, translation, chatbots, speech recognition, and information organization.
What is sentiment analysis?
The process of determining whether text is positive, negative, or neutral.
Why is sentiment analysis useful?
It can extract opinions and predict behavior from text data.
What is preprocessing in NLP?
The process of cleaning and transforming raw text into usable features.
What is tokenization?
Breaking text into individual words or tokens.
Why remove punctuation and numbers?
They usually add little meaningful information.
Why convert text to lowercase?
To treat words consistently regardless of capitalization.
What are stop words?
Common words that carry little meaning, such as “the” or “and”.
Why remove stop words?
To reduce noise and dimensionality.
What is stemming?
Reducing words to a root form by removing endings.
What is a limitation of stemming?
The resulting root may not be a valid word.
What is lemmatization?
Reducing words to their true root using language rules.
How does lemmatization differ from stemming?
It is more accurate but computationally slower.
Why remove rare words?
They are difficult for models to learn from.
Why remove very common words?
They provide little discriminatory information.
What is an n-gram?
A sequence of n consecutive words.
What is a unigram?
A single word.
What is a bigram?
A pair of consecutive words.
What is a trigram?
A sequence of three consecutive words.
Why use n-grams?
To capture context and word order.
What is a drawback of n-grams?
They increase feature dimensionality.
What is a document?
A single piece of text.
What is a corpus?
A collection of documents.
What is a vocabulary?
The set of all unique words in the corpus.
What is vectorization?
Converting text into numerical representations.
What is the bag-of-words approach?
A method that represents text using word counts without considering order.
What is a document-term matrix?
A matrix where rows are documents and columns are words.
Why is the document-term matrix sparse?
Most words do not appear in most documents.
What is term count?
The number of times a word appears in a document.
What is term frequency?
The normalized count of a word in a document.
Why normalize term frequency?
To account for differences in document length.
What problem arises with common words in search?
They do not help distinguish documents.
What is inverse document frequency?
A measure that downweights words appearing in many documents.
What is TF-IDF?
A weighting scheme combining term frequency and inverse document frequency.
When is TF-IDF high?
When a word is frequent in a document but rare across documents.
Why is TF-IDF useful?
It emphasizes important and distinctive words.
What is a dictionary-based method?
A sentiment approach using predefined word lists.
What is a limitation of dictionary methods?
They do not learn from data and can be inaccurate.
Why are domain-specific dictionaries needed?
General dictionaries may misclassify domain-specific terms.
What are machine learning methods in NLP?
Approaches that use labeled data to learn patterns.
What are common feature representations in ML NLP?
Boolean presence, counts, frequencies, and TF-IDF.
Why is feature standardization difficult in NLP?
The data is sparse.
What is Naïve Bayes?
A probabilistic classifier based on conditional independence assumptions.
What is the key assumption of Naïve Bayes?
Features are independent given the class.
Why is Naïve Bayes useful in NLP?
It performs well with high-dimensional sparse data.
What is the zero-probability problem?
When a word never appears in training data for a class, leading to zero probability.
Why is zero probability problematic?
It can eliminate an entire class prediction.
What is Laplace smoothing?
Adding a small value to counts to avoid zero probabilities.
Why is Laplace smoothing important?
It stabilizes probability estimates and prevents extreme outcomes.
What is the role of training data in NLP models?
It is used to learn patterns between text and outcomes.
What is the role of validation data?
It helps tune model parameters.
What is the role of test data?
It evaluates final model performance.