L6 - Natural Language Processing

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/56

There's no tags or description

Looks like no tags are added yet.

Last updated 8:00 PM on 4/14/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

57 Terms

New cards

What is Natural Language Processing?

The study of analyzing and modeling text data.

New cards

Why is text data challenging?

It is unstructured and requires significant preprocessing.

New cards

What is unstructured data?

Data that is not organized in a predefined format.

New cards

What is alternative data?

Non-traditional data sources used for insights, such as text from social media or reports.

New cards

Why is NLP important?

It is widely used in industry for extracting insights from text.

New cards

What are common NLP applications?

Sentiment analysis, translation, chatbots, speech recognition, and information organization.

New cards

What is sentiment analysis?

The process of determining whether text is positive, negative, or neutral.

New cards

Why is sentiment analysis useful?

It can extract opinions and predict behavior from text data.

New cards

What is preprocessing in NLP?

The process of cleaning and transforming raw text into usable features.

New cards

What is tokenization?

Breaking text into individual words or tokens.

New cards

Why remove punctuation and numbers?

They usually add little meaningful information.

New cards

Why convert text to lowercase?

To treat words consistently regardless of capitalization.

New cards

What are stop words?

Common words that carry little meaning, such as “the” or “and”.

New cards

Why remove stop words?

To reduce noise and dimensionality.

New cards

What is stemming?

Reducing words to a root form by removing endings.

New cards

What is a limitation of stemming?

The resulting root may not be a valid word.

New cards

What is lemmatization?

Reducing words to their true root using language rules.

New cards

How does lemmatization differ from stemming?

It is more accurate but computationally slower.

New cards

Why remove rare words?

They are difficult for models to learn from.

New cards

Why remove very common words?

They provide little discriminatory information.

New cards

What is an n-gram?

A sequence of n consecutive words.

New cards

What is a unigram?

A single word.

New cards

What is a bigram?

A pair of consecutive words.

New cards

What is a trigram?

A sequence of three consecutive words.

New cards

Why use n-grams?

To capture context and word order.

New cards

What is a drawback of n-grams?

They increase feature dimensionality.

New cards

What is a document?

A single piece of text.

New cards

What is a corpus?

A collection of documents.

New cards

What is a vocabulary?

The set of all unique words in the corpus.

New cards

What is vectorization?

Converting text into numerical representations.

New cards

What is the bag-of-words approach?

A method that represents text using word counts without considering order.

New cards

What is a document-term matrix?

A matrix where rows are documents and columns are words.

New cards

Why is the document-term matrix sparse?

Most words do not appear in most documents.

New cards

What is term count?

The number of times a word appears in a document.

New cards

What is term frequency?

The normalized count of a word in a document.

New cards

Why normalize term frequency?

To account for differences in document length.

New cards

What problem arises with common words in search?

They do not help distinguish documents.

New cards

What is inverse document frequency?

A measure that downweights words appearing in many documents.

New cards

What is TF-IDF?

A weighting scheme combining term frequency and inverse document frequency.

New cards

When is TF-IDF high?

When a word is frequent in a document but rare across documents.

New cards

Why is TF-IDF useful?

It emphasizes important and distinctive words.

New cards

What is a dictionary-based method?

A sentiment approach using predefined word lists.

New cards

What is a limitation of dictionary methods?

They do not learn from data and can be inaccurate.

New cards

Why are domain-specific dictionaries needed?

General dictionaries may misclassify domain-specific terms.

New cards

What are machine learning methods in NLP?

Approaches that use labeled data to learn patterns.

New cards

What are common feature representations in ML NLP?

Boolean presence, counts, frequencies, and TF-IDF.

New cards

Why is feature standardization difficult in NLP?

The data is sparse.

New cards

What is Naïve Bayes?

A probabilistic classifier based on conditional independence assumptions.

New cards

What is the key assumption of Naïve Bayes?

Features are independent given the class.

New cards

Why is Naïve Bayes useful in NLP?

It performs well with high-dimensional sparse data.

New cards

What is the zero-probability problem?

When a word never appears in training data for a class, leading to zero probability.

New cards

Why is zero probability problematic?

It can eliminate an entire class prediction.

New cards

What is Laplace smoothing?

Adding a small value to counts to avoid zero probabilities.

New cards

Why is Laplace smoothing important?

It stabilizes probability estimates and prevents extreme outcomes.

New cards

What is the role of training data in NLP models?

It is used to learn patterns between text and outcomes.

New cards

What is the role of validation data?

It helps tune model parameters.

New cards

What is the role of test data?

It evaluates final model performance.