Lecture Notes on Natural Language Processing (NLP) Fundamentals

Definition: NLP, sometimes called computational text analysis, involves applying techniques to text data.
Applications:
- Similarity Comparison: Assessing the similarity between texts (e.g., plagiarism detection).
- Sentiment Analysis: Gauging sentiment in text, often used on social media.
- Topic Extraction: Identifying trending topics or summarizing text.
- Text Summarization: Condensing large amounts of text.
- Relationship Extraction: Identifying relationships between entities in text.
- Question Answering and Language Generation: Foundations for technologies like ChatGPT (though ChatGPT itself won't be covered).
Example: Spam Detection
- Technique: Naive Bayes method.
- Process:
  - Splitting Data: Divide emails into training and test sets.
  - Labeling: Emails are labeled as spam or not spam (supervised learning).
  - Probability Calculation: Count the probability of each word appearing in spam and non-spam messages.
  - Composition: Multiply individual probabilities to determine if a message is likely to be spam.
Simplifications:
- Word Order: The sequence of words is ignored.
Effectiveness: Despite simplifications, works reasonably well.
Text to Numbers: NLP techniques process numbers or vectors, requiring text to be transformed.

Process: Assign a number to each word in the vocabulary to represent sentences as vectors of counts.
Example: Heavy Metal Lyrics Analysis
- Question: Can heavy metal lyrics be distinguished from other genres, and which band is most representative?
- Data Collection: Gather lyrics data.
- Importance of Data: Emphasize getting and ensuring the process ability of data early in the process.
Steps:
- Break lyrics into lines.
- Break lines into words.
- Count word frequencies.

Challenge: Different forms of a word (e.g., speak, spoke, speaking) should be grouped together.
Techniques:
- Tokenization: Breaking sentences into individual words (tokens).
- Stemming: Reducing words to their root form.
- Lemmatization: Converting words to their lemma (dictionary form).
Stemming Details:
- Process: Removing unnecessary parts to get the root.
- Example: "changing" becomes "chang".
- Issue: The root may not be a valid word.
Lemmatization Details:
- Process: Converting words to their dictionary form (lemma).
- Example: "better" becomes "good".
- Advantage: More meaningful and dictionary-valid.
- Disadvantage: More complex and slower than stemming.
Choice Between Stemming and Lemmatization:
- Dependency: Depends on the specific problem and data set.

English-Centric Methods: Most NLP techniques are developed for English.
Tokenization Issues:
- Vietnamese Example: Splitting based on spaces can lead to loss of meaning.
- Phrasal Verbs: Splitting phrasal verbs like "setup" loses their combined meaning.
- Key Message: Space is not always the best way to define a word.
Other Challenges: Languages without spaces require different approaches to separate words.
Agglutination Inversion: Different languages express the same idea with varying word counts.

Goal: Determine if metal lyrics are different and identify the most representative band.
Steps:
- Tokenization.
- Stemming or lemmatization.
- Counting word frequencies.
Corpus: The dataset of text.
Bag of Words: An approach where word order is ignored, and only counts are considered.
Vector Representation: Each lyric becomes a vector of word frequencies.
Sparse Matrix: The resulting matrix is sparse due to many words not appearing in each lyric.

Word Frequency Distribution: Word frequencies follow a specific distribution.
Anomalies: Deviations from this distribution can indicate non-human generation.
Characteristics: Results in a long tail, scale-free distribution.
Frequency Sorting: Sorting words by frequency reveals articles and prepositions as most common.

PCA and Clustering: Techniques can be applied following the conversion to vectors.

Metal vs. Non-Metal: Compare word distributions to identify differences.
Frequency Ratios: Divide word frequencies in metal lyrics by those in other genres.
Diagonal Plot:
- Representation: Each dot represents a word.
- Diagonal Significance: Words on the diagonal are irrelevant.
- Above Diagonal: More frequent in metal.
- Below Diagonal: More frequent in other genres.
- Informative Words: Both frequent and infrequent words can be informative.

Calculation Steps to determine term frequency and inverse document frequency.
Refinement: Switch denominator to compare metal band lyrics against all metal lyrics.
Definition: A word is more important if it’s frequent but not everywhere.
TF (Term Frequency): Word frequency within a document.
IDF (Inverse Document Frequency): Measures how unique or rare a word is across all documents.

Context: This approach can’t handle different meanings of the same word in different contexts.
Sequence: The sequence of words are not considered.

Podcast: "The Meanings of Your Words," discussing the meanings and etymology of different words. It demonstrates NLP usage in its background.