NLP Notes
GPU Usage Tips
- Don't use GPU Colab when not actively using GPUs to conserve resources.
- Start with a tiny dataset when building code, potentially using the CPU initially.
- HuggingFace transformers can be complex, so refer to HuggingFace tutorials for examples, as API documentation lacks them.
- Restart the notebook after installing packages and place installation code at the top.
- Disable GPUs for better error messages by using the following code at the start and restarting:
import os
os.environ["CUDAVISIBLEDEVICES"] = ""
- Alternative GPU options if Colab time runs out:
- Personal machine's GPU.
- Kaggle.
- Avoid paying for Colab Pro if possible.
Part-of-Speech Tagging & Parsing
- Introduction to Natural Language Processing (NLP).
- Sequential labeling: Parts-of-Speech (POS), Named Entity Recognition.
- Classification techniques: Hidden Markov Model (HMM), Viterbi Algorithm.
- Tree Structures: Constituency and Dependency Syntax Trees.
- Parsing techniques: Transition-Based Parsing.
Understanding Text
- Previously covered:
- Bag-of-words representations (e.g., {john, drank, some, wine, on, the, new, sofa, it, was, red}) with weighting schemes like TF-IDF.
- N-gram language models:
P("John drank some wine on the new sofa. It was red.")
P("red" | "John drank some wine on the new sofa. It was") ≈ P("red" | "It was")
- Limitations of previous approaches: They lack true understanding of the text.
Examples of Textual Understanding
- Identifying self-contained phrases (Noun Phrases).
- Recognizing reference (anaphora) and its potential ambiguity.
- Subject and Direct Object identification.
- Understanding the modification provided by prepositional phrases.
- Grasping implications through pragmatics (e.g., John spilled the wine on the new sofa).
Relationships in Text
- Modeling relationships between words/phrases is crucial for understanding text beyond sequential words.
- Example: "I saw an elephant with a telescope. It was enormous!"
- "an" and "elephant" form a Noun Phrase.
NLP Components
- Syntax: Identifying grammatical roles of words to form valid sentences.
- Example: "Bank" as a direct object in "Rachel ran to the bank."
- Semantics: Understanding the meanings of words/phrases and their relationships.
- Example: "Bank" referring to the riverbank; Rachel performing the running action.
- Pragmatics: Understanding how communicative context affects meaning.
- Example: "John drank some wine on the new sofa. It was red." and its implications.
Context Dependence
- Context and background knowledge affect meaning.
- "Rachel ran to the bank" vs. "Rachel swam to the bank".
- "John drank some wine at the table. It was red." vs. "John drank some wine at the table. It was wobbly."
Complexity of NLP
- Real newspaper headlines demonstrate the complexity:
- "Boy paralyzed after tumor fights back to gain black belt"
- "Scientists study whales from space"
- "Juvenile Court to Try Shooting Defendant"
- Garden Path Sentences: Sentences that initially appear syntactically incorrect.
- "The old man the boat."
- "The complex houses married and single soldiers and their families."
- "The horse raced past the barn fell."
Typical NLP Pipeline
- Tokenization and lemmatization.
- Sentence boundary detection: Most NLP operates on individual sentences.
- Part-of-speech tagging: Identifying nouns, verbs, adjectives, etc.
- Parsing (dependency): Diagramming sentences.
- Named Entity Recognition: Detecting and classifying entities.
- Coreference resolution: Resolving pronouns to named entities.
- Example:
- A tall boy ran home .
- DT JJ NN VBD NN .
- Mr. Mckintosh went shopping . He
NLP Approaches
- Statistical methods: Identifying patterns through supervised machine learning.
- Learning from human interactions for training data.
- Rule-based approaches: Using linguistic and domain knowledge.
- Example: tokenization using
re.split(r'\w+', raw)
. - Challenges: Manual maintenance and updates are difficult.
Sequential NLP Structures
- NLP information as a sequence of labels.
- Examples include Part-of-Speech Tagging, Named Entity Recognition, and Word/Structure Segmentation.
- Units (token, character, sentence) are assigned labels that depend on surrounding labels.
Parts-of-Speech (POS) Tagging
- Basic syntactic analysis assigning tags to words based on grammatical role.
- Example:
- John drank some wine on the new sofa . It was red .
- PROPN VERB DET NOUN ADP DET ADJ NOUN PUNCT PRN AUX ADJ PUNCT
- Uses a set of "Universal" POS tags.
Examples of Universal POS tags
- ADJ (adjective): big, old, green
- ADP (adposition/preposition): in, to, during
- ADV (adverb): very, well, exactly
- AUX (auxiliary verb): is, has, must
- CCONJ (coordinating conjunction): and, or, but
- DET (determiner): a, the, some
- INTJ (interjection): psst, ouch, hello
- NOUN (noun): girl, cat, tree
- NUM (numeral): five, 70, IV
- PART (particle): 's, not
- PRON (pronoun): he, her, myself, everybody
- PROPN (proper noun): Mary, Glasgow, HBO
- PUNCT (punctuation): ., ( )
- SCONJ (subordinating conjunction): that, if, while
- SYM (symbol): $, ♥‿♥, 1-800-COMPANY
- VERB (verb): drank, running, sits
POS Tagging Accuracy
- State-of-the-art techniques achieve >98% accuracy on news text.
- Average POS tagging disagreement among expert human judges for the Penn Treebank was 3.5%.
- POS tagging in low-resource languages is harder (~87%).
- POS tagging for Tweets is harder (state-of-the-art ~88%).
Other Sequence Tagging Problems
- Word Segmentation
- Structure Segmentation
Sequence Labeling Approaches
- Rule-based: Error-prone; building & maintaining rules is challenging.
- Supervised Learning: Most common/effective approach; learns from labeled treebanks.
- Hidden Markov Models (HMM) are covered.
- More recent approaches exist based on neural networks, but many of the same ideas apply.
Naive Bayes for POS Tagging
- Problem: Ignores sequential information.
- Words will always be assigned the same POS tag, regardless of context.
- Example: "That was a bad call" vs. "Please call your mum after school."
Hidden Markov Model (HMM)
- Generalization of Naïve Bayes to sequences.
- Assumes an underlying set of hidden (unobserved) states (labels), y, in which the model can be.
- Assumes probabilistic transitions between states over time as sequence is generated.
- Assumes a probabilistic generation of tokens from states (e.g. words generated for each POS).
- Markov Assumption: Current state depends only on the previous state (may vary, e.g., 2nd order = previous 2 states).
HMM components
- x_i: observed nodes (i.e., words).
- y_i: hidden nodes (i.e., POS tags) - goal is to predict these.
- Transitions: p(yi | y{i-1}) (horizontal arrows).
- Emissions: p(xi | yi) (vertical arrows).
HMM Tasks
- Training: Learn p(yi | y{i-1}) and p(xi | yi).
- Inference (test): Given xi, infer yi or p(y_i).
POS Tagging with HMMs
- Hidden States: POS tags.
- Observed States: Words in the sequence.
- Goal: Predict the POS tags.
Example
- James ate the tasty food
- PROPN VERB DET ADJ NOUN
- Calculations:
- P(PROPN | )
- P(“James” | PROPN) * P(PROPN | )
- P(“James” | PROPN) * P(PROPN | ) * P(“ate” | VERB) * P(VERB | PROPN)
- P(“James” | PROPN) * P(PROPN | ) * P(“ate” | VERB) * P(VERB | PROPN) * P(DET | VERB) P(ADJ | DET) P(NOUN | DET) * P(“the” | DET) P(“tasty” | ADJ) P(“food” | NOUN)
HMM Training
- Count and normalize training data to learn transition and emission probabilities.
- Transition probabilities involve learning P(A|s), P(B|s), P(A|A), P(B|A), P(A|B), P(B|B), P(END| B), P(END| A).
- Emission probabilities involve learning P(Red|A), P(Blue|A), P(Red |B), P(Blue|B).
HMM Inference
- Goal: Determine the hidden states for a given sequence.
- Use transition and emission probabilities.
Greedy Approach
- Step through the sequence one output at a time.
- Decide the most likely hidden state based on the product of:
- Transition probability.
- Emission probability.
- "Greedy" because it picks the most likely choice at each outcome without looking ahead.
HMMs: Part-of-speech (POS) Training
- Transitions: p(yi | y{i-1}).
- Emissions: p(xi | yi).
HMMs: POS Inference
- Finding the hidden states gives parts-of-speech tags for each token.
- Example: James ate the food
Exercise: HMM Inference
- Calculate the probabilities and find the optimal sequence for the sentence using a greedy search.
- "Sue drew Mark"
- Transitions: P(yi | y{i-1})
- Emissions: P(xi | yi)$$
Viterbi Algorithm
- Superior to the greedy method because it takes the whole sequence into account.
- Calculates the maximum possible probability of the entire sequence given transition and emission probabilities.
- Builds probabilities up one token at a time using dynamic programming.
- Works backwards to find the optimal path through the hidden states (POS tags).
Viterbi and Dynamic Programming
- Viterbi is an example of Dynamic Programming
- Dynamic programming solves problems that rely on recursion more efficiently
- Efficiently builds up the solution Example Problem: Calculate the 100th Fibonacci multiplied by the 101st Fibonacci
Viterbi Algorithm Steps
- Work through the sequence one output/emission at a time.
- Use probabilities of reaching each of the previous hidden states.
- Find the most likely transition and emission for this output/emission.
- Decide which is the most likely previous state at each step and use that.
Viterbi Demo Example
Viterbi vs. Greedy
- Viterbi allows multiple possible paths through the hidden states and keeps track of their probabilities.
- Calculates the highest probability of getting to a hidden state given all the previous inputs.
Constituency Parsing
- Breaking a sentence into noun phrases, verb phrases, etc.
- Noun Phrase (NP): noun with determinants/adjectives (e.g., The small dog, five cars).
- Verb phrase (VP): verb plus arguments (excluding its subject) (e.g., passed David the salt, walked away).
- Prepositional phrase (PP): preposition plus object (noun phrase) (e.g., To the trains, On top of the Empire State Building).
Ambiguity in Syntactic Analysis
- Sentences can have multiple possible syntactic structures.
- Example: "I saw an elephant in my pajamas."
- Moderate length sentences (20-30 words) can have thousands of possible syntactic structures.
Significance of Syntactic Structure
- Meaning can be learnt using syntactic structure.
- The NP preceding VP is often the subject; the NP following VP is often the object.
- Basic units are helpful in modeling language.
- Useful for:
- Predicting/completing sentences (language modeling).
- Re-organizing sentences or simplifying them (summarization).
- Relation extraction, question answering, machine translation, semantic role labeling.
Dependency Parsing
- Given a sentence, draw edges between pairs of words and label them.
- Result should be a tree.
- Edge-labels between word pairs should convey the correct syntactic relation.
- A dependency structure for a given sentence is a directed graph originating out of a unique and artificially inserted root node.
- Properties of a valid dependency graph:
- Each word has exactly one incoming edge (except the root).
- Weakly connected (replacing edges with undirected edges results in a connected graph).
- Acyclic.
Linguistic Phenomena
- Capture who did what to whom:
- Subject, direct object, indirect object.
Transition Based Parsing
- Build a parse tree with a sequence of actions (transitions).
Set-up (state)
- A buffer initialized with the words
- A stack to store state:
- Words under consideration for more edges
- Dependency arcs (edges)
- The partially constructed tree
Arc-Standard parsing
- Shift: pop from buffer, push to stack.
- Left-arc: add left edge.
- Right-arc: add right edge.
Analysis of Transition-Based Parsing
- "Shift-reduce" parsing:
- Shift: move from buffer to stack.
- Reduce: (left-arc, right-arc) add edges, and remove elements from stack.
- For N tokens:
- O(N) shift operations (one for each token).
- O(N) reduce operations (one for each edge).
- O(N) = linear time parsing (if you can choose the right transitions).
- Actions are irreversible, so must correctly choose action every time.
Transition Selection
- Multi-class prediction Inputs: state= {stack, buffer, other features}
- Targets: transitions= {shift, left-arc, right-arc}
Arc Labels
- Use a separate classifier that picks arc label (given the stack features & buffer features) ie subject, object, etc..
Parsing Summary
- Syntactic parsing aims to find phrase-structure in sentences and word-word relations.
Dependency parsing
- Word-word relations e.g., nsubj(ate, Elephant).
- Generates dependency trees by predicting edges or pruning edges.
Transition-based parsing
- Formulates edge prediction as a classification task to predict next action
NLP Summary
- NLP provides tools to model language beyond simple lexical (word) features.
- Sequence models like HMMs allow us to label break apart patterns in language such as:
- part-of-speech tags
- sentence or word breaks
- Tree structured prediction: Useful for constructing structures such as Dependency parsing.
- Helpful for advanced applications (QA, Dialogue Systems, etc.).