Note
0.0(0)
MD

NLP Notes

GPU Usage Tips

  • Don't use GPU Colab when not actively using GPUs to conserve resources.
  • Start with a tiny dataset when building code, potentially using the CPU initially.
  • HuggingFace transformers can be complex, so refer to HuggingFace tutorials for examples, as API documentation lacks them.
  • Restart the notebook after installing packages and place installation code at the top.
  • Disable GPUs for better error messages by using the following code at the start and restarting:

    import os


    os.environ["CUDAVISIBLEDEVICES"] = ""
  • Alternative GPU options if Colab time runs out:
    • Personal machine's GPU.
    • Kaggle.
    • Avoid paying for Colab Pro if possible.

Part-of-Speech Tagging & Parsing

  • Introduction to Natural Language Processing (NLP).
  • Sequential labeling: Parts-of-Speech (POS), Named Entity Recognition.
  • Classification techniques: Hidden Markov Model (HMM), Viterbi Algorithm.
  • Tree Structures: Constituency and Dependency Syntax Trees.
  • Parsing techniques: Transition-Based Parsing.

Understanding Text

  • Previously covered:
    • Bag-of-words representations (e.g., {john, drank, some, wine, on, the, new, sofa, it, was, red}) with weighting schemes like TF-IDF.
    • N-gram language models: P("John drank some wine on the new sofa. It was red.") P("red" | "John drank some wine on the new sofa. It was") ≈ P("red" | "It was")
      • Markov Assumption
  • Limitations of previous approaches: They lack true understanding of the text.

Examples of Textual Understanding

  • Identifying self-contained phrases (Noun Phrases).
  • Recognizing reference (anaphora) and its potential ambiguity.
  • Subject and Direct Object identification.
  • Understanding the modification provided by prepositional phrases.
  • Grasping implications through pragmatics (e.g., John spilled the wine on the new sofa).

Relationships in Text

  • Modeling relationships between words/phrases is crucial for understanding text beyond sequential words.
  • Example: "I saw an elephant with a telescope. It was enormous!"
    • "an" and "elephant" form a Noun Phrase.

NLP Components

  • Syntax: Identifying grammatical roles of words to form valid sentences.
    • Example: "Bank" as a direct object in "Rachel ran to the bank."
  • Semantics: Understanding the meanings of words/phrases and their relationships.
    • Example: "Bank" referring to the riverbank; Rachel performing the running action.
  • Pragmatics: Understanding how communicative context affects meaning.
    • Example: "John drank some wine on the new sofa. It was red." and its implications.

Context Dependence

  • Context and background knowledge affect meaning.
    • "Rachel ran to the bank" vs. "Rachel swam to the bank".
    • "John drank some wine at the table. It was red." vs. "John drank some wine at the table. It was wobbly."

Complexity of NLP

  • Real newspaper headlines demonstrate the complexity:
    • "Boy paralyzed after tumor fights back to gain black belt"
    • "Scientists study whales from space"
    • "Juvenile Court to Try Shooting Defendant"
  • Garden Path Sentences: Sentences that initially appear syntactically incorrect.
    • "The old man the boat."
    • "The complex houses married and single soldiers and their families."
    • "The horse raced past the barn fell."

Typical NLP Pipeline

  • Tokenization and lemmatization.
  • Sentence boundary detection: Most NLP operates on individual sentences.
  • Part-of-speech tagging: Identifying nouns, verbs, adjectives, etc.
  • Parsing (dependency): Diagramming sentences.
  • Named Entity Recognition: Detecting and classifying entities.
  • Coreference resolution: Resolving pronouns to named entities.
  • Example:
    • A tall boy ran home .
    • DT JJ NN VBD NN .
    • Mr. Mckintosh went shopping . He

NLP Approaches

  • Statistical methods: Identifying patterns through supervised machine learning.
    • Learning from human interactions for training data.
  • Rule-based approaches: Using linguistic and domain knowledge.
    • Example: tokenization using re.split(r'\w+', raw).
    • Challenges: Manual maintenance and updates are difficult.

Sequential NLP Structures

  • NLP information as a sequence of labels.
  • Examples include Part-of-Speech Tagging, Named Entity Recognition, and Word/Structure Segmentation.
  • Units (token, character, sentence) are assigned labels that depend on surrounding labels.

Parts-of-Speech (POS) Tagging

  • Basic syntactic analysis assigning tags to words based on grammatical role.
  • Example:
    • John drank some wine on the new sofa . It was red .
    • PROPN VERB DET NOUN ADP DET ADJ NOUN PUNCT PRN AUX ADJ PUNCT
  • Uses a set of "Universal" POS tags.

Examples of Universal POS tags

  • ADJ (adjective): big, old, green
  • ADP (adposition/preposition): in, to, during
  • ADV (adverb): very, well, exactly
  • AUX (auxiliary verb): is, has, must
  • CCONJ (coordinating conjunction): and, or, but
  • DET (determiner): a, the, some
  • INTJ (interjection): psst, ouch, hello
  • NOUN (noun): girl, cat, tree
  • NUM (numeral): five, 70, IV
  • PART (particle): 's, not
  • PRON (pronoun): he, her, myself, everybody
  • PROPN (proper noun): Mary, Glasgow, HBO
  • PUNCT (punctuation): ., ( )
  • SCONJ (subordinating conjunction): that, if, while
  • SYM (symbol): $, ♥‿♥, 1-800-COMPANY
  • VERB (verb): drank, running, sits

POS Tagging Accuracy

  • State-of-the-art techniques achieve >98% accuracy on news text.
  • Average POS tagging disagreement among expert human judges for the Penn Treebank was 3.5%.
  • POS tagging in low-resource languages is harder (~87%).
  • POS tagging for Tweets is harder (state-of-the-art ~88%).

Other Sequence Tagging Problems

  • Word Segmentation
  • Structure Segmentation

Sequence Labeling Approaches

  • Rule-based: Error-prone; building & maintaining rules is challenging.
  • Supervised Learning: Most common/effective approach; learns from labeled treebanks.
    • Hidden Markov Models (HMM) are covered.
    • More recent approaches exist based on neural networks, but many of the same ideas apply.

Naive Bayes for POS Tagging

  • Problem: Ignores sequential information.
  • Words will always be assigned the same POS tag, regardless of context.
  • Example: "That was a bad call" vs. "Please call your mum after school."

Hidden Markov Model (HMM)

  • Generalization of Naïve Bayes to sequences.
  • Assumes an underlying set of hidden (unobserved) states (labels), y, in which the model can be.
  • Assumes probabilistic transitions between states over time as sequence is generated.
  • Assumes a probabilistic generation of tokens from states (e.g. words generated for each POS).
  • Markov Assumption: Current state depends only on the previous state (may vary, e.g., 2nd order = previous 2 states).

HMM components

  • x_i: observed nodes (i.e., words).
  • y_i: hidden nodes (i.e., POS tags) - goal is to predict these.
  • Transitions: p(yi | y{i-1}) (horizontal arrows).
  • Emissions: p(xi | yi) (vertical arrows).

HMM Tasks

  • Training: Learn p(yi | y{i-1}) and p(xi | yi).
  • Inference (test): Given xi, infer yi or p(y_i).

POS Tagging with HMMs

  • Hidden States: POS tags.
  • Observed States: Words in the sequence.
  • Goal: Predict the POS tags.

Example

  • James ate the tasty food
  • PROPN VERB DET ADJ NOUN
  • Calculations:
    • P(PROPN | )
    • P(“James” | PROPN) * P(PROPN | )
    • P(“James” | PROPN) * P(PROPN | ) * P(“ate” | VERB) * P(VERB | PROPN)
    • P(“James” | PROPN) * P(PROPN | ) * P(“ate” | VERB) * P(VERB | PROPN) * P(DET | VERB) P(ADJ | DET) P(NOUN | DET) * P(“the” | DET) P(“tasty” | ADJ) P(“food” | NOUN)

HMM Training

  • Count and normalize training data to learn transition and emission probabilities.
  • Transition probabilities involve learning P(A|s), P(B|s), P(A|A), P(B|A), P(A|B), P(B|B), P(END| B), P(END| A).
  • Emission probabilities involve learning P(Red|A), P(Blue|A), P(Red |B), P(Blue|B).

HMM Inference

  • Goal: Determine the hidden states for a given sequence.
  • Use transition and emission probabilities.

Greedy Approach

  • Step through the sequence one output at a time.
  • Decide the most likely hidden state based on the product of:
    • Transition probability.
    • Emission probability.
  • "Greedy" because it picks the most likely choice at each outcome without looking ahead.

HMMs: Part-of-speech (POS) Training

  • Transitions: p(yi | y{i-1}).
  • Emissions: p(xi | yi).

HMMs: POS Inference

  • Finding the hidden states gives parts-of-speech tags for each token.
  • Example: James ate the food

Exercise: HMM Inference

  • Calculate the probabilities and find the optimal sequence for the sentence using a greedy search.
  • "Sue drew Mark"
  • Transitions: P(yi | y{i-1})
  • Emissions: P(xi | yi)$$

Viterbi Algorithm

  • Superior to the greedy method because it takes the whole sequence into account.
  • Calculates the maximum possible probability of the entire sequence given transition and emission probabilities.
  • Builds probabilities up one token at a time using dynamic programming.
  • Works backwards to find the optimal path through the hidden states (POS tags).

Viterbi and Dynamic Programming

  • Viterbi is an example of Dynamic Programming
  • Dynamic programming solves problems that rely on recursion more efficiently
  • Efficiently builds up the solution Example Problem: Calculate the 100th Fibonacci multiplied by the 101st Fibonacci

Viterbi Algorithm Steps

  • Work through the sequence one output/emission at a time.
  • Use probabilities of reaching each of the previous hidden states.
  • Find the most likely transition and emission for this output/emission.
  • Decide which is the most likely previous state at each step and use that.

Viterbi Demo Example

  • black bears go home

Viterbi vs. Greedy

  • Viterbi allows multiple possible paths through the hidden states and keeps track of their probabilities.
  • Calculates the highest probability of getting to a hidden state given all the previous inputs.

Constituency Parsing

  • Breaking a sentence into noun phrases, verb phrases, etc.
  • Noun Phrase (NP): noun with determinants/adjectives (e.g., The small dog, five cars).
  • Verb phrase (VP): verb plus arguments (excluding its subject) (e.g., passed David the salt, walked away).
  • Prepositional phrase (PP): preposition plus object (noun phrase) (e.g., To the trains, On top of the Empire State Building).

Ambiguity in Syntactic Analysis

  • Sentences can have multiple possible syntactic structures.
  • Example: "I saw an elephant in my pajamas."
  • Moderate length sentences (20-30 words) can have thousands of possible syntactic structures.

Significance of Syntactic Structure

  • Meaning can be learnt using syntactic structure.
  • The NP preceding VP is often the subject; the NP following VP is often the object.
  • Basic units are helpful in modeling language.
  • Useful for:
    • Predicting/completing sentences (language modeling).
    • Re-organizing sentences or simplifying them (summarization).
    • Relation extraction, question answering, machine translation, semantic role labeling.

Dependency Parsing

  • Given a sentence, draw edges between pairs of words and label them.
  • Result should be a tree.
  • Edge-labels between word pairs should convey the correct syntactic relation.

Dependency Parsing Formal Definition

  • A dependency structure for a given sentence is a directed graph originating out of a unique and artificially inserted root node.
  • Properties of a valid dependency graph:
    • Each word has exactly one incoming edge (except the root).
    • Weakly connected (replacing edges with undirected edges results in a connected graph).
    • Acyclic.

Linguistic Phenomena

  • Capture who did what to whom:
    • Subject, direct object, indirect object.

Transition Based Parsing

  • Build a parse tree with a sequence of actions (transitions).

Set-up (state)

  • A buffer initialized with the words
  • A stack to store state:
    • Words under consideration for more edges
    • Dependency arcs (edges)
    • The partially constructed tree

Arc-Standard parsing

  • Shift: pop from buffer, push to stack.
  • Left-arc: add left edge.
  • Right-arc: add right edge.

Analysis of Transition-Based Parsing

  • "Shift-reduce" parsing:
    • Shift: move from buffer to stack.
    • Reduce: (left-arc, right-arc) add edges, and remove elements from stack.
  • For N tokens:
    • O(N) shift operations (one for each token).
    • O(N) reduce operations (one for each edge).
  • O(N) = linear time parsing (if you can choose the right transitions).
  • Actions are irreversible, so must correctly choose action every time.

Transition Selection

  • Multi-class prediction Inputs: state= {stack, buffer, other features}
  • Targets: transitions= {shift, left-arc, right-arc}

Arc Labels

  • Use a separate classifier that picks arc label (given the stack features & buffer features) ie subject, object, etc..

Parsing Summary

  • Syntactic parsing aims to find phrase-structure in sentences and word-word relations.

Dependency parsing

  • Word-word relations e.g., nsubj(ate, Elephant).
  • Generates dependency trees by predicting edges or pruning edges.

Transition-based parsing

  • Formulates edge prediction as a classification task to predict next action

NLP Summary

  • NLP provides tools to model language beyond simple lexical (word) features.
  • Sequence models like HMMs allow us to label break apart patterns in language such as:
    • part-of-speech tags
    • sentence or word breaks
  • Tree structured prediction: Useful for constructing structures such as Dependency parsing.
  • Helpful for advanced applications (QA, Dialogue Systems, etc.).
Note
0.0(0)