Note

0.0(0)

Chat with Kai

NLP Notes

GPU Usage Tips

Don't use GPU Colab when not actively using GPUs to conserve resources.
Start with a tiny dataset when building code, potentially using the CPU initially.
HuggingFace transformers can be complex, so refer to HuggingFace tutorials for examples, as API documentation lacks them.
Restart the notebook after installing packages and place installation code at the top.
Disable GPUs for better error messages by using the following code at the start and restarting:

import os

os.environ["CUDAVISIBLEDEVICES"] = ""
Alternative GPU options if Colab time runs out:
- Personal machine's GPU.
- Kaggle.
- Avoid paying for Colab Pro if possible.

Part-of-Speech Tagging & Parsing

Introduction to Natural Language Processing (NLP).
Sequential labeling: Parts-of-Speech (POS), Named Entity Recognition.
Classification techniques: Hidden Markov Model (HMM), Viterbi Algorithm.
Tree Structures: Constituency and Dependency Syntax Trees.
Parsing techniques: Transition-Based Parsing.

Understanding Text

Previously covered:
- Bag-of-words representations (e.g., {john, drank, some, wine, on, the, new, sofa, it, was, red}) with weighting schemes like TF-IDF.
- N-gram language models: P("John drank some wine on the new sofa. It was red.") P("red" | "John drank some wine on the new sofa. It was") ≈ P("red" | "It was")
  - Markov Assumption
Limitations of previous approaches: They lack true understanding of the text.

Examples of Textual Understanding

Identifying self-contained phrases (Noun Phrases).
Recognizing reference (anaphora) and its potential ambiguity.
Subject and Direct Object identification.
Understanding the modification provided by prepositional phrases.
Grasping implications through pragmatics (e.g., John spilled the wine on the new sofa).

Relationships in Text

Modeling relationships between words/phrases is crucial for understanding text beyond sequential words.
Example: "I saw an elephant with a telescope. It was enormous!"
- "an" and "elephant" form a Noun Phrase.

NLP Components

Syntax: Identifying grammatical roles of words to form valid sentences.
- Example: "Bank" as a direct object in "Rachel ran to the bank."
Semantics: Understanding the meanings of words/phrases and their relationships.
- Example: "Bank" referring to the riverbank; Rachel performing the running action.
Pragmatics: Understanding how communicative context affects meaning.
- Example: "John drank some wine on the new sofa. It was red." and its implications.

Context Dependence

Context and background knowledge affect meaning.
- "Rachel ran to the bank" vs. "Rachel swam to the bank".
- "John drank some wine at the table. It was red." vs. "John drank some wine at the table. It was wobbly."

Complexity of NLP

Real newspaper headlines demonstrate the complexity:
- "Boy paralyzed after tumor fights back to gain black belt"
- "Scientists study whales from space"
- "Juvenile Court to Try Shooting Defendant"
Garden Path Sentences: Sentences that initially appear syntactically incorrect.
- "The old man the boat."
- "The complex houses married and single soldiers and their families."
- "The horse raced past the barn fell."

Typical NLP Pipeline

Tokenization and lemmatization.
Sentence boundary detection: Most NLP operates on individual sentences.
Part-of-speech tagging: Identifying nouns, verbs, adjectives, etc.
Parsing (dependency): Diagramming sentences.
Named Entity Recognition: Detecting and classifying entities.
Coreference resolution: Resolving pronouns to named entities.
Example:
- A tall boy ran home .
- DT JJ NN VBD NN .
- Mr. Mckintosh went shopping . He

NLP Approaches

Statistical methods: Identifying patterns through supervised machine learning.
- Learning from human interactions for training data.
Rule-based approaches: Using linguistic and domain knowledge.
- Example: tokenization using re.split(r'\w+', raw).
- Challenges: Manual maintenance and updates are difficult.

Sequential NLP Structures

NLP information as a sequence of labels.
Examples include Part-of-Speech Tagging, Named Entity Recognition, and Word/Structure Segmentation.
Units (token, character, sentence) are assigned labels that depend on surrounding labels.

Parts-of-Speech (POS) Tagging

Basic syntactic analysis assigning tags to words based on grammatical role.
Example:
- John drank some wine on the new sofa . It was red .
- PROPN VERB DET NOUN ADP DET ADJ NOUN PUNCT PRN AUX ADJ PUNCT
Uses a set of "Universal" POS tags.

POS Tagging Accuracy

State-of-the-art techniques achieve >98% accuracy on news text.
Average POS tagging disagreement among expert human judges for the Penn Treebank was 3.5%.
POS tagging in low-resource languages is harder (~87%).
POS tagging for Tweets is harder (state-of-the-art ~88%).

Sequence Labeling Approaches

Rule-based: Error-prone; building & maintaining rules is challenging.
Supervised Learning: Most common/effective approach; learns from labeled treebanks.
- Hidden Markov Models (HMM) are covered.
- More recent approaches exist based on neural networks, but many of the same ideas apply.

Naive Bayes for POS Tagging

Problem: Ignores sequential information.
Words will always be assigned the same POS tag, regardless of context.
Example: "That was a bad call" vs. "Please call your mum after school."

Hidden Markov Model (HMM)

Generalization of Naïve Bayes to sequences.
Assumes an underlying set of hidden (unobserved) states (labels), y, in which the model can be.
Assumes probabilistic transitions between states over time as sequence is generated.
Assumes a probabilistic generation of tokens from states (e.g. words generated for each POS).
Markov Assumption: Current state depends only on the previous state (may vary, e.g., 2nd order = previous 2 states).

HMM components

x_i: observed nodes (i.e., words).
y_i: hidden nodes (i.e., POS tags) - goal is to predict these.
Transitions: p(yi | y{i-1}) (horizontal arrows).
Emissions: p(xi | yi) (vertical arrows).

HMM Tasks

Training: Learn p(yi | y{i-1}) and p(xi | yi).
Inference (test): Given xi, infer yi or p(y_i).

POS Tagging with HMMs

Hidden States: POS tags.
Observed States: Words in the sequence.
Goal: Predict the POS tags.

Example

James ate the tasty food
PROPN VERB DET ADJ NOUN
Calculations:
- P(PROPN | )
- P(“James” | PROPN) * P(PROPN | )
- P(“James” | PROPN) * P(PROPN | ) * P(“ate” | VERB) * P(VERB | PROPN)
- P(“James” | PROPN) * P(PROPN | ) * P(“ate” | VERB) * P(VERB | PROPN) * P(DET | VERB) P(ADJ | DET) P(NOUN | DET) * P(“the” | DET) P(“tasty” | ADJ) P(“food” | NOUN)

HMM Training

Count and normalize training data to learn transition and emission probabilities.
Transition probabilities involve learning P(A|s), P(B|s), P(A|A), P(B|A), P(A|B), P(B|B), P(END| B), P(END| A).
Emission probabilities involve learning P(Red|A), P(Blue|A), P(Red |B), P(Blue|B).

HMM Inference

Goal: Determine the hidden states for a given sequence.
Use transition and emission probabilities.

Greedy Approach

Step through the sequence one output at a time.
Decide the most likely hidden state based on the product of:
- Transition probability.
- Emission probability.
"Greedy" because it picks the most likely choice at each outcome without looking ahead.

HMMs: Part-of-speech (POS) Training

Transitions: p(yi | y{i-1}).
Emissions: p(xi | yi).

HMMs: POS Inference

Finding the hidden states gives parts-of-speech tags for each token.
Example: James ate the food

Exercise: HMM Inference

Calculate the probabilities and find the optimal sequence for the sentence using a greedy search.
"Sue drew Mark"
Transitions: P(yi | y{i-1})
Emissions: P(xi | yi)$$

Viterbi Algorithm

Superior to the greedy method because it takes the whole sequence into account.
Calculates the maximum possible probability of the entire sequence given transition and emission probabilities.
Builds probabilities up one token at a time using dynamic programming.
Works backwards to find the optimal path through the hidden states (POS tags).

Viterbi and Dynamic Programming

Viterbi is an example of Dynamic Programming
Dynamic programming solves problems that rely on recursion more efficiently
Efficiently builds up the solution Example Problem: Calculate the 100th Fibonacci multiplied by the 101st Fibonacci

Viterbi Algorithm Steps

Work through the sequence one output/emission at a time.
Use probabilities of reaching each of the previous hidden states.
Find the most likely transition and emission for this output/emission.
Decide which is the most likely previous state at each step and use that.

Viterbi Demo Example

black bears go home

Viterbi vs. Greedy

Viterbi allows multiple possible paths through the hidden states and keeps track of their probabilities.
Calculates the highest probability of getting to a hidden state given all the previous inputs.

Constituency Parsing

Breaking a sentence into noun phrases, verb phrases, etc.
Noun Phrase (NP): noun with determinants/adjectives (e.g., The small dog, five cars).
Verb phrase (VP): verb plus arguments (excluding its subject) (e.g., passed David the salt, walked away).
Prepositional phrase (PP): preposition plus object (noun phrase) (e.g., To the trains, On top of the Empire State Building).

Ambiguity in Syntactic Analysis

Sentences can have multiple possible syntactic structures.
Example: "I saw an elephant in my pajamas."
Moderate length sentences (20-30 words) can have thousands of possible syntactic structures.

Significance of Syntactic Structure

Meaning can be learnt using syntactic structure.
The NP preceding VP is often the subject; the NP following VP is often the object.
Basic units are helpful in modeling language.
Useful for:
- Predicting/completing sentences (language modeling).
- Re-organizing sentences or simplifying them (summarization).
- Relation extraction, question answering, machine translation, semantic role labeling.

Dependency Parsing

Given a sentence, draw edges between pairs of words and label them.
Result should be a tree.
Edge-labels between word pairs should convey the correct syntactic relation.

Dependency Parsing Formal Definition

A dependency structure for a given sentence is a directed graph originating out of a unique and artificially inserted root node.
Properties of a valid dependency graph:
- Each word has exactly one incoming edge (except the root).
- Weakly connected (replacing edges with undirected edges results in a connected graph).
- Acyclic.

Linguistic Phenomena

Capture who did what to whom:
- Subject, direct object, indirect object.

Transition Based Parsing

Build a parse tree with a sequence of actions (transitions).

Set-up (state)

A buffer initialized with the words
A stack to store state:
- Words under consideration for more edges
- Dependency arcs (edges)
- The partially constructed tree

Arc-Standard parsing

Shift: pop from buffer, push to stack.
Left-arc: add left edge.
Right-arc: add right edge.

Analysis of Transition-Based Parsing

"Shift-reduce" parsing:
- Shift: move from buffer to stack.
- Reduce: (left-arc, right-arc) add edges, and remove elements from stack.
For N tokens:
- O(N) shift operations (one for each token).
- O(N) reduce operations (one for each edge).
O(N) = linear time parsing (if you can choose the right transitions).
Actions are irreversible, so must correctly choose action every time.

Transition Selection

Multi-class prediction Inputs: state= {stack, buffer, other features}
Targets: transitions= {shift, left-arc, right-arc}

Arc Labels

Use a separate classifier that picks arc label (given the stack features & buffer features) ie subject, object, etc..

Parsing Summary

Syntactic parsing aims to find phrase-structure in sentences and word-word relations.

Dependency parsing

Word-word relations e.g., nsubj(ate, Elephant).
Generates dependency trees by predicting edges or pruning edges.

Transition-based parsing

Formulates edge prediction as a classification task to predict next action

NLP Summary

NLP provides tools to model language beyond simple lexical (word) features.
Sequence models like HMMs allow us to label break apart patterns in language such as:
- part-of-speech tags
- sentence or word breaks
Tree structured prediction: Useful for constructing structures such as Dependency parsing.
Helpful for advanced applications (QA, Dialogue Systems, etc.).

Note

0.0(0)

Take a practice test

Chat with Kai

NLP Notes

GPU Usage Tips

Part-of-Speech Tagging & Parsing

Understanding Text

Examples of Textual Understanding

Relationships in Text

NLP Components

Context Dependence

Complexity of NLP

Typical NLP Pipeline

NLP Approaches

Sequential NLP Structures

Parts-of-Speech (POS) Tagging

Examples of Universal POS tags

POS Tagging Accuracy

Other Sequence Tagging Problems

Sequence Labeling Approaches

Naive Bayes for POS Tagging

Hidden Markov Model (HMM)

HMM components

HMM Tasks

POS Tagging with HMMs

Example

HMM Training

HMM Inference

Greedy Approach

HMMs: Part-of-speech (POS) Training

HMMs: POS Inference

Exercise: HMM Inference

Viterbi Algorithm

Viterbi and Dynamic Programming

Viterbi Algorithm Steps

Viterbi Demo Example

Viterbi vs. Greedy

Constituency Parsing

Ambiguity in Syntactic Analysis

Significance of Syntactic Structure

Dependency Parsing

Dependency Parsing Formal Definition

Linguistic Phenomena

Transition Based Parsing

Set-up (state)

Arc-Standard parsing

Analysis of Transition-Based Parsing

Transition Selection

Arc Labels

Parsing Summary

Dependency parsing

Transition-based parsing

NLP Summary