Dependency & Constituency Parsing: Detailed Lecture Notes

Informal start: participants greet each other, wait ~5 min for late-comers.
Goal of the class: deepen understanding of parsing in Natural Language Processing (NLP), starting with dependency parsing and moving to statistical constituency parsing.

Focus sentence family: “Vineet will join the board as a non-executive director on 29 June.”
- will → modal auxiliary; encodes futurity; attaches to main verb as aux/aux-mod.
- join → lexical (main) verb; becomes root of dependency tree.
- Dependency parser asks: “What arguments does join need?”
1. Subject (nsubj) → “Vineet/Lincoln” (NP on the left).
2. Object (dobj/obj) → “the board” (NP on the right).
3. Optional modifiers (adjuncts):
  • Role: “as a non-executive director” (prep as → obl:arg).
  • Time: “on 29 June” (obl:tmod).
  • Place: “at the head office” (obl:lmod).
Key property: a verb like join minimally requires two noun phrases; time/place are optional.

Parser output visualised as arcs from head → dependent.
Projective tree: drawing arcs above sentence, none cross.
Non-projective structures (crossing arcs) appear in free-word-order languages (and occasionally English via topicalisation). Important for algorithm choice.

Subject & direct object = core arguments (valency of verb).
Time & place = adjuncts (can be absent without ungrammaticality).
Parsing task: detect which dependents are obligatory to satisfy subcategorisation frame.

Inspired by LR (shift-reduce) parsing in compilers.
Data structures:
1. Stack $(S=s<em>1,\dots,s</em>k)$ – partially processed words (top = $s_k$).
2. Buffer $(B=b<em>1,\dots,b</em>m)$ – remaining input.
3. Arc set $(A)$ – constructed dependency edges.
Oracle decides next action by consulting current configuration $\langle S,B,A \rangle$.
Core actions (variants exist):
• $\text{SHIFT}$ – move $b1$ to stack. • $\text{LEFT dep}$ – pop two, add arc $b1 \leftarrow sk$. • $\text{RIGHT dep}$ – pop two, add arc $sk \rightarrow b_1$.
• $\text{REDUCE}$ – pop when head is already found.
Greedy: applies first legal transition; only builds projective trees.
Complexity: $O(n)$ transitions for sentence of length $n$ .

Builds fully connected directed graph (each word candidate head of every other word).
Assigns scores $w(i \rightarrow j)$ using scoring model (usually neural or MST).
Chooses maximum spanning tree (MST) subject to single-head constraint.
Handles non-projective structures; cost: $O(n^2)$ (edge scoring) to $O(n^3)$ (MST inference).

Transition-based: fast, local, suitable for real-time; struggles with long-distance non-projective arcs; susceptible to error-propagation.
Graph-based: globally optimal, handles crossings; heavier computation & memory.
Choice depends on language typology, latency requirements, and hardware.

Bridge from dependency to constituency world.
A CFG rule $A \rightarrow \alpha$ augmented with probability $P(A \rightarrow \alpha)$ such that $\sum_{\alpha} P(A \rightarrow \alpha)=1$ .
Example:
$NP \rightarrow Det\;Nominal \quad P=0.7$
$NP \rightarrow ProperNoun \quad P=0.3$
Allows ranking of multiple parse trees; highest probability chosen (analogous to most likely hidden path in HMMs).

Hypothesis: strings that occur in the same syntactic environments form constituents.
Distribution test: slot “____ (verb)”
• “Harry the Horse” attracts/loves/sits – valid.
• “Three parties from Brooklyn” attracts/loves/sits – valid.
• “High-class spots such as Mindy’s” attracts? (ungrammatical) → not NP in subject position.
Constituency parser learns which noun-phrase structures licence which environments, often via PCFG probabilities.

Grammar $G=(N,\Sigma,R,S)$
• $N$ – non-terminals (e.g., $S, NP, VP$ ).
• $\Sigma$ – terminals (lexical words: the, flight, join …).
• $R$ – production rules, one non-terminal on LHS.
• $S$ – start symbol (usually Sentence).
Sample airline fragment:
1. $S \rightarrow NP\;VP$
2. $VP \rightarrow V\;NP \mid V\;NP\;PP$
3. $NP \rightarrow Det\;Nominal \mid ProperNoun$
4. $Nominal \rightarrow N \mid N\;Nominal$
5. $PP \rightarrow Prep\;NP$
Parse tree for “a flight” (top-down derivation):
$NP \Rightarrow Det\;Nominal \Rightarrow a\;Nominal \Rightarrow a\;N \Rightarrow a\;flight$

Top-Down (recursive-descent): expand from $S$ , predict categories, match words.
• Needs backtracking when wrong rule chosen.
Bottom-Up (shift-reduce / CYK): begin with words, repeatedly combine into higher phrases until $S$ formed.
• Avoids prediction but may build constituents that never appear in final tree; dynamic programming remedies this.

Works on Chomsky Normal Form CFG (rules $A \rightarrow BC$ or $A \rightarrow a$ ).
Builds triangular table $T[i,j]$ storing non-terminals that span words $i \dots j$ and their best probabilities.
Recurrence (probabilistic):
$P(A, i,j) = \max_{k, B,C} \big[ P(A \rightarrow BC) \times P(B,i,k) \times P(C,k,j) \big]$
Analogy to Viterbi: each cell encodes highest-prob path to that span; yields most likely parse in $O(n^3|G|)$ .

Accurate POS tags critical because grammar rules reference POS categories.
Earlier lecture: HMM and Viterbi for sequence labelling.
Ambiguity example: leave (N or V) – syntactic context + probability decides.

Augmented Transition Networks (ATN): finite-state chassis with registers & tests; probabilities can augment arcs.
Conceptual Dependency (CD) graphs: semantic roles (actor, object, instrument) → early forerunner of dependency graphs.
Finite-State Models & Logic Programming also mentioned as complementary parsing paradigms.

Computational trade-offs influence parser deployed in real systems (voice assistants vs offline analysis).
Over-reliance on greedy decisions can encode bias present in training corpora; probabilistic and neural scorers must be audited.
Parsers power downstream tasks: information extraction, question answering, code generation – erroneous parses propagate.

PCFG constraint: $\sum_{\alpha} P(A \rightarrow \alpha)=1$ .
Transition parser complexity: $\Theta(n)$ operations; graph-based MST ≈ $\Theta(n^2)$ scoring + $\Theta(n^2)$ inference.
CKY dynamic rule given above; total complexity $O(n^3)$ .

Understand difference between dependency (head–dependent) and constituency (phrase structure) representations.
Memorise transition actions and when non-projectivity forces shift from transition-based to graph-based.
Be able to convert simple English sentences into CFG derivations and dependency trees.
Practise CKY table filling on 5-6 word sentences; track probabilities.
Relate compiler LR-parsing knowledge to NLP shift-reduce strategies.
Keep POS tagging accuracy in mind; errors at tag level cascade to parse level.

End of lecture – 5-minute break announced (03:04 → 03:09).