Dependency Parsing – Transition-Based vs. Graph-Based

Dependency Parsing Approaches

• Two mainstream algorithms convert a sentence into a dependency tree:
– Transition-based parsing (stack/LR style, greedy, efficient, only guaranteed for projective trees = trees whose arcs never cross).
– Graph-based parsing (global search for the highest-scoring tree; can handle non-projectivity but is costlier to implement/run).
• Choice affects coverage (projective vs. non-projective languages), speed, memory & code complexity.

Transition-Based Parsing (Stack, Greedy)

• Conceptually identical to LR shift–reduce parsing in compilers.
• Parser state («configuration») = ⧼Stack, Input-Buffer, Set-of-built-dependencies⧽.
• Two data alphabets:
– Input buffer alphabet = surface words.
– Stack alphabet = symbolic summaries; top-of-stack symbol represents entire stack to avoid repeated popping (stack is destructive memory).
• Parsing loop (left-to-right, single pass):
1. Inspect top two stack items & next buffer word.
2. Use an oracle (rule classifier) to decide one of three transitions.
3. Execute transition, update configuration.
4. Repeat until stack = {ROOT} and buffer empty.
• Greedy aspect: first legal reduction is applied; parser never backtracks → fails on non-projective sentences whose correct tree would require postponing a reduction.

Transition Operations: Shift, Left-Arc, Right-Arc

• Shift (SH): move first word from buffer to top of stack; no dependency edge added.
• Left-Arc (LA):
– Preconditions: stack has at least two items.
– Action: assert dependency «stack[0] ◀── stack[1]» (top depends on word beneath).
– Pop LOWER of the two (L removes Lower).
• Right-Arc (RA):
– Preconditions: stack has at least two items.
– Action: assert dependency «stack[1] ───▶ stack[0]» (second-top depends on top).
– Pop TOP item (R removes Top).
• Oracle learns when each transition is correct.

Worked Example: "Book me the morning flight"

Step

Stack (top→left)

Buffer

Action

New Edge

0

ROOT

book me the morning flight

SH

1

book ROOT

me the morning flight

SH

2

me book ROOT

the morning flight

RA

book ▶ me (OBJ)

3

book ROOT

the morning flight

SH

4

the book ROOT

morning flight

SH

5

morning the book ROOT

flight

SH

6

flight morning the book ROOT

LA

flight ◀ morning (NMOD)

7

flight the book ROOT

LA

flight ◀ the (DET)

8

flight book ROOT

RA

book ▶ flight (DOBJ)

9

book ROOT

RA

ROOT ▶ book (ROOT)

10

ROOT

parsing complete

Dependency graph: ROOT→book; book→{me, flight}; flight→{the, morning}.

Training a Transition Parser (Oracle & Supervision)

• Treebanks give only final correct trees, not intermediate (config, transition) pairs.
• Solution: simulate parser on each gold tree; at every configuration consult a dynamic oracle that outputs the transition(s) guaranteed to reach the gold tree; collect (feature-vector, transition) pairs.
• Learn classifier (Perceptron, SVM, Neural net) that predicts transition from features (stack top/below, buffer head, POS tags, etc.).

Graph-Based Parsing (MST Search)

• Parser treats every possible ordered pair (h,d) as a directed edge candidate with score s(h,d)s(h,d).
• The best tree TT^* maximises total edge score:
T=arg max<em>TT(S)</em>(h,d)Ts(h,d)T^* = \operatorname*{arg\,max}<em>{T\in\mathcal{T}(S)} \sum</em>{(h,d)\in T} s(h,d).
• Finding TT^* is equivalent to a Maximum Spanning Tree (MST) problem on a directed graph whose nodes are words+ROOT.
• Non-projectivity naturally allowed because MST need not obey planar ordering.

Chu-Liu / Edmonds Algorithm (Directed MST)

  1. For each vertex vROOTv\neq ROOT choose the highest-scoring incoming edge; put all such edges in set FF.

  2. If FF is acyclic ⇒ it is the MST; return.

  3. Else contract each cycle into a super-node, adjust edge weights: w(u,v)=w(u,v)w(bestIn(v))w'(u,v)=w(u,v)-w(bestIn(v)).

  4. Recurse on contracted graph to obtain MST of super-nodes.

  5. Expand cycles, re-insert previously removed best-in edges.
    • Runs in O(EV)O(EV) time; practical for sentence lengths (<100 words).

Edge Scoring & Feature Design

• Linear model: s(h,d)=wTϕ(h,d)s(h,d)=w^T \phi(h,d).
• Typical feature templates:
– Word forms, lemmas, POS of head & dependent.
– Direction & distance (tokens) between head/dependent.
– Surrounding context words/POS tags (±1, ±2).
– Dependency label under consideration.
– Pre-trained word embeddings (concatenate or average) ⇒ captures semantics.
• Weights ww trained with structured perceptron, MIRA, or CRF-like objectives using gold trees.

Evaluation Metrics (UAS & LAS)

• Unlabeled Attachment Score (UAS): % of words whose predicted head = gold head (label ignored).
• Labeled Attachment Score (LAS): % of words whose predicted head AND dependency label both correct.
• Example «Book the flight through Houston»
– Gold: book(ROOT)▶flight(DOBJ); flight▶Houston(NMOD).
– System: book▶flight(xcomp); flight▶Houston(NMOD).
– UAS = 33=1.0\frac{3}{3}=1.0 (all heads right); LAS = 230.67\frac{2}{3}\approx0.67 (label xcomp wrong).

Building Dependency Trees in Practice

• Three practical routes:
1. Transition parser with greedy or beam search; fastest (≈linear) but projective.
2. Graph/MST parser; O(n2)O(n^2) or O(n3)O(n^3) depending on features, handles non-projective.
3. Convert phrase-structure (PCFG) parse to dependency tree by head-percolation rules.
• Hybrid systems combine neural encoders (Bi-LSTM, Transformer) to produce edge scores fed to MST or transition classifier.
• Real-world relevance: used in machine translation, information extraction, conversational agents, and even code-generation where NL utterances resemble compiler input.