Syntax is the study of sentence structure, answering "Who does what to whom?"
Various theories exist with commonalities: Government and Binding (GB), Minimalist Program (MP), Head-driven phrase structure grammar (HPSG), Lexical Functional Grammar (LFG), Categorial Grammar, and Dependency Grammar.
Why Syntax Matters
Theoretical syntacticians focus on grammaticality.
Relevant for NLP applications like text generation and grammar checking.
Parsing provides scaffolding for semantic analysis, aiding opinion mining, information extraction, and machine translation.
Basic Principles of Syntax
Form vs. Function
Syntactic form uses parts of speech and phrases (NP, VP).
Syntactic function describes roles in a sentence (Subject, Object, Adverbial).
Constituents
Words are organized into groupings that function as a whole.
Tested through linguistic tests of constituency.
Phrase Structure Grammar (PSG)
Captures constituent status and ordering using context-free grammar.
Example rules: S \rightarrow NP VP, NP \rightarrow D N, VP \rightarrow V NP
Dependency Grammar (DG)
An alternative to phrase structure.
Syntactic functions are central.
Syntactic structure consists of lexical items linked by binary asymmetric dependencies.
Increasing interest in dependency-based parsing for NLP.
Useful in relation extraction, question answering, and sentiment analysis.
Constituency vs. Relations
DG is based on relationships between words (dependency relations).
PSG is based on groupings or constituents.
Simple Relation Example
In the sentence "The dog ate my homework", relations include:
ate →subj The dog
ate →obj my homework
Comparison
Dependency structures represent head-dependent relations, functional categories, and parts-of-speech.
Phrase structures represent phrases, structural categories, and grammatical functions.
Criteria for Heads and Dependents
H determines the syntactic category of C; H can replace C.
H determines the semantic category of C; D specifies H.
H is obligatory; D may be optional.
The form of D depends on H (agreement or government).
The linear position of D is specified with reference to H.
Some Tricky Cases
Complex verb groups
Subordinate clauses
Coordination
Prepositional phrases
Punctuation
Dependency Graphs
Defined as a directed graph G with:
A set V of nodes.
A set E of arcs (edges).
Labeled graphs with word forms and dependency types.
Notations: i \rightarrow j \equiv (i, j) \in E
Formal Properties of Dependency Graphs
antisymmetric: if A → B, then B ↛ A
antireflexive: if A → B, then B ≠ A
antitransitive: if A → B and B → C, then A ↛ C
labeled: ∀ →, → has a label (r)
Formal Conditions on Dependency Graphs
G is (weakly) connected: For every node i, there is a node j such that i → j or j → i.
G is acyclic: If i → j then not j →∗ i.
G obeys the single-head constraint: If i → j, then not k → j, for any k ≠ i.
Projectivity
A projective graph: If i → j then for any k such that i< k < j or j < k < i, i →∗ k.
Non-projective structures are needed for long-distance dependencies and free word order.
Treebanks
Collections of sentences manually annotated with syntactic analysis.
Used to train data-driven NLP tools.
Examples: Penn Treebank, Prague Dependency Treebank, Negra/Tuba-DZ, Penn (Chinese), Norwegian Dependency Treebank, Universal Dependencies.
Norwegian Dependency Treebank (NDT)
Completed in 2014 by Språkbanken, National Library.
Ca 600,000 tokens of Bokmål and Nynorsk text.
Enables training of taggers and parsers for Norwegian.
Converted to Universal Dependencies.
Universal Dependencies
Harmonized dependency treebanks for more than 100 languages.