Chapter 3 focuses on describing syntax and semantics in programming languages.
Two complementary facets of a language definition:
Syntax – the formal structure of expressions, statements, program units.
Semantics – the meaning associated with those structures.
Accurate definitions are vital for:
Language designers (to reuse or extend ideas).
Implementers (compiler/interpreter writers).
Programmers (end-users who must understand what programs do).
Syntax ≠ semantics; form versus meaning.
Static semantics: compile-time rules not captured purely by context-free grammars (CFGs).
Dynamic semantics: run-time meaning of constructs.
Sentence: a string of characters over an alphabet.
Language: a set of sentences.
Lexeme: the lowest-level syntactic unit (e.g. * , sum, begin).
Token: a category/class of lexemes (e.g. identifier, integer-literal).
Two complementary approaches:
Recognizers: algorithms/devices that accept valid sentences and reject invalid ones (e.g. the parser in a compiler).
Generators: mechanisms that can systematically produce every valid sentence; comparing a candidate sentence against the generator’s patterns shows whether it is syntactically valid.
Invented by Noam Chomsky (1950s) to model natural languages; adopted for programming languages.
Backus–Naur Form (BNF) (1959, John Backus) is notation equivalent to CFGs.
Uses abstractions (nonterminals) to represent classes of syntactic structures.
Terminals = lexemes/tokens.
Rules (productions) have the form \text{LHS} \rightarrow \text{RHS} where LHS is a nonterminal.
Nonterminals are often shown in ⟨angle brackets⟩.
A grammar is a finite, non-empty set of rules plus a distinguished start symbol.
Nonterminals may have multiple alternative RHSs (use |).
Repetitive constructs are expressed recursively.
Example: \langle ident_list \rangle \rightarrow ident \; | \; ident , \langle ident_list \rangle
Derivation: repeated rule application beginning with the start symbol and terminating in a sentence (all terminals).
Grammar:
\langle program \rangle \rightarrow \langle stmts \rangle
\langle stmts \rangle \rightarrow \langle stmt \rangle \; | \; \langle stmt \rangle ; \langle stmts \rangle
\langle stmt \rangle \rightarrow \langle var \rangle = \langle expr \rangle
\langle var \rangle \rightarrow a \; | \; b \; | \; c \; | \; d
\langle expr \rangle \rightarrow \langle term \rangle + \langle term \rangle \; | \; \langle term \rangle - \langle term \rangle
\langle term \rangle \rightarrow \langle var \rangle \; | \; const
Leftmost derivation for “a = b + const” shown step-by-step.
Sentential form: any intermediate string in a derivation.
Parse tree: hierarchical visualization of a derivation.
A grammar is ambiguous iff it generates at least one sentential form with ≥2 distinct parse trees.
Classic ambiguous arithmetic grammar:
\langle expr \rangle \rightarrow \langle expr \rangle \langle op \rangle \langle expr \rangle \; | \; const
\langle op \rangle \rightarrow / \; | \; -
Ambiguity removed by encoding precedence/associativity into separate nonterminals, e.g.
\langle expr \rangle \rightarrow \langle expr \rangle - \langle term \rangle \; | \; \langle term \rangle
\langle term \rangle \rightarrow \langle term \rangle / const \; | \; const
Ambiguous: \langle expr \rangle \rightarrow \langle expr \rangle + \langle expr \rangle \; | \; const
Left-associative, unambiguous version: \langle expr \rangle \rightarrow \langle expr \rangle + const \; | \; const
Naïve Java if-then-else grammar is ambiguous.
Solution partitions statements:
\langle stmt \rangle \rightarrow \langle matched \rangle \; | \; \langle unmatched \rangle
\langle matched \rangle \rightarrow if(\langle logic_expr \rangle)\; \langle stmt \rangle \; else \; \langle matched \rangle \; | \; \text{non-if stmt}
\langle unmatched \rangle \rightarrow if(\langle logic_expr \rangle)\; \langle stmt \rangle \; | \; if(\langle logic_expr \rangle)\; \langle matched \rangle \; else \; \langle unmatched \rangle
\langle assign \rangle \rightarrow \langle id \rangle = \langle expr \rangle
\langle id \rangle \rightarrow A \; | \; B \; | \; C
\langle expr \rangle \rightarrow \langle expr \rangle + \langle term \rangle \; | \; \langle term \rangle
\langle term \rangle \rightarrow \langle term \rangle * \langle factor \rangle \; | \; \langle factor \rangle
\langle factor \rangle \rightarrow ( \langle expr \rangle ) \; | \; \langle id \rangle
Adds metasymbols for conciseness:
Optional: [ … ]
Repetition (0 +): { … }
Grouped alternatives: (A | B).
Example conversion of above arithmetic grammar:
\langle expr \rangle \rightarrow \langle term \rangle \; {\; (+ | -) \; \langle term \rangle\;}
\langle term \rangle \rightarrow \langle factor \rangle \; {\; (* | /) \; \langle factor \rangle\;}
Recent stylistic innovations: separate lines for alternatives, ':' instead of $\Rightarrow$, keywords like “opt” and “oneof”.
Write BNF for Java Boolean expressions (operators &&, ||, !).
Convert Example 3.4 to EBNF.
Pure CFGs can’t express all static constraints (e.g., type compatibility, identifier declaration before use).
Attribute Grammar (AG) enriches CFG with attributes, functions, predicates.
A grammar G = (S,N,T,P) where:
N: nonterminals, T: terminals, P: productions, S: start symbol.
For each symbol x, attribute set A(x).
Two kinds of attributes:
Synthesized: computed from children up to parent, e.g. S(X0)=f(A(X1),\dots,A(X_n)).
Inherited: passed from parent/siblings downwards, e.g. I(Xj)=f(A(X0),\dots,A(X_n)).
Rules may contain predicates to ensure consistency (type checks, etc.).
Syntax (simplified):
\langle assign \rangle \rightarrow \langle var \rangle = \langle expr \rangle
\langle expr \rangle \rightarrow \langle var \rangle + \langle var \rangle \; | \; \langle var \rangle
\langle var \rangle \rightarrow A \; | \; B \; | \; C
Attributes:
actual_type (synthesized) for \<var>, \<expr>.
expected_type (inherited) for \<expr>.
Sample semantic rules:
Production: \langle expr \rangle \rightarrow \langle var \rangle1 + \langle var \rangle2
\langle expr \rangle.actual_type \leftarrow \langle var \rangle_1.actual_type
Predicate: \langle var \rangle1.actual_type == \langle var \rangle2.actual_type
\langle expr \rangle.expected_type == \langle expr \rangle.actual_type
Production: \langle var \rangle \rightarrow id
\langle var \rangle.actual_type \leftarrow lookup(\langle var \rangle.string)
Attribute evaluation strategies:
All inherited ⇒ top-down decoration.
All synthesized ⇒ bottom-up decoration.
Mixed ⇒ combination order needed.
Programmers need precise definitions.
Compiler writers must implement correctly.
Facilitates correctness proofs, compiler generators, ambiguity detection.
No single universally adopted formalism; three main families: operational, denotational, axiomatic.
Describes meaning by executing statements on an abstract (often virtual) machine; state changes => meaning.
Pure hardware or software interpreters are impractical/too concrete; instead, build:
Translator from source to idealized machine code.
Simulator of the idealized machine.
Two granularities:
Natural operational semantics (big-step) – directly maps whole constructs to final states.
Structural operational semantics (small-step) – models computation as sequences of elementary steps.
Pros: intuitive for manuals/teaching. Cons: cumbersome for rigorous formal proofs (e.g., IBM VDL for PL/I).
Developed by Scott & Strachey (1970); uses mathematical (often recursive) functions.
For each syntactic entity, define a corresponding mathematical denotation.
Program meaning ≡ mathematical function on program state.
s = {\langle i1,v1 \rangle,\dots,\langle in,vn \rangle} (mapping identifiers to values).
VARMAP(ij, s) = vj retrieves a variable’s value.
Grammar:
\langle dec_num \rangle \rightarrow '0'|\dots|'9' \; | \; \langle dec_num \rangle \; ('0'|\dots|'9')
Mapping function M_{dec}:
M{dec}('0') = 0, \; M{dec}('1') = 1, \dots
M{dec}(\langle dec_num \rangle 'k') = 10 \cdot M{dec}(\langle dec_num \rangle) + k where k\in{0..9}.
Domain: \mathbb Z \cup {error}.
Recursive definition M_e(\langle expr \rangle, s):
Constant ⇒ M_{dec}.
Variable ⇒ error if undefined else VARMAP.
Binary + or * expressions ⇒ evaluate operands then apply operator; propagate error if any.
Ma(x := E, s) = \begin{cases} error & \text{if } Me(E,s)=error\
s' & \text{otherwise}
\end{cases}
s' identical to s except x now maps to M_e(E,s).
Ml(while\ B\ do\ L, s) = \begin{cases} error & \text{if } Mb(B,s)=undef\
s & \text{if } Mb(B,s)=false\ error & \text{if } M{sl}(L,s)=error\
Ml(while\ B\ do\ L, M{sl}(L,s)) & \text{otherwise}
\end{cases}
Converts iteration into recursion for mathematical clarity.
Enables formal proofs of correctness and compiler generation.
Highly rigorous, but complex for everyday language users.
Based on predicate calculus; created for program verification.
Uses assertions about program states.
Notation: {P}\; statement \;{Q} where P precondition, Q postcondition.
Weakest precondition (WP): least restrictive P guaranteeing Q.
Example: statement a = b + 1; desired postcondition {a > 1}.
One valid precondition: b > 10.
WP: b > 0.
Assignment Axiom: {Q[x \leftarrow E]}\; x = E; \; {Q}
Consequence Rule: from {P'}S{Q'}, P \implies P', Q' \implies Q infer {P}S{Q}.
Sequence: if {P1}S1{P2} and {P2}S2{P3} then {P1}S1;S2{P3}.
Selection (if-then-else):
From {P \land B}S1{Q} and {P \land \lnot B}S2{Q} infer {P}if\ B\ then\ S1\ else\ S2{Q}.
While Loop (pretest): choose loop invariant I.
Requirements:
P \Rightarrow I (initially true).
{I \land B} S {I} (body preserves invariant).
I \land \lnot B \Rightarrow Q (termination yields postcondition).
Loop terminates (sometimes proven via a variant function).
Must be weak enough to hold before first iteration, yet strong enough (with exit condition) to ensure postcondition.
Powerful for proofs & reasoning; challenging to define full rule sets; less directly useful to casual programmers.
Operational: state changes given by algorithms (executable descriptions).
Denotational: state changes given by mathematical functions (high abstraction).
Axiomatic: focuses on logical assertions and correctness reasoning, independent of execution details.
BNF/CFG are standard meta-languages for syntax.
EBNF adds syntactic sugar for clarity.
Ambiguity must be resolved to guarantee unique parse trees (precise meaning).
Attribute Grammars extend CFGs to specify some static semantics (types, declarations).
Dynamic semantics can be described via three major formal methods:
Operational
Denotational
Axiomatic
Each method trades off between rigor, abstraction, and practicality for designers, implementers, and users.