Syntax
Form and structure of the program code
Semantics
Meaning of the program code
meaning of expressions, statements, control structures
Syntax and semantics provide a language’s definition
Users of a language definition
Implementers
Programmers (the users of the language)
A sentence is a string of characters over some alphabet
A language is a set of sentences
A lexeme is the lowest level of syntactic unit of a language
*, sum, begin
A token is a category of lexemes
identifiers
Generators (Grammars)
A device that generates sentences of a languages
One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator
Recognizers (Parser)
A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language
syntax analysis part of a compiler
Context-Free Grammars (CFG)
Developed by Noam Chomsky in the mid-1950s
Language generators, meant to describe the syntax of natural languages
Define a class of languages called context-free languages
Backus-Naur Form (1959)
Invented by John Backus to describe Algol 58
BNF and CFG represent equal concepts
In BNF, abstractions are used to represent classes of syntactic structures-they act like syntactic variables (also called nonterminal symbols, or just nonterminals)
Terminals are lexemes or tokens
A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals
An abstraction (or nonterminal symbol) can have more than one RHS
<stmt> -> <single_stmt> | begin <stmt_list> end
Nonterminals are often enclosed in angle brackets
Example of BNF rules:
<ident_list> → identifier | identifier , <ident_list>
<if_stmt> → if <logic_expr> then <stmt>
Grammar: a finite non-empty set of rules
A start symbol is a special element of the nonterminals of a grammar
A context-free grammar G is defined by the 4-tuple G = (V, Σ, R, S) where
V is a finite set; each element v in V is called a non-terminal character or a variable. Each variable represents a different type of phrase or clause in the sentence
Σ is a finite set of terminals, disjoint from V, which make up the actual content of the sentence. The set of terminals is the alphabet of the language defined by the grammar G
R is a finite relation from V to (V + Σ)*, where the asterisk represents the Kleene star operation. The members of R, are called the (rewrite) rules or productions of the grammar. (also commonly symbolized by a P)
S is the start variable (or the start symbol), used to represent the whole sentence (or program). It must be an element of V
Sytanctic lists are described using recursion
<ident_list> → ident
| ident ',' <ident_list>
A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols)
<program> → <stmts>
<stmts> → <stmt> | <stmt> ';' <stmts>
<stmt> → <var> '=' <expr>
<var> → 'a' | 'b' | 'c' | 'd'
<expr> → <term> '+' <term> | <term> '-' <term>
<term> → <var> | 'const'
<program> => <stmts> => <stmt>
=> <var> = <expr>
=> a = <expr>
=> a = <term> + <term>
=> a = <var> + <term>
=> a = b + <term>
=> a = b + const
Every string of symbols in a derivation is a sentential form
A sentence is a sentential form that has only terminal symbols
A leftmost derivation is one in which the leftmost nonterminal in each sentential form is the one that is expanded. Accordingly a rightmost derivation for the rightmost nonterminal
A derivation may be neither leftmost nor rightmost
A parse tree represents a hierarchical representation of a derivation by means of a tree
Example:
A grammar is ambiguous if and only if it generates a sentential form (the language generated by the grammar comprises at least one word) that has two or more distinct parse trees
<expr> → <expr> <op> <expr> | const
<op> → ‘+' | '-'
word: const – const + const
Grammar on the example before in ambiguous form:
We get some form of operator precedence
<expr> → <expr> - <term> | <term>
<term> → <term> + const | const
Operator associativity can also be indicated by a grammar
Example:
<expr> -> <expr> + <expr> | const
(ambiguous)
<expr> -> <expr> + const | const
(unambiguous)
Optional Parts are placed in brackets []
<proc_call> -> ident [(<expr_list>)]
Alternative parts of RHS are placed inside parentheses and separated via vertical bars
<term> -> <term> (+|-) const
Repetitions (0 or more) are placed inside braces {}
<ident> -> letter {letter|digit}
BNF
<expr> → <expr> + <term>
| <expr> - <term>
| <term>
<term> → <term> * <factor>
| <term> / <factor>
| <factor>
EBNF
<expr> → <term> {(+ | -) <term>}
<term> → <factor> {(* | /) <factor>}
Alternative RHS are put on separate lines
Use of colon instead of →
Use of opt for optional parts
Use of oneof for choices
Nothing to do with meaning
Context-free grammars (CFGs) cannot describe all of the syntax of programming languages
Categories of constructs that are “trouble-maker”
Still Context-free, but cumbersome (e.g., types of operands in expressions)
Non-context-free (e.g., variables must be declared before they are used)
Attribute grammars (AGs) have additions to CFGs to carry some semantic info on parse tree nodes
Primary values of AGs:
Static semantics specification for static semantics checks
Def: An attribute grammar is a context-free grammar with the following additions:
For each grammar symbol x there is a set A(x) of attribute values
Each rule has a set of functions that define certain attributes of the nonterminals in the rule
Each rule has a (possibly empty) set of predicates to check for attribute consistency
Let $X_0 -> X_1...X_n$ be a rule
Functions of the form $S(X_0) = f(A(X_1), ..., A(X_n))$ define synthesized attributes
Initially, there are intrinsic attributes on the attributes on the leaves
Syntax
<assign> -> <var> = <expr>
<expr> -> <var> + <var> | <var>
<var> -> id
Attributes
actual_type: synthesized with the non-terminals <var> and <expr>
expected_type: inherited with the non-terminal <expr>