Programming Languages - Syntax
Programming Languages - Syntax Notes
Objectives
- Understand the lexical structure of programming languages
- Understand context-free grammars and Backus-Naur Forms (BNFs)
- Become familiar with parse trees and abstract syntax trees
- Understand ambiguity, associativity, and precedence
- Learn to use Extended Backus-Naur Forms (EBNFs) and syntax diagrams
- Become familiar with parsing techniques and tools
- Understand lexics vs. syntax vs. semantics
- Build a syntax analyzer for TinyAda
Introduction
- Syntax: The structure of a programming language.
- Developed from the ideas of Noam Chomsky in the 1950s, who introduced context-free grammars.
- Backus-Naur Forms (BNFs): A notation developed by John Backus and Peter Naur for describing grammars, first used for Algol60.
- Understanding BNF is crucial for modern computer scientists when interpreting language syntax.
Variations of BNF
- BNF: Basic form with simple notation.
- EBNF: Extended BNF which introduces additional notation for clarity.
- Syntax Diagrams: Visual representation of grammatical rules.
Lexical Structure of Programming Languages
- Lexical Structure: Refers to the structure of tokens or words in a programming language.
- Tokens Classification:
- Reserved words (keywords)
- Literals or constants
- Special symbols
- Identifiers (e.g., variable names)
- Scanning Phase: Converts character sequences into tokens.
- Parsing Phase: Processes tokens into a structured syntactic representation.
Lexics vs. Syntax vs. Semantics
- Lexics: Deals with the symbols (alphabet), words (combinations of symbols), and means of token recognition.
- Syntax: Concerns valid sequences of tokens based on language rules.
- Semantics: Defines the expected behavior of syntactically correct strings.
- Example highlighting the contrast: "Colorless green ideas sleep furiously" has syntax but no meaning.
Phases of Compilation
- Character stream: Input from source code.
- Token stream: Collection of tokens.
- Parse tree & Abstract syntax tree (AST): Internal representations used by the compiler.
- Assembly or machine language generation: Final output for execution.
Context-Free Grammars and BNFs
- Context-Free Grammar: Set of rules defining a language's syntax.
- Derivation Process: Constructing sentences by replacing symbols according to grammar rules.
- Ambiguity: Occurs when a grammar allows multiple parse trees for a single string.
- Nonterminals vs. Terminals: Nonterminals are symbols that can be replaced, terminals are actual tokens.
Parse Trees and Abstract Syntax Trees
- Parse Trees: Graphical representations showing how a string is derived from the grammar rules.
- Abstract Syntax Trees: Simplified trees that remove redundant parts of the parse tree, focusing on necessary structural elements.
- Nodes and Leaves: Nonterminals are nodes with children, while terminals are leaf nodes.
Ambiguity, Associativity, and Precedence
- Associativity: Rule that determines how operators of the same precedence are grouped.
- Ambiguous Grammars: Need to be modified to eliminate ambiguity – typically through establishing precedence and associativity rules.
- Leftmost Derivation: A method to systematically replace the leftmost nonterminal to resolve ambiguities.
- Example: Left-recursive vs right-recursive definitions impact associativity.
EBNFs and Syntax Diagrams
- EBNF: Introduces useful notations for optional elements and repetition using
{} and [] respectively. - Syntax Diagrams: Visual tools that represent grammatical rules through shapes for terminals and nonterminals, illustrating sequence and repetition.
- Parsing: The process of analyzing a string based on a grammar.
- Bottom-up parsers: Construct parse trees from leaves upward.
- Top-down parsers: Start with the top-most element and work downwards.
- Parser Generators: Software that automates the creation of parsers based on given grammar specifications.
- Predictive Parsers: Make parsing decisions based on lookahead tokens to ensure correct derivation.
Lexics vs. Syntax vs. Semantics Discussion
- Formatting aspects, such as whitespace and specific language rules, dictate how tokens are recognized and are distinct from semantics.
- Some semantic properties are context-sensitive and cannot be represented by context-free rules alone, indicating the need for careful language design.
Case Study: Building a Syntax Analyzer for TinyAda
- TinyAda: A simplified language to demonstrate syntax features of larger languages.
- Includes declarations, statements, and expressions, allowing for nested constructs.
- The parsing shell ensures token types adhere to expected language constructs, serving as a foundation for further semantic analysis.