RHWeek4.slides

CS2001 Grammars and Parsing

Presented by: Ruth Hoffmann, University of St Andrews

Page 1

Title: CS2001 Grammars and Parsing

Page 2

Overview of Grammars and Parsing
- Key Topics:
  - Grammars
  - Parsing
  - Recursive-descent parser
  - Finite State Automata (FSAs)
- Importance of Grammars:
  - Recognize a variety of strings (numbers, keywords).
  - Useful for finite and finitely described inputs.
  - Higher-level treatment for complex or infinite alternatives.

Page 3

Definitions:
- Languages: Sequences of symbols.
- Grammars: Formal descriptions of languages.
- Parsers: Machines that determine if sentences conform to the grammar, and perform actions accordingly.
- Syntax: Relates to how sentences are structured.

Page 4

Importance of Parsing:
- Representing complex structured data is essential for:
  - Programs, web pages, data from databases.
  - Intelligent search queries and speech recognition.
- Standard Formats:
  - Examples include XML, but not all data adheres to structured formats.

Page 5

Limitations of FSAs:
- FSAs recognize individual letters, which isn't ideal for intricate languages.
- Important to maintain language integrity while parsing.
- Some languages cannot be captured by FSAs alone.

Page 6

Understanding Grammars:
- Grammar Definition: A set of formal rules defining valid sentences including:
  - What symbols can appear.
  - Their order and combination.
- Context-Free Grammars (CFGs): Focus point of the discussion.

Page 7

Complex Patterns:
- Sequences and interleavings of strings can be complicated (e.g., all Java programs, all legal XML files).
- The goal is to define simple patterns for syntactically valid programs.
- Focus primarily on syntax, not semantics like typing.

Page 8

Context-Free Grammar Structure:
- A CFG is defined as a 4-tuple:
  - V: finite set of variables.
  - Σ: finite set of terminals (disjoint from V).
  - R: finite set of rules (each with a variable and corresponding strings of variables/terminals).
  - S: start variable within V.

Page 9

Generating Words from Grammars:
- Start at the beginning variable and apply the set rules recursively to generate legal words.
- Valid words should only consist of terminal symbols.

Page 10

Infinite Language Possibilities:
- Many grammars have the potential to generate infinite sentences.
- This recursive process continues until there are only terminal symbols.

Page 11

CFG Limitations:
- In CFGs, all rules must have a single variable on the left-hand side.
- Some grammatical forms permit variations, leading to potentially larger language classes but are harder to parse.

Page 12

Grammar Rewriting Process:
- Start with the first variable, choose any variable from the pattern, and find a rule with it on the LHS.
- Rewrite the string until it consists of only terminal symbols.

Page 13

Parse Trees:
- Each production generates a tree structure, where leaves are terminal symbols.
- A tree is defined as connected and acyclic (graph without cycles).

Page 14

Extensions and Shorthands:
- Merging rules can increase readability.
- Epsilon (ε) can represent a silent/zero-length string.

Page 15

Abstract Syntax for Statements:
- Variables represent statement types, expressions, and terminals include symbols and keywords.
- Example grammar for a for-statement:
  - for-statement → for ( expression ; expression ; expression ) statements.

Page 16

CFGs vs FSAs:
- CFGs can be represented by pushdown automata, which combine FSA functionality with a stack.
- CFGs can accept a broader range of languages compared to FSAs.

Page 17

Ambiguity in Grammars:
- A grammar is ambiguous if at least one string has more than one valid parse tree.
- Hard to pre-determine ambiguity in CFGs; it's better to avoid structures with common prefixes in productions.

Page 18

Summary of CFGs:
- CFGs are based on a rule system and allow iterative expansion of variables.
- The strings accepted consist of terminals generated during this expansion.

Page 19

Recap of CFGs:
- CFGs utilize a rule system to iteratively expand their variables, generating language from terminals along various paths.

Page 20

Upcoming Content:
- Recursive patterns, parsing methods, and handling ambiguity again.

Page 21

Recursion in Grammars:
- Allows alternatives and repetition within structures.
- Two approaches to recursion:
  1. Repeating the same structure multiple times.
  2. Using different forms to allow repetition.

Page 22

Handling Repeats:
- Rules may include forms like A → B∗ indicating zero or more occurrences (Kleene star).
- Recursive structures can be built using the same variables across different rules.

Page 23

General Structure of Recursion:
- Recursive structure can emulate the Kleene star and represent "one or more" occurrences with A+.

Page 24

Synopsis of Example CFG:
- Variables: A, B, C
- Terminals: 0, 1, for
- Example rules and structure using alternatives and recursion (expressing zero or more, one or more).

Page 25

Dynamics of a Grammar:
- Start with a variable, expand using patterns, leading to strings in the grammar language.
- This generates parse trees or derivations showing rule applications.

Page 26

Practical Example - Numeric Expressions:
- Define a grammar for numeric expressions which includes numbers and simple operators without extra complexity.

Page 27

Modular Approach to Grammar Design:
- Example Numeric Grammar:
  - int → -unsigned|unsigned
  - real → int.unsigned|int
  - Digits: digit → 0|1|2|...|9

Page 28

Building Expressions:
- Construct expressions through defined rules, allowing operations such as addition and multiplication that disallow meaningless expressions.

Page 29

Parsing Process for Expressions:
- Generate patterns, ensuring each stage yields terminals and reflects valid expression parsing.

Page 30

Defining Grammar's Purpose and Clarity:
- Grammar constructed to generate legal expressions and simplify parsing through structured rules.

Page 31

Managing Ambiguity:
- Important to determine which rule applies in ambiguous situations, leading to potential variations in parse trees.

Page 32

Operator Precedence:
- Syntax plays a key role in dealing with operator precedence (e.g., performing multiplications before additions).

Page 33

Disambiguation Steps:
- Aim for clarity in parsing by explicitly defining rule functions and maintaining structure.

Page 34

Evaluating Expressions:
- Build trees for local calculations, facilitating overall language representation in compilers and interpreters.

Page 35

Summary of Expression Parsing:
- Parse trees represent the structure of languages dynamically, with clear roles defined for processing.

Page 36

Overview of Parsing Conversion:
- Parsing transforms complex texts into structured parse trees through systematic conversion methods.

Page 37

Goals of Parsing:
- Create concise grammar definitions while maintaining simplicity for human writing and computer execution.

Page 38

Simplified Program Structure:
- Create grammar rules for programming language projects, covering assignment statements and expressions.

Page 39

From Human to Machine Representation:
- Example transformation of expressions into machine code needing minimum complexity in translation.

Page 40

Focus on the Essential Elements:
- Grammar should emphasize key components like tokens for clarity and focus on essential structures.

Page 41

Tokenisation Process:
- Conversion of character strings into symbols streamlines parsing.

Page 42

Tokenisation Complexity:
- Define sub-grammars for simple tokens; however, consider the entire input for comprehensive tokenization.

Page 43

Using FSAs for Tokenisation:
- FSAs can define tokens compactly, providing advantages such as efficiency with low overhead.

Page 44

Tokenisation Approach:
- Convert source input into a token stream while managing error generation for unrecognized strings.

Page 45

FSA Characteristics:
- Generates symbols correlating with terminals while tracking values presented in the token stream.

Page 46

Next Steps in Parsing:
- Develop algorithms for parsing input token streams using two primary approaches: Top-down and Bottom-up.

Page 47

Recursive-Descent Parsing:
- Each rule acts as a recognizer within the token stream, incorporating recursive functions for complex pattern matching.

Page 48

Methodology Noted:
- A “try it and see” approach allows for attempts at matches against the token stream with potential backtracking as needed.

Page 49

Table-Driven Parsing Approach:
- Utilize token input and analyze potential rules to build parse structures efficiently as constraints.

Page 50

Comparison of Techniques:
- While table-driven methods may seem clearer, their efficiency arises from single-pass token evaluation without backtracking.

Page 51

Ambiguity Issues:
- Some ambiguous grammars may cause difficulties for table-driven parsers but may work with deterministic selection in others.

Page 52

Support for Parsing Implementations:
- Tools available for various parsing systems, including recursive descent and table-driven parsers for better manipulation efficiency.

Page 53

Layered Toolchain for Parsing and Code Generation:
- Streamlined processing from text to a machine code binary through parsing layers: Tokeniser -> Parser -> Interpreter/Compiler.

Page 54

Key Takeaways on Parsers:
- Built using FSAs and grammars, essential for handling text and program manipulations with established support tools.

Page 55

Future Considerations in Parser Development:
- Recursive-descent parsers are engaging but often rely on parser-compilers for complex constructions and language parsing.

Page 56

Consideration for Practicality:
- Engage with pre-existing parsers unless new applications necessitate experimentation into unique grammar forms.

Page 57

Pseudo-Code for Parser Development:
- Create parser skeletons focusing on abstract algorithm shapes without delving into syntax details too early.

Page 58

Mechanics of Recursive Descent:
- Each rule transitions into a function within the parser, employing functions for both variable and terminal recognizers.

Page 59

Detailed Rule Functionality:
- Each rule's function must be capable of recognizing respective fragments amid recursive call structures.

Page 60

Conventional Recognition Logic:
- Terminal recognizers progressively consume recognized tokens, whereas variable recognizers control flow with backtracking and pattern validation.

Page 61

Token Stream Management:
- Stream position awareness allows for reverting and continued processing when patterns fail to match.

Page 62

Implementing Token Stream Visibility:
- Develop methods for accessing current token states, adjusting, and tracking progress, promoting fluency in parsing operations.

Page 63

Variable Handling in Recursive Rules:
- Variables maintain track of tokens while merging successes and failures into coherent parsers, adjusting parsing attempts as needed.

Page 64

Example Variable Parsing Logic:
- Use utilities for managing token stream positions and encoding logic for success based on variable evaluations.

Page 65

Handling Repetition in Parsing:
- Apply recognized structures in simplified manners while balancing recursive necessities across grammar rules.

Page 66

Managing Structuring for Repetitions:
- Utilize established repetition rules to simplify implementations while keeping standard coding practices in context.

Page 67

Repetitions and Backtracking Techniques:
- Code segments to detail parsing behaviors for repetitions while adhering to structured expansion logic.

Page 68

Structural Notes on Efficiency:
- Reflecting parsing practices may lead to optimization potentials in comprehending grammar rules.

Page 69

Addressing Efficiency Crossover:
- Identify filtering opportunities for terminal recognizers to streamline operational efficiency in variable evaluations.

Page 70

Overall Summary of Grammar Parsing Process:
- Contextualize parsing enactments with clear structures bridging grammar to practical implementations that facilitate understanding and processing.