RHWeek4.slides

CS2001 Grammars and Parsing

  • Presented by: Ruth Hoffmann, University of St Andrews

Page 1

  • Title: CS2001 Grammars and Parsing

Page 2

  • Overview of Grammars and Parsing

    • Key Topics:

      • Grammars

      • Parsing

      • Recursive-descent parser

      • Finite State Automata (FSAs)

    • Importance of Grammars:

      • Recognize a variety of strings (numbers, keywords).

      • Useful for finite and finitely described inputs.

      • Higher-level treatment for complex or infinite alternatives.

Page 3

  • Definitions:

    • Languages: Sequences of symbols.

    • Grammars: Formal descriptions of languages.

    • Parsers: Machines that determine if sentences conform to the grammar, and perform actions accordingly.

    • Syntax: Relates to how sentences are structured.

Page 4

  • Importance of Parsing:

    • Representing complex structured data is essential for:

      • Programs, web pages, data from databases.

      • Intelligent search queries and speech recognition.

    • Standard Formats:

      • Examples include XML, but not all data adheres to structured formats.

Page 5

  • Limitations of FSAs:

    • FSAs recognize individual letters, which isn't ideal for intricate languages.

    • Important to maintain language integrity while parsing.

    • Some languages cannot be captured by FSAs alone.

Page 6

  • Understanding Grammars:

    • Grammar Definition: A set of formal rules defining valid sentences including:

      • What symbols can appear.

      • Their order and combination.

    • Context-Free Grammars (CFGs): Focus point of the discussion.

Page 7

  • Complex Patterns:

    • Sequences and interleavings of strings can be complicated (e.g., all Java programs, all legal XML files).

    • The goal is to define simple patterns for syntactically valid programs.

    • Focus primarily on syntax, not semantics like typing.

Page 8

  • Context-Free Grammar Structure:

    • A CFG is defined as a 4-tuple:

      • V: finite set of variables.

      • Σ: finite set of terminals (disjoint from V).

      • R: finite set of rules (each with a variable and corresponding strings of variables/terminals).

      • S: start variable within V.

Page 9

  • Generating Words from Grammars:

    • Start at the beginning variable and apply the set rules recursively to generate legal words.

    • Valid words should only consist of terminal symbols.

Page 10

  • Infinite Language Possibilities:

    • Many grammars have the potential to generate infinite sentences.

    • This recursive process continues until there are only terminal symbols.

Page 11

  • CFG Limitations:

    • In CFGs, all rules must have a single variable on the left-hand side.

    • Some grammatical forms permit variations, leading to potentially larger language classes but are harder to parse.

Page 12

  • Grammar Rewriting Process:

    • Start with the first variable, choose any variable from the pattern, and find a rule with it on the LHS.

    • Rewrite the string until it consists of only terminal symbols.

Page 13

  • Parse Trees:

    • Each production generates a tree structure, where leaves are terminal symbols.

    • A tree is defined as connected and acyclic (graph without cycles).

Page 14

  • Extensions and Shorthands:

    • Merging rules can increase readability.

    • Epsilon (ε) can represent a silent/zero-length string.

Page 15

  • Abstract Syntax for Statements:

    • Variables represent statement types, expressions, and terminals include symbols and keywords.

    • Example grammar for a for-statement:

      • for-statement → for ( expression ; expression ; expression ) statements.

Page 16

  • CFGs vs FSAs:

    • CFGs can be represented by pushdown automata, which combine FSA functionality with a stack.

    • CFGs can accept a broader range of languages compared to FSAs.

Page 17

  • Ambiguity in Grammars:

    • A grammar is ambiguous if at least one string has more than one valid parse tree.

    • Hard to pre-determine ambiguity in CFGs; it's better to avoid structures with common prefixes in productions.

Page 18

  • Summary of CFGs:

    • CFGs are based on a rule system and allow iterative expansion of variables.

    • The strings accepted consist of terminals generated during this expansion.

Page 19

  • Recap of CFGs:

    • CFGs utilize a rule system to iteratively expand their variables, generating language from terminals along various paths.

Page 20

  • Upcoming Content:

    • Recursive patterns, parsing methods, and handling ambiguity again.

Page 21

  • Recursion in Grammars:

    • Allows alternatives and repetition within structures.

    • Two approaches to recursion:

      1. Repeating the same structure multiple times.

      2. Using different forms to allow repetition.

Page 22

  • Handling Repeats:

    • Rules may include forms like A → B∗ indicating zero or more occurrences (Kleene star).

    • Recursive structures can be built using the same variables across different rules.

Page 23

  • General Structure of Recursion:

    • Recursive structure can emulate the Kleene star and represent "one or more" occurrences with A+.

Page 24

  • Synopsis of Example CFG:

    • Variables: A, B, C

    • Terminals: 0, 1, for

    • Example rules and structure using alternatives and recursion (expressing zero or more, one or more).

Page 25

  • Dynamics of a Grammar:

    • Start with a variable, expand using patterns, leading to strings in the grammar language.

    • This generates parse trees or derivations showing rule applications.

Page 26

  • Practical Example - Numeric Expressions:

    • Define a grammar for numeric expressions which includes numbers and simple operators without extra complexity.

Page 27

  • Modular Approach to Grammar Design:

    • Example Numeric Grammar:

      • int → -unsigned|unsigned

      • real → int.unsigned|int

      • Digits: digit → 0|1|2|...|9

Page 28

  • Building Expressions:

    • Construct expressions through defined rules, allowing operations such as addition and multiplication that disallow meaningless expressions.

Page 29

  • Parsing Process for Expressions:

    • Generate patterns, ensuring each stage yields terminals and reflects valid expression parsing.

Page 30

  • Defining Grammar's Purpose and Clarity:

    • Grammar constructed to generate legal expressions and simplify parsing through structured rules.

Page 31

  • Managing Ambiguity:

    • Important to determine which rule applies in ambiguous situations, leading to potential variations in parse trees.

Page 32

  • Operator Precedence:

    • Syntax plays a key role in dealing with operator precedence (e.g., performing multiplications before additions).

Page 33

  • Disambiguation Steps:

    • Aim for clarity in parsing by explicitly defining rule functions and maintaining structure.

Page 34

  • Evaluating Expressions:

    • Build trees for local calculations, facilitating overall language representation in compilers and interpreters.

Page 35

  • Summary of Expression Parsing:

    • Parse trees represent the structure of languages dynamically, with clear roles defined for processing.

Page 36

  • Overview of Parsing Conversion:

    • Parsing transforms complex texts into structured parse trees through systematic conversion methods.

Page 37

  • Goals of Parsing:

    • Create concise grammar definitions while maintaining simplicity for human writing and computer execution.

Page 38

  • Simplified Program Structure:

    • Create grammar rules for programming language projects, covering assignment statements and expressions.

Page 39

  • From Human to Machine Representation:

    • Example transformation of expressions into machine code needing minimum complexity in translation.

Page 40

  • Focus on the Essential Elements:

    • Grammar should emphasize key components like tokens for clarity and focus on essential structures.

Page 41

  • Tokenisation Process:

    • Conversion of character strings into symbols streamlines parsing.

Page 42

  • Tokenisation Complexity:

    • Define sub-grammars for simple tokens; however, consider the entire input for comprehensive tokenization.

Page 43

  • Using FSAs for Tokenisation:

    • FSAs can define tokens compactly, providing advantages such as efficiency with low overhead.

Page 44

  • Tokenisation Approach:

    • Convert source input into a token stream while managing error generation for unrecognized strings.

Page 45

  • FSA Characteristics:

    • Generates symbols correlating with terminals while tracking values presented in the token stream.

Page 46

  • Next Steps in Parsing:

    • Develop algorithms for parsing input token streams using two primary approaches: Top-down and Bottom-up.

Page 47

  • Recursive-Descent Parsing:

    • Each rule acts as a recognizer within the token stream, incorporating recursive functions for complex pattern matching.

Page 48

  • Methodology Noted:

    • A “try it and see” approach allows for attempts at matches against the token stream with potential backtracking as needed.

Page 49

  • Table-Driven Parsing Approach:

    • Utilize token input and analyze potential rules to build parse structures efficiently as constraints.

Page 50

  • Comparison of Techniques:

    • While table-driven methods may seem clearer, their efficiency arises from single-pass token evaluation without backtracking.

Page 51

  • Ambiguity Issues:

    • Some ambiguous grammars may cause difficulties for table-driven parsers but may work with deterministic selection in others.

Page 52

  • Support for Parsing Implementations:

    • Tools available for various parsing systems, including recursive descent and table-driven parsers for better manipulation efficiency.

Page 53

  • Layered Toolchain for Parsing and Code Generation:

    • Streamlined processing from text to a machine code binary through parsing layers: Tokeniser -> Parser -> Interpreter/Compiler.

Page 54

  • Key Takeaways on Parsers:

    • Built using FSAs and grammars, essential for handling text and program manipulations with established support tools.

Page 55

  • Future Considerations in Parser Development:

    • Recursive-descent parsers are engaging but often rely on parser-compilers for complex constructions and language parsing.

Page 56

  • Consideration for Practicality:

    • Engage with pre-existing parsers unless new applications necessitate experimentation into unique grammar forms.

Page 57

  • Pseudo-Code for Parser Development:

    • Create parser skeletons focusing on abstract algorithm shapes without delving into syntax details too early.

Page 58

  • Mechanics of Recursive Descent:

    • Each rule transitions into a function within the parser, employing functions for both variable and terminal recognizers.

Page 59

  • Detailed Rule Functionality:

    • Each rule's function must be capable of recognizing respective fragments amid recursive call structures.

Page 60

  • Conventional Recognition Logic:

    • Terminal recognizers progressively consume recognized tokens, whereas variable recognizers control flow with backtracking and pattern validation.

Page 61

  • Token Stream Management:

    • Stream position awareness allows for reverting and continued processing when patterns fail to match.

Page 62

  • Implementing Token Stream Visibility:

    • Develop methods for accessing current token states, adjusting, and tracking progress, promoting fluency in parsing operations.

Page 63

  • Variable Handling in Recursive Rules:

    • Variables maintain track of tokens while merging successes and failures into coherent parsers, adjusting parsing attempts as needed.

Page 64

  • Example Variable Parsing Logic:

    • Use utilities for managing token stream positions and encoding logic for success based on variable evaluations.

Page 65

  • Handling Repetition in Parsing:

    • Apply recognized structures in simplified manners while balancing recursive necessities across grammar rules.

Page 66

  • Managing Structuring for Repetitions:

    • Utilize established repetition rules to simplify implementations while keeping standard coding practices in context.

Page 67

  • Repetitions and Backtracking Techniques:

    • Code segments to detail parsing behaviors for repetitions while adhering to structured expansion logic.

Page 68

  • Structural Notes on Efficiency:

    • Reflecting parsing practices may lead to optimization potentials in comprehending grammar rules.

Page 69

  • Addressing Efficiency Crossover:

    • Identify filtering opportunities for terminal recognizers to streamline operational efficiency in variable evaluations.

Page 70

  • Overall Summary of Grammar Parsing Process:

    • Contextualize parsing enactments with clear structures bridging grammar to practical implementations that facilitate understanding and processing.