RHWeek4.slides
CS2001 Grammars and Parsing
Presented by: Ruth Hoffmann, University of St Andrews
Page 1
Title: CS2001 Grammars and Parsing
Page 2
Overview of Grammars and Parsing
Key Topics:
Grammars
Parsing
Recursive-descent parser
Finite State Automata (FSAs)
Importance of Grammars:
Recognize a variety of strings (numbers, keywords).
Useful for finite and finitely described inputs.
Higher-level treatment for complex or infinite alternatives.
Page 3
Definitions:
Languages: Sequences of symbols.
Grammars: Formal descriptions of languages.
Parsers: Machines that determine if sentences conform to the grammar, and perform actions accordingly.
Syntax: Relates to how sentences are structured.
Page 4
Importance of Parsing:
Representing complex structured data is essential for:
Programs, web pages, data from databases.
Intelligent search queries and speech recognition.
Standard Formats:
Examples include XML, but not all data adheres to structured formats.
Page 5
Limitations of FSAs:
FSAs recognize individual letters, which isn't ideal for intricate languages.
Important to maintain language integrity while parsing.
Some languages cannot be captured by FSAs alone.
Page 6
Understanding Grammars:
Grammar Definition: A set of formal rules defining valid sentences including:
What symbols can appear.
Their order and combination.
Context-Free Grammars (CFGs): Focus point of the discussion.
Page 7
Complex Patterns:
Sequences and interleavings of strings can be complicated (e.g., all Java programs, all legal XML files).
The goal is to define simple patterns for syntactically valid programs.
Focus primarily on syntax, not semantics like typing.
Page 8
Context-Free Grammar Structure:
A CFG is defined as a 4-tuple:
V: finite set of variables.
Σ: finite set of terminals (disjoint from V).
R: finite set of rules (each with a variable and corresponding strings of variables/terminals).
S: start variable within V.
Page 9
Generating Words from Grammars:
Start at the beginning variable and apply the set rules recursively to generate legal words.
Valid words should only consist of terminal symbols.
Page 10
Infinite Language Possibilities:
Many grammars have the potential to generate infinite sentences.
This recursive process continues until there are only terminal symbols.
Page 11
CFG Limitations:
In CFGs, all rules must have a single variable on the left-hand side.
Some grammatical forms permit variations, leading to potentially larger language classes but are harder to parse.
Page 12
Grammar Rewriting Process:
Start with the first variable, choose any variable from the pattern, and find a rule with it on the LHS.
Rewrite the string until it consists of only terminal symbols.
Page 13
Parse Trees:
Each production generates a tree structure, where leaves are terminal symbols.
A tree is defined as connected and acyclic (graph without cycles).
Page 14
Extensions and Shorthands:
Merging rules can increase readability.
Epsilon (ε) can represent a silent/zero-length string.
Page 15
Abstract Syntax for Statements:
Variables represent statement types, expressions, and terminals include symbols and keywords.
Example grammar for a for-statement:
for-statement → for ( expression ; expression ; expression ) statements.
Page 16
CFGs vs FSAs:
CFGs can be represented by pushdown automata, which combine FSA functionality with a stack.
CFGs can accept a broader range of languages compared to FSAs.
Page 17
Ambiguity in Grammars:
A grammar is ambiguous if at least one string has more than one valid parse tree.
Hard to pre-determine ambiguity in CFGs; it's better to avoid structures with common prefixes in productions.
Page 18
Summary of CFGs:
CFGs are based on a rule system and allow iterative expansion of variables.
The strings accepted consist of terminals generated during this expansion.
Page 19
Recap of CFGs:
CFGs utilize a rule system to iteratively expand their variables, generating language from terminals along various paths.
Page 20
Upcoming Content:
Recursive patterns, parsing methods, and handling ambiguity again.
Page 21
Recursion in Grammars:
Allows alternatives and repetition within structures.
Two approaches to recursion:
Repeating the same structure multiple times.
Using different forms to allow repetition.
Page 22
Handling Repeats:
Rules may include forms like
A → B∗indicating zero or more occurrences (Kleene star).Recursive structures can be built using the same variables across different rules.
Page 23
General Structure of Recursion:
Recursive structure can emulate the Kleene star and represent "one or more" occurrences with
A+.
Page 24
Synopsis of Example CFG:
Variables: A, B, C
Terminals: 0, 1, for
Example rules and structure using alternatives and recursion (expressing zero or more, one or more).
Page 25
Dynamics of a Grammar:
Start with a variable, expand using patterns, leading to strings in the grammar language.
This generates parse trees or derivations showing rule applications.
Page 26
Practical Example - Numeric Expressions:
Define a grammar for numeric expressions which includes numbers and simple operators without extra complexity.
Page 27
Modular Approach to Grammar Design:
Example Numeric Grammar:
int → -unsigned|unsignedreal → int.unsigned|intDigits:
digit → 0|1|2|...|9
Page 28
Building Expressions:
Construct expressions through defined rules, allowing operations such as addition and multiplication that disallow meaningless expressions.
Page 29
Parsing Process for Expressions:
Generate patterns, ensuring each stage yields terminals and reflects valid expression parsing.
Page 30
Defining Grammar's Purpose and Clarity:
Grammar constructed to generate legal expressions and simplify parsing through structured rules.
Page 31
Managing Ambiguity:
Important to determine which rule applies in ambiguous situations, leading to potential variations in parse trees.
Page 32
Operator Precedence:
Syntax plays a key role in dealing with operator precedence (e.g., performing multiplications before additions).
Page 33
Disambiguation Steps:
Aim for clarity in parsing by explicitly defining rule functions and maintaining structure.
Page 34
Evaluating Expressions:
Build trees for local calculations, facilitating overall language representation in compilers and interpreters.
Page 35
Summary of Expression Parsing:
Parse trees represent the structure of languages dynamically, with clear roles defined for processing.
Page 36
Overview of Parsing Conversion:
Parsing transforms complex texts into structured parse trees through systematic conversion methods.
Page 37
Goals of Parsing:
Create concise grammar definitions while maintaining simplicity for human writing and computer execution.
Page 38
Simplified Program Structure:
Create grammar rules for programming language projects, covering assignment statements and expressions.
Page 39
From Human to Machine Representation:
Example transformation of expressions into machine code needing minimum complexity in translation.
Page 40
Focus on the Essential Elements:
Grammar should emphasize key components like tokens for clarity and focus on essential structures.
Page 41
Tokenisation Process:
Conversion of character strings into symbols streamlines parsing.
Page 42
Tokenisation Complexity:
Define sub-grammars for simple tokens; however, consider the entire input for comprehensive tokenization.
Page 43
Using FSAs for Tokenisation:
FSAs can define tokens compactly, providing advantages such as efficiency with low overhead.
Page 44
Tokenisation Approach:
Convert source input into a token stream while managing error generation for unrecognized strings.
Page 45
FSA Characteristics:
Generates symbols correlating with terminals while tracking values presented in the token stream.
Page 46
Next Steps in Parsing:
Develop algorithms for parsing input token streams using two primary approaches: Top-down and Bottom-up.
Page 47
Recursive-Descent Parsing:
Each rule acts as a recognizer within the token stream, incorporating recursive functions for complex pattern matching.
Page 48
Methodology Noted:
A “try it and see” approach allows for attempts at matches against the token stream with potential backtracking as needed.
Page 49
Table-Driven Parsing Approach:
Utilize token input and analyze potential rules to build parse structures efficiently as constraints.
Page 50
Comparison of Techniques:
While table-driven methods may seem clearer, their efficiency arises from single-pass token evaluation without backtracking.
Page 51
Ambiguity Issues:
Some ambiguous grammars may cause difficulties for table-driven parsers but may work with deterministic selection in others.
Page 52
Support for Parsing Implementations:
Tools available for various parsing systems, including recursive descent and table-driven parsers for better manipulation efficiency.
Page 53
Layered Toolchain for Parsing and Code Generation:
Streamlined processing from text to a machine code binary through parsing layers: Tokeniser -> Parser -> Interpreter/Compiler.
Page 54
Key Takeaways on Parsers:
Built using FSAs and grammars, essential for handling text and program manipulations with established support tools.
Page 55
Future Considerations in Parser Development:
Recursive-descent parsers are engaging but often rely on parser-compilers for complex constructions and language parsing.
Page 56
Consideration for Practicality:
Engage with pre-existing parsers unless new applications necessitate experimentation into unique grammar forms.
Page 57
Pseudo-Code for Parser Development:
Create parser skeletons focusing on abstract algorithm shapes without delving into syntax details too early.
Page 58
Mechanics of Recursive Descent:
Each rule transitions into a function within the parser, employing functions for both variable and terminal recognizers.
Page 59
Detailed Rule Functionality:
Each rule's function must be capable of recognizing respective fragments amid recursive call structures.
Page 60
Conventional Recognition Logic:
Terminal recognizers progressively consume recognized tokens, whereas variable recognizers control flow with backtracking and pattern validation.
Page 61
Token Stream Management:
Stream position awareness allows for reverting and continued processing when patterns fail to match.
Page 62
Implementing Token Stream Visibility:
Develop methods for accessing current token states, adjusting, and tracking progress, promoting fluency in parsing operations.
Page 63
Variable Handling in Recursive Rules:
Variables maintain track of tokens while merging successes and failures into coherent parsers, adjusting parsing attempts as needed.
Page 64
Example Variable Parsing Logic:
Use utilities for managing token stream positions and encoding logic for success based on variable evaluations.
Page 65
Handling Repetition in Parsing:
Apply recognized structures in simplified manners while balancing recursive necessities across grammar rules.
Page 66
Managing Structuring for Repetitions:
Utilize established repetition rules to simplify implementations while keeping standard coding practices in context.
Page 67
Repetitions and Backtracking Techniques:
Code segments to detail parsing behaviors for repetitions while adhering to structured expansion logic.
Page 68
Structural Notes on Efficiency:
Reflecting parsing practices may lead to optimization potentials in comprehending grammar rules.
Page 69
Addressing Efficiency Crossover:
Identify filtering opportunities for terminal recognizers to streamline operational efficiency in variable evaluations.
Page 70
Overall Summary of Grammar Parsing Process:
Contextualize parsing enactments with clear structures bridging grammar to practical implementations that facilitate understanding and processing.