Lecture 2 - Scanner
What is a scanner
Splits a stream of characters into a sequence of tokens:

What is a token
A minimal sequence of characters that represent a unit of information
Each language specifies a finite set of token types
Examples:
Type Examples
ID foo n14 last
INTEGER 73 0 515 082
REAL 66.1 .5 10. 1e67
IF if
COMMA ,
NOTEQ !=
LPAREN (
RPAREN )
Non-token Examples
Some character sequences are not tokens:
comment /* try again */
preprocessor directive #include <stdio.h>
preprocessor directive #define NUMS 5 , 6
macro NUMS
blanks, tabs, newlines
Specifying Token Types
Token structures can be complex, and English descriptions can be tedious, imprecise, and incomplete
E.g., Identifiers, and real numbers
Need a formal system to specify them without ambiguity
Allow review of design of validation and of implementation
Regular Expressions:
succinct, precise, complete
capable of representing infinite sets of strings
English rules for identifiers
An identifier is a sequence of letters or digits; the first character must be a letter
The underscore _ counts as a letter
Upper- and Lower-case letters are different
If the input stream has been parsed into tokens up to a given character, the next token is taken to include the longest string of characters that could possibly constitute a token
Blanks, tabs, newlines, and comments are ignored except as they serve to separate tokens
Some whitespace is required to separate otherwise adjacent identifiers, keywords, or constants
Regular Expressions
Basic Regular Expressions:
Given any character from an alphabet , a itself denotes a regular expression and the language it recognizes is L(a) = {a}.
Empty string: ε (also λ)
L(ε) = {ε}
Empty set: Φ
L(Φ) = {}
What is the difference between ε and Φ?
Empty set is an indication that there is nothing to match, while an empty string means the set isn’t empty, but contains an empty string.
Basic Operations
Concatenation: ab (also )
Choice/alternation: a|b
A(a|b) =
Kleene closure: a*
Represents infinite strings
Combinations:
(a|b)a = {aa, ba}
((a|b)a)* = {ε, aa, ba, aaaa, aaba, baaa, baba, aaaaaa, ...}