Lecture 2 - Scanner

What is a scanner

  • Splits a stream of characters into a sequence of tokens:

What is a token

  • A minimal sequence of characters that represent a unit of information

  • Each language specifies a finite set of token types

  • Examples:

    • Type            Examples

    • ID                foo n14 last

    • INTEGER   73 0 515 082

    • REAL          66.1 .5 10. 1e67

    • IF                if

    • COMMA     ,

      NOTEQ      !=

      LPAREN    (

      RPAREN    )

Non-token Examples

  • Some character sequences are not tokens:

  • comment                                /* try again */

  • preprocessor directive            #include <stdio.h>

  • preprocessor directive            #define NUMS 5 , 6

  • macro                                      NUMS

  • blanks, tabs, newlines

Specifying Token Types

  • Token structures can be complex, and English descriptions can be tedious, imprecise, and incomplete

    • E.g., Identifiers, and real numbers

  • Need a formal system to specify them without ambiguity

    • Allow review of design of validation and of implementation

  • Regular Expressions:

    • succinct, precise, complete

    • capable of representing infinite sets of strings

English rules for identifiers

  • An identifier is a sequence of letters or digits; the first character must be a letter

  • The underscore _ counts as a letter

  • Upper- and Lower-case letters are different

  • If the input stream has been parsed into tokens up to a given character, the next token is taken to include the longest string of characters that could possibly constitute a token

  • Blanks, tabs, newlines, and comments are ignored except as they serve to separate tokens

  • Some whitespace is required to separate otherwise adjacent identifiers, keywords, or constants

Regular Expressions

  • Basic Regular Expressions:

    • Given any character from an alphabet Σ\Sigma, a itself denotes a regular expression and the language it recognizes is L(a) = {a}.

    • Empty string: ε (also λ)

      • L(ε) = {ε}

    • Empty set: Φ

      • L(Φ) = {}

    • What is the difference between ε and Φ?

      • Empty set is an indication that there is nothing to match, while an empty string means the set isn’t empty, but contains an empty string.

Basic Operations

  • Concatenation: ab (also aba\cdot b)

  • Choice/alternation: a|b

    • A(a|b) = L(a)L(b)={a,b}L(a)\cup L(b)=\{a, b\}

  • Kleene closure: a*

    • L(a)=L(ε)L(a)L(aa)L(aaa)...={ε,a,aa,aaa,...}L(a*)=L(\varepsilon)\cup L(a)\cup L(aa)\cup L(aaa)...=\{\varepsilon,a,aa,aaa,...\}

    • Represents infinite strings

  • Combinations:

    • (a|b)a = {aa, ba}

    • ((a|b)a)* = {ε, aa, ba, aaaa, aaba, baaa, baba, aaaaaa, ...}