AR

CMPSC 311 — Regular Expressions Lecture Notes

Definition and Core Idea of Regular Expressions

  • Regular Expressions (common abbreviations: regex, occasionally regexp)
    • Formal language designed to describe patterns within strings.
    • Think of it as a “super-powered” search-and-replace syntax that can be embedded almost anywhere:
    • Command-line tools (e.g., grep, egrep)
    • Shell scripts
    • Text-editor search bars (VS Code, Vim, Sublime, etc.)
    • Programming-language APIs (Python, Perl, Java, .NET, Ruby, JavaScript, PCRE, …)
  • Conceptual importance in Systems Programming (course context CMPSC 311)
    • Automates validation, parsing, and transformation tasks that normally require many lines of code.
    • Serves as gateway to understanding lexical analyzers, finite state machines, and compiler front-ends (future lectures).

Typical Use-Cases & Motivation

  • Pattern-based searching
    • Find email addresses, phone numbers, dates, account numbers, log timestamps, etc.
  • Validation of user input
    • Ensures a string conforms to a pattern before a system accepts or stores it (security + data integrity).
  • Complex find-and-replace (rewriting)
    • Example discussed: swap order of two words—change “X and Y” into “Y and X” for any words X and Y.
  • Productivity boost
    • One-liner solutions replace dozens of manual edits or lines of C/Python code.

Landscape of Regex “Flavors” (Engines)

  • Each implementation adds or omits features—collectively called a flavor.
  • Major flavors mentioned (source list on slide):
    • JGsoft (Just Great Software)
    • .NET (C#, VB, F#) versions 1-3
    • Java (java.util.regex; JDK 1.4+ with extra features in 1.5 & 1.6)
    • Perl 5.6/5.8 (Unicode after 5.6)
    • PCRE 5.x/6.x (C library, PHP, many tools)
    • ECMA-262 / JavaScript
    • Python re module
    • Ruby built-in engine
    • Tcl ARE (Advanced Regular Expressions)
    • POSIX BRE/ERE, GNU BRE/ERE (classic UNIX utilities)
    • XML Schema, XPath/XQuery regex specs
  • Practical takeaway
    • Syntax is ~90 % common, but always check edge-case behavior (look-around, back-reference limits, Unicode handling).

Command-Line Workhorse: grep / egrep

  • Name origin: global regular expression print.
  • Variants
    • grep -F – Fixed (literal) strings, no metacharacters.
    • grep -E or egrep – Extended regex (close to scripting-language flavor, used in lecture).
  • Basic syntax patterns
    • Search files: grep -E "regex" file1 file2 …
    • Filter stdin: ... | grep -E "regex"
  • Useful flags
    • -i – ignore case
    • -v – invert match (select non-matching lines)
    • -r – recursive directory search
    • -l – list only filenames that contain a match (not the matching lines)

Hands-On File for Class Demos

  • File: grep-text.txt (available on course Canvas)
  • Students instructed to copy to VM and test commands.
    • Example E1 starter: grep -E this grep-text.txt ⇒ prints lines containing literal “this”.

Literal Characters & Basic Matching (Lesson 1)

  • Alphanumeric characters and many symbols match themselves.
  • Important mental model: a regex can match anywhere in a line unless you anchor it.
  • Exercise references
    • E2: Find lines containing “fgh” ⇒ grep -E fgh grep-text.txt
    • E3: Find lines containing “lmn”

Anchors (Beginning/End of Line)

  • ^ – matches start of line
  • \ – matches end of line (escaped as \$ in shell quotes)
  • Typical shell-safe invocation: grep -E "^pattern$" file
  • Example Exercises
    • E4: Words ending in “gry” ⇒ grep -E "gry$" ...
    • E5: Words starting with “ah” ⇒ grep -E "^ah" ...
    • Using both anchors simultaneously requires pattern to match entire line.
    • E.g., grep -E "^foo$" finds lines that are exactly “foo”.

Single-Character Wildcard

  • Dot . – matches exactly one character (including digits, punctuation; excludes newline).
  • Exercises
    • E6: 6-letter word where 2nd, 4th, 6th letters are “o”: pattern ^.o.o.o$ ( anchored + six dots/characters )
    • E7: Word starts with “gr”, ends with “at”, exactly one char in middle: gr.at

Multi-Character Wildcard & Quantifiers

  • Dot-star .* – two tokens:
    • . any char
    • * quantifier “0 or more”
  • Quantifier cheat-sheet (applies to preceding token or group):
    • * – 0 or more
    • + – 1 or more
    • ? – 0 or 1 (optional)
    • {n} – exactly n repetitions (many flavors also support {n,m} ranges)
  • Exercises
    • E8: Words “gr…at” any length ⇒ gr.*at
    • E9: Spell-check necessary/necessary variants ⇒ necc?ess?ary
    • E10: uk/us spelling colour/color ⇒ colou?r
    • E11: Sequence u→o→i→e→a with ≥1 char between each ⇒ u.+o.+i.+e.+a

Character Classes & Whitespace

  • Square brackets specify one char from a set: [abc].
  • Combine with quantifiers for length control.
  • Pre-defined shorthand (POSIX & PCRE):
    • \s – any whitespace (space, tab, newline)
    • Lecture explicitly notes space can be expressed via \s when needed.
  • Exercise
    • E12: Word starts with b, remaining letters only c/h/s^b[chs]+$

Ranges Inside Classes

  • Syntax: [a-j] gives a OR b OR … OR j.
  • Examples mentioned
    • One hexadecimal digit: [0-9a-f]
    • English consonants: [b-df-hj-np-tv-z]
  • Exercise
    • E13: Lines containing words formed exclusively from letters a–e ⇒ regex hint [a-e]+

Negative Character Classes

  • [^ … ] – matches any char not in brackets.
  • Combine with ranges: [^0-9] = “non-digit”.
  • Exercise
    • E14: Words with "aq" not followed by "u" ⇒ aq[^u]

Grouping with Parentheses

  • Parentheses create sub-expressions.
    • You can attach quantifiers to the whole group.
  • Also enable capture/back-reference later (see next section).
  • Exercises
    • E15: “m” followed by "ach" repeated ≥1 times, then "e": m(ach)+e
    • E16: Every other char (starting first) is "e" ⇒ pattern using back-references or character groups like ^(e.)+$.

Alternation (Branching)

  • Vertical bar | works like logical OR inside regex engine.
  • Always evaluated left→right; parentheses limit scope.
  • Exercise
    • E17: Lines containing word "this" OR "fish" ⇒ \b(this|fish)\b

Special / Metacharacters Recap & Escaping

  • Primary metacharacters seen so far: ^ $ . * + ? [ ] ( ) | \
  • Precede with backslash to treat literally.
    • Example: literal asterisk ⇒ \*
  • Exercise
    • E18: Search for lines containing *grep -E "\*" file

Back-References (Re-using Captured Text)

  • After you place parentheses, the engine numbers them 1,2,3… in order of opening parenthesis.
  • Syntax to reference: \1, \2, …
  • Extremely powerful for detecting repetition or palindromic patterns.
  • Exercise
    • E19: Word containing immediately repeated 4-char sequence ⇒ \b(....)\1\b
    • Explanation: (....) captures any 4 characters; \1 demands the exact same 4 chars right after.

Learning, Ethics & Real-World Impact

  • Resources
    • regexcrossword.com – gamified practice (progressive difficulty).
    • Classic reference book: “Mastering Regular Expressions” (Jeffrey Friedl, O’Reilly) – covers Perl, .NET, Java, Ruby, more.
  • Ethical & practical considerations
    • Security: poorly crafted regex can lead to ReDoS (Regular-Expression Denial of Service) due to catastrophic backtracking.
    • Maintainability: opaque one-liner may hinder future code readability; always comment or use verbose mode when available ((?x) in many flavors).
    • Portability: scripts relying on GNU extensions may break on BSD or POSIX-only environments—test on target system.

Quick Reference Cheat-Sheet (Course Edition)

  • Anchors: ^ (start) / \ (end)
  • Wildcards: . (one), .* (0+ any)
  • Quantifiers: *, +, ?, \{n\}, \{n,m\}
  • Classes: [abc], [a-z], [^0-9], \s, \d (if flavor supports), etc.
  • Grouping & Alternation: ( … ), |
  • Back-references: \1, \2, …
  • Grep flags to remember: -E, -i, -v, -r, -l

Connection to Future Lectures

  • Finite-State Machines (FSMs): every regular expression corresponds to a deterministic or non-deterministic FSM – foundational to compiler lexical analysis.
  • Systems Programming uses: log parsing, protocol filtering, packet inspection, configuration validation.
  • Introductory gateway to more powerful pattern systems (context-free grammars, parsing expression grammars, etc.).