CMPSC 311 — Regular Expressions Lecture Notes

Definition and Core Idea of Regular Expressions

Regular Expressions (common abbreviations: regex, occasionally regexp)
- Formal language designed to describe patterns within strings.
- Think of it as a “super-powered” search-and-replace syntax that can be embedded almost anywhere:
- Command-line tools (e.g., grep, egrep)
- Shell scripts
- Text-editor search bars (VS Code, Vim, Sublime, etc.)
- Programming-language APIs (Python, Perl, Java, .NET, Ruby, JavaScript, PCRE, …)
Conceptual importance in Systems Programming (course context CMPSC 311)
- Automates validation, parsing, and transformation tasks that normally require many lines of code.
- Serves as gateway to understanding lexical analyzers, finite state machines, and compiler front-ends (future lectures).

Typical Use-Cases & Motivation

Pattern-based searching
- Find email addresses, phone numbers, dates, account numbers, log timestamps, etc.
Validation of user input
- Ensures a string conforms to a pattern before a system accepts or stores it (security + data integrity).
Complex find-and-replace (rewriting)
- Example discussed: swap order of two words—change “X and Y” into “Y and X” for any words X and Y.
Productivity boost
- One-liner solutions replace dozens of manual edits or lines of C/Python code.

Landscape of Regex “Flavors” (Engines)

Each implementation adds or omits features—collectively called a flavor.
Major flavors mentioned (source list on slide):
- JGsoft (Just Great Software)
- .NET (C#, VB, F#) versions 1-3
- Java (java.util.regex; JDK 1.4+ with extra features in 1.5 & 1.6)
- Perl 5.6/5.8 (Unicode after 5.6)
- PCRE 5.x/6.x (C library, PHP, many tools)
- ECMA-262 / JavaScript
- Python re module
- Ruby built-in engine
- Tcl ARE (Advanced Regular Expressions)
- POSIX BRE/ERE, GNU BRE/ERE (classic UNIX utilities)
- XML Schema, XPath/XQuery regex specs
Practical takeaway
- Syntax is ~90 % common, but always check edge-case behavior (look-around, back-reference limits, Unicode handling).

Command-Line Workhorse: grep / egrep

Name origin: global regular expression print.
Variants
- grep -F – Fixed (literal) strings, no metacharacters.
- grep -E or egrep – Extended regex (close to scripting-language flavor, used in lecture).
Basic syntax patterns
- Search files: grep -E "regex" file1 file2 …
- Filter stdin: ... | grep -E "regex"
Useful flags
- -i – ignore case
- -v – invert match (select non-matching lines)
- -r – recursive directory search
- -l – list only filenames that contain a match (not the matching lines)

Hands-On File for Class Demos

File: grep-text.txt (available on course Canvas)
Students instructed to copy to VM and test commands.
- Example E1 starter: grep -E this grep-text.txt ⇒ prints lines containing literal “this”.

Literal Characters & Basic Matching (Lesson 1)

Alphanumeric characters and many symbols match themselves.
Important mental model: a regex can match anywhere in a line unless you anchor it.
Exercise references
- E2: Find lines containing “fgh” ⇒ grep -E fgh grep-text.txt
- E3: Find lines containing “lmn”

Anchors (Beginning/End of Line)

^ – matches start of line
\ – matches end of line (escaped as \$ in shell quotes)
Typical shell-safe invocation: grep -E "^pattern$" file
Example Exercises
- E4: Words ending in “gry” ⇒ grep -E "gry$" ...
- E5: Words starting with “ah” ⇒ grep -E "^ah" ...
- Using both anchors simultaneously requires pattern to match entire line.
- E.g., grep -E "^foo$" finds lines that are exactly “foo”.

Single-Character Wildcard

Dot . – matches exactly one character (including digits, punctuation; excludes newline).
Exercises
- E6: 6-letter word where 2nd, 4th, 6th letters are “o”: pattern ^.o.o.o$ ( anchored + six dots/characters )
- E7: Word starts with “gr”, ends with “at”, exactly one char in middle: gr.at

Multi-Character Wildcard & Quantifiers

Dot-star .* – two tokens:
- . any char
- * quantifier “0 or more”
Quantifier cheat-sheet (applies to preceding token or group):
- * – 0 or more
- + – 1 or more
- ? – 0 or 1 (optional)
- {n} – exactly n repetitions (many flavors also support {n,m} ranges)
Exercises
- E8: Words “gr…at” any length ⇒ gr.*at
- E9: Spell-check necessary/necessary variants ⇒ necc?ess?ary
- E10: uk/us spelling colour/color ⇒ colou?r
- E11: Sequence u→o→i→e→a with ≥1 char between each ⇒ u.+o.+i.+e.+a

Character Classes & Whitespace

Square brackets specify one char from a set: [abc].
Combine with quantifiers for length control.
Pre-defined shorthand (POSIX & PCRE):
- \s – any whitespace (space, tab, newline)
- Lecture explicitly notes space can be expressed via \s when needed.
Exercise
- E12: Word starts with b, remaining letters only c/h/s ⇒ ^b[chs]+$

Ranges Inside Classes

Syntax: [a-j] gives a OR b OR … OR j.
Examples mentioned
- One hexadecimal digit: [0-9a-f]
- English consonants: [b-df-hj-np-tv-z]
Exercise
- E13: Lines containing words formed exclusively from letters a–e ⇒ regex hint [a-e]+

Negative Character Classes

[^ … ] – matches any char not in brackets.
Combine with ranges: [^0-9] = “non-digit”.
Exercise
- E14: Words with "aq" not followed by "u" ⇒ aq[^u]

Grouping with Parentheses

Parentheses create sub-expressions.
- You can attach quantifiers to the whole group.
Also enable capture/back-reference later (see next section).
Exercises
- E15: “m” followed by "ach" repeated ≥1 times, then "e": m(ach)+e
- E16: Every other char (starting first) is "e" ⇒ pattern using back-references or character groups like ^(e.)+$.

Alternation (Branching)

Vertical bar | works like logical OR inside regex engine.
Always evaluated left→right; parentheses limit scope.
Exercise
- E17: Lines containing word "this" OR "fish" ⇒ \b(this|fish)\b

Special / Metacharacters Recap & Escaping

Primary metacharacters seen so far: ^ $ . * + ? [ ] ( ) | \
Precede with backslash to treat literally.
- Example: literal asterisk ⇒ \*
Exercise
- E18: Search for lines containing * ⇒ grep -E "\*" file

Back-References (Re-using Captured Text)

After you place parentheses, the engine numbers them 1,2,3… in order of opening parenthesis.
Syntax to reference: \1, \2, …
Extremely powerful for detecting repetition or palindromic patterns.
Exercise
- E19: Word containing immediately repeated 4-char sequence ⇒ \b(....)\1\b
- Explanation: (....) captures any 4 characters; \1 demands the exact same 4 chars right after.

Learning, Ethics & Real-World Impact

Resources
- regexcrossword.com – gamified practice (progressive difficulty).
- Classic reference book: “Mastering Regular Expressions” (Jeffrey Friedl, O’Reilly) – covers Perl, .NET, Java, Ruby, more.
Ethical & practical considerations
- Security: poorly crafted regex can lead to ReDoS (Regular-Expression Denial of Service) due to catastrophic backtracking.
- Maintainability: opaque one-liner may hinder future code readability; always comment or use verbose mode when available ((?x) in many flavors).
- Portability: scripts relying on GNU extensions may break on BSD or POSIX-only environments—test on target system.

Quick Reference Cheat-Sheet (Course Edition)

Anchors: ^ (start) / \ (end)
Wildcards: . (one), .* (0+ any)
Quantifiers: *, +, ?, \{n\}, \{n,m\}
Classes: [abc], [a-z], [^0-9], \s, \d (if flavor supports), etc.
Grouping & Alternation: ( … ), |
Back-references: \1, \2, …
Grep flags to remember: -E, -i, -v, -r, -l

Connection to Future Lectures

Finite-State Machines (FSMs): every regular expression corresponds to a deterministic or non-deterministic FSM – foundational to compiler lexical analysis.
Systems Programming uses: log parsing, protocol filtering, packet inspection, configuration validation.
Introductory gateway to more powerful pattern systems (context-free grammars, parsing expression grammars, etc.).