CMPSC 311 — Regular Expressions Lecture Notes
Definition and Core Idea of Regular Expressions
- Regular Expressions (common abbreviations: regex, occasionally regexp)
- Formal language designed to describe patterns within strings.
- Think of it as a “super-powered” search-and-replace syntax that can be embedded almost anywhere:
- Command-line tools (e.g., grep, egrep)
- Shell scripts
- Text-editor search bars (VS Code, Vim, Sublime, etc.)
- Programming-language APIs (Python, Perl, Java, .NET, Ruby, JavaScript, PCRE, …)
- Conceptual importance in Systems Programming (course context CMPSC 311)
- Automates validation, parsing, and transformation tasks that normally require many lines of code.
- Serves as gateway to understanding lexical analyzers, finite state machines, and compiler front-ends (future lectures).
Typical Use-Cases & Motivation
- Pattern-based searching
- Find email addresses, phone numbers, dates, account numbers, log timestamps, etc.
- Validation of user input
- Ensures a string conforms to a pattern before a system accepts or stores it (security + data integrity).
- Complex find-and-replace (rewriting)
- Example discussed: swap order of two words—change “X and Y” into “Y and X” for any words X and Y.
- Productivity boost
- One-liner solutions replace dozens of manual edits or lines of C/Python code.
Landscape of Regex “Flavors” (Engines)
- Each implementation adds or omits features—collectively called a flavor.
- Major flavors mentioned (source list on slide):
- JGsoft (Just Great Software)
- .NET (C#, VB, F#) versions 1-3
- Java (java.util.regex; JDK 1.4+ with extra features in 1.5 & 1.6)
- Perl 5.6/5.8 (Unicode after 5.6)
- PCRE 5.x/6.x (C library, PHP, many tools)
- ECMA-262 / JavaScript
- Python re module
- Ruby built-in engine
- Tcl ARE (Advanced Regular Expressions)
- POSIX BRE/ERE, GNU BRE/ERE (classic UNIX utilities)
- XML Schema, XPath/XQuery regex specs
- Practical takeaway
- Syntax is ~90 % common, but always check edge-case behavior (look-around, back-reference limits, Unicode handling).
Command-Line Workhorse: grep / egrep
- Name origin: global regular expression print.
- Variants
- grep -F – Fixed (literal) strings, no metacharacters.
- grep -E or egrep – Extended regex (close to scripting-language flavor, used in lecture).
- Basic syntax patterns
- Search files:
grep -E "regex" file1 file2 …
- Filter stdin:
... | grep -E "regex"
- Useful flags
-i
– ignore case-v
– invert match (select non-matching lines)-r
– recursive directory search-l
– list only filenames that contain a match (not the matching lines)
Hands-On File for Class Demos
- File:
grep-text.txt
(available on course Canvas) - Students instructed to copy to VM and test commands.
- Example E1 starter:
grep -E this grep-text.txt
⇒ prints lines containing literal “this”.
Literal Characters & Basic Matching (Lesson 1)
- Alphanumeric characters and many symbols match themselves.
- Important mental model: a regex can match anywhere in a line unless you anchor it.
- Exercise references
- E2: Find lines containing “fgh” ⇒
grep -E fgh grep-text.txt
- E3: Find lines containing “lmn”
Anchors (Beginning/End of Line)
- ^ – matches start of line
- \ – matches end of line (escaped as
\$
in shell quotes) - Typical shell-safe invocation:
grep -E "^pattern$" file
- Example Exercises
- E4: Words ending in “gry” ⇒
grep -E "gry$" ...
- E5: Words starting with “ah” ⇒
grep -E "^ah" ...
- Using both anchors simultaneously requires pattern to match entire line.
- E.g.,
grep -E "^foo$"
finds lines that are exactly “foo”.
Single-Character Wildcard
- Dot . – matches exactly one character (including digits, punctuation; excludes newline).
- Exercises
- E6: 6-letter word where 2nd, 4th, 6th letters are “o”: pattern
^.o.o.o$
( anchored + six dots/characters ) - E7: Word starts with “gr”, ends with “at”, exactly one char in middle:
gr.at
Multi-Character Wildcard & Quantifiers
- Dot-star .* – two tokens:
- . any char
- * quantifier “0 or more”
- Quantifier cheat-sheet (applies to preceding token or group):
- * – 0 or more
- + – 1 or more
- ? – 0 or 1 (optional)
- {n} – exactly n repetitions (many flavors also support {n,m} ranges)
- Exercises
- E8: Words “gr…at” any length ⇒
gr.*at
- E9: Spell-check necessary/necessary variants ⇒
necc?ess?ary
- E10: uk/us spelling colour/color ⇒
colou?r
- E11: Sequence u→o→i→e→a with ≥1 char between each ⇒
u.+o.+i.+e.+a
Character Classes & Whitespace
- Square brackets specify one char from a set:
[abc]
. - Combine with quantifiers for length control.
- Pre-defined shorthand (POSIX & PCRE):
\s
– any whitespace (space, tab, newline)- Lecture explicitly notes space can be expressed via
\s
when needed.
- Exercise
- E12: Word starts with b, remaining letters only c/h/s ⇒
^b[chs]+$
Ranges Inside Classes
- Syntax:
[a-j]
gives a OR b OR … OR j. - Examples mentioned
- One hexadecimal digit:
[0-9a-f]
- English consonants:
[b-df-hj-np-tv-z]
- Exercise
- E13: Lines containing words formed exclusively from letters a–e ⇒ regex hint
[a-e]+
Negative Character Classes
[^ … ]
– matches any char not in brackets.- Combine with ranges:
[^0-9]
= “non-digit”. - Exercise
- E14: Words with "aq" not followed by "u" ⇒
aq[^u]
Grouping with Parentheses
- Parentheses create sub-expressions.
- You can attach quantifiers to the whole group.
- Also enable capture/back-reference later (see next section).
- Exercises
- E15: “m” followed by "ach" repeated ≥1 times, then "e":
m(ach)+e
- E16: Every other char (starting first) is "e" ⇒ pattern using back-references or character groups like
^(e.)+$
.
Alternation (Branching)
- Vertical bar | works like logical OR inside regex engine.
- Always evaluated left→right; parentheses limit scope.
- Exercise
- E17: Lines containing word "this" OR "fish" ⇒
\b(this|fish)\b
- Primary metacharacters seen so far:
^ $ . * + ? [ ] ( ) | \
- Precede with backslash to treat literally.
- Example: literal asterisk ⇒
\*
- Exercise
- E18: Search for lines containing
*
⇒ grep -E "\*" file
Back-References (Re-using Captured Text)
- After you place parentheses, the engine numbers them 1,2,3… in order of opening parenthesis.
- Syntax to reference:
\1
, \2
, … - Extremely powerful for detecting repetition or palindromic patterns.
- Exercise
- E19: Word containing immediately repeated 4-char sequence ⇒
\b(....)\1\b
- Explanation:
(....)
captures any 4 characters; \1
demands the exact same 4 chars right after.
Learning, Ethics & Real-World Impact
- Resources
- regexcrossword.com – gamified practice (progressive difficulty).
- Classic reference book: “Mastering Regular Expressions” (Jeffrey Friedl, O’Reilly) – covers Perl, .NET, Java, Ruby, more.
- Ethical & practical considerations
- Security: poorly crafted regex can lead to ReDoS (Regular-Expression Denial of Service) due to catastrophic backtracking.
- Maintainability: opaque one-liner may hinder future code readability; always comment or use verbose mode when available (
(?x)
in many flavors). - Portability: scripts relying on GNU extensions may break on BSD or POSIX-only environments—test on target system.
Quick Reference Cheat-Sheet (Course Edition)
- Anchors: ^ (start) / \ (end)
- Wildcards:
.
(one), .*
(0+ any) - Quantifiers:
*
, +
, ?
, \{n\}
, \{n,m\}
- Classes:
[abc]
, [a-z]
, [^0-9]
, \s
, \d
(if flavor supports), etc. - Grouping & Alternation:
( … )
, |
- Back-references:
\1
, \2
, … - Grep flags to remember:
-E
, -i
, -v
, -r
, -l
Connection to Future Lectures
- Finite-State Machines (FSMs): every regular expression corresponds to a deterministic or non-deterministic FSM – foundational to compiler lexical analysis.
- Systems Programming uses: log parsing, protocol filtering, packet inspection, configuration validation.
- Introductory gateway to more powerful pattern systems (context-free grammars, parsing expression grammars, etc.).