Regular Expressions: sed

Overview of the Stream Editor (sed)

  • Definition: The sed utility is a non-interactive, line-oriented stream editor. It is designed to process text input one line at a time.

  • Primary Functions:

    • Performing text processing tasks.

    • Executing in-place substitutions within files.

    • Making global substitutions where matched regular expression (regex) patterns are replaced with specific text strings.

  • Core Characteristics:

    • Non-interactive: Unlike a visual editor, sed follows a predetermined script of instructions.

    • Line-oriented: It reads input line by line, processes each line, and then moves to the next.

    • Versatility: It serves as a powerful tool for writing conversion programs and performing "in-pipe" editing.

Basic Syntax and Command Structure

  • General Syntax: sed [-n] [-e] ['command'] [file…] or sed [-n] [-f script] [file…].

  • Standard Usage for Substitution: sed -r "s/REGEX/TEXT/g" filename.

    • REGEX: The search pattern to match.

    • TEXT: The replacement text.

    • g: A global flag indicating that all occurrences of the pattern in the line should be replaced.

  • Input/Output:

    • If no filename is specified, sed reads from the standard input (stdin).

    • Resulting output is sent to the terminal (stdout) by default.

    • To make permanent changes, the output must be redirected to a new file, or the -i option must be used for in-place editing.

Command Line Options

  • -r: Enables extended regular expressions. This allows characters such as ( ), [ ], *, and + to be recognized correctly without excessive escaping.

  • -n: Suppresses the automatic output of the pattern space. Only lines explicitly specified with the p (print) command or the p flag of the substitute command will be displayed.

  • -e script: Allows the specification of an editing command directly on the command line. This is particularly useful when passing multiple commands.

  • -f script-file: Directs sed to read editing commands from an external file. If the first line of this script file is #n, sed behaves as if the -n option was passed globally.

  • -i: Used to perform in-place editing, modifying the original file directly rather than just outputting the changes to the stream.

Operational Mechanism: The Pattern Space

  • Step 1: Read: sed reads a single line from the input file or standard input.

  • Step 2: Pattern Space: The line is copied into a temporary buffer called the pattern space.

  • Step 3: Execute: Editing commands are applied to the text currently in the pattern space. Subsequent commands in a script operate on the modified line produced by previous commands, not the original input line.

  • Step 4: Output: Once all commands in the script have been applied, the contents of the pattern space are sent to the output (unless the -n option is active).

  • Step 5: Repeat: The line is removed from the pattern space, and sed reads the next line of input, repeating the cycle until the end of the file is reached.

  • Integrity: The original input file remains unchanged during this process unless specific options (like -i) or redirections are employed.

Addressing Mechanisms

Addresses determine which specific lines in the input file are processed by subsequent commands. If no address is specified, the command applies to every line.

  • Single Address: Specifies exactly one line.

    • Line Number: e.g., 6d deletes line 66.

    • Pattern: e.g., /REGEX/ applies the command to any line matching that pattern.

  • Address Ranges: Two addresses separated by a comma (inclusive range).

    • Line Number Range: e.g., 1,10d deletes lines 11 through 1010.

    • Mixed/Pattern Ranges: e.g., 1,/^$/d deletes from line 11 through the first blank line.

  • The Negation Operator (!): When an address is followed by !, the command applies only to lines that do not match the address.

    • Example: /black/!s/cow/horse/ substitutes "horse" for "cow" on every line that does not contain the word "black."

Fundamental Commands: Deletion, Print, and Line Numbering

  • Deletion (d): Deletes the addressed line(s) from the pattern space so they are not passed to the output. A new line is then read, and processing restarts from the top of the script.

    • d: Deletes all lines.

    • /^$/d: Deletes all blank lines.

    • /^$/,$d: Deletes from the first blank line to the end of the file ( $ represents the last line).

    • /^ya*y/,/[0-9]$/d: Deletes from the first line starting with "yay", "yaay", etc., through the first line ending with a digit.

  • Print (p): Forces the current pattern space to be output. If -n is not used, the line will appear twice.

    • 1,5p: Displays lines 11 through 55.

    • /^$/,$p: Displays from the first blank line to the end of the file.

  • Line Numbering (=): Writes the current line number on a separate line before the matched or output line.

    • sed -e '/Two-thirds-time/=' tuition.data: Displays the line number for the line containing "Two-thirds-time."

The Substitution Command (s)

  • Syntax: [address]s/pattern/replacement/[flags]

  • Flags:

    • n: A number from 11 to 512512 indicating which specific occurrence of the pattern should be replaced (e.g., s/Four/Five/2 replaces only the second "Four" on each line).

    • g (Global): Replaces every occurrence of the pattern in the pattern space.

    • p (Print): Prints the pattern space if a successful substitution occurred.

  • Replacement Patterns:

    • &: Represents the entire string matched by the regex. For example, s/.NI./wonderful &/ applied to "UNIX" produces "wonderful UNIX".

    • \n: Replaced by the nthn^{th} substring specified using backreferences \( \) (or ( ) with -r).

    • **: Used to escape the ampersand (&) or the backslash character itself.

  • Custom Delimiters: While / is standard, sed allows other delimiters to make commands more readable, especially for paths. Example: sed -r "s#http://#https://#g" urls.txt.

Text Manipulation: Append, Insert, and Change

These commands typically require multiple lines in a script file or specific escaping in an inline command.

  • Append (a): Adds text after the addressed line.

    • Syntax: [address]a\ text.

  • Insert (i): Adds text before the addressed line.

    • Syntax: [address]i\ text.

  • Change (c): Replaces the entire addressed line or range of lines with the specified text.

    • Syntax: [address(es)]c\ text.

  • Constraint: Append and Insert function only for single addresses, while Change can operate on ranges.

File Operations, Transformations, and Control

  • Read (r filename): Queues the contents of a file to be read and inserted into the output stream at the end of the current cycle. If the file cannot be read, it is treated as empty without an error message.

  • Write (w filename): Writes the current pattern space to a file. The file is created or truncated before the first input line is processed. Multiple w commands to the same file use the same stream.

  • Transform (y): Acts like the tr command, performing a character-to-character replacement. The number of characters in the search and replacement string must be identical.

    • Example: y/abc/xyz/ transforms all 'a's to 'x's, 'b's to 'y's, and 'c's to 'z's.

  • Quit (q): Terminates the sed script immediately when the specified address is reached. It accepts at most a single-line address.

    • Example: sed '100q' filename prints the first 100100 lines of a file and then exits, mimicking the head command.

Scripting and Multiple Commands

  • Script Files: A file containing a sequence of sed commands, each consisting of an address and an action.

  • Braces ({ }): Used to group multiple commands and apply them to a single address or range.

    • Format Requirements: The opening brace { must be the last character on its line. The closing brace } must be on its own line. No spaces should follow the braces.

    • Alternative Format: Commands can be separated by semicolons: [address]{command1; command2; } or passed as a string 'command1; command2'.

  • Conversion Example: Swapping names in a list (e.g., from "Last, First" to "First Last"): sed -r "s/([A-Za-z]+), ([A-Za-z]+)/\2 \1/g" names.txt.

Case Study: Tuition Data Processing

  • Scenario: A file tuition.data contains:

    • Part-time 1003.99

    • Two-thirds-time 1506.49

    • Full-time 2012.29

  • Deletion: sed -e '/^Part-time/d' tuition.data removes the first line.

  • Append: Appending a dashed line after every entry: a \ --------------------------     

  • Insert: Adding a title before line 11: 1 i\ Tuition List\     

  • Change: Replacing the first line to update a value from 1003.991003.99 to 1100.001100.00: 1 c\ Part-time 1100.00