Regular Expressions: sed
Overview of the Stream Editor (sed)
Definition: The
sedutility is a non-interactive, line-oriented stream editor. It is designed to process text input one line at a time.Primary Functions:
Performing text processing tasks.
Executing in-place substitutions within files.
Making global substitutions where matched regular expression (regex) patterns are replaced with specific text strings.
Core Characteristics:
Non-interactive: Unlike a visual editor,
sedfollows a predetermined script of instructions.Line-oriented: It reads input line by line, processes each line, and then moves to the next.
Versatility: It serves as a powerful tool for writing conversion programs and performing "in-pipe" editing.
Basic Syntax and Command Structure
General Syntax:
sed [-n] [-e] ['command'] [file…]orsed [-n] [-f script] [file…].Standard Usage for Substitution:
sed -r "s/REGEX/TEXT/g" filename.REGEX: The search pattern to match.
TEXT: The replacement text.
g: A global flag indicating that all occurrences of the pattern in the line should be replaced.
Input/Output:
If no filename is specified,
sedreads from the standard input (stdin).Resulting output is sent to the terminal (stdout) by default.
To make permanent changes, the output must be redirected to a new file, or the
-ioption must be used for in-place editing.
Command Line Options
-r: Enables extended regular expressions. This allows characters such as
( ),[ ],*, and+to be recognized correctly without excessive escaping.-n: Suppresses the automatic output of the pattern space. Only lines explicitly specified with the
p(print) command or thepflag of the substitute command will be displayed.-e script: Allows the specification of an editing command directly on the command line. This is particularly useful when passing multiple commands.
-f script-file: Directs
sedto read editing commands from an external file. If the first line of this script file is#n,sedbehaves as if the-noption was passed globally.-i: Used to perform in-place editing, modifying the original file directly rather than just outputting the changes to the stream.
Operational Mechanism: The Pattern Space
Step 1: Read:
sedreads a single line from the input file or standard input.Step 2: Pattern Space: The line is copied into a temporary buffer called the pattern space.
Step 3: Execute: Editing commands are applied to the text currently in the pattern space. Subsequent commands in a script operate on the modified line produced by previous commands, not the original input line.
Step 4: Output: Once all commands in the script have been applied, the contents of the pattern space are sent to the output (unless the
-noption is active).Step 5: Repeat: The line is removed from the pattern space, and
sedreads the next line of input, repeating the cycle until the end of the file is reached.Integrity: The original input file remains unchanged during this process unless specific options (like
-i) or redirections are employed.
Addressing Mechanisms
Addresses determine which specific lines in the input file are processed by subsequent commands. If no address is specified, the command applies to every line.
Single Address: Specifies exactly one line.
Line Number: e.g.,
6ddeletes line .Pattern: e.g.,
/REGEX/applies the command to any line matching that pattern.
Address Ranges: Two addresses separated by a comma (inclusive range).
Line Number Range: e.g.,
1,10ddeletes lines through .Mixed/Pattern Ranges: e.g.,
1,/^$/ddeletes from line through the first blank line.
The Negation Operator (!): When an address is followed by
!, the command applies only to lines that do not match the address.Example:
/black/!s/cow/horse/substitutes "horse" for "cow" on every line that does not contain the word "black."
Fundamental Commands: Deletion, Print, and Line Numbering
Deletion (d): Deletes the addressed line(s) from the pattern space so they are not passed to the output. A new line is then read, and processing restarts from the top of the script.
d: Deletes all lines./^$/d: Deletes all blank lines./^$/,$d: Deletes from the first blank line to the end of the file ( $ represents the last line)./^ya*y/,/[0-9]$/d: Deletes from the first line starting with "yay", "yaay", etc., through the first line ending with a digit.
Print (p): Forces the current pattern space to be output. If
-nis not used, the line will appear twice.1,5p: Displays lines through ./^$/,$p: Displays from the first blank line to the end of the file.
Line Numbering (=): Writes the current line number on a separate line before the matched or output line.
sed -e '/Two-thirds-time/=' tuition.data: Displays the line number for the line containing "Two-thirds-time."
The Substitution Command (s)
Syntax:
[address]s/pattern/replacement/[flags]Flags:
n: A number from to indicating which specific occurrence of the pattern should be replaced (e.g.,
s/Four/Five/2replaces only the second "Four" on each line).g (Global): Replaces every occurrence of the pattern in the pattern space.
p (Print): Prints the pattern space if a successful substitution occurred.
Replacement Patterns:
&: Represents the entire string matched by the regex. For example,
s/.NI./wonderful &/applied to "UNIX" produces "wonderful UNIX".\n: Replaced by the substring specified using backreferences
\( \)(or( )with-r).**: Used to escape the ampersand (
&) or the backslash character itself.
Custom Delimiters: While
/is standard,sedallows other delimiters to make commands more readable, especially for paths. Example:sed -r "s#http://#https://#g" urls.txt.
Text Manipulation: Append, Insert, and Change
These commands typically require multiple lines in a script file or specific escaping in an inline command.
Append (a): Adds text after the addressed line.
Syntax:
[address]a\ text.
Insert (i): Adds text before the addressed line.
Syntax:
[address]i\ text.
Change (c): Replaces the entire addressed line or range of lines with the specified text.
Syntax:
[address(es)]c\ text.
Constraint: Append and Insert function only for single addresses, while Change can operate on ranges.
File Operations, Transformations, and Control
Read (r filename): Queues the contents of a file to be read and inserted into the output stream at the end of the current cycle. If the file cannot be read, it is treated as empty without an error message.
Write (w filename): Writes the current pattern space to a file. The file is created or truncated before the first input line is processed. Multiple
wcommands to the same file use the same stream.Transform (y): Acts like the
trcommand, performing a character-to-character replacement. The number of characters in the search and replacement string must be identical.Example:
y/abc/xyz/transforms all 'a's to 'x's, 'b's to 'y's, and 'c's to 'z's.
Quit (q): Terminates the
sedscript immediately when the specified address is reached. It accepts at most a single-line address.Example:
sed '100q' filenameprints the first lines of a file and then exits, mimicking theheadcommand.
Scripting and Multiple Commands
Script Files: A file containing a sequence of
sedcommands, each consisting of an address and an action.Braces ({ }): Used to group multiple commands and apply them to a single address or range.
Format Requirements: The opening brace
{must be the last character on its line. The closing brace}must be on its own line. No spaces should follow the braces.Alternative Format: Commands can be separated by semicolons:
[address]{command1; command2; }or passed as a string'command1; command2'.
Conversion Example: Swapping names in a list (e.g., from "Last, First" to "First Last"):
sed -r "s/([A-Za-z]+), ([A-Za-z]+)/\2 \1/g" names.txt.
Case Study: Tuition Data Processing
Scenario: A file
tuition.datacontains:Part-time 1003.99Two-thirds-time 1506.49Full-time 2012.29
Deletion:
sed -e '/^Part-time/d' tuition.dataremoves the first line.Append: Appending a dashed line after every entry:
a \ -------------------------- Insert: Adding a title before line :
1 i\ Tuition List\ Change: Replacing the first line to update a value from to :
1 c\ Part-time 1100.00