UNIX Sorting & Pipe System
Introduction
This lecture covers fundamental UNIX concepts related to stream management, file redirection, and piping to create powerful command-line tools.
Fundamental UNIX Utilities
wc (word count)
Purpose: Reports the number of lines, words, and characters within a file.
Applicability: Works with all file types, but is most meaningful for text files.
Requirement: Requires either file redirection or piping for its input.
Usage Example with Text File: If
TextFilecontains:This file contains a lot of words, but not that many different lines. Hopefully, we'll be able to see just how things. work with this one example.Runningwc TextFilewould output: `$4$ $27$ $150$ TextFile.$4$: Number of lines.$27$: Number of words.$150$: Number of characters.
Usage Example with Binary File: Running
wc /bin/kill(a binary executable) can produce error messages due to invalid or incomplete multibyte characters, butwcwill still report counts based on its interpretation of the raw bytes, e.g.,$21$ $136$ $7996$ /bin/kill.
sort
Purpose: Arranges the lines of a file into alphabetical order.
Utility: Highly useful for organizing data files or standard output from other programs.
Usage:
sort [OPTIONS] [FILE]Example: If
dataFilecontains:Dhalsim Zangeif Dan Guile Athena Ken
Runningsort dataFilewould output:Athena Dan Dhalsim Guile Ken Zangeif
uniq
Purpose: Processes a file and removes duplicate adjacent lines.
Limitation: Only removes duplicates if they are immediately next to each other.
Utility: Effective for cleaning files that may contain redundant, consecutive data entries.
Usage:
uniq [OPTIONS] [FILE]Example: If
dataFile2contains:Dan Dan Dan Athena Ryu Ryu Athena
Runninguniq dataFile2would output:Dan Athena Ryu Athena
NoticeAthenaat the end is not removed because it is not adjacent to the previousAthena. To remove all duplicates,sortshould be piped intouniq.
Understanding Input and Output Streams
Input: Refers to all data a UNIX program reads. Typically sourced from the keyboard (what you type).
Output: Refers to all data a UNIX program writes. Typically directed to the screen (what you see).
Standard Streams in UNIX
UNIX programs primarily interact with three standard streams:
stdin(Standard Input):The primary source of input for a program.
By default, it receives input from the keyboard (everything typed into the terminal).
Programs generally expect user input via
stdin.Assigned file descriptor number:
$0$.
stdout(Standard Output):The default destination for a program's output, usually the screen.
Displays the results of a running program in the terminal from which it was launched.
Assigned file descriptor number:
$1$.
stderr(Standard Error):The default destination for error messages generated by a program, usually the screen.
Appears in the same location as
stdout.Crucially,
stderrcan be separated from normalstdoutfor independent handling.Assigned file descriptor number:
$2$.
File Redirection
Redirection allows altering the default sources and destinations of these standard streams.
Redirecting stdout (>)
Operator: The
>symbol.Function: Sends all output that would normally go to
stdoutinto a specified file.Behavior: If the target file already exists, its contents are completely overwritten. If the file does not exist, a new one is created.
Example:
echo 'hi' > helloFilewill createhelloFilecontaininghi.Example
ls > dirListing: If you runls > dirListing, the output of thelscommand will be saved into a file nameddirListing. Subsequently,more dirListingwill display the directory listing from that file.Crucial Caution: Always be mindful of the
>operator as it will irrevocably overwrite existing files. For instance, executingls > dataFile1on an existingdataFile1will replace its original content with the directory listing.
Redirecting stdin (<)
Operator: The
<symbol.Function: Directs a program to take its input from a specified file rather than from the keyboard.
Utility: Extremely useful for providing large volumes of data to a program without manual typing.
Example:
wc < inputFilewould read its input frominputFile.
Appending stdout (>>)
Operator: The
>>symbol.Function: Sends all output from a program to the end of a specified file.
Behavior: If the file does not exist, it will be created. If it already exists, new output is appended to its current contents; no information is overwritten.
Redirecting stderr (2>)
Operator: The
2>symbol.Function: Specifically redirects
stderr(error messages) to a designated file.Notation: The
$2$explicitly refers to thestderrstream.Example: To see the difference between
stdoutandstderr:ls /bin/doesntExistwill print an error message directly to the screen.ls /bin/doesntExist 2> errorOutputwill redirect that error message intoerrorOutput.cat errorOutputwould then display the messagels: /bin/doesntExist: No such file or directory.
Is Redirection Useful?
Yes, definitely!
Save Output: Whenever you need to archive the output of a program for future reference or processing.
Automate Input: When a program requires repetitive or lengthy input, redirecting from a file saves significant typing effort.
Isolate Errors: Separating
stderrallows for easier logging and analysis of program errors without mixing them with normal output.Inter-Program Data Transfer: Data generated by one program can be saved to a file, then easily used as input for another program.
Remembering Redirection Operators Intuition
Think of the arrow as indicating the direction of data flow:
cat < inputFile: The contents ofinputFileare flowing into thecatprogram.cat > outputFile: The results generated bycatare flowing intooutputFile.
Command Flags and Redirection Order
The input or output file must immediately follow its respective redirection operator.
Correct Usage:
ls -l * > listingOutput(all arguments come before redirection).Incorrect Usage:
ls > -l * listingOutputThis is interpreted incorrectly because
>expects a filename immediately after it. Here,-lwould be treated as the filename for redirection, leading to erroneous behavior or file creation names.
General Rule: As a best practice, redirection operators and their target files should typically come at the end of the command line, after all program arguments, to prevent misinterpretations.
Combining Redirection Flags
It is entirely possible to use both input (
<) and output (>) redirection on the same command line.Example:
sort < inputFile > outputFilewould sortinputFileand save the result tooutputFile.
Each redirection operator assumes the subsequent argument is its target (file name).
Combining
>and>>: While technically allowed on the same line, e.g.,ls >>appendOutput >output, this can lead to unexpected behavior. In the given example,>>appendOutputcaptures thestdoutfirst, causingappendOutputto contain thelsoutput, whileoutputremains empty. This demonstrates that redirections are processed sequentially and can affect subsequent redirections of the same stream.Example with
stderr:ls /bin/doesntExist 2>> errorFilewill append any error messages toerrorFilewithout overwriting prior content.
Piping: Tying Programs Together
Operator: The vertical bar
|is used to create a pipe.Function: Pipes feed the
stdoutof one program directly into thestdinof another program without needing an intermediate file.Efficiency: This is a powerful feature for chaining commands seamlessly.
sort and uniq with Pipes
Problem: To sort a file and then remove all duplicate lines (not just adjacent ones).
Prior Approach (with Redirection):
Sort the file into a temporary output file:
sort dataFile > tempOutput.Use
uniqto process the sorted temporary file:uniq < tempOutput.
This requires creating and potentially deleting a temporary file.
Piped Approach:
sort dataFile | uniqThe
stdoutofsort dataFile(which is the fully sorted list) is directly fed asstdinto theuniqprogram.Since
uniqreceives sorted input, all duplicate lines will be adjacent and thus correctly removed.
How to Read Pipes
Commands chained with pipes are always read from left to right.
The flow of data (output) also proceeds from left to right.
Multiple pipes can be used on a single line, forming complex data processing chains.
Combining Redirection and Piping
It is entirely feasible to combine both file redirection and piping on the same command line, adding another layer of flexibility and power to command-line operations.
Is All This Stuff Worth It?
Short Answer: Absolutely!.
Long Answer: UNIX's command-line tools and piping system offer significant advantages over GUI-based environments for certain tasks.
Efficiency: With piping, multiple complex tasks can be combined and executed in a single, efficient step.
Comparison (e.g., Sorting and Uniquing a Word File):
In a GUI like Microsoft Word: The process involves opening the file, highlighting text, navigating through menus (e.g., "Table -> Sort"), configuring sorting options, and then realizing there's no native