UNIX Sorting & Pipe System

Introduction

This lecture covers fundamental UNIX concepts related to stream management, file redirection, and piping to create powerful command-line tools.

Fundamental UNIX Utilities

wc (word count)

  • Purpose: Reports the number of lines, words, and characters within a file.

  • Applicability: Works with all file types, but is most meaningful for text files.

  • Requirement: Requires either file redirection or piping for its input.

  • Usage Example with Text File: If TextFile contains: This file contains a lot of words, but not that many different lines. Hopefully, we'll be able to see just how things. work with this one example. Running wc TextFile would output: `$4$ $27$ $150$ TextFile.

    • $4$: Number of lines.

    • $27$: Number of words.

    • $150$: Number of characters.

  • Usage Example with Binary File: Running wc /bin/kill (a binary executable) can produce error messages due to invalid or incomplete multibyte characters, but wc will still report counts based on its interpretation of the raw bytes, e.g., $21$ $136$ $7996$ /bin/kill.

sort

  • Purpose: Arranges the lines of a file into alphabetical order.

  • Utility: Highly useful for organizing data files or standard output from other programs.

  • Usage: sort [OPTIONS] [FILE]

  • Example: If dataFile contains:
    Dhalsim Zangeif Dan Guile Athena Ken
    Running sort dataFile would output:
    Athena Dan Dhalsim Guile Ken Zangeif

uniq

  • Purpose: Processes a file and removes duplicate adjacent lines.

  • Limitation: Only removes duplicates if they are immediately next to each other.

  • Utility: Effective for cleaning files that may contain redundant, consecutive data entries.

  • Usage: uniq [OPTIONS] [FILE]

  • Example: If dataFile2 contains:
    Dan Dan Dan Athena Ryu Ryu Athena
    Running uniq dataFile2 would output:
    Dan Athena Ryu Athena
    Notice Athena at the end is not removed because it is not adjacent to the previous Athena. To remove all duplicates, sort should be piped into uniq.

Understanding Input and Output Streams

  • Input: Refers to all data a UNIX program reads. Typically sourced from the keyboard (what you type).

  • Output: Refers to all data a UNIX program writes. Typically directed to the screen (what you see).

Standard Streams in UNIX

UNIX programs primarily interact with three standard streams:

  • stdin (Standard Input):

    • The primary source of input for a program.

    • By default, it receives input from the keyboard (everything typed into the terminal).

    • Programs generally expect user input via stdin.

    • Assigned file descriptor number: $0$.

  • stdout (Standard Output):

    • The default destination for a program's output, usually the screen.

    • Displays the results of a running program in the terminal from which it was launched.

    • Assigned file descriptor number: $1$.

  • stderr (Standard Error):

    • The default destination for error messages generated by a program, usually the screen.

    • Appears in the same location as stdout.

    • Crucially, stderr can be separated from normal stdout for independent handling.

    • Assigned file descriptor number: $2$.

File Redirection

Redirection allows altering the default sources and destinations of these standard streams.

Redirecting stdout (>)

  • Operator: The > symbol.

  • Function: Sends all output that would normally go to stdout into a specified file.

  • Behavior: If the target file already exists, its contents are completely overwritten. If the file does not exist, a new one is created.

  • Example: echo 'hi' > helloFile will create helloFile containing hi.

  • Example ls > dirListing: If you run ls > dirListing, the output of the ls command will be saved into a file named dirListing. Subsequently, more dirListing will display the directory listing from that file.

  • Crucial Caution: Always be mindful of the > operator as it will irrevocably overwrite existing files. For instance, executing ls > dataFile1 on an existing dataFile1 will replace its original content with the directory listing.

Redirecting stdin (<)

  • Operator: The < symbol.

  • Function: Directs a program to take its input from a specified file rather than from the keyboard.

  • Utility: Extremely useful for providing large volumes of data to a program without manual typing.

  • Example: wc < inputFile would read its input from inputFile.

Appending stdout (>>)

  • Operator: The >> symbol.

  • Function: Sends all output from a program to the end of a specified file.

  • Behavior: If the file does not exist, it will be created. If it already exists, new output is appended to its current contents; no information is overwritten.

Redirecting stderr (2>)

  • Operator: The 2> symbol.

  • Function: Specifically redirects stderr (error messages) to a designated file.

  • Notation: The $2$ explicitly refers to the stderr stream.

  • Example: To see the difference between stdout and stderr:

    • ls /bin/doesntExist will print an error message directly to the screen.

    • ls /bin/doesntExist 2> errorOutput will redirect that error message into errorOutput. cat errorOutput would then display the message ls: /bin/doesntExist: No such file or directory.

Is Redirection Useful?

  • Yes, definitely!

  • Save Output: Whenever you need to archive the output of a program for future reference or processing.

  • Automate Input: When a program requires repetitive or lengthy input, redirecting from a file saves significant typing effort.

  • Isolate Errors: Separating stderr allows for easier logging and analysis of program errors without mixing them with normal output.

  • Inter-Program Data Transfer: Data generated by one program can be saved to a file, then easily used as input for another program.

Remembering Redirection Operators Intuition

Think of the arrow as indicating the direction of data flow:

  • cat < inputFile: The contents of inputFile are flowing into the cat program.

  • cat > outputFile: The results generated by cat are flowing into outputFile.

Command Flags and Redirection Order

  • The input or output file must immediately follow its respective redirection operator.

  • Correct Usage: ls -l * > listingOutput (all arguments come before redirection).

  • Incorrect Usage: ls > -l * listingOutput

    • This is interpreted incorrectly because > expects a filename immediately after it. Here, -l would be treated as the filename for redirection, leading to erroneous behavior or file creation names.

  • General Rule: As a best practice, redirection operators and their target files should typically come at the end of the command line, after all program arguments, to prevent misinterpretations.

Combining Redirection Flags

  • It is entirely possible to use both input (<) and output (>) redirection on the same command line.

    • Example: sort < inputFile > outputFile would sort inputFile and save the result to outputFile.

  • Each redirection operator assumes the subsequent argument is its target (file name).

  • Combining > and >>: While technically allowed on the same line, e.g., ls >>appendOutput >output, this can lead to unexpected behavior. In the given example, >>appendOutput captures the stdout first, causing appendOutput to contain the ls output, while output remains empty. This demonstrates that redirections are processed sequentially and can affect subsequent redirections of the same stream.

  • Example with stderr: ls /bin/doesntExist 2>> errorFile will append any error messages to errorFile without overwriting prior content.

Piping: Tying Programs Together

  • Operator: The vertical bar | is used to create a pipe.

  • Function: Pipes feed the stdout of one program directly into the stdin of another program without needing an intermediate file.

  • Efficiency: This is a powerful feature for chaining commands seamlessly.

sort and uniq with Pipes

  • Problem: To sort a file and then remove all duplicate lines (not just adjacent ones).

  • Prior Approach (with Redirection):

    1. Sort the file into a temporary output file: sort dataFile > tempOutput.

    2. Use uniq to process the sorted temporary file: uniq < tempOutput.
      This requires creating and potentially deleting a temporary file.

  • Piped Approach: sort dataFile | uniq

    • The stdout of sort dataFile (which is the fully sorted list) is directly fed as stdin to the uniq program.

    • Since uniq receives sorted input, all duplicate lines will be adjacent and thus correctly removed.

How to Read Pipes

  • Commands chained with pipes are always read from left to right.

  • The flow of data (output) also proceeds from left to right.

  • Multiple pipes can be used on a single line, forming complex data processing chains.

Combining Redirection and Piping

  • It is entirely feasible to combine both file redirection and piping on the same command line, adding another layer of flexibility and power to command-line operations.

Is All This Stuff Worth It?

  • Short Answer: Absolutely!.

  • Long Answer: UNIX's command-line tools and piping system offer significant advantages over GUI-based environments for certain tasks.

    • Efficiency: With piping, multiple complex tasks can be combined and executed in a single, efficient step.

    • Comparison (e.g., Sorting and Uniquing a Word File):

      • In a GUI like Microsoft Word: The process involves opening the file, highlighting text, navigating through menus (e.g., "Table -> Sort"), configuring sorting options, and then realizing there's no native