Computer Architecture and Cryptography: message_finder, substitution, and caesar Study Guide

FILENAME FORMATTING AND HEXADECIMAL (message_finder.py)

  • The hex(n) Function in Python:     * The hex(n) function is used to convert integer numbers into their base hexadecimal representation.     * For an input such as 1515, the function returns the string 0xf.     * The prefix 0x is a standard indicator in Python and other programming languages that the subsequent value is in hexadecimal format.
  • String Slicing with [2:]:     * When hex(n) is called, the output includes the 0x prefix.     * The slice [2:] is applied to the result of hex(n) to strip away these first two characters (0 and x), leaving only the actual hexadecimal digits.
  • The .upper() Method:     * By default, Python's hexadecimal output uses lowercase letters (e.g., a, b, c).     * The .upper() method is applied to convert these to uppercase (e.g., A, B, C).     * This is intended to ensure consistency with file naming conventions and to prevent errors when the file system or program expects uppercase hexadecimal strings.
  • The Padding Mechanism (While Loop):     * The code uses a while loop to ensure filenames follow a specific length requirement.     * The loop check is while len(n) < 4: n = "0" + n.     * This serves as a padding mechanism; if the hexadecimal string has fewer than 44 digits, zeros are prepended to the front until the string length reaches exactly 44.     * Examples of padding:         * If the result is 44 digits (e.g., 4E20), the loop runs 00 times and no zeros are added.         * If the result is 11, 22, or 33 digits, the loop runs repeatedly to reach the 44-digit requirement.
  • Required Skills for Decimal to Hexadecimal Conversion:     * Students must be able to manually convert decimal values ranging from 00 to 256256 into hexadecimal without the aid of a calculator.     * Example: Decimal 256256 corresponds to Hexadecimal 100100.

FILE HANDLING AND CHARACTER COUNTING

  • The read_file() Function:     * This specific function is responsible for opening a file path and extracting its content.     * The function returns the file content as a String data type.
  • Array Initialization for Character Counts:     * The line counts = [0] * 128 initializes a list containing exactly 128128 elements, with every element starting at a value of 00.
  • Significance of the Size 128128:     * The size 128128 is chosen because it represents the maximum number of standard ASCII values.     * This includes the range of characters before the list is eventually sliced starting at index 3232.
  • Mapping Characters with ord(c):     * The ord() function provides the numerical Unicode/ASCII representation of a character.     * For example, passing a space character ' ' to ord(' ') returns the value 3232.
  • Incrementing Count Values:     * The expression counts[i] += 1 identifies the index i (derived from the ASCII value of character c) and increments the stored value by 11.     * Example: If the character is 'A', ord('A') returns 6565. The code then executes counts[65] += 1 to track that an 'A' has been found.
  • Input Impact on the counts List:     * Tracing the string "! bbb !~":         * ! has an ASCII value of 3333.         * (space) has an ASCII value of 3232.         * b has an ASCII value of 9898.         * ~ has an ASCII value of 126126.         * The list will reflect the frequency of these specific ASCII indices based on their occurrences in the string.
  • Slicing the counts Array (counts[32:-1]):     * The slice [32:-1] is used to extract elements starting from index 3232 up to, but not including, the very last index (127127).     * This effectively grabs the 32extnd32 ext{nd} through the 127extth127 ext{th} character representations.     * The first 3232 characters (indices 00 through 3131) are removed because they represent computer control commands (such as copy, paste, null, etc.) rather than printable characters.

CHI-SQUARE STATISTICAL ANALYSIS (message_finder.py)

  • Calculation of the Expected Value:     * The expected value is derived by summing all the numbers in the counts list and dividing by the total count of numbers in that list (extsum(nums)/extlen(nums)ext{sum}(nums) / ext{len}(nums)).     * This represents the mean distribution of characters for a single file.
  • The Chi-Square (extX2ext{X}^2) Calculation Formula:     * The chi-square value is computed by iterating through every observed value in the list.     * For each observed value, the expected value is subtracted, and the result is squared.     * This squared difference is then divided by the expected value:         * extX2=(observedexpected)2expectedext{X}^2 = \sum \frac{(\text{observed} - \text{expected})^2}{\text{expected}}     * The code adds up these individual results for every value in the list to produce the final chi-square statistic.
  • Interpreting Chi-Square Values:     * Higher Chi-Square Values: Indicate significant variations in character frequencies. This often suggests a specific pattern or non-random structure that differs from other text files.     * Lower Chi-Square Values: Indicate very little variation. The observed character counts fit the expected values (the mean) very closely, suggesting a more uniform distribution.

FILE CHECKING LOGIC AND PATH CONSTRUCTION

  • Filename and Path Structure:     * Files are located in a folder named text_files.     * Each filename is formatted as file_XXXX.txt, where XXXX is a hexadecimal value ranging from 0000 to 4E20 (which corresponds to decimal 2000020000).     * The full path is constructed as "text_files/" + filename.
  • Threshold Parameters:     * The threshold variable determines which files are flagged and displayed in the output.     * Only files with a calculated chi-square value greater than the threshold will be printed.     * Adjusting the threshold allows the user to filter for files with specific levels of character frequency variance.
  • Range of Processing:     * In the provided implementation check_files(0, 20000, 145), the program processes 2000020000 files, starting from decimal index 00 up to 1999919999.

SUBSTITUTION CIPHER MECHANICS (substitution.py)

  • The Alphabet and The Key:     * Alphabet String: Serves as the master reference for the standard characters used in the translation process. It is the basis for encryption and decryption.     * Key String: This is a scrambled version of the alphabet or a unique set of characters used to map against the alphabet. It is essential for the substitution process; without a key, there is no method for decryption.     * Relationship: Both the alphabet and the key strings must be the exact same length to ensure every character in the alphabet has a corresponding replacement in the key.
  • Character Escaping:     * Certain characters, specifically the double quote ("), must be escaped using a backslash (\") within the string definition.     * This informs the Python interpreter that the quote is a literal character in the string rather than the closing delimiter of the string itself.     * Failure to escape these characters will result in Python assuming the string has ended prematurely, causing syntax errors.
  • Encryption Process:     * The function uses .find(character) on the alphabet string to locate the integer index of each character in the original message.     * This index is then used to retrieve the character at the same position in the key string.     * Example: If the second character of the alphabet is found, it is replaced by the second character of the key.
  • Decryption Process:     * The process is reversed. The function uses .find(character) on the key string to find the index of the character from the encrypted message.     * This index is then used to retrieve the corresponding character from the alphabet string.
  • Security and Vulnerabilities:     * Substitution ciphers are highly vulnerable to Frequency Analysis.     * Cryptanalysts can look at the frequency of symbols to determine which letters are most common (e.g., vowels or common consonants in the English language). Spaces and common patterns are often easy to identify.     * Keyspace: This refers to the total number of possible combinations available to break a code. This ranges from a small number of combinations for short/simple keys to massive values (128128-bit codes) that would take billions of years to brute-force.     * Actual Security: This is defined as the minimum amount of time required to successfully find or crack the code.

CAESAR CIPHER AND ASCII SHIFTING (caesar.py)

  • ASCII Conversions:     * ord(char): Translates a character into its specific numerical ASCII value.     * chr(ascii_val): The reverse of ord(); it converts a numerical ASCII value back into its character representation.
  • The Shifting Mechanism:     * A character is shifted by adding an integer value n to its original ASCII value.     * Example: A space character (ASCII 3232) shifted by 33 results in ASCII 3535, which corresponds to the # character.
  • Handling Character Wrap-Around:     * The printable ASCII range used in this implementation is consistently between 3232 and 126126.     * Above ASCII 126: If a shift results in a value higher than 126126, the program uses a while loop to repeatedly subtract 9595 until the value falls back into the valid range.     * Below ASCII 32: If a shift (typically during decryption) results in a value lower than 3232, a while loop repeatedly adds 9595 until the value is at least 3232.     * Why 95? The number 9595 is used because it represents the total count of printable characters in the cycle (12632+1=95126 - 32 + 1 = 95, logically representing the spanning range).     * Why while instead of if? A while statement can handle large shifts (shifts much greater than 9595) by repeatedly checking and adjusting the value, whereas an if statement would only check and adjust once.
  • Encryption and Decryption Functions:     * Encryption: Iterates through every character in a message and applies the shift function using the argument n.     * Decryption: Reuses the encryption function but multiplies the shift value by 1-1 (i.e., encrypt(message, -1 * n)). This reverses the direction of the shift, effectively restoring the original text.
  • Special Shift Cases:     * Edge Cases: Shifting a character near the end of the range (like ~) results in it looping back to the beginning (e.g., ASCII 127127 shifted by 33 would become 3535).     * Shift of 95: If the shift value is exactly 9595, it completes a full wrap-around, and the resulting characters will be identical to the original input.
  • Security Vulnerabilities:     * Like substitution ciphers, the Caesar cipher is vulnerable to frequency distribution analysis.     * By identifying the most common character in an encrypted message and comparing it to the known frequency of letters in English (like 'e' or space), an attacker can easily determine the original shift value.