Computer Architecture and Cryptography: message_finder, substitution, and caesar Study Guide

FILENAME FORMATTING AND HEXADECIMAL (message_finder.py)

The hex(n) Function in Python: * The hex(n) function is used to convert integer numbers into their base hexadecimal representation. * For an input such as $15$ , the function returns the string 0xf. * The prefix 0x is a standard indicator in Python and other programming languages that the subsequent value is in hexadecimal format.
String Slicing with [2:]: * When hex(n) is called, the output includes the 0x prefix. * The slice [2:] is applied to the result of hex(n) to strip away these first two characters (0 and x), leaving only the actual hexadecimal digits.
The .upper() Method: * By default, Python's hexadecimal output uses lowercase letters (e.g., a, b, c). * The .upper() method is applied to convert these to uppercase (e.g., A, B, C). * This is intended to ensure consistency with file naming conventions and to prevent errors when the file system or program expects uppercase hexadecimal strings.
The Padding Mechanism (While Loop): * The code uses a while loop to ensure filenames follow a specific length requirement. * The loop check is while len(n) < 4: n = "0" + n. * This serves as a padding mechanism; if the hexadecimal string has fewer than $4$ digits, zeros are prepended to the front until the string length reaches exactly $4$ . * Examples of padding: * If the result is $4$ digits (e.g., 4E20), the loop runs $0$ times and no zeros are added. * If the result is $1$ , $2$ , or $3$ digits, the loop runs repeatedly to reach the $4$ -digit requirement.
Required Skills for Decimal to Hexadecimal Conversion: * Students must be able to manually convert decimal values ranging from $0$ to $256$ into hexadecimal without the aid of a calculator. * Example: Decimal $256$ corresponds to Hexadecimal $100$ .

FILE HANDLING AND CHARACTER COUNTING

The read_file() Function: * This specific function is responsible for opening a file path and extracting its content. * The function returns the file content as a String data type.
Array Initialization for Character Counts: * The line counts = [0] * 128 initializes a list containing exactly $128$ elements, with every element starting at a value of $0$ .
Significance of the Size $128$ : * The size $128$ is chosen because it represents the maximum number of standard ASCII values. * This includes the range of characters before the list is eventually sliced starting at index $32$ .
Mapping Characters with ord(c): * The ord() function provides the numerical Unicode/ASCII representation of a character. * For example, passing a space character ' ' to ord(' ') returns the value $32$ .
Incrementing Count Values: * The expression counts[i] += 1 identifies the index i (derived from the ASCII value of character c) and increments the stored value by $1$ . * Example: If the character is 'A', ord('A') returns $65$ . The code then executes counts[65] += 1 to track that an 'A' has been found.
Input Impact on the counts List: * Tracing the string "! bbb !~": * ! has an ASCII value of $33$ . * (space) has an ASCII value of $32$ . * b has an ASCII value of $98$ . * ~ has an ASCII value of $126$ . * The list will reflect the frequency of these specific ASCII indices based on their occurrences in the string.
Slicing the counts Array (counts[32:-1]): * The slice [32:-1] is used to extract elements starting from index $32$ up to, but not including, the very last index ( $127$ ). * This effectively grabs the $32 ext{nd}$ through the $127 ext{th}$ character representations. * The first $32$ characters (indices $0$ through $31$ ) are removed because they represent computer control commands (such as copy, paste, null, etc.) rather than printable characters.

CHI-SQUARE STATISTICAL ANALYSIS (message_finder.py)

Calculation of the Expected Value: * The expected value is derived by summing all the numbers in the counts list and dividing by the total count of numbers in that list ( $ext{sum}(nums) / ext{len}(nums)$ ). * This represents the mean distribution of characters for a single file.
The Chi-Square ( $ext{X}^2$ ) Calculation Formula: * The chi-square value is computed by iterating through every observed value in the list. * For each observed value, the expected value is subtracted, and the result is squared. * This squared difference is then divided by the expected value: * $ext{X}^2 = \sum \frac{(\text{observed} - \text{expected})^2}{\text{expected}}$ * The code adds up these individual results for every value in the list to produce the final chi-square statistic.
Interpreting Chi-Square Values: * Higher Chi-Square Values: Indicate significant variations in character frequencies. This often suggests a specific pattern or non-random structure that differs from other text files. * Lower Chi-Square Values: Indicate very little variation. The observed character counts fit the expected values (the mean) very closely, suggesting a more uniform distribution.

FILE CHECKING LOGIC AND PATH CONSTRUCTION

Filename and Path Structure: * Files are located in a folder named text_files. * Each filename is formatted as file_XXXX.txt, where XXXX is a hexadecimal value ranging from 0000 to 4E20 (which corresponds to decimal $20000$ ). * The full path is constructed as "text_files/" + filename.
Threshold Parameters: * The threshold variable determines which files are flagged and displayed in the output. * Only files with a calculated chi-square value greater than the threshold will be printed. * Adjusting the threshold allows the user to filter for files with specific levels of character frequency variance.
Range of Processing: * In the provided implementation check_files(0, 20000, 145), the program processes $20000$ files, starting from decimal index $0$ up to $19999$ .

SUBSTITUTION CIPHER MECHANICS (substitution.py)

The Alphabet and The Key: * Alphabet String: Serves as the master reference for the standard characters used in the translation process. It is the basis for encryption and decryption. * Key String: This is a scrambled version of the alphabet or a unique set of characters used to map against the alphabet. It is essential for the substitution process; without a key, there is no method for decryption. * Relationship: Both the alphabet and the key strings must be the exact same length to ensure every character in the alphabet has a corresponding replacement in the key.
Character Escaping: * Certain characters, specifically the double quote ("), must be escaped using a backslash (\") within the string definition. * This informs the Python interpreter that the quote is a literal character in the string rather than the closing delimiter of the string itself. * Failure to escape these characters will result in Python assuming the string has ended prematurely, causing syntax errors.
Encryption Process: * The function uses .find(character) on the alphabet string to locate the integer index of each character in the original message. * This index is then used to retrieve the character at the same position in the key string. * Example: If the second character of the alphabet is found, it is replaced by the second character of the key.
Decryption Process: * The process is reversed. The function uses .find(character) on the key string to find the index of the character from the encrypted message. * This index is then used to retrieve the corresponding character from the alphabet string.
Security and Vulnerabilities: * Substitution ciphers are highly vulnerable to Frequency Analysis. * Cryptanalysts can look at the frequency of symbols to determine which letters are most common (e.g., vowels or common consonants in the English language). Spaces and common patterns are often easy to identify. * Keyspace: This refers to the total number of possible combinations available to break a code. This ranges from a small number of combinations for short/simple keys to massive values ( $128$ -bit codes) that would take billions of years to brute-force. * Actual Security: This is defined as the minimum amount of time required to successfully find or crack the code.

CAESAR CIPHER AND ASCII SHIFTING (caesar.py)

ASCII Conversions: * ord(char): Translates a character into its specific numerical ASCII value. * chr(ascii_val): The reverse of ord(); it converts a numerical ASCII value back into its character representation.
The Shifting Mechanism: * A character is shifted by adding an integer value n to its original ASCII value. * Example: A space character (ASCII $32$ ) shifted by $3$ results in ASCII $35$ , which corresponds to the # character.
Handling Character Wrap-Around: * The printable ASCII range used in this implementation is consistently between $32$ and $126$ . * Above ASCII 126: If a shift results in a value higher than $126$ , the program uses a while loop to repeatedly subtract $95$ until the value falls back into the valid range. * Below ASCII 32: If a shift (typically during decryption) results in a value lower than $32$ , a while loop repeatedly adds $95$ until the value is at least $32$ . * Why 95? The number $95$ is used because it represents the total count of printable characters in the cycle ( $126 - 32 + 1 = 95$ , logically representing the spanning range). * Why while instead of if? A while statement can handle large shifts (shifts much greater than $95$ ) by repeatedly checking and adjusting the value, whereas an if statement would only check and adjust once.
Encryption and Decryption Functions: * Encryption: Iterates through every character in a message and applies the shift function using the argument n. * Decryption: Reuses the encryption function but multiplies the shift value by $-1$ (i.e., encrypt(message, -1 * n)). This reverses the direction of the shift, effectively restoring the original text.
Special Shift Cases: * Edge Cases: Shifting a character near the end of the range (like ~) results in it looping back to the beginning (e.g., ASCII $127$ shifted by $3$ would become $35$ ). * Shift of 95: If the shift value is exactly $95$ , it completes a full wrap-around, and the resulting characters will be identical to the original input.
Security Vulnerabilities: * Like substitution ciphers, the Caesar cipher is vulnerable to frequency distribution analysis. * By identifying the most common character in an encrypted message and comparing it to the known frequency of letters in English (like 'e' or space), an attacker can easily determine the original shift value.