Week 4: Hashing Concepts

What is a Hash Function?

A hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum. The values returned by a hash function are called hash values, hash codes, hash sums. or simply hashes.

Cryptographic Hash Functions

A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value. The data to be encoded is often called the “message”, and the hash value is sometimes called the message digest or simply digest.

The ideal cryptographic hash function has the main properties: it is infeasible to find a message that has a given hash, it is infeasible to modify a message without changing its hash, it is infeasible to find two different messages with the same hash.

MD5 and SHA-1 are the most commonly used cryptographic hash functions (a.k.a. algorithms) in the field of Computer Forensics.

MD5

MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value.

The 128-bit MD5 hashes (also termed message digests) are represented as a sequence of 16 hexadecimal bytes. The following demonstrates a 40-byte ASCII input and the corresponding MD5 hash: MD5 of “This is an example of an MD5 Hash Value.” = 3413EE4F01F2A0AA17664088E79CF5C2

Even a small change in the message will result in a completely different hash. For example, changing the period at the end of the sentence to an exclamation mark: MD5 of “This is an example of an MD5 Hash Value!” = B872D23A7D14B6EE3B390A58C17F21A8

SHA-1

SHA stands for Secure Hash Algorithm.

SHA-1 produces a 160-bit digest from a message and is represented as a sequence of 20 hexadecimal bytes. The following is an example of SHA-1 digests: Just like MD5, even a small change in a message will result in a completely different hash. For example:

SHA1 of “This is a test.” = AFA6C8B3A2FAE95785DC7D9685A57835D703AC88

SHA1 of “This is a pest.” = FE43FFB3C844CC93093922D1AAC44A39298CAE11

The MD5 hash algorithm - the chance of 2 files having the same MD5 hash value is 2 to the 128th power = 3.4028236692093846346337460743177e+38 or 1 in 340 billion billion billion billion.

The SHA-1 hash algorithm - the chance of 2 files having the same SHA-1 hash value is 2 to the 160th power = 1.4615016373309029182036848327163e+48 or 1 in a REALLY big number!

What do CF Examiners use Hashes for?

Data Authentication : to prove two things are the same.

Data Reduction: To exclude many “known” files from hundreds of thousands of file you have to look at.

File Identification: To find a needle in a haystack.

Data Authentication

One of the most important issues a computer forensic examiner faces is ensuring the ability to “authenticate” your digital evidence.

This done via Chain of Custody, Documentation, and Hash values.

Using MD5 or SHA-1 hashing tools, an examiner should be able to verify that data has not changed. A hash of the acquired data must be identical to a hash of the original evidence.

Data Authentication

Calculating a “hash value” for any block of data (i.e. a file, an entire disk, a partition, etc.) can be accomplished as a stand-alone task or simultaneous with the acquisition process (by most tools).

Calculating the “hash value” of an entire disk is done by reading all data on the disk, running it through the desired algorithm, and generating a hash of all data read. The examiner then typically documents the resulting hash value.

The resulting “hash value” is a hash of the data READ from the disk, not necessarily a hash of the data WRITTEN to your target disk during the acquisition process.

Input/Output errors and bad sector errors encountered during the acquisition process will effect the resulting hash value.

An examiner should run a verification process after the acquisition to ensure that the original hash value calculated while reading the original data matches the hash value of the data written out to your target disk.

Data Authentication

Considerations:

Drives will start to fail as they get older, resulting in “bad sectors”. Bad sectors = inability to obtain matching hash values when comparing a hash of the original disk to the has of a forensic image of the data read from the disk.

The more time a disk spins up, the more chance of disk failure(s). To calculate a hash value of a drive, you must read all data on the disk. To acquire a forensic image, you must read all data on the disk.

If your imaging tool does not simultaneously capture a hash value as part of the data acquisition process, consider whether the risk of double the spin-up time to obtain a pre-acquisition hash values is appropriate given that your primary objective is to obtain the data.

Using hashes, an examiner can also verify that a specific file or any block of data has not changed.

Hash individual file(s) with FTK Imager, WinHex, md5summer, and many other hashing tools.

A single modified byte will result in hash values that do not much.

When hashing individual files:

Changing filename or extension does NOT change hash value.

Changing Modified, Accessed, Created dates does NOT change hash value.

Changing file system attributes (read-only, hidden, system, etc.) does NOT change hash value.

Changing ANYTHING within the file contents DOES change the hash value of the file. (For files like MS Word documents, that contain “Metadata”, changes within the Metadata DO change the contents of the file and therefore change the hash value of the file., For example, if you open a MS Word document, made no changes to the contents of the file and just re-saved the file, MS Word would update the dates saved withint the Metadata and the actual raw content of the overall word document would change and therefore generate a different hash value.

Cropping a graphic, changing the resolution, saving as anothe graphic format (BMP to JPEG), or any other change that may not necessarily change the visual depiction of the pciture, WILL change the raw contents of the file and therefore will change the hash calue of the file.

NOTE: Although changing a filename or other “non-content” of a file does not change the hash value of the file, such a “non-content” change DOES make a change to the FAT directory entry, MFT entry, or other file systems component that holds the filename, MAC dates, attributes, etc. and therefore DOES change the data on the file system that holds the file in question. Therefore a change of a filename, MAC date, file attribute, etc. DOES NOT change the hash value of the file, but it DOES change the hash value of the disk on which the file is stored.

Data Reduction

As the storage capacity of disks grows, so does the number of fiels a computer forensic examiner must examine.

A typical hard drive containing a WIndows installation, software applications, user files, temporary Internet files, music downloads, etc. will contain well over a hundred thousand files.

Large databases containing a hash values of “known” files can be used by a forensic examiner to reduce the number of files he or she must analyze.

Files that are known to be part of the opearting system and.or installes softwarw applications are likely not going to contain evidence.

By exlcuding all known operatins system files and files from known software applications, an examienr is left with only user created files ot review for potential evidence.

Using forensic software tools, an examiner calculates the has value of all files on a disk.

Then the examiner uses the software tool to ocmpare the caluclated hash values against all of the hash values within a known hash database to identify any matching hash values.

The examiner can then exclude from view, any files with hash values matching those in the database.

The examiner can also exclude from view, any files that are duplicates of each other according to their hash values, further reducing the number of files in view.

This procces called “Data Reduction” can save the examiner from analyzing many thousands of un-necessary files.

Hash Databases:

National Software Reference Library (NSRL) - Reference Data Sets (RDS) - NIST

HashKeeper (LE, Military and Government only) - NDIC

Knowkn File Filter (KFF) - AccessData, Inc.

Self-generated or shared databases

File Identification

Quickly identifiying a specific “notable” file or files amongst the hundreds of thousands of files on a disk can also be accomplished by use of hash databases….finding the needle in the haystack!

Instead of using a database of known “ignorable” files such as OS files, databases containing hash values of known “notable” files can be utilized.

Example of common “Notable” files are : Child pornography and other contraband images, hacker tools, viruses, trojans and other malware.

The examiner can search by hash value and flag any files with hash values matching those in the “notable database.”

Limitations

A mismatched hash value on tells you something changed, not what changed.

When using MD5, SHA-1 or other standard cryptographic hashes to identify known files, only EXACT matches will result in success. When files are slighlty modified, standard hashing will not identify similar files. “Fuzzy Hashing” uses a concept called context triggered piecewise hashes in the tool ssdeep to identify files that have similar pieces but may not be entirely identical.

Hash “collisions” have been discovered and some argue that stronger (more collision proof) hash algorithms should be used in computer forensics.