Lecture 14 Hashing in Digital Forensics

Cryptographic vs. Forensic Hashes

  • Hashing algorithms produce

    1. the same hash value for identical input and

    2. a different hash value for different input.

  • Hashing is a one-way function.

  • Most commonly used hashing algorithms:

    • MD5: 128-bit hash value

    • SHA-1: 160-bit hash value

Bit-by-bit Copy

  • Tools can generate an exact bit-by-bit copy of digital data on a storage device, producing an “evidence file” or “image file.” Examples include EnCase, FTK Imager, and Linux dd command.

  • PROOF OF INTEGRITY

  • Some tools, like dd, provide an image file BUT do not automatically produce a hash value

    • OUTCOME: image file and the hash value saved in a separate file.

  • Evidence files using the E0x format include

    1. header information, (examiner name/ case number)

    1. original data split into blocks,

    1. a Cyclical Redundancy Check (CRC) for each block,

    1. the MD5 and/or SHA1 hash value of the entire data. (ie all data blocks)

Use Cases for Hashing

  • Proof of integrity: Hashing acts like a digital fingerprint to ensure data hasn't changed.

  • Compliance with ACPO principle 1: No action should change data which may subsequently be relied upon in court; the hash should match at any point in time.

  • The same concept of “proof of integrity” is valid for any standalone file relevant for an investigation,

  • Proof of tampering: Hashing can be used to check images that visually look identical but Steganography might have been used to hide data.

  • Hash values can be used to build Hash Libraries

    – Libraries can be used in two main ways in an investigation

    ✓ To exclude files whose content and presence on an image

    file are knownthey are not of forensic interest.

    Example: exclude files from the clean installation of Windows

    ✓ To locate files whose content and presence on an image file

    are notable → they are of forensic interest.

Hash Libraries

  • A hash library can contain one or more hash sets, which are lists of hash values with a name attached.

  • A hash set is a list of hash values with a name attached

    – Usually, two hash libraries (primary and secondary) may be used within a case at the same time

  • Hash Library exists outside of a particular investigation case.

Hash Collision

  • A collision occurs when two different documents have the same hash fingerprint.

  • MD5 and SHA-1 hash collisions have been demonstrated, raising concerns about their reliability.

So far…

Cryptographic hashing is fundamental to DF investigations

– as proof of integrity,

– proof of tampering,

– to find identical copies of known files of interest, and

– to exclude known files (from our investigation) that are NOT of interest

Cryptographic hashing is useful to search for identical files or verify if 2 files are identical

Cryptographic hash matching:

What is the main drawback of cryptographic hash matching?

  Any image manipulation or edition… one bit, resizing, change in background colour, saving in another format…anything will be enough to evade match is the issue

  • Solution direction

    Semantic approximate matching is a way forward

    – They are hashes that can detect visually similar images

    – Several terms used (e.g., perceptual hashing, robust hashing and forensic hashing) refer to semantic approximate matching techniques >>> we will adopt the term “forensic hashing”

    PhotoDNA

  • PhotoDNA allows detection of copies of a same image

    Rationale:

    → images with similar DNA are variations of the same original

  • Images are converted to B&W, resized and broken into a grid cell

  • A histogram of intensity is computed for each grid cell, and a hash value of 144 bytes (12x12) of the image is generated

  • Allows detection of copies of a same image. Images with similar DNA are variations of the same original

  • CAID stores PhotoDNA (and cryptographic hashes) of known IIOC as metadata.

  • PhotoDNA allows detection of copies of IIOC despite it being subject to manipulation to some extent.

  • It allows zooming in grid cells to see details

Open Source Forensic Hashing

  • Average hash:

    • Reduces an input image to an 8x8 pixel representation.

    • Converts to grayscale.

    • Calculates the average of the 64 grayscale values to obtain a mean value.

    • Compares each 64 grayscale value with the mean: a bit is set to indicate the value is above or below

  • Perceptual hash:

    • Resizes the input image to a 32x32 pixel representation.

    • Converts to grayscale.

    • Applies the Discrete Cosine Transform (DCT).

    • Crops pixels with high frequencies and compares to median value.

  • Wavelet hash:

    • Similar to perceptual hash but instead of DCT, it uses the Discrete Wavelet Transform (DWT).

    • It analyses the frequency content of an image in a localised and multi-resolution manner

Conclusion

  • Cryptographic hashing is fundamental to DF investigations: as proof of integrity, proof of tampering, to find identical copies of known files of interest, and to exclude known files from our investigation that are NOT of interest

  • Forensic hashing is able to identify similar notable media subject to editing and manipulation to evade detection.