Exhaustive Study Notes on File Format and Artifact Analysis

Fundamental Principles of File Format Analysis

  • Definition of File Formats: These are the rules governing what specific bytes represent inside a file, how applications produce and consume those bytes, and how forensic examiners extract metadata.

  • Learning Objectives:     - Identify files from raw bytes.     - Parse internal structures of common CASE formats.     - Recognize metadata fields that survive copies and conversions.

  • The Visual Encyclopedia Reference:     - Source: github.com/corkami/pics.     - Maintained by: Angel Bertini.     - Utility: Contains annotated reference figures for nearly every common binary format; excellent for refreshing byte-level layout knowledge.

  • Identification: Extension vs. Bytes:     - Extensions: Viewed as advisory metadata for the OS UI. They are trivially renamed and often wrong. The forensic investigator views them as what someone wanted the file to be perceived as.     - File System Records: Stores name, size, timestamps, and layout map, but contains no actual type information.     - Authoritative Source: The only authoritative source of type information is the byte content of the file.     - Forensic Policy: Never trust a file extension. Use the format signature (magic number) to verify the file type. A disagreement between bytes and extension is evidence known as an Extension Mismatch Indicator.

The Forensic Storage Stack and Artifact Layers

  • Physical Layer: The actual data blocks on the device.

  • Partitioning Layer: Carves the device into volumes.

  • File System Layer: Manages files, names, allocation, and MAC (Modified, Accessed, Created) times.

  • Format Layer: Governs the internal organization of bytes within a file.     - Artifacts at this layer: Authorship, software producer info, GPS coordinates, tracked changes/unaccepted edits.

  • Application Layer: Sits on top, consuming format-conformant data.

  • Antiforensics: The format layer is the primary site for stripping/falsifying metadata or smuggling objects/hidden content.

Structural Patterns in File Formats

  • Header: Contains the Magic Number (signature), version, dimensions, and offset to the first record.

  • Body: Divided into self-describing units.     - Nomenclature by Format: PNG calls them "chunks," MP4 calls them "boxes," PDF calls them "objects."     - Type-Length-Value (TLV) Pattern: A common pattern consisting of a tag (type), a length field, and a payload.

  • Trailer: Acts as either an end marker or a directory.     - JPEG: Ends with EOI marker FFD9FFD9.     - ZIP: Ends with an End of Central Directory (EOCD) record pointing to the central directory.     - PDF: Ends with startxref pointing to the cross-reference table.

  • Forensic Parsing: Allows examiners to walk the structure by reading lengths/types without knowing every field.

Endianness and Byte Ordering

  • Definition: The order in which multibyte integer fields are stored.

  • Big Endian (Network Byte Order): Most significant byte stored first.     - Examples: JPEG, PNG, MP4, and most Java artifacts.

  • Little Endian: Least significant byte stored first.     - Examples: PE and ELF (on x86), BMP, ZIP, and the OOXML container.

  • Declared Byte Order (TIFF and EXIF):     - Signified by the first two bytes of the file.     - 4D4D4D4D: "Motorola" (Big Endian).     - 49494949: "Intel" (Little Endian).

  • PDF Exception: Sidesteps endianness entirely as all numeric fields are text-encoded.

  • Manual Analysis Risk: Reading Little Endian as Big Endian causes massive errors.     - Example: Bytes E8030000E8 03 00 00 read as Little Endian are 1,0001,000. Read as Big Endian, they exceed 3,800,000,0003,800,000,000.

File Signatures and Magic Numbers

  • Definition: A short, fixed sequence of bytes at a known offset used to identify a format.

  • Signature Locations:     - Offset Zero: Most common (e.g., JPEG, ZIP, PE).     - Fixed Non-Zero Offset: MP4 sits at byte 44 because bytes 030-3 encode the size of the container box.     - Trailing Signatures: Found near the end (e.g., JPEG's FFD9FFD9, PDF's %%EOF\%\%EOF).

  • Self-Checking Signatures (PNG Example):     - The 88-byte signature is 89 50 4E 47 0D 0A 1A 0A\text{89 50 4E 47 0D 0A 1A 0A}.     - High bit byte: Detects 77-bit channel stripping.     - CRLF (Carriage Return/Line Feed): Detects DOS conversion.     - Substitute character: Traps DOS end-of-file behavior.     - Trailing newline: Detects DOS-to-UNIX conversion.

  • Common Signature Table:     - JPEG: Offset 00, Bytes FFD8FFFFD8FF.     - PDF: Offset 00 (Preamble up to 10231023 bytes allowed), Bytes %PDF\%PDF.     - PNG: Offset 00, Bytes 89 50 4E 47 0D 0A 1A 0A\text{89 50 4E 47 0D 0A 1A 0A}.     - ZIP: Offset 00, Bytes 50 4B 03 04\text{50 4B 03 04} (ASCII "PK").     - PE (Windows Executable): Offset 00, Bytes 4D 5A\text{4D 5A} (ASCII "MZ").     - ELF (Linux Executable): Offset 00, Bytes 7F 45 4C 46\text{7F 45 4C 46} (ASCII ".ELF").     - GIF: Offset 00, Bytes 47 49 46 38\text{47 49 46 38} (ASCII "GIF8").

Forensic Identification Toolkit

  • file utility: Backed by LibMagic. Universal, fast, MIME-aware. Use -i for MIME types and -b to drop filename prefix.

  • TRID: Probabilistic identification. Ranks candidates by confidence percentage based on a community database. Useful when file returns generic "data."

  • Siegfried: Uses the PRONOM registry from the UK National Archives. Returns a PUID (Persistent Unique Identifier).     - Example PUIDs: FMT/18 (PDF 1.4); FMT/412 (JPEG with EXIF).

  • ExifTool: The primary metadata engine for images and documents.

  • Binwalk: Specialized for finding embedded files (e.g., a JPEG inside a ZIP or firmware analysis).

  • Bulk Extractor: Corpus-level scanner for large evidence sets.

  • YARA: Rule-based pattern matching (e.g., finding a PDF with embedded JavaScript using logical conditions).

  • Python Libraries:     - struct: For raw binary header parsing.     - python-magic: Binding for LibMagic.     - olefile: For legacy Office formats.     - pefile: For executables.

Image Format Analysis: JPEG (Joint Photographic Experts Group)

  • Compression: Primarily lossy (high-frequency components discarded).

  • Key Structural Markers:     - SOI (Start of Image): FFD8FFD8.     - APP0: JFIF or EXIF metadata.     - APP1: Primary EXIF carrier (FFE1FFE1). Contains EXIF header ("Exif" + 000000 00) and TIFF-style IFDs.     - DQT (Define Quantization Table): FFDBFFDB. Used as a device fingerprint.     - SOF (Start of Frame): Defines dimensions and layout.     - DHT (Define Huffman Table): Entropy coding tables.     - SOS (Start of Scan): Beginning of compressed pixel data.     - EOI (End of Image): FFD9FFD9.

  • DQT Fingerprinting: Quantization tables are consistent for specific cameras and quality presets. A mismatch between EXIF model info and DQT tables implies the image was re-saved or edited.

  • EXIF Metadata Groups:     - IFD0: Camera make, model, orientation.     - Exif SubIFD: Capture parameters (ISO, shutter speed, DateTimeOriginal).     - GPS IFD: Latitude, Longitude, Altitude, Timestamp.

  • Thumbnails: Embedded JPEGs generated at capture. If the main image is redacted/edited, the thumbnail often remains in its original state, providing critical evidence.

Image Format Analysis: PNG, GIF, and BMP

  • PNG (Portable Network Graphics):     - Lossless compression.     - Critical Chunks: IHDR (Header, must be first), IDAT (ImageData), IEND (End marker).     - Ancillary Chunks: tEXt, iTXt, zTXt (metadata). These are invisible to renderers but can be used for covert data embedding.     - Every chunk has a 44-byte type, a 44-byte length, payload, and a 44-byte CRC for integrity.

  • GIF (Graphics Interchange Format):     - Older format. Signature GIF87a or GIF89a.     - Uses LZW compression. Supports animation via Extension Blocks (e.g., NETSCAPE2.0).

  • BMP (BitMap):     - Simple, uncompressed (usually). Signature 424D42 4D (BM).     - Forensic value is primarily contextual (e.g., manual screenshots or legacy app outputs).

Document Format Analysis: PDF (Portable Document Format)

  • Four Parts: Header, Body (Objects), Cross-Reference Table (xref), and Trailer.

  • Random Access Logic: To parse, find startxref at the end, jump to the xref table, then locate objects.

  • Streams and Filters: Streams carry heavy data (images, code). Filters like FlateDecode (Zlib) or DCTDecode (JPEG) decode the data.

  • Incremental Updates: New content is appended without removing old data. This allows examiners to recover previous versions of a document from a single file.

  • Exfiltration/Malware Vectors: JavaScript actions (/JS, /JavaScript), Launch actions (/Launch), and Open actions (/OpenAction).

Document Format Analysis: Microsoft Office (OLE and OOXML)

  • OLE (Object Linking and Embedding): Legacy .doc, .xls, .ppt.     - Acts as a "mini-filesystem" inside a file.     - Contains Storages (directories) and Streams (files).     - Property Streams: SummaryInformation and DocumentSummaryInformation contain author, revision count, and total editing time.

  • OOXML (Office Open XML): Modern .docx, .xlsx, .pptx.     - A ZIP archive containing XML files.     - Important Parts:         - docProps/core.xml: Author, created/modified dates.         - docProps/app.xml: Application name and version.         - word/document.xml: The actual text body.     - Track Changes: Recorded inline as w:ins or w:del elements with author/date attributes.

Archive Format Analysis: ZIP

  • Structure: Local File Header (PK\x03\x04PK\backslash x03\backslash x04), Central Directory (PK\x01\x02PK\backslash x01\backslash x02), and EOCD (PK\x05\x06PK\backslash x05\backslash x06).

  • Extra Fields: Used for high-resolution timestamps.     - 0x54550x5455: Extended Timestamp (Unix epoch).     - 0x000A0x000A: NTFS Timestamps (100100-nanosecond resolution).

  • Encryption:     - ZipCrypto: Legacy, weak, susceptible to known-plaintext attacks.     - AES-256: Stronger, introduced in Spec 5.2.     - Forensic Leak: In standard ZIPs, filenames and sizes remain in plaintext even if content is encrypted.

  • Covert Channels: Archive and entry comments can store up to 65,53565,535 bytes of hidden data.

Miscellaneous Archives and Embedded Analysis

  • TAR (Tape Archive): Simple linear structure. No central directory. Each file has a 512512-byte POSIX header. Easy to carve from disk.

  • 7Z: High compression (LZMA2). Signature 37 7A BC AF 27 1C\text{37 7A BC AF 27 1C}. Supports header encryption (hiding filenames).

  • Binwalk Operations:     - --signature: Scans for embedded files.     - --entropy: Plots Shannon entropy to find compressed/encrypted regions.     - -e: Extracts identified files.     - -M: Recursive extraction (e.g., pulling a filesystem out of a firmware image).

Questions & Discussion

  • Q: How can we identify a file if the extension is deleted?

  • A: The examiner looks at the byte content, specifically the magic number at the start of the file. Tools like file, TRID, or a hex viewer allow the identification of the format based on internal byte patterns (e.g., FFD8FFD8 for JPEG) regardless of the filename.

  • Q: Why does the DQT table matter for forensic analysis?

  • A: The Quantization Tables (DQT) act as a fingerprint for the software or camera that created the image. If EXIF data says the image was taken by an iPhone, but the DQT matches Photoshop settings, it proves the image has been re-processed or edited through Photoshop.