Study Notes on NIST Special Publication 800-168: Approximate Matching

NIST Special Publication 800-168: Publication Details and Authority

Title: Approximate Matching: Definition and Terminology.
Authors: * Frank Breitinger: University of New Haven, West Haven, CT. * Barbara Guttman: Software and Systems Division, Information Technology Laboratory (ITL), NIST. * Michael McCarrin: Naval Postgraduate School, Monterey, CA. * Vassil Roussev: University of New Orleans, New Orleans, LA. * Douglas White: Software and Systems Division, ITL, NIST.
Publication Date: May 2014.
Sponsoring Agency: U.S. Department of Commerce (Penny Pritzker, Secretary) and the National Institute of Standards and Technology (Patrick D. Gallagher, Under Secretary of Commerce for Standards and Technology and Director).
Authority Statement: Developed under the Federal Information Security Management Act (FISMA), Public Law (P.L.) 107-347. It is consistent with Office of Management and Budget (OMB) Circular A-130 requirements, specifically Section 8b(3) regarding "Securing Agency Information Systems."
ITL Responsibilities: The Information Technology Laboratory promotes the U.S. economy and public welfare through technical leadership in measurement and standards. This involves developing tests, reference data, and technical analyses for Federal information systems (excluding national security systems unless specifically approved).

Introduction to Approximate Matching

Core Definition: Approximate matching is a generic term for any technique designed to identify similarities between two digital artifacts.
Primary Functions: * Identifying objects that resemble each other. * Identifying objects contained within another object.
Applications: Crucial for filtering data in security monitoring, digital forensics, and network analysis.
Comparison to Cryptographic Hashing: While cryptographic hashing provides a binary outcome ( $0$ or $1$ ) for an identity match, approximate matching provides a range of outcomes in the interval $[0, 1]$ , interpreted as a relative measure of similarity.
Digital Artifact/Object: Defined in this context as an arbitrary byte sequence (such as a file) with a meaningful interpretation.
Utility in Forensics: Used to filter data into or out of an investigation based on a known reference set (blacklisting). * Note: It is less effective for whitelisting because malicious variants (e.g., a backdoored SSH server) will match the original version closely.

Similarity Query Use Cases

There are two broad categories of similarity queries, as defined by Broder (1997):

Resemblance (R): * Definition: Comparing two similarly sized objects to measure commonality. * Example: Comparing two successive versions of source code. * Problem Type 1: Object Similarity Detection (R): Identifying related artifacts, such as different versions of a single document. * Problem Type 2: Cross Correlation (R): Identifying different artifacts that share a common object (e.g., a Microsoft Word file and a PDF both containing the same image).
Containment (C): * Definition: Comparing objects of significantly different sizes (e.g., a file vs. a whole-disk image) to see if the larger contains the smaller. * Problem Type 3: Embedded Object Detection (C): Finding a specific object inside another (e.g., an executable file inside a memory capture). * Problem Type 4: Fragment Detection (C): Identifying traces or fragments of a known artifact (e.g., identifying a file within individual network packets).

Core Terminology and Abstraction Levels

Features: The basic elements through which artifacts are compared. A comparison of two features always yields a binary ${0, 1}$ outcome; partial feature matches are not allowed. * Example: A feature could be a $(byte, offset)$ pair.
Feature Set: The collection of all features associated with a single artifact. Criteria for inclusion in a set (e.g., selecting every $16^{th}$ byte) must be defined by the algorithm.
Similarity Definition: In approximate matching, similarity is an increasing monotonic function of the number of matching features between two feature sets.

Levels of Abstraction

Bytewise Matching: Operates solely on byte sequences without assuming any data structure. It assumes that human-perceived similarity translates to byte-level similarity, which is not always valid. Terms used: "fuzzy hashing" or "similarity hashing."
Syntactic Matching: Utilizes internal structures (e.g., TCP packet headers) defined by standards. It is specific to specific classes of objects sharing an encoding but does not interpret the "meaning" of content.
Semantic Matching: Uses contextual attributes to interpret artifacts according to human perceptual categories (e.g., visual similarity of images). Terms used: "perceptual hashing" or "robust hashing." This is computationally expensive but provides the most specific results.

Bytewise Approximate Matching Algorithm Phases

Bytewise algorithms typically operate in two distinct phases:

Phase One: Generation: A similarity digest (also called a signature or fingerprint) is created. This is a compressed representation of the artifact's feature set. Usually, the original object cannot be recovered from the digest.
Phase Two: Comparison: Comparison of digests to produce a similarity score ( $s$ ) where $0 \leq s \leq 1$ .

Required Core Functions

Feature Extraction Function: Identifies and extracts features to create the similarity digest.
Similarity Function: Compares digests and outputs a score.
Normalization Strategy: * For Resemblance, matching features are weighed against total features in both objects. * For Containment, unmatched features in the larger object may be disregarded.
Threshold Value: An empirically determined value used to correlate scores with human-recognizable traits. Scores above the threshold are treated with higher confidence.

Essential Algorithmic Requirements

Similarity Preservation: Comparison results between two digests (A' and B') must be uniquely determined by the similarity of the original artifacts (A and B).
Self-evaluation: Algorithms should provide a measure of accuracy, such as a margin of error or confidence level, and clarify if a score of $1$ indicates a bit-for-bit exact match.
Compression: Digests should be compact for fast storage and comparison. Fixed-length digests are preferred.
Ease of Computation: Efficiency should be reported for both extraction and similarity functions. * Computation Complexity (Big O): * $O(1)$ : Fixed-length digests in hash tables/dictionaries. * $O(\log_2 n)$ : Fixed-length digests in binary trees or sorted lists. * $O(n)$ : Fixed-length digests in unsorted lists (no indexing).

Reliability and Security Factors

Sensitivity & Robustness: * Sensitivity: The ability to find correlations based on fine-grained commonality. * Robustness (Performance Envelope): Effective function despite noise or routine transformations (e.g., internal fragmentation).
Precision & Recall: Use of precision, recall, and F-score to determine reliability. Data should be clear on whether it was culled from existing collections or developed for testing.
Security of Results: Resistance to attacks where data is manipulated to appear either similar or dissimilar intentionally.

Testing Methodologies

Motivations for Testing

Understand which objects the algorithm identifies as similar (semantic interpretation).
Regression testing during development (tracking improvements in speed/score).
Algorithm comparison using standard tasks or confusion matrices to translate $[0, 1]$ scores into binary outcomes.

Types of Test Data

Controlled (Synthetic) Data: Randomly generated blocks where ground truth is precisely known. * Disadvantage: High-entropy, unique data may not represent real files like text documents.
Real Data: Data originally created for other purposes (e.g., the govdoc-corpus). * Disadvantage: Challenges include establishing ground truth and scaling experiments without humans.

Properties of Interest in Testing

Space Efficiency: The ratio between digest length and input length ( $\text{space efficiency} = \frac{\text{digest length}}{\text{input length}}$ ).
Generation and Comparison Efficiency: Measured in throughput, often compared against SHA-1 as a reference.
Scalability Analysis: Speedup relative to available CPU cores or GPUs (concurrency).
Sensitivity Tests: * Fragment Detection: Minimum length needed to correlate a sample to the whole object. * Single-common-block Correlation: Minimum block size shared by two otherwise dissimilar files to trigger a match.
Robustness Tests: * Alignment Robustness: Sensitivity to alignment shifts (e.g., prefixing bytes). * White Noise Resistance: Amount of random noise acceptable before correlation fails (e.g., ssdeep is sensitive to small changes distributed over input).

Reference Methodology (Appendix 1)

Controlled Data Testing

Reference File Sizes (KiB): $1$ , $4$ , $16$ , $64$ , $256$ , and $1024$ .
Govdoc-corpus distribution statistics: * $\leq 4\,KiB$ : $5.4\%$ cumulative probability. * $\leq 16\,KiB$ : $20.71\%$ . * $\leq 64\,KiB$ : $52.54\%$ . * $\leq 256\,KiB$ : $75.82\%$ . * $\leq 1024\,KiB$ : $90.60\%$ .
FRASH Mutation Methods ( $X$ refers to object size/fraction, $s_1$ is size of file $f_1$ ): * Fragment Detection: $X = {0.01s_1, 0.02s_1, 0.03s_1, 0.04s_1, 0.05s_1, 0.1s_1, 0.15s_1, 0.2s_1, 0.3s_1, 0.5s_1}$ . * Single-common-block: $X = {0.01s_1, 0.02s_1, 0.03s_1, 0.04s_1, 0.05s_1, 0.1s_1, 0.15s_1, 0.2s_1, 0.3s_1, 0.5s_1}$ . * Alignment Robustness: $X = {0.01s_1, 0.02s_1, 0.03s_1, 0.04s_1, 0.05s_1, 0.1s_1, 0.2s_1}$ . * Random-noise Resistance: $X = {0.005s_1, 0.01s_1, 0.015s_1, 0.02s_1, 0.025s_1}$ .

Ground Truth establishment with Longest Common Substring (LCS)

Absolute Outcome ( $L_a$ ): $L_a = \text{alcsa}(f_1, f_2)$ , where $0 \leq L_a \leq \min(|f_1|, |f_2|)$ .
Relative Outcome ( $L_r$ ): $L_r = \frac{100 \times L_a}{\min(|f_1|, |f_2|)}$ , where $0 \leq L_r \leq 1$ .
True Positive Threshold: $T P_{lcs}(f_1, f_2) \equiv L_a \geq 100 \wedge L_r \geq 0.01$ . (Note: Standard rounding applies, so $0.5$ relative result rounds up to $1$ ).
Computational Burden: LCS has a complexity of $O(mn)$ , making it difficult for large files.

Approximate Longest Common Substring (aLCS)

Mechanism: Operates block-by-block (~80 bytes) using the ssdeep rolling hash.
Hashing: Each block is hashed using 64-bit FNV-1a.
Storage: The "alcs-digest" stores hash value, entropy, and length.
Implementation: Developed in C using a command-line tool. Configurable settings in header/config.h include: * MIN_LCS: Minimum length for output (default is $0$ ). * THREAD_POOL_QUEUE_SIZE: Set to $\frac{\text{fileamount} \times (\text{fileamount} - 1)}{2}$ . * NUMTHREADS: Recommended to match the number of CPU cores.
Verification: Verified against a parallelized LCS tool using $201$ randomly selected files (yielding $20,100$ comparisons).
Accuracy: The relative score difference ( $d_r$ ) distribution showed that over $95\%$ of differences between LCS and aLCS do not exceed $3\%$ of the smaller file's size.
Empirical Distribution Logic for $d_r$ : $d_r = \left\lfloor \frac{lcs(f_1, f_2) - alcs(f_1, f_2)}{\min(|f_1|, |f_2|)} \right\rfloor$ .

Table: Empirical Probability for $d_r$

X	Pr{dr = X}	Pr{dr \leq X}
0	0.8869	0.8869
1	0.0449	0.9318
2	0.0155	0.9473
3	0.0040	0.9513
4	0.0047	0.9561
5	0.0116	0.9677
10	0.0062	0.9834
15	0.0001	0.9992
20	0.0000	0.9999

Study Notes on NIST Special Publication 800-168: Approximate Matching

NIST Special Publication 800-168: Publication Details and Authority

Introduction to Approximate Matching

Similarity Query Use Cases

Core Terminology and Abstraction Levels

Levels of Abstraction

Bytewise Approximate Matching Algorithm Phases

Required Core Functions

Essential Algorithmic Requirements

Reliability and Security Factors

Testing Methodologies

Motivations for Testing

Types of Test Data

Properties of Interest in Testing

Reference Methodology (Appendix 1)

Controlled Data Testing

Ground Truth establishment with Longest Common Substring (LCS)

Approximate Longest Common Substring (aLCS)

Table: Empirical Probability for drd_rdr​

Table: Empirical Probability for $d_r$