LSH and Nearest Neighbor Search

Nearest Neighbor Search (NNS) and Similarity Search Overview

  • Goal: Given a query item, find the top-k most similar items from a database using a chosen similarity metric.
  • Baseline approach: linear (exhaustive) search
    • For a query vector x, compute similarity with every sample in the database.
    • Return the top-k most similar samples.
    • Pros: simple and exact; Cons: computationally expensive for large databases (e.g., billions of images).
  • Approximate Nearest Neighbor (ANN) methods to scale up:
    • Local Sensitive Hashing (LSH)
    • Product Quantization (PQ)
    • Inverted File Index (IVF)
    • Others (e.g., as listed in ANN benchmarks)
  • Typical pipeline in practice:
    • Extract a feature vector for images (often via deep neural networks).
    • Compute similarity between feature vectors (query vs. database) using a chosen metric.
    • Retrieve top-k results (exact or approximate depending on method).

Similarity Search Applications

  • Visually similar results: e.g., clothing/shoes search with query image.
  • Large-scale image databases (billions of images) require scalable indexing and retrieval.
  • Practical relevance: e-commerce, fashion search, content-based image retrieval.

Feature Vectors and Similarity Computation

  • Feature extraction: Deep neural networks produce high-dimensional feature vectors for images.
  • Similarity between two images is computed by comparing their feature vectors (e.g., query vector vs. database image vector).
  • Key takeaway: Quality of retrieval depends on the quality of the feature representations and the chosen similarity metric.

Distance Metrics (Fundamental Building Blocks)

  • Euclidean distance (L2 distance):
    • d{L2}(x,y) = iggl( iggl\'\u2211{i=1}^n (xi - yi)^2 \biggr)^{1/2}
    • Note: Some slides use the squared form; see below.
  • Squared Euclidean distance (often used for efficiency):
    • d<em>L22(x,y)=</em>i=1n(x<em>iy</em>i)2d<em>{L2}^2(x,y) = \sum</em>{i=1}^n (x<em>i - y</em>i)^2
  • Inner product (dot product):
    • ab=<em>i=1na</em>ibia \cdot b = \sum<em>{i=1}^n a</em>i b_i
    • Geometric relation: ab=abcosθa \cdot b = |a| |b| \cos \theta where \theta is the angle between a and b.
  • Cosine similarity (for vectors):
    • cosθ=ABAB\cos \theta = \frac{A \cdot B}{|A| |B|}
    • Cosine distance: CosineDist(A,B)=1ABAB\text{CosineDist}(A,B) = 1 - \frac{A \cdot B}{|A| |B|}
  • Hamming distance (for binary data):
    • Definition: the number of positions with different values.
    • For binary vectors x, y: D<em>H(x,y)=</em>i=1n1[x<em>iy</em>i]D<em>H(x,y) = \sum</em>{i=1}^n \mathbb{1}[x<em>i \neq y</em>i]
    • Efficient computation via bitwise XOR and popcount.
  • Other common metrics (mentioned): Manhattan distance (L1): d<em>L1(x,y)=</em>i=1nx<em>iy</em>id<em>{L1}(x,y) = \sum</em>{i=1}^n |x<em>i - y</em>i|, inner product, etc.

Linear (Exhaustive) Search: Example

  • Given query vector and database samples (in 2D):
    • Query: A = [0, -1]
    • Database: B = [-2,0], C = [1,2], D = [2,1], E = [1,-1], F = [-1,2]
  • Distances via Euclidean (L2):
    • d(A,B) = \sqrt{(-2-0)^2 + (0+1)^2} = \sqrt{5}
    • d(A,C) = \sqrt{(1-0)^2 + (2+1)^2} = \sqrt{10}
    • d(A,D) = \sqrt{(2-0)^2 + (1+1)^2} = \sqrt{8}
    • d(A,E) = \sqrt{(1-0)^2 + (-1+1)^2} = 1
    • d(A,F) = \sqrt{(-1-0)^2 + (2+1)^2} = \sqrt{10}
  • Top-2 nearest neighbors by L2: E and B
  • Note on distance variant: sometimes squared L2 is used in ranking (avoids a sqrt), which yields the same order.

Local Sensitive Hashing (LSH): Core Idea

  • LSH is a family of hashing methods designed so that similar items have a higher probability of colliding (same hash) than dissimilar items.
  • Random Projection (one of the LSH families) is often used to preserve cosine similarity.
  • Basic workflow:
    • Generate K hashing functions (K bits per hash code) using random hyperplanes (random vectors w).
    • Each hashing function maps an input vector x to a binary bit: h_w(x) = 1 if w^T x >= 0, else 0.
    • A collection of K hashing functions yields a K-bit hash code (hash vector) for x.
    • Compare hash codes via Hamming distance to find candidates, then refine with a more accurate distance (e.g., L2).
  • Key concepts:
    • Hash function: h_w(x) = [w^T x >= 0] (indicator that x lies on one side of a hyperplane through the origin).
    • Each bit corresponds to one hyperplane; nb bits = nb hashing functions.
    • Offline preprocessing: database items are transformed into binary hash vectors once.
    • Online processing: each query is transformed to a binary hash vector on the fly.
  • Geometry intuition: a random hyperplane splits space into two half-spaces; the sign of w^T x indicates on which side x lies. As nb of hyperplanes increases, the binary representation becomes a finer partition of space, preserving similarity structure.

LSH Random Projection: Example (2D coordinates, 4 hashing functions)

  • Setup: DB points B, C, D, E, F and query A; hashing functions w1..w4 (random vectors)
    • w1 = [-1, 1], w2 = [-1, 0], w3 = [0, 1], w4 = [1, -1]
  • Step: compute h_w(x) for each point and the query by the sign of w^T x
  • Computed binary hashes (example results):
    • Query A: [0, 0, 0, 1]
    • B: [1, 1, 0, 0]
    • C: [1, 0, 1, 0]
    • D: [0, 0, 1, 1]
    • E: [0, 0, 0, 1]
    • F: [1, 1, 1, 0]
  • How to use: compare hashes using Hamming distance between the query hash and database hashes.
  • Example results after hashing and comparison (Hamming distances with A):
    • DH(A,B) = 3; DH(A,C) = 3; DH(A,D) = 1; DH(A,E) = 0; DH(A,F) = 4
    • Top-2 via LSH + linear search: E and D (E has distance 0; D has distance 1)
  • Practical takeaway: binary hashing enables fast candidate generation and reduces expensive distance computations.

Linear Search on Binary Hashes (LSH + Binary Vector Space)

  • After hashing, we perform linear search in the binary space using Hamming distance:
    • For query A (binary hash) vs. database hashes, compute D_H(A, hash(B)) etc.
    • Example results (for A vs {B, C, D, E, F}):
    • DH(A,B) = 3, DH(A,C) = 3, DH(A,D) = 1, DH(A,E) = 0, D_H(A,F) = 4
    • Top-2 results: E and D
  • Key insight: Hamming distance is very efficient to compute using XOR operations on binary vectors.

Efficiency and Trade-offs of LSH + Binary Hashing

  • Benefits over L2-based linear search:
    • Fewer distance computations due to candidate pruning by hash buckets.
    • Hamming distance is fast (bitwise XOR and popcount).
    • Particularly advantageous in high-dimensional spaces.
  • Offline vs online work:
    • Offline: compute and store binary hash codes for all database items.
    • Online: transform the query into a binary hash code and search efficiently.
  • Storage efficiency:
    • A realistic binary vector example: 128-bit binary vector occupies 16 bytes.
    • A 128-dimensional float vector typically uses 512 bytes (128 × 4 bytes).
    • Binary hashing reduces storage and speeds up similarity computations.
  • Impact of more hashing bits (nbits):
    • More bits yield longer binary vectors and more buckets, increasing retrieval accuracy.
    • Relationship shown in practice: increasing nbbits improves top-k cosine/L2 retrieval similarity to the exact top-k.
  • Practical note: there is a trade-off between offline preprocessing time, online query time, and the accuracy of top-k results.
  • Example capacity: number of buckets grows as 2^{nbits} (e.g., 2 bits → 4 buckets, 4 bits → 16, 8 bits → 256, 16 bits → 65,536).
  • Illustration of accuracy vs. nbBits: increasing nbBits typically increases average cosine similarity of the top-100 retrieved results (Sift1M, 1M samples, 128 dimensions, Faiss baseline).

Hash Tables: One vs Multiple Tables (Two Main Approaches)

  • Method 1: One hash table
    • Build a single hash table by inserting each database item under its hash vector as a key.
    • For a query, compute its hash vector, locate the corresponding bucket, and examine items in that bucket.
    • If the bucket does not contain enough candidates, you may examine adjacent buckets or use a larger nbBits.
    • Workflow (exhaustive within candidate buckets):
    • Step 1: Construct the hash table offline.
    • Step 2: For a query x, compute its hash h(x) and collect all items in the bucket(s) with that hash.
    • Step 3: Rank candidates by a distance measure (often L2) and return top-k.
  • Example structure:
    • Hash table stores entries: Key = hash vector (binary), Value = list of sample indices.
    • Buckets group similar data because items with similar hashes often come from similar regions of the space.
  • Method 2: Multiple hash tables
    • Construct M independent hash tables, each with its own independent set of hash functions.
    • For a query, compute M hash vectors (one per table) and retrieve the corresponding bucket in each table.
    • Combine retrieved items across tables into a candidate set.
    • Run exact linear search (e.g., L2) on the candidate set to produce final top-k.
  • Example with 3 hash tables:
    • Each table has its own key-value buckets; combining results from all tables yields a candidate set such as {13, 2, 5, 40} for a given query.
    • Then perform linear search on the candidate set to get the final top-k.
  • Practical benefit: multiple tables reduce the chance of missing true near neighbors due to hash collisions and can improve recall.

Hash Table Concepts and Notation

  • Hash table: a key-value data structure that maps a binary hash vector to a bucket (list of sample indices).
  • Bucket: the set of samples sharing the same hash key (binary vector).
  • Key: a binary vector (hash vector) produced by applying the LSH hash functions to a data item.
  • Value: the list of indices of data samples that map to that hash key.
  • Rationale: similar data are likely to share the same or nearby hash keys, enabling efficient candidate retrieval.

Offline Preprocessing vs Online Querying in LSH

  • Offline (pre-processing): transform the entire database into binary hash vectors and build hash tables.
  • Online (query time): transform the query into a binary hash vector and retrieve candidate items from the hash tables.
  • Benefit: query-time work is reduced to a small set of candidate items, followed by a precise distance computation on those candidates.

Practical Notes on Hyperplanes and Bit-length (Geometry View)

  • Each hashing function corresponds to a hyperplane through the origin defined by a random normal vector w.
  • A single bit encodes which side of the hyperplane the point lies on:
    • If w^T x >= 0, bit = 1; otherwise bit = 0.
  • Increasing nbits (number of hyperplanes) creates more partitions of the space and yields longer binary codes, which improves discrimination at the cost of more offline storage and computation.
  • A single bit provides only coarse information about similarity; multiple bits capture a richer structure of the angle between vectors via the sign of projections onto multiple random directions.

Online Resources and Tools

  • FAISS: A library for efficient similarity search and clustering of dense vectors by Facebook AI Research.
    • URL: https://github.com/facebookresearch/faiss/wiki
  • Locality-Sensitive Hashing (Random Projection) resources:
    • Pinecone article and tutorials on LSH with random projection
    • URL reference: https://www.pinecone.io/learn/locality-sensitive-hashing-random-projection/

Summary of Key Concepts to Remember

  • NNS vs ANN: NNS seeks exact top-k; ANN trades exactness for speed via indexing structures like LSH, PQ, IVF.
  • LSH with random projections preserves certain similarities (e.g., cosine similarity) via sign-based hashing using random hyperplanes.
  • Binary hash vectors enable fast similarity computation via Hamming distance, offering significant speedups in high-dimensional spaces.
  • Offline preprocessing (hashing the database) plus online querying (hashing the query) are essential workflow steps in LSH.
  • Using multiple hash tables generally improves recall by diversifying hash functions and bucket coverage.
  • More hash bits (nbits) increase potential buckets and accuracy but require more storage and preprocessing.
  • Practical deployment involves balancing accuracy, speed, and resources; FAISS provides a real-world toolchain for dense vector search.

Key Formulas to Remember

  • Euclidean distance (L2):
    • d<em>L2(x,y)=</em>i=1n(x<em>iy</em>i)2d<em>{L2}(x,y) = \sqrt{\sum</em>{i=1}^n (x<em>i - y</em>i)^2}
    • Squared version (often used for speed): d<em>L22(x,y)=</em>i=1n(x<em>iy</em>i)2d<em>{L2}^2(x,y) = \sum</em>{i=1}^n (x<em>i - y</em>i)^2
  • Inner product:
    • ab=<em>i=1na</em>ibia \cdot b = \sum<em>{i=1}^n a</em>i b_i
  • Cosine similarity and distance:
    • cosθ=abab\cos\theta = \frac{a \cdot b}{|a| |b|}
    • CosineDist(a,b)=1abab\text{CosineDist}(a,b) = 1 - \frac{a \cdot b}{|a| |b|}
  • Hamming distance for binary vectors:
    • D<em>H(x,y)=</em>i=1n1[x<em>iy</em>i]D<em>H(x,y) = \sum</em>{i=1}^n \mathbb{1}[x<em>i \neq y</em>i]
  • LSH hash function bit (one hashing function):
    • hw(x)={1,amp;if wTx0 0,amp;otherwiseh_w(x) = \begin{cases} 1, &amp; \text{if } w^T x \ge 0 \ 0, &amp; \text{otherwise} \end{cases}
  • Number of buckets given nb bits:
    • Maximum buckets = 2nbits2^{nbits}
  • Example counts for nb bits (maximum buckets):
    • 2 bits: 4 buckets
    • 4 bits: 16 buckets
    • 8 bits: 256 buckets
    • 16 bits: 65,536 buckets