LSH and Nearest Neighbor Search

Nearest Neighbor Search (NNS) and Similarity Search Overview

Goal: Given a query item, find the top-k most similar items from a database using a chosen similarity metric.
Baseline approach: linear (exhaustive) search
- For a query vector x, compute similarity with every sample in the database.
- Return the top-k most similar samples.
- Pros: simple and exact; Cons: computationally expensive for large databases (e.g., billions of images).
Approximate Nearest Neighbor (ANN) methods to scale up:
- Local Sensitive Hashing (LSH)
- Product Quantization (PQ)
- Inverted File Index (IVF)
- Others (e.g., as listed in ANN benchmarks)
Typical pipeline in practice:
- Extract a feature vector for images (often via deep neural networks).
- Compute similarity between feature vectors (query vs. database) using a chosen metric.
- Retrieve top-k results (exact or approximate depending on method).

Similarity Search Applications

Visually similar results: e.g., clothing/shoes search with query image.
Large-scale image databases (billions of images) require scalable indexing and retrieval.
Practical relevance: e-commerce, fashion search, content-based image retrieval.

Feature Vectors and Similarity Computation

Feature extraction: Deep neural networks produce high-dimensional feature vectors for images.
Similarity between two images is computed by comparing their feature vectors (e.g., query vector vs. database image vector).
Key takeaway: Quality of retrieval depends on the quality of the feature representations and the chosen similarity metric.

Distance Metrics (Fundamental Building Blocks)

Euclidean distance (L2 distance):
- d{L2}(x,y) = iggl( iggl\'\u2211{i=1}^n (xi - yi)^2 \biggr)^{1/2}
- Note: Some slides use the squared form; see below.
Squared Euclidean distance (often used for efficiency):
- $d{L2}^2(x,y) = \sum{i=1}^n (xi - yi)^2$
Inner product (dot product):
- $a \cdot b = \sum{i=1}^n ai b_i$
- Geometric relation: $a \cdot b = |a| |b| \cos \theta$ where \theta is the angle between a and b.
Cosine similarity (for vectors):
- $\cos \theta = \frac{A \cdot B}{|A| |B|}$
- Cosine distance: $\text{CosineDist}(A,B) = 1 - \frac{A \cdot B}{|A| |B|}$
Hamming distance (for binary data):
- Definition: the number of positions with different values.
- For binary vectors x, y: $DH(x,y) = \sum{i=1}^n \mathbb{1}[xi \neq yi]$
- Efficient computation via bitwise XOR and popcount.
Other common metrics (mentioned): Manhattan distance (L1): $d{L1}(x,y) = \sum{i=1}^n |xi - yi|$ , inner product, etc.

Linear (Exhaustive) Search: Example

Given query vector and database samples (in 2D):
- Query: A = [0, -1]
- Database: B = [-2,0], C = [1,2], D = [2,1], E = [1,-1], F = [-1,2]
Distances via Euclidean (L2):
- d(A,B) = \sqrt{(-2-0)^2 + (0+1)^2} = \sqrt{5}
- d(A,C) = \sqrt{(1-0)^2 + (2+1)^2} = \sqrt{10}
- d(A,D) = \sqrt{(2-0)^2 + (1+1)^2} = \sqrt{8}
- d(A,E) = \sqrt{(1-0)^2 + (-1+1)^2} = 1
- d(A,F) = \sqrt{(-1-0)^2 + (2+1)^2} = \sqrt{10}
Top-2 nearest neighbors by L2: E and B
Note on distance variant: sometimes squared L2 is used in ranking (avoids a sqrt), which yields the same order.

Local Sensitive Hashing (LSH): Core Idea

LSH is a family of hashing methods designed so that similar items have a higher probability of colliding (same hash) than dissimilar items.
Random Projection (one of the LSH families) is often used to preserve cosine similarity.
Basic workflow:
- Generate K hashing functions (K bits per hash code) using random hyperplanes (random vectors w).
- Each hashing function maps an input vector x to a binary bit: h_w(x) = 1 if w^T x >= 0, else 0.
- A collection of K hashing functions yields a K-bit hash code (hash vector) for x.
- Compare hash codes via Hamming distance to find candidates, then refine with a more accurate distance (e.g., L2).
Key concepts:
- Hash function: h_w(x) = [w^T x >= 0] (indicator that x lies on one side of a hyperplane through the origin).
- Each bit corresponds to one hyperplane; nb bits = nb hashing functions.
- Offline preprocessing: database items are transformed into binary hash vectors once.
- Online processing: each query is transformed to a binary hash vector on the fly.
Geometry intuition: a random hyperplane splits space into two half-spaces; the sign of w^T x indicates on which side x lies. As nb of hyperplanes increases, the binary representation becomes a finer partition of space, preserving similarity structure.

LSH Random Projection: Example (2D coordinates, 4 hashing functions)

Setup: DB points B, C, D, E, F and query A; hashing functions w1..w4 (random vectors)
- w1 = [-1, 1], w2 = [-1, 0], w3 = [0, 1], w4 = [1, -1]
Step: compute h_w(x) for each point and the query by the sign of w^T x
Computed binary hashes (example results):
- Query A: [0, 0, 0, 1]
- B: [1, 1, 0, 0]
- C: [1, 0, 1, 0]
- D: [0, 0, 1, 1]
- E: [0, 0, 0, 1]
- F: [1, 1, 1, 0]
How to use: compare hashes using Hamming distance between the query hash and database hashes.
Example results after hashing and comparison (Hamming distances with A):
- DH(A,B) = 3; DH(A,C) = 3; DH(A,D) = 1; DH(A,E) = 0; DH(A,F) = 4
- Top-2 via LSH + linear search: E and D (E has distance 0; D has distance 1)
Practical takeaway: binary hashing enables fast candidate generation and reduces expensive distance computations.

Linear Search on Binary Hashes (LSH + Binary Vector Space)

After hashing, we perform linear search in the binary space using Hamming distance:
- For query A (binary hash) vs. database hashes, compute D_H(A, hash(B)) etc.
- Example results (for A vs {B, C, D, E, F}):
- DH(A,B) = 3, DH(A,C) = 3, DH(A,D) = 1, DH(A,E) = 0, D_H(A,F) = 4
- Top-2 results: E and D
Key insight: Hamming distance is very efficient to compute using XOR operations on binary vectors.

Efficiency and Trade-offs of LSH + Binary Hashing

Benefits over L2-based linear search:
- Fewer distance computations due to candidate pruning by hash buckets.
- Hamming distance is fast (bitwise XOR and popcount).
- Particularly advantageous in high-dimensional spaces.
Offline vs online work:
- Offline: compute and store binary hash codes for all database items.
- Online: transform the query into a binary hash code and search efficiently.
Storage efficiency:
- A realistic binary vector example: 128-bit binary vector occupies 16 bytes.
- A 128-dimensional float vector typically uses 512 bytes (128 × 4 bytes).
- Binary hashing reduces storage and speeds up similarity computations.
Impact of more hashing bits (nbits):
- More bits yield longer binary vectors and more buckets, increasing retrieval accuracy.
- Relationship shown in practice: increasing nbbits improves top-k cosine/L2 retrieval similarity to the exact top-k.
Practical note: there is a trade-off between offline preprocessing time, online query time, and the accuracy of top-k results.
Example capacity: number of buckets grows as 2^{nbits} (e.g., 2 bits → 4 buckets, 4 bits → 16, 8 bits → 256, 16 bits → 65,536).
Illustration of accuracy vs. nbBits: increasing nbBits typically increases average cosine similarity of the top-100 retrieved results (Sift1M, 1M samples, 128 dimensions, Faiss baseline).

Hash Tables: One vs Multiple Tables (Two Main Approaches)

Method 1: One hash table
- Build a single hash table by inserting each database item under its hash vector as a key.
- For a query, compute its hash vector, locate the corresponding bucket, and examine items in that bucket.
- If the bucket does not contain enough candidates, you may examine adjacent buckets or use a larger nbBits.
- Workflow (exhaustive within candidate buckets):
- Step 1: Construct the hash table offline.
- Step 2: For a query x, compute its hash h(x) and collect all items in the bucket(s) with that hash.
- Step 3: Rank candidates by a distance measure (often L2) and return top-k.
Example structure:
- Hash table stores entries: Key = hash vector (binary), Value = list of sample indices.
- Buckets group similar data because items with similar hashes often come from similar regions of the space.
Method 2: Multiple hash tables
- Construct M independent hash tables, each with its own independent set of hash functions.
- For a query, compute M hash vectors (one per table) and retrieve the corresponding bucket in each table.
- Combine retrieved items across tables into a candidate set.
- Run exact linear search (e.g., L2) on the candidate set to produce final top-k.
Example with 3 hash tables:
- Each table has its own key-value buckets; combining results from all tables yields a candidate set such as {13, 2, 5, 40} for a given query.
- Then perform linear search on the candidate set to get the final top-k.
Practical benefit: multiple tables reduce the chance of missing true near neighbors due to hash collisions and can improve recall.

Hash Table Concepts and Notation

Hash table: a key-value data structure that maps a binary hash vector to a bucket (list of sample indices).
Bucket: the set of samples sharing the same hash key (binary vector).
Key: a binary vector (hash vector) produced by applying the LSH hash functions to a data item.
Value: the list of indices of data samples that map to that hash key.
Rationale: similar data are likely to share the same or nearby hash keys, enabling efficient candidate retrieval.

Offline Preprocessing vs Online Querying in LSH

Offline (pre-processing): transform the entire database into binary hash vectors and build hash tables.
Online (query time): transform the query into a binary hash vector and retrieve candidate items from the hash tables.
Benefit: query-time work is reduced to a small set of candidate items, followed by a precise distance computation on those candidates.

Practical Notes on Hyperplanes and Bit-length (Geometry View)

Each hashing function corresponds to a hyperplane through the origin defined by a random normal vector w.
A single bit encodes which side of the hyperplane the point lies on:
- If w^T x >= 0, bit = 1; otherwise bit = 0.
Increasing nbits (number of hyperplanes) creates more partitions of the space and yields longer binary codes, which improves discrimination at the cost of more offline storage and computation.
A single bit provides only coarse information about similarity; multiple bits capture a richer structure of the angle between vectors via the sign of projections onto multiple random directions.

Online Resources and Tools

FAISS: A library for efficient similarity search and clustering of dense vectors by Facebook AI Research.
- URL: https://github.com/facebookresearch/faiss/wiki
Locality-Sensitive Hashing (Random Projection) resources:
- Pinecone article and tutorials on LSH with random projection
- URL reference: https://www.pinecone.io/learn/locality-sensitive-hashing-random-projection/

Summary of Key Concepts to Remember

NNS vs ANN: NNS seeks exact top-k; ANN trades exactness for speed via indexing structures like LSH, PQ, IVF.
LSH with random projections preserves certain similarities (e.g., cosine similarity) via sign-based hashing using random hyperplanes.
Binary hash vectors enable fast similarity computation via Hamming distance, offering significant speedups in high-dimensional spaces.
Offline preprocessing (hashing the database) plus online querying (hashing the query) are essential workflow steps in LSH.
Using multiple hash tables generally improves recall by diversifying hash functions and bucket coverage.
More hash bits (nbits) increase potential buckets and accuracy but require more storage and preprocessing.
Practical deployment involves balancing accuracy, speed, and resources; FAISS provides a real-world toolchain for dense vector search.

Key Formulas to Remember

Euclidean distance (L2):
- $d{L2}(x,y) = \sqrt{\sum{i=1}^n (xi - yi)^2}$
- Squared version (often used for speed): $d{L2}^2(x,y) = \sum{i=1}^n (xi - yi)^2$
Inner product:
- $a \cdot b = \sum{i=1}^n ai b_i$
Cosine similarity and distance:
- $\cos\theta = \frac{a \cdot b}{|a| |b|}$
- $\text{CosineDist}(a,b) = 1 - \frac{a \cdot b}{|a| |b|}$
Hamming distance for binary vectors:
- $DH(x,y) = \sum{i=1}^n \mathbb{1}[xi \neq yi]$
LSH hash function bit (one hashing function):
- $h_w(x) = \begin{cases} 1, & \text{if } w^T x \ge 0 \ 0, & \text{otherwise} \end{cases}$
Number of buckets given nb bits:
- Maximum buckets = $2^{nbits}$
Example counts for nb bits (maximum buckets):
- 2 bits: 4 buckets
- 4 bits: 16 buckets
- 8 bits: 256 buckets
- 16 bits: 65,536 buckets