Note: Some slides use the squared form; see below.
Squared Euclidean distance (often used for efficiency):
d<em>L22(x,y)=∑</em>i=1n(x<em>i−y</em>i)2
Inner product (dot product):
a⋅b=∑<em>i=1na</em>ibi
Geometric relation: a⋅b=∣a∣∣b∣cosθ where \theta is the angle between a and b.
Cosine similarity (for vectors):
cosθ=∣A∣∣B∣A⋅B
Cosine distance: CosineDist(A,B)=1−∣A∣∣B∣A⋅B
Hamming distance (for binary data):
Definition: the number of positions with different values.
For binary vectors x, y: D<em>H(x,y)=∑</em>i=1n1[x<em>i=y</em>i]
Efficient computation via bitwise XOR and popcount.
Other common metrics (mentioned): Manhattan distance (L1): d<em>L1(x,y)=∑</em>i=1n∣x<em>i−y</em>i∣, inner product, etc.
Linear (Exhaustive) Search: Example
Given query vector and database samples (in 2D):
Query: A = [0, -1]
Database: B = [-2,0], C = [1,2], D = [2,1], E = [1,-1], F = [-1,2]
Distances via Euclidean (L2):
d(A,B) = \sqrt{(-2-0)^2 + (0+1)^2} = \sqrt{5}
d(A,C) = \sqrt{(1-0)^2 + (2+1)^2} = \sqrt{10}
d(A,D) = \sqrt{(2-0)^2 + (1+1)^2} = \sqrt{8}
d(A,E) = \sqrt{(1-0)^2 + (-1+1)^2} = 1
d(A,F) = \sqrt{(-1-0)^2 + (2+1)^2} = \sqrt{10}
Top-2 nearest neighbors by L2: E and B
Note on distance variant: sometimes squared L2 is used in ranking (avoids a sqrt), which yields the same order.
Local Sensitive Hashing (LSH): Core Idea
LSH is a family of hashing methods designed so that similar items have a higher probability of colliding (same hash) than dissimilar items.
Random Projection (one of the LSH families) is often used to preserve cosine similarity.
Basic workflow:
Generate K hashing functions (K bits per hash code) using random hyperplanes (random vectors w).
Each hashing function maps an input vector x to a binary bit: h_w(x) = 1 if w^T x >= 0, else 0.
A collection of K hashing functions yields a K-bit hash code (hash vector) for x.
Compare hash codes via Hamming distance to find candidates, then refine with a more accurate distance (e.g., L2).
Key concepts:
Hash function: h_w(x) = [w^T x >= 0] (indicator that x lies on one side of a hyperplane through the origin).
Each bit corresponds to one hyperplane; nb bits = nb hashing functions.
Offline preprocessing: database items are transformed into binary hash vectors once.
Online processing: each query is transformed to a binary hash vector on the fly.
Geometry intuition: a random hyperplane splits space into two half-spaces; the sign of w^T x indicates on which side x lies. As nb of hyperplanes increases, the binary representation becomes a finer partition of space, preserving similarity structure.
LSH Random Projection: Example (2D coordinates, 4 hashing functions)
Setup: DB points B, C, D, E, F and query A; hashing functions w1..w4 (random vectors)
A 128-dimensional float vector typically uses 512 bytes (128 × 4 bytes).
Binary hashing reduces storage and speeds up similarity computations.
Impact of more hashing bits (nbits):
More bits yield longer binary vectors and more buckets, increasing retrieval accuracy.
Relationship shown in practice: increasing nbbits improves top-k cosine/L2 retrieval similarity to the exact top-k.
Practical note: there is a trade-off between offline preprocessing time, online query time, and the accuracy of top-k results.
Example capacity: number of buckets grows as 2^{nbits} (e.g., 2 bits → 4 buckets, 4 bits → 16, 8 bits → 256, 16 bits → 65,536).
Illustration of accuracy vs. nbBits: increasing nbBits typically increases average cosine similarity of the top-100 retrieved results (Sift1M, 1M samples, 128 dimensions, Faiss baseline).
Hash Tables: One vs Multiple Tables (Two Main Approaches)
Method 1: One hash table
Build a single hash table by inserting each database item under its hash vector as a key.
For a query, compute its hash vector, locate the corresponding bucket, and examine items in that bucket.
If the bucket does not contain enough candidates, you may examine adjacent buckets or use a larger nbBits.
Workflow (exhaustive within candidate buckets):
Step 1: Construct the hash table offline.
Step 2: For a query x, compute its hash h(x) and collect all items in the bucket(s) with that hash.
Step 3: Rank candidates by a distance measure (often L2) and return top-k.
Example structure:
Hash table stores entries: Key = hash vector (binary), Value = list of sample indices.
Buckets group similar data because items with similar hashes often come from similar regions of the space.
Method 2: Multiple hash tables
Construct M independent hash tables, each with its own independent set of hash functions.
For a query, compute M hash vectors (one per table) and retrieve the corresponding bucket in each table.
Combine retrieved items across tables into a candidate set.
Run exact linear search (e.g., L2) on the candidate set to produce final top-k.
Example with 3 hash tables:
Each table has its own key-value buckets; combining results from all tables yields a candidate set such as {13, 2, 5, 40} for a given query.
Then perform linear search on the candidate set to get the final top-k.
Practical benefit: multiple tables reduce the chance of missing true near neighbors due to hash collisions and can improve recall.
Hash Table Concepts and Notation
Hash table: a key-value data structure that maps a binary hash vector to a bucket (list of sample indices).
Bucket: the set of samples sharing the same hash key (binary vector).
Key: a binary vector (hash vector) produced by applying the LSH hash functions to a data item.
Value: the list of indices of data samples that map to that hash key.
Rationale: similar data are likely to share the same or nearby hash keys, enabling efficient candidate retrieval.
Offline Preprocessing vs Online Querying in LSH
Offline (pre-processing): transform the entire database into binary hash vectors and build hash tables.
Online (query time): transform the query into a binary hash vector and retrieve candidate items from the hash tables.
Benefit: query-time work is reduced to a small set of candidate items, followed by a precise distance computation on those candidates.
Practical Notes on Hyperplanes and Bit-length (Geometry View)
Each hashing function corresponds to a hyperplane through the origin defined by a random normal vector w.
A single bit encodes which side of the hyperplane the point lies on:
If w^T x >= 0, bit = 1; otherwise bit = 0.
Increasing nbits (number of hyperplanes) creates more partitions of the space and yields longer binary codes, which improves discrimination at the cost of more offline storage and computation.
A single bit provides only coarse information about similarity; multiple bits capture a richer structure of the angle between vectors via the sign of projections onto multiple random directions.
Online Resources and Tools
FAISS: A library for efficient similarity search and clustering of dense vectors by Facebook AI Research.