Curso RNALD B1 FaceNet
Introduction
FaceNet: a unified system for face verification, recognition, and clustering by learning a direct embedding of face images into a compact Euclidean space.
Goal: map images to a space where distances correspond to face similarity; once embedded, standard methods (thresholding, k-NN, clustering) can be applied directly.
Key innovations:
End-to-end training of the embedding f(x) ∈ ℝ^d with a triplet-based loss, avoiding reliance on a separate bottleneck layer or PCA post-processing.
A novel online triplet mining strategy to select informative triplets during training.
A 128-dimensional embedding per face, enabling efficient large-scale recognition and clustering (compact representation).
Results (highlights):
LFW: 99.63% accuracy (with alignment) and 98.87% with a fixed center crop.
YouTube Faces DB: 95.12% accuracy.
Relative improvement over prior best results: ~30% on both LFW and YouTube Faces DB.
Concepts introduced:
Harmonic embeddings and harmonic triplet loss, enabling compatibility between embeddings produced by different networks.
A lightweight, scalable approach that minimizes training alignment requirements (tight crops, minimal 2D/3D alignment).
Core Idea: Embedding and its Use
Objective: learn an embedding f: x → ℝ^d such that squared L2 distances reflect identity similarity:
Faces of the same person have small distances; faces of different people have large distances.
Embedding properties:
Embedding vector is constrained to lie on the unit hypersphere:
After embedding, face verification is thresholding the distance between embeddings; recognition is a k-NN classification; clustering can use standard clustering algorithms (k-means, agglomerative clustering).
Context: prior deep-learning approaches used a classification network plus a bottleneck representation; FaceNet bypasses the bottleneck and trains directly for the embedding with a triplet loss.
Triplet Loss (Section 3.1)
Embedding vector:
Notation:
Anchor: ; Positive: (same identity as anchor); Negative: (different identity).
Embedding constraint: |f(x^ai) - f(x^pi)|^22 + \alpha < |f(x^ai) - f(x^ni)|^22, for all triplets (x^ai, x^pi, x^n_i) ∈ T, where α is a margin.
This ensures a margin α between positive and negative pairs in the embedding space.
Loss objective:
where (·)_+ denotes the hinge function (max(0, ·)).Design choice: the embedding is learned directly, not via a separate classifier; the goal is to create a discriminative manifold where same-identity faces cluster together while different identities are separated.
Triplet Selection (Section 3.2)
Challenge: generating all possible triplets is infeasible; many triplets are easily satisfied and do not contribute to learning.
Online vs offline triplet generation:
Offline: periodically sample triplets using the latest network checkpoint from a subset of data.
Online: generate triplets within a mini-batch; this work focuses on online generation with large mini-batches (thousands of exemplars).
Mini-batch construction:
Around 40 faces per identity per mini-batch; randomly sampled negatives added to each mini-batch.
Use all anchor-positive pairs in a mini-batch, but select hard negatives within the batch.
Hardness and stability:
Do not always pick the hardest negatives to avoid bad local minima or collapse (f(x) → 0).
Semi-hard negatives: choose negatives that satisfy
|f(x^ai) - f(x^pi)|^22 < |f(x^ai) - f(x^ni)|^22,
i.e., negatives farther than positives but still within the margin α, lying inside the margin region.
Practical batch considerations:
Batch size: approximately 1,800 exemplars in many experiments.
To ensure meaningful anchor-positive distances, ensure roughly 40 faces per identity per batch.
Curriculum learning and mining strategy:
Inspired by curriculum learning, triplets are chosen to gradually increase challenge as training progresses.
Deep Convolutional Networks Used (Section 3.3)
Training setup:
SGD with standard backprop and AdaGrad; learning rate starts at 0.05 and decays.
Random initialization; training on CPU cluster for 1,000–2,000 hours.
Margin α set to 0.2.
Architectures explored:
Zeiler & Fergus style networks (with 1×1×d convolutions between layers): deep CNN variant with 22 layers; substantial parameter count.
Inception-based models (Szegedy et al.): multi-branch modules that reduce parameters and FLOPS; suitable for mobile deployment.
Parameter and compute trade-offs:
Zeiler & Fergus variant (with 1×1 adjacencies): ~140M parameters; ~1.6×10^9 FLOPS per image.
Inception-based models (NNS1, NNS2): up to ~20× fewer parameters; up to ~5× fewer FLOPS; some models suitable for mobile.
Notable networks:
NN1: Zeiler&Fergus based, input 220×220, about 140M parameters, ~1.6B FLOPS.
NN2: Inception based, input 224×224, ~7.5M parameters, ~1.6B FLOPS.
NN3: Inception based, input 160×160.
NN4: Inception based, input 96×96, greatly reduced CPU requirements.
Table 1 (NN1) and Table 2 (NN2) summarize architectures and complexity; Figure 4 (FLOPS vs accuracy) compares models.
Observations:
Inception models can achieve comparable or better accuracy with far fewer parameters and FLOPS.
Some smaller models (e.g., NN3, NN4) provide favorable accuracy-cost trade-offs, enabling mobile/edge deployment.
Datasets & Evaluation (Section 4)
Evaluation criteria:
Face verification: given a pair, decide if same or different identity using a threshold on the L2 distance between embeddings.
Metrics used: VAL(d) (true accepts at threshold d) and FAR(d) (false accepts at threshold d).
Definitions:
TA(d) = { (i, j) ∈ Psame | D(xi, xj) ≤ d }
FA(d) = { (i, j) ∈ Pdiff | D(xi, xj) ≤ d }
VAL(d) = |TA(d)| / |Psame|, FAR(d) = |FA(d)| / |Pdiff|.
Datasets:
Hold-out test set: ~1,000,000 images, disjoint identities; split into five disjoint sets of 200k images each; FAR and VAL computed on 100k × 100k image pairs per split.
Personal photos: ~12k images; evaluation via FAR and VAL across all 12k squared pairs.
Academic datasets:
Labeled Faces in the Wild (LFW): standard unrestricted protocol; mean accuracy and standard error reported.
YouTube Faces DB: repeated video-frame style verification task; 10k,000? pairs; described as similar to LFW in setup.
Hold-out subsets and evaluation protocol ensure distribution consistency with training data, but identities are disjoint.
Experiments and Key Results (Section 5)
5.1 Computation-Accuracy Trade-off
Figure 4 shows FLOPS vs accuracy (VAL at a fixed FAR on a user-labeled test set).
Observations:
Strong correlation between computational cost and accuracy across models.
Inception-based NN2 achieves competitive accuracy with far fewer parameters than NN1.
5.2 Effect of CNN Model
Comparison across four models (NN1, NN2, NN3, NN4, NNS1, NNS2):
NN2 (Inception, 224×224) achieves highest VAL at 10^-3 FAR among larger models.
NN1 (Zeiler&) remains competitive; NNS1/NNS2 are smaller but still useful for mobile/clustering; NN3 provides a good balance.
Table 3 summarizes hold-out results (VAL at FAR = 10^-3) for each model:
NN1: about 87.9%
NN2: about 89.4%
NN3: about 88.3%
NN4: about 82.0%
NNS1: about 82.4%
NNS2: about 51.9%
5.3 Sensitivity to Image Quality
Table 4 explores robustness to JPEG quality and image size.
Findings: robust to JPEG quality down to 20; down to 120×120 still strong performance; 80×80 acceptable; training inputs up to 220×220; suggests potential gains from training with lower-resolution faces.
5.4 Embedding Dimensionality
Common embedding dimension chosen: 128 (128-D) for most experiments;
64, 256, 512 were tested; results show 128 performs well; larger embeddings do not guarantee higher accuracy; 128 can be quantized to 128 bytes with no loss in accuracy.
5.5 Amount of Training Data
Training data sizes: 2.6M, 26M, 52M, 260M training faces; model size roughly similar in some comparisons.
Results (VAL on personal photo set):
2.6M: ~76.3%
26M: ~85.1%
52M: ~85.1%
260M: ~86.2%
Conclusion: more data yields improved accuracy; diminishing returns at very large scales for the studied setups; tens of millions of exemplars boost performance substantially.
5.6 Performance on LFW
Two evaluation modes:
1) Fixed center crop of LFW thumbnails.
2) Extra alignment using a face detector (more aligned faces).Results on LFW mean accuracy: 98.87% ± 0.15% with fixed crop; 99.63% ± 0.09% with extra alignment.
This surpasses DeepFace and previous state-of-the-art by significant margins (roughly 30% relative improvement over prior best results).
5.7 Performance on YouTube Faces DB
Average similarity across first 100 frames per video yielded 95.12% ± 0.39; 1,000 frames gave 95.18%.
Outperforms earlier methods (e.g., DeepId2+ at 93.2%) by ~30% relative error reduction.
5.8 Face Clustering
Embeddings enable effective clustering of a user’s personal photo collection into groups corresponding to individuals.
Demonstrated with agglomerative clustering: a representative exemplar cluster shows invariance to occlusion, lighting, pose, and even age.
Summary of FaceNet Approach (Section 6)
Direct end-to-end learning of a compact embedding that supports verification, recognition, and clustering.
Advantages over bottleneck/classification-based approaches:
End-to-end optimization of the embedding for the target tasks.
Minimal alignment requirements (tight face crops; scale/translation only).
Bidirectional compatibility: embeddings can be compared across different networks via harmonic embeddings.
Final remarks: future work on error analysis, model size reduction, faster training (curriculum learning with different batch sizes), and improved positive/negative mining strategies.
Harmonic Embedding (Appendix, Section 7)
Concept: Harmonic embeddings are a set of embeddings generated by different models v1, v2 that remain mutually compatible for comparison.
Purpose: enable smooth upgrade paths when deploying newer embedding models without breaking compatibility with existing embeddings.
Visualization (Figure 8): demonstrates compatibility between NN2 embeddings and NN1 embeddings; mixed-mode performance (NN1 with NN2 embeddings) can outperform NN1 alone.
7.1 Harmonic Triplet Loss
To learn a harmonic embedding, triplets mix embeddings from v1 and v2 during training.
Training mix: use semi-hard negatives drawn from the combined set of v1 and v2 embeddings.
Process: initialize v2 embedding from independently trained NN2 and retrain the embedding layer; then retrain the whole v2 network with the harmonic loss to encourage compatibility.
Intuition: most v2 embeddings cluster near the corresponding v1 embeddings, while slight perturbations can improve verification accuracy for mislocated v1 embeddings.
7.2 Summary and Future Work
Harmonic embedding concept appears effective and robust; potential to extend to further extensions, including mobile-friendly, compatible networks.
7.3 Additional Notes
The approach emphasizes compatibility and upgradeability in production systems where embeddings are deployed across devices and servers.
Additional Figures and Tables Referenced
Figure 1: Illumination and pose invariance; embedding distances for same vs different identities under pose/illumination changes; threshold around 1.1 classifies correctly.
Figure 2: Model structure: batch input → deep CNN → L2 normalization → embedding → triplet loss.
Figure 3: Triplet loss illustration: anchor and positive vs. anchor and negative with margin.
Figure 4: FLOPS vs. accuracy trade-off for model families (NN1, NN2, NN3, NNS1, NNS2).
Figure 5: ROC curves for different architectures on the personal photos hold-out set; order of performance NN2 > NN1 > NNS1 > NNS2.
Figure 6: LFW errors (false accepts/rejects) illustrating failure modes.
Figure 7: Example face clustering exemplar.
Figure 8–10: Harmonic embedding compatibility visualizations (ROC space, embedding space).
Table 1: NN1 Zeiler&Fergus-based model with 1×1 convolutions; parameters and FLOPS.
Table 2: NN2 Inception-based model details and complexity.
Table 3: Hold-out validation rates (VAL) at FAR = 10^-3 for different models.
Table 4: Image quality effects (JPEG quality) and image size effects on VAL at 10^-3.
Table 5: Embedding dimensionality effects on VAL.
Table 6: Training data size effects on VAL.
Key Formulas ( recap )
Triplet constraint:
|f(x^a) - f(x^p)|^22 + \alpha < |f(x^a) - f(x^n)|^22.Triplet loss:
Embedding normalization:
Semi-hard negative condition (as defined in text):
|f(x^a) - f(x^p)|^22 < |f(x^a) - f(x^n)|^22.VAL and FAR definitions:
TA(d) = { (i, j) ∈ Psame | D(xi, xj) ≤ d }
FA(d) = { (i, j) ∈ Pdiff | D(xi, xj) ≤ d }
VAL(d) = |TA(d)| / |Psame|
FAR(d) = |FA(d)| / |Pdiff|
Datasets and thresholds are used to report accuracy and false positive rates across splits.
If you’d like, I can tailor these notes to focus more on equations, practical takeaways, or exam-style questions (e.g., derive the gradient flow for the triplet loss, discuss potential failure modes, or design a mini-batch strategy for triplet mining).