Curso RNALD B1 FaceNet

Introduction

FaceNet: a unified system for face verification, recognition, and clustering by learning a direct embedding of face images into a compact Euclidean space.
Goal: map images to a space where distances correspond to face similarity; once embedded, standard methods (thresholding, k-NN, clustering) can be applied directly.
Key innovations:
- End-to-end training of the embedding f(x) ∈ ℝ^d with a triplet-based loss, avoiding reliance on a separate bottleneck layer or PCA post-processing.
- A novel online triplet mining strategy to select informative triplets during training.
- A 128-dimensional embedding per face, enabling efficient large-scale recognition and clustering (compact representation).
Results (highlights):
- LFW: 99.63% accuracy (with alignment) and 98.87% with a fixed center crop.
- YouTube Faces DB: 95.12% accuracy.
- Relative improvement over prior best results: ~30% on both LFW and YouTube Faces DB.
Concepts introduced:
- Harmonic embeddings and harmonic triplet loss, enabling compatibility between embeddings produced by different networks.
- A lightweight, scalable approach that minimizes training alignment requirements (tight crops, minimal 2D/3D alignment).

Core Idea: Embedding and its Use

Objective: learn an embedding f: x → ℝ^d such that squared L2 distances reflect identity similarity:
- Faces of the same person have small distances; faces of different people have large distances.
Embedding properties:
- Embedding vector is constrained to lie on the unit hypersphere:
  $|f(x)|^2_2 = 1.$
After embedding, face verification is thresholding the distance between embeddings; recognition is a k-NN classification; clustering can use standard clustering algorithms (k-means, agglomerative clustering).
Context: prior deep-learning approaches used a classification network plus a bottleneck representation; FaceNet bypasses the bottleneck and trains directly for the embedding with a triplet loss.

Triplet Loss (Section 3.1)

Embedding vector: $f(x) \in \mathbb{R}^d.$
Notation:
- Anchor: $x^ai$ ; Positive: $x^pi$ (same identity as anchor); Negative: $x^n_i$ (different identity).
Embedding constraint: |f(x^ai) - f(x^pi)|^22 + \alpha < |f(x^ai) - f(x^ni)|^22, for all triplets (x^ai, x^pi, x^n_i) ∈ T, where α is a margin.
- This ensures a margin α between positive and negative pairs in the embedding space.
Loss objective:
$L = \sumi \Big( |f(x^ai) - f(x^pi)|^22 - |f(x^ai) - f(x^ni)|^22 + \alpha \Big)+,$
where (·)_+ denotes the hinge function (max(0, ·)).
Design choice: the embedding is learned directly, not via a separate classifier; the goal is to create a discriminative manifold where same-identity faces cluster together while different identities are separated.

Triplet Selection (Section 3.2)

Challenge: generating all possible triplets is infeasible; many triplets are easily satisfied and do not contribute to learning.
Online vs offline triplet generation:
- Offline: periodically sample triplets using the latest network checkpoint from a subset of data.
- Online: generate triplets within a mini-batch; this work focuses on online generation with large mini-batches (thousands of exemplars).
Mini-batch construction:
- Around 40 faces per identity per mini-batch; randomly sampled negatives added to each mini-batch.
- Use all anchor-positive pairs in a mini-batch, but select hard negatives within the batch.
Hardness and stability:
- Do not always pick the hardest negatives to avoid bad local minima or collapse (f(x) → 0).
- Semi-hard negatives: choose negatives that satisfy
 |f(x^ai) - f(x^pi)|^22 < |f(x^ai) - f(x^ni)|^22,
 i.e., negatives farther than positives but still within the margin α, lying inside the margin region.
Practical batch considerations:
- Batch size: approximately 1,800 exemplars in many experiments.
- To ensure meaningful anchor-positive distances, ensure roughly 40 faces per identity per batch.
Curriculum learning and mining strategy:
- Inspired by curriculum learning, triplets are chosen to gradually increase challenge as training progresses.

Deep Convolutional Networks Used (Section 3.3)

Training setup:
- SGD with standard backprop and AdaGrad; learning rate starts at 0.05 and decays.
- Random initialization; training on CPU cluster for 1,000–2,000 hours.
- Margin α set to 0.2.
Architectures explored:
- Zeiler & Fergus style networks (with 1×1×d convolutions between layers): deep CNN variant with 22 layers; substantial parameter count.
- Inception-based models (Szegedy et al.): multi-branch modules that reduce parameters and FLOPS; suitable for mobile deployment.
Parameter and compute trade-offs:
- Zeiler & Fergus variant (with 1×1 adjacencies): ~140M parameters; ~1.6×10^9 FLOPS per image.
- Inception-based models (NNS1, NNS2): up to ~20× fewer parameters; up to ~5× fewer FLOPS; some models suitable for mobile.
Notable networks:
- NN1: Zeiler&Fergus based, input 220×220, about 140M parameters, ~1.6B FLOPS.
- NN2: Inception based, input 224×224, ~7.5M parameters, ~1.6B FLOPS.
- NN3: Inception based, input 160×160.
- NN4: Inception based, input 96×96, greatly reduced CPU requirements.
Table 1 (NN1) and Table 2 (NN2) summarize architectures and complexity; Figure 4 (FLOPS vs accuracy) compares models.
Observations:
- Inception models can achieve comparable or better accuracy with far fewer parameters and FLOPS.
- Some smaller models (e.g., NN3, NN4) provide favorable accuracy-cost trade-offs, enabling mobile/edge deployment.

Datasets & Evaluation (Section 4)

Evaluation criteria:
- Face verification: given a pair, decide if same or different identity using a threshold on the L2 distance between embeddings.
- Metrics used: VAL(d) (true accepts at threshold d) and FAR(d) (false accepts at threshold d).
- Definitions:
- TA(d) = { (i, j) ∈ Psame | D(xi, xj) ≤ d }
- FA(d) = { (i, j) ∈ Pdiff | D(xi, xj) ≤ d }
- VAL(d) = |TA(d)| / |Psame|, FAR(d) = |FA(d)| / |Pdiff|.
Datasets:
- Hold-out test set: ~1,000,000 images, disjoint identities; split into five disjoint sets of 200k images each; FAR and VAL computed on 100k × 100k image pairs per split.
- Personal photos: ~12k images; evaluation via FAR and VAL across all 12k squared pairs.
- Academic datasets:
- Labeled Faces in the Wild (LFW): standard unrestricted protocol; mean accuracy and standard error reported.
- YouTube Faces DB: repeated video-frame style verification task; 10k,000? pairs; described as similar to LFW in setup.
Hold-out subsets and evaluation protocol ensure distribution consistency with training data, but identities are disjoint.

Experiments and Key Results (Section 5)

5.1 Computation-Accuracy Trade-off
- Figure 4 shows FLOPS vs accuracy (VAL at a fixed FAR on a user-labeled test set).
- Observations:
- Strong correlation between computational cost and accuracy across models.
- Inception-based NN2 achieves competitive accuracy with far fewer parameters than NN1.
5.2 Effect of CNN Model
- Comparison across four models (NN1, NN2, NN3, NN4, NNS1, NNS2):
- NN2 (Inception, 224×224) achieves highest VAL at 10^-3 FAR among larger models.
- NN1 (Zeiler&) remains competitive; NNS1/NNS2 are smaller but still useful for mobile/clustering; NN3 provides a good balance.
- Table 3 summarizes hold-out results (VAL at FAR = 10^-3) for each model:
- NN1: about 87.9%
- NN2: about 89.4%
- NN3: about 88.3%
- NN4: about 82.0%
- NNS1: about 82.4%
- NNS2: about 51.9%
5.3 Sensitivity to Image Quality
- Table 4 explores robustness to JPEG quality and image size.
- Findings: robust to JPEG quality down to 20; down to 120×120 still strong performance; 80×80 acceptable; training inputs up to 220×220; suggests potential gains from training with lower-resolution faces.
5.4 Embedding Dimensionality
- Common embedding dimension chosen: 128 (128-D) for most experiments;
- 64, 256, 512 were tested; results show 128 performs well; larger embeddings do not guarantee higher accuracy; 128 can be quantized to 128 bytes with no loss in accuracy.
5.5 Amount of Training Data
- Training data sizes: 2.6M, 26M, 52M, 260M training faces; model size roughly similar in some comparisons.
- Results (VAL on personal photo set):
- 2.6M: ~76.3%
- 26M: ~85.1%
- 52M: ~85.1%
- 260M: ~86.2%
- Conclusion: more data yields improved accuracy; diminishing returns at very large scales for the studied setups; tens of millions of exemplars boost performance substantially.
5.6 Performance on LFW
- Two evaluation modes:
  1) Fixed center crop of LFW thumbnails.
  2) Extra alignment using a face detector (more aligned faces).
- Results on LFW mean accuracy: 98.87% ± 0.15% with fixed crop; 99.63% ± 0.09% with extra alignment.
- This surpasses DeepFace and previous state-of-the-art by significant margins (roughly 30% relative improvement over prior best results).
5.7 Performance on YouTube Faces DB
- Average similarity across first 100 frames per video yielded 95.12% ± 0.39; 1,000 frames gave 95.18%.
- Outperforms earlier methods (e.g., DeepId2+ at 93.2%) by ~30% relative error reduction.
5.8 Face Clustering
- Embeddings enable effective clustering of a user’s personal photo collection into groups corresponding to individuals.
- Demonstrated with agglomerative clustering: a representative exemplar cluster shows invariance to occlusion, lighting, pose, and even age.

Summary of FaceNet Approach (Section 6)

Direct end-to-end learning of a compact embedding that supports verification, recognition, and clustering.
Advantages over bottleneck/classification-based approaches:
- End-to-end optimization of the embedding for the target tasks.
- Minimal alignment requirements (tight face crops; scale/translation only).
- Bidirectional compatibility: embeddings can be compared across different networks via harmonic embeddings.
Final remarks: future work on error analysis, model size reduction, faster training (curriculum learning with different batch sizes), and improved positive/negative mining strategies.

Harmonic Embedding (Appendix, Section 7)

Concept: Harmonic embeddings are a set of embeddings generated by different models v1, v2 that remain mutually compatible for comparison.
Purpose: enable smooth upgrade paths when deploying newer embedding models without breaking compatibility with existing embeddings.
Visualization (Figure 8): demonstrates compatibility between NN2 embeddings and NN1 embeddings; mixed-mode performance (NN1 with NN2 embeddings) can outperform NN1 alone.
7.1 Harmonic Triplet Loss
- To learn a harmonic embedding, triplets mix embeddings from v1 and v2 during training.
- Training mix: use semi-hard negatives drawn from the combined set of v1 and v2 embeddings.
- Process: initialize v2 embedding from independently trained NN2 and retrain the embedding layer; then retrain the whole v2 network with the harmonic loss to encourage compatibility.
- Intuition: most v2 embeddings cluster near the corresponding v1 embeddings, while slight perturbations can improve verification accuracy for mislocated v1 embeddings.
7.2 Summary and Future Work
- Harmonic embedding concept appears effective and robust; potential to extend to further extensions, including mobile-friendly, compatible networks.
7.3 Additional Notes
- The approach emphasizes compatibility and upgradeability in production systems where embeddings are deployed across devices and servers.

Additional Figures and Tables Referenced

Figure 1: Illumination and pose invariance; embedding distances for same vs different identities under pose/illumination changes; threshold around 1.1 classifies correctly.
Figure 2: Model structure: batch input → deep CNN → L2 normalization → embedding → triplet loss.
Figure 3: Triplet loss illustration: anchor and positive vs. anchor and negative with margin.
Figure 4: FLOPS vs. accuracy trade-off for model families (NN1, NN2, NN3, NNS1, NNS2).
Figure 5: ROC curves for different architectures on the personal photos hold-out set; order of performance NN2 > NN1 > NNS1 > NNS2.
Figure 6: LFW errors (false accepts/rejects) illustrating failure modes.
Figure 7: Example face clustering exemplar.
Figure 8–10: Harmonic embedding compatibility visualizations (ROC space, embedding space).
Table 1: NN1 Zeiler&Fergus-based model with 1×1 convolutions; parameters and FLOPS.
Table 2: NN2 Inception-based model details and complexity.
Table 3: Hold-out validation rates (VAL) at FAR = 10^-3 for different models.
Table 4: Image quality effects (JPEG quality) and image size effects on VAL at 10^-3.
Table 5: Embedding dimensionality effects on VAL.
Table 6: Training data size effects on VAL.

Key Formulas ( recap )

Triplet constraint:
|f(x^a) - f(x^p)|^22 + \alpha < |f(x^a) - f(x^n)|^22.
Triplet loss:
$L = \sumi \Big( |f(x^ai) - f(x^pi)|^22 - |f(x^ai) - f(x^ni)|^22 + \alpha \Big)+.$
Embedding normalization:
$|f(x)|^2_2 = 1.$
Semi-hard negative condition (as defined in text):
|f(x^a) - f(x^p)|^22 < |f(x^a) - f(x^n)|^22.
VAL and FAR definitions:
- TA(d) = { (i, j) ∈ Psame | D(xi, xj) ≤ d }
- FA(d) = { (i, j) ∈ Pdiff | D(xi, xj) ≤ d }
- VAL(d) = |TA(d)| / |Psame|
- FAR(d) = |FA(d)| / |Pdiff|
Datasets and thresholds are used to report accuracy and false positive rates across splits.

If you’d like, I can tailor these notes to focus more on equations, practical takeaways, or exam-style questions (e.g., derive the gradient flow for the triplet loss, discuss potential failure modes, or design a mini-batch strategy for triplet mining).