Curso RNALD B1 FaceNet

Introduction

  • FaceNet: a unified system for face verification, recognition, and clustering by learning a direct embedding of face images into a compact Euclidean space.

  • Goal: map images to a space where distances correspond to face similarity; once embedded, standard methods (thresholding, k-NN, clustering) can be applied directly.

  • Key innovations:

    • End-to-end training of the embedding f(x) ∈ ℝ^d with a triplet-based loss, avoiding reliance on a separate bottleneck layer or PCA post-processing.

    • A novel online triplet mining strategy to select informative triplets during training.

    • A 128-dimensional embedding per face, enabling efficient large-scale recognition and clustering (compact representation).

  • Results (highlights):

    • LFW: 99.63% accuracy (with alignment) and 98.87% with a fixed center crop.

    • YouTube Faces DB: 95.12% accuracy.

    • Relative improvement over prior best results: ~30% on both LFW and YouTube Faces DB.

  • Concepts introduced:

    • Harmonic embeddings and harmonic triplet loss, enabling compatibility between embeddings produced by different networks.

    • A lightweight, scalable approach that minimizes training alignment requirements (tight crops, minimal 2D/3D alignment).

Core Idea: Embedding and its Use

  • Objective: learn an embedding f: x → ℝ^d such that squared L2 distances reflect identity similarity:

    • Faces of the same person have small distances; faces of different people have large distances.

  • Embedding properties:

    • Embedding vector is constrained to lie on the unit hypersphere:
      f(x)22=1.|f(x)|^2_2 = 1.

  • After embedding, face verification is thresholding the distance between embeddings; recognition is a k-NN classification; clustering can use standard clustering algorithms (k-means, agglomerative clustering).

  • Context: prior deep-learning approaches used a classification network plus a bottleneck representation; FaceNet bypasses the bottleneck and trains directly for the embedding with a triplet loss.

Triplet Loss (Section 3.1)

  • Embedding vector: f(x)Rd.f(x) \in \mathbb{R}^d.

  • Notation:

    • Anchor: xa<em>ix^a<em>i; Positive: xp</em>ix^p</em>i (same identity as anchor); Negative: xinx^n_i (different identity).

  • Embedding constraint: |f(x^ai) - f(x^pi)|^22 + \alpha < |f(x^ai) - f(x^ni)|^22, for all triplets (x^ai, x^pi, x^n_i) ∈ T, where α is a margin.

    • This ensures a margin α between positive and negative pairs in the embedding space.

  • Loss objective:
    L=<em>i(f(xa</em>i)f(xp<em>i)2</em>2f(xa<em>i)f(xn</em>i)2<em>2+α)</em>+,L = \sum<em>i \Big( |f(x^a</em>i) - f(x^p<em>i)|^2</em>2 - |f(x^a<em>i) - f(x^n</em>i)|^2<em>2 + \alpha \Big)</em>+,
    where (·)_+ denotes the hinge function (max(0, ·)).

  • Design choice: the embedding is learned directly, not via a separate classifier; the goal is to create a discriminative manifold where same-identity faces cluster together while different identities are separated.

Triplet Selection (Section 3.2)

  • Challenge: generating all possible triplets is infeasible; many triplets are easily satisfied and do not contribute to learning.

  • Online vs offline triplet generation:

    • Offline: periodically sample triplets using the latest network checkpoint from a subset of data.

    • Online: generate triplets within a mini-batch; this work focuses on online generation with large mini-batches (thousands of exemplars).

  • Mini-batch construction:

    • Around 40 faces per identity per mini-batch; randomly sampled negatives added to each mini-batch.

    • Use all anchor-positive pairs in a mini-batch, but select hard negatives within the batch.

  • Hardness and stability:

    • Do not always pick the hardest negatives to avoid bad local minima or collapse (f(x) → 0).

    • Semi-hard negatives: choose negatives that satisfy
      |f(x^ai) - f(x^pi)|^22 < |f(x^ai) - f(x^ni)|^22,
      i.e., negatives farther than positives but still within the margin α, lying inside the margin region.

  • Practical batch considerations:

    • Batch size: approximately 1,800 exemplars in many experiments.

    • To ensure meaningful anchor-positive distances, ensure roughly 40 faces per identity per batch.

  • Curriculum learning and mining strategy:

    • Inspired by curriculum learning, triplets are chosen to gradually increase challenge as training progresses.

Deep Convolutional Networks Used (Section 3.3)

  • Training setup:

    • SGD with standard backprop and AdaGrad; learning rate starts at 0.05 and decays.

    • Random initialization; training on CPU cluster for 1,000–2,000 hours.

    • Margin α set to 0.2.

  • Architectures explored:

    • Zeiler & Fergus style networks (with 1×1×d convolutions between layers): deep CNN variant with 22 layers; substantial parameter count.

    • Inception-based models (Szegedy et al.): multi-branch modules that reduce parameters and FLOPS; suitable for mobile deployment.

  • Parameter and compute trade-offs:

    • Zeiler & Fergus variant (with 1×1 adjacencies): ~140M parameters; ~1.6×10^9 FLOPS per image.

    • Inception-based models (NNS1, NNS2): up to ~20× fewer parameters; up to ~5× fewer FLOPS; some models suitable for mobile.

  • Notable networks:

    • NN1: Zeiler&Fergus based, input 220×220, about 140M parameters, ~1.6B FLOPS.

    • NN2: Inception based, input 224×224, ~7.5M parameters, ~1.6B FLOPS.

    • NN3: Inception based, input 160×160.

    • NN4: Inception based, input 96×96, greatly reduced CPU requirements.

  • Table 1 (NN1) and Table 2 (NN2) summarize architectures and complexity; Figure 4 (FLOPS vs accuracy) compares models.

  • Observations:

    • Inception models can achieve comparable or better accuracy with far fewer parameters and FLOPS.

    • Some smaller models (e.g., NN3, NN4) provide favorable accuracy-cost trade-offs, enabling mobile/edge deployment.

Datasets & Evaluation (Section 4)

  • Evaluation criteria:

    • Face verification: given a pair, decide if same or different identity using a threshold on the L2 distance between embeddings.

    • Metrics used: VAL(d) (true accepts at threshold d) and FAR(d) (false accepts at threshold d).

    • Definitions:

    • TA(d) = { (i, j) ∈ Psame | D(xi, xj) ≤ d }

    • FA(d) = { (i, j) ∈ Pdiff | D(xi, xj) ≤ d }

    • VAL(d) = |TA(d)| / |Psame|, FAR(d) = |FA(d)| / |Pdiff|.

  • Datasets:

    • Hold-out test set: ~1,000,000 images, disjoint identities; split into five disjoint sets of 200k images each; FAR and VAL computed on 100k × 100k image pairs per split.

    • Personal photos: ~12k images; evaluation via FAR and VAL across all 12k squared pairs.

    • Academic datasets:

    • Labeled Faces in the Wild (LFW): standard unrestricted protocol; mean accuracy and standard error reported.

    • YouTube Faces DB: repeated video-frame style verification task; 10k,000? pairs; described as similar to LFW in setup.

  • Hold-out subsets and evaluation protocol ensure distribution consistency with training data, but identities are disjoint.

Experiments and Key Results (Section 5)

  • 5.1 Computation-Accuracy Trade-off

    • Figure 4 shows FLOPS vs accuracy (VAL at a fixed FAR on a user-labeled test set).

    • Observations:

    • Strong correlation between computational cost and accuracy across models.

    • Inception-based NN2 achieves competitive accuracy with far fewer parameters than NN1.

  • 5.2 Effect of CNN Model

    • Comparison across four models (NN1, NN2, NN3, NN4, NNS1, NNS2):

    • NN2 (Inception, 224×224) achieves highest VAL at 10^-3 FAR among larger models.

    • NN1 (Zeiler&) remains competitive; NNS1/NNS2 are smaller but still useful for mobile/clustering; NN3 provides a good balance.

    • Table 3 summarizes hold-out results (VAL at FAR = 10^-3) for each model:

    • NN1: about 87.9%

    • NN2: about 89.4%

    • NN3: about 88.3%

    • NN4: about 82.0%

    • NNS1: about 82.4%

    • NNS2: about 51.9%

  • 5.3 Sensitivity to Image Quality

    • Table 4 explores robustness to JPEG quality and image size.

    • Findings: robust to JPEG quality down to 20; down to 120×120 still strong performance; 80×80 acceptable; training inputs up to 220×220; suggests potential gains from training with lower-resolution faces.

  • 5.4 Embedding Dimensionality

    • Common embedding dimension chosen: 128 (128-D) for most experiments;

    • 64, 256, 512 were tested; results show 128 performs well; larger embeddings do not guarantee higher accuracy; 128 can be quantized to 128 bytes with no loss in accuracy.

  • 5.5 Amount of Training Data

    • Training data sizes: 2.6M, 26M, 52M, 260M training faces; model size roughly similar in some comparisons.

    • Results (VAL on personal photo set):

    • 2.6M: ~76.3%

    • 26M: ~85.1%

    • 52M: ~85.1%

    • 260M: ~86.2%

    • Conclusion: more data yields improved accuracy; diminishing returns at very large scales for the studied setups; tens of millions of exemplars boost performance substantially.

  • 5.6 Performance on LFW

    • Two evaluation modes:
      1) Fixed center crop of LFW thumbnails.
      2) Extra alignment using a face detector (more aligned faces).

    • Results on LFW mean accuracy: 98.87% ± 0.15% with fixed crop; 99.63% ± 0.09% with extra alignment.

    • This surpasses DeepFace and previous state-of-the-art by significant margins (roughly 30% relative improvement over prior best results).

  • 5.7 Performance on YouTube Faces DB

    • Average similarity across first 100 frames per video yielded 95.12% ± 0.39; 1,000 frames gave 95.18%.

    • Outperforms earlier methods (e.g., DeepId2+ at 93.2%) by ~30% relative error reduction.

  • 5.8 Face Clustering

    • Embeddings enable effective clustering of a user’s personal photo collection into groups corresponding to individuals.

    • Demonstrated with agglomerative clustering: a representative exemplar cluster shows invariance to occlusion, lighting, pose, and even age.

Summary of FaceNet Approach (Section 6)

  • Direct end-to-end learning of a compact embedding that supports verification, recognition, and clustering.

  • Advantages over bottleneck/classification-based approaches:

    • End-to-end optimization of the embedding for the target tasks.

    • Minimal alignment requirements (tight face crops; scale/translation only).

    • Bidirectional compatibility: embeddings can be compared across different networks via harmonic embeddings.

  • Final remarks: future work on error analysis, model size reduction, faster training (curriculum learning with different batch sizes), and improved positive/negative mining strategies.

Harmonic Embedding (Appendix, Section 7)

  • Concept: Harmonic embeddings are a set of embeddings generated by different models v1, v2 that remain mutually compatible for comparison.

  • Purpose: enable smooth upgrade paths when deploying newer embedding models without breaking compatibility with existing embeddings.

  • Visualization (Figure 8): demonstrates compatibility between NN2 embeddings and NN1 embeddings; mixed-mode performance (NN1 with NN2 embeddings) can outperform NN1 alone.

  • 7.1 Harmonic Triplet Loss

    • To learn a harmonic embedding, triplets mix embeddings from v1 and v2 during training.

    • Training mix: use semi-hard negatives drawn from the combined set of v1 and v2 embeddings.

    • Process: initialize v2 embedding from independently trained NN2 and retrain the embedding layer; then retrain the whole v2 network with the harmonic loss to encourage compatibility.

    • Intuition: most v2 embeddings cluster near the corresponding v1 embeddings, while slight perturbations can improve verification accuracy for mislocated v1 embeddings.

  • 7.2 Summary and Future Work

    • Harmonic embedding concept appears effective and robust; potential to extend to further extensions, including mobile-friendly, compatible networks.

  • 7.3 Additional Notes

    • The approach emphasizes compatibility and upgradeability in production systems where embeddings are deployed across devices and servers.

Additional Figures and Tables Referenced

  • Figure 1: Illumination and pose invariance; embedding distances for same vs different identities under pose/illumination changes; threshold around 1.1 classifies correctly.

  • Figure 2: Model structure: batch input → deep CNN → L2 normalization → embedding → triplet loss.

  • Figure 3: Triplet loss illustration: anchor and positive vs. anchor and negative with margin.

  • Figure 4: FLOPS vs. accuracy trade-off for model families (NN1, NN2, NN3, NNS1, NNS2).

  • Figure 5: ROC curves for different architectures on the personal photos hold-out set; order of performance NN2 > NN1 > NNS1 > NNS2.

  • Figure 6: LFW errors (false accepts/rejects) illustrating failure modes.

  • Figure 7: Example face clustering exemplar.

  • Figure 8–10: Harmonic embedding compatibility visualizations (ROC space, embedding space).

  • Table 1: NN1 Zeiler&Fergus-based model with 1×1 convolutions; parameters and FLOPS.

  • Table 2: NN2 Inception-based model details and complexity.

  • Table 3: Hold-out validation rates (VAL) at FAR = 10^-3 for different models.

  • Table 4: Image quality effects (JPEG quality) and image size effects on VAL at 10^-3.

  • Table 5: Embedding dimensionality effects on VAL.

  • Table 6: Training data size effects on VAL.

Key Formulas ( recap )

  • Triplet constraint:
    |f(x^a) - f(x^p)|^22 + \alpha < |f(x^a) - f(x^n)|^22.

  • Triplet loss:
    L=<br><em>i(f(xa</em>i)f(xp<em>i)2</em>2f(xa<em>i)f(xn</em>i)2<em>2+α)</em>+.L = <br>\sum<em>i \Big( |f(x^a</em>i) - f(x^p<em>i)|^2</em>2 - |f(x^a<em>i) - f(x^n</em>i)|^2<em>2 + \alpha \Big)</em>+.

  • Embedding normalization:
    f(x)22=1.|f(x)|^2_2 = 1.

  • Semi-hard negative condition (as defined in text):
    |f(x^a) - f(x^p)|^22 < |f(x^a) - f(x^n)|^22.

  • VAL and FAR definitions:

    • TA(d) = { (i, j) ∈ Psame | D(xi, xj) ≤ d }

    • FA(d) = { (i, j) ∈ Pdiff | D(xi, xj) ≤ d }

    • VAL(d) = |TA(d)| / |Psame|

    • FAR(d) = |FA(d)| / |Pdiff|

  • Datasets and thresholds are used to report accuracy and false positive rates across splits.

If you’d like, I can tailor these notes to focus more on equations, practical takeaways, or exam-style questions (e.g., derive the gradient flow for the triplet loss, discuss potential failure modes, or design a mini-batch strategy for triplet mining).