SEMA-YOLO Paper Study Notes

Page 1

  • Bibliographic Information
    • Academic Editor: Javier Marcello
    • Received 11\,April\,2025 → Revised 22\,May\,2025 → Accepted 30\,May\,2025 → Published 31\,May\,2025
    • Citation: Z. Wu et al., “SEMA-YOLO: Lightweight Small Object Detection in Remote Sensing Image via Shallow-Layer Enhancement and Multi-Scale Adaptation,” Remote Sens. 2025, 17, 1917. doi:10.3390/rs17111917
    • Open-access under CC-BY 4.0.
  • Author & Affiliation Highlights
    • Zhenchuan Wu, Hang Zhen († equal), Xiaoxinxi Zhang, Xinghua Li*, Xuechen Bai
    • School of Remote Sensing & Information Engineering, Wuhan University (China).
    • State Key Laboratory of Information Engineering in Surveying, Mapping & Remote Sensing.
  • Abstract—Key Messages
    • Small object detection in remote sensing is hindered by feature loss during down-sampling and complex backgrounds.
    • Introduces SEMA-YOLO, an enhanced YOLOv11 framework with three pillars:
    1. Shallow-Layer Enhancement (SLE) – shallower backbone + extra tiny detection head (larger feature map).
    2. GCP-ASFF – Global Context Pooling + Adaptively Spatial Feature Fusion over 4 heads.
    3. RFA-C3k2 – Receptive-Field-Attention infused C3k2 for refined extraction.
    • Performance: mAP_{50}=72.5\% (RS-STOD) & 61.5\% (AI-TOD), surpassing prior SOTA.
  • Keywords
    small-object detection, remote sensing, YOLO, feature fusion

Page 2

  • Importance of Object Detection in RS: military [1], agriculture [2], urban planning [3], environmental monitoring [4], traffic [5].
  • Definition: per MS-COCO, small ≤32\times32 px.
  • High-Resolution (HR) Data Boom:
    • WorldView-3: 0.31\,m (pan) / 1.24\,m (MS) [8].
    • DJI Phantom 4 RTK UAV: cm-level [9].
    • SAR: Gaofen-3 03 1\,m (C-band) [10]; TerraSAR-X 1\,m Spotlight [11].
  • Three Major Challenges (Fig. 1):
    1. Complex backgrounds (shadows, noise, geometric/radiometric distortions) [12].
    2. Limited pixels per object [13].
    3. Dense distribution causing occlusion/overlap [14].
  • Contributions Recap
    • Lightweight, accurate YOLOv11 derivative.
    • SLE: add tiny head (P2), backbone output P5→P4.
    • GCP-ASFF: inject global context into ASFF.
    • RFA: dynamic receptive-field attention > C3k2.

Page 3

  • Paper Organization:
    Sec 2 – Related Work; Sec 3 – Method; Sec 4 – Experiments; Sec 5 – Discussion; Sec 6 – Conclusion.
  • Related-Work Axes: (i) task-specific strategies; (ii) mainstream frameworks.
  • 2.1 Task-Specific
    • 2.1.1 Multi-Scale Feature
      • FPN [18] origin → PANet [19], NAS-FPN [20], BiFPN [21], Recursive-FPN [22], AFPN [23].
      • YOLO lineage: v3 [24] multi-scale heads; v4 [25] PAN.
      • Transformers: FPT [26], DNTR [27].
      • Trade-off: ↑params, limited small-object tailoring.
    • 2.1.2 Super-Resolution
      GAN [28], RCAN [29]; SuperYOLO [30], SRCGAN-RFA-YOLO [32], etc.
      • Gains but ↑complexity and compute.
    • 2.1.3 Context-Based
      Context cues (spatial/semantic relations) e.g., Lim et al. [34], PyramidBox [35], SCDNet [36], CAB Net [38], etc.

Page 4

  • Continues context-based review plus underscores that combining context separation and task-specific branches boosts detection in clutter.

Page 5

  • 2.2 Mainstream Frameworks
    • 2.2.1 YOLO Evolution: v1→v12; key additions: BatchNorm, FPN, SPP, PAN, RepVGG, E-ELAN, NMS-free, C3k2/C2PSA, Attention-centric R-ELAN.
    • Multiple RS adaptations: ASFF-YOLOv5 [49], YOLO-DA [50], RSI-YOLO [51], CSPPartial-YOLO [52].
    • Motivation: choose YOLOv11 as base for HR small object detection.

Page 6

  • Other Frameworks
    • Two-stage (Faster/ Cascade R-CNN) vs one-stage (RetinaNet, CenterNet).
    • Anchor-free: FCOS [56]; label assignment RFLA [57]; aligned matching [58].
    • Transformer family: DETR [60] drawbacks → Deformable [61], DN-DETR [62], DINO [63], DNTR [27], DQ-DETR [64], RT-DETR [65].
    • Yet none balances real-time & small-object equally; Transformers heavier.

Page 7

  • 3 Proposed Method – Overview
    Workflow (Fig.2):
    Input → Backbone(C3K2/2CBS/SPPF/C2PSA truncated at P4) → Neck(PAN+RFA-C3k2) → 4-level Heads (P2–P5) with GCP-ASFF fusion.

Page 8

  • 3.2 Shallow-Layer Enhancement (SLE)
    • Original YOLOv11 heads: P3 (1/8), P4 (1/16), P5 (1/32).
    • Modifications:
      • Add P2 head (1/4) via double-upsample from P4.
      Stop backbone at P4 (remove final down-sampling) → fewer params & preserved low-level detail.
      • P5 retained by downsampling enhanced P4, then Concat with original downsampled P4.
    • Stats: params ↓19.7\% (2.583 M → 2.075 M); mAP_{50:95} +0.052 on RS-STOD.

Page 9

  • 3.3 GCP-ASFF Module
    • Problem: naive FPN fusion causes gradient conflict; ASFF [16] learns spatial weights \alpha,\beta,\gamma:
      z^{k}{ij}=\alpha^{k}{ij}\,x^{1\rightarrow k}{ij}+\beta^{k}{ij}\,x^{2\rightarrow k}{ij}+\gamma^{k}{ij}\,x^{3\rightarrow k}_{ij} (Eq. 1)
    • Weights normalized via:
      \hat\phi^{k}{ij}=\frac{e^{\lambda^{k}{\phi,ij}}}{\sum{\psi}e^{\lambda^{k}{\psi,ij}}},\quad \phi\in{\alpha,\beta,\gamma} (Eq. 2)
      Final scaling ensures \sum \phi^{k}_{ij}=1 (Eq. 3).
    • Limitation: lacks global context.

Page 10

  • GCP Enhancement (Figs 3-4)
    1. Global Avg Pooling each level → channel descriptor g_i (Eq. 4).
    2. Concatenate g_i back, compress via 1\times1 Conv (Eq. 5).
    3. Soft-max spatial weights α_i, fuse (Eq. 6).
    • GCP acts as a global semantic filter → strengthens target cues, suppresses background.

Page 11

  • 3.4 RFA-C3k2 Module (Fig. 5)
    • RFAConv: multi-branch kernels produce attention map A{rf}; features F{rf};
      F = \text{Softmax}(g{1\times1}(\text{AvgPool}(X)))\times\text{ReLU}(\text{Norm}(g{k\times k}(X))) (Eq. 7)
      Final fusion: F=\sum{i=1}^{k^{2}} A{rf}(i)\odot F_{rf}(i) (Eq. 8).
    • Benefits: dynamic receptive-field selection → better tiny-detail & context.

Page 12

  • 4 Experiments – Datasets
    • RS-STOD: 50 854 instances / 2354 images, 5 classes; 93\% small. Res 0.4–2 m.
    • AI-TOD: 28 036 images, 700 621 instances, 8 classes. For compute, authors sample 2700 train / 300 test.

Page 13

  • Implementation
    • Hardware: single NVIDIA RTX 4090; PyTorch; SGD.
    • Input sizes: RS-STOD 512², batch 16; AI-TOD 640², batch 4.
  • Metrics
    \text{Precision}=\frac{TP}{TP+FP} (9)
    \text{Recall}=\frac{TP}{TP+FN} (10)
    F1=\frac{2PR}{P+R} (11)
    mAP=\frac{\sum{i}\sum{j}(R{j+1}-Rj)P{inter}(R{j+1})_i}{k} (12)
    FPS estimation: FPS=\frac{frameNum}{elapsedTime} (13).

Page 14

  • SOTA Comparison on RS-STOD (Table 1)
    • SEMA-YOLO: mAP{50}=0.725, mAP{50:95}=0.468
      • +0.225 / +0.197 vs RT-DETR-R50
      • +0.054 / +0.053 vs best YOLOv8n.
  • Per-class AP50 (Table 2)
    • Small Vehicle 0.722 (+20.3 % vs YOLOv11n)
    • Large Vehicle 0.196 (+39.0 %).
    • Ship 0.793; Airplane 0.987; Oil Tank 0.926.
  • Visualization (Fig. 6): clearer boxes, fewer FP.

Page 15

  • AI-TOD Results (Table 3)
    • SEMA-YOLO: Precision 0.740, Recall 0.557, mAP{50}=0.615, mAP{50:95}=0.284
      • Large margin over RT-DETR-L (+0.482 mAP50).
  • Visualization (Fig. 7): robust on tiny aircraft & vehicles.

Page 16

  • 4.5 Ablation (Table 4)
    • Baseline YOLOv11n: P 0.709 / R 0.643 / mAP_{50}=0.671.
    • +SLE: biggest single jump (+0.046 mAP50, −19.7 % params).
    • +RFA alone: minor dip (needs ASFF).
    • +SLE+ASFF: mAP_{50}=0.722.
    • Full SEMA-YOLO (+GCP, RFA): mAP{50}=0.725 & best mAP{50:95}=0.468.
  • Grad-CAM (Fig. 8): progressive focus; GCP suppresses BG; RFA sharpens small objects.

Page 17

  • 5 Discussion – Model Potential (Table 5)
    • Scaling: n → s (13.4 M) → m (29.7 M) improves mAP_{50} 72.5→75.3→76.8 %.
    • m-size rivals RT-DETR in FLOPs but +26.8 % mAP.
    • n-size (14.2 GFLOPs) viable for Jetson Orin NX.

Page 18

  • Computational Efficiency (Table 6)
    • Params 3.6 M; 14.2 GFLOPs; 7.43 MB; 185 FPS.
    • Slightly heavier than YOLOv10n (2.7 M) but still real-time; lighter/faster than RT-DETR.
    • Balanced accuracy vs cost.

Page 19

  • 6 Conclusions
    • SEMA-YOLO incorporates SLE, GCP-ASFF, RFA on YOLOv11n → superior small-object detection in HR RS.
    • Accurate, real-time, compact.
    • Future: Transformer hybrids, advanced fusion, deployment strategies.

Page 20

  • Author Contributions & Funding
    • Method, software & drafting: Z.W., H.Z., X.Z.
    • Review/editing: X.B., X.L.
    • Funded by Wuhan University Undergraduate Innovation Grant (S202510486303).

Page 21

  • Data Availability
    RS-STOD: https://github.com/lixinghua5540/STOD
    AI-TOD: https://github.com/jwwangchn/AI-TOD
  • No Conflict of Interest declared.

Page 22

  • References Snapshot
    • [18] Lin et al., FPN.
    • [40] Redmon et al., YOLOv1.
    • [60] Carion et al., DETR.
    • [66] Selvaraju et al., Grad-CAM. (Full list spans [1]–[66]).

Ethical & Practical Implications: Better surveillance, urban management, and disaster response but also raises privacy concerns and military sensitivities; open-access code/data encourage reproducibility.