SEMA-YOLO Paper Study Notes
Page 1
- Bibliographic Information
- Academic Editor: Javier Marcello
- Received 11\,April\,2025 → Revised 22\,May\,2025 → Accepted 30\,May\,2025 → Published 31\,May\,2025
- Citation: Z. Wu et al., “SEMA-YOLO: Lightweight Small Object Detection in Remote Sensing Image via Shallow-Layer Enhancement and Multi-Scale Adaptation,” Remote Sens. 2025, 17, 1917. doi:10.3390/rs17111917
- Open-access under CC-BY 4.0.
- Author & Affiliation Highlights
- Zhenchuan Wu, Hang Zhen († equal), Xiaoxinxi Zhang, Xinghua Li*, Xuechen Bai
- School of Remote Sensing & Information Engineering, Wuhan University (China).
- State Key Laboratory of Information Engineering in Surveying, Mapping & Remote Sensing.
- Abstract—Key Messages
- Small object detection in remote sensing is hindered by feature loss during down-sampling and complex backgrounds.
- Introduces SEMA-YOLO, an enhanced YOLOv11 framework with three pillars:
- Shallow-Layer Enhancement (SLE) – shallower backbone + extra tiny detection head (larger feature map).
- GCP-ASFF – Global Context Pooling + Adaptively Spatial Feature Fusion over 4 heads.
- RFA-C3k2 – Receptive-Field-Attention infused C3k2 for refined extraction.
- Performance: mAP_{50}=72.5\% (RS-STOD) & 61.5\% (AI-TOD), surpassing prior SOTA.
- Keywords
small-object detection, remote sensing, YOLO, feature fusion
Page 2
- Importance of Object Detection in RS: military [1], agriculture [2], urban planning [3], environmental monitoring [4], traffic [5].
- Definition: per MS-COCO, small ≤32\times32 px.
- High-Resolution (HR) Data Boom:
- WorldView-3: 0.31\,m (pan) / 1.24\,m (MS) [8].
- DJI Phantom 4 RTK UAV: cm-level [9].
- SAR: Gaofen-3 03 1\,m (C-band) [10]; TerraSAR-X 1\,m Spotlight [11].
- Three Major Challenges (Fig. 1):
- Complex backgrounds (shadows, noise, geometric/radiometric distortions) [12].
- Limited pixels per object [13].
- Dense distribution causing occlusion/overlap [14].
- Contributions Recap
- Lightweight, accurate YOLOv11 derivative.
- SLE: add tiny head (P2), backbone output P5→P4.
- GCP-ASFF: inject global context into ASFF.
- RFA: dynamic receptive-field attention > C3k2.
Page 3
- Paper Organization:
Sec 2 – Related Work; Sec 3 – Method; Sec 4 – Experiments; Sec 5 – Discussion; Sec 6 – Conclusion. - Related-Work Axes: (i) task-specific strategies; (ii) mainstream frameworks.
- 2.1 Task-Specific
- 2.1.1 Multi-Scale Feature
• FPN [18] origin → PANet [19], NAS-FPN [20], BiFPN [21], Recursive-FPN [22], AFPN [23].
• YOLO lineage: v3 [24] multi-scale heads; v4 [25] PAN.
• Transformers: FPT [26], DNTR [27].
• Trade-off: ↑params, limited small-object tailoring. - 2.1.2 Super-Resolution
GAN [28], RCAN [29]; SuperYOLO [30], SRCGAN-RFA-YOLO [32], etc.
• Gains but ↑complexity and compute. - 2.1.3 Context-Based
Context cues (spatial/semantic relations) e.g., Lim et al. [34], PyramidBox [35], SCDNet [36], CAB Net [38], etc.
- 2.1.1 Multi-Scale Feature
Page 4
- Continues context-based review plus underscores that combining context separation and task-specific branches boosts detection in clutter.
Page 5
- 2.2 Mainstream Frameworks
- 2.2.1 YOLO Evolution: v1→v12; key additions: BatchNorm, FPN, SPP, PAN, RepVGG, E-ELAN, NMS-free, C3k2/C2PSA, Attention-centric R-ELAN.
- Multiple RS adaptations: ASFF-YOLOv5 [49], YOLO-DA [50], RSI-YOLO [51], CSPPartial-YOLO [52].
- Motivation: choose YOLOv11 as base for HR small object detection.
Page 6
- Other Frameworks
- Two-stage (Faster/ Cascade R-CNN) vs one-stage (RetinaNet, CenterNet).
- Anchor-free: FCOS [56]; label assignment RFLA [57]; aligned matching [58].
- Transformer family: DETR [60] drawbacks → Deformable [61], DN-DETR [62], DINO [63], DNTR [27], DQ-DETR [64], RT-DETR [65].
- Yet none balances real-time & small-object equally; Transformers heavier.
Page 7
- 3 Proposed Method – Overview
Workflow (Fig.2):
Input → Backbone(C3K2/2CBS/SPPF/C2PSA truncated at P4) → Neck(PAN+RFA-C3k2) → 4-level Heads (P2–P5) with GCP-ASFF fusion.
Page 8
- 3.2 Shallow-Layer Enhancement (SLE)
- Original YOLOv11 heads: P3 (1/8), P4 (1/16), P5 (1/32).
- Modifications:
• Add P2 head (1/4) via double-upsample from P4.
• Stop backbone at P4 (remove final down-sampling) → fewer params & preserved low-level detail.
• P5 retained by downsampling enhanced P4, then Concat with original downsampled P4. - Stats: params ↓19.7\% (2.583 M → 2.075 M); mAP_{50:95} +0.052 on RS-STOD.
Page 9
- 3.3 GCP-ASFF Module
- Problem: naive FPN fusion causes gradient conflict; ASFF [16] learns spatial weights \alpha,\beta,\gamma:
z^{k}{ij}=\alpha^{k}{ij}\,x^{1\rightarrow k}{ij}+\beta^{k}{ij}\,x^{2\rightarrow k}{ij}+\gamma^{k}{ij}\,x^{3\rightarrow k}_{ij} (Eq. 1) - Weights normalized via:
\hat\phi^{k}{ij}=\frac{e^{\lambda^{k}{\phi,ij}}}{\sum{\psi}e^{\lambda^{k}{\psi,ij}}},\quad \phi\in{\alpha,\beta,\gamma} (Eq. 2)
Final scaling ensures \sum \phi^{k}_{ij}=1 (Eq. 3). - Limitation: lacks global context.
- Problem: naive FPN fusion causes gradient conflict; ASFF [16] learns spatial weights \alpha,\beta,\gamma:
Page 10
- GCP Enhancement (Figs 3-4)
- Global Avg Pooling each level → channel descriptor g_i (Eq. 4).
- Concatenate g_i back, compress via 1\times1 Conv (Eq. 5).
- Soft-max spatial weights α_i, fuse (Eq. 6).
- GCP acts as a global semantic filter → strengthens target cues, suppresses background.
Page 11
- 3.4 RFA-C3k2 Module (Fig. 5)
- RFAConv: multi-branch kernels produce attention map A{rf}; features F{rf};
F = \text{Softmax}(g{1\times1}(\text{AvgPool}(X)))\times\text{ReLU}(\text{Norm}(g{k\times k}(X))) (Eq. 7)
Final fusion: F=\sum{i=1}^{k^{2}} A{rf}(i)\odot F_{rf}(i) (Eq. 8). - Benefits: dynamic receptive-field selection → better tiny-detail & context.
- RFAConv: multi-branch kernels produce attention map A{rf}; features F{rf};
Page 12
- 4 Experiments – Datasets
- RS-STOD: 50 854 instances / 2354 images, 5 classes; 93\% small. Res 0.4–2 m.
- AI-TOD: 28 036 images, 700 621 instances, 8 classes. For compute, authors sample 2700 train / 300 test.
Page 13
- Implementation
- Hardware: single NVIDIA RTX 4090; PyTorch; SGD.
- Input sizes: RS-STOD 512², batch 16; AI-TOD 640², batch 4.
- Metrics
\text{Precision}=\frac{TP}{TP+FP} (9)
\text{Recall}=\frac{TP}{TP+FN} (10)
F1=\frac{2PR}{P+R} (11)
mAP=\frac{\sum{i}\sum{j}(R{j+1}-Rj)P{inter}(R{j+1})_i}{k} (12)
FPS estimation: FPS=\frac{frameNum}{elapsedTime} (13).
Page 14
- SOTA Comparison on RS-STOD (Table 1)
- SEMA-YOLO: mAP{50}=0.725, mAP{50:95}=0.468
• +0.225 / +0.197 vs RT-DETR-R50
• +0.054 / +0.053 vs best YOLOv8n.
- SEMA-YOLO: mAP{50}=0.725, mAP{50:95}=0.468
- Per-class AP50 (Table 2)
- Small Vehicle 0.722 (+20.3 % vs YOLOv11n)
- Large Vehicle 0.196 (+39.0 %).
- Ship 0.793; Airplane 0.987; Oil Tank 0.926.
- Visualization (Fig. 6): clearer boxes, fewer FP.
Page 15
- AI-TOD Results (Table 3)
- SEMA-YOLO: Precision 0.740, Recall 0.557, mAP{50}=0.615, mAP{50:95}=0.284
• Large margin over RT-DETR-L (+0.482 mAP50).
- SEMA-YOLO: Precision 0.740, Recall 0.557, mAP{50}=0.615, mAP{50:95}=0.284
- Visualization (Fig. 7): robust on tiny aircraft & vehicles.
Page 16
- 4.5 Ablation (Table 4)
- Baseline YOLOv11n: P 0.709 / R 0.643 / mAP_{50}=0.671.
- +SLE: biggest single jump (+0.046 mAP50, −19.7 % params).
- +RFA alone: minor dip (needs ASFF).
- +SLE+ASFF: mAP_{50}=0.722.
- Full SEMA-YOLO (+GCP, RFA): mAP{50}=0.725 & best mAP{50:95}=0.468.
- Grad-CAM (Fig. 8): progressive focus; GCP suppresses BG; RFA sharpens small objects.
Page 17
- 5 Discussion – Model Potential (Table 5)
- Scaling: n → s (13.4 M) → m (29.7 M) improves mAP_{50} 72.5→75.3→76.8 %.
- m-size rivals RT-DETR in FLOPs but +26.8 % mAP.
- n-size (14.2 GFLOPs) viable for Jetson Orin NX.
Page 18
- Computational Efficiency (Table 6)
- Params 3.6 M; 14.2 GFLOPs; 7.43 MB; 185 FPS.
- Slightly heavier than YOLOv10n (2.7 M) but still real-time; lighter/faster than RT-DETR.
- Balanced accuracy vs cost.
Page 19
- 6 Conclusions
- SEMA-YOLO incorporates SLE, GCP-ASFF, RFA on YOLOv11n → superior small-object detection in HR RS.
- Accurate, real-time, compact.
- Future: Transformer hybrids, advanced fusion, deployment strategies.
Page 20
- Author Contributions & Funding
- Method, software & drafting: Z.W., H.Z., X.Z.
- Review/editing: X.B., X.L.
- Funded by Wuhan University Undergraduate Innovation Grant (S202510486303).
Page 21
- Data Availability
RS-STOD: https://github.com/lixinghua5540/STOD
AI-TOD: https://github.com/jwwangchn/AI-TOD - No Conflict of Interest declared.
Page 22
- References Snapshot
- [18] Lin et al., FPN.
- [40] Redmon et al., YOLOv1.
- [60] Carion et al., DETR.
- [66] Selvaraju et al., Grad-CAM. (Full list spans [1]–[66]).
Ethical & Practical Implications: Better surveillance, urban management, and disaster response but also raises privacy concerns and military sensitivities; open-access code/data encourage reproducibility.