SEMA-YOLO Paper Study Notes

Page 1

Bibliographic Information
- Academic Editor: Javier Marcello
- Received $11\,April\,2025$ → Revised $22\,May\,2025$ → Accepted $30\,May\,2025$ → Published $31\,May\,2025$
- Citation: Z. Wu et al., “SEMA-YOLO: Lightweight Small Object Detection in Remote Sensing Image via Shallow-Layer Enhancement and Multi-Scale Adaptation,” Remote Sens. 2025, 17, 1917. doi:10.3390/rs17111917
- Open-access under CC-BY 4.0.
Author & Affiliation Highlights
- Zhenchuan Wu, Hang Zhen († equal), Xiaoxinxi Zhang, Xinghua Li*, Xuechen Bai
- School of Remote Sensing & Information Engineering, Wuhan University (China).
- State Key Laboratory of Information Engineering in Surveying, Mapping & Remote Sensing.
Abstract—Key Messages
- Small object detection in remote sensing is hindered by feature loss during down-sampling and complex backgrounds.
- Introduces SEMA-YOLO, an enhanced YOLOv11 framework with three pillars:
1. Shallow-Layer Enhancement (SLE) – shallower backbone + extra tiny detection head (larger feature map).
2. GCP-ASFF – Global Context Pooling + Adaptively Spatial Feature Fusion over 4 heads.
3. RFA-C3k2 – Receptive-Field-Attention infused C3k2 for refined extraction.
- Performance: $mAP_{50}=72.5\%$ (RS-STOD) & $61.5\%$ (AI-TOD), surpassing prior SOTA.
Keywords
small-object detection, remote sensing, YOLO, feature fusion

Page 2

Importance of Object Detection in RS: military [1], agriculture [2], urban planning [3], environmental monitoring [4], traffic [5].
Definition: per MS-COCO, small ≤ $32\times32$ px.
High-Resolution (HR) Data Boom:
- WorldView-3: $0.31\,m$ (pan) / $1.24\,m$ (MS) [8].
- DJI Phantom 4 RTK UAV: cm-level [9].
- SAR: Gaofen-3 03 $1\,m$ (C-band) [10]; TerraSAR-X $1\,m$ Spotlight [11].
Three Major Challenges (Fig. 1):
1. Complex backgrounds (shadows, noise, geometric/radiometric distortions) [12].
2. Limited pixels per object [13].
3. Dense distribution causing occlusion/overlap [14].
Contributions Recap
- Lightweight, accurate YOLOv11 derivative.
- SLE: add tiny head (P2), backbone output P5→P4.
- GCP-ASFF: inject global context into ASFF.
- RFA: dynamic receptive-field attention > C3k2.

Page 3

Paper Organization:
Sec 2 – Related Work; Sec 3 – Method; Sec 4 – Experiments; Sec 5 – Discussion; Sec 6 – Conclusion.
Related-Work Axes: (i) task-specific strategies; (ii) mainstream frameworks.
2.1 Task-Specific
- 2.1.1 Multi-Scale Feature
  • FPN [18] origin → PANet [19], NAS-FPN [20], BiFPN [21], Recursive-FPN [22], AFPN [23].
  • YOLO lineage: v3 [24] multi-scale heads; v4 [25] PAN.
  • Transformers: FPT [26], DNTR [27].
  • Trade-off: ↑params, limited small-object tailoring.
- 2.1.2 Super-Resolution
  GAN [28], RCAN [29]; SuperYOLO [30], SRCGAN-RFA-YOLO [32], etc.
  • Gains but ↑complexity and compute.
- 2.1.3 Context-Based
  Context cues (spatial/semantic relations) e.g., Lim et al. [34], PyramidBox [35], SCDNet [36], CAB Net [38], etc.

Page 4

Continues context-based review plus underscores that combining context separation and task-specific branches boosts detection in clutter.

Page 5

2.2 Mainstream Frameworks
- 2.2.1 YOLO Evolution: v1→v12; key additions: BatchNorm, FPN, SPP, PAN, RepVGG, E-ELAN, NMS-free, C3k2/C2PSA, Attention-centric R-ELAN.
- Multiple RS adaptations: ASFF-YOLOv5 [49], YOLO-DA [50], RSI-YOLO [51], CSPPartial-YOLO [52].
- Motivation: choose YOLOv11 as base for HR small object detection.

Page 6

Other Frameworks
- Two-stage (Faster/ Cascade R-CNN) vs one-stage (RetinaNet, CenterNet).
- Anchor-free: FCOS [56]; label assignment RFLA [57]; aligned matching [58].
- Transformer family: DETR [60] drawbacks → Deformable [61], DN-DETR [62], DINO [63], DNTR [27], DQ-DETR [64], RT-DETR [65].
- Yet none balances real-time & small-object equally; Transformers heavier.

Page 7

3 Proposed Method – Overview
Workflow (Fig.2):
Input → Backbone(C3K2/2CBS/SPPF/C2PSA truncated at P4) → Neck(PAN+RFA-C3k2) → 4-level Heads (P2–P5) with GCP-ASFF fusion.

Page 8

3.2 Shallow-Layer Enhancement (SLE)
- Original YOLOv11 heads: P3 (1/8), P4 (1/16), P5 (1/32).
- Modifications:
  • Add P2 head (1/4) via double-upsample from P4.
  • Stop backbone at P4 (remove final down-sampling) → fewer params & preserved low-level detail.
  • P5 retained by downsampling enhanced P4, then Concat with original downsampled P4.
- Stats: params ↓ $19.7\%$ (2.583 M → 2.075 M); $mAP_{50:95}$ +0.052 on RS-STOD.

Page 9

3.3 GCP-ASFF Module
- Problem: naive FPN fusion causes gradient conflict; ASFF [16] learns spatial weights $\alpha,\beta,\gamma$ :
 $z^{k}{ij}=\alpha^{k}{ij}\,x^{1\rightarrow k}{ij}+\beta^{k}{ij}\,x^{2\rightarrow k}{ij}+\gamma^{k}{ij}\,x^{3\rightarrow k}_{ij}$ (Eq. 1)
- Weights normalized via:
 $\hat\phi^{k}{ij}=\frac{e^{\lambda^{k}{\phi,ij}}}{\sum{\psi}e^{\lambda^{k}{\psi,ij}}},\quad \phi\in{\alpha,\beta,\gamma}$ (Eq. 2)
 Final scaling ensures $\sum \phi^{k}_{ij}=1$ (Eq. 3).
- Limitation: lacks global context.

Page 10

GCP Enhancement (Figs 3-4)
1. Global Avg Pooling each level → channel descriptor $g_i$ (Eq. 4).
2. Concatenate $g_i$ back, compress via $1\times1$ Conv (Eq. 5).
3. Soft-max spatial weights α_i, fuse (Eq. 6).
- GCP acts as a global semantic filter → strengthens target cues, suppresses background.

Page 11

3.4 RFA-C3k2 Module (Fig. 5)
- RFAConv: multi-branch kernels produce attention map $A{rf}$ ; features $F{rf}$ ;
 $F = \text{Softmax}(g{1\times1}(\text{AvgPool}(X)))\times\text{ReLU}(\text{Norm}(g{k\times k}(X)))$ (Eq. 7)
 Final fusion: $F=\sum{i=1}^{k^{2}} A{rf}(i)\odot F_{rf}(i)$ (Eq. 8).
- Benefits: dynamic receptive-field selection → better tiny-detail & context.

Page 12

4 Experiments – Datasets
- RS-STOD: 50 854 instances / 2354 images, 5 classes; $93\%$ small. Res 0.4–2 m.
- AI-TOD: 28 036 images, 700 621 instances, 8 classes. For compute, authors sample 2700 train / 300 test.

Page 13

Implementation
- Hardware: single NVIDIA RTX 4090; PyTorch; SGD.
- Input sizes: RS-STOD 512², batch 16; AI-TOD 640², batch 4.
Metrics
$\text{Precision}=\frac{TP}{TP+FP}$ (9)
$\text{Recall}=\frac{TP}{TP+FN}$ (10)
$F1=\frac{2PR}{P+R}$ (11)
$mAP=\frac{\sum{i}\sum{j}(R{j+1}-Rj)P{inter}(R{j+1})_i}{k}$ (12)
FPS estimation: $FPS=\frac{frameNum}{elapsedTime}$ (13).

Page 14

SOTA Comparison on RS-STOD (Table 1)
- SEMA-YOLO: $mAP{50}=0.725$ , $mAP{50:95}=0.468$
 • +0.225 / +0.197 vs RT-DETR-R50
 • +0.054 / +0.053 vs best YOLOv8n.
Per-class AP50 (Table 2)
- Small Vehicle 0.722 (+20.3 % vs YOLOv11n)
- Large Vehicle 0.196 (+39.0 %).
- Ship 0.793; Airplane 0.987; Oil Tank 0.926.
Visualization (Fig. 6): clearer boxes, fewer FP.

Page 15

AI-TOD Results (Table 3)
- SEMA-YOLO: Precision 0.740, Recall 0.557, $mAP{50}=0.615$ , $mAP{50:95}=0.284$
 • Large margin over RT-DETR-L (+0.482 mAP50).
Visualization (Fig. 7): robust on tiny aircraft & vehicles.

Page 16

4.5 Ablation (Table 4)
- Baseline YOLOv11n: P 0.709 / R 0.643 / $mAP_{50}=0.671$ .
- +SLE: biggest single jump (+0.046 mAP50, −19.7 % params).
- +RFA alone: minor dip (needs ASFF).
- +SLE+ASFF: $mAP_{50}=0.722$ .
- Full SEMA-YOLO (+GCP, RFA): $mAP{50}=0.725$ & best $mAP{50:95}=0.468$ .
Grad-CAM (Fig. 8): progressive focus; GCP suppresses BG; RFA sharpens small objects.

Page 17

5 Discussion – Model Potential (Table 5)
- Scaling: n → s (13.4 M) → m (29.7 M) improves $mAP_{50}$ 72.5→75.3→76.8 %.
- m-size rivals RT-DETR in FLOPs but +26.8 % mAP.
- n-size (14.2 GFLOPs) viable for Jetson Orin NX.

Page 18

Computational Efficiency (Table 6)
- Params 3.6 M; 14.2 GFLOPs; 7.43 MB; 185 FPS.
- Slightly heavier than YOLOv10n (2.7 M) but still real-time; lighter/faster than RT-DETR.
- Balanced accuracy vs cost.

Page 19

6 Conclusions
- SEMA-YOLO incorporates SLE, GCP-ASFF, RFA on YOLOv11n → superior small-object detection in HR RS.
- Accurate, real-time, compact.
- Future: Transformer hybrids, advanced fusion, deployment strategies.

Page 20

Author Contributions & Funding
- Method, software & drafting: Z.W., H.Z., X.Z.
- Review/editing: X.B., X.L.
- Funded by Wuhan University Undergraduate Innovation Grant (S202510486303).

Page 21

Data Availability
RS-STOD: https://github.com/lixinghua5540/STOD
AI-TOD: https://github.com/jwwangchn/AI-TOD
No Conflict of Interest declared.

Page 22

References Snapshot
- [18] Lin et al., FPN.
- [40] Redmon et al., YOLOv1.
- [60] Carion et al., DETR.
- [66] Selvaraju et al., Grad-CAM. (Full list spans [1]–[66]).

Ethical & Practical Implications: Better surveillance, urban management, and disaster response but also raises privacy concerns and military sensitivities; open-access code/data encourage reproducibility.