FocusU2Net: Dual-Attention Gated U-Net for Automated Polyp Segmentation
Background and Motivation
Colorectal cancer (CRC) is the 3rd most prevalent malignancy worldwide; > US cases in with deaths.
Early removal of adenomatous polyps raises survival to ; late-stage 5-yr survival .
Colonoscopy is the gold standard yet suffers from:
Adenoma Detection Rate (ADR) variability (*avg miss ).
Miss rate rises to for polyps <; for >.
Automated, accurate, real-time segmentation can curb inter-observer variability, reduce workload, and improve diagnostic precision.
Research Problem
Existing deep learning (DL) models struggle to fuse local + global context, handle multi-scale variation, and remain computationally efficient.
Pre-trained backbones add heavy compute; training from scratch is resource-intensive.
Medical images (GI tract) have heterogeneous artefacts, small datasets, and severe class imbalance.
Key Contributions
FocusU2Net: bi-level nested U-structure (derived from U2Net) + dual attention Focus Gate (FG).
Re-designed Residual U-blocks (RSU) with variable receptive fields for rich context.
Integrated spatial & channel attention inside FG; focal scaling suppresses background.
Pixel standardization during preprocessing to stabilize gradients (Eq. ).
Exhaustive evaluation on 5 datasets with Dice gains 3.14–43.59 % vs SOTA; 85 % cross-dataset success vs < previously.
Lightweight for real-time ( params, ).
Explainable AI (feature maps & heatmaps) for transparency.
Related Work (Condensed)
Classical Encoder–Decoder
UNet, UNet++, ResUNet, MultiResUNet: efficient skip connections but limited global context, fine-detail loss.
Modified Encoder–Decoder
DilatedRes-FCN, PolypSegNet, SR-AttNet, U2Net, I2U-Net, PSPNet, DC-UNet, MSRFNet, BA-Net: address multi-scale and edges yet still heavy or brittle.
Transfer-Learning
DoubleUNet (VGG-19 + ASPP), HarDNet-MSEG, Inception-UNet: better semantics but high compute / interpretability issues.
Transformer Hybrids
SwinE-Net, TransUNet, SwinPA-Net: capture long range but memory-hungry.
Attention-centric
FocusU-Net, FANet, ResGANet, PraNet, CSAP-UNet: explicit attention; issues with imbalance & overfitting.
Foundation Models
SAM, SAM2, MedSAM: zero-shot adaptable but poor boundary refinement, prompt dependency, high cost.
FocusU2Net Architecture
1. Design Elements
Pixel Standardization: Gaussian scaling ensures uniform distribution—see Fig.1 in paper.
Additive Attention Gates (AGs): combine gating signal (deep features) + skip features; ReLU → global pooling → sigmoid.
Focus Gate (FG) (Fig.2):
Parallel spatial & channel attention.
Focal parameter boosts FG output then sigmoid.
Learnable transposed conv for size matching.
Channel Attention (Eq. 3): with adaptive kernel ( ).
Spatial Attention (Eq.7): where adapts via Eq.(6).
2. Residual-U Blocks (RSU)
Local conv then mini-UNet encoder-decoder → residual sum .
RSU-4F variant uses dilated conv (no pooling) for low-res stages.
3. Overall Pipeline (Fig.4)
Encoder: 6 stages (RSU-7 → RSU-4F) + max-pool.
Gating Signal: deepest stage via conv & upsampling.
Skip paths: each encoder output + FG.
Decoder: 5 stages mirrored; transposed conv upsampling.
Side outputs: each decoder level → conv + sigmoid.
Fusion: .
4. Formal Equations
Encoder layer (Eq.11):
Skip refinement (Eq.19): .
5. Pseudocode (Algorithm 1)
33 steps: encoder forward pass, gating, FG integration, decoder, side outputs, final fusion.
Training Configuration
Epochs: ; Batch: ; Optimizer: Adam.
Initial LR ; adaptive scheduler ↓ LR if loss plateaus (patience epochs); floor .
Loss-function study (Table 3): Dice loss best overall; e.g. EndoScene Dice .
Datasets (80/20 split; resized )
Kvasir-SEG (1000 imgs )
CVC-ClinicDB (612, )
CVC-ColonDB (300, )
ETIS-Larib (196, )
EndoScene (912, variable)
Evaluation Metrics
Dice, IoU, Precision, Recall:
etc.
Results
Learning Curves (Fig.5)
Perfect fit (training≈validation) on 3 datasets; slight variance on ColonDB & ETIS (data-scarcity).
Quantitative Highlights (Table 5)
ClinicDB: Dice (↑1.28 over I2U-Net; ↑4.99 over CSAP-UNet).
Kvasir-SEG: Dice (↑1.8 over SR-AttNet; ↑5.52 over CSAP-UNet).
ColonDB: Dice (↑4.12 over U2Net; ↑12.32 over CSAP-UNet).
ETIS-Larib: Dice ; highest precision .
EndoScene: Dice ; Precision .
Cross-Dataset Validation (Table 6)
Train on ClinicDB, test on others → FocusU2Net wins 17/20 cases; overall success vs < for rivals.
Ablation (Table 7)
U2Net+FG vs U2Net: +2.5$–$6\% Dice across datasets.
FG alone (UNet+FG) cannot match nested RSU; combination critical.
Qualitative (Figs.7–11)
Lower FP (red) & FN (blue) vs SAM2/MedSAM, especially on small/flat lesions.
Interpretability
Feature maps (Figs.12–13): progressive abstraction; decoder restores fine boundaries.
Heatmaps (Fig.14): strong activation on polyps; FG suppresses noisy background.
Computational Footprint (Table 8)
Params ; GFLOPs → lighter than UNet++ (×6) & Attention UNet (×3), yet higher accuracy.
Overall Insights
Dual attention + multiscale RSU enables balanced local–global fusion.
Pixel standardization simplifies training convergence.
Model scales to real-time colonoscopy and offers transparent decision maps.
Limitations & Future Work
Slight over-segmentation on tiny datasets (ETIS, ColonDB).
Does not yet address 3-D volumetric or multimodal fusion.
Future aims: 3-D. extension, lighter mobile variant, advanced loss (Tversky focal), unsupervised domain adaptation.
Ethical & Practical Notes
Complies with Declaration of Helsinki; public datasets used.
No competing interests; no external funding.
Data available per original dataset licenses