FocusU2Net: Dual-Attention Gated U-Net for Automated Polyp Segmentation

Background and Motivation

  • Colorectal cancer (CRC) is the 3rd most prevalent malignancy worldwide; >104,000104{,}000 US cases in 20202020 with 53,20053{,}200 deaths.

  • Early removal of adenomatous polyps raises survival to 90%\approx 90\%; late-stage 5-yr survival 14%\approx 14\%.

  • Colonoscopy is the gold standard yet suffers from:

    • Adenoma Detection Rate (ADR) variability 7%!!53%7\%!\rightarrow!53\% (*avg miss 25%\approx 25\%).

    • Miss rate rises to 26%26\% for polyps <5mm5\,\text{mm}; 2%2\% for >10mm10\,\text{mm}.

  • Automated, accurate, real-time segmentation can curb inter-observer variability, reduce workload, and improve diagnostic precision.

Research Problem

  • Existing deep learning (DL) models struggle to fuse local + global context, handle multi-scale variation, and remain computationally efficient.

  • Pre-trained backbones add heavy compute; training from scratch is resource-intensive.

  • Medical images (GI tract) have heterogeneous artefacts, small datasets, and severe class imbalance.

Key Contributions

  • FocusU2Net: bi-level nested U-structure (derived from U2Net) + dual attention Focus Gate (FG).

  • Re-designed Residual U-blocks (RSU) with variable receptive fields for rich context.

  • Integrated spatial & channel attention inside FG; focal scaling suppresses background.

  • Pixel standardization during preprocessing to stabilize gradients (Eq. X~=Xμ<em>pixelσ</em>pixel\tilde X = \dfrac{X-\mu<em>{pixel}}{\sigma</em>{pixel}}).

  • Exhaustive evaluation on 5 datasets with Dice gains 3.14–43.59 % vs SOTA; 85 % cross-dataset success vs <55 % previously.

  • Lightweight for real-time ( 46.64M46.64\,\text{M} params, 78.09GFLOPs78.09\,\text{GFLOPs}).

  • Explainable AI (feature maps & heatmaps) for transparency.

Related Work (Condensed)

Classical Encoder–Decoder
  • UNet, UNet++, ResUNet, MultiResUNet: efficient skip connections but limited global context, fine-detail loss.

Modified Encoder–Decoder
  • DilatedRes-FCN, PolypSegNet, SR-AttNet, U2Net, I2U-Net, PSPNet, DC-UNet, MSRFNet, BA-Net: address multi-scale and edges yet still heavy or brittle.

Transfer-Learning
  • DoubleUNet (VGG-19 + ASPP), HarDNet-MSEG, Inception-UNet: better semantics but high compute / interpretability issues.

Transformer Hybrids
  • SwinE-Net, TransUNet, SwinPA-Net: capture long range but memory-hungry.

Attention-centric
  • FocusU-Net, FANet, ResGANet, PraNet, CSAP-UNet: explicit attention; issues with imbalance & overfitting.

Foundation Models
  • SAM, SAM2, MedSAM: zero-shot adaptable but poor boundary refinement, prompt dependency, high cost.

FocusU2Net Architecture

1. Design Elements
  • Pixel Standardization: Gaussian scaling ensures uniform distribution—see Fig.1 in paper.

  • Additive Attention Gates (AGs): combine gating signal (deep features) + skip features; ReLU → global pooling → sigmoid.

  • Focus Gate (FG) (Fig.2):

    • Parallel spatial & channel attention.

    • Focal parameter γ\gamma boosts FG output S<em>iTrans=exp(S</em>iChSiSp)S<em>i^{Trans}=\exp\big(S</em>i^{Ch}\odot S_i^{Sp}\big) then sigmoid.

    • Learnable transposed conv for size matching.

  • Channel Attention (Eq. 3): M<em>c(F)=σ(Conv</em>k<em>c(F</em>avgc)F<em>maxc)M<em>c(F)=\sigma\big(\text{Conv}</em>{k<em>c}(F</em>{avg}^c)\otimes F<em>{max}^c\big) with adaptive kernel k</em>c=log<em>2Cγ+b</em>oddk</em>c=\lvert\lvert \frac{\log<em>2C}{\gamma}+b \rvert\rvert</em>{\text{odd}} ( b=2,γ=1b=2,\gamma=1).

  • Spatial Attention (Eq.7): M<em>s(F)=σ(Conv</em>k<em>s(F</em>avgsF<em>maxs))M<em>s(F)=\sigma\big(\text{Conv}</em>{k<em>s}(F</em>{avg}^s\otimes F<em>{max}^s)\big) where k</em>sk</em>s adapts via Eq.(6).

2. Residual-U Blocks (RSU)
  • Local conv F<em>1(x)F<em>1(x) then mini-UNet encoder-decoder U(F</em>1)U(F</em>1) → residual sum H<em>RSU(x)=F</em>1(x)+U(F1(x))H<em>{RSU}(x)=F</em>1(x)+U(F_1(x)).

  • RSU-4F variant uses dilated conv (no pooling) for low-res stages.

3. Overall Pipeline (Fig.4)
  1. Encoder: 6 stages (RSU-7 → RSU-4F) + max-pool.

  2. Gating Signal: deepest stage via 1×11\times1 conv & upsampling.

  3. Skip paths: each encoder output + FG.

  4. Decoder: 5 stages mirrored; transposed conv upsampling.

  5. Side outputs: each decoder level → 1×11\times1 conv + sigmoid.

  6. Fusion: S<em>fuse=σ(</em>i=15α<em>iS(i)</em>side)S<em>{fuse}=\sigma\Big(\sum</em>{i=1}^5 \alpha<em>i S^{(i)}</em>{side}\Big).

4. Formal Equations
  • Encoder layer (Eq.11): F<em>En</em>i(l)={RSU(F<em>En</em>i1(l1)),amp;i4 RSU<em>4F(F</em>Eni1(l1)),amp;igt;4F<em>{En</em>i}^{(l)} = \begin{cases} RSU(F<em>{En</em>{i-1}}^{(l-1)}), &amp; i\le4\ RSU<em>{4F}(F</em>{En_{i-1}}^{(l-1)}), &amp; i&gt;4 \end{cases}

  • Skip refinement (Eq.19): S<em>iRef=S</em>iTransSiskipS<em>i^{Ref}=S</em>i^{Trans}\odot S_i^{skip}.

5. Pseudocode (Algorithm 1)
  • 33 steps: encoder forward pass, gating, FG integration, decoder, side outputs, final fusion.

Training Configuration

  • Epochs: 200200; Batch: 88; Optimizer: Adam.

  • Initial LR =104=10^{-4}; adaptive scheduler ↓ LR if loss plateaus (patience 55 epochs); floor 10810^{-8}.

  • Loss-function study (Table 3): Dice loss best overall; e.g. EndoScene Dice 93.6%93.6\%.

Datasets (80/20 split; resized 256×256256\times256)

  • Kvasir-SEG (1000 imgs 332!×!4821920!×!1072332!\times!482\rightarrow1920!\times!1072)

  • CVC-ClinicDB (612, 384!×!288384!\times!288)

  • CVC-ColonDB (300, 574!×!500574!\times!500)

  • ETIS-Larib (196, 1225!×!9661225!\times!966)

  • EndoScene (912, variable)

Evaluation Metrics

  • Dice, IoU, Precision, Recall:
    Dice=2TP2TP+FP+FNDice = \frac{2TP}{2TP+FP+FN}
    IoU=TPTP+FP+FNIoU = \frac{TP}{TP+FP+FN} etc.

Results

Learning Curves (Fig.5)
  • Perfect fit (training≈validation) on 3 datasets; slight variance on ColonDB & ETIS (data-scarcity).

Quantitative Highlights (Table 5)
  • ClinicDB: Dice 93.6%93.6\% (↑1.28 over I2U-Net; ↑4.99 over CSAP-UNet).

  • Kvasir-SEG: Dice 89.8%89.8\% (↑1.8 over SR-AttNet; ↑5.52 over CSAP-UNet).

  • ColonDB: Dice 86.4%86.4\% (↑4.12 over U2Net; ↑12.32 over CSAP-UNet).

  • ETIS-Larib: Dice 78.7%78.7\%; highest precision 96.1%96.1\%.

  • EndoScene: Dice 93.6%93.6\%; Precision 94.8%94.8\%.

Cross-Dataset Validation (Table 6)
  • Train on ClinicDB, test on others → FocusU2Net wins 17/20 cases; overall success 85%85\% vs <5%5\% for rivals.

Ablation (Table 7)
  • U2Net+FG vs U2Net: +2.5$–$6\% Dice across datasets.

  • FG alone (UNet+FG) cannot match nested RSU; combination critical.

Qualitative (Figs.7–11)
  • Lower FP (red) & FN (blue) vs SAM2/MedSAM, especially on small/flat lesions.

Interpretability
  • Feature maps (Figs.12–13): progressive abstraction; decoder restores fine boundaries.

  • Heatmaps (Fig.14): strong activation on polyps; FG suppresses noisy background.

Computational Footprint (Table 8)
  • Params 46.64M46.64\,M; GFLOPs 78.0978.09 → lighter than UNet++ (×6) & Attention UNet (×3), yet higher accuracy.

Overall Insights

  • Dual attention + multiscale RSU enables balanced local–global fusion.

  • Pixel standardization simplifies training convergence.

  • Model scales to real-time colonoscopy and offers transparent decision maps.

Limitations & Future Work

  • Slight over-segmentation on tiny datasets (ETIS, ColonDB).

  • Does not yet address 3-D volumetric or multimodal fusion.

  • Future aims: 3-D. extension, lighter mobile variant, advanced loss (Tversky focal), unsupervised domain adaptation.

Ethical & Practical Notes

  • Complies with Declaration of Helsinki; public datasets used.

  • No competing interests; no external funding.

  • Data available per original dataset licenses