FocusU2Net: Dual-Attention Gated U-Net for Automated Polyp Segmentation

Background and Motivation

Colorectal cancer (CRC) is the 3rd most prevalent malignancy worldwide; > $104{,}000$ US cases in $2020$ with $53{,}200$ deaths.
Early removal of adenomatous polyps raises survival to $\approx 90\%$ ; late-stage 5-yr survival $\approx 14\%$ .
Colonoscopy is the gold standard yet suffers from:
- Adenoma Detection Rate (ADR) variability $7\%!\rightarrow!53\%$ (*avg miss $\approx 25\%$ ).
- Miss rate rises to $26\%$ for polyps < $5\,\text{mm}$ ; $2\%$ for > $10\,\text{mm}$ .
Automated, accurate, real-time segmentation can curb inter-observer variability, reduce workload, and improve diagnostic precision.

Research Problem

Existing deep learning (DL) models struggle to fuse local + global context, handle multi-scale variation, and remain computationally efficient.
Pre-trained backbones add heavy compute; training from scratch is resource-intensive.
Medical images (GI tract) have heterogeneous artefacts, small datasets, and severe class imbalance.

Key Contributions

FocusU2Net: bi-level nested U-structure (derived from U2Net) + dual attention Focus Gate (FG).
Re-designed Residual U-blocks (RSU) with variable receptive fields for rich context.
Integrated spatial & channel attention inside FG; focal scaling suppresses background.
Pixel standardization during preprocessing to stabilize gradients (Eq. $\tilde X = \dfrac{X-\mu{pixel}}{\sigma{pixel}}$ ).
Exhaustive evaluation on 5 datasets with Dice gains 3.14–43.59 % vs SOTA; 85 % cross-dataset success vs < $5 %$ previously.
Lightweight for real-time ( $46.64\,\text{M}$ params, $78.09\,\text{GFLOPs}$ ).
Explainable AI (feature maps & heatmaps) for transparency.

Related Work (Condensed)

Classical Encoder–Decoder

UNet, UNet++, ResUNet, MultiResUNet: efficient skip connections but limited global context, fine-detail loss.

Modified Encoder–Decoder

DilatedRes-FCN, PolypSegNet, SR-AttNet, U2Net, I2U-Net, PSPNet, DC-UNet, MSRFNet, BA-Net: address multi-scale and edges yet still heavy or brittle.

Transfer-Learning

DoubleUNet (VGG-19 + ASPP), HarDNet-MSEG, Inception-UNet: better semantics but high compute / interpretability issues.

Transformer Hybrids

SwinE-Net, TransUNet, SwinPA-Net: capture long range but memory-hungry.

Attention-centric

FocusU-Net, FANet, ResGANet, PraNet, CSAP-UNet: explicit attention; issues with imbalance & overfitting.

Foundation Models

SAM, SAM2, MedSAM: zero-shot adaptable but poor boundary refinement, prompt dependency, high cost.

FocusU2Net Architecture

1. Design Elements

Pixel Standardization: Gaussian scaling ensures uniform distribution—see Fig.1 in paper.
Additive Attention Gates (AGs): combine gating signal (deep features) + skip features; ReLU → global pooling → sigmoid.
Focus Gate (FG) (Fig.2):
- Parallel spatial & channel attention.
- Focal parameter $\gamma$ boosts FG output $Si^{Trans}=\exp\big(Si^{Ch}\odot S_i^{Sp}\big)$ then sigmoid.
- Learnable transposed conv for size matching.
Channel Attention (Eq. 3): $Mc(F)=\sigma\big(\text{Conv}{kc}(F{avg}^c)\otimes F{max}^c\big)$ with adaptive kernel $kc=\lvert\lvert \frac{\log2C}{\gamma}+b \rvert\rvert{\text{odd}}$ ( $b=2,\gamma=1$ ).
Spatial Attention (Eq.7): $Ms(F)=\sigma\big(\text{Conv}{ks}(F{avg}^s\otimes F{max}^s)\big)$ where $ks$ adapts via Eq.(6).

2. Residual-U Blocks (RSU)

Local conv $F1(x)$ then mini-UNet encoder-decoder $U(F1)$ → residual sum $H{RSU}(x)=F1(x)+U(F_1(x))$ .
RSU-4F variant uses dilated conv (no pooling) for low-res stages.

3. Overall Pipeline (Fig.4)

Encoder: 6 stages (RSU-7 → RSU-4F) + max-pool.
Gating Signal: deepest stage via $1\times1$ conv & upsampling.
Skip paths: each encoder output + FG.
Decoder: 5 stages mirrored; transposed conv upsampling.
Side outputs: each decoder level → $1\times1$ conv + sigmoid.
Fusion: $S{fuse}=\sigma\Big(\sum{i=1}^5 \alphai S^{(i)}{side}\Big)$ .

4. Formal Equations

Encoder layer (Eq.11): $F{Eni}^{(l)} = \begin{cases} RSU(F{En{i-1}}^{(l-1)}), & i\le4\ RSU{4F}(F{En_{i-1}}^{(l-1)}), & i>4 \end{cases}$
Skip refinement (Eq.19): $Si^{Ref}=Si^{Trans}\odot S_i^{skip}$ .

5. Pseudocode (Algorithm 1)

33 steps: encoder forward pass, gating, FG integration, decoder, side outputs, final fusion.

Training Configuration

Epochs: $200$ ; Batch: $8$ ; Optimizer: Adam.
Initial LR $=10^{-4}$ ; adaptive scheduler ↓ LR if loss plateaus (patience $5$ epochs); floor $10^{-8}$ .
Loss-function study (Table 3): Dice loss best overall; e.g. EndoScene Dice $93.6\%$ .

Datasets (80/20 split; resized $256\times256$ )

Kvasir-SEG (1000 imgs $332!\times!482\rightarrow1920!\times!1072$ )
CVC-ClinicDB (612, $384!\times!288$ )
CVC-ColonDB (300, $574!\times!500$ )
ETIS-Larib (196, $1225!\times!966$ )
EndoScene (912, variable)

Evaluation Metrics

Dice, IoU, Precision, Recall:
$Dice = \frac{2TP}{2TP+FP+FN}$
$IoU = \frac{TP}{TP+FP+FN}$ etc.

Results

Learning Curves (Fig.5)

Perfect fit (training≈validation) on 3 datasets; slight variance on ColonDB & ETIS (data-scarcity).

Quantitative Highlights (Table 5)

ClinicDB: Dice $93.6\%$ (↑1.28 over I2U-Net; ↑4.99 over CSAP-UNet).
Kvasir-SEG: Dice $89.8\%$ (↑1.8 over SR-AttNet; ↑5.52 over CSAP-UNet).
ColonDB: Dice $86.4\%$ (↑4.12 over U2Net; ↑12.32 over CSAP-UNet).
ETIS-Larib: Dice $78.7\%$ ; highest precision $96.1\%$ .
EndoScene: Dice $93.6\%$ ; Precision $94.8\%$ .

Cross-Dataset Validation (Table 6)

Train on ClinicDB, test on others → FocusU2Net wins 17/20 cases; overall success $85\%$ vs < $5\%$ for rivals.

Ablation (Table 7)

U2Net+FG vs U2Net: +2.5$–$6\% Dice across datasets.
FG alone (UNet+FG) cannot match nested RSU; combination critical.

Qualitative (Figs.7–11)

Lower FP (red) & FN (blue) vs SAM2/MedSAM, especially on small/flat lesions.

Interpretability

Feature maps (Figs.12–13): progressive abstraction; decoder restores fine boundaries.
Heatmaps (Fig.14): strong activation on polyps; FG suppresses noisy background.

Computational Footprint (Table 8)

Params $46.64\,M$ ; GFLOPs $78.09$ → lighter than UNet++ (×6) & Attention UNet (×3), yet higher accuracy.

Overall Insights

Dual attention + multiscale RSU enables balanced local–global fusion.
Pixel standardization simplifies training convergence.
Model scales to real-time colonoscopy and offers transparent decision maps.

Limitations & Future Work

Slight over-segmentation on tiny datasets (ETIS, ColonDB).
Does not yet address 3-D volumetric or multimodal fusion.
Future aims: 3-D. extension, lighter mobile variant, advanced loss (Tversky focal), unsupervised domain adaptation.

Ethical & Practical Notes

Complies with Declaration of Helsinki; public datasets used.
No competing interests; no external funding.
Data available per original dataset licenses

FocusU2Net: Dual-Attention Gated U-Net for Automated Polyp Segmentation

Background and Motivation

Research Problem

Key Contributions

Related Work (Condensed)

Classical Encoder–Decoder

Modified Encoder–Decoder

Transfer-Learning

Transformer Hybrids

Attention-centric

Foundation Models

FocusU2Net Architecture

1. Design Elements

2. Residual-U Blocks (RSU)

3. Overall Pipeline (Fig.4)

4. Formal Equations

5. Pseudocode (Algorithm 1)

Training Configuration

Datasets (80/20 split; resized 256×256256\times256256×256)

Evaluation Metrics

Results

Learning Curves (Fig.5)

Quantitative Highlights (Table 5)

Cross-Dataset Validation (Table 6)

Ablation (Table 7)

Qualitative (Figs.7–11)

Interpretability

Computational Footprint (Table 8)

Overall Insights

Limitations & Future Work

Ethical & Practical Notes

Datasets (80/20 split; resized $256\times256$ )