DoubleU-Net Notes

DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation

Abstract

Semantic image segmentation involves labeling each pixel of an image with its corresponding class.
Encoder-decoder approaches like U-Net are popular for medical image segmentation.
DoubleU-Net: A novel architecture combining two U-Net architectures stacked.
First U-Net: Uses pre-trained VGG-19 as the encoder, transferring features from ImageNet.
Second U-Net: Added to capture more semantic information efficiently.
Atrous Spatial Pyramid Pooling (ASPP): Adopted to capture contextual information.
Evaluated on four medical segmentation datasets (colonoscopy, dermoscopy, microscopy).
Datasets: 2015 MICCAI sub-challenge (automatic polyp detection), CVC-ClinicDB, 2018 Data Science Bowl, Lesion boundary segmentation.
Results: DoubleU-Net outperforms U-Net and baseline models.
Accurate segmentation masks, especially for challenging images (smaller, flat polyps).
DoubleU-Net: A strong baseline for medical image segmentation and cross-dataset evaluation.
Index Terms: semantic segmentation, convolutional neural network, U-Net, DoubleU-Net, CVC-ClinicDB, ETIS-Larib, ASPP, 2015 MICCAI sub-challenge, 2018 Data Science Bowl, Lesion Boundary Segmentation challenge.

Introduction

Medical image segmentation: Labeling each pixel of an object of interest.
Key task for clinical applications: Computer-Aided Diagnosis (CADx), therapy planning and guidance.
Helps clinicians focus on disease areas and extract detailed information.
Challenges:
- Unavailability of annotated data.
- Lack of high-quality labeled images.
- Low image quality.
- Lack of a standard segmentation protocol.
- Large variations of images among patients.
Quantification of segmentation accuracy is essential.
Requirement: Automatic, generalizable, and efficient semantic image segmentation approach.
Convolutional Neural Networks (CNNs): State-of-the-art performance for automated medical image segmentation.
Fully Convolutional Network (FCN): Earlier Deep Learning (DL) architecture for pixel-wise prediction.
U-Net: Popular image segmentation architecture.
- Analysis path: Learns deep features.
- Synthesis path: Performs segmentation.
- Skip connections: Propagates dense feature maps from analysis path to synthesis part.
- Spatial information applied to deeper layers for accurate output.
Adding more layers to U-Net improves feature learning and segmentation masks.
Generalization and robustness are keys for Artificial Intelligence (AI) in clinical trials.
Pre-trained ImageNet models (e.g., VGG19) improve CNN performance.
DoubleU-Net: Uses modified U-Net and VGG-19 in the encoder part.
Reasons for using VGG-19:
- Lightweight model.
- Similar architecture to U-Net.
- Allows deeper networks.
Aim: Improve segmentation performance with architectural changes.

Contributions

Novel architecture, DoubleU-Net, for semantic image segmentation.
- Two U-Net architectures in sequence.
- Two encoders and two decoders.
- First encoder: pre-trained VGG-19 on ImageNet.
- Atrous Spatial Pyramid Pooling (ASPP).
- Rest of the architecture built from scratch.
Experiments on multiple datasets.
- Four different medical imaging datasets:
  - Two colonoscopy datasets.
  - One dermoscopy dataset.
  - One microscopy dataset.
DoubleU-Net shows better segmentation performance.
- 2015 MICCAI sub-challenge (automatic polyp detection).
- CVC-ClinicDB dataset.
- Lesion Boundary Segmentation challenge (ISIC-2018).
- 2018 Data Science Bowl challenge dataset.
Extensive evaluation shows significant improvement over U-Net.
DoubleU-Net: A new baseline for medical image segmentation.
Paper organization: seven sections
- Section II: Related work.
- Section III: Proposed architecture.
- Section IV: Experiments.
- Section V: Results.
- Section VI: Discussion.
- Section VII: Summary, future work, and limitations.

Related Work

Encoder-decoder networks (FCN, U-Net) are popular for semantic segmentation.
Badrinarayan et al. proposed a deep fully CNN for semantic pixel-wise segmentation with fewer parameters.
Yu et al. proposed a new convolutional network module that used dilated convolutions for systematically aggregating multi-scale contextual information.
Chen et al. proposed DeepLab to solve segmentation problems.
DeepLabV3 improved over previous DeepLab versions without DenseCRF post-processing.
DeepLabV3 uses skip connection between analysis path and synthesis path similar to U-Net architecture.
Zhao et al. proposed effective scenes parsing network for complex scene understanding, where global pyramidal features provide an opportunity to capture additional contextual information.
Zhang et al. proposed Deep Residual U-Net, which uses residual connections for better output segmentation maps.
Chen et al. proposed Dense-Res-Inception Net (DRINET) for medical image segmentation and compared their results with FCN, U-Net, and ResUNet.
Ibtehaz et al. modified U-Net and proposed an improved MultiResUNet architecture for medical image segmentation where they compared their results with U-Net on various medical image segmentation datasets and showed superior accuracy than U-Net.
Jha et al. proposed ResUNet++, enhanced version of ResUNet by integrating squeeze-and-excite block, ASPP, and attention block.
Zhou et al. proposed UNet++, a neural network architectures for semantic and instance segmentation tasks by alleviating the unknown network depth, redesigning the skip connections, and devising a pruning scheme to the architecture.
Efforts toward developing deep CNN architectures for natural and medical image segmentation.
Focus on developing generalizable models tested on different datasets.
High accuracy achieved for both natural and medical imaging.
AI in medicine is still an emerging field.
Challenges in the medical domain: lack of test datasets and imbalanced datasets.
Need for a more accurate medical image segmentation approach to deal with the challenging images, since there are many challenging images (for example, flat polyps in colonoscopy), which are usually missed out during colonoscopy examination and can develop into cancer if early detection is not performed.
DoubleU-Net architecture that produces efficient output segmentation masks with challenging images.

The DoubleU-Net Architecture

Starts with a VGG-19 as encoder sub-network, followed by decoder sub-network.
Distinguishes from U-Net: VGG-19, ASPP, and decoder block in the first network (NETWORK 1).
Squeeze-and-excite block used in the encoder of NETWORK 1 and decoder blocks of NETWORK 1 and NETWORK 2.
Element-wise multiplication between the output of NETWORK 1 with the input of the same network.
Difference between DoubleU-Net and U-Net in the second network (NETWORK 2): only the use of ASPP and squeeze-and-excite block.
In the NETWORK 1, the input image is fed to the modified U-Net, which generates a predicted mask (Output1).
Multiply the input image and the produced mask (Output1), which acts as an input for the second modified U-Net that produces another mask (Output2).
Concatenate both the masks (Output1 and Output2) to see the qualitative difference between the intermediate mask (Output1) and final predicted mask (Output2).
The squeeze- and-excite block in the proposed networks reduces the redundant information and passes the most relevant information.
ASPP helps to extract high-resolution feature maps.

Encoder Explanation

First encoder (encoder1): Uses pre-trained VGG-19.
Second encoder (encoder2): Built from scratch.
Each encoder encodes information in the input image.
Encoder block in encoder2: Two 3 × 3 convolution operations, each followed by batch normalization.
Batch normalization reduces internal co-variant shift and regularizes the model.
Rectified Linear Unit (ReLU) activation function introduces non-linearity.
Followed by squeeze-and-excitation block for enhancing feature maps.
Max-pooling with a 2 × 2 window and stride 2 to reduce spatial dimension.

Decoder Explanation

Two decoders in the entire network with small modifications.
Each block in the decoder performs a 2 × 2 bi-linear up-sampling.
Concatenate appropriate skip connections feature maps from the encoder to the output feature maps.
First decoder: Skip connection from the first encoder.
Second decoder: Skip connection from both encoders.
Two 3 × 3 convolution operation, each followed by batch normalization and then by a ReLU activation function.
A squeeze and excitation block.
Convolution layer with a sigmoid activation function generates the mask for the corresponding modified U-Net.

Experiments

Datasets, evaluation metrics, experiment setup and configuration, and data augmentation techniques used.

Datasets

Four publicly available datasets from the medical domain.
- The 2015 MICCAI sub-challenge on automatic polyp detection used the CVC-ClinicDB for training and ETIS-Larib for testing.
- CVC-ClinicDB a common choice for polyp segmentation.
- The third dataset used in our experiment is from the ISIC- 2018 challenge, namely, Lesion Boundary Segmentation dataset.
- The fourth dataset used in this study is nuclei segmentation, from the 2018 Data Science Bowl challenge.
All of the datasets are clinically relevant during diagnosis.

$Table I: Summary of biomedical segmentation dataset used in our experiments$

Evaluation metrics

Sørensen–dice coefficient (DSC), mean Intersection over Union (mIoU), Precision, and Recall.
Official evaluation metrics were used in the challenge.
mIoU is the official evaluation metrics for the Lesion Boundary Segmentation challenge.

Experiment setup and configuration

Models implemented using Keras framework [26] with Tensorflow 2.1.0 [27] as backend.
Experiments on a Volta 100 GPU and an Nvidia DGX-2 AI system.
80% of dataset for training, 10% for validation, and 10% for testing.
Original image size for smaller dataset.
Resized the images to 384 × 512 for the Lesion Boundary segmentation challenge dataset.
Size of ETIS-Larib was adjusted similarly to that of CVC-ClinicDB.
Binary cross- entropy as the loss function for all the networks and the Nadam optimizer with its default parameters.
Learning rate $1e-5$ for dice loss and Adam optimizer.
Models trained for 300 epochs.
Early stopping and ReduceLROnPlateau is also used.

Data augmentation techniques

Medical datasets are challenging to obtain and annotate.
Most existing datasets have only a few samples.
Data augmentation techniques increase the number of samples during training.
Split dataset into training, validation, and testing sets.
Apply different data augmentation methods to each set, including center crop, random rotation, transpose, elastic transform, etc.
A single image was converted into 25 different images.
The same augmentation techniques were applied to all four datasets.

Results

Compared with the baselines on the respective datasets.
U-Net is still considered as the baseline for various medical image segmentation tasks.
Compared the proposed model with U-Net.
Reported results on four datasets.
Sequence of input, ground truth, Output1, and Output2 are followed.

Comparison on 2015 MICCAI sub-challenge on automatic polyp detection dataset

Experimental results shows that DoubleU-Net achieved a DSC of 0.7649 and a mIoU of 0.6255.
DoubleU-Net outperforms the base- line [29] by 6.07% in terms of DSC and 1.31% in mIoU.
The model that uses a pre-trained ImageNet network as a backbone achieves a higher score on cross- dataset evaluation.
The segmentation mask produced by Output2 is better than that of Output1.

$Table II: Experimental results using the 2015 MICCAI sub- challenge on automatic polyp detection dataset$

Comparison on CVC-ClinicDB

DoubleU-Net is compared with U-Net and the recent works that used the same dataset for evaluation.
DoubleU-Net achieve a DSC of 0.9239 which is 3.91% higher than [34] and mIoU of 0.8611, which is 1.14% higher than [17].
DoubleU-Net produces better segmentation masks as compared to the intermediate network.
The model performs reasonably well on the challenging images such as flat and small polyps.

$Table III: Result comparison on CVC-ClinicDB$

Comparison on Lesion Boundary segmentation challenge dataset

The official evaluation metric for the challenge was mIoU.
DoubleU-Net achieve a DSC of 0.8962 and mIoU of 0.8212 on this challenge dataset.
DoubleU-Net outperforms U-Net [17] by an approximate margin of 5.7%, and Multi-ResUNet [17] by an approximate margin of 1.83% in terms of mIoU
Both intermediate output and the final output produced by the network perform well on all types of lesions ranging from small to medium to large lesions.
The final output produced by the network is better than the intermediate one.

$Table IV: Result on Lesion boundary segmentation dataset from ISIC-2018$

Comparison on 2018 Data Science Bowl challenge dataset

Compared our work with U-Net++ [20].
Our method produced a DSC of 0.9133, which is 1.59% higher than the method proposed by Zhou et. al [20].
Comparable mIoU with U-Net and UNet++ that uses Resnet101 as the backbone model.
UNet++ has been used as a strong baseline for result comparison over various image segmentation tasks.
The DoubleU-Net set a new baseline for semantic image segmentation task.

$Table V: Result on Nuclei segmentation from 2018 Data Science Bowl challenge$

Discussion

DoubleU- Net performs reasonably well as compared to U-Net for all the presented datasets.
For the CVC-ClinicDB dataset, the performance of U-Net is competitive.
For 2015 MICCAI sub-challenge on automatic polyp detection dataset and the 2018 Data Science Bowl, DoubleU-Net has a signifi- cant DSC improvement of 0.4729% and 15.60% respectively.
The 2015 MICCAI sub-challenge on automatic polyp detection dataset provides us the opportunity to study the cross-data generalizability.
DoubleU-Net outperforms its competitors.
The model trained on pre-trained ImageNet performs much better on the cross-dataset test than that of the model trained from scratch.
DoubleU-Net is more generalizable and can be used for the cross-dataset test across the different domains.
DoubleU-Net is capable of producing better segmentation mask even for the challenging images.
The model produces high-quality segmentation masks for Lesion Bound- ary Segmentation challenge dataset and 2018 Data Science Bowl challenge dataset.
The model performs well for different multi-organ and multi-centered medical image segmentation datasets.
Transfer learning from a pre-trained ImageNet network significantly improves the results on every dataset.
DoubleU-Net as a baseline for result comparisons over four medical image segmentation datasets.

$Table VI: Relative improvement of DoubleU-Net on U-Net$

Conclusion

Proposed a novel CNN architecture called DoubleU-Net.
DoubleU-Net has five main components: two U-Net networks, VGG-19, a squeeze-and- excite block and ASPP.
The performance of DoubleU-Net is significantly better when compared with the baselines and U- Net on all four datasets.
The proposed architecture is flexible.
The segmentation results can be improved by further integrating different CNN blocks.
A limitation of the DoubleU-Net is that it uses more parameters as compared to U-Net, which leads to an increase in the training time.
The re- search should focus more on designing simplified architectures with fewer parameters while maintaining its ability.