Project Introduction — First-Prize Solution for MindSpore-based Kidney-Tumor Segmentation

Project Introduction — First-Prize Solution for MindSpore-based Kidney-Tumor Segmentation

Project Introduction — First-Prize Solution for MindSpore-based Kidney-Tumor Segmentation

The 10th CCF Big Data & Computing Intelligence Contest (2022 CCF BDCI) has recently come to a close with great success. The winning teams' solutions will be shared on the official competition platform, DataFountain (DF), for further communication and discussion. This blog presents one of the solutions for MindSpore-based Kidney Tumor Segmentation, which won first prize in the contest. For details about the contest, visit http://go.datafountain.cn/3056.

Team Introduction

Team Name: Xian Yu

Team Members: This team is established by two postgraduates from the Institute of Computing Technology, Chinese Academy of Science, who were encouraged to take part in this contest by their teacher of AI courses.

Award: First Prize

Abstract

Under the theme of medical image segmentation, we employe methods such as data augmentation, propose the creative Res-U-Net model, and add the Lovasz-Softmax loss function during training. By doing so, our model achieves an optimal training performance on the test dataset.

keywords

Semantic segmentation, data augmentation, ResNet, Lovasz-Softmax

1. Introduction

The contest focuses on the semantic segmentation task, which is a well-established and extensively researched problem in the field of deep learning. Building on past techniques, we create our own methods, including:

● Data augmentation: Based on data characteristics of this contest, we use multiple data augmentation methods to improve the generalization capability of our model.

● Res-U-Net model: By incorporating the U-Net structure and leveraging advantages of ResNet blocks, we design the Res-U-Net model, which ensures stable model training and inference.

● Combined loss functions: To train the model, we utilize Locasz-Softmax and a combination of loss functions, including cross-entropy loss with weights.

Results obtained on the verification dataset and test dataset demonstrate that our methods have achieved a good effect.

2. Dataset

The KiTS19 dataset[5] serves as the training dataset, and the test dataset contains private data provided by the contest organizer. Both datasets are kidney and tumor data. The training dataset contains scanning results of 210 patients with kidney tumors, and each of results contains about 300 scanned images. Additionally, the training dataset is labeled with three categories: background, kidney, and tumor. We randomly select data records of 30 patients as the verification dataset, and the remaining data records of 180 patients as the training dataset.

3. Data Processing

The scanned CT images are characterized by continuous spatial and temporal information. This implies that each image is closely related to images before and next to it. Therefore, when we select an image for training, we also select its adjacent images in the sequence. That is, the image size used during training is 512 x 512 x 3.

To improve the generalization capability of our model and reduce its dependency on certain samples and attributes, we've employed image argumentation methods to generate more training samples when the training data is read. Before training samples are added to the training dataset, image argumentation is performed through sequential geometric transformation on images, including random rotation, flipping, and cropping.

3.1 Random Rotation and Flipping

Random rotation and flipping are the most commonly used image augmentation methods. We rotate samples evenly and randomly with a small angle (-9°, 9°) and flip them randomly and horizontally at a probability of 0.5.

3.2 Random Cropping

We first pad the four edges of a sample image with 12 pixels, expanding its size to 536 x 536 from 512 x 512. Then, crop the image by random area and height-width ratio. The random cropping area and height-width ratio of an image are within the range (0.92,0.99) and (0.96.1.04), which is presented as Scrop/Spad ∈ (0.92,0.99) and H/W ∈ (0.96.1.04). In the two parameters, Scrop indicates the cropped image area, Spad indicates the padded image area, H indicates the image height, and W indicates the image width. Finally, we resize the cropped image to 512 x 512.

4. Model Structure

The main structure of our Res-U-Net model, with the classic encoder-decoder structure as the backbone, is developed according to the theory of Cortinhal[1]. The input of Res-U-Net is a 512 x 512 x 3 image, where 3 indicates the number of channels of the image and 512 indicates the image width and height. The input image first passes through the Conv0 layer, which preliminarily learns the basic features of the image and extends the number of image channels to 32. The following is a five-layer encoder structure, which is represented as the ResDownSample layer. ResDownSample performs downsampling on the feature map. After downsampling, the width and height of the feature map are reduced to 1/2 of the original values. Simultaneously, ResDownSample learns the features of the feature map and doubles the number of channels.

Accordingly, the input image then passes through a five-layer decoder structure, which is represented as UpSample and CatResBlock. UpSample performs upsampling on the feature map. After upsampling, the width and height of the feature map are doubled, and the number of channels is reduced to 1/4 of the original number. The feature map after upsampling is stitched with the output of the corresponding ResDownSample layer by using the theory of Ronneberger[2].

Before stitching, the number of channels of the output feature map at the ResDownSample layer is reduced through a 1 x 1 convolutional layer. This ensures that input feature maps of CatResBlock have a proper channel number. CatResBlock stitches the two input feature maps to learn their features. After being upsampled for five times, the output feature map is resized to 512 x 512 x 16. Then, the feature map is stitched with the output of the Conv0 layer. Finally, the feature map is convolved again, and a 512 x 512 x 3 feature map is output.

4.1 Conv0

The Conv0 layer consists of 3 x 3 Conv, Batch Norm, and Leaky ReLU layers. The stride of the 3 x 3 Conv is 1, presented as s = 1. The convolution operation does not change the width and height of the feature map. The Conv0 layer is used to preliminarily extract the feature information of the image, which lays a foundation for the subsequent feature extraction.

4.2 ResDownSample

The structure of the ResDownSample layer is designed based on He's theory[3]. The first convolutional layer is a 3 x 3 Conv with a stride equal to 2, which is used to halve the width and height of the image for downsampling. The Batch Norm and Leaky ReLU layers are followed. The stride of the second convolutional layer is 1, which does not change the width and height of the feature map. The input feature map also passes through another convolutional layer whose stride is 2, which is called "shortcut" in the He's theory[3]. The feature map being shortcut is added to the feature map that passes through the second convolutional layer, and then it is output by Leaky ReLU.

4.3 Connector

The structure of the Connector layer is basically the same as that of the Conv0 layer. The only difference is that the stride of convolutional layer in Connector is 2, which is used to downsample the feature map.

4.4 UpSample

The UpSample layer uses the PixelShuffle method, which performs upsampling by exchanging the pixel sequence of the feature map, without using any parameter that needs to be learned. For example, an H x W x C feature map becomes 2H x 2W x C/4 after pixel shuffling, indicating that the width and height of the feature map are doubled, and the number of channels is reduced to one-fourth of the original number.

4.5 CatResBlock

The first layer of CatResBlock is the stitching layer, which is used to stitch the output of the corresponding ResDownSample layer. The following layers are 3 × 3 Conv, Batch Norm, Leaky ReLU, 3 × 3 Conv, and Batch Norm layers in sequence. The output feature map is added to the stitched feature map and then output again through the Leaky ReLU layer.

4.6 CatConv

The first layer of CatConv is a stitching layer, and the following layers are 3 x 3 Conv, Batch Norm, and Leaky ReLU layers.

Acknowledgments

We truly appreciate our AI instructor for encouraging us to participate in this contest, and we credit our success to his guidance.

Reference

[1]Cortinhal, Tiago, George Tzelepis, and Eren Erdal Aksoy. "SalsaNext: Fast, uncertainty-aware semantic segmentation of LiDAR point clouds." International Symposium on Visual Computing. Springer, Cham, 2020.

[2]Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

[3]He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[4]Berman, Maxim, Amal Rannen Triki, and Matthew B. Blaschko. "The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[5]Heller, Nicholas, et al. "The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 challenge." Medical image analysis 67 (2021): 101821.