# U-Net Training with Instance-Layer Normalization

###### Abstract

Normalization layers are essential in a Deep Convolutional Neural Network (DCNN). Various normalization methods have been proposed. The statistics used to normalize the feature maps can be computed at batch, channel, or instance level. However, in most of existing methods, the normalization for each layer is fixed. Batch-Instance Normalization (BIN) is one of the first proposed methods that combines two different normalization methods and achieve diverse normalization for different layers. However, two potential issues exist in BIN: first, the Clip function is not differentiable at input values of 0 and 1; second, the combined feature map is not with a normalized distribution which is harmful for signal propagation in DCNN. In this paper, an Instance-Layer Normalization (ILN) layer is proposed by using the Sigmoid function for the feature map combination, and cascading group normalization. The performance of ILN is validated on image segmentation of the Right Ventricle (RV) and Left Ventricle (LV) using U-Net as the network architecture. The results show that the proposed ILN outperforms previous traditional and popular normalization methods with noticeable accuracy improvements for most validations, supporting the effectiveness of the proposed ILN.

###### Keywords:

Instance-Layer Normalization Deep Convolutional Neural Network U-Net Biomedical Image Segmentation.## 1 Introduction

Biomedical image segmentation is a fundamental step in medical image analysis, i.e., 3D shape instantiation for organs [16] and prosthesis [15, 14]. Most current popular methods are based on Deep Convolutional Neural Network (DCNN) which train multiple non-linear modules for feature extraction and pixel classification with both higher automation and performance. One fundamental component in DCNN is the normalization layer. Initially, one of the main motivations for normalization was to alleviate the internal covariate shift where layers’ input distribution changes [3]. However, recent work considers the use of normalization layer is beneficial, because it increases the robustness of the networks to fluctuation associated with random initialization [2], or it achieves smoother optimization landscape [9]. In this paper, we keep this motivation question open and focus on normalization strategies.

For a feature map with dimension of , where is the batch size, is the feature height, is the feature width, is the feature channel, Batch Normalization (BN) [3][4] was the first proposed normalization method which calculated the mean and variance of a feature map along the dimension, then re-scaled and re-translated the normalized feature map with additional trainable parameters to preserve the DCNN representation ability. Instance Normalization (IN) [10] which calculated the mean and variance along the dimension was proposed for fast stylization. Layer Normalization (LN) [1] which calculated the mean and variance along the dimension was proposed for recurrent networks. Group Normalization (GN) [12] calculated the mean and variance along the and multiple-channels dimension and was validated on image classification and instance segmentation. A review of these four normalization methods for training U-Net for medical image segmentation could be found in [17]. Weight normalization [8][13] based on re-parameterization on weights was used in recurrent models and reinforcement learning. Batch Kalman normalization estimated the mean and variance considering all preceding layers [11].

Recently, Nam et al. proposed Batch-Instance Normalization [5] (BIN), which combined BN and IN with a trainable parameter. However, two risks potentially exist: 1) the trainable parameter was restricted in the range of [0, 1] with Clip function which is not differentiable at input values of 0 and 1; 2) the combined feature map was no longer with a normalized distribution, which is harmful for signal propagation in DCNN. In this paper, Instance-Layer Normalization (ILN) is proposed to combine IN and LN: 1) Sigmoid is used to solve the non-differentiable characteristic of Clip function at input values of 0 and 1; 2) an additional GN16 - GN with a group number of 16 is added after the combined feature map to ensure a normalized distribution of the combined feature map. A widely-applied and popular network architecture - U-Net [7] is used as the network to validate the proposed ILN on the Right Ventricle (RV) and Left Ventricle (LV) image segmentation. The proposed ILN outperforms existing normalization methods with noticeable accuracy improvements in most validations in terms of the Dice Similarity Coefficient (DSC).

## 2 Methodology

### 2.1 Instance-Layer Normalization

#### 2.1.1 Instance Normalization

With a feature map F of dimension , IN calculates the mean and variance of F as:

(1) |

Then, the feature map is normalized as :

(2) |

where is a small value added for division stability. For the same feature map F, LN calculates the mean and variance as:

(3) |

where F is normalized in a similar way of Equ. (2) to . A trainable parameter is added to combine and . In the original BIN [5], was clipped to be in the range of with a Clip function, as shown in Figure 1.

However, Clip function is not differentiable at input values of and . In this paper, Sigmoid function which is differentiable everywhere is applied to solve this potential issue:

(4) |

An additional potential issue in the original BIN is that the combined is no longer with a mean of and a variance of , this non-normalized distribution may be harmful for signal propagation in DCNN. In this paper, we solve this issue with applying an additional GN16 on the combined :

(5) |

(6) |

where is the channel number in each feature group, // is exact division, . The feature map is normalized in a similar way of Equ. (2) as . Following BN [3], additional parameters and are added to preserve the DCNN representation ability .

### 2.2 Experimental Setup

##### Network Architecture

A widely adopted network architecture in medical image segmentation, called U-Net [7], was used as the fundamental network framework with four max-pooling layers. The start feature channel number is 16. The normalization layer was added between the convolutional and Relu layer. Cross-entropy was used as the loss function. Momentum Stochastic Gradient Descent (SGD) was used as the optimizer with the momentum set as 0.9. Weights were initialized with a truncated normal distribution with the stddev as , where is the channel number. Biases were initialized as 0.1. was initialized as 0.5.

##### Data Collections

6082 RV images [16], scanned with a 1.5T Magnetic Resonance Imaging (MRI) machine (Sonata, Siemens, Erlangen, Germany), with slice gap of 10mm, pixel spacing of 1.52mm, image size of , from 37 subjects mixed with Hypertrophic Cardiomyopathy (HCM) patients and asymptomatic subjects, from the atrioventricular ring to the apex were used for the validation. The ground truth was labeled by one expert with Analyze (AnalyzeDirect, Inc, Overland Park, KS, USA). rotations were applied to augment the training images. 12, 12, 13 subjects for each group were split randomly for three-fold cross validation. 805 LV images [6], from SunnyBrook MRI dataset, with subject number of 45, image size of , were used for the validation as well. rotations were applied to augment the training images. 15 subjects for each group were split randomly for three-fold cross validation.

##### Implementation

As the proposed ILN needs to manipulate intermediate feature maps, the U-Net framework was implemented with low-level Tensorflow functions - tf.nn. In this paper, to ensure a fair comparison, all normalization methods were re-implemented into the same framework as the ILN implementation instead of using the available high-level Tensorflow Application Programming Interface (API) exists for some normalization methods in Tensorflow library, such as those used in [17].

##### Experiments

## 3 Result

To prove the advantage of using the Sigmoid function over the Clip function (in original BIN [5]), three comparison experiments were set up: 1) using Clip function with one trainable parameter for IN feature map while the parameter for LN feature map is ; 2) using Sigmoid function with one trainable parameter for IN feature map while the parameter for LN feature map is ; 3) using Softmax function with two trainable parameters for IN and LN feature map respectively. Comparison results are shown in Section 3.1.

To prove the advantage of adding GN16 after the combined feature map, two comparison experiments with or without GN16 are conducted. Results are shown in Section 3.2. Eight randomly-selected segmentation examples are shown in Section 3.3 for intuitive illustrations. As GN16 performed similarly to IN [17], no normalization, IN, LN, GN4 are chosen as the baseline to validate the performance of the proposed ILN, as presented in details in Section 3. The training curves of at eight randomly-selected layers are shown in Section 3.5. In this paper, RV-1 refers to the cross validation that uses the first group of RV data as testing while uses the second and third group of RV data as training. Similar fashions were applied as the notations of the experiments on the RV-2, RV-3, LV-1, LV-2, and LV-3.

### 3.1 Sigmoid vs. Clip vs. Softmax Function

The meanstd segmentation DSCs of using Clip, Sigmoid and Softmax function to combine the IN and LN feature map are shown in Table 1. We can see that Sigmoid function achieves the highest DSC for most cross validations, except RV-1 experiment, which proves the effectiveness of the proposed method in this paper - replacing the Clip function in original BIN [5] with Sigmoid function.

Method | RV-1 | RV-2 | RV-3 | LV-1 | LV-2 | LV-3 |
---|---|---|---|---|---|---|

Clip | 0.7020.295 | 0.7070.299 | 0.6660.319 | 0.9000.099 | 0.8640.184 | 0.8040.246 |

Sigmoid | 0.6920.304 | 0.7240.284 | 0.6750.301 | 0.9030.118 | 0.8880.135 | 0.8280.189 |

Softmax | 0.6880.290 | 0.7200.279 | 0.6640.323 | 0.8950.151 | 0.8660.153 | 0.8270.228 |

### 3.2 With or Without GN16

The meanstd segmentation DSCs of adding or not adding GN16 after the combined feature map of IN and LN are shown in Table 2. We can see that, the method with adding GN16 achieves the highest DSC for most cross validations, except LV-3 experiment. This result proves the effectiveness of adding GN16 after the combined feature map and also proves the importance of maintaining the normalized distribution of feature maps.

Method | RV-1 | RV-2 | RV-3 | LV-1 | LV-2 | LV-3 |
---|---|---|---|---|---|---|

No | 0.6920.304 | 0.7240.284 | 0.6750.301 | 0.9030.118 | 0.8880.135 | 0.8280.189 |

Yes | 0.7140.290 | 0.7370.267 | 0.6800.305 | 0.9190.098 | 0.8930.127 | 0.8270.211 |

### 3.3 Segmentation Examples

Eight segmentation examples were selected randomly from the RV and LV data to show the segmentation details in Figure 2. For most cases, both the RV and LV are segmented properly. However, for cases near the RV apex, i.e., the forth figure in the first row, the segmentation quality is worse. This might be due to the tissue adhesion and the small size of RV.

### 3.4 Comparison to Other Methods

The meanstd segmentation DSCs of using no normalization, IN, LN, GN4, and the proposed ILN with the U-Net framework are shown in Table 3. We can see that, except the LV-3 experiment, the proposed ILN outperforms all other traditional methods with considerable accuracy improvements. This result proves the effectiveness of the proposed ILN in medical image segmentation.

Method | RV-1 | RV-2 | RV-3 | LV-1 | LV-2 | LV-3 |
---|---|---|---|---|---|---|

None | 0.6880.296 | 0.6780.318 | 0.6610.323 | 0.8990.134 | 0.8720.167 | 0.7840.280 |

IN | 0.7090.266 | 0.7150.278 | 0.6550.327 | 0.9050.114 | 0.8760.131 | 0.8360.207 |

LN | 0.7020.287 | 0.7180.270 | 0.6620.309 | 0.8980.120 | 0.8580.187 | 0.7930.262 |

GN4 | 0.6790.303 | 0.7010.291 | 0.6710.309 | 0.9080.113 | 0.8410.196 | 0.8000.255 |

ILN | 0.7140.290 | 0.7370.267 | 0.6800.305 | 0.9190.098 | 0.8930.127 | 0.8270.211 |

### 3.5 Training Curves of

The training curves of eight layers were selected randomly from LV-1 experiment to be shown in Figure 3. We can see that was trained to be different values and the proposed ILN achieved diverse normalization at different layers. As the ground truth of is not known and it is impossible to judge the curve correctness, a comparison regarding the training curves of ILN and BIN is not illustrated.

The CPU used is Intel Xeon(R) E5-1650 v4@3.60GHz12. The GPU used is Nvidia Titan XP. Comparing ILN to IN, the parameter number increases 22, as one parameter is added to each layer. The training time for 200 iterations increases from 34.8s to 36.5s due to the additional GN16 calculation.

## 4 Discussion

The proposed ILN strategy is generic and flexible. The three components, IN, LN and GN16 could be replaced with other normalization methods. The proposed ILN framework is validated on medical image segmentation with a U-Net framework. We believe that it could also be useful for other tasks, which needs further validation and exploration. The proposed ILN failed to achieve the highest DSC for the LV-3 experiment. It may due to that the combination of IN, LN and GN16 is not suitable for this experiment. In the future, the proposed ILN framework would be extended to combining more normalization methods.

## 5 Conclusion

To improve the accuracy of biomedical image segmentation based on U-net, the ILN was proposed to combine the feature map of IN and LN with an additional trainable parameter and Sigmoid function, then add GN16 after the combined feature map. Although, various normalization methods have been proposed, the noticeable accuracy improvements of the proposed ILN - almost DSC proves the importance of carefully tuning the normalization strategy when training DCNNs.

## References

- [1] (2016) Layer normalization. Stat 1050, pp. 21. Cited by: §1.
- [2] (2018) Understanding batch normalization. In NeurIPS, pp. 7705–7716. Cited by: §1.
- [3] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §1, §1, §2.1.1.
- [4] (2017) Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In NeurIPS, pp. 1945–1953. Cited by: §1.
- [5] (2018) Batch-instance normalization for adaptively style-invariant neural networks. In NeurIPS, pp. 2563–2572. Cited by: §1, §2.1.1, §3.1, §3.
- [6] (2009) Evaluation framework for algorithms segmenting short axis cardiac MRI. The MIDAS Journal-Cardiac MR Left Ventricle Segmentation Challenge 49. Cited by: §2.2.
- [7] (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §1, §2.2.
- [8] (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In NeurIPS, pp. 901–909. Cited by: §1.
- [9] (2018) How does batch normalization help optimization?. In NeurIPS, pp. 2488–2498. Cited by: §1.
- [10] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §1.
- [11] (2018) Batch kalman normalization: towards training deep neural networks with micro-batches. arXiv preprint arXiv:1802.03133. Cited by: §1.
- [12] (2018) Group normalization. In ECCV, pp. 3–19. Cited by: §1.
- [13] (2018) Understanding weight normalized deep neural networks with rectified linear units. In NeurIPS, pp. 130–139. Cited by: §1.
- [14] (2018) Real-time 3D shape instantiation from single fluoroscopy projection for fenestrated stent graft deployment. IEEE RAL 3 (2), pp. 1314–1321. Cited by: §1.
- [15] (2018) Towards automatic 3D shape instantiation for deployed stent grafts: 2D multiple-class and class-imbalance marker segmentation with equally-weighted focal U-Net. In 2018 IEEE/RSJ IROS, pp. 1261–1267. Cited by: §1.
- [16] (2018) A real-time and registration-free framework for dynamic shape instantiation. MedIA 44, pp. 86–97. Cited by: §1, §2.2.
- [17] (2019) Normalization in training U-Net for 2D biomedical semantic segmentation. IEEE RAL. Cited by: §1, §2.2, §2.2, §3.
- [18] (2019) Atrous convolutional neural network (ACNN) for biomedical semantic segmentation with dimensionally lossless feature maps. arXiv preprint arXiv:1901.09203. Cited by: §2.2.