Residual Attention based Network for Hand Bone Age Assessment
Computerized automatic methods have been employed to boost the productivity as well as objectiveness of hand bone age assessment. These approaches make predictions according to the whole X-ray images, which include other objects that may introduce distractions. Instead, our framework is inspired by the clinical workflow (Tanner-Whitehouse) of hand bone age assessment, which focuses on the key components of the hand. The proposed framework is composed of two components: a Mask R-CNN subnet of pixelwise hand segmentation and a residual attention network for hand bone age assessment. The Mask R-CNN subnet segments the hands from X-ray images to avoid the distractions of other objects (e.g., X-ray tags). The hierarchical attention components of the residual attention subnet force our network to focus on the key components of the X-ray images and generate the final predictions as well as the associated visual supports, which is similar to the assessment procedure of clinicians. We evaluate the performance of the proposed pipeline on the RSNA pediatric bone age dataset111http://rsnachallenges.cloudapp.net/competitions/4 and the results demonstrate its superiority over the previous methods.11footnotetext: Indicates equal contribution22footnotetext: Corresponding authors
Residual Attention based Network for Hand Bone Age Assessment
|E. Wu, B. Kong, X. Wang, J. Bai, Y. Lu, F. Gao, S. Zhang, K. Cao, Q. Song, S. Lyu, Y. Yin|
|Cornell University, Ithaca, NY, USA|
|CuraCloud Corporation, Seattle, WA, USA|
|Department of Computer Science, UNC Charlotte, Charlotte, NC, USA|
|Department of Computer Science, University at Albany, State University at New York, NY, USA|
Index Terms— hand bone age assessment, computer-aided diagnosis (CAD), deep learning
Although bone age assessment is crucial for the evaluation of the status of many diseases, the clinical procedure remains almost the same since the seminal work of Greulich and Pyle (G&P) , in which doctors compare an X-ray of the hand that includes the fingers and wrist with an atlas of X-rays to determine the bone age. Since this process requires a significant amount of time and expertise to scrutinize X-rays, it is extremely tedious and error-prone. Additionally, it introduces the intra- and inter-observer variabilities.
Computer-aided diagnosis (CAD) bone age assessment approaches have been developed to address the above issues. For instance, BoneXpert  with hand-engineered image processing approaches has been developed and approved for use in various countries. However, significant variations of size, shape, and mineralization exist in X-ray images. These often lead to the inaccurate prediction of age.
Recently, deep learning with hierarchical structures has been adopted as a methodology of choice for many medical image analysis problems [3, 4]. It has demonstrated its superior performance over other machine learning techniques. Several works have adopted deep learning to determine bone age according to the whole X-ray image. For instance, Spampinato et al.  employed BoNet with multiple convolution, deformation, and fully connected layers to automatically learn to predict bone age. Nevertheless, other objects (e.g., X-ray tags and annotation markers) besides hands also exist in X-ray images. For instance, the right of Fig. 1 shows a hand X-ray image and its corresponding attention map (large value indicates more importance to the final prediction) generated from the deep learning model trained the same way as in . Obviously, these objects act as noises that distract the network to other unimportant regions of the image, thereby yielding a suboptimal result. Additionally, some images have more than one instance of hand (left of Fig. 1), which makes the prediction more challenging.
To address this issue, in this work, we present a unified deep learning network for simultaneous hand segmentation and bone age assessment. Our method consists of two subnets: a Mask R-CNN subnet for pixelwise hand segmentation and a residual attention network for hand bone age assessment. The Mask R-CNN subnet is responsible for segmenting the hands from X-ray images, thereby avoiding the distractions of other objects. The subsequent residual attention network is dedicated to the task of bone age assessment, which leverages attention to force our network to focus on important bone regions. While a similar pipeline is adopted in , the segmentation and bone age assessment network are isolated from each other. By contrast, in our method, these two subnets form two steps in a unified framework, which can be trained end-to-end.
In summary, We design a unified deep learning based framework for bone age assessment. It is able to focus on important bone regions. This is achieved by two steps. First, the Mask R-CNN subnet extract hand regions from the X-ray images to remove other objects, so as to avoid their distractions. Second, a residual attention subnet is employed to force the network to automatically attend to important regions. We also evaluate our method on a large bone age assessment dataset, which demonstrates that our design is indeed essential for accurate prediction.
In this section, we present our approach for bone age estimation. As is illustrated in Fig. 2, our model first use Mask R-CNN  to mask out non-relevant pixels such as those belonging to the extra objects. It then uses a residual attention based network  to perform estimation on the masked image, yielding the predicted bone age in months, which is a single number.
2.1 Network Architecture
Mask R-CNN for hand segmentation: As is mentioned in Sec. 1, the noisy background may cause undue attention to other parts of X-ray images. Therefore, we employ a segmentation network to remove the distractions. The current state-of-the-art image instance segmentation system is Mask R-CNN . Mask R-CNN is based on Faster R-CNN , but adds a parallel branch for predicting object masks. Mask R-CNN’s mask branch corrects for misalignment in ROI proposals using a RoIAlign layer, which significantly increase accuracy.
Residual attention network for bone age assessment: We then use the residual attention network for bone age assessment. Inspired by the practices of the domain experts, the attention mechanism has gained popularity recently to guide all kinds of neural networks to salient features. Attention weighs parts of the input differently to strengthen or diminish features in main network layers. This is commonly accomplished by having a separate branch that calculates attention and is later incorporated back into the main branch with some weighing function.
As is illustrated in Fig. 2, our residual attention subnet (residual units are ignored for simplicity) is composed of a convolution layer, followed by 6 residual attention modules . Each attention module has a trunk branch and soft mask branch inside it. Given the input feature map x, the attention module generates the trunk branch output and soft mask map . The trunk branch contains only two few residual blocks and acts as a shortcut for data flow. Afterward, the attention mask is applied to the trunk branch as a multiplier, generating the attended feature map .
where denotes Hadamard product.
While the attention mechanism is also used in , it is only used to visually show significant parts in the images for bone age prediction. Instead, attention in this work is integrated into the network. Thus, our network is guided by attention to focus on important bone regions during training, so that more precise estimations can be made. We also added a gender input to the last fully connected layer that was then concatenated to the input for the final fully connected layer.
2.2 Loss function
Formally, the loss function during training is a combination of multiple tasks:
where , , and are the classification loss, bounding-box loss, and per-pixel hand segmentation loss, respectively, which are identical as defined in . is the standard L1 loss, which is the mean absolute error (MAE) between the predicted age for the X-ray image and the ground truth .
Dataset, preprocessing, and Evaluation Metrics. Our framework was evaluated on the RSNA pediatric bone age dataset, which includes approximately 12,500 labeled images. All images have a gender associated with them. We used a 90%/10% training/validation split. The hand masks in the training set were segmented using the Canny edge detector on histogram normalized images. The MAE (Eq. 3) was also used for evaluation. The network was trained using Nesterov SGD with a weight decay of 0.0001, momentum of 0.9, and initial learning rate of 0.01. The learning rate was divided by 10 automatically when validation loss plateaued for 5 epochs. Standard data augmentation transformations (random cropping, resizing, rotation, and mirroring) were used.
Comparison with baseline models. We compare our model with a baseline model with VGG-16 and the state of the art , which report results with two different settings: the first is obtained on whole hands with a combination of U-Net and VGG-style neural network and the second result is obtained by heavy ensembling. In the second setting, the X-ray images were first registered so that the hands are in the same direction. And three networks were trained with different regions of the images. The final predictions were generated by ensembling the results together. The results are shown in Table 1. The error of our network was 7.38 months. This is comparable with their MAE of 8.08 months for the whole hand, especially considering that our model was trained without any registration of images. It also slightly exceeds their MAE of 7.52 months for a multi-model ensemble while only using a single network for regression.
|Iglovikov et al.  on whole hand||8.08|
|Iglovikov et al.  with ensemble||7.52|
Evaluation of Mask R-CNN subnet. We then evaluate the Mask R-CNN subnet. Running the final network without Mask R-CNN subnet on images with tags results in an MAE of 12 months, which performs much worse than the presented framework. We also show the attention maps from different attention modules. We take masks from before the sigmoid in each attention module’s soft mask branches and map the values with a heatmap. These individual attention maps are shown in Fig. 3. When there is no Mask R-CNN subnet (top two rows of Fig. 3), high attention is generated on X-ray tags in all the attention modules. On the contrary, our model does decently well on these images. Figure 4 shows noisy images with background borders and tags (left), the segmented hands (middle), and the attention maps with the highest responses (right) in all 5 attention modules. Mask R-CNN isolates the hand relatively well, and the important regions related to bone age is then correctly focused by the attention mechanism of the residual attention network.
Evaluation of the residual attention subnet. We can also determine what the network marks as important from the generated attention maps. From Fig. 3 (bottom 2 rows), it can be seen that earlier attention modules focus on the entire hand, while later attention focuses on parts of the hand more pertinent to deciding bone age. The highest average attention was on the carpals in Att3_1, Att3_2, and Att3_3. This is consistent with previous work indicating that the carpals are important in skeletal maturity assessment for infants and toddlers . In other attention modules, there is more attention on the metacarpals and phalanges than the carpals. We also quantitatively evaluate the attention in our residual attention network. The masks of attention modules Att1_1, Att2_1, Att2_2, and Att3_1 are sufficiently unique enough that their removal would likely be highly detrimental. Instead, we remove Att3_2 and Att3_3 since their masks are visually similar to Att3_1. The results are shown in Table 2. By removing the attention from Att3_2 and Att3_3, the MAE increases by 1.69, demonstrating that attention is indeed essential for the bone age assessment.
|Without Att3_2 and Att3_3||9.07|
Evaluation of age input. Finally, we evaluate the importance of age for the prediction. To do this, a separate network exactly the same as in Fig. 2 except without the gender input was trained. This network attained a mean absolute error of 10 months, which is more than a month higher than the network with a gender input. Therefore, we conclude that gender is an important factor for the network to determine bone age accurately.
In this study, we investigate building a deep learning based pipeline for automatic bone age assessment. Inspired by the clinical workflow (Tanner-Whitehouse) of bone age assessment, we build a unified network for age assessment. More specifically, it consists of two subnets: the Mask R-CNN subnet segment hands from the image to remove the distractions from backgrounds, based on which the residual attention subnet generate the final prediction and the visual supports. In this way, our network is more robust to noises.
Acknowledgement. The work received supports from Shenzhen Municipal Government under the grant KQTD 2016112809330877.
-  William Walter Greulich, Sarah Idell Pyle, and Thomas Wingate Todd, Radiographic atlas of skeletal development of the hand and wrist, vol. 2, Stanford university press Stanford, 1959.
-  Hans Henrik Thodberg, Sven Kreiborg, Anders Juul, and Karen Damgaard Pedersen, “The bonexpert method for automated determination of skeletal maturity,” IEEE TMI, vol. 28, no. 1, pp. 52–66, 2009.
-  Bin Kong, Xin Wang, Zhongyu Li, Qi Song, and Shaoting Zhang, “Cancer metastasis detection via spatially structured deep network,” in IPMI. Springer, 2017, pp. 236–248.
-  Bin Kong, Shanhui Sun, Xin Wang, Qi Song, and Shaoting Zhang, “Invasive cancer detection utilizing compressed convolutional neural network and transfer learning,” in MICCAI. Springer, 2018, pp. 156–164.
-  Concetto Spampinato, Simone Palazzo, Daniela Giordano, Marco Aldinucci, and Rosalia Leonardi, “Deep learning for automated skeletal bone age assessment in x-ray images,” Medical Image Aalysis, vol. 36, pp. 41–51, 2017.
-  Vladimir Iglovikov, Alexander Rakhlin, Alexandr Kalinin, and Alexey Shvets, “Pediatric bone age assessment using deep convolutional neural networks,” arXiv preprint arXiv:1712.05053, 2017.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask r-cnn,” in ICCV. IEEE, 2017, pp. 2980–2988.
-  Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang, “Residual attention network for image classification,” in CVPR. IEEE, 2017, pp. 6450–6458.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015, pp. 91–99.
-  Hyunkwang Lee, Shahein Tajmir, Jenny Lee, Maurice Zissen, Bethel Ayele Yeshiwas, Tarik K Alkasab, Garry Choy, and Synho Do, “Fully automated deep learning system for bone age assessment,” Journal of Digital Imaging, vol. 30, no. 4, pp. 427–441, 2017.