A Deep Framework for Bone Age Assessment based on Finger Joint Localization
Bone age assessment is an important clinical trial to measure skeletal child maturity and diagnose of growth disorders. Conventional approaches such as the Tanner-Whitehouse (TW) and Greulich and Pyle (GP) may not perform well due to their large inter-observer and intra-observer variations. In this paper, we propose a finger joint localization strategy to filter out most non-informative parts of images. When combining with the conventional full image-based deep network, we observe a much-improved performance.
Skeletal hand bone age prediction is an essential issue in the medical fields. The gap of skeletal maturity and chronological age of children can indicate child growth problem. It has been traditionally solved via manually comparing the X-ray image of hand and wrist bone with reference standards by a medical specialist . Two general medical skeletal maturity prediction streams are Tanner-Whitehouse (TW)  and Greulich and Pyle (GP) . TW-based methods analyze certain parts of the hand bone, while the GP method usually utilizes the information from the whole hand. As shown in Fig 1, a set of the specific region of interests (ROIs) analyzed by TW is divided into epiphysis/metaphysis ROIs (EMROIs) and carpal ROIs (CROIs), which always need more time and also suffer from diagnosis of different doctors.
To solve the aforementioned problems of traditional methods, it is necessary to develop an automatic system to help diagnosis based on radiographs. In recent years, a great number of work based on deep learning have been conducted to assess the bone age automatically [4, 5, 3]. Most of them only treated the deep learning architecture as a black box, which can not transfer the domain knowledge from traditional medical diagnosis to feature learning. Inspired by ROIs from the TW method, we try to “mimic” the behaviors of doctors in a sense that the discriminative regions, such as the finger joints (detailed in Fig 2) in our case, should be carefully addressed. Specifically, we embark on the segmentation results of deep U-Net and propose a localization strategy to extract finger joints. A deep convolutional neural network is trained to process all these finger joints and regress the age values. With finger joints as the only input, we show more competitive results than conventional whole image-based approaches. Moreover, when fusing results from both approaches, we observe even more desirable performance.
Ii Related Work
Recently, deep learning has been applied to a wide range of fields and tasks, examples including: segmentation , edge detection , crowd counting , Internet of Things  and so on. The unifying idea behind all of the above is the utilization of neural networks with many hidden layers, for the purposes of learning complex feature representations from raw data, rather than relying on handcrafted feature extraction. Under this umbrella, some deep learning approaches have also been introduced in the medical field as well. One recent instance is the combination of VGG16 network  and Gaussian process regression to reduce the variance of the GP and TW method . The comparison of the prediction results of the three ROIs showed that the combination of the whole hand and three ROIs could improve the accuracy of bone age prediction . A novel deep network is introduced to assess the skeletal maturity on pediatric hand radiographs .
As illustrated in Fig 2 and Fig LABEL:fig:part, one can observe that the distance gap between different joints of the finger varies with ages. Motivated by this, we study the feasibility of improving the bone age prediction with finger joints. To this end, we propose a novel finger joint localization strategy to extract finger joints. When combined with conventional full image-based methods, we observe a significant performance boost.
The system overview of the joint extraction is shown in Fig 5. Firstly, the hand bones are segmented from the original images, and then, the joints can be localized and extracted accordingly.
Iii-a Joint Localization
Iii-A1 Bone Segmentation
We adopt a U-Net  (shown in Fig 4) to segment the bones from the radiographs. In U-Net, the “encoder-decoder” structure could not only decrease the resolution of the input in the encoding stage but also upsample the feature maps in the decoding stage. Along with skip connections, the resolution could be recovered by concatenating multi-scale feature maps with up-sampled features maps, which can contribute to segmentation on conditions of very few labels. Besides, U-Net is one of the most common segmentation architectures in medical imaging.
As done in other vision tasks [14, 15], we also adopt a dense CRF to improve spatial coherence and quality of the segmentation quality. More specifically, every pixel , which is regarded as a node, has a label and an observation value , and the relationship among pixels are regarded as edges. The labels behind pixels can be inferred by observations , and the CRF is characterized by a Gibbs distribution,
where is the Gibbs energy of a label , which is formulated as follows,
among which, the unary potential function is donated by the output of U-Net, and the pairwise potentials in our model is given by
where each is a Gaussian kernel , the vectors and are feature vectors for pixels and , are linear combination weights, and is a label compatibility function.
Iii-A2 Joint Localization
After segmenting the finger into different parts as illustrated in Fig. 5, a rotated rectangle of the minimum area enclosing each distinct part are drawn accordingly. Then the centroids of each rotated rectangle are computed. For each of those adjacent centroids, their centroids can be further obtained to represent the location of each finger joint. Finally, we crop a rotated rectangle with a fixed size of based on the location of each finger joints. The whole process is illustrated in Fig. 5.
Iii-B Regression Model
The proposed joint localization strategy filters out most non-informative areas while only keeps the finger joints. Those ROIs are further encapsulated in a single image (as illustrated Fig 5) which is finally processed by a deep CNN. Motivated by, we project the gender information into 32 dimensions with a dense layer(also known as the fully connected layer) and further concatenate it with the last later of deep network. In addition, the dense layer with 1000 neurons following by dropout layer are appended. Finally, we regress the age value based on the 1000 features.
Although the finger joints based network may achieve comparable good results as shown in Table I. There may exist inevitably information lost due to inaccurate segmentation results. In order to compensate this, our system also consists of a network which receives the whole image as input. The final results are obtained by averaging the outputs of the two networks. A graphical demonstration of our system can be found in Fig 6. The used loss function is the Mean Absolute Difference MAD, also known as Mean Absolute Error, defined as:
where represents the number of samples, represents the physician-estimated bone age of sample in months and represents the model-estimated bone age for a sample . A lower MAD score represents a closer match with the annotation of a trained physician.
The batch size is set to be 16. The Adam optimization algorithm is used in the model. The learning rate decrease from 0.0008, and set the patience of 30 epochs with no improvement, after that, the learning rate will be reduced. As done in other computer vision tasks, we augment the data by horizontal and vertical shifting and zooming by the range of [0.2,0.4] increase 0.05 every step in the experiment to improve the performance of deep networks.
Iv EXPERIMENTS and RESULTS
The images of hand bones are collected from the Kaggle competition held by the Radiological Society of North America. It contains 12611 training images of the hand which consists of 6833 male images and 5778 females. The testing set has 200 images. There is a different distribution by month for different genders in training dataset which is shown in Fig 7. We conducted the experiments use the 10089 images from 12611 training data set for training and the rest as the validation data set. The 200 testing images provide by RSNA are used to evaluate the proposed method.
In order to show the advantage of the proposed method, we first compare it with the results of the two base networks. We embark on the inception-V3 model  due to its superior performance in practice. “Whole Hand” and “Finger Joints” in Table I stands for the network with whole image and finger joints as inputs, respectively. From Table I we can easily observe that the proposed finger joint localization yields better performance than the whole image. In addition, when fused together, the performance is further boosted. This may partially due to the information lost in the finger joints localization procedure.We also observe better results when training with gender information.
|Without Gender||Whole Hand||8.99||9.05||9.33|
|With Gender||Whole Hand||7.31||6.78||7.28|
After verifying the effectiveness of the proposed finger joints localization strategy, we are now ready to compare the performance of the proposed method with the state-of-the-art method in . This method combined deep learning with Gaussian process regression (GRP). It could reduce the sensitivity of deep learning to changes in input images such as flipping and rotations. Although remarkable results were reported, this method is not practical due to extremely large computational complexity caused by 32 VGG network therein. We use the same training/evaluation protocol as done in . In order to have a more fair comparison, we conduct another same VGG19 network. In order to have a more fair comparison, we replace the inception-V3 model with the same VGG19 network and the results are summarized in Table II. Results show that the proposed method outperforms the recently published method in . Please note that our method is much simpler compared with  because we only used two VGG models while  used 32.
The gaps between different finger joints vary with different ages. Motivated by this, we propose a finger joint localization strategy to extract more informative ROIs for deep networks. Those finger joints are further encapsulated into a single image and processed by a deep convolutional neural network. We show this strategy outperforms conventional whole image-based approaches. In addition, when fused together, the two networks yields an improved result which outperforms a recently proposed method with much more complicated systems architectures.
The work was supported by Singapore-China NRF-NSFC Grant [No. NRF2016NRF-NSFC001-111].
-  Preedy, Victor R., ed. Handbook of growth and growth monitoring in health and disease. Vol. 1. Springer Science & Business Media, 2011.
-  Tanner J M, Whitehouse R H, Cameron N, et al. Assessment of skeletal maturity and prediction of adult height (TW2 method)[M]. London: Academic Press, 1975.
-  Spampinato C, Palazzo S, Giordano D, et al. Deep learning for automated skeletal bone age assessment in X-ray images[J]. Medical image analysis, 2017, 36: 41-51.
-  Ebner T, Stern D, Donner R, et al. Towards automatic bone age estimation from MRI: localization of 3D anatomical landmarks[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2014: 421-428.
-  Iglovikov V I, Rakhlin A, Kalinin A A, et al. Paediatric Bone age assessment using deep convolutional neural networks[M]//Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, Cham, 2018: 300-308.
-  Liu Y, Jiang P T, Petrosyan V, et al. DEL: Deep Embedding Learning for Efficient Image Segmentation[C]//IJCAI. 2018: 864-870.
-  Liu Y, Cheng M M, Hu X, et al. Richer convolutional features for edge detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 3000-3009.
-  Shi Z, Zhang L, Liu Y, et al. Crowd counting with deep negative correlation learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 5382-5390.
-  Chen Z, Zhang L, Jiang C, et al. WiFi CSI Based Passive Human Activity Recognition Using Attention Based BLSTM[J]. IEEE Transactions on Mobile Computing, 2018.
-  Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
-  Van Steenkiste T, Ruyssinck J, Janssens O, et al. Automated assessment of bone age using deep learning and Gaussian process regression[C]//2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2018: 674-677.
-  Larson D B, Chen M C, Lungren M P, et al. Performance of a deep-learning neural network model in assessing skeletal maturity on pediatric hand radiographs[J]. Radiology, 2017, 287(1): 313-322.
-  Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015: 234-241.
-  Hou Q, Cheng M M, Hu X, et al. Deeply supervised salient object detection with short connections[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3203-3212.
-  Zhao Z, Zhang X, Chen C, et al. Semi-Supervised Self-Taught Deep Learning for Finger Bones Segmentation[J]. arXiv preprint arXiv:1903.04778, 2019.
-  Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2818-2826.