Relational Reasoning Network (RRN) for Anatomical Landmarking

Relational Reasoning Network (RRN) for Anatomical Landmarking

Neslisah Torosdagli, Mary McIntosh, Denise K. Liberton, Payal Verma, Murat Sincan, Wade W. Han, Janice S. Lee, and Ulas BagciCorresponding author: ulasbagci@gmail.comD. Liberton, and J. Lee are with the Craniofacial Anomalies and Regeneration section, National Institute of Dental and Craniofacial Research (NIDCR), National Institutes of Health (NIH), Bethesda, MD.W. Han is with Department of Otolaryngology - Head and Neck Surgery, Boston Children’s Hospital, Harvard Medical School, Boston, MA.N. Torosdagli, M. McIntosh and U. Bagci are with Center for Research in Computer Vision at University of Central Florida, Orlando, FL.

Accurately identifying anatomical landmarks is a crucial step in deformation analysis and surgical planning for craniomaxillofacial (CMF) bones. Available methods require segmentation of the object of interest for precise landmarking. Unlike those, our purpose in this study is to perform anatomical landmarking using the inherent relation of CMF bones without explicitly segmenting them. We propose a new deep network architecture, called relational reasoning network (RRN), to accurately learn the local and the global relations of the landmarks. Specifically, we are interested in learning landmarks in CMF region: mandible, maxilla, and nasal bones. The proposed RRN works in an end-to-end manner, utilizing learned relations of the landmarks based on dense-block units and without the need for segmentation. For a given a few landmarks as input, the proposed system accurately and efficiently localizes the remaining landmarks on the aforementioned bones. For a comprehensive evaluation of RRN, we used cone-beam computed tomography (CBCT) scans of 250 patients. The proposed system identifies the landmark locations very accurately even when there are severe pathologies or deformations in the bones. The proposed RRN has also revealed unique relationships among the landmarks that help us infer several reasoning about informativeness of the landmark points. RRN is invariant to order of landmarks and it allowed us to discover the optimal configurations (number and location) for landmarks to be localized within the object of interest (mandible) or nearby objects (maxilla and nasal). To the best of our knowledge, this is the first of its kind algorithm finding anatomical relations of the objects using deep learning.

Relational Reasoning, Anatomical Landmarking, Surgical Modeling, Deep Relational Learning, craniomaxillofacial bones

=0ex \belowrulesep=0ex

I Introduction

In the United States, more than million patients suffer from congenital or developmental deformities of the jaws, face, and skull due to trauma, deformities from tumor ablation, and congenital birth defects [1]. The number of patients who require orthodontic treatment is far beyond this number. Accurate and fast segmentation and anatomical landmarking are crucial steps in the deformation analysis and surgical planning of the craniomaxillofacial (CMF) bones. However, manual landmarking is a tedious process and prone to inter-operator variability. There are elaborative efforts towards making a fully-automated and accurate software for segmentation and anatomical landmarking [2, 3]. Despite this need, little progress has been made especially for bones with high deformations (approximately 5 of the CMF deformities) especially for congenital and developmental deformities.

(a) Patient 1
(b) Patient 2
Fig. 1: Segmentation results rendered in fuchsia which are scored as “unacceptable segmentation” at [4], a) Patient with surgical intervention, b) Patient with high variability in the bone.

Deep learning based approaches become the standard choice for pixel-wise medical-image segmentation applications due to their high efficacy [2, 4, 5]. However, it is difficult to generalize segmentation especially when there is high degree of deformations or pathology. Figure 1 demonstrates two examples of challenging mandible cases where the patients have surgical intervention (left) and high variability in the bone (right), and causing deep learning based segmentation algorithm to fail (leakage or under-segmentation). Current state-of-the-art landmarking algorithms are mostly dependent on the segmentation results since locating landmarks can be easier once their parent anatomy (the bones they belong to) is precisely known. However, if underlying segmentation is poor, it is highly likely to have high localization errors for landmarks, directly affecting the quantification process (severity measurement, surgical modeling, treatment planing).

We hypothesize that if explicit segmentation can be avoided for extremely challenging cases, landmark localization errors can be minimized. This will also lead to a widespread use of landmarking procedure. CMF bones constitute in the same anatomical space even when there is deformity or pathology. Thus, overall global relationships of the landmark points should still be preserved despite severe localized changes. Based on this rationale, we claim that utilizing local and global relations of the landmarks can help automatic landmarking without the extreme need for segmentation.

(a) Mandible bone [4].
(b) Nasal bones and Maxilla [6, 7].
Fig. 2: Mandible and Maxilla/Nasal bone anatomies, a) Mandibular Landmarks: Menton , Condylar Left , Condylar Right , Coronoid Left , Coronoid Right , Infradentale, B point , Pogonion , and Gnathion , b) Maxillary Landmarks: Anterior Nasal Spine , Posterior Nasal Spine , A-Point , and Prostion , Nasal Bones Landmark: Nasion .

I-a Background and Related Work

I-A1 Landmarking

Anatomical landmark localization approaches can broadly be categorized into three main approaches: registration-based (atlas-based), knowledge-based, and learning-based [8]. Integration of the shape and the appearance increases the accuracy of the registration-based approaches. However, image registration is still an ill-posed problem, and when there are variations such as age (pediatrics vs. adults), missing teeth (very common in certain age groups), missing bone or bone parts, severe pathology (congenital or trauma), and imaging artifacts, the performance can be quite poor [3, 9, 10]. The same concerns apply to segmentation based approaches too.

Gupta et al. [11] developed a knowledge-based algorithm to identify 20 anatomical landmarks on CBCT scans. Despite the promising results, a seed is selected by template registration on the inferior-anterior region where fractures are the most common. An error in the seed localization may easily lead to a sub-optimal outcome in such approaches. Zhang et al. [12] developed a regression forest-based landmark detector to localize CMF landmarks on the CBCT scans. To address the spatial coherence of landmarks, image segmentation was used as a helper. The authors obtained a mean digitization error less than 2 for CMF landmarks. The following year, to reduce the mean digitization error further, Zhang et al. [2] proposed a deep learning based joint CMF bone segmentation and landmarking strategy. Authors used context guided multi-task fully convolutional neural network and employed 3D displacement maps to perceive the spatial locations of the landmarks. They obtained segmentation accuracy of and a mean digitization error of less than mm for identifying CMF landmarks. The major disadvantage of this (one of the state-of-the-arts) method was the memory constraint introduced by the redundant information in the displacement maps such that only a limited number of the landmarks can be learned using this approach. Since the proposed strategy is based on joint segmentation and landmarking, it naturally shares other disadvantages of the segmentation based methods: frequent failures for very challenging cases.

More recently, we integrated the manifold information (geodesic) in a deep learning architecture to improve robustness of the segmentation based strategies for landmarking [4], and obtained promising results, significantly better than the state-of-the-art methods. We also noticed that there is still a room to improve landmarking process especially when pathology or bone deformation is severe. To fill this research gap, in this study, we take a radically different approach by learning landmark relationships without segmenting bones. We hypothesize that the inherent relation of the landmarks in the CMF region can be learned by a relational reasoning algorithm based on deep learning. Although our proposed algorithm stems from this unique need of anatomical landmarking, the core idea of this work is inspired from the recent studies in artificial intelligence (AI), particularly in robotics and physical interactions of human/robots with their environments, as described in the following with further details.

I-A2 Relational Reasoning

The ability to learn relationship and infer reasons between entities and their properties is a central component of AI field but it has been proven to be very difficult to learn through neural networks until recently [13]. In 2009, Scarselli et al. [14] introduced graph neural network (GNN) by extending the neural network models to process graph data which encoded relationship information of the objects under investigation. Li et al. [15] proposed a machine learning model based on gated recurrent units (GRUs) to learn the distributed vector representations from heap graphs. Despite the increase use and promising nature of the GNN architectures [16], there is a limited understanding for their representational properties, which is often a necessity in medical AI applications for their adoption in clinics.

Fig. 3: Overview of the proposed RRN architecture: for a few given input landmarks, RRN utilizes both pairwise and combination of all pairwise relations to predict the remaining landmarks.

Recently, DeepMind team(s) published four important studies on the relational reasoning and explored how objects in complex systems can interact with each other [17, 13, 18, 19]. Battaglia et al. [17] introduced interaction networks to reason about the objects and the relations in the complex environments. The authors proposed a simple yet accurate system to reason about -body problems, rigid-body collision, and non-rigid dynamics. The proposed system can predict the dynamics in the next step with an order of magnitude lower error and higher accuracy. Raposa and Santoro et al. [13] introduced a Relational Network (RN) to learn the object relations from the scene description, hypothesising that a typical scene contains salient objects which are typically related to each other by their underlying causes and semantics. Following this study, Santoro and Raposa et al. [18] presented another relational reasoning architecture for tasks such as visual question-answering, text-based question-answering, and dynamic physical systems. The proposed model obtained most answers correctly. Lastly, Battaglia et al. [19] studied the relational inductive biases to learn the relations of the entities and presented the graph networks. These four studies show promising approaches to understanding the challenge of relational reasoning. To the best of our knowledge, such advanced reasoning algorithms have neither been developed for nor applied to the medical imaging applications yet. Of a note,medical AI applications require fundamentally different reasoning paradigms than conventional computer vision and robotics fields have (e.g., salient objects definitions). In this study, we focus on the anatomy-anatomy and anatomy-pathology relationships in an implicit manner.

I-B Summary of our contributions

  • To the best of our knowledge, the proposed method is the first in the literature to successfully apply the spatial reasoning of the anatomical landmarks for accurate and robust landmarking using deep learning.

  • Many anatomical landmarking methods, including our previous work, [4, 11, 20], use bone segmentation as a guidance for finding the location of the landmarks on the surface of a bone. The major limitation imposed by such approaches is that it may not be always possible to have an accurate segmentation. Our proposed RRN system addresses this problem by enabling accurate prediction of anatomical landmarks without employing explicit object segmentation.

  • Since efficiency is a significant barrier for many medical AI applications, we explore new deep learning architecture designs for a better efficacy in the system performance. For this purpose, we utilize variational dropout [21] and targeted dropout [22] in our implementation for faster and more robust convergence of the landmarking procedure ( times faster than baselines).

  • Our data set includes highly variable bone deformities along with other challenges of the CBCT scans. Hence, the proposed algorithm is considered highly-robust and identifies anatomical landmarks accurately under varying conditions (Table III). In our experiments, we find landmarks pertaining to mandible, maxilla and nasal bones (Figure 2).

The rest of this paper is organized as follows: we introduce our novel methodology and its details in Section II. In Section III, we present experiments and results and then we discuss strengths and limitations of our study in Section IV.

(a) Menton-Condylar Left Relation
(b) Menton-Coronoid Left Relation
(c) Menton-Condylar Right Relation
(d) Menton-Coronoid Right Relation
(e) Menton Relations
Fig. 4: For the input domain : (a)-(d) pairwise relations of landmark Menton a) Menton-Condylar Left, b) Menton-Coronoid Left, c) Menton-Condylar Right, d) Menton-Coronoid Right, e) combined relations of Menton.

Ii Methods

Ii-a Overview and preliminaries

The most frequently deformed or injured CMF bone is the lower jaw bone, mandible, which is the only mobile CMF bone [23]. In our previous study [4], we developed a framework to segment mandible from CBCT scans and identify the mandibular landmarks in a fully-automated way. Herein, we focus on anatomical landmarking without the need for explicit segmentation, and extend the learned landmarks into other bones (maxilla and nasal). Overall, we seek the answers for the following important questions:

  • Q1: Can we automatically identify all anatomical landmarks of a bone if only a subset of the landmarks are given as input? If so, what is the least effort for performing this procedure? In other words, how many landmarks are necessary and which landmarks are more informative to perform this whole procedure?

  • Q2: Can we identify anatomical landmarks of nasal and maxilla bones if we only know locations of a few landmarks in the mandible? In other words, do relations of landmarks hold true even when they belong to different anatomical structures (manifold)?

Although modern AI algorithms have made tremendous progress solving problems in biomedical imaging, relations between objects within the data are often not modeled as separate tasks. In this study, we explore inherent relations among anatomical landmarks at the local and global levels in order to explore availability of structured data samples helping anatomical landmark localization. Inferred from the morphological integration of the CMF bones, we claim that landmarks of the same bone should carry common properties of the bone so that one landmark should give clues about the positions of the other landmarks with respect to a common reference. This reference is often chosen as segmentation of the bone to enhance information flow, but in our study we leverage this reference point from the whole segmented bone into a reference landmark point. Throughout the text, we use the following definitions:

Definition 1: A landmark is an anatomically distinct point, helping clinicians to make reliable measurement about a condition, diagnosis, modeling a surgical model, or even creating a treatment plan.

Definition 2: A relation is defined as a geometric property between landmarks. Relations between two landmarks might include the following geometric features: size, distance, shape, and other implicit structural information. In this study, we focus on pairwise relations between landmarks as a starting point.

Definition 3: A reason is defined as an inference about relationships of the landmarks. For instance, compared to closely localized landmarks (if given as input), a few of sparsely localized landmarks can help predicting landmarks better. The reason is that sparsely localized input landmark configuration captures the anatomy of a region of interest and infers better global relationships of the landmarks.

Once relationships of landmarks are learned effectively, we can use this relationship to identify the landmarks on the same or different CMF bones without the need for a precise segmentation. Towards this goal, we propose to learn relationship of anatomical landmarks in two stages (illustrated in Figure 3). In the first stage, pairwise relations (local) of landmarks are learned (shown as function ) with a simple neural network algorithm based on dense-blocks (DBs). Figures 4(a)-4(d) shows example pairwise relations for different pairs of mandible landmarks. There are five sparsely localized landmarks, and the Figure 3 shows how we assess the relationship per landmark. The basis/reference is chosen as Menton, in this example, hence, four pairwise relations are illustrated. Figure 4(e) illustrates all four relationships (Figures 4(a)-4(d)) of the landmark Menton (reference) with respect to other landmarks on the mandible.

In the second stage of the proposed algorithm (shown as function in Figure 3), we combine pairwise relations of landmarks () of landmarks with another neural network setting based on Relational Units (RUs).

Ii-B Relational Reasoning Architecture

Anatomical landmarking has been an active research topic for several years in the medical imaging field. However, how to build a reliable/universal relationship between landmarks for a given clinical problem. While anatomical similarities at the local and global levels are agreed to serve towards viable solutions, thus far, features that can represent anatomical landmarks from the medical images have not achieved the desired efficacy and interpretation [2, 24, 25, 26].

We propose a new network framework called RRN to learn pairwise and global relations of anatomical landmarks () through its units called RU (relationship unit). The relation of two landmarks are encoded major spatial properties of the landmarks. We explored two architectures as RU: first one is a simple multi-layer perceptron (MLP) (Figure 5(b)) (similar to [13]), the other one is more advanced architecture composed of Dense-Blocks (DBs) (Figure 5(c)). Both architectures are relatively simple compared to very dense complex deep-learning architectures. Our objective is to locate all anatomical landmarks by inputting a few landmarks to RRN, which provides reasoning inferred from the learned relationships of landmarks and locate all other landmarks automatically.

(a) Overview of Relational Reasoning Network (RRN)
(b) MLP Relational Unit (RU) of RRN
(c) Dense Relational Unit (RU) of RRN
(d) Dense Block (DB[4]
Fig. 5: Network Architecture; a) Relational Reasoning Network for -input landmarks : =, and is the average operator. b) Relation Unit (RU) composed of 2 DBs, convolution and concatanation (C) units. c) Dense Block (DB) architecture composed of layers and concatanation layers.

Figures 5(a)5(b) and  5(c) summarize the proposed RRN architecture, and its RU sub-architectures, respectively. In the pairwise learning/reasoning stage (stage 1), 5-landmarks based system is assumed as an example network (other configurations are possible too, see experiments and results section). Sparsely-spaced landmarks (Figure 4(e)) and their pairwise relationships are learned in this stage (). These pairwise relationship(s) are later combined in a separate DB setting in (). It should be noted that this combination is employed through a joint loss function and an RU to infer an average relation information. In other words, for each individual landmark, the combined relationship vector is assigned a secondary learning function through a single RU.

The RU is the core component of the RRN architecture and it is designed as a unit with DB. Each RU is designed in an end-to-end fashion; hence, they are differentiable. For landmarks in the input domain, the proposed RRN architecture learns pairwise and combined relations (global) with a total of RUs. Therefore, depending on the number of input domain landmarks, RRN can be either shallow or dense.

Let and indicate vectors of input and output anatomical landmarks, respectively. Then, two stages of the RRN of the input domain landmarks can be defined as:


where is the mean pairwise relation vector of the landmark to every other landmark . The functions and are the functions with the free parameters and , and indicates a global relation (in other words, combined pairwise relations) of landmarks.

Ii-C (pairwise relation)

For a given a few input landmarks (), our objective is to predict the spatial locations of the target domain landmarks () by using the spatial locations of the input domain landmarks (). With respect to relative locations of the input domain landmarks, we reason about the locations of the target domain landmarks. The RU function represents the relation of two input domain landmarks and where (Figures 4(a)-4(d)). The output of describes relative spatial context of two landmarks, defined for each pair of input domain landmarks (pairwise relation at Figure 5(a)). According to each input domain landmark , the structure of the manifold is captured through mean of all pairwise relations (represented as at Equation 1).

Ii-D (global relation)

The mean pairwise relation is calculated with respect to each input domain landmark , and it is given as input to the second stage where global (combined) relation is learned. is a RU function and the output of is the predicted coordinates of the target domain landmarks (). In other words, each input domain landmark learns and predicts the target domain landmarks by the RU function . The terminal prediction of the target domain landmarks is the average of individual predictions of each input domain landmark, represented by at Equation 1. There are totally RUs in the architecture. Note that the number of trainable parameters used for each experimental configuration are directly proportional with (Table II). Since all pairwise relations are leveraged under and with averaging operation, we can conclude that RRN is invariant to the order of input landmarks (i.e., permutation-invariant).

Ii-E Loss Function

The natural choice for the loss function is the mean squared error (MSE) because it is a differentiable distance metric measuring how well landmarks are localized/matched, and it allows output of the proposed network to be real-valued functions of the input landmarks. For input landmarks and target landmarks, MSE simply penalizes large distances between the landmarks as follows:


where are target domain landmarks (.

Ii-F Variational Dropout

Dropout is an important regularizer employed to prevent overfitting with the cost of times increased training time on average [27]. For efficiency reasons, speeding up dropout is critical and it can be achieved by a variational Bayesian inference on the model parameters [21]. Given a training input dataset and the corresponding output dataset , the goal in RRN is to learn the parameters such that . In the Bayesian approach, given the input and output datasets , we seek for the posterior distribution , by which we can predict output for a new input point by solving the integral [28]:


In practice, this computation involves intractable integrals [21]. To obtain the posterior distributions, a Gaussian prior distribution is placed over the network weights [28] which leads to a much faster convergence [21].

Ii-G Targeted Dropout

Alternative to the conventional dropout, we also propose to use the targeted dropout for better convergence. Given a neural network parameterized by , the goal is to find the optimal parameters such that the loss is minimized. For efficiency and generalization reasons, , only weights of highest magnitude in the network are employed. In this regard, deterministic approach is to drop the lowest weights. In the targeted dropout, using a target rate and a drop out rate , first a target set is generated with the lowest weights with the target rate . Next, weights are stochasticity dropped out from the target set with the dropout rate  [22].

Ii-H Landmark Features

Pairwise relations are learned through RU functions. Each RU accepts input features to be modelled as a pairwise relation. It is desirable to have such features characterizing landmarks and interactions with other landmarks. These input features can either be learned throughout a more complicated network design, or through feature engineering. In this study, for simplicity, we define a set of simple yet explainable geometric features. Since RUs model relations between two landmarks ( and ), we use coordinates of these landmarks (both in pixel and spherical space), their relative positions with respect to a well-defined landmark point (reference), and approximate size of the mandible. The mandible size is estimated as the distance between the maximum and the minimum coordinates of the input domain mandibular landmarks (Table I). At final, a -dimensional feature vector is considered to be an input to local relationship function . For a reference well-defined landmark, we use Menton (Me) as the origin of the Mandible (See Figure 2(a)).

Pairwise Feature (, )
3D pixel-space position of the
Spherical coordinate of the vector
from landmark Menton () to
, , )
3D pixel-space position of the
Spherical coordinate of the vector
from landmark Menton to
(, , )
3D pixel-space position of the
landmark Menton
Spherical coordinate of the vector
from to
(, , )
Diagonal length of the bounding box
capturing Mandible roughly, computed
as the distance between the minimum
and the maximum spatial locations
of the input domain mandibular
landmarks () in the pixel space.
TABLE I: Input landmarks have the following feature(s) to be used only in stage I. feature vector includes only structural information.

Iii Experiments and Results

Iii-a Data Description

Anonymized CBCT scans of 250 patients (142 female and 108 male, mean age = 23.6 years, standard deviation = 9.34 years) were included in our analysis through an IRB-approved protocol. The data set includes both pediatric and adult patients with craniofacial congenital birth defects, developmental growth anomalies, trauma to the CMF, and surgical interventions. CB MercuRay CBCT system (Hitachi Medical Corporation, Tokyo, Japan) was used to scan the data at mA and Kvp. The radiation dosage for each scan was around mSv. To handle the computational cost, each patient’s scan was re-sampled from to . In-plane resolution of the scans were noted (in mm) either as or . In addition, following image-based variations exist in the data set: aliasing artifacts due to braces, metal alloy surgical implants (screws and plates), dental fillings, and missing bones or teeth [4].

The data was annotated independently by three expert interpreters, one from the NIH team, and two from UCF team. Among them, inter-observer agreement values were computed as approximately pixels. Experts used freely available Slicer software for the annotations [4].

Experiment #   Configuration  
1   -landmarks  
, ,
, ,
2   -Landmarks Regular  
, ,
,,, ,
, ,
3   -Landmarks Cross  
, ,
, , ,
4   -landmarks  
, ,
, ,
5   -landmarks  
, ,
, ,
, ,
, ,
TABLE II: Five Experimental Landmark Configurations for experimental explorations. : input landmarks and : output landmarks, and RUs indicate the number of relational units.

Iii-B Data Augmentation

Our data set includes fully-annotated mandibular, maxillary and nasal bones’ landmarks. Due to insufficiency of samples for a deep-learning algorithm to run, we applied data-augmentation approach. In our study, the common usage of random scaling or rotations for data-augmentation were not found to be useful for new landmark data generation because such transformations would not generate new relations different from the original ones. Instead, we used random interpolation similar to active shape model’s landmarks [29]. Briefly, we interpolated (or ) randomly selected scans with randomly computed weight per-interpolation. We merged the relation information at different scans to a new relation. We also added a random noise to each landmark with a maximum in the range of pixels, defined empirically based on the resolution of the images as well as the observed high-deformity of the bones. We generated approximately landmark sets.

Iii-C Evaluation Methods

We used root-mean squared error (RMSE) in the anatomical space (in mm) to evaluate the goodness of the landmarking accuracy. Lower RMSE indicates successful landmarking process. For statistical significance comparisons of different methods and their variants, we used P-value of as a cut-off threshold to define significance and applied -tests where applicable.

Iii-D Input Landmark Configurations

In our experiments, there were three groups of landmarks (See Figure 2) defined based on the bones they reside: Mandibular , Maxillary , and Nasal , where subscripts in denote the specific landmark in that bone:

  • ,

  • ,

  • .

In each experiment, as detailed in Table II, we designed a specific input set where , and . The target domain landmarks for each experiment were and such that . With carefully designed input domain configurations , and pairwise relationships of the landmarks in the input set, we seek the answers to the following questions previously defined as Q1 and Q2 in Section II:

  • What configuration of the input landmarks can capture the manifold of bones so that other landmarks can be localized successfully?

  • What is the minimum number and configuration of the input landmarks for successful identification of other landmarks?

Overall, we designed different input landmark configurations called -landmarks regular, -landmarks cross, -landmarks, -landmarks and -landmarks (Table II). Each configuration is explained in the following section.

Iii-E Training

The MLP RU was composed of fully-connected layers, batch normalizations and ReLUs as represented at Figure 5(b). The DB RU architecture contained DB, which was composed of layers with a growth-rate of . We used a batch size of for all experiments. For the -landmarks configuration, there were and trainable parameters for the MLP and the DB architectures, respectively. We trained the network for epochs on Nvidia Titan-XP GPU with memory using the MLP architecture with the regular dropout compared to epochs with the variational and targeted dropout implementations. For the DB architecture, it converged in around epochs independent of the dropout implementation employed.

Iii-F Experiments and Results

We ran a set of experiments to evaluate the performance of the proposed system using -fold cross-validation. We summarized the experimental configurations in Table II, error rates in Table III, and corresponding renderings in Figure 6. The method achieving the minimum error for a corresponding landmark is colored the same as the corresponding landmark at Table III.

(a) Patient-
(b) Patient-
(c) Patient-
Fig. 6: Landmark annotations using the -landmarks configuration: Ground truth in blue and computed landmarks in pink. a) Genioplasty/chin advancement (male 43 yo), b) Malocclusion (mandibular hyperplasia, maxillary hypoplasia) surgery (male 19 yo), c) Malocclusion (mandibular hyperplasia, maxillary hypoplasia) surgery (female 14 yo).
Method   Mandibular Landmarks
-Landmarks Regular (Dense)   -
-Landmarks Cross (Dense)   -
-landmarks Var. Dropout (MLP)   - - -
-landmarks (Dense)   - - -
-landmarks Var. Dropout (Dense)   - - -
-landmarks Targeted Dropout (Dense)   - - -
-landmarks (Dense)   - - -
-landmarks Var. Dropout (Dense)   - - -
-landmarks Targeted Dropout(Dense)   - - -
-landmarks (Dense)   - - - - - - -
Torosdagli et al. [4]  
Gupta et al. [11]   - - -
Method   Maxillary-Nasal Bone Landmarks
-Landmarks Regular (Dense)  
-Landmarks Cross (Dense)  
-landmarks Var. Dropout (MLP)  
-landmarks (Dense)  
-landmarks Var. Dropout (Dense)  
-landmarks Targeted Dropout (Dense)  
-landmarks (Dense)   -
-landmarks Var. Dropout (Dense)   -
-landmarks Targeted Dropout (Dense)   -
-landmarks (Dense)  
Torosdagli et al. [4]   - - - - -
Gupta et al. [11]   -
TABLE III: Landmark Localization Errors (mm). The symbol ’-’ means not applicable (N/A).

Among two different RU architectures, DB architecture was evaluated to be more robust and fast to converge compared to the MLP architecture. To be self-complete, we provided the MLP experimental configuration performances only for the -landmark experiment (See Table III).

In the first experiment (Table II-Experiment 1), to have an understanding of the performance of the RRN, we used the landmark grouping sparsely-spaced and closely-spaced as proposed in Torosdagli et al. [4]. We named our first configuration as “-landmarks” where closely-spaced, maxillary and nasal bones landmarks are predicted based on the relation of sparsely-spaced landmarks (Table II). In the -landmarks RRN architecture, there are totally 25 RUs.

In the second experiment (Table II-Experiment 2), we explored the impact of a configuration with less number of input mandibular landmarks on the learning performance. Compared to the sparsely-spaced input landmarks used in the first experimental configuration, herein we learned the relation of landmarks, , and , and predicted the closely-spaced landmark locations (as in the -landmarks experiment) plus superior-anterior landmarks and and maxillary and nasal bones’ landmark locations. The network was composed of RUs. The training was relatively fast compared to the -landmarks configuration due to low number of RUs. We named this method as “3-Landmarks Regular”. After observing statistically similar accuracy compared to the -landmarks method for the closely-spaced landmarks (), and high error rates at the superior-anterior landmarks and , we setup a new experiment which we named “3-Landmarks Cross”.

We designed -Landmarks Cross (Table II-Experiment 3) configuration for the third experiment where we used superior-posterior and superior-anterior landmarks on the right and left sides respectively. This network was similar to -landmarks regular one in terms of number of RUs used.

In the fourth experiment (Table II-Experiment 4), we evaluated the performance of the system in learning the closely-spaced mandibular landmarks and the maxillary landmarks using the relation information of the sparsely-spaced and the nasal-bones landmarks which is named as “-landmarks”. There are totally RUs in this configuration.

In the last experiment (Table II-Experiment 5), we aimed to learn the maxillary landmarks and nasal bones landmark using the relation of the mandibular networks; hence, this network configuration is called “-landmarks”. The architecture was composed of RUs. Owing to the high number of RUs in the architecture, the training of this network was the slowest among all the experiments performed.

For three challenging CBCT scans, Figure 6 presents the ground-truth and the predicted landmarks with respect to the -landmarks configuration DB architecture, annotated in blue and pink, respectively. We evaluated -landmarks configuration for both MLP and the DB architectures using variational-dropout as regularizer (Table III). For old folds, we observed that DB architecture was robust and fast-to-converge. Although, the performances were statistically similar for the mandibular landmarks, this was not the case for the maxillary and the nasal bone landmarks. The performance of the MLP architecture degrades notably compared to the decrease in the DB architecture for the maxilla and nasal bone landmarks.

-landmarks and -landmarks configurations (Table III) performed statistically similar for the mandibular landmarks. Interestingly, both -landmarks configurations performed slightly better for the neighbouring bone landmarks. This reveals the importance of optimum number of landmarks in the configuration.

In comparison of -landmarks and -landmarks configurations (Table III), we observed that -landmarks configuration is good at capturing the relations on the same bone. In contrast, -landmarks configuration was good at capturing the relations on the neighbouring bones. Although, the error rates were less than , the potentially redundant information induced by the landmark in the -landmarks configuration caused the performance to decrease notably for the mandibular landmarks compared to the -landmarks configuration.

-landmarks configuration performed statistically similar to -landmarks configuration, however, due to RUs employed for the -landmarks, the training was slower.

Direct comparison was not possible but we also compared our results with Gupta et al. [11]. Our judgements was based on the landmark distances. We found that our results were significantly better for all landmarks except the landmark. The framework proposed at [11] uses an initial seed point using a 3D template registration at the inferior-anterior region where fractures are the most common. Eventually, any anatomical deformity that alters the anterior mandible may cause an error in the seed localization which can lead to a sub-optimal outcome.

We evaluated the performance of the proposed system when variational [21] and targeted [22] dropouts were employed. Although statistically there was no accuracy-wise difference in the regular, both dropout implementations converge relatively fast in around epoch compared to of the regular dropout for the MLP architecture. Hence, for the MLP architecture, in terms of computational resources, variational and targeted dropout implementations were far more efficient for our proposed system. This is particularly important because when there are large number of RUs, one may focus more on the efficiency rather than accuracy. When the DB architecture was employed, we did not observe any performance improvement among different dropout implementations.

Iv Discussions and Conclusion

We proposed RRN framework which learns the spatial dependencies between the CMF region landmarks in an end-to-end fashion. Without the need for explicit segmentation, we hypothesized that there is an inherent relation of the CMF landmarks which can be learned using the relation reasoning architecture.

In our experiments, we first evaluated this claim using a dataset with high amount of bone deformities in addition to other CBCT challenges. We observed that (1) despite the large amount of deformities that may exist in the CMF anatomy, there is a functional relation between the CMF landmarks, and (2) RNN frameworks are strong enough to reveal this latent relation information. Next, we evaluated the detection performance of five different configurations of the input landmarks to find out the optimum configuration. We observed that not all landmarks are equally informative in the detection performance. Some landmark configurations are good in capturing the local information, while some have both good local and global prediction performance. Overall, per-landmark error for the -landmarks configuration is less than , which is considered as a clinically acceptable level of success.

In our implementation, we showed that networks can be integrated well into our platform as long as features are encoded via RUs. Hence, we do not have any restrictions in the choice of networks. One may argue if we can change specific parameters to make the predictions better. Such incremental explorations are kept outside the main paper but worth exploration in the future study from an optimization point of view.

There may be a number of limitations of our study. For instance, we confined ourselves to manifold data only (position of the landmarks and their geometric relations) without use of appearance information because one of our aims was to avoid explicit segmentation from our system to be able to use simple reasoning networks. As an extension study, we will design a separate deep network to learn pairwise features instead of design them ourselves. In parallel, we will incorporate appearance features from medical images to explore whether these features are superior to purely geometric features, or combined (hybrid) features can have additive values into the current research. One alternative way to pursue the research that we initiated herein will be to explore deeper and more efficient networks that can scale up the problem that we have here into a much wider platform, useful especially large number of landmarks are being explored. Also, landmark localization is inherently a regression problem; therefore, MSE suits well to our problem. However, other loss functions can still be explored in a separate study for possible improvements.


  • [1] J. J. Xia, J. Gateno, and J. F. Teichgraeber, “A paradigm shift in orthognathic surgery: A special series part i,” Journal of Oral and Maxillofacial Surgery, vol. 67, no. 10, pp. 2093–2106, 2009.
  • [2] J. Zhang, M. Liu, L. Wang, S. Chen, P. Yuan, J. Li, S. G.-F. Shen, Z. Tang, K.-C. Chen, J. J. Xia, and D. Shen, Joint Craniomaxillofacial Bone Segmentation and Landmark Digitization by Context-Guided Fully Convolutional Networks.   Cham: Springer International Publishing, 2017, pp. 720–728.
  • [3] N. Anuwongnukroh, S. Dechkunakorn, S. Damrongsri, C. Nilwarat, N. Pudpong, W. Radomsutthisarn, and S. Kangern, “Accuracy of automatic cephalometric software on landmark identification,” IOP Conference Series: Materials Science and Engineering, vol. 265, no. 1, p. 012028, 2017.
  • [4] N. Torosdagli, D. K. Liberton, P. Verma, M. Sincan, J. S. Lee, and U. Bagci, “Deep geodesic learning for segmentation and anatomical landmarking,” IEEE Transactions on Medical Imaging, pp. 1–1, 2018.
  • [5] N. Torosdagli, D. K. Liberton, P. Verma, M. Sincan, J. Lee, S. Pattanaik, and U. Bagci, “Robust and fully automated segmentation of mandible from ct scans,” in 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), April 2017, pp. 1209–1212.
  • [6] “The Nasal Bone, by Kenhub,”, accessed: 2010-09-30.
  • [7] “The Maxilla, by Kenhub,”, accessed: 2010-09-30.
  • [8] Dinggang Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual Review of Biomedical Engineering, vol. 19, no. 1, pp. 221–248, 2017.
  • [9] S., E. Bahrampour, E. Soltanimehr, A. Zamani, M. Oshagh, M. Moattari, and A. Mehdizadeh, “The accuracy of a designed software for automated localization of craniofacial landmarks on cbct images,” BMC Medical Imaging, vol. 14, no. 1, p. 32, Sep 2014.
  • [10] X. Li, Y. Zhang, Y. Shi, S. Wu, Y. Xiao, X. Gu, X. Zhen, and L. Zhou, “Comprehensive evaluation of ten deformable image registration algorithms for contour propagation between ct and cone-beam ct images in adaptive head and neck radiotherapy,” PLOS ONE, vol. 12, no. 4, pp. 1–17, 04 2017.
  • [11] A. Gupta, O. Kharbanda, V. Sardana, R. Balachandran, and H. Sardana, “A knowledge-based algorithm for automatic detection of cephalometric landmarks on cbct images,” International Journal of Computer Assisted Radiology and Surgery, vol. 10, no. 11, pp. 1737–1752, Nov 2015.
  • [12] J. Zhang, Y. Gao, L. Wang, Z. Tang, J. J. Xia, and D. Shen, “Automatic craniomaxillofacial landmark digitization via segmentation-guided partially-joint regression forest model and multiscale statistical features,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 9, pp. 1820–1829, Sept 2016.
  • [13] D. Raposo, A. Santoro, D. Barrett, R. Pascanu, T. Lillicrap, and P. Battaglia, “Discovering objects and their relations from entangled scene representations,” 2017.
  • [14] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, Jan 2009.
  • [15] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” 2015.
  • [16] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
  • [17] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu, “Interaction networks for learning about objects, relations and physics,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16.   USA: Curran Associates Inc., 2016, pp. 4509–4517. [Online]. Available:
  • [18] A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. W. Battaglia, and T. P. Lillicrap, “A simple neural network module for relational reasoning,” in NIPS, 2017.
  • [19] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
  • [20] F. Lalys, S. Esneault, M. Castro, L. Royer, P. Haigron, V. Auffret, and J. Tomasi, “Automatic aortic root segmentation and anatomical landmarks detection for tavi procedure planning,” Minimally Invasive Therapy & Allied Technologies, vol. 0, no. 0, pp. 1–8, 2018, pMID: 30039720. [Online]. Available:
  • [21] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’15.   Cambridge, MA, USA: MIT Press, 2015, pp. 2575–2583. [Online]. Available:
  • [22] K. S. Y. G. G. H. A.N. Gomez, I. Zhang, “Targeted dropout,” December 2018, [Online]. [Online]. Available:
  • [23] A.C.V. Armond, C. Martins, J. Glória, E. Galvão, C. dos Santos, and S. Falci, “Influence of third molars in mandibular fractures. part 1: mandibular angle—a meta-analysis,” International Journal of Oral and Maxillofacial Surgery, vol. 46, no. 6, pp. 716 – 729, 2017.
  • [24] U. Bagci, X. Chen, and J. K Udupa, “Hierarchical scale-based multiobject recognition of 3-d anatomical structures,” IEEE transactions on medical imaging, vol. 31, pp. 777–89, 12 2011.
  • [25] X. Chen and U. Bagci, “3d automatic anatomy segmentation based on iterative graph-cut-asm,” Medical Physics, vol. 38, no. 8, pp. 4610–4622. [Online]. Available:
  • [26] S. Rueda, J. K. Udupa, and L. Bai, “Shape modeling via local curvature scale,” Pattern Recognition Letters, vol. 31, no. 4, pp. 324 – 336, 2010, 20th SIBGRAPI: Advances in Image Processing and Computer Vision. [Online]. Available:
  • [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available:
  • [28] Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16.   USA: Curran Associates Inc., 2016, pp. 1027–1035. [Online]. Available:
  • [29] X. Chen and U. Bagci, “3d automatic anatomy segmentation based on iterative graph-cut-asm,” Medical Physics, vol. 38, no. 8, pp. 4610–4622. [Online]. Available:
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description