i3PosNet: Instrument Pose Estimation from X-Ray

i3PosNet: Instrument Pose Estimation from X-Ray


Performing delicate Minimally Invasive Surgeries (MIS) forces surgeons to accurately assess the position and orientation (pose) of surgical instruments. In current practice, this pose information is provided by conventional tracking systems (optical and electro-magnetic). Two challenges render these systems inadequate for minimally invasive bone surgery: the need for instrument positioning with high precision and occluding tissue blocking the line of sight. Fluoroscopic tracking is limited by the radiation exposure to patient and surgeon. A possible solution is constraining the acquisition of x-ray images. The distinct acquisitions at irregular intervals require a pose estimation solution instead of a tracking technique. We develop i3PosNet (Iterative Image Instrument Pose estimation Network), a patch-based modular Deep Learning method enhanced by geometric considerations, which estimates the pose of surgical instruments from single x-rays. For the evaluation of i3PosNet, we consider the scenario of drilling in the otobasis. i3PosNet generalizes well to different instruments, which we show by applying it to a screw, a drill and a robot. i3PosNet consistently estimates the pose of surgical instruments better than conventional image registration techniques by a factor of 5 and more achieving in-plane position errors of and angle errors of . Additional factors, such as depth are evaluated to from single radiographs.


instrument pose estimation, modular deep learning, fluoroscopic tracking, cochlear implant, vestibular schwannoma removal


1 Introduction

MIS lead to shorter hospital stays due to smaller incisions and less operation trauma [1]. Recent years show a surge of MIS for bone surgery, e.g. Lateral bone surgery, where clinical instrument positioning needs to be more accurate than \SI0.5\milli\meter [2]. To achieve this positioning accuracy, measured positions of surgical instruments and tools are required to be ten times more accurate (errors less than \SI0.05\milli\meter). In combination with the orientation, this would enable methods from Computer-Aided Intervention or robotic surgery to be exploited for similar applications. Optical or electro-magnetic tracking systems work well for soft tissue interventions, but fail, when the line-of-sight (LoS) is limited and sub-millimeter accuracies are required [3].

i3PosNet is a generalized, Iterative Deep Learning framework to determine the pose of Surgical Instruments from a single x-ray. The pose has five degrees of freedom ( for position and for orientation). We apply i3PosNet to lateral skull base surgery, e.g. cochlear implantation or vestibular schwannoma removal. The current clinical practice in lateral base surgery is to remove a large part of the otobasis in order to reveal all risk structures to the surgeon. Current research on navigation for cochlear implantation [4, 5, 6, 7] assumes the drill to be rigid and relies on tracking the drill at the tool’s base or measures the closeness to a risk structure, e.g. the facial nerve [8]. No image-based method has been proposed yet that captures the pose of surgical instruments in the otobasis.

We propose a novel method based on a modular Deep Learning approach with geometric considerations. Unlike end-to-end this modular approach predicts the positions of multiple landmark points and derives the pose (position, forward angle, projection angle and depth – all defined w.r.t. the projection geometry) from these positions in a follow-up step using their geometric relationship. The term “modular” is motivated from this divide-and-conquer-approach. i3PosNet consistently beats competing state-of-art instrument pose estimation techniques [9, 10] by a factor of .

Our proposed method finds the pose of an instrument on a x-ray image given an initial estimate of said pose. Initial poses are constrained to clinically plausible differences. This paper introduces three core concepts in its design: 1) the geometric conversion between instrument landmarks and the pose, 2) a statistically driven training dataset generation scheme and 3) an iterative patch-based pose prediction scheme.

In this work, we estimate the pose from a single image and evaluate poses w.r.t. the arrangement of the x-ray source and detector. The five dimensions of the pose are the in-plane position ( and ) and the depth as well as two rotations: 1) around the projection normal and 2) the rotation of the instrument’s main axis out of the image plane (projection angle). Since the instruments are rotationally symmetric, we ignore the rotation of the instrument around its own axis. This separation ensures the independence of components that demonstrate different degrees of estimation accuracy.

We identify three challenges for instrument pose estimation using x-ray images: 1) the unavailability of ground truth poses, 2) the sensitivity to local and patient-specific anatomy [11] and 3) the poor generalization of hand-crafted instrument features.

A major challenge for all pose estimation techniques is the generation of images that are annotated with ground truth poses to learn from and compare with. To determine the projection parameters w.r.t. an instrument in a real-world c-arm setup, the instrument, detector and source position have to be measured. Due to the perspective projection nature of the c-arm, the required ground truth precision of the source w.r.t. the instrument embedded in the anatomy is not achievable to assert the desired pose estimation accuracy. The use of simulated images allows us to additionally control the distribution of the instrument pose and projection parameters.

The Deep Learning approach derives abstract feature representations of the instrument, that are independent of the anatomy. We show this by 3-fold cross-validation on three patient anatomies and three different instruments, a screw, a conventional drill (where the tip is tracked) and a custom drill robot (which has additional degrees of freedom, that are present in the images, but not determined by i3PosNet).

Additionally, we perform an extensive analysis of the design parameters of the convolutional neural network (CNN) including its layout and the optimizer parameters. We investigate optimal properties for the data set including the distributions for the image generation parameters and the chosen size of the training data set. The evaluation incorporates the analysis of method parameters such as iteration count, modular vs. end-to-end comparison and the dependence on initial pose estimates. Finally, we compare our results with a state-of-art registration-based pose estimation approach.

In this paper, we present three key contributions:

  • The first Deep Learning method* for instrument pose estimation (including depth) from single image fluoroscopy.

  • Generalization to multiple instruments (rigid and non-rigid) while patient-independent and no requirement of individual patient CT scans.

  • A large dataset* of x-ray images with exact reference poses and a method to generate these from statistical distributions.

* The code and the dataset will be made publicly available upon acceptance.

2 Related Work

2.1 Pose Estimation in Medical images

In Medical Imaging, Pose Estimation has been covered intensively with regards to two related research questions: C-Arm pose estimation (CBCT) and estimation of surgical instruments in endoscopic images.

While Registration [12, 13] is the dominant method for the estimation of the c-arm source and detector arrangement [14, 15, 16], recent direct regression approaches such as Bui et.al. [17] using a CNN-based PoseNet architecture show potential.

Deep Learning techniques are prevalent for the pose estimation of surgical instruments on endoscopic images [18, 19], but sub-pixel accuracy is not achieved - in part because the manual ground truth annotation do not allow it.

Instrument pose estimation and tracking on monochrome images (x-ray, fluoroscopy and cell tracking) typically rely on registration [9, 20, 10], segmentation [21, 11, 16] or matching of local features [22]. The latter two often leads to a feature-based registration.

Several specialized methods [22, 16, 11] are fine-tuned to specific instruments and cannot be applied to other instruments.

According to the classification by Markelj et.al. [12] 3D/2D-registration methods rely on an optimization loop to either minimize distances of feature points, maximize the similarity of images or match similar image gradients. The loop is built around a dimensional correspondence strategy (e.g. computational projection of 3D volume data to 2D) and evaluated after intra-operative images are available. A metric is used to compare the acquired data to a hypothesis (e.g. moving image) in order to increase the accuracy of said hypothesis. In contrast to this methodology, our method performs the 3D/2D correspondence a priori in the data generation so our model “learns” the geometry of the instrument. Additionally, we do not compare to a hypothesis but infer directly.

Litjens et.al. [23] provide an overview of approaches to boost registration performance by Deep Learning.

Miao et.al. [24] develop a registration approach based on convolutional neural networks, which we consider most related to i3PosNet. For three different clinical applications featuring objects without rotational symmetry, they show that they outperform conventional optimization and image metric-based registration approaches by a factor of up to 100. However our work differs significantly from Miao et.al. in five aspects: They use registration to determine the ground truth poses for training and evaluation; the size of their instruments (between \SI37\milli\meter to \SI110\milli\meter) is significantly larger than ours; they use multiple image patches and directly regress the rotation angles; they employ 974 specialized CNNs (non-Deep) and they operate on image differences between captured and generated images.

2.2 Key point Estimation

The usage of facial key points has been explored to estimate facial expressions [25] or for biometric applications. Sun et.al. [26] presented a key contribution introducing a Deep Neural networks to predict the position of facial key points. Similar techniques have been developed for human pose estimation [27] and robots [28].

Litjens et.al. [23] observe several deep learning approaches for landmark detection, which is complex for the direct regression of these landmarks in 3D data.

3 Materials

We generate Digitally Rendered Radiographs (DRR) from CT volumes and meshes of different surgical instruments.

3.1 Anatomies

To account for the variation of patient-specific anatomy, we consider three different conserved human cadaver heads captured by a SIEMENS SOMATOM Definition AS+. The slices of the transverse plane are centered around the otobasis and include the full cross-section of the skull.

Due to the conservation procedure, some tissue is bloated or shrunk and screws fix the skullcap to the skull. Additionally, calibration sticks are present in the exterior auditory channels.

3.2 Surgical Tools and Fiducials

We consider three surgical objects – referred to as surgical instruments: A medical screw, a conventional medical drill and a prototype drilling robot. We define the origin as the point of the instrument, whose position we inherently want to identify (c.f. rays in Fig. 1). The geometry of these instruments is defined by meshes exported from CAD models.

The non-rigid drilling robot consists of a spherical drilling head and two cylinders connected by a flexible joint. By flexing and expanding the joint in coordination with cushions on the cylinders, it creates non-linear access paths. We implement this additional degree of freedom at the joint by generating the corresponding mesh on the fly from a generative model.

The dimensions of the instruments are in line with typical MIS and bone surgery applications (drill diameter \SI3\milli\meter). This leads to bounding box diagonals of \SI6.5\milli\meter and \SI13.15\milli\meter for the screw and the rigid front part of the robot respectively. Despite a drill’s length, for the estimation of the tip’s pose we should only consider the tip to limit the influence of drill bending [5, 29].

Figure 1: Instrument Pose Estimation from single x-ray of screw, drill and robot
Parameter Training Evaluation
Orientation (Rotations)
  Object Offset
Table 1: Parameters of the DRR Generation with for Normal and for Uniform Distributions

3.3 Generation of Radiographs

Our DRR Generation pipeline is fully parameterizable and tailored to the instrument pose estimation use-case. We use the Insight Segmentation and Reconstruction Toolkit [30] and the Registration Toolkit [31] to modify and project the CT anatomy into 2D images. The pipeline generates an unrestricted number of projections and corresponding ground truth poses from a CT anatomy, an instrument mesh (or generative model) and a parameter definition. While we expect an explicit definition for some parameters, most parameters accept a statistical definition. This allows us to define the high-dimensional parameter space of the projections statistically.

The parameter space of our radiographs consists of:

  • the 3D pose of the instrument in the anatomy (6 DoF)

    • position ()

    • orientation (as a vector or rotations) ()

  • the projection parameters () (6 DoF)

    • Source-Object-Distance

    • Displacement orthogonal to the projection direction

    • Rotations around the object

We derive the c-arm-parameters from a Ziehm Vario RFD, which has a -detector at pixels and a Source-Detector-Distance of \SI1064\milli\meter.

An additional challenge arises, since the CT data only include a limited traversal height. To cover different projection directions, the projection geometry can be rotated leading to projection rays intersecting regions, where the CT volume data is missing. We consider projections invalid, if any projection ray within \SI5\milli\meter of the surgical instrument passes through a missing region of the skull. The pipeline implements this by projecting polygons onto the detector and checking, whether the instrument lies within them.

The pipeline follows Algorithm 1 to generate images and annotations:

Algorithm 1: Generation of DRRs for training and testing  

1:Distributions , and ,
2:Polygons for , CTVolumeData,
4:Image, pose
8:      repeat
10:             Output pose
12:            valid
13:      until valid Fail after a defined number of
14:until valid unsuccessful iterations.
15:Anatomy interpolate(CTVolumeData)
16:Mesh transform(, InstrumentMesh)
17:Instrument rasterize_mesh(Mesh)
18:Volume combine(Anatomy, Instrument)
19:Image project(, Volume)


For the Generation of the DRRs, we sample poses (positions from and orientations from ) and the projection parameters from , until we find a projection, that is valid. denotes probability distributions, where and represent normal and uniform distributions. The sampling of these parameters is summarized in Table 1. To determine, whether a projection is valid, we check, if rays close to the instrument travel through a part of the skull cut off by the availability of data in the CT scan. These regions are identified by a lower and an upper Polygon per anatomy. Projections with rays passing through regions of missing data are rejected and the corresponding parametersets resampled.

We interpolate the CT volume data to increase the sharpness of the instrument outlines providing finer voxels to render the mesh into. The anatomy and the surgical instrument volumes are combined and volume-projected to the image. We export the pose (c.f. Equation 1) of the instrument to provide annotations for later use in training and evaluation.

3.4 Cases and Scenarios

Let a case be an unique combination of an instrument and a subject’s anatomy on a DRR. This leads to three cases for each of the three instruments.

A scenario assigns the three cases associated with an instrument to the training and the validation set leading to three scenarios per instrument.

In our 3-fold cross-validation-scheme we use projections from two anatomies to assemble the training data while the third anatomy is reserved for evaluation. We define the benchmark scenario for design parameter search to use the screw and anatomies 1 and 2 for training.

We provide 20 plausible poses (position and orientation) for every subject, 10 for left and right side each. These poses are chosen to be clinically plausible, for the screw at the skull surface around the ear and for the drill/robot in the mastoid bone. By individually sampling deviations from these nominal poses we create a 6-dimensional manifold of configurations. We use different distributions for training and testing better resemble the clinical use-case (c.f. Table 1). This process yields us radiographs per scenario for training.

4 Methods

Our approach marginalizes the parameter space of the Deep Neural Network by introducing a patchification strategy and a standard pose w.r.t. the image patch. In standard pose the instrument is positioned at a central location and oriented in the direction of the x-axis of the patch. Since we require an initial estimate of the pose, the assumption of a standard pose reduces the possible range of angles from any angle () to that present in the initial estimate.

Figure 2: Definition of pose; to illustrate only an excerpt of the DRR is shown

We define the pose to be the set of pixel coordinates, forward angle, the projection angle and the depth of the instrument. These variables are defined w.r.t. the detector and the projection geometry of the digitally generated radiograph (see Fig. 2).


The forward angle indicates the angle between the instrument’s main rotational axis projected onto the image plane and the horizontal axis of the image. The projection angle quantifies the tilt of the instrument w.r.t. the detector plane. The depth represents the distance on the projection normal from the source (focal point) to the instrument (c.f. Source-Object-Distance).

We design a geometry-based angle-estimation scheme to investigate a direct (end-to-end regression) and an indirect (modular design) from a single image (Section 4.1). Following a discussion of the general estimation strategy, we discuss our CNN design, the implementation of a patchification strategy and perform an iterative evaluation, which takes advantage of the properties of the standard patch pose.

4.1 Regression of Orientation Angles from Images

Initially, we analyze the generalized problem of predicting the orientation (forward angle) of an object in an image. Since many medical instruments display varying degrees of rotational invariance, we cannot use quaternions like PoseNet [32] and its derivatives.

For this purpose we reduce the 3d x-ray scenario to a 2d rectangle scenario. The images show a black rectangle on white background. Two corners of the rectangle are rounded to eliminate the rectangle’s rotational periodicity at 180 degrees. This rectangle represents a simplification of the instrument outline (c.f. Fig. 3).

We evaluate two methods to predict the forward angle:

  1. direct (i.e. the network has 1 output node) or

  2. indirect by regressing on x- and y-coordinates of two points placed at both ends of the shape (i.e. the network has 4 output nodes).

For this comparison we generate an artificially training and testing data set (/ images) with image size  pixel and rectangle dimensions  pixel. We draw both the center position and the forward angle from an uniform distribution.

Both approaches use the same simplified network (4 convolutional and 3 fully connected layers) with the exception of the output layer.

4.2 Pose Estimation Algorithm

Design Parameter Option 1 Option 2 Option 3 Option 4 Option 5
Convolutional Layers
# of Blocks 2 3
Layers per Block 2 3
Regularization None Dropout Batch Normalization after
every Block every Layer
Pooling Max Pooling Average Pooling Last Layer uses Stride
Fully Connected (FC) Layers
# of FC Layers /
Factor for # of FC Nodes
Regularization None Dropout of Batch Normalization
No Dropout and Dropout of and Dropout of
Table 2: CNN Design Parameters

Simplifying i3PosNet to i2PosNet by dropping the iterative component (), the approach consists of the three steps: patch generation, point prediction and the geometric reconstruction of the pose from the predicted points as shown in Algorithm 2.

Algorithm 2: i3PosNet Pose Estimation  

1:initial pose estimate , Image
2:predicted pose
3:for  do
4:      Patch


In patch generation, we use the initial estimate as an approximation of the pose. The radiograph is rotated around the initial position by the initial forward angle. Cutting the image to its patch size of results in the estimated pose being placed in standard pose. Since the estimate is only an approximation, the instrument on the radiograph will be slightly offset (by position and rotation) from this standard pose. Finding this offset is the task we train the CNN to perform by training it with deviations from the standard pose.

From an image patch, our deep CNN predicts 6 key points placed on the main axis of the instrument and the plane orthogonal to the projection direction. This CNN is designed after a VGG-fashion [33] with 13 weight layers. Input and output of the CNN are image patches and 12 normalized values representing the x- and y- coordinates of the key points.

We define the placement of six key points (see Fig. 3) locally based at the instrument’s position (origin) in terms of two normalized support vectors (instruments rotational axis and its cross product with the projection direction). Key point coordinates are transformed to the image plane () by Equation 2 dependent on the Source-Detector-Distance and the Detector-Pixel-Spacing and normalized to the maximum range.


Using a cross-shape, or enables us to invert Equation 2 geometrically by fitting lines through two subsets of key points (see also Fig. 3). The intersection yields the position and the slope the forward angle . The depth and projection angle are determined by using Equation 3 and 4 on the same key point subsets.

Figure 3: Prediction scheme for i2PosNet (no iteration); blue: initial key points (KP), green: ground truth KP, yellow: predicted KP; red: predicted position

4.3 i3PosNet Architecture

The input of the network is a -normalized greyscale image, as provided by the patch generation.

We benchmark the CNN (for the benchmarking scenario, see Section 3.4) on multiple design dimensions including the number of convolutional layers and blocks, the pooling layer type, the number of fully connected layers and the regularization strategy. In this context, we assume a block consists of multiple convolutional layers and ends in a pooling layer shrinking the layer size by a factor of . Adjusting the last layer to use a Stride of is an option for pooling. All layers use ReLU activation. We double the number of channels after every block, starting with 32 for the first block. We use the Mean Squared Error as loss function. The design dimensions of our analysis are summarized in Table 2.

For optimizers, we evaluated both Stochastic Gradient Descent with Nesterov Momentum update and Adam including different parameter combinations.1

4.4 Data augmentation and training setup

From DRRs (as generated by the generation pipeline described in Section 3.3) we create training sets for all scenarios. We create 10 image patches for each image of the cases used for training. In the creation of these image patches, two considerations to train the CNN on similar samples compared to the use-case are taken:

Figure 4: Quantative Comparison of i3PosNet, i2PosNet (no iteration), Registration using Covariance Matrix Adaptation Evolution Strategy (CMA) and Gradient Correlation or Mutual Information
  1. Deviations from the standard pose are covered by adding noise to the image coordinates and forward angle components of the pose.

  2. Greater model accuracy around the standard pose is achieved by sampling the noise such that the CNN trains on more samples with poses similar to the standard pose.

We implement these considerations by adding deviations equal to the expected clinical initial estimates: for the position () the noise is sampled in polar coordinates and for the forward angle we draw from a normal distribution () as described by Equation 5:


5 Experiments

We conducted four evaluations2 to assess the performance of i3PosNet:

  1. Quantative Comparison of i3PosNet with Registration

  2. Analysis of direct and indirect prediction of angles, i.e. end-to-end and modular training

  3. Generalization to instruments and anatomies

  4. Analysis of the number of training X-ray images

Unless explicitly stated otherwise, we used the same sampling strategy for the initial pose as for data augmentation (c.f. Section 4.4). i3PosNet does not need an initial estimate for the projection angle or the depth. The upper bound of \SI2.5\milli\meter for the initial position estimation error is drawn from two considerations: a) initial pose estimates from electromagnetic tracking [3] and b) position errors larger than \SI1\milli\meter from the surgery plan are assumed to be a failure states in any case.

5.1 Metrics

Figure 5: Comparison of direct and indirect prediction for the generalized artificial rectangle scenario across 5 independent trainings after 200 epochs

We evaluated the components of the predicted pose independently using 5 error measures:

  • Position (Millimeter): Euclidean Distance between prediction and ground truth at the instrument projected on a plane orthogonal to the projection normal, also called reprojection distance (RPD) by von de Kraats et.al. [34].

  • Position (Pixel): Euclidean Pixel-Distance in the image

  • Forward Angle (Degrees): Angle between estimated and ground truth orientation in the image plane

  • Projection Angle (Degrees): Tilt out of the image plane

    We are restricted to differences in absolute angle values, since i3PosNet cannot determine the sign of projection angle ( )

  • Depth Error (Millimeter): Error of the Depth estimation, which von de Kraats [34] refers to as the target registration error in the projection direction.

5.2 Comparison with Current State-of-art

Registration is the accepted current state-of-art method for pose estimation in medical applications [10, 20, 9, 35, 36], while Deep Learning-based methods are still new to the field [23]. We evaluated registration for pose estimation for the screw and anatomy 1 in an earlier work [36]. There we identified the configuration and components for the registration to achieve the best results: Covariance Matrix Adaptation Evolution Strategy (CMA) for the Optimizer and Gradient Correlation (GC) as Metric. This configuration is consistent with findings from Uneri et.al. [9] and Miao et.al. [24].

Experimental Setup: We generated 25 DRRs for the screw and anatomy 1 (c.f. Evaluation in Table 1) and performed two pose estimations from randomly sampled deviations from the initial estimate. The projection matrices were available to the registration method, so new DRRs (moving images) were generated on the fly depending on the pose of the instrument. We limited the number of DRRs generated to 400. While the registration operated on positions w.r.t. the patient, all error calculations were performed in terms of the pose (c.f. Equation 1). Four i3PosNet-models were independently trained for 80 epochs and evaluated for 1 iteration (i2PosNet) and for 3 iterations (i3PosNet). The results of these 4 models were merged into one box.

Results: i3PosNet outperformed the state-of-the-art registration method with the best configuration (CMA and GC) by a factor of 5 and more (see Fig. 4). For i3PosNet and i2PosNet all results are below Pixel (\SI0.1\milli\meter).

Figure 6: Comparing modular (using geometric considerations, on left) with end-to-end (on right) estimation strategy; Forward Angle Errors (top) and Projection Angle Errors (bottom) grouped by the error of the initial estimate

5.3 Direct vs. indirect prediction

To emphasize our reliance on geometric considerations, we evaluated the prediction of forward angles (orientation) on a simplified artificial case (see Section 4.1) and confirmed the results on our benchmark case (see Section 3.4).

General orientation angle regression: We trained 5 models independently to regress the forward angle for the rectangle scenario for the direct and the indirect prediction approach. For Fig. 5 we merged the evaluation results for all five models. Our proposed indirect method (using geometric considerations) outperformed the direct method. The results showed a better overall accuracy (the third quartile of indirect errors roughly matches the first quartile of direct errors) and significantly less dominant outliers. Especially the errors at the jump from \ang360 to \ang0 became apparent.

Figure 7: Evaluation of the forward angle dependent on the projection angle; examples showing different instruments for different projection angles

Comparison of end-to-end and modular schemes: Comparing i3PosNet with an end-to-end setup, we found the direct regression for the forward angle to display about larger errors than i3PosNet (indirect regression). We observed similar error levels for situations close to the standard pose (), but with increasing difference of the pose to this mean pose, errors got significantly larger. Fig. b illustrates the consistent results of the modular and the dependency on both analyzed angles of the end-to-end approach.

5.4 Analysis of projection parameters

In order to determine the limits imposed on i3PosNet by the properties of the projection, we evaluated i3PosNet’s dependence on the projection angle. We expected limitations, since instruments with a dominant axis and strong rotational symmetry are nearly indiscriminate w.r.t. the orientation when the dominant axis and the projection direction coincide.

i3PosNet’s forward angle predictions gradually started to loose quality as shown in Fig. 7 for absolute projection angles greater than \ang60, which corresponds to a projection length of compared to nice projections. These effects became significant for sometimes even leading to forward angle estimates for the next iteration outside the parameter space specified in training and thereby escalating errors. Therefore we limit the experiments to .

Fig. 7 illustrates these observations by displaying projections in addition to the forward angle errors for all three instruments. The drill was especially prone to error increases at large absolute projection angles.

Figure 8: Performance of i3PosNet over multiple iterations; Average Errors and Quartiles for valid projection angles (); averages of 3 scenarios and 3 training runs for each instrument

5.5 Considering number of images and iterations

To determine the optimal number or iterations for i3PosNet, we analyzed the improvements of predictions for different numbers of iterations. Fig. 8 shows the large increase of mean and quartile errors in the second iteration followed by a negligible increase in the third. Single-image predictions using one GTX 1080 at 6 % utilization took for 3 iterations making i3PosNet feasible for realtime-applications ().

Figure 9: i2PosNet (no iteration) Evaluation for number of images used for training; 20 patches were generated per DRR

We evaluated the number of unique images (constant 20 patches per DRR) used for training (see Fig. 9). By increasing the number of training epochs we kept the number of model updates constant to distinguish between model convergence and dataset size. We observed a trend of decreasing errors with saturation.

5.6 Generalization to Instruments and Anatomies

The experiments for instruments and anatomies differed in the chosen set of DRRs for the datasets as well as on the placement of the standard pose and the fourth key point. Since most of the instrument was “in front” of its origin for the screw and “behind” its origin for the other two instruments, the position of the standard pose in the patch was adapted to include a large part of the instrument in the patch and the forth key point was placed accordingly. These adaptations translated to mirroring the instrument on the Y-Z plane.

Figure 10: i3PosNet evaluation errors of (a) Position (Millimeter); (b) Depth (Millimeter); (c) Forward Angle (Degrees); (d) Projection Angle (Degrees)

In the evaluation of i3PosNet we ran experiments for any individual trained neural network leading to experiments after dropping experiments with projection angles greater than \ang80. We normalized the whiskers in Fig. d to an outlier percentage of approximately .

From the Evaluation of the Position Error (see Fig. a) of the evaluations resulted in errors less than \SI0.3mm. Most of these fail-cases were attributed to five DRRs and concerned the drill. 95 % (99%) of the Millimeter Errors were smaller than \SI0.071\milli\meter (\SI0.107\milli\meter).

6 Discussion & Conclusion

We estimate the pose of three surgical instruments using a Deep-Learning based approach. By including geometric considerations into our method, we are able approximate the non-linear properties of rotation and projection. In previous works, this was done by training neural networks to specialized sections of the parameter space [24], in effect providing the chance for local linearizations. i3PosNet outperforms registration based methods by a factor of 5 and more.

i3PosNet performs well independent of the instrument, with the only instrument dependent parameters being the relation of the instrument origin with respect to the instruments center of mass.

Our instruments share the property of a dominant axis with most surgical instruments (screws, nails, rotational tools, catheters, drills, etc.) and is evaluated for minimally invasive surgery, where tools are very small. A dominant axis leads to i3PosNet’s limitation on feasible projection angles, which requires there to be an angle of at least \ang10 between said axis and the projection direction. The non-existence of the non-continuous jump between \ang359 and \ang0 is another advantages of the indirect determination of the orientation.

In the future, we want to embed i3PosNet in a multi-tool localization scheme, where fiducials, instruments etc. are localized and their pose estimated without the knowledge of the projection matrix. To increase the 3D accuracy, multiple orthogonal x-rays and a proposal scheme for the projection direction may be used. We want to verify i3PosNet on real x-ray images by exploring the dependence on clean annotations and investigating methods to cope with noisy annotations. One limitation of i3PosNet is the associated radiation exposure, which could be decreased by low-energy x-rays, possibly at multiple settings [37].

With the accuracy shown in this paper, i3PosNet enables surgeons to accurately determine the pose of instruments, even when the line of sight is obstructed. Through this novel navigation method, surgeries previously barred from minimally invasive approaches are opened to new possibilities with an outlook of higher precision and reduced patient surgery trauma.


The authors would like to thank the German Research Foundation for funding this research.


  1. We enclose detailed analysis and comparisons of the design parameters for the network architecture and the optimizer in the Supplementary Material.
  2. See the Supplementary Material for the analysis of the design parameters.


  1. A. J. Koffron, G. Auffenberg, R. Kung, and M. Abecassis, “Evaluation of 300 minimally invasive liver resections at a single institution: Less is more,” Annals of surgery, vol. 246, no. 3, pp. 385–92; discussion 392–4, 2007.
  2. J. Schipper, A. Aschendorff, I. Arapakis, T. Klenzner, C. B. Teszler, G. J. Ridder, and R. Laszig, “Navigation as a quality management tool in cochlear implant surgery,” The Journal of laryngology and otology, vol. 118, no. 10, pp. 764–770, 2004.
  3. A. M. Franz, T. Haidegger, W. Birkfellner, K. Cleary, T. M. Peters, and L. Maier-Hein, “Electromagnetic tracking in medicine–a review of technology, validation, and applications,” IEEE transactions on medical imaging, vol. 33, no. 8, pp. 1702–1725, 2014.
  4. M. Caversaccio, K. Gavaghan, W. Wimmer, T. Williamson, J. Ansò, G. Mantokoudis, N. Gerber, C. Rathgeb, A. Feldmann, F. Wagner, O. Scheidegger, M. Kompis, C. Weisstanner, M. Zoka-Assadi, K. Roesler, L. Anschuetz, M. Huth, and S. Weber, “Robotic cochlear implantation: Surgical procedure and first clinical experience,” Acta oto-laryngologica, vol. 137, no. 4, pp. 447–454, 2017.
  5. I. Stenin, S. Hansen, M. Becker, G. Sakas, D. Fellner, T. Klenzner, and J. Schipper, “Minimally invasive multiport surgery of the lateral skull base,” BioMed research international, vol. 2014, p. 379295, 2014.
  6. R. F. Labadie, R. Balachandran, J. H. Noble, G. S. Blachon, J. E. Mitchell, F. A. Reda, B. M. Dawant, and J. M. Fitzpatrick, “Minimally invasive image-guided cochlear implantation surgery: first report of clinical implementation,” The Laryngoscope, vol. 124, no. 8, pp. 1915–1922, 2014.
  7. J.-P. Kobler, M. Schoppe, G. J. Lexow, T. S. Rau, O. Majdani, L. A. Kahrs, and T. Ortmaier, “Temporal bone borehole accuracy for cochlear implantation influenced by drilling strategy: An in vitro study,” International journal of computer assisted radiology and surgery, vol. 9, no. 6, pp. 1033–1043, 2014.
  8. J. Ansó, C. Dür, K. Gavaghan, H. Rohrbach, N. Gerber, T. Williamson, E. M. Calvo, T. W. Balmer, C. Precht, D. Ferrario, M. S. Dettmer, K. M. Rösler, M. D. Caversaccio, B. Bell, and S. Weber, “A neuromonitoring approach to facial nerve preservation during image-guided robotic cochlear implantation,” Otology & neurotology : official publication of the American Otological Society, American Neurotology Society [and] European Academy of Otology and Neurotology, vol. 37, no. 1, pp. 89–98, 2016.
  9. A. Uneri, J. W. Stayman, T. de Silva, A. S. Wang, G. Kleinszig, S. Vogt, A. J. Khanna, J.-P. Wolinsky, Z. L. Gokaslan, and J. H. Siewerdsen, “Known-component 3d-2d registration for image guidance and quality assurance in spine surgery pedicle screw placement,” Proceedings of SPIE–the International Society for Optical Engineering, vol. 9415, 2015.
  10. H. Esfandiari, S. Amiri, D. D. Lichti, and C. Anglin, “A fast, accurate and closed-form method for pose recognition of an intramedullary nail using a tracked c-arm,” International journal of computer assisted radiology and surgery, vol. 11, no. 4, pp. 621–633, 2016.
  11. T. Steger and S. Wesarg, “Quantitative analysis of marker segmentation for c-arm pose based navigation,” in XIII Mediterranean Conference on Medical and Biological Engineering and Computing 2013, ser. IFMBE Proceedings, L. M. Roa Romero, Ed.   Cham: Springer, 2014, vol. 41, pp. 487–490.
  12. P. Markelj, D. Tomaževič, B. Likar, and F. Pernuš, “A review of 3d/2d registration methods for image-guided interventions,” Medical image analysis, vol. 16, no. 3, pp. 642–661, 2012.
  13. M. A. Viergever, J. B. A. Maintz, S. Klein, K. Murphy, M. Staring, and J. P. W. Pluim, “A survey of medical image registration - under review,” Medical image analysis, vol. 33, pp. 140–144, 2016.
  14. T. Steger, M. Hoßbach, and S. Wesarg, “Marker detection evaluation by phantom and cadaver experiments for c-arm pose estimation pattern,” ser. SPIE Proceedings, D. R. Holmes and Z. R. Yaniv, Eds.   SPIE, 2013, p. 86711V.
  15. A. K. Jain, T. Mustafa, Y. Zhou, C. Burdette, G. S. Chirikjian, and G. Fichtinger, “Ftrac—a robust fluoroscope tracking fiducial,” Medical Physics, vol. 32, no. 10, p. 3185, 2005.
  16. W. El Hakimi, J. Beutel, and G. Sakas, “Particle path segmentation: a fast, accurate, and robust method for localization of spherical markers in cone-beam ct projections,” in Track f: devices and systems for surgical interventions, K. Lange, W. Lauer, M. Nowak, D. B. Ellebrecht, and M. P. E. Gebhard, Eds., vol. 51, 2014, pp. 405–408.
  17. M. Bui, S. Albarqouni, M. Schrapp, N. Navab, and S. Ilic, “X-ray posenet: 6 dof pose estimation for mobile x-ray devices,” in WACV 2017.   Piscataway, NJ: IEEE, 2017, pp. 1036–1044.
  18. L. Maier-Hein, S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou, M. Hashizume, D. Katic, H. Kenngott, M. Kranzfelder, A. Malpani, K. März, T. Neumuth, N. Padoy, C. Pugh, N. Schoch, D. Stoyanov, R. Taylor, M. Wagner, G. D. Hager, and P. Jannin, “Surgical data science: Enabling next-generation surgery,” 2017. [Online]. Available: http://arxiv.org/pdf/1701.06482
  19. M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin, D. L. Collins, S. Duchesne, T. Kurmann, P. Marquez Neila, X. Du, P. Fua, D. Stoyanov, S. Wolf, and R. Sznitman, Eds., Simultaneous Recognition and Pose Estimation of Instruments in Minimally Invasive Surgery: Medical Image Computing and Computer-Assisted Intervention MICCAI 2017.   Springer International Publishing, 2017.
  20. C. R. Hatt, M. A. Speidel, and A. N. Raval, “Real-time pose estimation of devices from x-ray images: Application to x-ray/echo registration for cardiac interventions,” Medical image analysis, vol. 34, pp. 101–108, 2016.
  21. V. Ulman, M. Maška, K. E. G. Magnusson, O. Ronneberger, C. Haubold, N. Harder, P. Matula, P. Matula, D. Svoboda, M. Radojevic, I. Smal, K. Rohr, J. Jaldén, H. M. Blau, O. Dzyubachyk, B. Lelieveldt, P. Xiao, Y. Li, S.-Y. Cho, A. C. Dufour, J.-C. Olivo-Marin, C. C. Reyes-Aldasoro, J. A. Solis-Lemus, R. Bensch, T. Brox, J. Stegmaier, R. Mikut, S. Wolf, F. A. Hamprecht, T. Esteves, P. Quelhas, Ö. Demirel, L. Malmström, F. Jug, P. Tomancak, E. Meijering, A. Muñoz-Barrutia, M. Kozubek, and C. Ortiz-de Solorzano, “An objective comparison of cell-tracking algorithms,” Nature Methods, vol. 14, no. 12, p. 1141, 2017.
  22. A. Vandini, B. Glocker, M. Hamady, and G.-Z. Yang, “Robust guidewire tracking under large deformations combining segment-like features (seglets),” Medical image analysis, vol. 38, pp. 150–164, 2017.
  23. G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  24. S. Miao, Z. J. Wang, and R. Liao, “A cnn regression approach for real-time 2d/3d registration,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1352–1363, 2016.
  25. M. F. Valstar, E. Sánchez-Lozano, J. F. Cohn, L. A. Jeni, J. M. Girard, Z. Zhang, L. Yin, and M. Pantic, “Fera 2017 - addressing head pose in the third facial expression recognition and analysis challenge,” 2017. [Online]. Available: http://arxiv.org/pdf/1702.04174
  26. Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.   Piscataway, NJ: IEEE, 2013, pp. 3476–3483.
  27. E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres, “Joint graph decomposition & node labeling: Problem, algorithms, applications,” in 30th IEEE Conference on Computer Vision and Pattern Recognition.   Piscataway, NJ: IEEE, 2017, pp. 1904–1912.
  28. L. Pérez, Í. Rodríguez, N. Rodríguez, R. Usamentiaga, and D. F. García, “Robot guidance using machine vision techniques in industrial environments: A comparative review,” Sensors, vol. 16, no. 3, p. 335, 2016.
  29. I. Stenin, S. Hansen, M. Nau-Hermes, W. E. Hakimi, M. Becker, J. Bevermann, T. Klenzner, and J. Schipper, “Evaluation von minimal invasiven multi-port zugängen der otobasis am humanen schädelpräparat,” in 13. Jahrestagung der Deutschen Gesellschaft für Computer- und Roboterassistierte Chirurgie, September 11-13, 2014, Munich, Germany, H. Feußner, Ed., 2014, pp. 127–130.
  30. T. S. Yoo, M. J. Ackerman, W. E. Lorensen, W. Schroeder, V. Chalana, S. Aylward, D. Metaxas, and R. Whitaker, “Engineering and algorithm design for an image processing api: A technical report on itk–the insight toolkit,” Studies in health technology and informatics, vol. 85, pp. 586–592, 2002.
  31. S. Rit, M. Vila Oliva, S. Brousmiche, R. Labarbe, D. Sarrut, and G. C. Sharp, “The reconstruction toolkit (rtk), an open-source cone-beam ct reconstruction toolkit based on the insight toolkit (itk),” Journal of Physics: Conference Series, vol. 489, p. 012079, 2014.
  32. A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” 2015. [Online]. Available: http://arxiv.org/pdf/1505.07427
  33. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014. [Online]. Available: http://arxiv.org/pdf/1409.1556
  34. E. B. van de Kraats, G. P. Penney, D. Tomazevic, T. van Walsum, and W. J. Niessen, “Standardized evaluation methodology for 2-d-3-d registration,” IEEE transactions on medical imaging, vol. 24, no. 9, pp. 1177–1189, 2005.
  35. Y. Otake, M. Armand, R. S. Armiger, M. D. Kutzer, E. Basafa, P. Kazanzides, and R. H. Taylor, “Intraoperative image-based multiview 2d/3d registration for image-guided orthopaedic surgery: incorporation of fiducial-based c-arm tracking and gpu-acceleration,” IEEE transactions on medical imaging, vol. 31, no. 4, pp. 948–962, 2012.
  36. D. Kügler, M. Jastrzebski, and A. Mukhopadhyay, “Anonymous submission,” 2018.
  37. M. N. Wernick, O. Wirjadi, D. Chapman, Z. Zhong, N. P. Galatsanos, Y. Yang, J. G. Brankov, O. Oltulu, M. A. Anastasio, and C. Muehleman, “Multiple-image radiography,” Physics in Medicine & Biology, vol. 48, no. 23, p. 3875, 2003.
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Comments 0
Request answer
The feedback must be of minumum 40 characters
Add comment
Loading ...