i3PosNet: Instrument Pose Estimation from XRay
Abstract
Performing delicate Minimally Invasive Surgeries (MIS) forces surgeons to accurately assess the position and orientation (pose) of surgical instruments. In current practice, this pose information is provided by conventional tracking systems (optical and electromagnetic). Two challenges render these systems inadequate for minimally invasive bone surgery: the need for instrument positioning with high precision and occluding tissue blocking the line of sight. Fluoroscopic tracking is limited by the radiation exposure to patient and surgeon. A possible solution is constraining the acquisition of xray images. The distinct acquisitions at irregular intervals require a pose estimation solution instead of a tracking technique. We develop i3PosNet (Iterative Image Instrument Pose estimation Network), a patchbased modular Deep Learning method enhanced by geometric considerations, which estimates the pose of surgical instruments from single xrays. For the evaluation of i3PosNet, we consider the scenario of drilling in the otobasis. i3PosNet generalizes well to different instruments, which we show by applying it to a screw, a drill and a robot. i3PosNet consistently estimates the pose of surgical instruments better than conventional image registration techniques by a factor of 5 and more achieving inplane position errors of and angle errors of . Additional factors, such as depth are evaluated to from single radiographs.
instrument pose estimation, modular deep learning, fluoroscopic tracking, cochlear implant, vestibular schwannoma removal
1 Introduction
MIS lead to shorter hospital stays due to smaller incisions and less operation trauma [1]. Recent years show a surge of MIS for bone surgery, e.g. Lateral bone surgery, where clinical instrument positioning needs to be more accurate than \SI0.5\milli\meter [2]. To achieve this positioning accuracy, measured positions of surgical instruments and tools are required to be ten times more accurate (errors less than \SI0.05\milli\meter). In combination with the orientation, this would enable methods from ComputerAided Intervention or robotic surgery to be exploited for similar applications. Optical or electromagnetic tracking systems work well for soft tissue interventions, but fail, when the lineofsight (LoS) is limited and submillimeter accuracies are required [3].
i3PosNet is a generalized, Iterative Deep Learning framework to determine the pose of Surgical Instruments from a single xray. The pose has five degrees of freedom ( for position and for orientation). We apply i3PosNet to lateral skull base surgery, e.g. cochlear implantation or vestibular schwannoma removal. The current clinical practice in lateral base surgery is to remove a large part of the otobasis in order to reveal all risk structures to the surgeon. Current research on navigation for cochlear implantation [4, 5, 6, 7] assumes the drill to be rigid and relies on tracking the drill at the tool’s base or measures the closeness to a risk structure, e.g. the facial nerve [8]. No imagebased method has been proposed yet that captures the pose of surgical instruments in the otobasis.
We propose a novel method based on a modular Deep Learning approach with geometric considerations. Unlike endtoend this modular approach predicts the positions of multiple landmark points and derives the pose (position, forward angle, projection angle and depth – all defined w.r.t. the projection geometry) from these positions in a followup step using their geometric relationship. The term “modular” is motivated from this divideandconquerapproach. i3PosNet consistently beats competing stateofart instrument pose estimation techniques [9, 10] by a factor of .
Our proposed method finds the pose of an instrument on a xray image given an initial estimate of said pose. Initial poses are constrained to clinically plausible differences. This paper introduces three core concepts in its design: 1) the geometric conversion between instrument landmarks and the pose, 2) a statistically driven training dataset generation scheme and 3) an iterative patchbased pose prediction scheme.
In this work, we estimate the pose from a single image and evaluate poses w.r.t. the arrangement of the xray source and detector. The five dimensions of the pose are the inplane position ( and ) and the depth as well as two rotations: 1) around the projection normal and 2) the rotation of the instrument’s main axis out of the image plane (projection angle). Since the instruments are rotationally symmetric, we ignore the rotation of the instrument around its own axis. This separation ensures the independence of components that demonstrate different degrees of estimation accuracy.
We identify three challenges for instrument pose estimation using xray images: 1) the unavailability of ground truth poses, 2) the sensitivity to local and patientspecific anatomy [11] and 3) the poor generalization of handcrafted instrument features.
A major challenge for all pose estimation techniques is the generation of images that are annotated with ground truth poses to learn from and compare with. To determine the projection parameters w.r.t. an instrument in a realworld carm setup, the instrument, detector and source position have to be measured. Due to the perspective projection nature of the carm, the required ground truth precision of the source w.r.t. the instrument embedded in the anatomy is not achievable to assert the desired pose estimation accuracy. The use of simulated images allows us to additionally control the distribution of the instrument pose and projection parameters.
The Deep Learning approach derives abstract feature representations of the instrument, that are independent of the anatomy. We show this by 3fold crossvalidation on three patient anatomies and three different instruments, a screw, a conventional drill (where the tip is tracked) and a custom drill robot (which has additional degrees of freedom, that are present in the images, but not determined by i3PosNet).
Additionally, we perform an extensive analysis of the design parameters of the convolutional neural network (CNN) including its layout and the optimizer parameters. We investigate optimal properties for the data set including the distributions for the image generation parameters and the chosen size of the training data set. The evaluation incorporates the analysis of method parameters such as iteration count, modular vs. endtoend comparison and the dependence on initial pose estimates. Finally, we compare our results with a stateofart registrationbased pose estimation approach.
In this paper, we present three key contributions:

The first Deep Learning method* for instrument pose estimation (including depth) from single image fluoroscopy.

Generalization to multiple instruments (rigid and nonrigid) while patientindependent and no requirement of individual patient CT scans.

A large dataset* of xray images with exact reference poses and a method to generate these from statistical distributions.
* The code and the dataset will be made publicly available upon acceptance.
2 Related Work
2.1 Pose Estimation in Medical images
In Medical Imaging, Pose Estimation has been covered intensively with regards to two related research questions: CArm pose estimation (CBCT) and estimation of surgical instruments in endoscopic images.
While Registration [12, 13] is the dominant method for the estimation of the carm source and detector arrangement [14, 15, 16], recent direct regression approaches such as Bui et.al. [17] using a CNNbased PoseNet architecture show potential.
Deep Learning techniques are prevalent for the pose estimation of surgical instruments on endoscopic images [18, 19], but subpixel accuracy is not achieved  in part because the manual ground truth annotation do not allow it.
Instrument pose estimation and tracking on monochrome images (xray, fluoroscopy and cell tracking) typically rely on registration [9, 20, 10], segmentation [21, 11, 16] or matching of local features [22]. The latter two often leads to a featurebased registration.
Several specialized methods [22, 16, 11] are finetuned to specific instruments and cannot be applied to other instruments.
According to the classification by Markelj et.al. [12] 3D/2Dregistration methods rely on an optimization loop to either minimize distances of feature points, maximize the similarity of images or match similar image gradients. The loop is built around a dimensional correspondence strategy (e.g. computational projection of 3D volume data to 2D) and evaluated after intraoperative images are available. A metric is used to compare the acquired data to a hypothesis (e.g. moving image) in order to increase the accuracy of said hypothesis. In contrast to this methodology, our method performs the 3D/2D correspondence a priori in the data generation so our model “learns” the geometry of the instrument. Additionally, we do not compare to a hypothesis but infer directly.
Litjens et.al. [23] provide an overview of approaches to boost registration performance by Deep Learning.
Miao et.al. [24] develop a registration approach based on convolutional neural networks, which we consider most related to i3PosNet. For three different clinical applications featuring objects without rotational symmetry, they show that they outperform conventional optimization and image metricbased registration approaches by a factor of up to 100. However our work differs significantly from Miao et.al. in five aspects: They use registration to determine the ground truth poses for training and evaluation; the size of their instruments (between \SI37\milli\meter to \SI110\milli\meter) is significantly larger than ours; they use multiple image patches and directly regress the rotation angles; they employ 974 specialized CNNs (nonDeep) and they operate on image differences between captured and generated images.
2.2 Key point Estimation
The usage of facial key points has been explored to estimate facial expressions [25] or for biometric applications. Sun et.al. [26] presented a key contribution introducing a Deep Neural networks to predict the position of facial key points. Similar techniques have been developed for human pose estimation [27] and robots [28].
Litjens et.al. [23] observe several deep learning approaches for landmark detection, which is complex for the direct regression of these landmarks in 3D data.
3 Materials
We generate Digitally Rendered Radiographs (DRR) from CT volumes and meshes of different surgical instruments.
3.1 Anatomies
To account for the variation of patientspecific anatomy, we consider three different conserved human cadaver heads captured by a SIEMENS SOMATOM Definition AS+. The slices of the transverse plane are centered around the otobasis and include the full crosssection of the skull.
Due to the conservation procedure, some tissue is bloated or shrunk and screws fix the skullcap to the skull. Additionally, calibration sticks are present in the exterior auditory channels.
3.2 Surgical Tools and Fiducials
We consider three surgical objects – referred to as surgical instruments: A medical screw, a conventional medical drill and a prototype drilling robot. We define the origin as the point of the instrument, whose position we inherently want to identify (c.f. rays in Fig. 1). The geometry of these instruments is defined by meshes exported from CAD models.
The nonrigid drilling robot consists of a spherical drilling head and two cylinders connected by a flexible joint. By flexing and expanding the joint in coordination with cushions on the cylinders, it creates nonlinear access paths. We implement this additional degree of freedom at the joint by generating the corresponding mesh on the fly from a generative model.
The dimensions of the instruments are in line with typical MIS and bone surgery applications (drill diameter \SI3\milli\meter). This leads to bounding box diagonals of \SI6.5\milli\meter and \SI13.15\milli\meter for the screw and the rigid front part of the robot respectively. Despite a drill’s length, for the estimation of the tip’s pose we should only consider the tip to limit the influence of drill bending [5, 29].
Parameter  Training  Evaluation  

Position  
Orientation (Rotations)  
Projection  
SourceObjectDistance  
Object Offset  
Rotations 
3.3 Generation of Radiographs
Our DRR Generation pipeline is fully parameterizable and tailored to the instrument pose estimation usecase. We use the Insight Segmentation and Reconstruction Toolkit [30] and the Registration Toolkit [31] to modify and project the CT anatomy into 2D images. The pipeline generates an unrestricted number of projections and corresponding ground truth poses from a CT anatomy, an instrument mesh (or generative model) and a parameter definition. While we expect an explicit definition for some parameters, most parameters accept a statistical definition. This allows us to define the highdimensional parameter space of the projections statistically.
The parameter space of our radiographs consists of:

the 3D pose of the instrument in the anatomy (6 DoF)

position ()

orientation (as a vector or rotations) ()


the projection parameters () (6 DoF)

SourceObjectDistance

Displacement orthogonal to the projection direction

Rotations around the object

We derive the carmparameters from a Ziehm Vario RFD, which has a detector at pixels and a SourceDetectorDistance of \SI1064\milli\meter.
An additional challenge arises, since the CT data only include a limited traversal height. To cover different projection directions, the projection geometry can be rotated leading to projection rays intersecting regions, where the CT volume data is missing. We consider projections invalid, if any projection ray within \SI5\milli\meter of the surgical instrument passes through a missing region of the skull. The pipeline implements this by projecting polygons onto the detector and checking, whether the instrument lies within them.
The pipeline follows Algorithm 1 to generate images and annotations:
For the Generation of the DRRs, we sample poses (positions from and orientations from ) and the projection parameters from , until we find a projection, that is valid. denotes probability distributions, where and represent normal and uniform distributions. The sampling of these parameters is summarized in Table 1. To determine, whether a projection is valid, we check, if rays close to the instrument travel through a part of the skull cut off by the availability of data in the CT scan. These regions are identified by a lower and an upper Polygon per anatomy. Projections with rays passing through regions of missing data are rejected and the corresponding parametersets resampled.
We interpolate the CT volume data to increase the sharpness of the instrument outlines providing finer voxels to render the mesh into. The anatomy and the surgical instrument volumes are combined and volumeprojected to the image. We export the pose (c.f. Equation 1) of the instrument to provide annotations for later use in training and evaluation.
3.4 Cases and Scenarios
Let a case be an unique combination of an instrument and a subject’s anatomy on a DRR. This leads to three cases for each of the three instruments.
A scenario assigns the three cases associated with an instrument to the training and the validation set leading to three scenarios per instrument.
In our 3fold crossvalidationscheme we use projections from two anatomies to assemble the training data while the third anatomy is reserved for evaluation. We define the benchmark scenario for design parameter search to use the screw and anatomies 1 and 2 for training.
We provide 20 plausible poses (position and orientation) for every subject, 10 for left and right side each. These poses are chosen to be clinically plausible, for the screw at the skull surface around the ear and for the drill/robot in the mastoid bone. By individually sampling deviations from these nominal poses we create a 6dimensional manifold of configurations. We use different distributions for training and testing better resemble the clinical usecase (c.f. Table 1). This process yields us radiographs per scenario for training.
4 Methods
Our approach marginalizes the parameter space of the Deep Neural Network by introducing a patchification strategy and a standard pose w.r.t. the image patch. In standard pose the instrument is positioned at a central location and oriented in the direction of the xaxis of the patch. Since we require an initial estimate of the pose, the assumption of a standard pose reduces the possible range of angles from any angle () to that present in the initial estimate.
We define the pose to be the set of pixel coordinates, forward angle, the projection angle and the depth of the instrument. These variables are defined w.r.t. the detector and the projection geometry of the digitally generated radiograph (see Fig. 2).
(1) 
The forward angle indicates the angle between the instrument’s main rotational axis projected onto the image plane and the horizontal axis of the image. The projection angle quantifies the tilt of the instrument w.r.t. the detector plane. The depth represents the distance on the projection normal from the source (focal point) to the instrument (c.f. SourceObjectDistance).
We design a geometrybased angleestimation scheme to investigate a direct (endtoend regression) and an indirect (modular design) from a single image (Section 4.1). Following a discussion of the general estimation strategy, we discuss our CNN design, the implementation of a patchification strategy and perform an iterative evaluation, which takes advantage of the properties of the standard patch pose.
4.1 Regression of Orientation Angles from Images
Initially, we analyze the generalized problem of predicting the orientation (forward angle) of an object in an image. Since many medical instruments display varying degrees of rotational invariance, we cannot use quaternions like PoseNet [32] and its derivatives.
For this purpose we reduce the 3d xray scenario to a 2d rectangle scenario. The images show a black rectangle on white background. Two corners of the rectangle are rounded to eliminate the rectangle’s rotational periodicity at 180 degrees. This rectangle represents a simplification of the instrument outline (c.f. Fig. 3).
We evaluate two methods to predict the forward angle:

direct (i.e. the network has 1 output node) or

indirect by regressing on x and ycoordinates of two points placed at both ends of the shape (i.e. the network has 4 output nodes).
For this comparison we generate an artificially training and testing data set (/ images) with image size pixel and rectangle dimensions pixel. We draw both the center position and the forward angle from an uniform distribution.
Both approaches use the same simplified network (4 convolutional and 3 fully connected layers) with the exception of the output layer.
4.2 Pose Estimation Algorithm
Design Parameter  Option 1  Option 2  Option 3  Option 4  Option 5 
Convolutional Layers  
# of Blocks  2  3  
Layers per Block  2  3  
Regularization  None  Dropout  Batch Normalization after  
every Block  every Layer  
Pooling  Max Pooling  Average Pooling  Last Layer uses Stride  
Fully Connected (FC) Layers  
# of FC Layers /  
Factor for # of FC Nodes  
Regularization  None  Dropout of  Batch Normalization  
No Dropout  and Dropout of  and Dropout of 
Simplifying i3PosNet to i2PosNet by dropping the iterative component (), the approach consists of the three steps: patch generation, point prediction and the geometric reconstruction of the pose from the predicted points as shown in Algorithm 2.
In patch generation, we use the initial estimate as an approximation of the pose. The radiograph is rotated around the initial position by the initial forward angle. Cutting the image to its patch size of results in the estimated pose being placed in standard pose. Since the estimate is only an approximation, the instrument on the radiograph will be slightly offset (by position and rotation) from this standard pose. Finding this offset is the task we train the CNN to perform by training it with deviations from the standard pose.
From an image patch, our deep CNN predicts 6 key points placed on the main axis of the instrument and the plane orthogonal to the projection direction. This CNN is designed after a VGGfashion [33] with 13 weight layers. Input and output of the CNN are image patches and 12 normalized values representing the x and y coordinates of the key points.
We define the placement of six key points (see Fig. 3) locally based at the instrument’s position (origin) in terms of two normalized support vectors (instruments rotational axis and its cross product with the projection direction). Key point coordinates are transformed to the image plane () by Equation 2 dependent on the SourceDetectorDistance and the DetectorPixelSpacing and normalized to the maximum range.
(2) 
Using a crossshape, or enables us to invert Equation 2 geometrically by fitting lines through two subsets of key points (see also Fig. 3). The intersection yields the position and the slope the forward angle . The depth and projection angle are determined by using Equation 3 and 4 on the same key point subsets.
(3) 
(4) 
4.3 i3PosNet Architecture
The input of the network is a normalized greyscale image, as provided by the patch generation.
We benchmark the CNN (for the benchmarking scenario, see Section 3.4) on multiple design dimensions including the number of convolutional layers and blocks, the pooling layer type, the number of fully connected layers and the regularization strategy. In this context, we assume a block consists of multiple convolutional layers and ends in a pooling layer shrinking the layer size by a factor of . Adjusting the last layer to use a Stride of is an option for pooling. All layers use ReLU activation. We double the number of channels after every block, starting with 32 for the first block. We use the Mean Squared Error as loss function. The design dimensions of our analysis are summarized in Table 2.
For optimizers, we evaluated both Stochastic Gradient Descent with Nesterov Momentum
update and Adam including different parameter combinations.
4.4 Data augmentation and training setup
From DRRs (as generated by the generation pipeline described in Section 3.3) we create training sets for all scenarios. We create 10 image patches for each image of the cases used for training. In the creation of these image patches, two considerations to train the CNN on similar samples compared to the usecase are taken:

Deviations from the standard pose are covered by adding noise to the image coordinates and forward angle components of the pose.

Greater model accuracy around the standard pose is achieved by sampling the noise such that the CNN trains on more samples with poses similar to the standard pose.
We implement these considerations by adding deviations equal to the expected clinical initial estimates: for the position () the noise is sampled in polar coordinates and for the forward angle we draw from a normal distribution () as described by Equation 5:
(5) 
5 Experiments
We conducted four evaluations

Quantative Comparison of i3PosNet with Registration

Analysis of direct and indirect prediction of angles, i.e. endtoend and modular training

Generalization to instruments and anatomies

Analysis of the number of training Xray images
Unless explicitly stated otherwise, we used the same sampling strategy for the initial pose as for data augmentation (c.f. Section 4.4). i3PosNet does not need an initial estimate for the projection angle or the depth. The upper bound of \SI2.5\milli\meter for the initial position estimation error is drawn from two considerations: a) initial pose estimates from electromagnetic tracking [3] and b) position errors larger than \SI1\milli\meter from the surgery plan are assumed to be a failure states in any case.
5.1 Metrics
We evaluated the components of the predicted pose independently using 5 error measures:

Position (Millimeter): Euclidean Distance between prediction and ground truth at the instrument projected on a plane orthogonal to the projection normal, also called reprojection distance (RPD) by von de Kraats et.al. [34].

Position (Pixel): Euclidean PixelDistance in the image

Forward Angle (Degrees): Angle between estimated and ground truth orientation in the image plane

Projection Angle (Degrees): Tilt out of the image plane
We are restricted to differences in absolute angle values, since i3PosNet cannot determine the sign of projection angle ( )

Depth Error (Millimeter): Error of the Depth estimation, which von de Kraats [34] refers to as the target registration error in the projection direction.
5.2 Comparison with Current Stateofart
Registration is the accepted current stateofart method for pose estimation in medical applications [10, 20, 9, 35, 36], while Deep Learningbased methods are still new to the field [23]. We evaluated registration for pose estimation for the screw and anatomy 1 in an earlier work [36]. There we identified the configuration and components for the registration to achieve the best results: Covariance Matrix Adaptation Evolution Strategy (CMA) for the Optimizer and Gradient Correlation (GC) as Metric. This configuration is consistent with findings from Uneri et.al. [9] and Miao et.al. [24].
Experimental Setup: We generated 25 DRRs for the screw and anatomy 1 (c.f. Evaluation in Table 1) and performed two pose estimations from randomly sampled deviations from the initial estimate. The projection matrices were available to the registration method, so new DRRs (moving images) were generated on the fly depending on the pose of the instrument. We limited the number of DRRs generated to 400. While the registration operated on positions w.r.t. the patient, all error calculations were performed in terms of the pose (c.f. Equation 1). Four i3PosNetmodels were independently trained for 80 epochs and evaluated for 1 iteration (i2PosNet) and for 3 iterations (i3PosNet). The results of these 4 models were merged into one box.
Results: i3PosNet outperformed the stateoftheart registration method with the best configuration (CMA and GC) by a factor of 5 and more (see Fig. 4). For i3PosNet and i2PosNet all results are below Pixel (\SI0.1\milli\meter).
5.3 Direct vs. indirect prediction
To emphasize our reliance on geometric considerations, we evaluated the prediction of forward angles (orientation) on a simplified artificial case (see Section 4.1) and confirmed the results on our benchmark case (see Section 3.4).
General orientation angle regression: We trained 5 models independently to regress the forward angle for the rectangle scenario for the direct and the indirect prediction approach. For Fig. 5 we merged the evaluation results for all five models. Our proposed indirect method (using geometric considerations) outperformed the direct method. The results showed a better overall accuracy (the third quartile of indirect errors roughly matches the first quartile of direct errors) and significantly less dominant outliers. Especially the errors at the jump from \ang360 to \ang0 became apparent.
Comparison of endtoend and modular schemes: Comparing i3PosNet with an endtoend setup, we found the direct regression for the forward angle to display about larger errors than i3PosNet (indirect regression). We observed similar error levels for situations close to the standard pose (), but with increasing difference of the pose to this mean pose, errors got significantly larger. Fig. b illustrates the consistent results of the modular and the dependency on both analyzed angles of the endtoend approach.
5.4 Analysis of projection parameters
In order to determine the limits imposed on i3PosNet by the properties of the projection, we evaluated i3PosNet’s dependence on the projection angle. We expected limitations, since instruments with a dominant axis and strong rotational symmetry are nearly indiscriminate w.r.t. the orientation when the dominant axis and the projection direction coincide.
i3PosNet’s forward angle predictions gradually started to loose quality as shown in Fig. 7 for absolute projection angles greater than \ang60, which corresponds to a projection length of compared to nice projections. These effects became significant for sometimes even leading to forward angle estimates for the next iteration outside the parameter space specified in training and thereby escalating errors. Therefore we limit the experiments to .
Fig. 7 illustrates these observations by displaying projections in addition to the forward angle errors for all three instruments. The drill was especially prone to error increases at large absolute projection angles.
5.5 Considering number of images and iterations
To determine the optimal number or iterations for i3PosNet, we analyzed the improvements of predictions for different numbers of iterations. Fig. 8 shows the large increase of mean and quartile errors in the second iteration followed by a negligible increase in the third. Singleimage predictions using one GTX 1080 at 6 % utilization took for 3 iterations making i3PosNet feasible for realtimeapplications ().
We evaluated the number of unique images (constant 20 patches per DRR) used for training (see Fig. 9). By increasing the number of training epochs we kept the number of model updates constant to distinguish between model convergence and dataset size. We observed a trend of decreasing errors with saturation.
5.6 Generalization to Instruments and Anatomies
The experiments for instruments and anatomies differed in the chosen set of DRRs for the datasets as well as on the placement of the standard pose and the fourth key point. Since most of the instrument was “in front” of its origin for the screw and “behind” its origin for the other two instruments, the position of the standard pose in the patch was adapted to include a large part of the instrument in the patch and the forth key point was placed accordingly. These adaptations translated to mirroring the instrument on the YZ plane.
In the evaluation of i3PosNet we ran experiments for any individual trained neural network leading to experiments after dropping experiments with projection angles greater than \ang80. We normalized the whiskers in Fig. d to an outlier percentage of approximately .
From the Evaluation of the Position Error (see Fig. a) of the evaluations resulted in errors less than \SI0.3mm. Most of these failcases were attributed to five DRRs and concerned the drill. 95 % (99%) of the Millimeter Errors were smaller than \SI0.071\milli\meter (\SI0.107\milli\meter).
6 Discussion & Conclusion
We estimate the pose of three surgical instruments using a DeepLearning based approach. By including geometric considerations into our method, we are able approximate the nonlinear properties of rotation and projection. In previous works, this was done by training neural networks to specialized sections of the parameter space [24], in effect providing the chance for local linearizations. i3PosNet outperforms registration based methods by a factor of 5 and more.
i3PosNet performs well independent of the instrument, with the only instrument dependent parameters being the relation of the instrument origin with respect to the instruments center of mass.
Our instruments share the property of a dominant axis with most surgical instruments (screws, nails, rotational tools, catheters, drills, etc.) and is evaluated for minimally invasive surgery, where tools are very small. A dominant axis leads to i3PosNet’s limitation on feasible projection angles, which requires there to be an angle of at least \ang10 between said axis and the projection direction. The nonexistence of the noncontinuous jump between \ang359 and \ang0 is another advantages of the indirect determination of the orientation.
In the future, we want to embed i3PosNet in a multitool localization scheme, where fiducials, instruments etc. are localized and their pose estimated without the knowledge of the projection matrix. To increase the 3D accuracy, multiple orthogonal xrays and a proposal scheme for the projection direction may be used. We want to verify i3PosNet on real xray images by exploring the dependence on clean annotations and investigating methods to cope with noisy annotations. One limitation of i3PosNet is the associated radiation exposure, which could be decreased by lowenergy xrays, possibly at multiple settings [37].
With the accuracy shown in this paper, i3PosNet enables surgeons to accurately determine the pose of instruments, even when the line of sight is obstructed. Through this novel navigation method, surgeries previously barred from minimally invasive approaches are opened to new possibilities with an outlook of higher precision and reduced patient surgery trauma.
Acknowledgment
The authors would like to thank the German Research Foundation for funding this research.
Footnotes
 We enclose detailed analysis and comparisons of the design parameters for the network architecture and the optimizer in the Supplementary Material.
 See the Supplementary Material for the analysis of the design parameters.
References
 A. J. Koffron, G. Auffenberg, R. Kung, and M. Abecassis, “Evaluation of 300 minimally invasive liver resections at a single institution: Less is more,” Annals of surgery, vol. 246, no. 3, pp. 385–92; discussion 392–4, 2007.
 J. Schipper, A. Aschendorff, I. Arapakis, T. Klenzner, C. B. Teszler, G. J. Ridder, and R. Laszig, “Navigation as a quality management tool in cochlear implant surgery,” The Journal of laryngology and otology, vol. 118, no. 10, pp. 764–770, 2004.
 A. M. Franz, T. Haidegger, W. Birkfellner, K. Cleary, T. M. Peters, and L. MaierHein, “Electromagnetic tracking in medicine–a review of technology, validation, and applications,” IEEE transactions on medical imaging, vol. 33, no. 8, pp. 1702–1725, 2014.
 M. Caversaccio, K. Gavaghan, W. Wimmer, T. Williamson, J. Ansò, G. Mantokoudis, N. Gerber, C. Rathgeb, A. Feldmann, F. Wagner, O. Scheidegger, M. Kompis, C. Weisstanner, M. ZokaAssadi, K. Roesler, L. Anschuetz, M. Huth, and S. Weber, “Robotic cochlear implantation: Surgical procedure and first clinical experience,” Acta otolaryngologica, vol. 137, no. 4, pp. 447–454, 2017.
 I. Stenin, S. Hansen, M. Becker, G. Sakas, D. Fellner, T. Klenzner, and J. Schipper, “Minimally invasive multiport surgery of the lateral skull base,” BioMed research international, vol. 2014, p. 379295, 2014.
 R. F. Labadie, R. Balachandran, J. H. Noble, G. S. Blachon, J. E. Mitchell, F. A. Reda, B. M. Dawant, and J. M. Fitzpatrick, “Minimally invasive imageguided cochlear implantation surgery: first report of clinical implementation,” The Laryngoscope, vol. 124, no. 8, pp. 1915–1922, 2014.
 J.P. Kobler, M. Schoppe, G. J. Lexow, T. S. Rau, O. Majdani, L. A. Kahrs, and T. Ortmaier, “Temporal bone borehole accuracy for cochlear implantation influenced by drilling strategy: An in vitro study,” International journal of computer assisted radiology and surgery, vol. 9, no. 6, pp. 1033–1043, 2014.
 J. Ansó, C. Dür, K. Gavaghan, H. Rohrbach, N. Gerber, T. Williamson, E. M. Calvo, T. W. Balmer, C. Precht, D. Ferrario, M. S. Dettmer, K. M. Rösler, M. D. Caversaccio, B. Bell, and S. Weber, “A neuromonitoring approach to facial nerve preservation during imageguided robotic cochlear implantation,” Otology & neurotology : official publication of the American Otological Society, American Neurotology Society [and] European Academy of Otology and Neurotology, vol. 37, no. 1, pp. 89–98, 2016.
 A. Uneri, J. W. Stayman, T. de Silva, A. S. Wang, G. Kleinszig, S. Vogt, A. J. Khanna, J.P. Wolinsky, Z. L. Gokaslan, and J. H. Siewerdsen, “Knowncomponent 3d2d registration for image guidance and quality assurance in spine surgery pedicle screw placement,” Proceedings of SPIE–the International Society for Optical Engineering, vol. 9415, 2015.
 H. Esfandiari, S. Amiri, D. D. Lichti, and C. Anglin, “A fast, accurate and closedform method for pose recognition of an intramedullary nail using a tracked carm,” International journal of computer assisted radiology and surgery, vol. 11, no. 4, pp. 621–633, 2016.
 T. Steger and S. Wesarg, “Quantitative analysis of marker segmentation for carm pose based navigation,” in XIII Mediterranean Conference on Medical and Biological Engineering and Computing 2013, ser. IFMBE Proceedings, L. M. Roa Romero, Ed. Cham: Springer, 2014, vol. 41, pp. 487–490.
 P. Markelj, D. Tomaževič, B. Likar, and F. Pernuš, “A review of 3d/2d registration methods for imageguided interventions,” Medical image analysis, vol. 16, no. 3, pp. 642–661, 2012.
 M. A. Viergever, J. B. A. Maintz, S. Klein, K. Murphy, M. Staring, and J. P. W. Pluim, “A survey of medical image registration  under review,” Medical image analysis, vol. 33, pp. 140–144, 2016.
 T. Steger, M. Hoßbach, and S. Wesarg, “Marker detection evaluation by phantom and cadaver experiments for carm pose estimation pattern,” ser. SPIE Proceedings, D. R. Holmes and Z. R. Yaniv, Eds. SPIE, 2013, p. 86711V.
 A. K. Jain, T. Mustafa, Y. Zhou, C. Burdette, G. S. Chirikjian, and G. Fichtinger, “Ftrac—a robust fluoroscope tracking fiducial,” Medical Physics, vol. 32, no. 10, p. 3185, 2005.
 W. El Hakimi, J. Beutel, and G. Sakas, “Particle path segmentation: a fast, accurate, and robust method for localization of spherical markers in conebeam ct projections,” in Track f: devices and systems for surgical interventions, K. Lange, W. Lauer, M. Nowak, D. B. Ellebrecht, and M. P. E. Gebhard, Eds., vol. 51, 2014, pp. 405–408.
 M. Bui, S. Albarqouni, M. Schrapp, N. Navab, and S. Ilic, “Xray posenet: 6 dof pose estimation for mobile xray devices,” in WACV 2017. Piscataway, NJ: IEEE, 2017, pp. 1036–1044.
 L. MaierHein, S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou, M. Hashizume, D. Katic, H. Kenngott, M. Kranzfelder, A. Malpani, K. März, T. Neumuth, N. Padoy, C. Pugh, N. Schoch, D. Stoyanov, R. Taylor, M. Wagner, G. D. Hager, and P. Jannin, “Surgical data science: Enabling nextgeneration surgery,” 2017. [Online]. Available: http://arxiv.org/pdf/1701.06482
 M. Descoteaux, L. MaierHein, A. Franz, P. Jannin, D. L. Collins, S. Duchesne, T. Kurmann, P. Marquez Neila, X. Du, P. Fua, D. Stoyanov, S. Wolf, and R. Sznitman, Eds., Simultaneous Recognition and Pose Estimation of Instruments in Minimally Invasive Surgery: Medical Image Computing and ComputerAssisted Intervention MICCAI 2017. Springer International Publishing, 2017.
 C. R. Hatt, M. A. Speidel, and A. N. Raval, “Realtime pose estimation of devices from xray images: Application to xray/echo registration for cardiac interventions,” Medical image analysis, vol. 34, pp. 101–108, 2016.
 V. Ulman, M. Maška, K. E. G. Magnusson, O. Ronneberger, C. Haubold, N. Harder, P. Matula, P. Matula, D. Svoboda, M. Radojevic, I. Smal, K. Rohr, J. Jaldén, H. M. Blau, O. Dzyubachyk, B. Lelieveldt, P. Xiao, Y. Li, S.Y. Cho, A. C. Dufour, J.C. OlivoMarin, C. C. ReyesAldasoro, J. A. SolisLemus, R. Bensch, T. Brox, J. Stegmaier, R. Mikut, S. Wolf, F. A. Hamprecht, T. Esteves, P. Quelhas, Ö. Demirel, L. Malmström, F. Jug, P. Tomancak, E. Meijering, A. MuñozBarrutia, M. Kozubek, and C. Ortizde Solorzano, “An objective comparison of celltracking algorithms,” Nature Methods, vol. 14, no. 12, p. 1141, 2017.
 A. Vandini, B. Glocker, M. Hamady, and G.Z. Yang, “Robust guidewire tracking under large deformations combining segmentlike features (seglets),” Medical image analysis, vol. 38, pp. 150–164, 2017.
 G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
 S. Miao, Z. J. Wang, and R. Liao, “A cnn regression approach for realtime 2d/3d registration,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1352–1363, 2016.
 M. F. Valstar, E. SánchezLozano, J. F. Cohn, L. A. Jeni, J. M. Girard, Z. Zhang, L. Yin, and M. Pantic, “Fera 2017  addressing head pose in the third facial expression recognition and analysis challenge,” 2017. [Online]. Available: http://arxiv.org/pdf/1702.04174
 Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. Piscataway, NJ: IEEE, 2013, pp. 3476–3483.
 E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres, “Joint graph decomposition & node labeling: Problem, algorithms, applications,” in 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017, pp. 1904–1912.
 L. Pérez, Í. Rodríguez, N. Rodríguez, R. Usamentiaga, and D. F. García, “Robot guidance using machine vision techniques in industrial environments: A comparative review,” Sensors, vol. 16, no. 3, p. 335, 2016.
 I. Stenin, S. Hansen, M. NauHermes, W. E. Hakimi, M. Becker, J. Bevermann, T. Klenzner, and J. Schipper, “Evaluation von minimal invasiven multiport zugängen der otobasis am humanen schädelpräparat,” in 13. Jahrestagung der Deutschen Gesellschaft für Computer und Roboterassistierte Chirurgie, September 1113, 2014, Munich, Germany, H. Feußner, Ed., 2014, pp. 127–130.
 T. S. Yoo, M. J. Ackerman, W. E. Lorensen, W. Schroeder, V. Chalana, S. Aylward, D. Metaxas, and R. Whitaker, “Engineering and algorithm design for an image processing api: A technical report on itk–the insight toolkit,” Studies in health technology and informatics, vol. 85, pp. 586–592, 2002.
 S. Rit, M. Vila Oliva, S. Brousmiche, R. Labarbe, D. Sarrut, and G. C. Sharp, “The reconstruction toolkit (rtk), an opensource conebeam ct reconstruction toolkit based on the insight toolkit (itk),” Journal of Physics: Conference Series, vol. 489, p. 012079, 2014.
 A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for realtime 6dof camera relocalization,” 2015. [Online]. Available: http://arxiv.org/pdf/1505.07427
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” 2014. [Online]. Available: http://arxiv.org/pdf/1409.1556
 E. B. van de Kraats, G. P. Penney, D. Tomazevic, T. van Walsum, and W. J. Niessen, “Standardized evaluation methodology for 2d3d registration,” IEEE transactions on medical imaging, vol. 24, no. 9, pp. 1177–1189, 2005.
 Y. Otake, M. Armand, R. S. Armiger, M. D. Kutzer, E. Basafa, P. Kazanzides, and R. H. Taylor, “Intraoperative imagebased multiview 2d/3d registration for imageguided orthopaedic surgery: incorporation of fiducialbased carm tracking and gpuacceleration,” IEEE transactions on medical imaging, vol. 31, no. 4, pp. 948–962, 2012.
 D. Kügler, M. Jastrzebski, and A. Mukhopadhyay, “Anonymous submission,” 2018.
 M. N. Wernick, O. Wirjadi, D. Chapman, Z. Zhong, N. P. Galatsanos, Y. Yang, J. G. Brankov, O. Oltulu, M. A. Anastasio, and C. Muehleman, “Multipleimage radiography,” Physics in Medicine & Biology, vol. 48, no. 23, p. 3875, 2003.