A Deep Learning Approach for Pose Estimation from Volumetric OCT Data
Tracking the pose of instruments is a central problem in image-guided surgery. For microscopic scenarios, optical coherence tomography (OCT) is increasingly used as an imaging modality. OCT is suitable for accurate pose estimation due to its micrometer range resolution and volumetric field of view. However, OCT image processing is challenging due to speckle noise and reflection artifacts in addition to the images’ 3D nature. We address pose estimation from OCT volume data with a new deep learning-based tracking framework. For this purpose, we design a new 3D convolutional neural network (CNN) architecture to directly predict the 6D pose of a small marker geometry from OCT volumes. We use a hexapod robot to automatically acquire labeled data points which we use to train 3D CNN architectures for multi-output regression. We use this setup to provide an in-depth analysis on deep learning-based pose estimation from volumes. Specifically, we demonstrate that exploiting volume information for pose estimation yields higher accuracy than relying on 2D representations with depth information. Supporting this observation, we provide quantitative and qualitative results that 3D CNNs effectively exploit the depth structure of marker objects. Regarding the deep learning aspect, we present efficient design principles for 3D CNNs, making use of insights from the 2D deep learning community. In particular, we present Inception3D as a new architecture which performs best for our application. We show that our deep learning approach reaches errors at our ground-truth label’s resolution. We achieve a mean average error of and for position and orientation learning, respectively.
keywords:3D Convolutional Neural Networks, 3D Deep Learning, Pose Estimation, Optical Coherence Tomography
Tracking the pose of instruments and patients is a typical problem in many clinical scenarios, e.g., minimally invasive surgery (MIS) (Bouget.2017) or transcranial magnetic stimulation (Richter.2013). Common commercially available optical and electromagnetic (EM) tracking systems reach an accuracy of to (Kral.2013). For optical tracking, a mean tracking error of has been achieved for clinical setups (Elfring.2010). EM tracking operates without a line of sight but generally reaches lower accuracy with a typical root mean square error (RMSE) of (Franz.2014). Some application scenarios in MIS require better accuracy, such as ophthalmic surgery, cochleostomy or neurosurgery. Moreover, the markers for optical tracking systems have a size of several centimeters which hinders application for these micro-scale scenarios.
OCT represents a high-resolution image modality that is suitable for guiding microscale medical interventions. For example, OCT systems have been integrated into operating microscopes (Lankenau.2007), e.g., for ophthalmic surgery (Tao.2014) and neurosurgery (Finke.2012). Moreover, OCT has been studied as a tracking system for cochleostomy by using artificial markers created with a laser (Zhang.2014b). The approach reached tracking accuracy in the micrometer range. These results motivate the use of OCT as a precise pose estimation and tracking system.
Recently, deep learning-based frameworks have been applied for pose estimation problems. This includes methods to learn descriptors for 3D pose estimation from 2D images (Wohlhart.2015) and full 6D pose estimation from RGB-D images (Krull.2015). Similarly, CNNs are considered a promising approach for surgical tool segmentation and pose estimation with recent successful applications (Sahu.2016). Taking a learning-based approach for pose estimation allows for independence from large markers which often comes at the cost of lower accuracy (Bouget.2017).
For OCT, tracking approaches have been proposed (Laves.2017; Camino.2016). However, these methods are limited to specific application scenarios such as skin or eye motion tracking using handcrafted features. Similar to pose estimation from time-of-flight camera images (Krull.2015), these approaches rely on 2D depth representations despite full volume data being available. In general, there are no deep learning approaches for OCT-based pose estimation so far.
For other medical image analysis task, such as segmentation of magnetic resonance imaging (MRI) data, 3D CNNs have been widely used (Dou.2017; Havaei.2017; Kamnitsas.2017). However, early 3D CNN architectures have been identified as lackluster due to simple architecture choices (Yu.2017b) which leaves 3D CNN design as an open question. Moreover, to the best of our knowledge, 3D CNNs have not been applied to volumetric OCT data.
These considerations motivate a novel deep learning-based pose estimation approach for OCT. We take arbitrary small objects and turn them into a marker for pose estimation or tracking. To generate a training set, we acquire high-resolution volumetric OCT images of the object in different poses. We use a 3D CNN to learn highly accurate regression between volumetric images and object poses. Then, the 3D CNN can be used to estimate the object pose based on newly acquired volumetric images only. The object now acts as a marker that can be attached to surgical tools or patients to track their movement by inferring their pose changes from the marker. Figure 1 shows the data generation and tracking procedure in detail.
Our approach offers several advantages compared to the methods presented above. The marker’s shape and size can be chosen arbitrarily, and it is easy to manufacture, e.g., with a 3D printer. A 3D CNN can be trained for any marker shape. This allows for adaptation of our framework to different clinical tracking scenarios with varying requirements. Moreover, compared to tool segmentation, our approach does not require sophisticated, manual labeling. Also, while having similar flexibility as a markerless approach, we benefit from the high accuracy of marker based systems as our 3D CNN is fitted to one specific geometry at a time.
In this paper we provide an in-depth analysis of our proposed method concerning its accuracy, the use of volumetric OCT data and 3D CNN architectures for pose learning with OCT volumes.
First of all, we address the fundamental question of tracking accuracy. We compare our novel deep learning-based pose estimation approach to a classic feature detection and registration-based method with a similar setup (Zhang.2014b).
Next, we motivate the use of volumetric data for deep learning-based pose estimation. We investigate how directly leveraging volume information with 3D CNNs compares to the typical use of 2D depth representations.
Regarding the choice of volume data as our image representation, we also analyze how 3D CNNs make use of the additional depth information. OCT is a modality that can provide deep, subsurface information. However, this depends on materials and whether they can be penetrated by infrared light. We investigate how subsurface information benefits 3D CNN learning by comparing markers with and without an identifiable inner structure. We provide quantitative accuracy results and qualitative saliency maps to show how 3D CNNs exploit volume information for pose estimation.
In order to show our method’s robustness we also test our marker’s performance when the OCT image is occluded. These results illustrate the performance of our method in practical scenarios where many new objects are likely to appear that have not been present during training.
Another aspect of our proposed framework is the deep learning model itself. As a part of our method, we extend 3D CNN usage to OCT volume data. Building 3D CNNs is not trivial since the models have larger numbers of parameters and high computational and memory requirements compared to 2D CNNs. We consider efficient CNN design principles such as Inception (Szegedy.2016), ResNet (He.2016) and long-range feature transfer (Ronneberger.2015; Yu.2017b) in order to build a new 3D CNN architecture called Inception3D. We compare it to several 3D CNN architectures for our pose estimation method and highlight how different design principles affect performance.
Summarized, the main contributions of this paper are as follows:
We propose a novel deep learning method for direct pose estimation from volumes to track miniature markers with high accuracy.
We show the advantages of a volume-based learning approach for pose estimation by comparing it to typical 2D depth-based tracking approaches.
We provide quantitative and qualitative evidence that 3D CNNs exploit the additional volume information well when using markers with internal features.
Our work extends 3D CNNs to OCT volume data, and we introduce Inception3D as a new architecture for pose estimation and compare it to different CNN design principles.
2 Related Work
Our approach is linked to CNNs, pose estimation, and OCT imaging.
CNNs have been widely used in various fields in computer vision such as classification (Krizhevsky.2012), object detection (Girshick.2014), pose estimation (Toshev.2014) and semantic segmentation (Long.2015). Since their initial success in the ImageNet large scale visual recognition competition (ILSVRC2012), various new architectures and additions for CNNs have been introduced. The Inception architecture (Szegedy.2015) showed success by utilizing different filter sizes on the same intermediate features in a network. This resembles the extraction of features at different scales. Residual connections were introduced to deal with the degradation problem in very deep networks (He.2016). These were also incorporated into a new iteration of the Inception architecture (Szegedy.2017) that we use as a basis. Xie.2017 introduced ResNeXt, an architecture based on the ideas of Inception and residual learning. Their key contribution is the reduced number of hyperparameters that need to be chosen which makes the architecture easier to extend to new problems. Xie.2017 argue that sophisticated hyperparameter tuning hindered the application of successful architectures such as Inception to new domains. Li.2017 employed the Inception architecture on 3D data for 3D neuron reconstruction. However, the architecture was used with 2D kernels which leads to the CNN’s kernels having 2D FOVs and thus no feature learning with volumetric data exploitation. Recently, the usage of 3D CNNs for volumetric MRI data was proposed. Dou.2016b used 3D CNNs on MRI data for the detection of cerebral microbleeds. Brosch.2016 performed multiple sclerosis lesion segmentation on 3D MRI data. These approaches relied on simple CNN architectures and were therefore limited in their representation capability Yu.2017b. Other approaches relied on custom 3D CNN designs, e.g. Havaei.2017 built a cascaded architecture and Dou.2017 relied on deep supervision with auxiliary classifiers (Lee.2015) and dense output predictions. The U-Net design principle (Ronneberger.2015) has been extended to 3D (Cicek.2016) and is often found in 3D CNN architectures for segmentation tasks (Litjens.2017). The architecture is similar to an encoder-decoder scheme with feature propagation between similar resolution stages in the encoder and decoder part. Chen.2017b built a related architecture with multi-scale feature aggregation at higher network levels. Moreover, Chen.2017 improved 3D CNN architectures by utilizing residual connections in a CNN for volumetric brain segmentation. Yu.2017b refined this further by utilizing both short and long residual connections in a network. The latter is inspired by the feature propagation of U-Net. We also build on this idea, but we propagate information between different resolution stages instead of similar ones. So far, efficient design principles found in Inception and ResNet architectures have not seen a lot of attention for 3D medical image data although being successful in the 2D domain. Since these architectures are specifically designed for efficiency, we employ their design principles in the 3D domain where resources are often critical.
Pose estimation is a key problem in computer vision and has been widely studied and used in medicine. While typical approaches solve the task explicitly with known rigid body markers, machine learning-based approaches have gained popularity in clinical applications (Bouget.2017). In MIS environments, pose estimation is used for tracking of surgical tools or patients from endoscopic RGB videos. Allan.2014 performed tracking and 3D pose estimation of surgical tools from videos using linear Kalman filters. Recently, CNNs have been applied for the localization of tools in robot-assisted MIS surgery (Sarikaya.2017). Moreover, GarciaPerazaHerrera.2016 employ fully convolutional networks (FCN) for real-time segmentation and tracking of tools. Still, the application of CNNs in medical tracking tasks is rare, also due to the difficulty of obtaining large training sets (Bouget.2017).
In other fields, CNNs have been applied to pose estimation. CNNs have been used for pose estimation in RGB-D images. Wohlhart.2015 learned a semantic descriptor that separates image patches by object type and pose. Object recognition and pose estimation are performed by a nearest neighbor search which matches an image patch to a training sample based on their descriptors. The pose estimation is coarse and highly dependent on the density of training samples in the pose space. Krull.2015 took an analysis-by-synthesis approach for 6D pose estimation in RGB-D images. Rendered and observed image representations are fed as channels into a 2D CNN to predict an energy function value that is related to the target pose. Kehl.2016 employ CNNs in an unsupervised fashion on RGB-D patches for feature learning and subsequent 6D pose estimation. While images with a depth channel are frequently used, volumetric medical image data does not see usage for 6D pose estimation. We address this observation and show that directly using volumetric data is advantageous over the typical approach of relying on 2D depth representations.
OCT is an interferometric imaging modality with micrometer resolution and a typical field of view (FOV) of several millimeters range. OCT has been applied in surgical tasks through microscope integration, e.g., for ophthalmic surgery (Ehlers.2014) and laser cochleostomy (Zhang.2014b). Also, OCT-based tracking setups fused with an RGB-D camera have been investigated (Rajput.2016). For laser cochleostomy, an OCT-based pose estimation framework has been proposed Zhang.2014b. Artificial landmarks are applied to the patient’s cochlea with a laser which are used for relative movement tracking. The high accuracy results imply the usability of OCT data for pose estimation and tracking. Moreover, tracking of a region of interest (ROI) has been performed with maximum intensity projections (MIPs) and handcrafted feature registration (Laves.2017). Again, this approach leverages 2D depth representations instead of full volumetric information.
Additionally, OCT image data has been recently used in conjunction with machine learning approaches for tasks not related to pose estimation. Segmentation of retinal fluids has been performed using CNNs with 2D OCT slices Schlegl.2015. Moreover, tissue classification tasks have been addressed using recurrent neural networks (Otte.2014) and CNN-based approaches (Abdolmanafi.2017). Also, detection of macular diseases has been addressed using CNNs (Karri.2017; Lee.2017).
To the best of our knowledge, exploitation of volumetric OCT data with 3D CNNs has not been employed and is an open question for this imaging modality. We address this problem and compare different architectures that are new for the 3D CNN domain with our pose estimation method. Moreover, we address volumetric data exploitation of 3D CNNs and show its advantages over depth image-based pose estimation approaches found in the literature.
First, we introduce the setup for generating OCT and pose data. Second, the nature of our pose estimation framework is explained in detail. Third, the 3D CNN architectures we employ are introduced.
3.1 Data Generation and General Setup
We employ a setup to automatically generate a set of image and pose data for learning. The setup consists of a hexapod robot, a spectral domain OCT (SD-OCT) device with a stand and a phantom to be used as a marker, see Figure 1. The hexapod moves the marker inside the OCT’s FOV and stops at predefined poses. The position part of the 6D poses is generated by randomly sampling positions in a 3D bounding box that covers the OCT’s FOV size. Orientations are created by randomly generating rotation angles within an interval. All components are uniformly sampled from their respective space. The hexapod moves to a pose, stops, and an OCT volume is acquired. The volume is combined with the current pose to form a labeled data sample. This procedure is repeated several thousand times to create a dataset for training. As a result, our 3D CNNs receive an OCT volume containing the marker as their input and are trained in order to predict the pose with respect to the hexapod’s reference point.
It should be noted that these labels require the models to implicitly learn the transformation between the hexapod reference frame and a marker coordinate frame. All poses are defined with respect to the hexapod. CNNs follow the universal function approximation theorem. Therefore, the complex model has the ability to learn the transformation. Moreover, this labeling approach allows fast, automatic data acquisition for large training sets. Also, the labeling strategy does not require pose estimation from images with a checker board, as typically used for learning-based pose estimation (Brachmann.2014).
Tracking is achieved by letting the CNN predict the marker’s pose in two different volumes. Then, the relative transformation can be easily obtained by a matrix multiplication. This is depicted in the right part of Figure 1.
3.1.1 OCT Imaging
The imaging device is an SD-OCT system which is based on interferometry. The technique’s advantage is its high spatial resolution in micrometer range which makes it suitable for high accuracy tracking tasks. A broadband light source with a common center wavelength at emits a beam that is split such that one part of it is directed at a reference mirror and the other part penetrates the object of interest. Light is scattered and reflected back and interferes with the reference signal part. A spectrometer captures the resulting interference spectrum that represents a 1D depth profile (A-scan) of the region of interest and is limited by the coherence length of the laser. Repeated scanning at different lateral positions results in a complete volume scan (C-scan) of the object of interest. The visibility of the object’s interior structure largely depends on the object’s reflective properties. If it reflects near infrared radiation very well, only the object’s surface will be visible in an OCT volume. This is a very relevant property when considering the pose estimation task. Typical 6D pose estimation frameworks (Krull.2015) also rely on surface information obtained with time-of-flight depth cameras. Therefore, it appears natural to employ a similar framework for OCT images if mostly surfaces are visible without internal features. We investigate this assumption by training both on volume data and 2D surface extractions. Also, we train both on an opaque marker, whose surfaces are hardly penetrated and a marker with a distinct inner structure, visible in OCT volumes. Both approaches provide insight on the importance of volume data usage. Figure 2 shows the different markers with the different properties. We refer to the opaque marker as marker A and the marker with an inner structure as marker B.
3.1.2 Robot for Ground-Truth Annotation
The hexapod robot shown in Figure 1 is used to move the marker within the OCT’s FOV as well as for obtaining ground-truth 6D pose labels. Its pose is expressed with respect to a reference point slightly below its top plate. Translations relative to that point are denoted as , and . The rotations are expressed by rotation angles , , around each axis of a coordinate frame shifted by , and from the reference point. Note, that rotations related to that point would lead to a translation of the phantom. Therefore, the center of rotation is shifted in -direction to place it inside the OCT volume and minimize marker translations caused by rotations. A rotation matrix is expressed by consecutively rotating with , and around the moving axes , , , such that the rotation matrix can be expressed as . The rotation matrix and the translations are used to form a homogeneous transformation matrix that is used to obtain the relative transformation matrix as shown in the right part of Figure 1. The target pose labels for learning take the form .
3.2 3D CNN Architectures and Training Procedure
Having obtained labeled data samples, the 3D CNN model can be set up, trained, optimized and used to predict poses. First, preprocessing steps are outlined where we set up datasets with 3D and 2D representations. Then, we described the novel 3D CNN architectures for 3D OCT images and explain design choices.
For volume data, the volume size needs to be adjusted first due to computational requirements. We downsample the volumes from the acquisition size of to . The depth dimension is reduced with a larger factor than the lateral dimensions because its original pixel spacing is much smaller. As a result, the pixel spacing for each dimension of the volume represents the same cartesian distances. The target volume size is a trade-off between computational effort and potentially lost information during the downsampling process. The selected size leads to satisfactory results while keeping training times within feasible bounds. Note, that our pose estimation task does not allow us to perform subvolume sampling which is typically applied for large 3D input volumes (Liefers.2017). The pose is a global image property that would be lost in case of subsampling. As a final preprocessing step, we subtract the training data set mean from each image to help gradient-based optimization (Simonyan.2014).
For 2D depth data representations we extract surface information from the OCT volumes to obtain a 2D depth representation that is similar to other RGB-D based 6D pose estimation frameworks (Brachmann.2014). This allows for comparison to other OCT-based tracking approaches where 2D depth representations were used for tracking a volume of interest with handcrafted feature matching (Laves.2017).
We perform the extraction using MIPs from different views. This provides us with two different types of depth representations. The image index at which the maximum intensity was found represents the most intuitive notion of depth.
However, the maximum intensity itself also provides depth information. Considering a curved Gaussian beam model of the OCT’s infrared light, the intensity at the top of the volume (closer to the light source) will be different than at the bottom. Moreover, the MIPs can also carry rotation information as the back-scattering from surfaces changes based on the angle. Therefore, both the normalized depth index and the maximum intensities themselves are considered as 2D depth representations for learning. The extraction process is illustrated in Figure 3. Since our data is volumetric, there are several options of which coordinate direction (,,) should be chosen for extraction. Here, and are the lateral coordinate directions and is the depth direction along the OCT beam. Using several 2D projections from different angles is typically referred to as 2.5D and has been used for CNN training as a trade-off between less costly 2D and potentially richer 3D representations (Roth.2016).
The straight forward choice is the use of the MIP along the z-direction as this is the actual travelling direction of the OCT light beam. Taking the maximum value along the -direction results in a projection on the - plane. Although this is the primary, relevant direction for OCT, some information is likely lost through the projection. To illustrate this, consider Figure 3. Potentially useful information below the surface is lost entirely through projection. Therefore, we also include - and - projections in our datasets. To maintain spatial alignment, we perform the MIP extraction from a volume size of . This results in five different 2D datasets that we compare to the volumetric dataset:
intensity values from the - projection
normalized depth index values from the - projection
normalized depth index values and intensity values from the - projection
intensity values from the -, - and - projections
normalized depth index values from the -, - and - projections
The third dimension refers to the channel.
In order to draw a connection between 2D and 3D data processing, we also consider the case of using 3D volume data with 2D kernels. Prior approaches handled OCT volume data by using 2D slices in the input data’s channel dimension with 2D CNNs Schlegl.2015. By default, a 2D kernel that is swept over a volume performs processing slice by slice without taking context between slices into account. For a meaningful comparison to 3D CNNs, we extend the 3D volumes by a channel dimension for 2D kernel processing. Each channel contains a shifted version of the volume along the -direction. Therefore, when processing each slide with a 2D kernel, the neighboring slices are also taken into account.
Summarized, we use five datasets with 2D depth representations for comparison to a volumetric dataset. This provides a comparison on how computationally cheaper 2D representations perform against more costly 3D data when being trained with a 2D CNN and 3D CNN, respectively. The baseline dataset for our evaluation is the volumetric dataset.
3.2.2 3D CNN Architectures
First, we motivate our general 3D CNN approach for the 6D pose estimation task at hand. Then, we describe the different architectures we employ with the respective design principles we followed for their construction.
Although CNNs have been popular for several years, application to volumetric input data in medical imaging is still rare Greenspan.2016 and to our knowledge not available at all for OCT volume data. Therefore, our architecture follows popular design choices from the deep learning community for 2D applications and also considers successful approaches on MRI volume data.
The complete 3D CNN consists of several convolutional layers which represent a feature extraction stage and an output layer for the regression itself. The convolutional layers consist of a set of 3D kernels that are swept over the input and create several output feature volumes. The 3D property of the kernels leads to volumetric receptive fields which enable volume information exploitation.
Our principle network design is shown in Figure 4. After the volumetric input, some initial layers follow, which are identical for all architectures we build. Immediately after the first layer, we halve the input’s spatial dimension. We employ convolutional layers with stride two instead of the typical max pooling layer, following the idea of simplistic design (Springenberg.2014). Then, groups of architecture-specific layers follow, which we refer to as modules. At the module input, the first layer always reduces the input size by half in all spatial dimensions. Every architecture comes with two modules, representing our main feature extraction stage with the most model parameters and the largest influence on performance. After two modules, we apply global average pooling to reduce the current feature volume to a feature vector. This approach acts as a regularization as the following fully-connected layer has significantly fewer parameters (Lin.2013). The feature vector is fed into the output layer that predicts the pose as continuous regression. We chose to train separate networks for position and orientation. Therefore the CNN output is always a vector with three elements. We motivate this choice when describing the target vectors in detail in Section 3.2.3. We compare this approach to direct prediction of the entire pose vector.
The general architecture focuses on feature extraction at intermediate volume sizes of and . Note, that the volumes are padded to retain the desired volume sizes after convolutions. Considering the spatial dimension of the -axis, moving these main extraction stages to smaller volumes is not reasonable. Shifting the main extraction towards larger volumes is suboptimal as well since computational effort would increase tremendously.
For the modules in Figure 4 we employ different types of architectures to highlight the advantage of our network design. Each model introduces a different additional property that leads to our design of Inception3D, the main architecture we introduce in this paper. To maintain a fair comparison, we try to keep the architectures similar with respect to the number of parameters (4 million) and features learned.
To keep architecture design straight forward, we follow previous design principles for the 2D domain. Simonyan.2014 showed that smaller kernel sizes are preferable for CNNs which is why we only employ filters for feature learning and filters for changing feature map sizes. Moreover, we increase the number of feature maps in our modules each time the spatial feature dimensions are halved.
Additionally, we employ batch normalization before every activation to reduce covariate shift Ioffe.2015. The activation functions are of type ReLu Glorot.2011.
ResNetA3D is an architecture that we base on current state-of-the-art 3D segmentation CNNs such as (Chen.2017; Yu.2017b) to provide a meaningful comparison to our other models. Several blocks of this architecture are joined to modules as shown in Figure 5. The key feature of this architecture compared to plain convolutional blocks is the use of residual connections (He.2016). The idea of this concept is to learn a residual instead of the desired mapping where is the block’s input. Residual connections are frequently used in the 2D image domain with numerous variations (Szegedy.2017; Zagoruyko.2016) and recently the concept was employed for 3D prostate segmentation (Chen.2017). Therefore, we see this model as a baseline architecture reflecting the application of 2D design principles in the 3D image domain. Note, that this model is expensive regarding its number of parameters as is does not employ downsampling in the number of feature maps which is introduced next. Therefore, the network comes with a smaller depth to maintain a similar amount of parameters.
ResNetB3D is a model that extends the concept of residual blocks from ResNetA3D by adding convolutions for downsampling and upsampling in the feature map dimension, as shown in Figure 6. Often, this idea is described as a bottleneck. Furthermore, the method should be distinguished from spatial downsampling which acts on the images’ width, height and depth and helps to increase the implicit receptive fields. Reducing the feature map dimension follows the idea of dimensionality reduction which assumes that most of the input’s information can be preserved in a lower dimensional embedding. This concept was also used in the original 2D ResNet architecture (He.2016). However, to our knowledge, it has not been employed for 3D CNN learning tasks. This concept is particularly important for costly 3D CNNs as this method reduces the number of parameters and computational effort for the model. Note, that this design principle allows for a deeper model with more layers than ResNetA3D.
|Module 1 Res. Block /2|
|Module 1 Res. Block|
|Module 2 Res. Block /2|
|Module 2 Res. Block|
We propose Inception3D as a new 3D CNN architecture which is inspired by Inception-ResNet (Szegedy.2017). We make use of the previous models’ properties and additionally introduce the concept of multi-path convolutional blocks, as shown in Figure 7. The individual parameter choices for the convolutional layer sizes are shown in Table 1. The multi-path approach is motivated by the idea of feature extraction at different scales which is expected to yield more representative features (Szegedy.2015). Note, that this architecture is difficult to design, in particular, as more design choices need to be made. We address this problem by simplifying Inception3D without taking away its core concepts. Compared to Szegedy.2017, we employ a single type of Inception module with the same number of feature maps (width) for all filters in each path. Compared to our other models, we individually choose each block’s width, and we augment the architecture with long-range residual connections.
The idea of long-range residual connections is inspired by Yu.2017b where connections between the same feature map stages are applied in a U-net-like (Ronneberger.2015) encoder-decoder network. We extend this idea by transferring features between different feature map scales. For comparison, we also use the original idea of U-net for feature transfer (Ronneberger.2015). While residual connections perform an addition operation when features are fused, U-net concatenates the features to a larger feature map. For the latter, we perform a subsequent convolution that reduces the feature map size back to the original size after concatenation. In this way, the network can learn which combination of high- and low-level features is needed. The idea behind this approach is that pose estimation requires both local and global features. The latter are necessary for the object’s general position in the image while the former allow for fine-grained distinction of similar poses. Both skip connection approaches are shown in Figure 8.
ResNeXt3D is similar to the Inception idea with a multipath architecture which is inspired by (Xie.2017), see Figure 9. The key idea is to utilize all of the above models’ ideas with simplified design principles. The multiple paths idea from Inception is adopted by splitting up the single convolution path from ResNetB3D. The number of paths is referred to as cardinality which is considered the key hyperparameter to choose for this type of architecture (Xie.2017). The resulting architecture is easy to tune as all paths are identical compared to Inception, where each path is carefully tuned individually. Therefore, the key difference between ResNeXt3D and Inception3D is simpler architecture design for the former.
All in all, we propose four different architectures for the 3D image domain. Inception3D is our main architecture which we compare to the different design principles of our other models. ResNetA3D is a baseline with residual blocks that are found in typical 3D CNNs (Yu.2017b). For ResNetB3D we introduce the use of downsampling in the feature map dimension for more effective feature representation with the same amount of parameters. We augment Inception3D, our main architecture, further with multi-path blocks and long-range residual connections for optimal performance. Lastly, ResNeXt3D shows how a network with little design effort compares to our similar but carefully tuned Inception3D architecture. These architectures highlight how different design principles affect performance for our pose estimation method. A summary of all architectures is shown in Table 2. Also note, that all our architectures are very efficient in terms of the number of parameters. For comparison, the standard ResNet50 architecture (He.2016) with 16 residual blocks and 2D convolutions comes with 21 million parameters. Inception-ResNet (Szegedy.2017) contains 22 blocks and 56 million parameters.
|Individual Path Design||No||No||Yes||No|
|# of Parameters|
|# of Blocks||4||9||9||9|
3.2.3 Training the 3D CNNs
The learning task is formulated as a regression problem, which is why the error function to be minimized is chosen to be the mean squared error (MSE) between network outputs and ground-truth labels. We define the MSE as
where is the number of outputs, the batch size, the ground-truth label and the network’s predictions. The CNNs are trained with mini-batch gradient descent. We use the Adam algorithm (Kingma.2014) as a state-of-the-art optimizer with an initial learning rate of . When the validation error saturates, the learning rate is reduced by a factor of until we observe no further improvement. The decay rates for the first and second order statistical moment estimates are chosen according to Kingma.2014 with and . Similarly, the decay rate for the moving average in batch normalization layers is chosen to be . Following Ioffe.2015, we do not apply other regularization methods.
We split the data set into training, validation and test sets. The validation set is used for fine-tuning hyperparameters, the test set is used for evaluating the final performance. During training, we use a batch size of .
The labels used for training are provided by the hexapod robot. Due to the OCT’s limited FOV, the positions are limited to and . Similarly, we limit rotations to . For training, we rescale the regression outputs to a range of . In particular, we rescale every output component individually to a range based on the training set. The scaled outputs are defined as
where and are the minimum and maximum value of output in the training set. For evaluation we transform the network’s predictions back to the original scale and calculate error metrics on those values.
Another question that we address is whether training a single CNN for the entire pose label is the optimal choice. Multi-output regression has been addressed both by training a single model for the entire output and by training individual models for each output (Borchani.2015). We study three different approaches. First, we train a single CNN to predict the complete 6D pose. Second, we train one CNN each for position and orientation prediction. Third, we train one CNN each to predict a single component of the pose vector. We choose the best performing approach for all other experiments.
3.2.4 Visualizing What CNNs Learn
Understanding and visualizing what CNNs learned after training is an important issue in the field of deep learning (Simonyan.2013). In particular, for the problem at hand, it is crucial to understand what kind of image properties the CNNs leverage for pose estimation. In general, CNNs for classification are either visualized by image generation through maximization neuron activations or with saliency maps (Zeiler.2014). We utilize the latter since activation maximization is not immediately applicable to regression with continuous output values. Saliency maps visualize which region in a particular input image has the largest influence on a certain activation in the network. This is achieved by computing the partial derivative of the activation with respect to the current input image, leading to a gradient image
where is the saliency map, is the input image, and is a vector of activations. The partial derivatives for each vector element are summed up to form the saliency map. We set to be the output of our network, and thus, a saliency map tells us which region of an image leads to the largest change in the output. This allows us to visualize what our CNN focuses on when being trained on 2D data, when being trained on the marker with a surface structure and when being trained on a marker with inner features.
To enhance the saliency maps, we utilize guided backpropagation (Springenberg.2014). The key idea of this approach is to combine normal backpropagation with the deconvolution idea of Zeiler.2014. Effectively, guided backpropagation changes the backward pass of the ReLu activation function such that negative gradients and thus components that reduce the target activation are suppressed. The method has been shown to perform better than normal backpropagation and deconvolutional visualization, for details see (Springenberg.2014).
All in all, we support our investigation of depth exploitation in volume data for our 6D pose estimation technique by providing an intuitive visualization of what the CNNs learn.
3.2.5 Online Pose Estimation and Robustness Towards Occlusion
In order to show our method’s potential for clinical application scenarios, we also investigate the CNNs’ inference runtime and their robustness towards occlusion in the OCT volumes.
We compare inference runtimes for three different approaches. First, we use Inception3D which employs 3D convolutions and processes volume data. Second, we use a 2D variant of Inception3D with 2D convolutions for the 2D depth representations. Third, we use the 2D variant to process volume data as slices. We investigate whether the different mathematical operations and input data lead to differences in processing time.
We measure the time that passes between feeding a single input to the model and receiving the respective output. We provide mean and standard deviation for 100 single input passes to the model.
Furthermore, we investigate how our models react to occlusion in OCT volume data. For this purpose, we acquired an additional dataset where we added random objects around the marker. The occluding objects were repositioned and changed during training. We used a variety of objects with different reflective properties such as a scalpel, parts of a syringe, needles, cloths, different plastic and metal parts, surgical scissors, printed geometries that could be used as markers and water droplets on top of and next to the marker. An example occlusion scenario is shown in Figure 10. Our marker is the only object constantly appearing in all volumes, and we investigate whether this helps the model to learn robustness towards all other objects.
For testing we split off a dataset that contains objects that are not present anywhere else in the training dataset. Therefore, performance on this test set indicates how well the model deals with objects that it has never seen before. This provides a realistic impression on how the model will perform in practice where new objects are likely to appear in the OCT volumes.
In this section, we present our results. First, we introduce our acquired datasets and the experimental setup. Second, we provide a description of our evaluation strategy. Third, we provide the results themselves.
4.1 Experimental Setup and Data
Marker A was milled from a block of polyoxymethylene (POM) with an asymmetric prism shape, see Figure 2. The material reflects the infrared light very well, which is why mostly its surface is visible in an OCT volume, not its interior. The second marker was 3D printed with Formlabs Resin to obtain an inner structure. For both markers we acquired several thousand data samples each, using roughly for training and for validation and for testing. Additionally, we acquired a dataset that contains occlusions as described in Section 3.2.5. Note, that there is no validation set for the occlusion dataset as we directly use it with our models that were fine-tuned on the other two datasets. An overview of the datasets is shown in Table 3. All results we present refer to the test sets.
|Marker A||Marker B||Occlusion|
The OCT device is a Thorlabs Telesto I SD-OCT. Its lateral resolution is and its depth resolution is . Its FOV covers a volume of . Volume images are acquired with a size of voxels. In the setup shown in Figure 1 only the OCT’s scan head is visible.
The robot is a 6-axis H-820.D1 hexapod distributed by Physik Instrumente GmbH. It allows travel ranges of for translations and for rotations, covering the OCT’s FOV. Regarding accuracy, the robot is limited by a translational repeatability of and a rotational repeatability of . The range of positions covered by the hexapod robot in the experiment corresponds to the OCT’s FOV. The rotations are limited to a range of for each axis.
The 3D CNN implementation leverages the TensorFlow environment (Abadi.2016) and training is performed with graphics cards of type nVidia GTX 1080 Ti with 11GB VRAM.
4.2 Evaluation Strategy
We provide the results of the analysis of our pose estimation method in several steps:
We show general accuracy results and motivate the use of deep learning by comparing our framework to a more classic approach. For this comparison we use our best performing model Inception3D and the best performing marker B. Moreover, we show results for our choice of splitting position and orientation learning.
We show pose estimation accuracy for 2D depth representations for 2D CNN training and 3D volumes for both 2D and 3D CNN training. Again, we employ Inception3D with a 2D counterpart for this comparison. We use marker A for this comparison. The marker is best suited for comparison with 2D depth representations as it largely shows surface information in OCT volumes.
We show how marker A compares to marker B in order to highlight the effects of inner marker structure for 3D CNN learning. We use Inception3D for this comparison.
We visualize what our 3D CNN learns using saliency maps as described in Section 3.2.4. This adds qualitative results and a better understanding for the previous, quantitative results.
We show the suitability of our method for online pose estimation by providing inference times for 2D and 3D CNN data processing.
We show our method’s robustness by using our Inception3D model for a dataset with heavy occlusion.
We compare the 3D CNN models introduced in Section 3.2.2 with respect to their performance for our pose estimation method. We use both markers for this comparison.
We evaluate pose estimation accuracy using the mean absolute error (MAE), relative MAE (rMAE) and average correlation coefficient (aCC) which are typical measures for regression tasks (Borchani.2015). The relative MAE is obtained by dividing the MAE by the ground-truth label’s standard deviation. All reported accuracy values are derived from the independent test sets.
4.3 Pose Estimation Accuracy
First, we show how the use of a deep learning technique for 6D pose estimation from volume data compares to a classic feature based method. For the comparison, we use the related framework of Zhang.2014b. Their method is similar to ours in terms of the experimental setup as they use OCT as an imaging modality and a hexapod for movement.
The comparison is shown in Figure 11. Our approach outperforms the other framework with an MAE of for our method compared to for the method of Zhang.2014b.
Furthermore, we investigated the effect of training different models for different parts of the target pose vector. The results for three approaches with different label splitting are shown in Table 4. For position prediction, splitting up the training improves performance. However, training on a single position output does not lead to improvement. For orientation prediction, removing the position part does not have a substantial effect. Splitting the labels up further even deteriorates performance. Based on these observations, we choose to train position and orientation separately.
|6D Label||3D Label||1D Label||6D Label||3D Label||1D Label|
4.4 2D Depth Information vs. 3D Volume Information
|Vol.||M1||M3||D1||D3||MD||V. 2D||Vol.||M1||M3||D1||D3||MD||V. 2D|
As a second step, we compare the accuracy when using 2D depth representations or full volumetric data for learning. The results are shown in Table 5. We used our Inception3D architecture for training. For the 2D representations, we removed the filter’s third dimension, resulting in Inception2D. We conducted the experiment with marker A. This marker largely shows surface structures in OCT volumes. Therefore, 2D depths maps could be expected to contain a similar amount of information for learning.
Considering the comparison between 2D and 3D, the volumetric data representation that is used for training Inception3D clearly outperforms all 2D approaches. Note, that the 2D CNN version has a smaller capacity since filters only cover two dimensions. However, the 2D CNN was always able to reach a similar training error. This shows that insufficient capacity cannot be the reason for the performance difference but rather the representations used for learning.
Out of all models with 2D filters, the model with volume inputs performs best. Here, volume data is processed in a z-slice-wise fashion with kernels while also taking neighboring slices into account.
Considering the difference between 2D representations, it is notable that a combination of depth and intensity information from a single MIP in -direction performs best. Moreover, the single channel representations that only leverage information from the direction perform better than representations with additional - and - projections.
4.5 Surface vs. Subsurface Structure
|Marker A||Marker B||Marker A||Marker B|
The last section compared a volumetric representation to 2D projections which are typically employed for 6D pose estimation frameworks. Next, we show how a recognizable inner structure affects learning for 3D CNNs. The two markers we compare are described in Section 4.1. Their key difference is that one marker has an opaque surface under infrared light (A), while the second marker has a visible inner structure in OCT images (B), see Figure 2. The results are shown in Figure 12. Detailed values are shown in Table 6. Marker B clearly outperforms marker A. It is notable, that the position error goes beyond the assumed ground-truth label accuracy, induced by the robot’s specified repeatability of .
As a result, we show that a marker with a depth profile outperforms an opaque marker, which adds to the observation that volumetric representations outperform their 2D counterparts.
4.6 Visualizing What was Learned
Next, we aim for a deeper understanding of what was learned by the 3D CNN. In particular, we investigate whether the 3D CNN leveraged the depth information given in the second marker. We employ guided backpropagation to generate saliency maps for a test set image, see Section 3. The saliency maps are generated by deriving the output with respect to the input image. Thus, the final saliency maps we use can be interpreted as a gradient image which has the same size as the test image. Saliency maps indicate, which region in the image is largely responsible for the output, i.e., a change in that region leads to the largest change in the output.
To emphasize the importance of depth exploitation, we compare the 3D saliency maps from the two markers with 2D saliency maps from the approach of leveraging depth information from MIPs. The results for this are shown in Figure 13. The saliency maps for the 2D CNN show high intensities at characteristic surface features on the markers. The 3D saliency maps for the 3D CNN, which are represented by 2D MIPs, focus on a region on the marker without sticking to specific surface features such as the pyramid tip. Note, that the same original test image was used for the 2D saliency maps and the 2D MIPs of the 3D saliency maps.
Furthermore, we present the saliency maps of two test images for the two markers in Figure 14. The saliency maps are shown in red as slices overlaid on top of slices of the test images. The cross-sectional view specifically shows what regions on and inside the marker have a large influence on the output. For the marker with a surface structure, the saliency map mostly lights up around the marker’s surface. Note, that the high intensity saliency area spans above and below the surface, covering 3D space. For the marker with a depth structure, higher values in the saliency maps can be observed inside the marker. Furthermore, it should be noted that the 3D CNN’s center of attention is indeed the marker itself. There appears to be no fitting on the ground surface or artifacts within the volume.
All in all, the visualization with saliency maps adds qualitative indications for depth exploitation of our 3D CNNs. This adds further insights to the quantitative results presented above.
4.7 Inference Time and Robustness Towards Occlusion
In this section, we show the applicability of our approach for practical problems. We provide results for the processing times of our CNNs to show that online pose estimation is feasible. Furthermore, we show results for our model when foreign objects appear in the OCT volume which is likely to happen in practice.
The results for inference time measurement are shown in Table 7. We can observe that both CNNs allow sample processing at with the 2D CNNs being slightly faster. Note, that the convolution operations only have a small influence with a total number of 68 out of 1734 operations and an average processing time of for Inception3D and for Inception2D. Also, note, that these values are very hardware and software dependent, see Section 3.
Furthermore, we investigate how well our model performs when the OCT volume is occluded with foreign objects, see Figure 10. For this purpose, we use our third dataset where different objects are placed around the marker during data acquisition. The results are shown in Table 8. The model’s performance is still close to our other datasets where mostly the marker itself was visible. For rotations, the performance deteriorates more.
4.8 Architectures for Volumetric Data
Next, we provide results on how different architecture designs behave for our pose estimation method. First, we present results for the four architectures introduced in Section 3. Second, we show how long range feature propagation behaves for our Inception3D architecture.
4.8.1 Comparison of 3D CNN Architectures
For our deep learning framework, we propose four different models that come with different improved architectural ideas, see Section 3 for details. The results for position training are shown in Table 9. With the most structural adjustments, Inception3D outperforms the other models. Furthermore, ResNetA3D, which uses the type of residual connections often employed for 3D CNNs (Milletari.2016; Yu.2017b), lacks behind more significantly.
Additionally, Figure 15 shows the training behavior over time for all four models. In terms of convergence behavior, all models perform similar, as all models have approximately the same number of parameters.
All in all, our results show improved performance for models that exploit more efficient architecture design principles.
4.8.2 Long Range Residual Connections for Inception
In the last section, we showed that our custom design of Inception3D outperforms other architectures. Next, we present results on how long range residual connections that span over modules affect performance.
In Section 3 we presented two types of long range connections which are frequently used for feature transfer between similar sized stages in 3D CNNs for segmentation. We extend this approach by drawing connections between different stages of the network and introduce the concept to Inception3D by creating long range connections between modules. In Table 10 the results for the use of residual connections, feature connections and no connections at all are shown. Note, that the use of long- and short-range residual connections is also referred to as mixed residual connections (Yu.2017b) and feature connections are also called dense connections (Huang.2016). Residual connections perform best, closely followed by feature connections. The model with no connections at all shows worse results. It should be noted that performance changes are small compared to using an entirely different architecture.
In Figure 16 the training behavior of the three model variations is shown. There is a clear difference in errors for the model without any connections while the two models with connections are very close. The convergence behavior of the models is very similar once again. It should be noted that introducing the long range connections leads to a negligible increase in parameters.
Summarized, we showed various results highlighting the advantages of our novel deep learning-based pose estimation method. First, we showed that our method outperforms a comparable classic approach. Second, we showed that volumetric data leads to higher accuracy for pose learning, compared to depth-based approaches. Third, we provided qualitative saliency maps that demonstrate how 3D CNNs exploit inner features for pose estimation. Lastly, we showed results for our different architectures, highlighting the importance of efficient design principles with our proposed network Inception3D performing best.
We provided extensive results for our method of 6D pose estimation from volumetric OCT data which lead to valuable insights for deep learning-based pose estimation and 3D CNN application to OCT in general.
6D pose estimation from OCT volumes with deep learning models is a novel approach. We motivate this idea by showing that we outperform other frameworks that rely on classical feature-based approaches (Zhang.2014b). This insight is in line with the general trend of deep learning methods replacing handcrafted features in many computer vision tasks (Liefers.2017).
Also, note, that position prediction accuracy is within the magnitude of the robot’s repeatability and thus the ground-truth labels. Therefore, our deep learning approach is likely limited by the labels’ accuracy and not a lack of representational power. In addition, our framework is general enough to be employed for various pose estimation problems as the source of labels can be any robot or motor.
Furthermore, we investigated how splitting up training for different parts of the pose affects performance with a significant improvement being observed when training only on positions, as shown in Table 4. Often, multi-output regression is addressed by training a single model with multiple outputs instead of using multiple models with single outputs (Borchani.2015). This approach promises better performance by introducing regularization through additional supervision. The model’s feature maps have to learn to represent features for all outputs simultaneously. However, we observe performance improvement for position learning when splitting the pose label. This effect can be explained by regularization through learned invariance. When training on positions only, the input data contains examples with the marker being in the same position with different orientations. Thus, the CNN’s weights are forced to learn invariance towards orientation. This is linked to OCT’s properties as light scattering and surface visibility is highly dependent on the light beam’s angle of impact. Therefore, invariance towards orientations also implicitly enforces invariance towards different light scattering properties in the data. Our results indicate, that the effect of learned invariance significantly improves position learning. At the same time, there are no significant performance differences for orientation learning. Shifting positions within the volume does not change the OCT’s light beam angle of impact. Therefore, in opposite to position learning, invariance towards positions for rotation learning does not implicitly enforce invariance towards different light scattering conditions. All in all, our training strategy with split labels improves position learning by taking advantage of domain knowledge on OCT’s light scattering properties.
2D depth information and volume data were investigated to draw a connection to OCT based tracking which has been performed on 2D projections (Laves.2017). The use of 2D depth representations can be motivated by the imaging property that many surfaces appear opaque under OCT as they cannot be penetrated by infrared light. Therefore, pure surface information extracted from the OCT volume could be deemed sufficient for most tasks.
However, our results in Table 5 show that moving towards volumetric data and 3D CNNs significantly increases performance. The use of volume data with flat 2D kernels already improves performance which indicates that a significant amount of information is lost when creating 2D projections. The novel approach of employing 3D CNNs for OCT volume data improves performance even further. The volumetric receptive fields of stacked 3D convolutional layers appear to be able to capture relevant features for pose estimation more effectively.
With these findings we motivate the use of full volumetric information for OCT based tracking and pose estimation frameworks that relied on 2D representations so far (Laves.2017; Camino.2016). Other OCT based deep learning methods that have also relied on 2D representations so far (Roth.2016; Wang.2016; Venhuizen.2015) could also benefit from our insights.
We highlight the improved feature learning further with use of saliency maps for 2D and 3D data, see Figure 13. For 2D data, the CNN appears to fit to distinct features on the marker surface that are visible in the 2D representation. The 3D CNN, however, appears to take advantage of other, deeper features that cannot be recognized on the surface. This leads to our investigation of deep subsurface feature learning.
Markers with surface and subsurface structure were compared to gain further insight on how 3D CNNs take advantage of inner features. Our results in Table 6 show that the marker with an inner structure performs significantly better than the marker that largely contains surface information in OCT images. This shows that the exploitation of OCT’s 3D nature can be advantageous for volumetric feature learning with 3D CNNs. We support these quantitative result with additional saliency maps, see Figure 14. They highlight that the 3D CNNs indeed learned to exploit subsurface information when it was present in the volume data.
This finding shows that we can improve pose estimation performance without using a larger, more sophisticated marker. Ultimately, markers for surgery should be small and non-disruptive. Creating subsurface structures is an elegant solution to increase the learnable feature space without increasing the marker size. Thus, we combine the advantage of OCT’s depth imaging with 3D CNN powered volumetric feature learning for pose estimation.
All in all, these insights emphasize once more, that OCT’s capability of producing volumetric information is very exploitable by 3D CNNs. We provide strong evidence that OCT based 2D slicing and projection methods (Roth.2016; Wang.2016; Venhuizen.2015) could significantly benefit from 3D data usage and volumetric feature exploitation.
Moving towards clinical application scenarios is a next step for our method. We highlight its suitability for future clinical use by showing its real-time processing capability and its robustness towards occlusion.
Regarding the processing times shown in Table 7, it is notable that the change between 2D and 3D convolutions does not lead to a significant difference. The largest processing overhead is caused by other operations that are always present in the network and neither the input size nor the different operations are a bottleneck. Therefore, our 3D CNNs are capable of online pose estimation. This is linked to our efficient 3D CNN architecture design with comparatively small numbers of parameters, as shown in Table 2.
For future application in clinical scenarios, our marker system should be capable of being integrated into existing OCT setups for MIS without requiring special operating conditions. Thus, it is crucial that our models deal well with unknown objects. Our occlusion dataset results in Table 8 show that our Inception3D model was able to learn robustness towards new occluding objects by achieving a performance close to the initial dataset.
The application of deep learning architectures for 3D OCT data is a novel approach. When entering new problem domains with the use of deep learning, it is largely unclear how existing models should be adopted (Xie.2017). Therefore, we created four different 3D CNN architectures with different design principles and showed how they affect performance for our novel learning problem.
In particular, the idea of downsampling intermediate network outputs with respect to their number of feature maps, i.e. creating a bottleneck, appears to improve representational power greatly. The only model without this property, ResNetA3D, performs significantly worse than the others, see Table 9. The bottleneck idea has been successful for 2D CNNs (He.2015) and we show that it is even more valuable for 3D CNNs. Bottlenecks address the key problem of model complexity and computaional cost which are particularily severe for 3D CNNs (Yu.2017b). The increased efficiency in terms of the number of parameters allows for much deeper models. This insight relates to Yu.2017 who built very deep 2D CNNs for medical image analysis by relying on downsampling in the feature map dimension.
In addition to the bottleneck principle, we use Inception3D and ResNeXt3D to address 3D CNN architecture design for our problem by showing the pay-off for extensive design and fine-tuning. Both architectures employ the successful principle of multiple paths at each scale (Szegedy.2017). However, for Inception3D, we carefully tuned each path individually while for ResNeXt3D, all paths are designed identically. Although there is a performance difference, it is notable that the simple design principles we followed for ResNeXt3D lead to a similar performance, see Table 10. As a result, we argue that high-effort custom designs such as our Inception3D might not be strictly necessary for practice as more simple design choices can already reach good performance. Still, if the goal is the best performance possible, extensive fine-tuning will be necessary when entering new problem domains such as ours with 3D CNNs.
Additionally, we introduced long-range feature transfer between different scales for our architecture. This extends the idea of Ronneberger.2015 and Yu.2017b who employed feature transfer between similar scales for segmentation tasks. As shown in Table 10, these connections do lead to an improved performance. This supports the idea that we both need to detect our marker in the full image, which requires high level, coarse features with a large implicit FOV and we also need to detect fine grained differences for accurate pose distinction. The combination of fine, local and coarse, global features appears to lead to better pose estimation performance. This insight is in line with related ideas for object detection where features are also transferred for a combination of local and global properties (Shrivastava.2016).
Since the 3D CNN architectures we use are all very generic, our results have broader implications. In particular, it should be noted that the design principles of downsampling in the number of feature maps and multi-scale feature extraction are still rarely found in 3D medical image analysis. Early 3D CNN architectures have already been criticized for lack of representational capabilities (Yu.2017b). We extend on this point and argue that the design principles that we brought to the 3D domain with Inception3D and our other models are insufficiently applied for 3D medical learning problems. Several 3D CNN architectures with effective designs have been successfully introduced to the 3D image domainChen.2017; Dou.2017; Kamnitsas.2017; Yu.2017b. However, we argue that these well designed architectures could benefit further from the efficiency-focused design principles we introduced to 3D. Based on our results, we see a significant potential in current 2D CNN architectures for the 3D imaging domain.
We address the problem of high accuracy pose estimation for microscopic tracking tasks with OCT volume data. To this end, we introduce a novel deep learning-based pose estimation method that directly predicts a marker’s pose from volumetric OCT data. We thoroughly analyze our method and compare to typical depth-based approaches which we convincingly outperform. Furthermore, 3D CNNs appear to exploit depth structures in volumetric data which we show both quantitatively with improved results and qualitatively with 3D saliency map visualizations. Our models are able to learn robustness towards occlusion which shows the markers’ usability even when foreign objects appear in the OCT image which is likely to happen in a surgical scenario. Additionally, we show that efficient deep learning design principles can be effectively extended to the 3D image domain. Lastly, we showed that combining low- and high-level features through long range connections benefits pose learning.
For future work, OCT tracking frameworks could build on our insights and move towards deep learning based approaches with volume data exploitation. Furthermore, prior 2D based OCT learning approaches could be extended by volume based approaches. Regarding network architectures, future deep learning models for medical image analysis could incorporate more efficient architecture designs or directly adopt Inception3D for other problems.