Self-Supervised Model Adaptation for Multimodal
Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed self-supervised model adaptation fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. In addition, we propose a computationally efficient unimodal segmentation architecture termed AdapNet++ that incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling that has a larger effective receptive field with more than fewer parameters, complemented with a strong decoder with a multi-resolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest benchmarks demonstrate that both our unimodal and multimodal architectures achieve state-of-the-art performance while simultaneously being efficient in terms of parameters and inference time as well as demonstrating substantial robustness in adverse perceptual conditions.
Keywords:Semantic Segmentation Multimodal Fusion Scene Understanding Model Adaptation Deep Learning
Humans have the remarkable ability to instantaneously recognize and understand a complex visual scene which has piqued the interest of computer vision researches to model this ability since the 1960s (Fei-Fei et al, 2004). There are numerous ever-expanding applications to this capability ranging from robotics (Xiang and Fox, 2017) and remote sensing (Audebert et al, 2018) to medical diagnostics (Ronneberger et al, 2015) and content-based image retrieval (Noh et al, 2017). However, there are several challenges imposed by the multifaceted nature of this problem including the large variation in types and scales of objects, clutter and occlusions in the scene as well as outdoor appearance changes that take place throughout the day and across seasons.
Deep Convolutional Neural Network (DCNN) based methods (Long et al, 2015; Chen et al, 2016; Yu and Koltun, 2016) modelled as a Fully Convolutional Neural Network (FCN) have dramatically increased the performance on several semantic segmentation benchmarks. Nevertheless, they still face challenges due to the diversity of scenes in the real-world that cause mismatched relationship and inconspicuous object classes. Figure 1 shows two example scenes where in the first, the decal on the train is falsely predicted as a person and a sign, whereas in the second example, misclassifications are produced due to overexposure of the camera caused by the vehicle exiting a tunnel. In order to accurately predict the elements of the scene in such situations, features from complementary modalities such as depth and infrared can be leveraged to exploit object properties such as geometry and reflectance correspondingly. Moreover, the network can exploit complex intra-modal dependencies more effectively by directly learning to fuse visual appearance information from RGB images with learned features from complementary modalities in an end-to-end fashion. This not only enables the network to resolve inherent ambiguities and improve reliability but also obtain a more holistic scene segmentation.
|(a) Input Image||(b) Segmented Output|
While most existing work focuses on where to fuse modality-specific streams topologically (Hazirbas et al, 2016; Schneider et al, 2017; Valada et al, 2016b) and what transformations can be applied on the depth modality to enable better fusion with visual RGB features (Gupta et al, 2014; Eitel et al, 2015), it still remains an open question as to how to enable the network to dynamically adapt its fusion strategy based on the nature of the scene such as the types of objects, their spatial location in the world and the present scene context. This is a crucial requirement in applications such as robotics and autonomous driving where these systems run in continually changing environmental contexts. For example, an autonomous car navigating in ideal weather conditions will primarily use visual RGB information but when it enters a dark tunnel or exits an underpassage, the cameras might experience under/over exposure, whereas the depth modality will be more informative. Furthermore, the strategy to be employed for fusion also varies with objects in the scene, for instance, infrared might be more useful to detect categories such as people, vehicles, vegetation and boundaries of structures but it does not provide much information on object categories such as the sky. Additionally, the spatial location of objects in the scene also has an influence, for example, the depth modality provides rich information on objects that are at nearby distances but degrades very quickly for objects that are several meters away. More importantly, the approach employed should be robust to sensor failure and noise as constraining the network to always depend on both modalities and use noisy information can worsen the actual performance and lead to disastrous situations.
Due to these complex interdependencies, naively concatenating or adding modality-specific features does not allow the network to adapt to the aforementioned situations dynamically. Moreover, due to the nature of this dynamicity, the fusion mechanism has to be trained in a self-supervised manner in order to make the adaptivity emergent and to generalize effectively to different real-world scenarios. As a solution to this problem, we present the Self-Supervised Model Adaptation (SSMA) fusion block that adaptively recalibrates and fuses modality-specific feature maps based on the scene condition. The SSMA block takes intermediate representations of modality-specific streams as input and fuses them probabilistically based on the activations of individual modality streams. As we model the SSMA block in a fully convolutional fashion, it yields a probability for each activation in the feature maps which represents the optimal combination to exploit complementary properties. These probabilities are then used to amplify or suppress the representations of the individual modality streams, followed by the fusion. As we base the fusion on modality-specific activations, the fusion is intrinsically tolerant to sensor failure and noise such as missing depth values.
Our proposed architecture for multimodal segmentation consists of individual modality-specific encoder streams which are fused both at mid-level stages and at the end of the encoder streams using our SSMA blocks. The fused representations are input to the decoder at different stages for upsampling and refining the predictions. We employ a combination of mid and late fusion as several experiments have demonstrated that fusing semantically meaningful representations yields better performance in comparison to early fusion (Eitel et al, 2015; Valada et al, 2016b; Hazirbas et al, 2016; Xiang and Fox, 2017). Moreover, studies of the neural dynamics of the human brain has also shown evidence of late fusion of modalities for recognition tasks (Cichy et al, 2016). However, intermediate network representations are not aligned across modality-specific streams, hence integrating fused mid-level features into high-level features requires explicit prior alignment. Therefore, we propose an attention mechanism that weighs the fused mid-level skip features with spatially aggregated statistics of the high-level decoder features for better correlation, followed by channel-wise concatenation.
As our fusion framework necessitates individual modality-specific encoders, the architecture that we employ for the encoder and decoder should be efficient in terms of the number of parameters and computational operations, as well as be able to learn highly discriminative deep features. State-of-the-art semantic segmentation architectures such as DeepLab v3 (Chen et al, 2017) and PSPnet (Zhao et al, 2017) employ the ResNet-101 (He et al, 2015a) architecture which consumes parameters and FLOPs, as the encoder backbone. Training such architectures requires a large amount of memory and synchronized training across multiple GPUs. Moreover, they have slow run-times rendering them impractical for resource constrained applications such as robotics and augmented reality. More importantly, it is infeasible to employ them in multimodal frameworks that require multiple modality-specific streams as we do in this work.
With the goal of achieving the right trade-off between performance and computational complexity, we propose the AdapNet++ architecture for unimodal segmentation. We build the encoder of our model based on the full pre-activation ResNet-50 (He et al, 2016) architecture and incorporate our previously proposed multiscale residual units (Valada et al, 2017) to aggregate multiscale features throughout the network without increasing the number of parameters. The proposed units are more effective in learning multiscale features than the commonly employed multigrid approach introduced in DeepLab v3 (Chen et al, 2017). In addition, we propose an efficient variant of the Atrous Spatial Pyramid Pooling (ASPP) (Chen et al, 2017) called eASPP that employs cascaded and parallel atrous convolutions to capture long range context with a larger effective receptive field, while simultaneously reducing the number of parameters by 87% in comparison to the originally proposed ASPP. We also propose a new decoder that integrates mid-level features from the encoder using multiple skip refinement stages for high resolution segmentation along the object boundaries. In order to aid the optimization and to accelerate training, we propose a multiresolution supervision strategy that introduces weighted auxiliary losses after each upsampling stage in the decoder. This enables faster convergence, in addition to improving the performance of the model along the object boundaries. Our proposed architecture is compact and trainable with a large mini-batch size on a single consumer grade GPU.
Motivated by the recent success of compressing DCNNs by pruning unimportant neurons (Molchanov et al, 2017; Liu et al, 2017; Anwar et al, 2017), we explore pruning entire convolutional feature maps of our model to further reduce the number of parameters. Network pruning approaches utilize a cost function to first rank the importance of neurons, followed by removing the least important neurons and fine-tuning the network to recover any loss in accuracy. Thus far, these approaches have only been employed for pruning convolutional layers that do not have an identity or a projection shortcut connection. Pruning residual feature maps (third convolutional layer of a residual unit) also necessitates pruning the projected feature maps in the same configuration in order to maintain the shortcut connection. This leads to a significant drop in accuracy, therefore current approaches omit pruning convolutional filters with shortcut connections. As a solution to this problem, we propose a network-wide holistic pruning approach that employs a simple and yet effective strategy for pruning convolutional filters invariant to the presence of shortcut connections. This enables our network to further reduce the number of parameters and computing operations, making our model efficiently deployable even in resource constrained applications.
Finally, we present extensive experimental evaluations of our proposed unimodal and multimodal architectures on benchmark scene understanding datasets including Cityscapes (Cordts et al, 2016), Synthia (Ros et al, 2016), SUN RGB-D (Song et al, 2015), ScanNet (Dai et al, 2017) and Freiburg Forest (Valada et al, 2016b). The results demonstrate that our model sets the new state-of-the-art on all these benchmarks considering the computational efficiency and the fast inference time of on a consumer grade GPU. More importantly, our dynamically adapting multimodal architecture demonstrates exceptional robustness in adverse perceptual conditions such as fog, snow, rain and night-time, thus enabling it to be employed in critical resource constrained applications such as robotics where not only accuracy but robustness, computational efficiency and run-time are equally important. To the best of our knowledge, this is the first multimodal segmentation work to benchmark on these wide range of datasets containing several modalities and diverse environments ranging from urban city driving scenes to indoor environments and unstructured forested scenes.
In summary, the following are the main contributions of this work:
A multimodal fusion framework incorporating our proposed SSMA fusion blocks that adapts the fusion of modality-specific features dynamically according to the object category, its spatial location as well as the scene context and learns in a self-supervised manner.
The novel AdapNet++ semantic segmentation architecture that incorporates our multiscale residual units, a new efficient ASPP, a new decoder with skip refinement stages and a multiresolution supervision strategy.
The eASPP for efficiently aggregating multiscale features and capturing long range context, while having a larger effective receptive field and over reduction in parameters compared to the standard ASPP.
An attention mechanism for effectively correlating fused multimodal mid-level and high-level features for better object boundary refinement.
A holistic network-wide pruning approach that enables pruning of convolutional filters invariant to the presence of identity or projection shortcuts.
Extensive benchmarking of existing approaches with the same input image size and evaluation setting along with quantitative and qualitative evaluations of our unimodal and multimodal architectures on five different benchmark datasets consisting of multiple modalities.
Implementations of our proposed models are made publicly available at http://deepscene.cs.uni-freiburg.de.
The remainder of this paper is organized as follows. In Section 2, we discuss recent related work on semantic segmentation and multimodal fusion. We detail our AdapNet++ architecture and our pruning approach in Section 3, followed by the multimodal fusion architecture and the SSMA block in Section 4. In Section 5, we present extensive empirical evaluations with ablation studies and we conclude with a discussion in Section 6.
2 Related Works
In the last decade, there has been a sharp transition in semantic segmentation approaches from employing hand engineered features with flat classifiers such as Support Vector Machines (Fulkerson et al, 2009), Boosting (Sturgess et al, 2009) or Random Forests (Shotton et al, 2008; Brostow et al, 2008), to end-to-end DCNN-based approaches (Long et al, 2015; Badrinarayanan et al, 2015). We first briefly review some of the classical methods before delving into the state-of-the-art techniques.
Semantic Segmentation: Semantic segmentation is one of the fundamental problems in computer vision. Some of the earlier approaches for semantic segmentation use small patches to classify the center pixel using flat classifiers (Shotton et al, 2008; Sturgess et al, 2009) followed by smoothing the predictions using Conditional Random Fields (CRFs) (Sturgess et al, 2009). Rather than only relying on appearance based features, structure from motion features have also been used with randomized decision forests (Brostow et al, 2008; Sturgess et al, 2009). View independent 3D features from dense depth maps have been shown to outperform appearance based features, that also enabled classification of all the pixels in an image, as opposed to only the center pixel of a patch (Zhang et al, 2010). Plath et al (2009) propose an approach to combine local and global features using a CRF and an image classification method. However, the performance of these approaches is largely bounded by the expressiveness of handcrafted features which is highly scenario-specific.
The remarkable performance achieved by CNNs in classification tasks led to their application for dense prediction problems such as semantic segmentation, depth estimation and optical flow prediction. Initial approaches that employed neural networks for semantic segmentation still relied on patch-wise training (Grangier et al, 2009; Farabet et al, 2012; Pinheiro and Collobert, 2014). Pinheiro and Collobert (2014) use a recurrent CNN to aggregate several low resolution predictions for scene labeling. Farabet et al (2012) transforms the input image through a Laplacian pyramid followed by feeding each scale to a CNN for hierarchical feature extraction and classification. Although these approaches demonstrated improved performance over handcrafted features, they often yield a grid-like output that does not capture the true object boundaries. One of the first end-to-end approaches that learns to directly map the low resolution representations from a classification network to a dense prediction output was the Fully Convolutional Network (FCN) model (Long et al, 2015). FCN proposed an encoder-decoder architecture in which the encoder is built upon the VGG-16 (Simonyan and Zisserman, 2014) architecture with the inner-product layers replaced with convolutional layers. While, the decoder consists of successive deconvolution and convolution layers that upsample and refine the low resolution feature maps by combining them with the encoder feature maps. The last decoder then yields a segmented output with the same resolution as the input image.
DeconvNet (Noh et al, 2015) propose an improved architecture containing stacked deconvolution and unpooling layers that perform non-linear upsampling and outperforms FCNs but at the cost of a more complex training procedure. The SegNet (Badrinarayanan et al, 2015) architecture eliminates the need for learning to upsample by reusing pooling indices from the encoder layers to perform upsampling. Oliveira et al (2016) propose an architecture that builds upon FCNs and introduces more refinement stages and incorporates spatial dropout to prevent over fitting. The ParseNet (Liu et al, 2015) architecture models global context directly instead of only relying on the largest receptive field of the network. Recently, there has been more focus on learning multiscale features, which was initially achieved by providing the network with multiple rescaled versions of the image (Farabet et al, 2012) or by fusing features from multiple parallel branches that take different image resolutions (Long et al, 2015). However, these networks still use pooling layers to increase the receptive field, thereby decreasing the spatial resolution, which is not ideal for a segmentation network.
In order to alleviate this problem, Yu and Koltun (2016) propose dilated convolutions that allows for exponential increase in the receptive field without decrease in resolution or increase in parameters. DeepLab (Chen et al, 2016) and PSPNet (Zhao et al, 2017) build upon the aforementioned idea and propose pyramid pooling modules that utilize dilated convolutions of different rates to aggregate multiscale global context. DeepLab in addition uses fully connected CRFs in a post processing step for structured prediction. However, a drawback in employing these approaches is the computational complexity and substantially large inference time even using modern GPUs that hinder them from being deployed in robots that often have limited resources. In our previous work (Valada et al, 2017), we proposed an architecture that introduces dilated convolutions parallel to the conventional convolution layers and multiscale residual blocks that incorporates them, which enables the model to achieve competitive performance at interactive frame rates. Our proposed multiscale residual blocks are more effective at learning multiscale features compared to the widely employed multigrid approach from DeepLab v3 (Chen et al, 2017). While in this work, we propose several new improvements for learning multiscale features, capturing long range context and improving the upsampling in the decoder, while simultaneously reducing the number of parameters and maintaining a fast inference time.
Multimodal Fusion: The availability of low-cost sensors has encouraged novel approaches to exploit features from alternate modalities in an effort to improve robustness as well as the granularity of segmentation. Silberman et al (2012) propose an approach based on SIFT features and MRFs for indoor scene segmentation using RGB-D images. Subsequently, Ren et al (2012) propose improvements to the feature set by using kernel descriptors and by combining MRF with segmentation trees. Munoz et al (2012) employ modality-specific classifier cascades that hierarchically propagate information and do not require one-to-one correspondence between data across modalities. In addition to incorporating features based on depth images, Hermans et al (2014) propose an approach that performs joint 3D mapping and semantic segmentation using Randomized Decision Forests. There has also been work on extracting combined RGB and depth features using CNNs (Couprie et al, 2013; Gupta et al, 2014) for object detection and semantic segmentation. In most of these approaches, hand engineered or learned features are extracted from individual modalities and combined together in a joint feature set which is then used for classification.
More recently, there has been a series of DCNN-based fusion techniques (Eitel et al, 2015; Kim et al, 2017; Li et al, 2016) that have been proposed for end-to-end learning of fused representations from multiple modalities. These fusion approaches can be categorized into early, hierarchical and late fusion methods. An intuitive early fusion technique is to stack data from multiple modalities channel-wise and feed it to the network as a four or six channel input. However, experiments have shown that this often does not enable the network to learn complementary features and cross-modal interdependencies (Valada et al, 2016b; Hazirbas et al, 2016). Hierarchical fusion approaches combine feature maps from multiple modality-specific encoders at various levels (often at each downsampling stage) and upsample the fused features using a single decoder (Hazirbas et al, 2016; Kim et al, 2017). Alternatively, Schneider et al (2017) propose a mid-level fusion approach in which NiN layers (Lin et al, 2013) with depth as input are used to fuse feature maps into the RGB encoder in the middle of the network. Li et al (2016) propose a Long-Short Term Memory (LSTM) context fusion model that captures and fuses contextual information from multiple modalities accounting for the complex interdependencies between them. (Qi et al, 2017) propose an interesting approach that employs 3D graph neural networks for RGB-D semantic segmentation that accounts for both 2D appearance and 3D geometric relations, while capturing long range dependencies within images.
In the late fusion approach, identical network streams are first trained individually on a specific modality and the feature maps are fused towards the end of network using concatenation (Eitel et al, 2015) or element-wise summation (Valada et al, 2016b), followed by learning deeper fused representations. However, this does not enable the network to adapt the fusion to changing scene context. In our previous work (Valada et al, 2016a), we proposed a mixture-of-experts CMoDE fusion scheme for combining feature maps from late fusion based architectures. Subsequently, in (Valada et al, 2017) we extended the CMoDE framework for probabilistic fusion accounting for the types of object categories in the dataset which enables more flexibility in learning the optimal combination. Nevertheless, there are several real-world scenarios in which class-wise fusion is not sufficient, especially in outdoor scenes where different modalities perform well in different conditions. Moreover, the CMoDE module employs multiple softmax loss layers for each class to compute the probabilities for fusion which does not scale for datasets such as SUN RGB-D which has 37 object categories. Motivated by this observation, in this work, we propose a multimodal semantic segmentation architecture incorporating our SSMA fusion module that dynamically adapts the fusion of intermediate network representations from multiple modality-specific streams according to the object class, its spatial location and the scene context while learning the fusion in a self-supervised fashion.
3 AdapNet++ Architecture
In this section, we first briefly describe the overall topology of the proposed AdapNet++ architecture and our main contributions motivated by our design criteria. We then detail each of the constituting architectural components and the training schedule that we employ.
Our network follows the general fully convolutional en-coder-decoder design principle as shown in Figure 2. The encoder (depicted in blue) is based on the ResNet-50 (He et al, 2015a) model as it offers a good trade-off between learning highly discriminative deep features and the computational complexity required. In order to effectively compute high resolution feature responses at different spatial densities, we incorporate our recently proposed multiscale residual units (Valada et al, 2017) at varying dilation rates in the last two blocks of the encoder. In addition, to enable our model to capture long-range context and to further learn multiscale representations, we propose an efficient variant of the atrous spatial pyramid pooling module known as eASPP which has a larger effective receptive field and reduces the number of parameters required by over compared to the originally proposed ASPP in DeepLab v3 (Chen et al, 2017). We append the proposed eASPP after the last residual block of the encoder, shown as green blocks in Figure 2. In order to recover the segmentation details from the low spatial resolution output of the encoder section, we propose a new deep decoder consisting of multiple deconvolution and convolution layers, in addition to skip refinement stages that fuse mid-level features from the encoder with the upsampled decoder feature maps for object boundary refinement. Furthermore, we add two auxiliary supervision branches after each upsampling stage to accelerate training and improve the gradient propagation in the network. We depict the decoder as orange blocks and the skip refinement stages as gray blocks in the network architecture shown in Figure 2. In the following sections, we discuss each of the aforementioned network components in detail and elaborate on the design choices.
Encoders are the foundation of fully convolutional neural network architectures. Therefore, it is essential to build upon a good baseline that has a high representational ability conforming with the computational budget. Our critical requirement is to achieve the right trade-off between the accuracy of segmentation and inference time on a consumer grade GPU, while keeping the number of parameters low. As we also employ the proposed architecture for multimodal fusion, our objective is to design a topology that has a reasonable model size so that two individual modality-specific networks can be trained in a fusion framework and deployed on a single GPU. Therefore, we build upon the ResNet-50 architecture that has four computational blocks with varying number of residual units. We use the bottleneck residual units in our encoder as they are computationally more efficient than the baseline residual units and they enable us to build more complex models that are easily trainable.
|(a) Original||(b) Pre-activation|
Pre-activation Residual Units: The standard residual unit shown in Figure 4(a) can be expressed as
where is the residual function (convolutional layers in the residual unit), is the input feature to the -th unit, while is the output, are the set of weights and biases of the -th residual unit and is the number of layers in the unit. The function is a ReLU in our network and the function is set to the identity mapping . However, the activation in the original residual unit affects both paths in the next unit. Therefore, He et al (2016) proposed an improved residual unit where the activation only affects the path , which can be defined as
This pre-activation residual unit shown in Figure 4(b) enables the gradient to flow through the shortcut connection to any unit without any obstruction. Therefore, we adopt these units instead of the originally proposed residual units as they reduce overfitting, improve the convergence and also yield better performance. The output of the last block of the ResNet-50 architecture is 32-times downsampled with respect to the input image resolution. In order to increase the spatial density of the feature responses and to prevent signal decimation, we set the stride of the convolution layer in the last block (res4a) from two to one which makes the resolution of the output feature maps 1/16-times the input image resolution. We then replace the residual blocks that follow this last downsampling stage with our proposed multiscale residual units that incorporate parallel atrous convolutions at varying dilation rates. In the following, we first briefly review the principle behind atrous convolutions before describing our proposed multiscale residual units.
Atrous Convolution: Pooling and striding in convolutional neural networks, decreases the spatial resolution and decimates the details which cannot be completely recovered even using deconvolution layers. To alleviate this problem, atrous convolutions also known as dilated convolutions (Yu and Koltun, 2016) can be used to enlarge the field of the filter thereby effectively capturing larger context. Moreover, by using atrous convolutions of different dilation rates, we can aggregate multiscale context. Atrous convolution is equivalent to convolving with a filter which has zeros inserted between two consecutive filter values across the spatial dimensions. Let be a discrete function, , be a discrete filter of size and be a dilation rate. The atrous convolution can be defined as
where, the dilation rate denotes the stride with which we sample the input signal. Therefore a larger dilation rate indicates a larger receptive field. The standard convolution can be considered as a special case by setting .
Multiscale Residual Units: A naive approach to compute the feature responses at the full image resolution would be to remove the downsampling and replace all the convolutions to atrous convolutions having a dilation rate but this would be both computation and memory intensive. Therefore, we propose a novel multiscale residual unit (Valada et al, 2017) to efficiently enlarge the receptive field and aggregate multiscale features without increasing the number of parameters and the computational burden. Specifically, we replace the convolution in the full pre-activation residual unit with two parallel atrous convolutions with different dilation rates and half the number of feature maps each. We then concatenate their outputs before the following convolution.
|(a) Atrous Spatial Pyramid Pooling (ASPP)||(b) Efficient Atrous Spatial Pyramid Pooling (eASPP)|
By concatenating their outputs, the network additionally learns to combine the feature maps of different scales. Now, by setting the dilation rate in one of the convolutional layers to one and another to a rate , we can preserve the original scale of the features within the block and simultaneously add a larger context. While, by varying the dilation rates in each of the parallel convolutions, we can enable the network to effectively learn multiscale representations at different stages of the network. The topology of the proposed multiscale residual units and the corresponding original residual units are shown in the legend in Figure 3. The lower left two units show the original configuration, while the lower right two units show the proposed configuration. Figure 3 shows our entire encoder structure with the full pre-activation residual units and the multiscale residual units.
We incorporate the first multiscale residual unit with before the third block at res3d (unit before the block where we remove the downsampling as mentioned earlier). Subsequently, we replace the units res4c, res4d, res4e, res4f with our proposed multiscale units with rates in all the units and correspondingly. In addition, we replace the last three units of block four res5a, res5b, res5c with the multiscale units with increasing rates in both the convolutions, as , , correspondingly. We evaluate our proposed configuration in comparison to the multigrid method of DeepLab v3 (Chen et al, 2017) in the ablation study presented in Section 5.5.
3.2 Efficient Atrous Spatial Pyramid Pooling
In this section, we first describe the topology of the Atrous Spatial Pyramid Pooling (ASPP) module, followed by the structure of our proposed efficient Atrous Spatial Pyramid Pooling (eASPP). ASPP has become prevalent in most state-of-the-art architectures due to its ability to capture long range context and multiscale information. Inspired by spatial pyramid pooling (He et al, 2015c), the initially proposed ASPP in DeepLab v2 (Liang-Chieh et al, 2015) employs four parallel atrous convolutions with different dilation rates. Concatenating the outputs of multiple parallel atrous convolutions aggregates multi-scale context with different receptive field resolutions. However, as illustrated in the subsequent DeepLab v3 (Chen et al, 2017), applying extremely large dilation rates inhibits capturing long range context due to image boundary effects. Therefore, an improved version of ASPP was proposed (Chen et al, 2017) to add global context information by incorporating image-level features.
The resulting ASPP shown in Figure 5(a) consists of five parallel branches: one convolution and three convolutions with different dilation rates. Additionally, image-level features are introduced by applying global average pooling on the input feature map, followed by a convolution and bilinear upsampling to yield an output with the same dimensions as the input feature map. All the convolutions have 256 filters and batch normalization layers to improve training. Finally, the resulting feature maps from each of the parallel branches are concatenated and passed through another convolution and batch normalization to yield 256 output filters. The ASPP module is appended after the last residual block of the encoder where the feature maps are of dimensions in the DeepLab v3 architecture (Chen et al, 2017), therefore dilation rates of 6, 12 and 18 were used in the parallel atrous convolution layers. However, as we use a smaller input image, the dimensions of the input feature map to the ASPP is , therefore, we reduce the dilation rates to 3, 6 and 12 in the atrous convolution layers respectively.
The biggest caveat of employing the ASPP is the extremely large amount of parameters and floating point operations per second (FLOPS) that it consumes. Each of the convolutions have 256 filters, which in total for the entire ASPP amounts to parameters and FLOPS which is prohibitively expensive. To address this problem, we propose an equivalent structure called eASPP that substantially reduces the computational complexity. Our proposed topology is based on two principles: cascading atrous convolutions and the bottleneck structure. Cascading atrous convolutions effectively enlarges the receptive field as the latter atrous convolution takes the output of the former atrous convolution. The receptive field size of an atrous convolution is be computed as
where r is the dilation rate of the atrous convolution and is the filter size. When two atrous convolutions with the receptive field sizes as and are cascaded, the effective receptive field size is computed as
For example, if two atrous convolutions with filter size and dilation are cascaded, then each of the convolutions individually have a receptive field size of , while the effective receptive field size of the second atrous convolution is . Moreover, cascading atrous convolutions enables denser sampling of pixels in comparison to parallel atrous convolution with a larger receptive field. Therefore, by using both parallel and cascaded atrous convolutions in the ASPP, we can efficiently aggregate dense multiscale features with very large receptive fields.
In order to reduce the number of parameters in the ASPP topology, we employ a bottleneck structure in the cascaded atrous convolution branches. The topology of our proposed eASPP shown in Figure 5(b) consists of five parallel branches similar to ASPP but the branches with the atrous convolutions are replaced with our cascaded bottleneck branches. If is the number of channels in the atrous convolution, we add a convolution with filters before the atrous convolution to squeeze only the most relevant information through the bottleneck. We then replace the atrous convolution with two cascaded atrous convolutions with filters, followed by another convolution to restore the number of filters to . The proposed eASPP only has parameters and consumes FLOPS which accounts to a reduction of of parameters and of FLOPS in comparison to the ASPP. We evaluate our proposed eASPP in comparison to ASPP in the ablation study presented in Section 5.5.2 and show that it achieves improved performance while being more than efficient in the number of parameters.
The output of the eASPP in our network is 16-times downsampled with respect to the input image and therefore it has to be upsampled back to the full input resolution. In our previous work (Valada et al, 2017), we employed a simple decoder with two deconvolution layers and one skip refinement connection. Although the decoder was more effective in recovering the segmentation details in comparison to direct bilinear upsampling, it often produced disconnected segments while recovering the structure of thin objects such as poles and fences. In order to overcome this impediment, we propose a more effective decoder in this work.
Our decoder shown in Figure 6 consists of three stages. In the first stage, the output of the eASPP is upsampled by a factor of two using a deconvolution layer to obtain a coarse segmentation mask. The upsampled coarse mask is then passed through the second stage, where the feature maps are concatenated with the first skip refinement from Res3d. The skip refinement consists of a convolution layer to reduce the feature depth in order to not outweigh the encoder features. We experiment with varying number of feature channels in the skip refinement in the ablation study presented in Section 5.5.3. The concatenated feature maps are then passed through two convolutions to improve the resolution of the refinement, followed by a deconvolution layer that again upsamples the feature maps by a factor of two. This upsampled output is fed to the last decoder stage which resembles the previous stage consisting of concatenation with the feature maps from the second skip refinement from Res2c, followed by two convolution layers. All the convolutional and deconvolutional layers until this stage have 256 feature channels, therefore the output from the two convolutions in the last stage is fed to a convolution layer to reduce the number of feature channels to the number of object categories . This output is finally fed to the last deconvolution layer which upsamples the feature maps by a factor of four to recover the original input resolution.
3.4 Multiresolution Supervision
Deep networks often have difficulty in training due to the intrinsic instability associated with learning using gradient descent which leads to exploding or vanishing gradient problems. As our encoder is based on the residual learning framework, shortcut connections in each unit help propagating the gradient more effectively. Another technique that can be used to mitigate this problem to a certain extent is by initializing the layers with pretrained weights, however our proposed eASPP and decoder layers still have to be trained from scratch which could lead to optimization difficulties. Recent deep architectures have proposed employing an auxiliary loss in the middle of encoder network (Lee et al, 2015; Zhao et al, 2017), in addition to the main loss towards the end of the network. However, as shown in the ablation study presented in Section 5.5.1 this does not improve the performance of our network although it helps the optimization to converge faster.
Unlike previous approaches, in this work, we propose a multiresolution supervision strategy to both accelerate the training and improve the resolution of the segmentation. As described in the previous section, our decoder consists of three upsampling stages. We add two auxiliary loss branches at the end of the first and second stage after the deconvolution layer in addition to the main loss at the end of the decoder as shown in Figure 7. Each auxiliary loss branch decreases the feature channels to the number of category labels using a convolution with batch normalization and upsamples the feature maps to the input resolution using bilinear upsampling. We only use simple bilinear upsampling which does not contain any weights instead of a deconvolution layer in the auxiliary loss branches as our aim is to force the main decoder stream to improve its discriminativeness at each upsampling resolution so that it embeds multiresolution information while learning to upsample. We weigh the two auxiliary losses and to balance the gradient flow through all the previous layers. While testing, the auxiliary loss branches are discarded and only the main decoder stream is used. We experiment with different loss weightings in the ablation study presented in Section 5.5.3 and in Section 5.5.1 we show that each of the auxiliary loss branches improves the segmentation performance in addition to speeding-up the training.
3.5 Network Compression
As we strive to design an efficient and compact semantic segmentation architecture that can be employed in resource constrained applications, we must ensure that the utilization of convolutional filters in our network is thoroughly optimized. Often, even the most compact networks have abundant neurons in deeper layers that do not significantly contribute to the overall performance of the model. Excessive convolutional filters not only increase the model size but also the inference time and the number of computing operations. These factors critically hinder the deployment of models in resource constrained real-world applications. Pruning of neural networks can be traced back to the 80s when LeCun et al (1990) introduced a technique called Optimal Brain Damage for selectively pruning weights with a theoretically justified measure. Recently, several new techniques have been proposed for pruning weight matrices (Wen et al, 2016; Anwar et al, 2017; Liu et al, 2017; Li et al, 2017) of convolutional layers as most of the computation during inference is consumed by them.
These approaches rank neurons based on their contribution and remove the low ranking neurons from the network, followed by fine-tuning of the pruned network. While the simplest neuron ranking method computes the -norm of each convolutional filter (Li et al, 2017), more sophisticated techniques have recently been proposed (Anwar et al, 2017; Liu et al, 2017; Molchanov et al, 2017). Some of these approaches are based on sparsity based regularization of network parameters which additionally increases the computational overhead during training (Liu et al, 2017; Wen et al, 2016). Techniques have also been proposed for structured pruning of entire kernels with strided sparsity (Anwar et al, 2017) that demonstrate impressive results for pruning small networks. However, their applicability to complex networks that are to be evaluated on large validation sets has not been explored due its heavy computational processing. Moreover, until a year ago these techniques were only applied to simpler architectures such as VGG (Simonyan and Zisserman, 2014) and AlexNet (Krizhevsky et al, 2012), as pruning complex deep architectures such as ResNets requires a holistic approach. Thus far, pruning of residual units has only been performed on convolutional layers that do not have an identity or shortcut connection as pruning them additionally requires pruning the added residual maps in the exact same configuration. Attempts to prune them in the same configuration have resulted in a significant drop in performance (Li et al, 2017). Therefore only the first and the second convolutional layers of a residual unit are often pruned.
Our proposed AdapNet++ architecture has shortcut and skip connections both in the encoder as well the decoder. Therefore, in order to efficiently maximize the pruning of our network, we propose a holistic network-wide pruning technique that is invariant to the presence of skip or shortcut connections. Our proposed technique first involves pruning all the convolutional layers of a residual unit, followed by masking out the pruned indices of the last convolutional layer of a residual unit with zeros before the addition of the residual maps from the shortcut connection. As masking is performed after the pruning, we efficiently reduce the parameters and computing operations in a holistic fashion, while optimally pruning all the convolutional layers and preserving the shortcut or skip connections. After each pruning iteration, we fine-tune the network to recover any loss in accuracy. We illustrate this strategy adopting a recently proposed greedy criteria-based oracle pruning technique that incorporates a novel ranking method based on a first order Taylor expansion of the network cost function (Molchanov et al, 2017). The pruning problem is framed as a combinatorial optimization problem such that when the weights of the network are pruned, the change in cost value will be minimal.
where is the training set, is the network parameters and is the negative log-likelihood function. Based on Taylor expansion, the change in the loss function from removing a specific parameter can be approximated. Let be the output feature maps produced by parameter and . The output can be pruned by setting it to zero and the ranking can be given by
Approximating with Taylor expansion, we can write
where is the length of the vectorized feature map. This ranking can be easily computed using the standard back-propagation computation as it requires the gradient of the cost function with respect to the activation and the product of the activation. Furthermore, in order to achieve adequate rescaling across layers, a layer-wise -norm of the rankings is computed as
The entire pruning procedure can be summarized as follows: first the AdapNet++ network is trained until convergence using the training protocol described in Section 5.1. Then the importance of the feature maps is evaluated using the aforementioned ranking method and subsequently the unimportant feature maps are removed. The pruned convolution layers that have shortcut connections are then masked at the indices where the unimportant feature maps are removed to maintain the shortcut connections. The network is then fine-tuned and the pruning process is reiterated until the desired trade-off between accuracy and the number of parameters has been achieved. We present results from pruning our AdapNet++ architecture in Section 5.4, where we perform pruning of both the convolutional and deconvolutional layers of our network in five stages by varying the threshold for the rankings. For each of these stages, we quantitatively evaluate the performance versus number of parameters trade-off obtained using our proposed pruning strategy in comparison to the standard approach.
4 Self-Supervised Model Adaptation
In this section, we describe our approach to multimodal fusion using our proposed self-supervised model adaptation (SSMA) framework. Our framework consists of three components: a modality-specific encoder as described in Section 3.1, a decoder built upon the topology described in Section 3.3 and our proposed SSMA block for adaptively recalibrating and fusing modality-specific feature maps. In the following, we first formulate the problem of semantic segmentation from multimodal data, followed by a detailed description of our proposed SSMA units and finally we describe the overall topology of our fusion architecture.
We represent the training set for multimodal semantic segmentation as , where denotes the input frame from modality , denotes the corresponding input frame from modality and the groundtruth label is given by , where is the set of semantic classes. The image is only shown to the modality-specific encoder and similarly, the corresponding image from a complementary modality is only shown to the modality-specific encoder . This enables each modality-specific encoder to specialize in a particular sub-space learning their own hierarchical representations individually. We assume that the input images and , as well as the label have the same dimensions and that the pixels are drawn as samples following a categorical distribution. Let be the network parameters consisting of weights and biases. Using the classification scores at each pixel , we obtain probabilities with the function such that
denotes the probability of pixel being classified with label . The optimal is estimated by minimizing
for , where is the Kronecker delta.
4.1 SSMA Block
In order to adaptively recalibrate and fuse feature maps from modality-specific networks, we propose a novel architectural unit called the SSMA block. The goal of the SSMA block is to explicitly model the correlation between the two modality-specific feature maps before fusion so that the network can exploit the complementary features by learning to selectively emphasize more informative features from one modality, while suppressing the less informative features from the other. We construct the topology of the SSMA block in a fully-convolutional fashion which empowers the network with the ability to emphasize features from a modality-specific network for only certain spatial locations or object categories, while emphasizing features from the complementary modality for other locations or object categories. Moreover, the SSMA block dynamically recalibrates the feature maps based on the input scene context.
The structure of the SSMA block is shown in Figure 8. Let and denote the modality-specific feature maps from modality and modality respectively, where is the number of feature channels and is the spatial dimension. First, we concatenate the modality-specific feature maps and to yield . We then employ a recalibration technique to adapt the concatenated feature maps before fusion. In order to achieve this, we first pass the concatenated feature map through a bottleneck consisting of two convolutional layers for dimensionality reduction and to improve the representational capacity of the concatenated features. The first convolution has weights with a channel reduction ratio and a non-linearity function . We use ReLU for the non-linearity, similar to the other activations in the encoders and experiment with different reductions ratios in Section 5.9.1. Note that we omit the bias term to simplify the notation. The subsequent convolutional layer with weights increases the dimensionality of the feature channels back to concatenation dimension and a sigmoid function scales the dynamic range of the activations to the interval. This can be represented as
The resulting output is used to recalibrate or emphasize/de-emphasize regions in as
where denotes Hadamard product of the feature maps and the matrix of scalars such that each element in is multiplied with a corresponding activation in with , and . The activations adapt to the concatenated input feature map , enabling the network to weigh features element-wise spatially and across the channel depth based on the multimodal inputs and . With new multimodal inputs, the network dynamically weighs and reweighs the feature maps in order to optimally combine complementary features. Finally, the recalibrated feature maps are passed through a convolution with weights and a batch normalization layer to reduce the feature channel depth and yield the fused output as
As described in the following section, we employ our proposed SSMA block to fuse modality-specific feature maps both at intermediate stages of the network and towards the end of the encoder. Although we utilize a bottleneck structure to conserve the number of parameters consumed, further reduction in the parameters can be achieved by replacing the convolution layers with convolutions, which yields comparable performance. We also remark that the SSMA blocks can be used for multimodal fusion in other tasks such as image classification or object detection, as well as for fusion of feature maps across tasks in multitask learning.
4.2 Fusion Architecture
We propose a framework for multimodal semantic segmentation using a modified version of our AdapNet++ architecture and the proposed SSMA blocks. For simplicity, we consider the fusion of two modalities, but the framework can be easily extended to arbitrary number of modalities. The encoder of our framework shown in Figure 9 contains two streams, where each stream is based on the encoder topology described in Section 3.1. Each encoder stream is modality-specific and specializes in a particular sub-space. In order to fuse the feature maps from both streams, we adopt a combination of mid-level and late fusion strategy in which we fuse the latent representations of both encoders using the SSMA block and pass the fused feature map to the first decoder stage. We denote this as latent SSMA fusion as it takes the output of the eASPP from each modality-specific encoder as input. We set the reduction ratio in the latent SSMA. As the AdapNet++ architecture contains skip connections for high-resolution refinement, we employ an SSMA block at each skip refinement stage after the convolution as shown in Figure 9. As the convolutions reduce the feature channel depth to 24, we only use a reduction ratio in the two skip SSMAs as identified from the ablation experiments presented in Section 5.9.1.
In order to upsample the fused predictions, we build upon our decoder described in Section 3.3. The main stream of our decoder resembles the topology of the decoder in our AdapNet++ architecture consisting of three upsampling stages. The output of the latent SSMA block is fed to the first upsampling stage of the decoder. Following the AdapNet++ topology, the outputs of the skip SSMA blocks would be concatenated into the decoder at the second and third upsampling stages (skip1 after the first deconvolution and skip2 after the second deconvolution). However, we find that concatenating the fused mid-level features into the decoder does not substantially improve the resolution of the segmentation, as much as in the unimodal AdapNet++ architecture. We hypothesise that directly concatenating the fused mid-level features and fused high-level features causes a feature localization mismatch as each SSMA block adaptively recalibrates at different stages of the network where the resolution of the feature maps and channel depth differ by one half of their dimensions. Moreover, training the fusion network end-to-end from scratch also contributes to this problem as without initializing the encoders with modality-specific pre-trained weights, concatenating the uninitialized mid-level fused encoder feature maps into the decoder does not yield any performance gains, rather it hampers the convergence.
With the goal of mitigating this problem, we propose two strategies. In order to facilitate better fusion, we adopt a multi-stage training protocol where we first initialize each encoder in the fusion architecture with pre-trained weights from the unimodal AdapNet++ model. We describe this procedure in Section 5.1.2. Secondly, we propose a mechanism to better correlate the mid and high-level fused features before concatenation in the decoder. We propose to weigh the fused mid-level skip features with the spatially aggregated statistics of the high-level decoder features before the concatenation. Following the notation convention, we define as the high-level decoder feature map before the skip concatenation stage. A feature statistic is produced by projecting along the spatial dimensions using a global average pooling layer as
where represents a statistic or a local descriptor of the element of . We then reduce the number of feature channels in using a convolution layer with weights , batch normalization and an ReLU activation function to match the channels of the fused mid-level feature map , where is computed as shown in Equation (15). We can represent resulting output as
Finally, we weigh the fused mid-level feature map with the reduced aggregated descriptors using channel-wise multiplication as
As shown in Figure 10, we employ the aforementioned mechanism to the fused feature maps from skip1 SSMA as well as skip2 SSMA and concatenate their outputs with the decoder feature maps at the second and third upsampling stages respectively. We find that this mechanism guides the fusion of mid-level skip refinement features with the high-level decoder feature more effectively than direct concatenation and yields a notable improvement in the resolution of the segmentation output.
5 Experimental Results
In this section, we first describe the datasets that we benchmark on, followed by comprehensive quantitative results for unimodal segmentation using our proposed AdapNet++ architecture in Section 5.3 and the results for model compression in Section 5.4. We then present detailed ablation studies that describe our architectural decisions in Section 5.5, followed by the qualitative unimodal segmentation results in Section 5.6. We present the multimodal fusion benchmarking experiments with the various modalities contained in the datasets in Section 5.7 and the ablation study on our multimodal fusion architecture in Section 5.9. We finally present the qualitative multimodal segmentation results in Section 5.10 and in challenging perceptual conditions in Section 5.11.
All our models were implemented using the TensorFlow (Abadi et al, 2015) deep learning library and the experiments were carried out on a system with an Intel Xeon E5 with and an NVIDIA TITAN X GPU. We primarily use the standard Jaccard Index, also known as the intersection-over-union (IoU) metric to quantify the performance. We report the mean intersection-over-union (mIoU) metric for all the models and also the pixel-wise accuracy (Acc), average precision (AP), global intersection-over-union (gIoU) metric, false positive rate (FPR), false negative rate (FNR) in the detailed analysis. We make all our models publicly available at http://deepscene.cs.uni-freiburg.de/.
5.1 Training Protocol
In this subsection, we first describe the procedure that we employ for training our proposed AdapNet++ architecture, followed by the protocol for training the SSMA fusion scheme. We then detail the various data augmentations that we perform on the training set.
5.1.1 AdapNet++ Training
We train our network with an input image of resolution pixels, therefore we employ bilinear interpolation for resizing the RGB images and the nearest-neighbor interpolation for the other modalities as well as the groundtruth labels. We initialize the encoder section of the network with weights pre-trained on the ImageNet dataset (Deng et al, 2009), while we use the He initialization (He et al, 2015b) for the other convolutional and deconvolutional layers. We use the Adam solver for optimization with and . We train our model for 150K iterations using an initial learning rate of with a mini-batch size of and a dropout probability of . We use the cross-entropy loss function and set the weights and to balance the auxiliary losses. The final loss function can be given as .
5.1.2 SSMA Training
We employ a multi-stage procedure for training the multimodal models using our proposed SSMA fusion scheme. We first train each modality-specific Adapnet++ model individually using the training procedure described in Section 5.1.1. In the second stage, we leverage transfer learning to train the joint fusion model in the SSMA framework by initializing only the encoders with the weights from the individual modality-specific encoders trained in the previous stage. We then set the learning rate of the encoder layers to and the decoder layers to , and train the fusion model with a mini-batch of 7 for a maximum of 100K iterations. This enables the SSMA blocks to learn the optimal combination of multimodal feature maps from the well trained encoders, while slowly adapting the encoder weights to improve the fusion. In the final stage, we fix the learning rate of the encoder layers to while only training the decoder and the SSMA blocks with a learning rate of and a mini-batch size of 12 for 50K iterations. This enables us to train the network with a larger batch size, while focusing more on the upsampling stages to yield the high-resolution segmentation output.
5.1.3 Data Augmentation
The training of deep networks can be significantly improved by expanding the dataset to introduce more variability. In order to achieve this, we apply a series of augmentation strategies randomly on the input data while training. The augmentations that we apply include rotation ( to ), skewing ( to ), scaling ( to ), vignetting ( to ), cropping ( to ), brightness modulation ( to ), contrast modulation ( to ) and flipping.
We evaluate our proposed AdapNet++ architecture on five publicly available diverse scene understanding benchmarks ranging from urban driving scenarios to unstructured forested scenes and cluttered indoor environments. The datasets were particularly chosen based on the criteria of containing scenes with challenging perceptual conditions including rain, snow, fog, night-time, glare, motion blur and other seasonal appearance changes. Each of the datasets contain multiple modalities that we utilize for benchmarking our fusion approach. We briefly describe the datasets and their constituting semantic categories in this section.
Cityscapes: The Cityscapes dataset (Cordts et al, 2016) is one of the largest labeled RGB-D dataset for urban scene understanding. Being one of the standard benchmarks, it is highly challenging as it contains images of complex urban scenes, collected from over 50 cities during varying seasons, lighting and weather conditions. The images were captured using a automotive-grade baseline stereo camera at a resolution of pixels. The dataset contains 5000 finely annotated images with 30 categories, of which 2875 are provided for training, 500 are provided for validation and 1525 are used for testing. Additionally 20,000 coarse annotations are provided. The testing images are not publicly released, they are rather used by the evaluation server for benchmarking on object classes, excluding the rarely appearing categories. In order to facilitate comparison with previous fusion approaches we benchmark on the label set consisting of: sky, building, road, sidewalk, fence, vegetation, pole, car/truck/bus, traffic sign, person, rider/bicycle/motorbike and background.
|(a) RGB||(b) HHA|
|(c) Depth||(d) Depth Filled|
In our previous work (Valada et al, 2017), we directly used the colorized depth image as input to our network. We converted the stereo disparity map to a 3 channel colorized depth image, by normalizing and applying the standard jet color map Figure 11(a) and Figure 11(c) show an example image and the corresponding colorized depth map from the dataset. However, as seen in the figure, the depth maps have considerable amount of noise and missing depth values due to occlusion, which are undesirable especially when utilizing depth maps as an input modality for pixel-wise segmentation. Therefore, in this work, we employ a recently proposed state-of-the-art fast depth completion technique (Ku et al, 2018) to fill any holes that may be present. The resulting filled depth map is shown in Figure 11(d). The depth completion algorithm can easily be incorporated into our pipeline as a preprocessing step as it only requires while running on the CPU and it can be further parallelized using a GPU implementation. Additionally, Gupta et al (2014) proposed an alternate representation of the depth map known as the HHA encoding to enable DCNNs to learn more effectively. The authors demonstrate that the HHA representation encodes properties of geocentric pose that emphasizes on complementary discontinuities in the image which are extremely hard for the network to learn, especially from limited training data. This representation also yields a 3 channel image consisting of: horizontal disparity, height above ground, and the angle between the pixelâs local surface normal and the inferred gravity direction. The resulting channels are then linearly scaled and mapped to the 0 to 255 range. However, it is still unclear if this representation enables the network to learn features complementary to that learned from visual RGB images as different works show contradicting results (Hazirbas et al, 2016; Gupta et al, 2014; Eitel et al, 2015). In this paper, we perform in-depth experiments with both the jet colorized and the HHA encoded depth map on a larger and more challenging dataset than previous works to investigate the utility of these encodings.
|(a) RGB||(b) Depth||(c) HHA|
Synthia: The Synthia dataset (Ros et al, 2016) is a large-scale urban outdoor dataset that contains photo realistic images and depth data rendered from a virtual city built using the Unity engine. It consists of several annotated label sets. In this work, we use the Synthia-Rand-Cityscapes and the video sequences which have images of resolution with a horizontal field of view. This dataset is of particular interest for benchmarking the fusion approaches as it contains diverse traffic situations under different weather conditions. Synthia-Rand-Cityscapes consists of 9000 images and the sequences contain 8000 images with groundtruth labels for 12 classes. The categories of object labels are the same as the aforementioned Cityscapes label set.
|(a) RGB||(b) Depth||(c) HHA|
SUN RGB-D: The SUN RGB-D dataset (Song et al, 2015) is one of the most challenging indoor scene understanding benchmarks to date. It contains 10,335 RGB-D images that were captured with four different types of RGB-D cameras (Kinect V1, Kinect V2, Xtion and RealSense) with different resolutions and fields of view. This benchmark also combines several other datasets including 1449 images from the NYU Depth v2 (Silberman et al, 2012), 554 images from the Berkeley B3DO (Janoch et al, 2013) and 3389 images from the SUN3D (Xiao et al, 2013). We use the original train-val split consisting of 5285 images for training and 5050 images for testing. We use the refined in-painted depth images from the dataset that were processed using a multi-view fusion technique. However, some refined depth images still have missing depth values at distances larger than a few meters. Therefore, as mentioned in previous works (Hazirbas et al, 2016), we exclude the 587 training images that were captured using the RealSense RGB-D camera as they contain a significant amount of invalid depth measurements that are further intensified due to the in-painting process.
This dataset provides pixel-level semantic annotations for 37 categories, namely: wall, floor, cabinet, bed, chair, sofa, table, door, window, bookshelf, picture, counter, blinds, desk, shelves, curtain, dresser, pillow, mirror, floor mat, clothes, ceiling, books, fridge, tv, paper, towel, shower curtain, box, whiteboard, person, night stand, toilet, sink, lamp, bathtub and bag. However 16 our of the 37 classes are rarely present in the images and about of the pixels are not assigned to any of the classes, making it extremely unbalanced. Moreover, as each scene contains many different types of objects, they are often partially occluded and may appear completely different in the test images.
|(a) RGB||(b) Depth||(c) Depth Filled|
|(d) HHA||(e) Groundtruth|
ScanNet: The ScanNet RGB-D video dataset (Dai et al, 2017) is a recently introduced large-scale indoor scene understanding benchmark. It contains RGB-D images accounting to 1512 scans acquired in 707 distinct spaces. The data was collected using an iPad Air2 mounted with a depth camera similar to the Microsoft Kinect v1. Both the iPad camera and the depth camera were hardware synchronized and frames were captured at . The RGB images were captured at a resolution of pixels and the depth frames were captured at pixels. The semantic segmentation benchmark contains 16,506 labelled training images and 2537 testing images. From the example depth image shown in Figure 14(b), we can see that there are a number of missing depth values at the object boundaries and at large distances. Therefore, similar to the preprocessing that we perform on the cityscapes dataset, we use a fast depth completion technique (Ku et al, 2018) to fill the holes. The corresponding filled depth image is shown in Figure 14(c). We also compute the HHA encoding for the depth maps and use them as an additional modality in our experiments.
The dataset provides pixel-level semantic annotations for 21 object categories, namely: wall, floor, chair, table, desk, bed, bookshelf, sofa, sink, bathtub, toilet, curtain, counter, door, window, shower curtain, refrigerator, picture, cabinet, other furniture and void. Similar to the SUN RGB-D dataset, many object classes are rarely present making the dataset unbalanced. Moreover, the annotations at the object boundaries are often irregular and parts of objects at large distances are unlabelled as shown in Figure 14(e). These factors make the task even more challenging on this dataset.
|(a) RGB||(b) NIR||(c) NDVI|
|(d) NRG||(e) EVI||(f) Depth|
Freiburg Forest: In our previous work (Valada et al, 2016b), we introduced the Freiburg Multispectral Segmentation benchmark, which is a first-of-a-kind dataset of unstructured forested environments. Unlike urban and indoor scenes which are highly structured with rigid objects that have distinct geometric properties, objects in unstructured forested environments are extremely diverse and moreover, their appearance completely changes from month to month due to seasonal variations. The primary motivation for the introduction of this dataset is to enable robots to discern obstacles that can be driven over such as tall grass and bushes to obstacles that should be avoided such as tall trees and boulders. Therefore, we proposed to exploit the presence of chlorophyll in these objects which can be detected in the Near-InfraRed (NIR) wavelength. NIR images provide a high fidelity description on the presence of vegetation in the scene and as demonstrated in our previous work (Valada et al, 2017), it enhances border accuracy for segmentation.
The dataset was collected over an extended period of time using our Viona autonomous robot equipped with a Bumblebee2 camera to capture stereo images and a modified camera with the NIR-cut filter replaced with a Wratten 25A filter for capturing the NIR wavelength in the blue and green channels. The dataset contains over 15,000 images that were sub-sampled at , corresponding to traversing over each day. In order to extract consistent spatial and global vegetation information we computed vegetation indices such as Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) using the approach presented by Huete et al (1999). NDVI is resistant to noise caused due to changing sun angles, topography and shadows but is susceptible to error due to variable atmospheric and canopy background conditions (Huete et al, 1999). EVI was proposed to compensate for these defects with improved sensitivity to high biomass regions and improved detection though decoupling of canopy background signal and reduction in atmospheric influences. Figure 15 shows an example image from the dataset and the corresponding modalities. The dataset contains hand-annotated segmentation groundtruth for six classes: sky, trail, grass, vegetation, obstacle and void. We use the original train and test splits provided in the initial release in order to compare with the previously benchmarked fusion approaches.
5.3 AdapNet++ Benchmarking
In this subsection, we report results comparing the performance of our proposed AdapNet++ architecture for a input RGB image of resolution pixels. We benchmark against several well adopted state-of-the-art models including DeepLab v3 (Chen et al, 2017), ParseNet (Liu et al, 2015), FCN-8s (Long et al, 2015), SegNet (Badrinarayanan et al, 2015), FastNet (Oliveira et al, 2016), DeepLab v2 (Chen et al, 2016), DeconvNet (Noh et al, 2015) and Adapnet (Valada et al, 2017). For each of the datasets, we report the mIoU score, as well as the per-class IoU score. Note that we report the performance of each model for the same input image resolution of pixels and using the same evaluation setting in order to have a fair comparison. We do not apply multiscale inputs or left-right flips during testing as these techniques require each crop of each image to be evaluated several times which significantly increases the computational complexity and runtime (Note: We do not use crops for testing, we evaluate on the full image in a single pass). Moreover, they do not improve the performance in real-time applications. However, we show the potential gains that can be obtained in the evaluation metric utilizing these techniques and with a higher resolution input image in the ablation study presented in Section 5.5.5.
Table 1 shows the comparison for the Cityscapes dataset. AdapNet++ outperforms all the baselines in each individual object category as well in the mIoU score. AdapNet++ outperforms the highest baseline by a margin of . Analyzing the individual class IoU scores, we can see that AdapNet++ yields the highest improvement in object categories that contain thin structures such as Poles for which it gives a large improvement of , a similar improvement of for Fences and the highest improvement for for Signs. Most architectures struggle to recover the structure of thin objects due to downsampling by pooling and striding in the network which causes such information to be lost. However, these results show that AdapNet++ efficiently recovers the structure of such objects by learning multiscale features at several stages of the encoder using the proposed multiscale residual units and the eASPP. In driving scenarios, information of objects such as Pedestrians and Cyclists can also be lost when they appear at far away distances. A large improvement can also be seen in categories such as Person in which AdapNet++ achieves an improvement of . The improvement in larger object categories such as Cars and Vegetation can be attributed to the new decoder which improves the segmentation performance near object boundaries. This is more evident in the qualitative results presented in Section 5.10. Note that the colors shown below the object category names serve as a legend for the qualitative results.
We benchmark on the Synthia dataset largely due to the variety of seasons and adverse perceptual conditions where the improvement due to multimodal fusion can be seen. However, even for baseline comparison shown in Table 2, it can be seen that AdapNet++ outperforms all the baselines, both in the overall mIoU score as well as in the score of the individual object categories. It achieves an overall improvement of and a similar observation can be made in the improvement of scores for thin structures, reinforcing the utility of our proposed multiscale feature learning configuration. The largest improvement of was obtained for the Sign class, followed by an improvement of for the Pole class. In addition a significant improvement of can also be seen for the Cyclist class.
Compared to outdoor driving datasets, indoor benchmarks such as SUN RGB-D and ScanNet pose a different challenge. Indoor datasets have vast amounts of object categories in various different configurations and images captured from many different view points compared to driving scenarios where the camera is always parallel to the ground with similar viewpoints from the perspective of the vehicle driving on the road. Moreover, indoor scenes are often extremely cluttered which causes occlusions, in addition to the irregular frequency distribution of the object classes that make the problem even harder. Due to these factors SUN RGB-D is considered one of the hardest datasets to benchmark on. Despite these factors, as shown in Table 3, AdapNet++ outperforms all the baseline networks overall by a margin of compared to the highest performing DeepLab v3 baseline which took iterations more to reach this score. Unlike the performance in the Cityscapes and Synthia datasets where our previously proposed AdapNet architecture yields the second highest performance, AdapNet is outperformed by DeepLab v3 in the SUN RGB-D dataset. AdapNet++ on the other hand, outperforms the baselines in most categories by a large margin, while it is outperformed in 13 of the 37 classes by small margin. It can also be observed that the classes in which AdapNet++ get outperformed are the most infrequent classes. This can be alleviated by adding supplementary training images containing the low-frequency classes from other datasets or by employing class balancing techniques. However, our initial experiments employing techniques such as median frequency class balancing, inverse median frequency class balancing, normalized inverse frequency balancing, all severely affected the performance of our model.
We also show results on the recently introduced ScanNet dataset in Table 4, which is currently the largest labeled indoor RGB-D dataset till date. AdapNet++ outperforms the state-of-the-art overall by a margin of . The large improvement can be attributed to the proposed eASPP which efficiently captures long range context. Context aggregation plays an important role in such cluttered indoor datasets as different parts of an object are occluded from different viewpoints and across scenes. As objects such as the legs of a chair have thin structures, multiscale learning contributes to recovering such structures. We see a similar trend in the performance as in the SUN RGB-D dataset, where our network outperforms the baselines in most of the object categories (11 of the 20 classes) significantly, while yielding a comparable performance for the other categories. The largest improvement of is obtained for the Counter class, followed by an improvement of for the Curtain class which appears as many different variations in the dataset. An interesting observation that can be made is that the highest parametrized network DeconvNet which has parameters has the lowest performance in both SUN RGB-D and ScanNet datasets, while AdapNet++ which has almost 1/9th of the parameters, outperforms it by more than twice the margin. However, this is only observed in the indoor datasets, while in the outdoor datasets DeconvNet performs comparable to the other networks. This is primarily due to the fact that indoor datasets have more number of small classes and the predictions of DeconvNet do not retain them.
Finally, we also benchmark on the Freiburg Forest dataset as it contains several modalities and it is the largest dataset to provide labeled training data for unstructured forested environments. We show the results on the Freiburg Forest dataset in Table 5, where our proposed AdapNet++ outperforms the state-of-the-art by . Note that this dataset contains large objects such trees and it does not contain thin structures or objects in multiple scales. Therefore, the improvement produced by AdapNet++ is mostly due to the proposed decoder which yields an improved resolution of segmentation along the object boundaries. The actual utility of this dataset is seen in the qualitative multimodal fusion results, where the fusion helps to improve the segmentation in the presence of disturbances such as glare on the optics and snow. Nevertheless, we see the highest improvement of in the Obstacle class, which is the hardest to segment in this dataset as it contains many different types of objects in one category and it has comparatively fewer examples in the dataset
Moreover, we also compare the number of parameters and the inference time with the baseline networks in Table 5. Our proposed AdapNet++ architecture performs inference in on an NVIDIA TITAN X which is substantially faster than the top performing architectures in all the benchmarks. Most of them consume more than twice the amount of time and the number of parameters making them unsuitable for real-world resource constrained applications. Our critical design choices enable AdapNet++ to consume only more than our previously proposed AdapNet, while exceeding its performance in each of the benchmarks by a large margin. This shows that AdapNet++ achieves the right performance vs. compactness trade-off which enables it to be employed in not only resource critical applications, but also in applications that demand efficiency and fast inference times.
5.4 AdapNet++ Compression
|Technique||mIoU||Param.||FLOPS||Reduction % of|
We present empirical evaluations of our proposed pruning strategy that is invariant to shortcut connections in Table 6. We experiment with pruning entire convolutional filters which results in the removal of its corresponding feature map and the related kernels in the following layer. Most existing approaches only prune the first and the second convolution layer of each residual block, or in addition, equally prune the third convolution layer similar to the shortcut connection. However, this equal pruning strategy always leads to a significant drop in the accuracy of the model that is not recoverable (Li et al, 2017). Therefore, recent approaches have resorted to omitting pruning of these connections. Contrarily, our proposed technique is invariant to the presence of identity or projection shortcut connections, thereby making the pruning more effective and flexible. We employ a greedy pruning approach but rather than pruning layer by layer and fine-tuning the model after each step, we perform pruning of entire residual blocks at once and then perform the fine-tuning. As our network has a total of 75 convolutional and deconvolutional layers, pruning and fine-tuning each layer will be extremely cumbersome. Nevertheless, we expect a higher performance employing a fully greedy approach.
We compare our strategy with a baseline approach (Li et al, 2017) that uses the -norm of the convolutional filters to compute their importance as well as the approach that we build upon that uses the Taylor expansion criteria (Molchanov et al, 2017) for the ranking as described in Section 3.5. We denote the approach of Molchanov et al (2017) as Oracle in our results. In the first stage, we start by pruning only the Res5 block of our model as it contains the most number of filters, therefore, a substantial amount of parameters can be reduced without any loss in accuracy. As shown in Table 6, our approach enables a reduction of of the parameters and FLOPs with a slight increase in the mIoU metric. Similar to our approach the original Oracle approach does not cause a drop in the mIoU metric but achieves a lower reduction in parameters. Whereas, the baseline approach achieves a smaller reduction in the parameters and simultaneously causes a drop in the mIoU score.
Our aim for pruning in the first stage was to compress the model without leading to a drop in the segmentation performance, while in the following stages, we aggressively prune the model to achieve the best parameter to performance ratio. Results from this experiment are shown as the percentage in reduction of parameters in comparison to the change in mIoU in Figure 16. In the second stage, we prune the convolutional feature maps of Res2, Res3, Res4 and Res5 layers. Using our proposed method, we achieve a reduction of of parameters with minor drop of in the mIoU score. Whereas, the Oracle approach yields a lower reduction in parameters as well as a larger drop in performance. A similar trend can be seen for the other pruning stages where our proposed approach yields a higher reduction in parameters and FLOPs with a minor reduction in the mIoU score. This shows that pruning convolutional feature maps with regularity leads to a better compression ratio than selectively pruning layers at different stages of the network. In the third stage, we perform pruning of the deconvolutional feature maps, while in the fourth and fifth stages we further prune all the layers of the network by varying the threshold for the rankings. In the final stage we obtain a reduction of of the parameters and of FLOPs with a drop of in the mIoU score. Considering the compression that can be achieved, this minor drop in the mIoU score is acceptable to enable efficient deployment in resource constrained applications.
5.5 AdapNet++ Ablation Studies
In order to evaluate the various components of our AdapNet++ architecture, we performed several experiments in different settings. In this section, we study the improvement obtained due to the proposed encoder with the multiscale residual units, a detailed analysis of the proposed eASPP, comparisons with different base encoder network topologies, the improvement that can be obtained by using higher resolution images as input and using multiscale testing. For each of these components, we also study the effect of different parameter configurations. All the ablation studies presented in this section were performed on models trained on the Cityscapes dataset.
5.5.1 Detailed study on the AdapNet++ Architecture
We first study the major contributions made to the encoder as well as the decoder in our proposed AdapNet++ architecture. Table 7 shows results from this experiment and subsequent improvement due to each of the configurations. The simple base model M1 consisting of the standard ResNet-50 for the encoder and a single deconvolution layer for upsampling achieves a mIoU of . The model M2 that incorporates our multiscale residual units achieves an improvement of without any increase in the memory consumption. Whereas, the multigrid approach from DeepLab v3 (Chen et al, 2017) in the same configuration achieves only of improvement in the mIoU score. This shows the novelty in employing our multiscale residual units for efficiently learning multiscale features throughout the network. In the M3 model, we study the effect of incorporating skip connections for refinement. Skip connections that were initially introduced in the FCN architecture are still widely used for improving the resolution of the segmentation by incorporating low or mid-level features from the encoder into the decoder while upsampling. The ResNet-50 architecture contains the most discriminative features in the middle of the network. In our M3 model, we first upsample the encoder output by a factor two, followed by fusing the features from Res3d block of the encoder to refinement and subsequently using another deconvolutional layer to upsample back to input resolution. This model achieves a further improvement of .
In the M4 model, we replace the standard residual units with the full pre-activation residual units which yields an improvement of . As mentioned in the work by He et al (2016), the results corroborate that pre-activation residual units yields a lower error than standard residual units due to the ease of training and improved generalization capability. Aggregating multiscale context using ASPP has become standard practise in most classification and segmentation networks. In the M5 model, we add the ASPP module to the end of the encoder segment. This model demonstrates an improved mIoU of due to the ability of the ASPP to capture long range context. In the subsequent M6 model, we study if adding another skip refinement connection from the encoder yields a better performance. This was challenging as most combinations along with the Res3d skip connection did not demonstrate any improvement. However, adding a skip connection from Res2c showed a slight improvement.
In all the models upto this stage, we fused the mid-level encoder features into the decoder using element-wise addition. In order to make the decoder stronger, we experiment with improving the learned decoder representations with additional convolutions after concatenation of the mid-level features. Specifically, the M7 model has three upsampling stages, the first two stages consist of a deconvolution layer that upsamples by a factor of two, followed by concatenation of the mid-level features and two following convolutions that learn highly discriminative fused features. This model shows an improvement of which is primarily due to the improved segmentation along the object boundaries as demonstrated in the qualitative results in Section 5.10. Our M7 model contains a total of 75 convolutional and deconvolutional layers, making the optimization challenging. In order to accelerate the training and to further improve the segmentation along object boundaries, we propose a multiresolution supervision scheme in which we add a weighted auxiliary loss to each of the first two upsampling stages. This model denoted as M8 achieves an improved mIoU of . In comparison to aforementioned scheme, we also experimented with adding a weighted auxiliary loss at the end of the encoder of the M7 model, however it did not improve the performance, although it accelerated the training. Finally we also experimented with initializing the layers with the He initialization (He et al, 2015b) scheme (also known as MSRA) which further boosts the mIoU to .
5.5.2 Detailed study on the eASPP
In this subsection, we quantitatively and qualitatively evaluate the performance of our proposed eASPP configuration and the new decoder topology. We perform all the experiments in the section using the best performing M9 model described in Section 5.5.1. In the first configuration of the M91 model, we employ a single