RFBNet: Deep Multimodal Networks with Residual Fusion Blocks for RGB-D Semantic Segmentation
RGB-D semantic segmentation methods conventionally use two independent encoders to extract features from the RGB and depth data. However, there lacks an effective fusion mechanism to bridge the encoders, for the purpose of fully exploiting the complementary information from multiple modalities. This paper proposes a novel bottom-up interactive fusion structure to model the interdependencies between the encoders. The structure introduces an interaction stream to interconnect the encoders. The interaction stream not only progressively aggregates modality-specific features from the encoders but also computes complementary features for them. To instantiate this structure, the paper proposes a residual fusion block (RFB) to formulate the interdependences of the encoders. The RFB consists of two residual units and one fusion unit with gate mechanism. It learns complementary features for the modality-specific encoders and extracts modality-specific features as well as cross-modal features. Based on the RFB, the paper presents the deep multimodal networks for RGB-D semantic segmentation called RFBNet. The experiments on two datasets demonstrate the effectiveness of modeling the interdependencies and that the RFBNet achieved state-of-the-art performance.
Semantic scene understanding is one of the fundamental tasks for robotics applications, such as precise agriculture, autonomous driving, semantic mapping and modeling[3, 4], and localization[5, 6]. In recent years, this field has achieved huge progress, thanks to the methodology of convolutional neural network (CNN) based semantic segmentation[7, 8, 9]. As depth images provide complementary information to the RGB images, increasing research exploits deep multimodal networks to fuse the two modalities [10, 11, 12]. This paper investigates the fusion structures of multimodal networks for RGB-D semantic segmentation.
Nowadays, the RGB-D data can be easily obtained by active sensors, e.g., the Microsoft Kinect, or passive sensors, e.g., stereo cameras. The RGB data contain rich appearance information and textural details. Lots of work has been done in semantic segmentation with fully convolutional encoder-decoder networks by using RGB-only data[13, 8, 9, 14, 15, 16]. The depth data provide useful geometric cues which may reduce the uncertainty to segment objects with ambiguous appearance. It is meaningful and crucial to develop effective models to fuse the two complementary modalities.
Many works have shown improvement in semantic segmentation by fusing depth data with RGB data[10, 17, 18, 11, 19, 12]. Early fusion approaches (see Fig. 1(a)) simply feed the concatenated RGB and depth channels into a conventional unimodal network. Such methods may not fully exploit the complementary nature of the modalities. Lots of works turn to the two-stream fusion architecture which processes each modality by a separated and identical encoder and fuses modality-specific features in a single decoder[17, 12, 18, 19, 11]. The late fusion approaches[17, 18] (see Fig. 1(b)) combine the modality-specific features at the end of the two independent encoders with a combination method, e.g., concatenation and element-wise summation. Instead of fusing at early or late stages, hierarchical fusion approaches involve fusing the features at multiple levels. The approaches usually fuse multi-level features from one modality to another modality in the bottom-up path (see Fig. 1(c)) or fuse multi-level features in the top-down path[12, 19] (see Fig. 1(d)). Although these approaches have achieved encouraging results, they do not fully exploit the interdependencies of the modality-specific encoders. It is essential for the encoders to interact and inform each other for reducing the ambiguity in segmentation. How to construct an effective fusion mechanism for bidirectional interaction remains an open problem.
In this paper, we propose a bottom-up interactive fusion structure to bridge the modality-specific streams with an interaction stream. The structure should address two aspects. First, it should progressively aggregate the information from the modality-specific streams to the interaction stream and extract the cross-model features. Second, it should compute complementary features and feed them to the modality-specific streams without destroying the encoders’ ability to extract modality-specific features. The proposed structure is illustrated in Fig.1(e).
To instantiate this structure, we propose a residual fusion block (RFB) to formulate the interdependencies of the two encoders. The RFB consists of two modality-specific residual units (RUs) and one gated fusion unit (GFU). The GFU adaptively aggregates features from the RUs and generates complementary features for the RUs. The RFB formulates the complementary feature learning as residual learning, and it can extract modality-specific and cross-modal features. With the RFBs, the modality-specific encoders can interact with each other. We build the deep multimodal networks for RGB-D semantic segmentation based on the RFB, which is called RFBNet. And we conduct experiments on two datasets to verify the effectiveness of modeling the interdependencies for RGB-D semantic segmentation.
The main contributions of this paper are summarized as follows:
We propose the bottom-up interactive fusion structure, which bridges the modality-specific streams, i.e., RGB stream and depth stream, with an interaction stream.
We propose the residual fusion block (RFB) to formulate the interdependencies of the modality-specific streams and build the RFBNet for RGB-D semantic segmentation.
Ii Related Works
Ii-a Semantic Segmentation
Early semantic segmentation methods largely rely on handcrafted features, and use shallow classifiers such as Random Forest and Boosting to predict the class probabilities; then, usually use probabilistic models known as conditional random fields (CRFs) to refine the results[23, 24].
In recent years, great progress has been made in this field along with the advance of deep neural networks due to the emerges of large-scale datasets and high-performance graphics processing unit (GPU). FCNs successfully improved the accuracy of image semantic segmentation by adapting classification networks into fully convolutional networks. The subsequent works[8, 25, 9] are following this line, including ERFNet and AdapNet++. To increase the receptive field and reduce the memory and computational consumption, the encoder-decoder architecture is commonly adopted in these works, in which the encoder gradually reduces the feature maps and captures high-level semantic information, and the decoder recovers the spatial information.
Ii-B RGB-D data Fusion for Semantic Segmentation
The multimodal data fusion has gained long-time attention to exploit the complementary nature of the data of different sources[26, 27]. Early shallow learning methods mainly consider feature-level (early) fusion and decision-level (late) fusion which respectively fusing low-level features and prediction-level features. Deep multimodal networks further involve hierarchical fusion or intermediate fusion[27, 12] due to the ability of CNNs to learn hierarchical features of the data.
Early fusion can intuitively reuse conventional unimodal semantic segmentation networks[10, 7]. For example, Gouprie et al. adapted the multi-scale RGB network of Farabet et al. for RGB-D semantic segmentation by concatenating input RGB and depth channels. Late fusion aims to aggregate the high-level features of two modalities using independent networks[29, 18, 30, 31]. Gupta et al. concatenated the features extracted by two CNN models from RGB and depth data and classified them with SVM classifier; and they employed a new representation of depth data termed as HHA that encodes horizontal disparity, height above ground and angle with gravity for each pixel. Cheng et al. devised a gated fusion layer to automatically learn the contributions of high-level modality-specific features for an effective combination. Hierarchical fusion enables to combine multimodal features at different layers[11, 32, 19, 12]. FuseNet fused multi-level depth features into the RGB encoder in the bottom-up path. RedNet extended FuseNet by additional fusing multi-level features at top-down path. RDFNet proposed multi-modal feature block and multi-level feature refinement block to fuse multi-level features at top-down path. SSMA  proposed a self-supervised model adaptation (SSMA) fusion mechanism to combine modality-specific streams and also fused the multi-level features at the top-down path. It achieved state-of-the-art performance on various indoor and outdoor datasets.
The proposed RFBNet also belongs to hierarchical fusion. As a significant difference with existing methods, our approach explicitly formulates the interdependencies of the modality-specific nets, not just aggregating multi-level features.
Ii-C Gate Mechanism
Gates are commonly used to regulate the flow of the information[33, 18, 25, 34]. Hochreiter et al. used four gates to control the information propagate in and out of the memory cell and Cheng et al. used weighted gates to combine features from different modalities automatically. Highway networks used a learned gate mechanism to enable the optimization of very deep networks. We also use four gates in the gated fusion unit to regulate the interaction of useful information between modality-specific streams.
Iii Bottom-up Interactive Fusion with Residual Fusion Blocks
The architecture of the proposed RFBNet is illustrated in Fig. 2. Besides the RGB steam and depth stream, the architecture introduces an additional interaction stream. The three streams are merged by a combination method such as concatenation, summation, and SSMA block. Finally, a decoder is appended to compute the predictions. The RFBs are employed at high layers to manage the interaction of the three streams. Specifically, the RFBs are employed at layers after three downsampling operations when the spatial size of the feature map is one eighth that of the input data. Moreover, the spatial size of the interaction stream is the same as that of the depth stream, and the channel dimension is half of that of the depth stream.
Iii-B Shrinking the depth image
The architectures with two encoders commonly suffer from large computational and memory consumption. RGB data contains rich appearance and textural details to depict the objects, while depth data contains relatively sparse geometric information to depict the shape of objects. We ease the consumption by shrinking the spatial size of the depth stream. We shrink the depth data by a factor of 2 before inputting into the net, which reduces roughly three-quarters of computation and memory consumption for the depth stream. The depth stream and the interaction stream are upsampled to the same spatial size as the RGB stream before combining. This strategy makes the proposed net slightly faster than the baseline.
Iii-C Residual Fusion Block
The RFB is the basic module to achieve the idea of bottom-up interactive fusion. The RFB consists of two modality-specific residual units (RUs) and one gated fusion unit (GFU). The RU, as the basic unit of ResNet, is widely used in unimodal networks to learn unimodal features. The RFB learns the modality-specific features based on the RU. We design the GFU to aggregate features from the modality-specific RUs and compute complementary features for the RUs. The framework of the RFB is illustrated in Fig. 3.
Given the input RGB, depth, and cross-modal features , , and the output features , , , the RFB is formulated as:
where and are the complementary features computed by the GFU denoted as ; and denotes the residual functions of the modality-specific RUs; and are parameters of the RUs.
The RFB formulates the complementary feature learning as residual learning. The GFU acts as a residual function with respect to an identity mapping as illustrated in Fig. 4(b). Note that we add the complementary features and to the inputs of the modality-specific residual functions (denoted as Point “R”) instead of the trunks of the unimodal streams (denoted as Point “T”). The different adding points imply different identity mappings as illustrated in Fig. 4. The complementary feature directly impacts the modality-specific stream when adding to Point “T” (see Fig. 4(a)), while it directly impacts the residual function of the modality-specific RU when adding to Point “R” (see Fig. 4(b)).
Redundancy, noise, and complementary information exist among different modalities. The GFU explores the underlying complementary relationships in a soft-attention manner via the gate mechanism. The GFU contains two input gates and two output gates. The input gates , are used to control the unimodal features to flow into the interaction stream, and the output gates , are used to regulate the complementary features. The gates are learned by the same network as shown in the bottom of Fig. 3. consists of two convolutional layers with a ReLU layer in between, and a Sigmoid function to squash values to range. Note that we share the first convolutional layer for input gates to reduce the computational cost.
The useful information regulated by the input gates is concatenated together, following a convolutional layer before adding to ( is zero for the first RFB). Then we adopt a light-weight depthwise separable convolution (denoted as “Sconv” in Fig. 3)  to process the cross-modal features in the interaction stream. Finally, the GFU compute the complementary features for the modality-specific RUs regulated by the two output gates and .
Iii-D Incooprating with Top-Down Multi-Level Fusion
The proposed bottom-up interactive fusion structure models the interdependencies for the modality-specific encoders. It is orthogonal to the top-down multi-level fusion structure which fuses the encoders features in the top-down path at the decoder stage. The two structures can be incorporated into a united network. We illustrate the structure in Fig. 5 to give an intuitive understanding. In the experiments, we employ the proposed bottom-up interactive fusion in the SSMA which adopts the top-down multi-level fusion.
Iv Experimental Results
Datasets. We choose an indoor dataset, i.e, ScanNett and an outdoor dataset, i.e, Cityscapes to evaluate the performance. Each of them provides publicly available training and validation sets as well as an online evaluation server for benchmarking on the test set.
ScanNet is a large-scale indoor scene understanding dataset. It contains samples for training, for validation, and for testing. The RGB images are captured at a resolution of and depth at . Cityscapes is a large-scale outdoor RGB-D dataset for urban scene understanding. It contains totally finely annotated samples with a resolution of , of which for training, for validation, and for testing.
Backbones. We adopt two unimodal backbones, i.e., AdapNet++ and ERFNet. AdapNet++ is based on the ResNet-50 model with full pre-activation bottleneck residual units, while ERFNet is a real-time semantic segmentation model based on non-bottleneck factorized residual units. We use the encoder model of the ERFNet (denoted as ERFNetenc) for fast testing and ablation study. A simple bilinear interpolation upsampling layer acts as the decoder in ERFNetenc.
Criteria. We quantify the performance according to the PASCAL VOC intersection-over-union metric (IoU).
Iv-B Implementation Details
We first built up two-stream fusion networks with the unimodal backbones. We adopt SSMA, a state-of-the-art method, as the base framework which uses SSMA blocks to combine the two modality-specific streams. The SSMA is the same as SSMA model proposed in . For the SSMA, we use the ERFNetenc to extract two modality-specific features, combine the features with the SSMA block, and use a convolutional layer and a bilinear interpolation upsampling layer by a factor of 8 to get the final predictions. To employ our approach, we just replace the corresponding paired RUs with the RFBs for RFBNet and RFBNet. When employing the RFB for the bottleneck residual units, we feed the output of the first layer of the bottleneck RU to the GFU, and add the complementary features to the input of the layer of the RU.
The models were implemented using the Tensorflow 1.13.1 and trained on a single 1080Ti GPU. Adam is used for optimization, and “poly” learning rate scheduling policy is adopted to adjust the learning rate. The weight decay is set to for the AdapNet++ based models and for the ERFNetenc based models. The images are resized to a smaller scale for training so that the models can be trained on our 1080Ti GPU. For ScanNet, the images are resized to ; For Cityscapes, the images are resized to . We resize the predictions to the full resolution when benchmarking. When training on Cityscapes, we employ a crop of .
We first train the unimodal models, then use the trained weights to initialize the encoders of the multimodal models. For the AdapNet++ based models, we follow the training procedure of. We set a mini-batch of 7 for unimodal models, 6 for multimodal models, and 12 for finetuning. For the ERFNetenc based models, we use an initial learning rate of . We train K iterations with a mini-batch of 12 for the unimodal models, and 25K iterations with a mini-batch of for the multimodal models.
The raw depth data are usually not perfect and have amounts of noise and missing depth values. We perform depth completion for the depth images. Moreover, we employ the three-channel HHA encoding for the depth data. We employed extensive data augmentations for training, including flipping, scaling, rotation, cropping, color jittering, and Gaussian noise.
Iv-C Results and Analysis
Benchmarking. We report the performance benchmarking results on ScanNet and Cityscapes in Table I and Table II. Note that the test images of the two datasets are not publicly released, and they are used by the evaluation server for benchmarking. From the tables, we can see that the multimodal models have better performance than the unimodal models as expected.
|3DMV (2d proj)||✓|
In Table I, we compare against the top performing models on ScanNet test set. The results are taking from the leaderboard. The proposed RFBNet outperforms other methods, e.g., SSMA, FuseNet and 3DMV. Note that the RFBNet and SSMA adopted the same backbone, i.e., AdapNet++, while the SSMA is trained with a batch size of 16 on multiple GPUs with synchronized batch normalization111https://github.com/DeepSceneSeg/AdapNet-pp/issues/11. Still, the RFBNet achieved improvement over the SSMA.
In Table II, we compare RFBNet with base models of different backbones on Cityscapes test set, which shows that the proposed RFBNet constantly outperforms SSMA with different backbones. Note that the accuracy of AdapNet++ and SSMA reported in the Table is lower than the official accuracy. This is reasonable because the official models are trained with crops on full resolution images and a larger batch size on multiple GPUs in.
We found that the multimodal models improve less on Cityscapes than on ScanNet. We infer the reason is that the depth values of the outdoor data are much noisier and have poorer accuracy than those of the indoor data.
Ablation Study. We perform the ablation study on the ScanNet validation set with the ERFNetenc backbone. In Table III, we compare the performance of unimodal models and multimodal models and show how the resolution of the depth data impact the performance. From Table III, we can see the multimodal models show a large improvement over unimodal models by more than . When shrinking the depth input, the SSMA shows performance decrease by . We analyze that shrinking the depth can relatively increase the receptive field of depth encoders, which is beneficial for capturing broader context information, but it may lose some geometric details and reduce the spatial representation accuracy. Thus, the performance of SSMA which adopts independent encoders decreases. As the RFBNet bridges the encoders with an interaction stream, both of the unimodal encoders benefit from broader context information. Although losing some geometric details, RFBNet still shows performance improvement by .
|Method||Input data||Shrink depth||mIoU|
We show how the gates and complementary adding points of the RFB impact the performance in Table IV. “G” means employing gate mechanism to regulate the features. “T” and “R” means the complementary features are added to the trunk and the input of the residual function of RU, respectively. When “T” and “R” are disabled, the interaction stream only aggregates features from unimodal encoders but does not compute complementary features for the encoders. From the table, we can see that the performance improves by when employing gates. Moreover, enabling “R” further improves by and outperforms enabling “T” by , which indicates that it is beneficial for the encoders to interact and inform each other.
This paper addresses the RGB-D semantic segmentation by explicitly modeling the interdependencies of the RGB stream and depth stream. We proposed a bottom-up interactive fusion structure to bridge the modality-specific encoders with an interaction stream. Specifically, we proposed the residual fusion block to explicitly formulate the interdependences of the two encoders. Experiments demonstrate that the proposed approach achieved considerable improvements by effectively modeling the interdependencies.
-  A. Milioto, P. Lottes, and C. Stachniss, “Real-time semantic segmentation of crop and weed for precision agriculture robots leveraging background knowledge in cnns,” in IEEE Int. Conf. Robot. Autom. IEEE, 2018, pp. 2229–2235.
-  D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and L. Van Gool, “Fast scene understanding for autonomous driving,” arXiv preprint arXiv:1708.02550, 2017.
-  A. Hermans, G. Floros, and B. Leibe, “Dense 3d semantic mapping of indoor scenes from rgb-d images,” in IEEE Int. Conf. Robot. Autom. IEEE, 2014, pp. 2631–2638.
-  Y. Chen, M. Yang, C. Wang, and B. Wang, “3d semantic modelling with label correction for extensive outdoor scene,” in IEEE Intell. Veh. Symp. IEEE, 2019, pp. 1262–1267.
-  E. Stenborg, C. Toft, and L. Hammarstrand, “Long-term visual localization using semantically segmented images,” in IEEE Int. Conf. Robot. Autom. IEEE, 2018, pp. 6484–6490.
-  L. Deng, M. Yang, B. Hu, T. Li, H. Li, and C. Wang, “Semantic segmentation-based lane-level localization using around view monitoring system,” IEEE Sensors J., 2019.
-  E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, April 2017.
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Eur. Conf. Comput. Vis., 2018, pp. 801–818.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. Comput. Vis. Pattern Recog., July 2017, pp. 6230–6239.
-  C. Couprie, C. Farabet, L. Najman, and Y. Lecun, “Indoor semantic segmentation using depth information,” in Int. Conf. Learn. Represent., 2013.
-  C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Asian Conf. Comput. Vis. Springer, 2016, pp. 213–228.
-  A. Valada, R. Mohan, and W. Burgard, “Self-supervised model adaptation for multimodal semantic segmentation,” Int. J. Comput. Vis, jul 2019.
-  E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation,” IEEE Trans. Intell. Transp. Syst., 2017.
-  L. Deng, M. Yang, H. Li, T. Li, B. Hu, and C. Wang, “Restricted deformable convolution based road scene semantic segmentation using surround view cameras,” IEEE Trans. Intell. Transp. Syst., 2019.
-  L. Deng, M. Yang, Y. Qian, C. Wang, and B. Wang, “CNN based semantic segmentation for urban traffic scenes using fisheye camera,” in IEEE Intell. Veh. Symp, 2017, pp. 231–236.
-  K. Yang, X. Hu, L. M. Bergasa, E. Romera, X. Huang, D. Sun, and K. Wang, “Can we pass beyond the field of view? panoramic annular semantic segmentation for real-world surrounding perception,” in IEEE Intell. Veh. Symp. IEEE, 2019, pp. 446–453.
-  A. Valada, G. L. Oliveira, T. Brox, and W. Burgard, “Deep multispectral semantic scene understanding of forested environments using multimodal fusion,” in Int. Symp. Exp. Robot. Springer, 2016, pp. 465–477.
-  Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 3029–3037.
-  S.-J. Park, K.-S. Hong, and S. Lee, “Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation,” in IEEE Int. Conf. Comput. Vis., 2017, pp. 4980–4989.
-  J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Int. Conf. Mach. Learn., 2011, pp. 689–696.
-  A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 5828–5839.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 3213–3223.
-  A. C. Müller and S. Behnke, “Learning depth-sensitive conditional random fields for semantic segmentation of rgb-d images,” in IEEE Int. Conf. Robot. Autom. IEEE, 2014, pp. 6232–6237.
-  J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context,” Int. J. Comput. Vis, vol. 81, no. 1, pp. 2–23, 2009.
-  X. Li, H. Zhao, L. Han, Y. Tong, and K. Yang, “Gff: Gated fully fusion for semantic segmentation,” arXiv preprint arXiv:1904.01803, 2019.
-  P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, “Multimodal fusion for multimedia analysis: a survey,” Multimed. Syst., vol. 16, no. 6, pp. 345–379, 2010.
-  D. Ramachandram and G. W. Taylor, “Deep multimodal learning: A survey on recent advances and trends,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 96–108, 2017.
-  C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, 2012.
-  S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in Eur. Conf. Comput. Vis. Springer, 2014, pp. 345–360.
-  J. Wang, Z. Wang, D. Tao, S. See, and G. Wang, “Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks,” in Eur. Conf. Comput. Vis. Springer, 2016, pp. 664–679.
-  X. Song, L. Herranz, and S. Jiang, “Depth cnns for rgb-d scene recognition: learning from scratch better than transferring from rgb-cnns,” in AAAI Conf. Artif. Intell., 2017.
-  J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation,” arXiv preprint arXiv:1806.01054, 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Adv. Neural Inf. Process. Syst., 2015, pp. 2377–2385.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” Int. J. Comput. Vis, vol. 111, no. 1, pp. 98–136, 2015.
-  J. Ku, A. Harakeh, and S. L. Waslander, “In defense of classical image processing: Fast depth completion on the cpu,” in Conf. Comput. Robot Vis. IEEE, 2018, pp. 16–22.
-  A. Dai and M. Nießner, “3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation,” in Eur. Conf. Comput. Vis., 2018, pp. 452–468.