On Single Source Robustness in Deep Fusion Models
Abstract
Algorithms that fuse multiple input sources benefit from both complementary and shared information. Shared information may provide robustness to faulty or noisy inputs, which is indispensable for safetycritical applications like selfdriving cars. We investigate learning fusion algorithms that are robust against noise added to a single source. We first demonstrate that robustness against single source noise is not guaranteed in a linear fusion model. Motivated by this discovery, two possible approaches are proposed to increase robustness: a carefully designed loss with corresponding training algorithms for deep fusion models, and a simple convolutional fusion layer that has a structural advantage in dealing with noise. Experimental results show that both training algorithms and our fusion layer make a deep fusionbased 3D object detector robust against noise applied to a single source, while preserving the original performance on clean data.
On Single Source Robustness in Deep Fusion Models
Taewan Kim The University of Texas at Austin Austin, TX 78712 twankim@utexas.edu Joydeep Ghosh The University of Texas at Austin Austin, TX 78712 jghosh@utexas.edu
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Deep learning models have accomplished superior performance in several machine learning problems (LeCun et al., 2015) including object recognition (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; Szegedy et al., 2015; He et al., 2016; Huang et al., 2017), object detection (Ren et al., 2015; He et al., 2017; Dai et al., 2016; Redmon et al., 2016; Liu et al., 2016; Redmon and Farhadi, 2017) and speech recognition (Hinton et al., 2012; Graves et al., 2013; Sainath et al., 2013; Chorowski et al., 2015; Chan et al., 2016; Chiu et al., 2018), which use either visual or audio sources. One natural way of improving a model’s performance is to make use of multiple input sources relevant to a given task so that enough information can be extracted to build strong features. Therefore, deep fusion models have recently attracted considerable attention for autonomous driving (Kim and Ghosh, 2016; Chen et al., 2017; Qi et al., 2018; Ku et al., 2018), medical imaging (Kiros et al., 2014; Wu et al., 2013; Simonovsky et al., 2016; Liu et al., 2015), and audiovisual speech recognition (Huang and Kingsbury, 2013; Mroueh et al., 2015; Sui et al., 2015; Chung et al., 2017).
Two benefits are expected when fusionbased learning models are selected for a given problem. First, given adequate data, more information from multiple sources can enrich the model’s feature space to achieve higher prediction performance, especially, when different input sources provide complementary information to the model. This expectation coincides with a simple information theoretic fact: if we have multiple input sources and a target variable , mutual information obeys .
The second expected advantage is increased robustness against single source faults, which is the primary concern of our work. An underlying intuition comes from the fact that different sources may have shared information so one sensor can partially compensate for others. This type of robustness is critical in realworld fusion models, because each source may be exposed to different types of corruption but not at the same time. For example, LIDARs used in autonomous vehicles work fine at night whereas RGB cameras do not. Also, each source used in the model may have its own sensing device, and hence not necessarily be corrupted by some physical attack simultaneously with others. It would be ideal if the structure of machine learning based fusion models and shared information could compensate for the corruption and automatically guarantee robustness without additional steps.
This paper shows that a fusion model needs a supplementary strategy and a specialized structure to avoid vulnerability to noise or corruption on a single source. Our contributions are as follows:

We show that a fusion model learned with a standard robustness is not guaranteed to provide robustness against noise on a single source. Inspired by the analysis, a novel loss is proposed to achieve the desired robustness (Section 3).

Two efficient training algorithms for minimizing our loss in deep fusion models are devised to ensure robustness without impacting performance on clean data (Section 4.1).

We introduce a simple but an effective fusion layer which naturally reduces error by applying ensembling to latent convolutional features (Section 4.2).
We apply our loss and the fusion layer to a complex deep fusionbased 3D object detector used in autonomous driving for further investigation in practice. Note that our findings can be easily generalized to other applications exhibiting intermittent defects in a subset of input sources.
2 Related Works
Deep fusion models have been actively studied in object detection for autonomous vehicles. There exist two major streams classified according to their algorithmic structures: twostage detectors with RCNN (Regionbased Convolutional Neural Networks) technique (Girshick et al., 2014; Girshick, 2015; Ren et al., 2015; Dai et al., 2016; He et al., 2017), and single stage detectors for faster inference speed (Redmon et al., 2016; Redmon and Farhadi, 2017; Liu et al., 2016).
Earlier deep fusion models extended Fast RCNN (Girshick, 2015) to provide better quality of region proposals from multiple sources (Kim and Ghosh, 2016; Braun et al., 2016). With a highresolution LIDAR, point cloud was used as a major source of the region proposal stage before the fusion step (Du et al., 2017), whereas FPointNet (Qi et al., 2018) used it for validating 2D proposals from RGB images and predicting 3D shape and location within the visual frustum. MV3D (Chen et al., 2017) extended the idea of region proposal network (RPN) (Ren et al., 2015) by generating proposals from RGB image, and LIDAR’s front view and BEV (bird’s eye view) maps. Recent works tried to remove region proposal stages for faster inference and directly fused LIDAR’s front view depth image (Kim et al., 2018b) or BEV image (Wang et al., 2018) with RGB images. ContFuse (Liang et al., 2018) utilizes both RGB and LIDAR’s BEV images with a new continuous fusion scheme, which is further improved in MMF (Liang et al., 2019) by handling multiple tasks at once. Our experimental results are based on AVOD (Ku et al., 2018), a recent opensourced 3D object detector that generates region proposals from RPN using RGB and LIDAR’s BEV images.
Compared to the active efforts in accomplishing higher performance on clean data, very few works have focused on robust learning methods in multisource settings to the best of our knowledge. Adaptive fusion methods using gating networks weight the importance of each source automatically (Mees et al., 2016; Valada et al., 2017), but these works lack indepth studies of the robustness against single source faults. A recent work proposed a gated fusion at the feature level and applied data augmentation techniques with randomly chosen corruption methods (Kim et al., 2018a). In contrast, our training algorithms are surrogate minimization schemes for the proposed loss function, which is grounded from the analyses on underlying weakness of fusion methods. Also the fusion layer proposed in this paper focuses more on how to mix convolutional feature maps channelwise with simple trainable procedures. For extensive literature reviews, please refer to the recent survey papers about deep multimodal learning methods in general (Ramachandram and Taylor, 2017) and for autonomous driving (Feng et al., 2019).
3 Single Source Robustness of Fusion Models
3.1 Regression on linear fusion data
To show the vulnerability of naive fusion models, we introduce a simple data model and a fusion algorithm. Suppose is a linear function consisting of three different inherent (latent) components . There are two input sources, and . Here ’s are unknown functions.
(1) 
Our simple data model simulates a target variable relevant to two different sources, where each source has its own special information and and a shared one . For example, if two sources are obtained from an RGB camera and a LIDAR sensor, one can imagine that any features related to objectness are captured in whereas colors and depth information may be located in and , respectively. Our objective is to build a regression model by effectively incorporating information from the sources to predict the target variable .
Now, consider a fairly simple setting and , where can be defined accordingly to satisfy (1). A straightforward fusion approach is to stack the sources, i.e. , and learn a linear model. Then, it is easy to show that there exists a feasible errorfree model for noisefree data:
(2) 
where . Parameter vectors responsible for the shared information are denoted by and .^{1}^{1}1In practice, has to be solved for and with enough number of data samples. Then a standard least squares solution using a pseudoinverse gives . This is equivalent to the solution robust against random noise added to all the sources at once, which is vulnerable to single source faults (Section 3.2).
Suppose the true parameters of data satisfy and . Assume that the obtained solution’s parameters for are unbalanced, i.e. and with some weight vector having a small norm. Then adding noise to the source will give significant corruption to the prediction while is relatively robust because for any noise affecting . This simple example illustrates that additional training strategies or components are indispensable to achieve robust fusion model working even if one of the sources is disturbed. The next section introduces a novel loss for a balanced robustness against a fault in a single source.
3.2 Robust learning for single source noise
Fusion methods are not guaranteed to provide robustness against faults in a single source without additional supervision. Also, we demonstrate that naive regularization or robust learning methods are not sufficient for the robustness later in this section. Therefore, a supplementary constraint or strategy needs to be considered in training which can correctly guide learning parameters for the desired robustness.
One essential requirement of fusion models is showing balanced performance regardless of corruption added to any source. If the model is significantly vulnerable to corruption in one source, this model becomes untrustworthy and we need to balance the degradation levels of different input sources’ faults. For example, suppose there is a model robust against noise in RGB channels but shows huge degradation in performance for any fault of LIDAR. Then the overall system should be considered untrustworthy, because there exist certain corruption or environments which can consistently fool the model. Our loss, MaxSSN (Maximum Single Source Noise), for such robustness is introduced to handle this issue and further analyses are provided under the linear fusion data model explained in Section 3.1. This loss makes the model focus more on corruption of a single source, SSN, rather than focusing on noise added to all the sources at once, ASN.
Definition 1.
For multiple sources and a target variable , denote a predefined loss function by . If each source is perturbed with some additive noise for , MaxSSN loss for a model is defined as follows:
Another key principle in our robust training is to retain the model’s performance on clean data. Although techniques like data augmentation help improving a model’s generalization error in general, learning a model robust against certain types perturbation including adversarial attacks may harm the model’s accuracy on noncorrupt data (Tsipras et al., 2019). Deterioration in the model’s ability on normal data is an unwanted side effect, and hence our approach aims to avoid this.
Random noise
To investigate the importance of our MaxSSN loss, we revisit the linear fusion data model with the optimal direct fusion model of the regression problem introduced in Section 3.1. Suppose the objective is to find a model with robustness against single source noises, while preserving errorfree performance, i.e., unchanged loss under clean data. For the noise model, consider where and , which satisfy , , and for . Note that noises added to the shared information, and , are not identical, which resembles direct perturbation to the input sources in practice. For example, noise directly affecting a camera lens does not need to perturb other sources.
Optimal fusion model for MaxSSN
The robust linear fusion model is found by minimizing over parameters and . As shown in the previous section, any satisfying and should achieve zeroerror. Therefore, overall optimization problem can be reduced to the following one:
(3) 
If we use a standard expected squared loss and solve the optimization problem, the following solution with corresponding parameters can be obtained, and there exist three cases based on the relative sizes of ’s.
(4) 
The three cases reflect the relative influence of each weight vector for . For instance, if has larger importance compared to the rest in generating , the optimal way of balancing the effect of noise over is to remove all the influence of in by setting . When neither of nor dominates the importance, i.e. , the optimal solution tries to make .
Comparison with the standard robust fusion model
Minimizing loss with noise added to a model’s input is a standard process in robust learning. The same strategy can be applied to learn fusion models by considering all sources as a single combined source, then add noise to all the sources at once. However, this simple strategy cannot achieve low error in terms of the single source robustness. The optimal solution to , a least squares solution, is achieved when . The corresponding MaxSSN loss can be evaluated as . A nontrivial gap exists between and , which is directly proportional to the data model’s inherent characteristics:
(5) 
If either or has more influence on the target value than the other components, single source robustness of the model trained by MaxSSN loss is better than the fusion model for the general noise robustness with an amount proportional to the influence of shared feature . Otherwise, the gap’s lower bound is proportional to the difference in complementary information, .
Remark 1.
In linear systems such as the one studied above, having redundant information in the feature space is similar to multicollinearity in statistics. In this case, feature selection methods usually try to remove such redundancy. However, this redundant or shared information helps preventing degradation of the fusion model when a subset of the input sources are corrupted.
4 Robust Deep Fusion Models
In simple linear settings, our analyses illustrate that using MaxSSN loss can effectively minimize the degradation of a fusion model’s performance. This suggests a training strategy for complex deep fusion models to be equipped with robustness against single source faults. A principal factor considered in designing a common framework for our algorithms is the preservation of model’s performance on clean data while minimizing a loss for defending corruption. Therefore, our training algorithms use data augmentation to encounter both clean and corrupted data. The second way of achieving robustness is to take advantage of the fusion method’s structure. A simple but effective method of mixing convolutional features coming from different input sources is introduced later in this section.
4.1 Robust training algorithms for single source noise
Our common training framework alternately provides clean samples and corrupted samples per iteration to preserve the original performance of the model on uncontaminated data.^{2}^{2}2We also try finetuning only a subset of the model’s parameters, , to preserve essential parts for extracting features from normal data. However, training the whole network from the beginning shows better performance in practice. See Appendix B for a detailed comparison. On top of this strategy, one standard robust training scheme and two algorithms for minimizing MaxSSN loss are introduced for handling robustness against noise in different sources.
Standard robust training method
A standard robust training algorithm can be developed by considering all sources as a single combined source. Given noise generating functions (), the algorithm generates and adds corruption to all the sensors at once. Then the corresponding loss can be computed to update parameters using backpropagation. This algorithm is denoted by TrainASN and tested in experiments to investigate whether the procedure is also able to cover robustness against single source noise.
Minimization of MaxSSN loss
Minimization of the MaxSSN loss requires (number of input sources) forwardpropagations within one iteration. Each propagation needs a different set of corrupted samples generated by adding single source noise to the fixed clean minibatch of data. There are two possible approaches to compute gradients properly from these multiple passes. First, we can run backpropagation times to save the gradients temporarily without updating any parameters, then the saved gradients with the maximum loss is used for updating parameters. However, this process requires not only forward and backward passes but also large memory usage proportional to for saving the gradients. Another reasonable approach is to run forward passes to find the maximum loss and compute gradients by going back to the corresponding set of corrupted samples. Algorithm 1 adopts this idea for its efficiency, forward passes and one backpropagation. A faster version of the algorithm, TrainSSNAlt, is also considered since multiple forward passes may take longer as the number of sources increases. This algorithm ignores the maximum loss and alternately augments corrupted data. By a slight abuse of notation, symbols used in our algorithms also represent the iteration steps with the size of minibatches greater than one. Also, is shortened to in the algorithms.
4.2 Feature fusion methods
Fusion of features extracted from multiple input sources can be done in various ways (Chen et al., 2017). One of the popular methods is to fuse via an elementwise mean operation (Ku et al., 2018), but this assumes that each feature must have a same shape, i.e., width, height, and number of channels for a 3D feature. An elementwise mean can be also viewed as averaging channels from different 3D features, and it has an underlying assumption that the channels of each feature should share same information regardless of the input source origin. Therefore, the risk of becoming vulnerable to single source corruption may increase with this simple mean fusion method.
Our fusion method, latent ensemble layer (LEL), is devised for three objectives: (i) maintaining the known advantage—error reduction—of ensemble methods (Tumer and Ghosh, 1996b, a), (ii) admitting sourcespecific features to survive even after the fusion procedure, and (iii) allowing each source to provide a different number of channels. The proposed layer learns parameters so that channels of the 3D features from the different sources can be selectively mixed. Sparse constraints are introduced to let the training procedure find good subsets of channels to be fused across the feature maps. For example, mixing the channel of the convolutional feature from an RGB image with the and channels of the LIDAR’s latent feature is possible in our LEL, whereas in an elementwise mean layer the latent channel from RGB is only mixed with the other sources’ channels.
We also apply an activation function to supplement a semiadaptive behavior to the fusion procedure. Definition 2 explains the details of our LEL, and Figure 1 visualizes the overall process. In practice, this layer can be easily constructed by using convolutions with the ReLU activation and constraints. The output channeldepth is set to in the experiments.
Definition 2 (Latent ensemble layer).
Suppose we have convolutional features from different input sources , which can be stacked as . The channel of the stacked feature is denoted by . Let be a dimensional weight vector to mix ’s in channelwise fashion. Then LEL outputs where each channel is computed as , with some activation function and sparse constraints for all .
5 Experimental Results
We test our algorithms and the LEL fusion method on 3D and BEV object detection tasks using the car class of the KITTI dataset (Geiger et al., 2012). As our experiments include random generation of corruption, each task is evaluated 5 times to compare average scores (reported with 95% confidence intervals), and thus a validation set is used for ease of manipulating data and repetitive evaluation. We follow the split of (Ku et al., 2018), 3712 and 3769 frames for training and validation sets, respectively. Results are reported based on three difficulty levels defined by KITTI (easy, medium, hard) and a standard metric for object detection Average Precision (AP) is used. A recent opensourced 3D object detector AVOD (Ku et al., 2018) with a feature pyramid network is selected as a baseline algorithm. Four different algorithms are compared: AVOD trained on (i) clean data, (ii) data augmented with ASN samples (TrainASN), (iii) SSN augmented data with direct MaxSSN loss minimization (TrainSSN), and (iv) SSN augmented data (TrainSSNAlt). The AVOD architecture is varied to use either elementwise mean fusion layers or our LELs. We follow the original training setups of AVOD, e.g., 120k iterations using an ADAM optimizer with an initial learning rate of 0.0001.^{3}^{3}3Our methods are implemented with TensorFlow on top of the official AVOD code. The computing machine has a Intel Xeon E51660v3 CPU with Nvidia Titan X Pascal GPUs. The source code is available at: https://github.com/twankim/avod_ssn
Corruption methods
Gaussian noise generated i.i.d. with is directly added to the pixel value of an image () and the coordinate value of a LIDAR’s point (). is set to experimentally with and . The second method downsampling selects only 16 out of 64 lasers of LIDAR data. To match this effect, 3 out of 4 horizontal lines of an RGB image are deleted. Effects of corruption on each input source are visualized in Figure 2, where the color of a 2D LIDAR image represents a distance from the sensor. Although our analyses in Section 3.2 assume the noise variances to be identical, it is nontrivial to set equal noise levels for different modalities in practice, e.g., RGB pixels vs points in a 3D space. Nevertheless, an underlying objective of our MaxSSN loss, balancing the degradation rates of different input sources’ faults, does not depend on the choice of noise types or levels.
Evaluation metrics for single source robustness
To assess the robustness against single source noise, a new metric minAP is introduced. The AP score is evaluated on the dataset with a single corrupted input source, then after going over all sources, minAP reports the lowest score among the AP scores. Our second metric maxDiffAP computes the maximum absolute difference among the scores, which measures the balance of different input sources’ single source robustness; low value of maxDiffAP means the wellbalanced robustness.
(Data) Train Algo.  Easy  Moderate  Hard  Easy  Moderate  Hard 

Fusion method: Mean  
(Clean Data)  
AVOD (Ku et al., 2018)  76.41  72.74  66.86  89.33  86.49  79.44 
+TrainASN  75.96  66.68  65.97  88.63  79.45  78.79 
+TrainSSN  76.28  67.10  66.51  88.86  79.60  79.11 
+TrainSSNAlt  77.46  67.61  66.06  89.68  86.71  79.41 
(Gaussian SSN)  
AVOD (Ku et al., 2018)  
+TrainASN  
+TrainSSN  
+TrainSSNAlt  
(Gaussian SSN)  
AVOD (Ku et al., 2018)  
+TrainASN  
+TrainSSN  
+TrainSSNAlt  
Fusion method: Latent Ensemble Layer  
(Clean Data)  
AVOD (Ku et al., 2018)  77.79  67.69  66.31  88.90  85.64  78.86 
+TrainASN  75.00  64.75  58.28  88.30  78.60  77.23 
+TrainSSN  74.25  65.00  63.83  87.88  78.84  77.66 
+TrainSSNAlt  76.04  66.42  64.41  88.80  79.53  78.53 
(Gaussian SSN)  
AVOD (Ku et al., 2018)  
+TrainASN  
+TrainSSN  
+TrainSSNAlt 
(Data) Train Algo.  Easy  Moderate  Hard  Easy  Moderate  Hard 

(Clean Data)  
AVOD (Ku et al., 2018)  77.79  67.69  66.31  88.90  85.64  78.86 
+TrainASN  71.74  61.78  60.26  87.29  77.08  75.89 
+TrainSSN  75.54  66.26  63.72  88.07  79.18  78.03 
+TrainSSNAlt  76.22  66.05  63.87  89.00  79.65  78.03 
(Downsample SSN)  
AVOD (Ku et al., 2018)  61.70  51.66  46.17  86.08  69.99  61.55 
+TrainASN  65.74  53.49  51.35  82.27  67.88  65.79 
+TrainSSN  73.33  57.85  54.91  86.61  76.07  68.59 
+TrainSSNAlt  64.77  53.34  48.29  85.27  69.87  67.77 
Results
When the fusion model uses the elementwise mean fusion (Table 1), TrainSSN algorithm shows the best single source robustness against Gaussian SSN while preserving the original performance on clean data (only small decrease in the moderate BEV detection). Also a balance of the both input sources’ performance is dramatically decreased compared to the models trained without robust learning and a naive TrainASN method.
Encouragingly, AVOD model constructed with our LEL method already achieves relatively high robustness without any robust learning strategies compared to the mean fusion layers. For all the tasks, minAP scores are dramatically increased, e.g., 61.97 vs. 47.41 minAP for the easy 3D detection task, and the maxDiffAP scores are decreased (maxDiffAP scores for AVOD with LEL are reported in Appendix B.). Then the robustness is further improved by minimizing our MaxSSN loss. As our LEL’s structure inherently handles corruption on a single source well, even the TrainASN algorithm can successfully guide the model to be equipped with the desired robustness. A corruption method with a different style, downsampling, is also tested with our LEL. Table 2 shows that the model achieves the best performance among the four algorithms when trained with our TrainSSN.
Remark 3.
A simple TrainSSNAlt achieves fairly robust models in both fusion methods against Gaussian noise, and two reasons may explain this phenomenon. First, all parameters are updated instead of finetuning only fusion related parts. Therefore, unlike our analyses on the linear model, the latent representation can be transformed to meet the objective function. In fact, TrainSSNAlt performs poorly when we finetune the model with concatenation fusion layers as shown in the supplement. Secondly, the loss function inside our is usually nonconvex so that it may be enough to use an indirect approach for small number of sources, .
6 Conclusion
We study two strategies to improve robustness of fusion models against single source corruption. Motivated by analyses on linear fusion models, a loss function is introduced to balance performance degradation of deep fusion models caused by corruption in different sources. We also demonstrate the importance of a fusion method’s structure by proposing a simple ensemble layer achieving such robustness inherently. Our experimental results show that deep fusion models can effectively use complementary and shared information of different input sources by training with our loss and fusion layer to obtain both robustness and high accuracy. We hope our results motivate further work to improve the single source robustness of more complex fusion models with either large number of input sources or adaptive networks. Another interesting direction is to investigate the single source robustness against adversarial attacks in deep fusion models, which can be compared with our analyses in the supplementary material.
References
 Braun et al. [2016] Markus Braun, Qing Rao, Yikang Wang, and Fabian Flohr. Posercnn: Joint object detection and pose estimation using 3d object proposals. In IEEE 19th international conference on intelligent transportation systems (ITSC), pages 1546–1551, 2016.
 Chan et al. [2016] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4960–4964, 2016.
 Chen et al. [2017] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multiview 3d object detection network for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR), pages 1907–1915, 2017.
 Chiu et al. [2018] ChungCheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. Stateoftheart speech recognition with sequencetosequence models. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4774–4778, 2018.
 Chorowski et al. [2015] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attentionbased models for speech recognition. In Advances in neural information processing systems (NeurIPS), pages 577–585, 2015.
 Chung et al. [2017] Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. In IEEE conference on computer vision and pattern recognition (CVPR), pages 3444–3453, 2017.
 Dai et al. [2016] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. Rfcn: Object detection via regionbased fully convolutional networks. In Advances in neural information processing systems (NeurIPS), pages 379–387, 2016.
 Du et al. [2017] Xinxin Du, Marcelo H Ang, and Daniela Rus. Car detection for autonomous vehicle: Lidar and vision fusion approach through deep learning framework. In IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 749–754, 2017.
 Feng et al. [2019] Di Feng, Christian HaaseSchuetz, Lars Rosenbaum, Heinz Hertlein, Fabian Duffhauss, Claudius Glaeser, Werner Wiesbeck, and Klaus Dietmayer. Deep multimodal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. arXiv preprint arXiv:1902.07830, 2019.
 Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE conference on computer vision and pattern recognition (CVPR), pages 3354–3361, 2012.
 Girshick [2015] Ross Girshick. Fast rcnn. In IEEE international conference on computer vision (ICCV), pages 1440–1448, 2015.
 Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR), pages 580–587, 2014.
 Goodfellow et al. [2015] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International conference on learning representations (ICLR), 2015.
 Graves et al. [2013] Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6645–6649, 2013.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
 He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask rcnn. In IEEE international conference on computer vision (ICCV), pages 2961–2969, 2017.
 Hinton et al. [2012] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE signal processing magazine, 29, 2012.
 Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE conference on computer vision and pattern recognition (CVPR), pages 4700–4708, 2017.
 Huang and Kingsbury [2013] Jing Huang and Brian Kingsbury. Audiovisual deep learning for noise robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7596–7599, 2013.
 Kim et al. [2018a] Jaekyum Kim, Junho Koh, Yecheol Kim, Jaehyung Choi, Youngbae Hwang, and Jun Won Choi. Robust deep multimodal learning based on gated information fusion network. In Asian conference on computer vision (ACCV), 2018a.
 Kim and Ghosh [2016] Taewan Kim and Joydeep Ghosh. Robust detection of nonmotorized road users using deep learning on optical and lidar data. In IEEE 19th international conference on intelligent transportation systems (ITSC), pages 271–276, 2016.
 Kim et al. [2018b] Taewan Kim, Michael Motro, Patrícia Lavieri, Saharsh Samir Oza, Joydeep Ghosh, and Chandra Bhat. Pedestrian detection with simplified depth prediction. In IEEE 21st international conference on intelligent transportation systems (ITSC), pages 2712–2717, 2018b.
 Kiros et al. [2014] Ryan Kiros, Karteek Popuri, Dana Cobzas, and Martin Jagersand. Stacked multiscale feature learning for domain independent medical image segmentation. In International workshop on machine learning in medical imaging, pages 25–32. Springer, 2014.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS), pages 1097–1105, 2012.
 Ku et al. [2018] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 1–8, 2018.
 LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 Liang et al. [2018] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multisensor 3d object detection. In European conference on computer vision (ECCV), pages 641–656, 2018.
 Liang et al. [2019] Ming Liang, Bin Yang, Yun Chen, Rui Hui, and Raquel Urtasun. Multitask multisensor fusion for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR), 2019.
 Liu et al. [2015] Siqi Liu, Sidong Liu, Weidong Cai, Hangyu Che, Sonia Pujol, Ron Kikinis, Dagan Feng, Michael J Fulham, et al. Multimodal neuroimaging feature learning for multiclass diagnosis of alzheimer’s disease. IEEE transactions on biomedical engineering, 62(4):1132–1140, 2015.
 Liu et al. [2016] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision (ECCV), pages 21–37. Springer, 2016.
 Mees et al. [2016] Oier Mees, Andreas Eitel, and Wolfram Burgard. Choosing smartly: Adaptive multimodal fusion for object detection in changing environments. In IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 151–156, 2016.
 Mroueh et al. [2015] Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. Deep multimodal learning for audiovisual speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2130–2134, 2015.
 Qi et al. [2018] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgbd data. In IEEE conference on computer vision and pattern recognition (CVPR), pages 918–927, 2018.
 Ramachandram and Taylor [2017] Dhanesh Ramachandram and Graham W Taylor. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6):96–108, 2017.
 Redmon and Farhadi [2017] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In IEEE conference on computer vision and pattern recognition (CVPR), pages 7263–7271, 2017.
 Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, realtime object detection. In IEEE conference on computer vision and pattern recognition (CVPR), pages 779–788, 2016.
 Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems (NeurIPS), pages 91–99, 2015.
 Sainath et al. [2013] Tara N Sainath, Abdelrahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep convolutional neural networks for lvcsr. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 8614–8618, 2013.
 Simonovsky et al. [2016] Martin Simonovsky, Benjamín GutiérrezBecker, Diana Mateus, Nassir Navab, and Nikos Komodakis. A deep metric for multimodal registration. In International conference on medical image computing and computerassisted intervention, pages 10–18. Springer, 2016.
 Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In International conference on learning representations (ICLR), 2015.
 Sui et al. [2015] Chao Sui, Mohammed Bennamoun, and Roberto Togneri. Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines. In IEEE international conference on computer vision (ICCV), pages 154–162, 2015.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition (CVPR), pages 1–9, 2015.
 Tsipras et al. [2019] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International conference on learning representations (ICLR), 2019.
 Tumer and Ghosh [1996a] Kagan Tumer and Joydeep Ghosh. Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29(2):341–348, 1996a.
 Tumer and Ghosh [1996b] Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in ensemble classifiers. Connection science, 8(34):385–404, 1996b.
 Valada et al. [2017] Abhinav Valada, Johan Vertens, Ankit Dhall, and Wolfram Burgard. Adapnet: Adaptive semantic segmentation in adverse environmental conditions. In IEEE international conference on robotics and automation (ICRA), pages 4644–4651, 2017.
 Wang et al. [2018] Zining Wang, Wei Zhan, and Masayoshi Tomizuka. Fusing bird’s eye view lidar point cloud and front view camera image for 3d object detection. In IEEE intelligent vehicles symposium (IV), pages 1–6, 2018.
 Wu et al. [2013] Pengcheng Wu, Steven CH Hoi, Hao Xia, Peilin Zhao, Dayong Wang, and Chunyan Miao. Online multimodal deep similarity learning with application to image retrieval. In 21st ACM international conference on multimedia, pages 153–162. ACM, 2013.
Appendix A Proofs and supplementary Analyses
a.1 Proofs and analyses for Section 3.2
Proof.
The original loss minimization problem with an additional constraint of preserving loss under clean data can be transformed to the problem stated in (3) due to the flexibility of and under the constraint :
Under the expected squared loss with function, the loss can be evaluated,
Hence the equivalent problem (6) is achieved.
(6) 
For simple notation, substitute variables as , and solve the following convex optimization problem.
This problem can be solved by introducing a variable for the upper bound of the inner maximum value:
KKT condition gives:
(Primal feasibility)  
(Dual feasibility)  
(Complementary slackness)  
(Stationary) 
Considering and , we first need to analyze the case . This gives and the complementary slackness condition to find . can be analyzed with similar steps. If both and are positive, the complementary slackness condition gives , which ensures the balance of the original problem’s maximum value . This case gives with . Therefore, we can have the result (4) which provides the fusion model robust against single source corruptions from random noise. ∎
Comparison to the model not considering MaxSSN loss
If random noise are added to and simultaneously, the objective of the problem becomes instead of considering the MaxSSN loss. This is equivalent to minimizing subject to , and the solution can be directly found as it is a simple convex problem, which is . If we denote this model as , then MaxSSN loss is:
Now, let’s compute the difference .
Proof.
As both term includes , let’s assume for ease of notation. Among the three cases in (4), consider the first case .
The second case can be shown similarly. Now assume that holds, and let without loss of generality. Then we can show that,
∎
Therefore we can conclude that simply optimizing under noise added to all the input sources at the same time cannot do better than minimizing MaxSSN loss with some nonnegative gap in our linear fusion model.
a.2 Single Source Robustness against Adversarial attacks
Another important type of perturbation is an adversarial attack. Different from the previously studied random noise, perturbation to the input sources is also optimized to maximize the loss to consider the worst case. Adversarial version of the MaxSSN loss is defined as follows:
Definition 3.
For multiple sources and a target variable , denote a predefined loss function by . If each input source is maximally perturbed with some additive noise for , AdvMaxSSN loss for a model is defined as follows:
As a simple model analysis, let’s consider a binary classification problem using the logistic regression. Again, two input sources and have a common feature vector as in the linear fusion data model. A binary classifier is trained to predict label , where and the training loss is with the logistic function . Here, we apply one of the most popular attacks, fast gradient sign (FGS) method, which was also motivated by linear models without a fusion framework (Goodfellow et al., 2015). The adversarial attack per each source under norm constraint can be similarly derived as follows:
(7) 
As a substitute for the linear fusion data model, let’s assume the true classes are generated by the hidden relationship . Then the optimal fusion binary classifier becomes . Similar to the previous section, suppose an objective is to find a model with robustness against single source adversarial attacks, while preserving the performance on clean data. Then the overall optimization problem can be reduced to the following one:
(8) 
As is a decreasing function, optimal and of the original problem are equivalent to the minimizer of the following one: