What makes training multimodal networks hard?
Abstract
Consider endtoend training of a multimodal vs. a singlemodal network on a task with multiple input modalities: the multimodal network receives more information, so it should match or outperform its singlemodal counterpart. In our experiments, however, we observe the opposite: the best singlemodal network always outperforms the multimodal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks.
This paper identifies two main causes for this performance drop: first, multimodal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is suboptimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widelyused baselines for avoiding overfitting and achieves stateoftheart accuracy on various tasks including finegrained sport classification, human action recognition, and acoustic event detection.
What makes training multimodal networks hard?
Weiyao Wang, Du Tran, Matt Feiszli Facebook AI {weiyaowang,trandu,mdf}@fb.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Consider a latefusion multimodal network, trained endtoend to solve a task. In this setting, the singlemodal solutions are a strict subset of the solutions available to the multimodal network; a welloptimized multimodal model should outperform the best singlemodal model. However, we show here that current techniques do not generally achieve this. In fact, what we observe is contrary to common sense: the best singlemodal model always outperforms the jointly trained latefusion model, across different modalities and benchmarks (Table 2). Details will be given later in section 4.1. Anecdotally, this appears to be common. In personal communications we have heard of similar phenomena occurring in other tasks when fusing RGB+geometry, audio+video, and others.
1.1 Lack of known solution to the problem
There are two direct ways to approach this problem. First, one can consider solutions such as dropout Dropout14 (), pretraining, or early stopping to reduce overfitting. On the other hand, one may speculate that this is an architectural deficiency. We experiment with midlevel fusion by concatenation Owens_2018_ECCV () and fusion by gating Kiela18 (), trying both SqueezeandExcitation (SE) SENet () gates and NonLocal (NL) XiaolongWang18 () gates. We refer to supplementary materials for details of these architectures.
Remarkably, none of these provide an effective solution. For each method, we record the best audiovisual results on Kinetics in Table 2. Pretraining fails to offer improvements, and early stopping tends to underfit. Gating adds interactions between the modalities but fails to improve the performance. Midconcat and dropout provide only modest improvements over RGB model. We note that midconcat (with 37% fewer parameters compared to lateconcat) and dropout make 1.4% and 1.5% improvements over lateconcat, which indicates an overfitting problem with lateconcat.
How do we reconcile these experiments with previous multimodal successes? Multimodal networks have successfully been trained jointly on tasks including sound localization ZhaoSOP18 (), imageaudio alignment L3Net17 (), and audiovisual synchronization Owens_2018_ECCV (); Korbar18 (). However, these tasks cannot be performed with a singlemodal network alone, so the performance drop found in this paper does not apply to them. In other work, joint training is avoided entirely by fusing features from independently pretrained singlemodal networks (either on the same task or on different tasks). Good examples include twostream networks for video classification SimonyanZ14 (); WangXW0LTG16 (); FeichtenhoferPZ16 (); I3D () and image+text classification Arevalo17 (); Kiela18 (). However, these methods do not train jointly, so they are again not comparable, and their accuracy is most likely suboptimal due to independent training.
1.2 The contributions of this paper
Our contributions include:

We empirically demonstrate the significance of overfitting in joint training of multimodal networks, and we identify two causes for the problem. We note that such problem is architecture agnostic: different fusion techniques also suffer the same overfitting problem.

We propose a metric to understand the overfitting problem quantitatively: the overfittingtogeneralization ratio (OGR). We provide both theoretical and empirical justification.

We propose a new training scheme based on OGR which constructs an optimal blend (in a sense we make precise below) of multiple supervision signals. This GradientBlend method gives significant gains in ablations and achieves stateoftheart accuracy on benchmarks including Kinetics, Sports1M, and AudioSet. It applies broadly to endtoend training of ensemble models.
2 Background: joint multimodal training
Singlemodal network. Given a training set , where is the th training example and is its true label, training on a single modality (e.g. RGB frames, audio, or optical flows) means minimizing an empirical loss:
(1) 
where is normally a deep network parameterized by , and is a classifier, typically one or more fullyconnected (FC) layers with parameters . For classification problems, is normally the cross entropy loss. Minimizing Eq. 1 gives a solution and . Figure 1a shows independent training of two modalities and .
Multimodal network. We train a latefusion ensemble model on different modalities (). Each modality is processed by a different deep network , and their features are concatenated and passed to a classifier . Formally, training is done by minimizing the loss:
(2) 
where denotes a concatenation operation. Figure 1 b) shows an example of a joint training of two modalities and . Note that the multimodal network in Eq. 2 is a superset of the singlemodel network in Eq. 1, for any modality . In fact, for any solution to Eq. 1 on any modality , one can construct an equallygood solution to Eq. 2 by choosing parameters that mute all modalities other than . In practice, this solution is not found, and we next explain why.
3 Multimodal joint training via gradient blending
3.1 Generalizing vs. Overfitting
Overfitting is, by definition, learning patterns in a training set that do not generalize to the target distribution. We quantify this as follows. Given model parameters , where indicates the training epoch, let be the model’s average loss over the fixed training set, and be the “true” loss w.r.t the hypothetical target distribution. (In practice, is approximated by the test and validation losses.) For either loss, the quantity is a measure of the information gained during training. We define overfitting as the gap between the gain on the training set and the target distribution:
and generalization to be the amount we learn (from training) about the target distribution:
The overfittingtogeneralization ratio is a measure of information quality:
(3) 
However, it does not make sense to optimize this asis. Very underfit models, for example, may still score quite well. What does make sense, however, is to solve an infinitesimal problem: given several estimates of the gradient, blend them to minimize an infinitesimal , ensuring each gradient step now produces a gain no worse than that of the single best modality.
Given parameter , the fullbatch gradient with respect to the training set is , and the groundtruth gradient is . We decompose into the true gradient and a remainder:
(4) 
In particular, is exactly the infinitesimal overfitting. Given an estimate , we can measure its contribution to the losses via Taylor’s theorem:
which implies ’s contribution to overfitting is given by . If we train for steps with gradients , and is the learning rate at th step, the final can be aggregated as:
(5) 
and for a single vector is
(6) 
Next we will compute the optimal blend to minimize singlestep .
3.2 Blending of Multiple Supervision Signals by Ogr Minimization
We can obtain multiple approximate gradients by attaching classifiers to each modality’s features and to the fused features (see fig 1c). Gradients are obtained by backpropagating through each loss separately (so permodality gradients contain many zeros in other parts of the network). Our next result allows us to blend them all into a single vector with better overfitting behavior.
Proposition 1 (GradientBlend).
Let be a set of gradient estimators whose overfitting satisfies for . Let denote weights with . The optimal weights
(7) 
are given by
(8) 
where and enforces the sum is unity.
The assumption that will be false when two models’ overfitting is very correlated. However, if this is the case then very little can be gained by blending. In informal experiments we have indeed observed that these cross terms are often small relative to the . This is likely due to complementary information across modalities, and we speculate that additionally, this happens naturally as joint training tries to learn complementary features across neurons. Please see supplementary materials for proof of Proposition 1, including formulas for the correlated case.
3.3 Use of Ogr and GradientBlend in practice
We adapt a multitask architecture to construct an approximate solution to the optimization above.
GradientBlend in practice. Proposition 1 suggests calculating optimal weights every update step to minimize . This would be a noisy and computationally demanding task. Instead, we find it works remarkably well to assign a single fixed weight per modality, obtained using the permodality generalization () and overfitting () measured after an initial training run of each model separately. We demonstrate the gains from such simplified training schema and look forward to developing robust perstep or perepoch estimation in future work.
Optimal blending by loss reweighting Figure 1c shows our joint training setup for two modalities with weighted losses. At each backpropagation step, the permodality gradient for is (), and the gradient from the fused loss is given by Eq. 2 (we denote it as ). Taking the gradient of the blended loss
(9) 
thus produces the blended gradient . For appropriate choices of this yields a convenient way to implement gradient blending with fixed weights. Intuitively, loss reweighting recalibrates the learning schedule to balance the generalization/overfitting rate of different modalities.
Measuring OGR in practice. In practice, is not available. To measure OGR, we hold out a subset of the training set to approximate the true distribution (i.e. ), and compute
(10) 
In summary, we train as follows:

Train singlemodal models for each modality, as well as the joint model .

For each model, compute as per (10)

Train a multimodal model, as per figure 1c, with loss weights given by
In practice, we find it is equally effective to replace the loss measure by an accuracy metric.
4 Ablation Experiments
4.1 Experimental setup
Datasets. We use three datasets for our ablation experiments: Kinetics, miniSports, and miniAudioSet. Kinetics is a standard benchmark for action recognition with 260k videos Kinetics () of human action classes. We use the train split (240k) for training and the validation split (20k) for testing. MiniSports is a subset of Sports1M Karpathy14 (), a largescale video classification dataset with 1.1M videos of 487 different finegrained sports. We uniformly sampled 240k videos from train split and 20k videos from the test split. MiniAudioSet is a subset of AudioSet audioset (), a multilabel dataset consisting of 2M videos labeled by 527 acoustic events. AudioSet is very classunbalanced, so we remove classes with less than 500 samples and subsample such that each class has about 1100 samples to balance it (see supplementary). The balanced miniAudioSet has 418 classes with 243k videos.
Backbone architecture. We use ResNet3D Tran18 () as our visual backbone and ResNet KaimingHe16 () as our audio model, both with 50 layers. For fusion, we use a twoFClayer network applied on the concatenated features from visual and audio backbones, followed by one prediction layer.
Input preprocessing & augmentation. We use three modalities in ablations: RGB frames, optical flows and audio. For RGB and optical flows, we use the same visual backbone ResNet3D50, which takes a clip of 16224224 as input. We follow the same data preprocessing and augmentation as used in XiaolongWang18 () for our visual modal, except for we use 16frame clip input (instead of 32) to reduce memory. For audio, our ResNet50 takes a spectrogram image of 40100, i.e. MELspectrograms extracted from audio input with 100 temporal frames and each has 40 MEL filters.
Training and testing. We train our models with synchronous distributed SGD on GPU clusters using Caffe2 caffe2 () with the same training setup as Tran18 (). We hold out a small portion of training data for estimating the optimal weights (8% for Kinetics and miniSports, 13% for miniAudioSet). For evaluation, we report clip top1 accuracy, video top1 and top5 accuracy. For video accuracy, we use the center crops of 10 clips uniformly sampled from the video and average these 10 clip predictions to get the final video prediction.
4.2 Overfitting Problems in Naive Joint Training
In this ablation, we compare the performance of naive audioRGB joint training with the singlemodal network training of audioonly and RGBonly. Fig. 2 plots the training curves of these models on Kinetics (left) and miniSports (right). On both Kinetics and miniSports, the audio model overfits the most and video overfits least. We note that the naive joint audioRGB model has lower training error and higher validation error compared with the videoonly model. This is evidence that naive joint training of the audioRGB model increases overfitting, explaining its accuracy drop compared with the videoonly model.
We extend the analysis and confirm severe overfitting on other multimodal problems. We consider all 4 possible combinations of the three modalities (audio, RGB, and optical flow). In every case, the validation accuracy of naive joint training is significantly worse than the best single stream model (see Table 2), and training accuracy is almost always higher (see supplementary materials).
4.3 GradientBlend is an effective regularizer
In this ablation, we show the merit of GradientBlend in multimodal training. We first show our method helps to regularize and improve the performance on different multimodal problems on Kinetics. We then compare our method with other regularization methods on the three datasets.
On Kinetics, we study all combinations of three modalities: RGB, optical flow, and audio. Table 3 presents comparison of our method with naive joint training and best single stream model. We observe significant gains of our GradientBlend strategy compared to both baselines on all multimodal problems. It is worth noting that our GradientBlend is generic enough to work for more than two modalities.
Modal  RGB + A  RGB + OF  OF + A  RGB + OF + A  

Clip  V@1  V@5  Clip  V@1  V@5  Clip  V@1  V@5  Clip  V@1  V@5  
Single  63.5  72.6  90.1  63.5  72.6  90.1  49.2  62.1  82.6  63.5  72.6  90.1 
Naive  61.8  71.4  89.3  62.2  71.3  89.6  46.2  58.3  79.9  61.0  70.0  88.7 
GBlend  65.8  74.7  91.5  64.3  73.1  90.8  55.0  66.5  86.3  65.7  74.7  91.6 
Furthermore, we pick the problem of joint audioRGB model training, and go deeper to compare our GradientBlend with other regularization methods on different tasks and benchmarks: action recognition (Kinetics), sport classification (miniSports), and acoustic event detection (miniAudioSet). We include three baselines: adding dropout at concatenation layer Dropout14 (), pretraining single stream backbones then finetuning the fusion model, and blending the supervision signals with equal weights (which is equivalent to naive training with two auxiliary losses). Auxiliary losses are popularly used in multitask learning, and we extend it as a baseline for multimodal training.
As presented in Table 4, our GradientBlend outperforms all baselines by significant margins on both Kinetics and miniSports. On miniAudioSet, GradientBlend improves all baselines on mAP, and is slightly worse on mAUC compared to auxiliary loss baseline. The reason is that the gradient weights learned in GradientBlend ( on Audio, RGB and AudioRGB) are very similar to equal weights. The failures of auxiliary loss on Kinetics and miniSports demonstrates that the weights used in GradientBlend are indeed important. We also experiment with other less obvious multitask techniques such as treating the weights as learnable parameters during backprop Kendall18 (). However, this approach converges to a similar result as naive joint training. This happens because it lacks of overfitting prior, and thus the learnable weights were biased towards to the modality that has the lowest training loss which is audioRGB.
Dataset  Kinetics  miniSports  miniAudioSet  
Method  Clip  V@1  V@5  Clip  V@1  V@5  mAP  mAUC 
Audio only  13.9  19.7  33.6  14.7  22.1  35.6  29.1  90.4 
RGB only  63.5  72.6  90.1  48.5  62.7  84.8  22.1  86.1 
PreTraining  61.9  71.7  89.6  48.3  61.3  84.9  37.4  91.7 
Naive  61.8  71.7  89.3  47.1  60.2  83.3  36.5  92.2 
Dropout  63.8  72.9  90.6  47.4  61.4  84.3  36.7  92.3 
Auxiliary Loss  60.5  70.8  88.6  48.9  62.1  84.0  37.7  92.3 
GBlend  65.8  74.7  91.5  49.7  62.8  85.5  37.8  92.2 
Fig. 3 presents the top and bottom 20 classes on Kinetics where GradientBlend makes the most and least improvements compared with RGB network. We observe that the improved classes usually have a strong audiocorrelation: such as beatboxing, whistling, etc. For classes like movingfurniture, cleaningfloor, although audio alone has nearly 0 accuracy, when combined with RGB, there are still significant improvements. These classes also tend to have high accuracy with naive joint training, which indicates the value of the joint supervision signal. On the bottom20 classes, where the GradientBlend is doing worse than RGB model, we indeed find that audio does not seem to be very semantically relevant.
5 Compare with StateoftheArt
In this section, we train our multimodal networks with deeper backbone architectures using GradientBlend and compare them with stateoftheart methods on Kinetics, Sports1M, and AudioSet. Our GBlend is trained with RGB and audio input. We use R(2+1)D Tran18 () for visual backbone and R2D KaimingHe16 () for audio backbone, both with 101 layers. We use the same preprocessing and data augmentation as described in section 4. We use the same 10crop evaluation setup as in section 4 for Sports1M and AudioSet. For Kinetics, we follow the same 30crop evaluation setup as XiaolongWang18 (). Our main purposes in these experiments are: 1) to confirm the benefit of GradientBlend on highcapacity models; and 2) to compare our GBlend with stateoftheart methods on different largescale benchmarks.
Results. Table 7 presents results of our GBlend and compares them with current stateoftheart methods on Kinetics. First, we observe that our GBlend provides an 1.3% improvement over RGB model (the best single modal network) with the same backbone architecture R(2+1)D101 when both models are trained from scratch. This confirms that the benefits of GBlend still hold with high capacity model. Second, our GBlend, when finetuned from Sports1M, outperforms ShiftAttention Network abs170803805 () and Nonlocal Network XiaolongWang18 () by 1.2% and achieves stateoftheart accuracy on Kinetics. We note that this is not a fair and direct comparison. First, the ShiftAttention network uses 3 different modalities (RGB, optical flows, and audio), our Gblend uses only RGB and audio. Second, Nonlocal network uses 128frame clip input while our GBlend uses only 16frame clip input. We also note that there are many competitive methods reporting results on Kinetics, due to the space limit, we select only a few representative methods for comparison including ShiftAttention network abs170803805 (), Nonlocal network XiaolongWang18 (), and R(2+1)D Tran18 (). ShiftAttention and Nonlocal networks are the methods with the best published accuracy using multimodal and singlemodal input, respectively. R(2+1)D is used as the visual backbone of GBlend thus serves as a direct baseline.
Table 7 and Table 7 present our GBlend results and compare them with current best methods on Sports1M and AudioSet. On Sports1M, GBlend significantly outperforms previously published results by good margins. It outperforms the current stateoftheart R(2+1)D model by 1.8% while using shorter clip input (16 instead of 32 due to memory constraint). On AudioSet, our GBlend outperforms the Google benchmarkaudioset () and Softmax Attention Kong18 () by 4.1% and 2.8%, respectively, both of which used the feature extractor pretrained on YouTube100M YouTube100M (). Our GBlend is comparable with Multilevel Attention NetworkYuMultilvl18 () and TALNetTalNet (), although the first one uses strong features (pretrained on YouTube100M) and the second one uses 100 clips per video, while our GBlend uses only 10 clips.
6 Related Work
Our work is related to the previous line of research on multimodal networks Baltruaitis2018MultimodalML () for classifications SimonyanZ14 (); WangXW0LTG16 (); FeichtenhoferPZ16 (); Fukui16 (); I3D (); Arevalo17 (); abs170803805 (); Kiela18 (), which primarily uses pretraining in contrast to our joint training. On the other hand, our work is also related to crossmodal tasks Weston:2011:WSU:2283696.2283856 (); NIPS2013_5204 (); Socher:2013:ZLT:2999611.2999716 (); VQA (); balanced_binary_vqa (); balanced_vqa_v2 (); ImageCaption16 () and crossmodal selfsupervised learning ZhaoSOP18 (); L3Net17 (); Owens_2018_ECCV (); Korbar18 (). These tasks either take one modality as input and make prediction on the other modality (e.g. VisualQ&AVQA (); balanced_binary_vqa (); balanced_vqa_v2 (), image captioning ImageCaption16 (), sound localization Owens_2018_ECCV (); ZhaoSOP18 () in videos) or uses crossmodality correspondences as selfsupervision (e.g. imageaudio correspondence L3Net17 (), videoaudio synchronization Korbar18 ()). Different from them, our GradientBlend tries to address the problem of joint training of multimodal for classification. Our GradientBlend training scheme is also related to other works on auxiliary loss, which is widely adopted in multitask learning approaches Kokkinos16 (); Eigen15 (); Kendall18 (); GradNorm18 (). These methods either use uniform\manually tuned weights, or learn the weights as parameters during training, while our work recalibrates supervision signals using a prior OGR.
7 Discussion
In singlemodal networks, diagnosing and correcting overfitting typically involves manual inspection of learning curves. Here we have shown that for multimodal networks it is essential to measure and correct overfitting in a principled way, and we put forth a useful and practical measure of overfitting. Our proposed method, GradientBlend, uses this measure to obtain significant improvements over baselines, and either outperforms or is comparable with stateoftheart methods on multiple tasks and benchmarks. We look forward to extending GradientBlend to a singlepass online algorithm: OGR estimates are made during training and learning parameters are dynamically adjusted.
References
 (1) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.
 (2) R. Arandjelović and A. Zisserman. Look, listen and learn. In ICCV, 2017.
 (3) J. Arevalo, T. Solorio, M. M. y Gómez, and F. A. González. Gated multimodal units for information fusion. In ICLR Workshop, 2017.
 (4) T. Baltruvsaitis, C. Ahuja, and L.P. Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:423–443, 2018.
 (5) R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. IkizlerCinbis, F. Keller, A. Muscat, and B. Plank. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Int. Res., 55(1):409–442, Jan. 2016.
 (6) Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of offtheshelf temporal modeling approaches for largescale video classification. CoRR, abs/1708.03805, 2017.
 (7) Caffe2Team. Caffe2: A new lightweight, modular, and scalable deep learning framework. https://caffe2.ai/.
 (8) J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
 (9) Z. Chen, V. Badrinarayanan, C.Y. Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. 2018.
 (10) D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. 2015.
 (11) C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional twostream network fusion for video action recognition. In CVPR, 2016.
 (12) A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visualsemantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2121–2129. Curran Associates, Inc., 2013.
 (13) A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, 2016.
 (14) J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and humanlabeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
 (15) Y. Goyal, T. Khot, D. SummersStay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 (16) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 (17) S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson. Cnn architectures for largescale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, March 2017.
 (18) J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In CVPR, 2018.
 (19) A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. Largescale video classification with convolutional neural networks. In CVPR, 2014.
 (20) W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
 (21) A. Kendall, Y. Gal, and R. Cipolla. Multitask learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018.
 (22) D. Kiela, E. Grave, A. Joulin, and T. Mikolov. Efficient largescale multimodal classification. In AAAI, 2018.
 (23) I. Kokkinos. Ubernet: Training a ‘universal’ convolutional neural network for low, mid, and highlevel vision using diverse datasets and limited memory. arXiv preprint arXiv:1609.02132, 2016.
 (24) Q. Kong, Y. Xu, W. Wang, and M. Plumbley. Audio set classification with attention model: A probabilistic perspective. 04 2018.
 (25) B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from selfsupervised synchronization. In NeurIPS, 2018.
 (26) A. Owens and A. A. Efros. Audiovisual scene analysis with selfsupervised multisensory features. In The European Conference on Computer Vision (ECCV), September 2018.
 (27) Z. Qiu, T. Yao, , and T. Mei. Learning spatiotemporal representation with pseudo3d residual networks. In ICCV, 2017.
 (28) K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, 2014.
 (29) R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng. Zeroshot learning through crossmodal transfer. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 1, NIPS’13, pages 935–943, USA, 2013. Curran Associates Inc.
 (30) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, Jan. 2014.
 (31) D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
 (32) D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
 (33) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. 2017.
 (34) L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
 (35) X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In CVPR, 2018.
 (36) Y. Wang, J. Li, and F. Metze. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. arXiv preprint arXiv:1810.09050, 2018.
 (37) J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence  Volume Volume Three, IJCAI’11, pages 2764–2770. AAAI Press, 2011.
 (38) C. Yu, K. S. Barsim, Q. Kong, and B. Yang. Multilevel attention model for weakly supervised audio classification. arXiv preprint arXiv:1803.02353, 2018.
 (39) J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
 (40) P. Zhang, Y. Goyal, D. SummersStay, D. Batra, and D. Parikh. Yin and Yang: Balancing and answering binary visual questions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 (41) H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In ECCV, 2018.
Appendix A Proof of Proposition 1
We first introduce the following lemma to assist with the proof:
Lemma 1 (Scaling Invariance of Minimization).
Given the assumptions of Proposition 1, we transform the vectors such that . Let be a transformed set of weights with , the weights that minimize , satisfies
(11) 
In other words, the optimum value of is invariant to rescaling of input vectors .
Proof of Lemma 1.
Let be the optimal value of given and be the optimal value of given , we only need to show , because by symmetry we have .
Assume a contradiction: , and let be the solution for . Then we have
(12) 
However, in equation 12 is a feasible solution to minimization given , and thus, its satisfies . Therefore, we have
(13) 
Therefore, the contradiction assumption is incorrect; thus . By symmetry, ; thus . ∎
Proof of Proposition 1.
We create the normalized set
(14) 
and solve for coefficients given . By Lemma 1, minimizing is equivalent to original problem (minimizing ). From the constraint, we have
(15) 
Thus, the problem simplifies to:
(16) 
We first compute the expectation:
(17) 
where .
We apply Lagrange multipliers on our objective function (equation 17) and constraint (equation 15):
(18) 
The partial gradient of is given by
(19) 
Setting the partial gradient to zero gives:
(20) 
Applying the constraint gives:
(21) 
In other words,
(22) 
The normalized variance and original variance are related by
(23) 
And the original problem is related to the normalized problem by:
(24) 
Using to normalize the weights we get . ∎
Note: if we relax the assumption that for , the proof proceeds similarly, although from (17) it becomes more convenient to proceed in matrix notation. Define a matrix with entries given by
Then one finds that
Appendix B Subsampling and Balancing Multilabel Dataset
For a singlelabel dataset, one can subsample and balance at a perclass level such that each class may have the same volume of data. Unlike singlelabel dataset, classes in multilabel dataset can be correlated. As a result, sampling a single data may add volume for more than one class. This makes the naive perclass subsampling approach difficult.
To uniformly subsample and balance AudioSet to get miniAudioSet, we propose the following algorithm:
Appendix C Details on Model Architectures
c.1 Late Fusion By Concatenation
In late fusion by concatenation strategy, we concatenate the output features from each individual network (i.e. modalities’ 1D vectors with dimensions). If needed, we add dropout after the feature concatenations.
The fusion network is composed of two FC layers, with each followed by an ReLU layer, and a linear classifier. The first FC maps dimensions to dimensions, and the second one maps to . The classifier maps to , where is the number of classes.
As sanity check, we experimented using less or more layers on Kinetics:

0 FC. We only add a classifier that maps dimensions to dimensions.

1 FC. We add one FC layer that maps dimensions to dimension, followed by an ReLU layer and classifier to map dimension to dimensions.

4 FC. We add one FC layer that maps dimensions to dimension, followed by an ReLU layer. Then we add 3 FCReLU pairs that preserve the dimensions. Then we add an a classifier to map dimension to dimensions.
We noticed that the results of all these approaches are suboptimal. We speculate that less layers may fail to fully learn the relations of the features, while deeper fusion network overfits more.
c.2 Mid Fusion By concatenation
Inspired by Owens_2018_ECCV (), we also concatenate the features from each stream at an early stage rather than late fusion. The problem with mid fusion is that features from individual streams can have different dimensions. For example, audio features are 2D (timefrequency) while visual features are 3D (timeheightwidth).
We propose three ways to match the dimension, depending on the output dimension of the concatenated features:

1D Concat. We downsample the audio features to 1D by average pooling on the frequency dimension. We downsample the visual features to 1D by average pooling over the two spatial dimensions.

2D Concat. We keep the audio features the same and match the visual features to audio features. We downsample the visual features to 1D by average pooling over the two spatial dimensions. Then we tile the 1D visual features on frequency dimension to make 2D visual features.

3D Concat. We keep the visual features fixed and match the audio features to visual features. We downsample the audio features to 1D by average pooling over the frequency dimension. Then we tile the 1D visual features on two spatial dimensions to make 3D features.
The temporal dimension may also be mismatched between the streams: audio stream is usually longer than visual streams. We add convolution layers with stride of 2 to downsample audio stream if we are performing 2D concat. Otherwise, we upsample visual stream by replicating features on the temporal dimension.
There are five blocks in the backbones of our ablation experiments (section 4), and we fuse the features using all three strategies after block 2, block 3, and block 4. Due to memory issue, fusion using 3D concat after block 2 is unfeasible. On Kinetics, we found 3D concat after block 3 works the best, and it’s reported in Table 2. In addition, we found 2D concat works the best on AudioSet and uses less GFLOPs than 3D concat. We speculate that the method for dimension matching is taskdependent.
c.3 SE Gate
SqueezeandExcitement network introduced in SENet () applies a selfgating mechanism to produce a collection of perchannel weights. Similar strategies can be applied in a multimodal network to take inputs from one stream and produce channel weights for the other stream.
Specifically, we perform global average pooling on one stream and use the same architectures in SENet () to produce a set of weights for the other channel. Then we scale the channels of the other stream using the weights learned. We either do a ResNetstyle skip connection to add the new features or directly replace the features with the scaled features. The gate can be applied from one direction to another, or on both directions. The gate can also be added at different levels for multiple times. We found that on Kinetics, it works the best when applied after block 3 and on both directions.
We note that we can also first concatenate the features and use features from both streams to learn the perchannel weights. The results are similar to learning the weights with a single stream.
c.4 NL Gate
Although lightweight, SEgate fails to offer any spatialtemporal or frequencytemporal level attention. One alternative way is to apply an attentionbased gate. We are inspired by the QueryKeyValue formulation of gates in AttentionAll17 (). For example, if we are gating from audio stream to visual stream, then visual stream is Query and audio stream is Key and Value. The output has the same spatialtemporal dimension as Query.
Specifically, we use NonLocal gate in XiaolongWang18 () as the implementation for QueryKeyValue attention mechanism. Details of the design are illustrated in fig. 4. Similar to SEgate, NLGate can be added with multiple directions and at multiple positions. We found that it works the best when added after block 4, with a 2D concat of audio and RGB features as KeyValue and visual features as Query to gate the visual stream.
Appendix D Additional Ablation Results
d.1 Training Accuracy
In section 4.2, we introduced the overfitting problem of joint training of multimodal networks. Here we include both validation accuracy and train accuracy of the multimodal problems (Table 8). We demonstrate that in all cases, the multimodal networks are performing worse than their single best counterparts, while almost all of their train accuracy are higher (with the sole exception of OF+A, whose train accuracy is similar to audio network’s train accuracy).
Dataset  Modality  Validation Accuracy  Train Accuracy 
Kinetics  A  19.7  85.9 
RGB  72.6  90.0  
OF  62.1  75.1  
A + RGB  71.4  95.6  
RGB + OF  71.3  91.9  
A + OF  58.3  83.2  
A + RGB + OF  70.0  96.5  
miniSport  A  22.1  56.1 
RGB  62.7  77.6  
A + RGB  60.2  84.2 
d.2 Early Stopping
In early stopping, we experimented with three different stopping schedules: using 25%, 50% and 75% of iterations per epoch. We found that although overfitting becomes less of a problem, the model tends to underfit. In practice, we still found that the 75% iterations scheduling works the best among the three, though it’s performance is worse than full training schedule that suffers from overfitting. We summarize their learning curves in fig. 5.