Deep MultiView Learning using NeuronWise CorrelationMaximizing Regularizers
Abstract
Many machine learning problems concern with discovering or associating common patterns in data of multiple views or modalities. Multiview learning is of the methods to achieve such goals. Recent methods propose deep multiview networks via adaptation of generic Deep Neural Networks (DNNs), which concatenate features of individual views at intermediate network layers (i.e., fusion layers). In this work, we study the problem of multiview learning in such endtoend networks. We take a regularization approach via multiview learning criteria, and propose a novel, effective, and efficient neuronwise correlationmaximizing regularizer. We implement our proposed regularizers collectively as a correlationregularized network layer (CorrReg). CorrReg can be applied to either fullyconnected or convolutional fusion layers, simply by replacing them with their CorrReg counterparts. By partitioning neurons of a hidden layer in generic DNNs into multiple subsets, we also consider a multiview feature learning perspective of generic DNNs. Such a perspective enables us to study deep multiview learning in the context of regularized network training, for which we present control experiments of benchmark image classification to show the efficacy of our proposed CorrReg. To investigate how CorrReg is useful for practical multiview learning problems, we conduct experiments of RGBD object/scene recognition and multiview based 3D object recognition, using networks with fusion layers that concatenate intermediate features of individual modalities or views for subsequent classification. Applying CorrReg to fusion layers of these networks consistently improves classification performance. In particular, we achieve the new state of the art on the benchmark RGBD object and RGBD scene datasets. We make the implementation of CorrReg publicly available.
I Introduction
Many machine learning problems concern with discovering or associating common patterns in data of multiple views or modalities. Typical applications include retrieving images from texts or vice versa, combining visual and audio signals for content understanding, and object recognition from visual observations of multiple modalities. Data of different views usually contain complementary information, whose statistical distributions in the highdimensional measurements of individual views may also be different. Multiview learning methods aim to exploit information contained in multiple views to better accomplish specified learning tasks. In this work, we take image classification, in particular multiview or multimodal object recognition (e.g., recognizing objects from RGB and depth images), as the primary example to study the problem of multiview learning.
Given feature observations of different views, existing multiview learning approaches learn latent space representations in either deterministic [1, 2, 3, 4] or probabilistic manners [5, 6, 7, 8]. The learning objective is to make resulting features of different views at each dimension of the latent space more related with each other, where relations may be measured by different metrics/criteria [9, 10, 4]. Among various techniques, Canonical Correlation Analysis (CCA) [1] and its extensions [11, 12, 2] are the most representative ones. For example, given twoview data, CCA learns pairs of linear projections so that in the projected space, features of both views are maximally correlated at corresponding dimensions.
Following the success of deep learning [13, 14], deep multiview learning methods [2, 15] are also proposed recently for learning deep features from multiview data. These methods apply multiview learning criteria (e.g., CCA) on top of multiple singleview deep networks (cf. Figure 1(a) for an illustration); a twostage scheme of iterative learning is usually adopted to train the network parameters, where viewspecific features are learned until the very top layers, to which either a sequential step of multiview criteria followed by the objectives of the specified learning tasks or regularized learning objectives that seek a balance between multiview criteria and the final tasks of interest, are applied. Alternatively, one may design deep architectures that concatenate at intermediate network layers (fusion layers) output features of lower, parallel layer streams for individual views [3, 8, 16], followed by upper network layers for specified learning tasks (e.g., image classification, cf. Figure 1(b) for an illustration). Such endtoend networks have the advantage that the final tasks of interest are achieved directly at the network outputs. However, output features of the lower, parallel streams in such networks capture viewspecific patterns, which may not be aligned in a common space for a ready fusion in the subsequent layers.
To enjoy the advantage of endtoend learning while collaboratively benefiting from different views, multiview learning criteria could be exploited to improve the relations between resulting features of the lower, parallel streams. Under the framework of regularized function learning, this amounts to training network parameters by penalizing objectives of the main learning tasks with correlationmaximizing regularization at fusion layers (cf. Figure 1(b)). In this work, we are interested in CCA criteria since they have long been the main workhorse for multiview learning [11, 17, 10]. Empirical results also show that CCA based approaches outperform alternative ones in the context of deep multiview representation learning [15]. Directly using CCA as the regularizer makes network training very expensive, which is also incompatible with the minibatch based stochastic gradient descent (SGD), the commonly used algorithm in deep network training (cf. Section II for a discussion). Inspired by batch normalization [18], we propose in this paper a novel neuronwise correlationmaximizing regularizer, and implement the proposed regularizers collectively as a correlationregularized network layer (CorrReg). CorrReg can be applied to either fullyconnected (FC) or convolutional (conv) fusion layers, simply by replacing these layers with their CorrReg versions (cf. Figure 3 for an illustration). CorrReg fusion has the same computational complexity as the plain one does, which is significantly lower than that of CCA regularization.
We note that the fusion layer in a multiview network of Figure 1(b) is computationally equivalent to any hidden layer in a generic deep neural network (DNN), by partitioning neurons of the hidden layer into multiple subsets (cf. Figure 1(c) for an illustration). This equivalence suggests a multiview feature learning perspective of generic DNNs: when considering hidden neurons of generic DNNs as pattern detectors that characterize different patterns of the input data [19], features learned at each neuron subset of a hidden layer could be considered as a specific view of the input data. Ideally, each of such feature views is to learn salient or discriminative patterns of the input data, which should also be generalizable to unseen data. However, there exists an issue of overfitting, a phenomenon that specific subsets of layer neurons are trained to be coadapted to certain patterns in the training data, but cannot generalize well on the heldout test data [20]. As suggested by traditional learning theory [21], the risk of overfitting could be severe for generic DNNs, since modern DNNs have large model capacities and are usually overparameterized in the sense that they can be trained to fit randomly labelled datasets [22]. Such a risk for generic DNNs can be addressed implicitly by SGD training [23, 24, 25], and explicitly by additional regularization [20, 18]. Our proposed CorReg takes the second regularization approach: improving correlations of neuron subsets reduces coadaptation to possibly noisy or irrelevant, subsetspecific patterns, and thus alleviates the problem of overfitting. Such a connection with generic DNNs enables us to study CorrReg in the context of regularized network training, and to compare with modern regularization techniques (e.g., Dropout [20]) for deep multiview representation learning.
We finally summarize our contributions as follows.

We study in this work the problem of learning deep representations from multiview data in endtoend networks. We take a regularization approach via multiview learning criteria, and propose a novel, effective, and efficient neuronwise correlationmaximizing regularizer. We implement our proposed regularizers collectively as a correlationregularized network layer (CorrReg), which will be made publicly available. CorrReg can be applied to either FC or conv based fusion layers, simply by replacing these layers with their CorrReg versions. CorrReg fusion layer has the same computational complexity as a plain one does, which is significantly lower than that of CCA regularization.

We consider a multiview feature learning perspective of generic DNNs, by partitioning neurons of a hidden layer in a generic DNN into multiple subsets. Such a connection with generic DNNs enables us to study deep multiview representation learning in the context of regularized network training, and to compare CorrReg with modern regularization techniques (e.g., Dropout [20]). We note that such a comparison is largely ignored in existing deep multiview learning methods.

To investigate the efficacy of CorrReg for regularization of network training, we conduct control experiments on benchmark image classification [26] using generic DNNs [27, 28, 29, 30]. CorrReg consistently improves performance of these networks. To investigate how CorrReg is useful for practical multiview learning problems, we conduct experiments of RGBD object/scene recognition and multiview based 3D object recognition, using networks with fusion layers that concatenate intermediate features of individual modalities or views for subsequent classification. Applying CorrReg to fusion layers of these networks consistently improves classification performance. In particular, we achieve the new state of the art on the benchmark RGBD object [31] and RGBD scene [32] datasets.
Ii The Proposed NeuronWise CorrelationMaximizing Regularizers
In this section, we first use generic DNNs to present our proposed regularization method, and explain how our method is applied to their hidden layers. Our method is readily applied to fusion layers of deep networks that take practical multiview data, which will be introduced in Section III.
We start with a DNN composed of FC layers. Denote its network parameters as , where and are respectively the weight matrix and bias vector associated with the network layer. In the setting of supervised learning, given training samples of categorical data, the network parameters in are optimized by minimizing the empirical risk , where is a properly chosen loss function, e.g., crossentropy loss for image classification, and optimization is typically based on SGD or its variants [33]. As discussed in Section I, DNNs of high model capacities are able to learn complex functions but susceptible to overfitting. To remedy, one may apply regularization to reduce their model capacities. Adding regularization to the network training objective results in the following optimization problem
(1) 
where is the regularizer to be specified, and is a tradeoff parameter.
In this work, we are interested in regularizing network training using CCA based multiview learning criteria [2, 15]. More specifically, for a specified network layer and a training sample , denote as the feature vector of computed all the way up from the input layer to the layer. The layer computes , where is an elementwise nonlinear activation function such as ReLU [34], and are the weight matrix and bias vector respectively, and where we omit the superscript to make the following notations of better clarity. By randomly partitioning input neurons/dimensions of the layer into two subsets ^{1}^{1}1For simplicity, we only consider in this work the case of partitioning neurons of a hidden layer in a generic DNN into two subsets., we get feature subvectors and from , and the corresponding weight submatrices and from . We simply have . Discussions in Section I suggest that and can be analogously considered as two feature views of the input data. The coadaptation of features in each view is likely to cause network training to pay more attention to the extraction of viewspecific patterns, rather than the category related patterns that are desired to be learned by network training. To address this issue and benefit more from both views of the features, one may use CCA criteria to regularize network training, which aim to increase the feature correlations between the two views. Given the twoview features and of training set , applying CCA to the network layer amounts to optimizing and by
(2)  
where is an identity matrix of compatible size. The data matrices have been assumed centered for simplicity. Applying the above problem to the network layer also assumes implicitly that .
When directly using (2) as the regularizer in (1), network training requires solving (2) with stochastic optimization methods. Unfortunately, as pointed out in [9], the objective (2) does not easily admit a stochastic optimization due to the involvement of data covariance matrices in the constraints. One may use batch gradient descent to solve (2). It computes gradients of the correlation objective w.r.t. the CCA projected features and , which in turn will be used through backpropagation to compute the gradients w.r.t. and , and w.r.t. all the network parameters in the layers below [2]. This is expensive as it involves computation of covariance matrices (of and ), their inverse square roots, and also performing matrix singular value decomposition (SVD). One may nevertheless try minibatch based gradient descent to solve (2), which, however, may produce singular data covariance matrices; [2] also points out that solving (2) by minibatch based stochastic optimization empirically gives unsatisfactory results. Given these challenges of directly using CCA as the regularizer , we are motivated to find an alternative way to improve the correlations between the two feature views and .
Inspired by batch normalization [18], we propose to simplify the full CCA regularization in DNNs by considering the following two aspects. Firstly, instead of learning and to increase correlations at all the dimensions of the resulting features jointly, we propose to learn and , , independently for each output neuron of the layer, where and are the columns of and respectively. Note that such a decoupling suggests that the resulting features at the dimensions could be correlated, but at the same time the constraint of is also relaxed, enabling its flexible use in DNNs as tasks demand. Secondly, we use minibatches of training samples, rather than all the ones, to approximate the statistics (i.e., mean and variance) necessary for computing correlations. This second simplification is enabled by the independent neuronwise learning of and : since in the joint case, the size of minibatches is required to be big enough to avoid singularity of covariance matrices.
For a specified output neuron of the layer, we first introduce the (scalar) random variables and whose samples are respectively computed as and , as illustrated in Figure 2, where we omit the neuron index for notational clarity. Given a minibatch of training samples, we propose in this work the following neuronwise correlationmaximizing regularizer
(3) 
where the scalar is introduced for numerical stability, and
Based on the neuronwise regularizer (3), we specify the general objective function (1) as the following regularized problem to improve network training
(4) 
where indexes the group of neurons in the network layers that are specified to apply regularization, denotes a subset of network parameters that are involved in the computation of the incoming features of the neuron , and we have slightly abused the use of notation with that in (3). Note that and may contain overlapped parameters, i.e., those parameters associated with the common layers below and . The main objective (4) can be optimized using SGD or its variants [33], by sampling minibatches of training samples in iterative steps. Details are presented shortly. Complexity analysis presented in Section IIA1 shows that compared with standard SGD training, our regularizer (3) increases computation cost only by a constant factor.
Iia CorrelationRegularized Network Layer
We still use the illustration in Figure 2 as the running example. In the forward pass of a minibatch of size , any output neuron of the layer that is specified to apply the regularization (3) computes and , , which give output features of the neuron, before nonlinear activation , as , , where is the bias associated with this neuron. ^{2}^{2}2For each neuron that is applied the regularization, we have ever introduced trainable scalar parameters and to rescale the internal features and , i.e., , which is similar to the scheme introduced in [35]. In our experiments this alternative scheme does not necessarily improve the performance. In the backward pass, we need to compute the gradients of the neuronwise regularizer w.r.t. the weight vectors and , and also those w.r.t. the input features , , . These gradients can be derived via multivariable chain rule, and we give their explicit forms in Appendix A. Note that the later ones are used to compute through backpropagation gradients of the regularizer w.r.t. network parameters in the lower layers that are also involved in the computation of .
We usually apply (3) to all the output neurons of the specified layer. To make regularized network training efficient, we note that for these neurons, their respective internal features and statistics, i.e., , , , , , and , and also the corresponding gradients can be computed independently and in parallel. Given the minibatch of layer inputs of the twoview features , , we write gradients of the neuronwise regularizers in the compact forms as , , and , where denotes the correlation objectives compactly. More specifically, sums the gradients of neuronwise regularizers in the layer w.r.t. the input , and independently computes the gradient of each neuronwise regularizer w.r.t. its associated weight vector ; the same operations apply to and .
In other words, we implement our proposed scheme (3) as a correlationregularized network layer (CorrReg): in the forward pass, the computation is the same as a standard network layer, i.e., we do not explicitly compute the internal features and , and instead we directly compute , followed by elementwise nonlinear activation; in the backward pass, we compute the gradients of the correlationregularized loss w.r.t. layer weights and layer inputs, which we spell out as
where and can be computed via standard backpropagation. The gradient of the main loss w.r.t. the layer bias vector can be obtained in the same way.
IiA1 Analysis of Computational Complexity
We analyze the additional computation cost incurred by imposing CorrReg on a network layer. Consider an layer that computes for a minibatch of samples, and that has layer parameters and . Assume arithmetic with individual elements has complexity . In the forward pass, the computation for the layer with or without CorrReg is the same, and its complexity is . In the backward pass, without using CorrReg the complexity for backpropagating gradients through this layer is . By simplifying the gradient formulas given in Appendix A and also writing them in matrix forms, we have the same complexity of when using CorrReg. In summary, the complexity by imposing CorrReg on a network layer increases only by a constant factor. In contrast, using the CCA objective (2) as the regularizer involves computing inverse square root of covariance matrices of the size , and also performing SVD for matrix of the same size (one may refer to [2] for gradient formulas of CCA objective); it has the overall complexity of in the backward pass, which is significantly worse than that of CorrReg.
IiB CorrelationRegularized Convolutional Networks
Our proposed CorrReg can regularize both FC and conv layers. As discussed above, for an FC layer computing , we apply regularization to the input of by performing a random twoway partition on feature dimensions of , producing two internal features and , where the partition is fixed once determined. CorrReg is indeed to improve correlations between each dimension pair of and . It is straightforward to extend the above scheme of single twoway partition in CorrReg to its version of multiple twoway partitions. For a specified FC layer, one may simply perform multiple, random twoway partitions on feature dimensions of ; each of them produces their respective internal features, and also their respective gradients w.r.t. layer weights and layer inputs. The overall regularization imposed on this layer can be obtained by averaging the gradients from these multiple twoway partitions. Experiments investigating the efficacy of this scheme of multiple twoway partitions are reported in Section VA.
For a conv layer, we perform single or multiple random twoway partitions on its input feature maps in the same way as for an FC layer. Each twoway partition produces two internal feature maps (corresponding to and in Figure 2) for each output feature map of the layer. Although features/observations at nearby locations of an image are generally correlated, we do not explicitly exploit such correlations. Instead, we independently apply regularization at each spatial location/pixel of each output feature map of the layer, so that correlations between the corresponding spatial locations in each pair of the internal feature maps are improved, where regularization is again applied before nonlinear activation. Applying our proposed CorrReg to modern architectures of DNNs (e.g., ConvNets [36], variants of ResNets [28, 29], or DenseNets [30]) is very simple: one simply replaces FC or conv layers with their CorrReg versions.
Iii Use of CorrReg Fusion in Deep Representation Learning from MultiView Data
We present in this section how CorrReg can be readily applied to deep networks that take as inputs data of multiple views/modalities and learn deep representations from them. Different from generic DNNs, these networks by design have lower, parallel streams for data of individual views, and it is natural to apply CorrReg to the fusion layers where features of different views are concatenated for use in the subsequent layers. The fusion layer in such a network is usually based on FC or conv layers. To use CorrReg, one may simply treat output features of two parallel streams as the input twoview features of CorrReg (corresponding to and in Section II), and replace the fusion layer with its CorrReg version, resulting in a CorrReg fusion layer. Figure 3 gives the illustration. In the following, we take network architectures used in our RGBD recognition experiments to instantiate the use of CorrReg fusion.
Our networks for RGBD object recognition and scene recognition are based on ConvNets [13] and ResNets [28]. Take an 8layer ConvNet as the example. We modify it by enabling it to take inputs of both RGB and depth channels. The 8layer ConvNet is composed of two lower, parallel streams followed by an upper, single stream. Outputs of the two lower streams are concatenated as inputs of the upper stream, and CorrReg is applied to the first layer of the upper stream, which thus becomes a CorrReg fusion layer. In this work, we also investigate the effects when the CorrReg fusion layer is at different “heights” (lower, middle, or upper layers) of the network. Figure 5 gives the illustration. Adaptation of ResNets is similar as above. We use such adapted network architectures for experiments of RGBD object recognition and scene recognition in Section VB.
Iv Relations with Existing Works
Our proposed CorrReg method is related to four categories of existing research: regularization of generic DNNs, deep learning research that specially focuses on multiview representation learning, RGBD object/scene recognition, and multiview recognition of 3D object shapes. We respectively discuss the relations as follows.
Network regularization In the literature of DNNs, various regularization techniques have been proposed to address the issue of overfitting in network training, including the traditional early stopping, weight decay, and data augmentation [37], and also the more recent Dropout [20], Dropconnect [38], batch normalization [18], and all conv layer based networks [39, 40]. Among them, Dropout and Dropconnect are most related to our proposed method. In the original proposal of Dropout [20], each hidden neuron is randomly dropped (usually with a probability of ) at each training iteration, and the network is then updated on weights that are connected to the remaining neurons. During inference, all network weights are used after halving their values. Baldi and Sadowski [41] quantitatively analyze that random operations of dropout training and the associated inference can be understood as a good approximation to the expectation of outputs of a subnetwork ensemble, by introducing a bridging quantity Normalized Weighted Geometric Mean. They further show that the expectation of dropout gradients w.r.t. a network weight is approximately the gradient of the subnetwork ensemble regularized by adaptive weight decay. Wager et al. [42] present alternative interpretation of dropout training as adaptive weight decay by treating dropout as feature noising in generalized linear models. Analysis similar to [41] can be applied to Dropconnect [38] by randomly dropping weight connections rather than network neurons.
Dropout and Dropconnect achieve regularization by first sampling features/subnetworks (of shared weights), and then averaging over outputs of the subnetwork ensemble. Different from them, our CorrReg scheme explicitly increases the correlations between (internal) features of different views, and regularization is achieved by suppressing viewspecific noisy patterns. We empirically show in this work the usefulness of CorrReg in improving network training, and leave its theoretical connections with classical regularization to future research. Note that a few recent methods of network regularization explicitly reduce correlations across dimensions of the output features of a layer [43], or achieve similar effects by enforcing orthogonality of the layer weight matrix [44, 45]. These methods impose regularization complementary to our proposed CorrReg, and we are interested in the investigation of their combined use in future research.
Deep multiview representation learning Recent deep multiview representation learning methods include those based on CCA [2, 15, 46] and those based on autoencoders (AE) [3]. AE based methods typically learn a shared bottleneck layer on top of lower viewspecific layers, and the learned joint representation at the bottleneck layer is used for reconstruction of multiple views. Deep CCA [2] directly applies CCA to the output layers of two deep networks, so that the learned networks can produce maximally correlated features at the output layers. Wang et al. [15] extend deep CCA as Deep Canonically Correlated AutoEncoders, by balancing the correlation objective between the two views with their respective reconstruction objectives.
Most of existing deep multiview learning works take a twostage strategy for the final tasks of interest: they first learn from data of multiple views/modalities deep features in a common space, and then use the learned features in the common space either to train classifiers for multiview or acrossview classification, or to reconstruct data of missing views. In contrast, our use of correlation based multiview learning is to regularize training of endtoend networks, where parameters that project multiview data into a common space are exactly those of an intermediate network layer.
RGBD object and scene recognition RGBD object recognition [47, 48, 8, 49] has drawn research attention recently as a typical application of multiview learning. Lai et al. [31] collect the first largescale, hierarchial RGBD object dataset using a Kinect camera; they show that depth information substantially helps object recognition, by concatenating handcrafted depth features (e.g., spin images [50]) with those of RGB ones, and using the concatenated features for classification. In [16], a hierarchical learning model of Convolutional and Recursive Neural Networks (CNNs and RNNs) are proposed, where CNNs are used for learning lowlevel features and RNNs with random weights for efficiently extracting higherorder features; this combined model is applied to RGB and depth images separately, and the resulting features are concatenated for classification.
Eitel et al. [48] propose a multimodal deep learning architecture for RGBD object recognition, which fuses, via feature concatenation, outputs of two parallel streams of modalityspecific subnetworks (composed of conv and FC layers), and uses two additional FC layers for feature transformation and softmax classification; the whole network is trained via standard backpropagation with no consideration of multiview learning/regularization criteria. Built on top of two parallel streams of ResNets (after removing their respective last layers of classifier) [28], a Correlated and Individual Multimodal (CIM) learning layer is proposed in [51]; CIM aims to learn, in a discriminative and complementary manner, both correlated and modalityspecific features from output vectors of the two ResNets, where “correlation” is measured by the Euclidean distance between projected features of the two ResNets’ outputs; parameters of the whole network in [51] are updated in an alternating manner: those of the two lower streams of ResNets are updated after updating of the CIM parameters (projection matrices) converges. In [52], a deep learning framework termed MDSICNN is proposed to learn highly discriminative and spatially invariant multimodal feature representations at different hierarchical levels, which is technically achieved by introducing spatial transformer network [53] and Fisher encoding into CNN architectures. The problem of RGBD image classification with limited training samples is addressed in [54]. It takes a domain adaptation approach and enforces the prediction consistency between two classifiers that are respectively learned either from the combined RGB and depth features or from the RGB features alone. Results on RGBD object recognition show the efficacy of the proposed approach.
Methods for RGBD scene recognition [55, 56, 57, 58] largely follow those of RGBD object recognition. In particular, multimodal deep architectures similar to that of [48] are still the main workhorse to get good recognition performance. We also use such a type of networks for RGBD recognition, but with our proposed CorrReg fusion layers that have inbuilt neuronwise correlation regularization. Training of CorrReg fusion based networks has no difference from standard backpropagation.
Multiview recognition of 3D object shapes Recent research shows that multiview images are of a promising representation for recognition of 3D object shapes. Given a 3D object model (mesh), multiple 2D images can be rendered by placing virtual cameras around the object, and recognition is based on the rendered 2D images of multiple views. Among recent methods, MVCNN [59] is a representative one that uses parallel streams of conv layers to extract features from individual views, and then aggregates these features as a global signature simply via max pooling across different views. Subsequent works improve over MVCNN by strengthening interaction among feature learning of individual views. For example, MHBN [60] uses harmonized bilinear pooling to aggregate local features, and GVCNN [61] proposes a groupview framework to model correlations among different views at a hierarchy of multiple levels. In this work, we adapt the architectural design of MVCNN by incorporating into it CorrReg fusion layers. We use such an adapted architecture to verify the usefulness of CorrReg for multiview based 3D object recognition.
V Experiments
No CorrReg  CorrReg Conv2  CorrReg Conv3  CorrReg FC4  CorrReg FC5  CorrReg FC6  
()            
  ()  ()  ()  ()  ()  
  ()  ()  ()  ()  ()  
  ()  ()  ()  ()  ()  
  ()  ()  ()  ()  () 
In this section, we first present control experiments of image classification to investigate the effectiveness of our proposed CorrReg for regularization of network training. We use generic DNNs including ConvNet (LeNet) [27], and modern deep architectures of ResNet [28], Wide ResNet [29], DenseNet [30], and ResNeXt [62]. These experiments are conducted on the benchmark datasets of CIFAR10, CIFAR100 [26], and ImageNet [63]. We then present experiments of RGBD object/scene recognition and multiview 3D object recognition to evaluate the usefulness of CorrReg for practical multiview learning problems. We use the benchmark datasets of RGBD object [31], RGBD scene [32], and ModelNet40 [64] for these experiments, and compare with the stateoftheart results.
We use crossentropy loss to train all these networks. Training is based on SGD with momentum. Without mentioning otherwise, we use minibatches of size , momentum of , and weight decay of ; network parameters are initialized using Gaussian random weights; batch normalization is applied, before ReLU nonlinearity, in all networks to accelerate their training. In each experiment, the initial learning rate, the value of in (4), and also the dropping rate of Dropout (when using Dropout regularization) are determined by using of training samples as the validation set. As (4) suggests, we use constant values for all neurons that are specified to apply CorrReg. Learning rates are decayed at the rate of when learning curves plateau. Our implementation and experiments are based on the Torch library [65].
Va Control Experiments of Image Classification
We use the CIFAR10 dataset for our controlled studies on a plain ConvNet (a variant of LeNet [27]). The CIFAR10 dataset consists of object categories of color images of size ( training and testing ones). We follow [66] and preprocess the data using global contrast normalization and ZCA whitening. Our used LeNet variant consists of conv layers (the first one is the input layer), followed by FC layers (the last one is the output layer). Max or average pooling layers are applied after each conv layer. More layer specifics are given in Appendix B.
We first investigate the regularization effects when applying CorrReg to different network layers. To this end, we replace each of the network layers (except the input one), namely Conv2, Conv3, FC4, FC5, and FC6, with their CorrReg versions respectively, and compare the recognition performance. As indicated in Section IIB, the scheme of multiple (random) twoway partitions can be used when applying CorrReg to any network layer. We also investigate how different numbers nReg of twoway partitions in CorrReg achieve regularization, for which we set , or . Note that when , we use the first half neurons/feature maps of the layer as a subset, and use the other half as the second subset; when , regularization is achieved by averaging over those of the multiple twoway partitions. We run experiments of each setting (the layer/nReg pair) for times, and report results in the format of mean (standard deviation). Error rates reported in Table I tell that applying CorrReg, with any number nReg of twoway partitions, to these layers consistently achieves performance boost over the LeNet variant baseline. In general, CorrReg is more effective for (upper) FC layers; this is reasonable since a densely connected FC layer contains much more trainable parameters than a conv layer does, and is thus more susceptible to overfitting. Setting sometimes helps in getting even better results, but at the cost of slightly increased computation. In the subsequent experiments, we simply set for computational efficiency.
As a technique for regularization of network training, CorrReg is related to the methods [20, 38, 18] discussed in Section IV, in particular Dropout [20]. To compare CorrReg with Dropout, we apply them to the FC5 layer of LeNet variant. Since their working mechanisms are different, one might be also interested in using them together. Table II reports the comparative results. CorrReg achieves improvement comparable to that of Dropout, where the dropping rate is optimally set as by tuning from the range of on the validation set. Using CorrReg together with Dropout further improves the performance, showing the complementary regularization benefit of CorrReg to that of Dropout.
Methods  Error rates () 

Plain LeNet variant  () 
Dropout [20]  () 
CorrReg  () 
CorrReg + Dropout  () 
CorrReg achieves regularization via improving correlations between the internal features produced by twoway partition of a layer, which suggests a natural alternative that halves the number of layer neurons. Halving the number of layer neurons reduces the model capacity and creates “bottlenecking” of information flow, thus implicitly imposing regularization. Oppositely, one may be also interested in alternatives that increase the number of layer neurons with varying factors. To investigate the efficacy of CorrReg for models with different capacities, we apply these alternatives again to the FC5 layer of LeNet variant. Results in Table III show that larger models perform better than smaller ones do, and applying our proposed CorrReg to larger models further improves the performance. We also compute in Table III the averaged (pairwise) correlation among features of training samples learned at different layer neurons, in order to understand how network capacities relate to the behavior of feature correlations across layer neurons and how CorrReg plays a role here as a regularization. Results show that as the numbers of layer neurons increase, feature correlations between layer neurons increase, and applying CorrReg further enhances this effect.
Neuron No.  128  256  512  1024 

W/O CorrReg  /  /  /  / 
With CorrReg  /  /  /  / 
CorrReg is a neuronwise scheme of CCA regularization. One might be interested in the performance when using CCA as the regularizer. To this end, we apply the CCA objective (2) as a regularizer to the FC5 layer of the LeNet variant, where regularization parameter is set as by optimally tuning on the validation set. Computational complexity of CCA regularization is significantly worse than that of CorrReg, and Table IV shows that it practically consumes more time per iteration of SGD training (measured on an M40 GPU and Intel Xeon CPU running at 2.2 GHz). Although CCA regularization improves performance over that of plain LeNet variant, its results with different sizes of minibatches are worse than those of CorrReg; we hypothesize that this is because optimization of CCA objective (particularly the constraints in (2)) is incompatible with SGD based network training.
Methods  Plain LeNet variant  CorrReg  CCA regularization  

Computational  
Complexity  
wallclock time  
per iter. (sec.)  
Error rates ()  ()  ()  ()  ()  ()  ()  ()  ()  () 

The penalty parameter in (4) controls the amount of regularization that CorrReg imposes on network training. To investigate how performance of CorrReg depends on values, we use of training samples as validation, and apply CorrReg to different layers of the LeNet variant (i.e., the settings of Table I with ) using a range of values. Results are plotted in Figure 4. Figure 4 shows that smaller values of are usually better when applying CorrReg to (lower) conv layers, and larger values of are usually better for (upper) FC layers. This is reasonable due to two compound reasons: (1) when applying CorrReg to an upper network layer, larger values of are needed in order to backpropagate the regularization for better learning of features/parameters of all the layers below; (2) conv layers already have intrinsic regularization via weight sharing. This inconsistency of optimal values across different network layers makes use of CorrReg less convenient. Fortunately, results in this section suggest that to get the most effective regularization, one may simply apply CorrReg to an upper (FC) network layer, and set the optimal values accordingly. Setting typically gives good results. Experiments in the subsequent sections follow this empirical rule.
VA1 Results on Modern Deep Architectures
Architecture  CIFAR10  CIFAR100  

W/O Regu.  DropoutS1  DropoutS1S2  CorrReg  CorrReg  W/O Regu.  DropoutS1  DropoutS1S2  CorrReg  CorrReg  
DropoutS2  DropoutS2  
ResNet [68]  
Wide ResNet [29] 

DenseNet [30] 


We further investigate the regularization effects of CorrReg on modern deep architectures. For the datasets of CIFAR10 and CIFAR100, we use the representative architectures of ResNet [68], Wide ResNet [29], and DenseNet [30]. The CIFAR100 dataset is an adaptation of CIFAR10, consisting of object categories of color images. We use simple data augmentation following [67]: during training, we zeropad pixels along each image side, and sample a region crop from the padded image or its horizontal flip; during testing, we simply use the original nonpadded image. Our use of ResNet, Wide ResNet, and DenseNet for the CIFAR datasets is as follows: we use a preactivation ResNet [68] of weight layers, whose layer specifics are given in Appendix B; we use the exactly same topperforming architecture of “WRN2810” as in [29]; we also use the exactly same topperforming architecture of “DenseNetBC” (with the growth rate ) as in [30]. These architectures commonly aggregate features of lower layers via a top global average pooling layer, followed by a final FC layer of classification. For each network, we follow the empirical rule established in Section VA and apply CorrReg to the final FC layer. We fix of CorrReg as for all the three networks. We train ResNet and Wide ResNet for a total of epochs; learning rates are initialized as , and decay after and epoches of training. For DenseNet, we follow [30] and train for an extended duration of epochs, using minibatches of size . All other training hyperparameters are the same as described in the beginning of Section V (not necessarily the same as used in [68, 29, 30]). To compare with Dropout regularization, we use two schemes (denoted as DropoutS1 and DropoutS1S2 respectively): scheme 1 applies Dropout to (inputs of) the final FC layer of each network, which is the same as our use of CorrReg; scheme 2 follows the way in [29] and additionally applies Dropout to (inputs of) the second one of the two conv layers in each residual block of these networks. Dropping rates are tuned on the validation set with the optimal one set as . We also try CorrReg together with the above scheme 2 (denoted as CorrRegDropoutS2), to compare fairly with Dropout regularization. Results in Table V show that DropoutS1S2 improves over DropoutS1 by providing additional regularization, especially for the CIFAR100 dataset that contains much fewer training samples per category than CIFAR10 does. When applying to the top FC layer alone, our proposed CorrReg consistently outperforms DropoutS1. Moreover, the best results are obtained by CorrRegDropoutS2 that has the combined benefit of CorrReg and Dropout regularization.
For the ImageNet dataset, we use the representative architectures of ResNet [28], Wide ResNet [29], and ResNeXt [62] (more specifically, the “ResNet101”, and the topperforming “WRN502bottleneck” and “ResNeXt101 (644d)” of these architectures). For data augmentation, we adopt the same scheme as in [28]: during training, we randomly sample a region crop from an image or its horizontal flip; during testing, we use a single crop of size . Learning rates are initialized as , and decay by a factor of at and of the total training epochs, using minibatches of size . For each network, we again apply CorrReg to the top FC layer, using a fixed value of . We also apply Dropout to (inputs of) the same FC layers of these networks, where dropping rate is again set as . Table VI shows the comparative results. While Dropout regularization may not have effect on these architectures, our proposed CorrReg steadily achieves performance improvement.
Architecture  W/O Regu.  With Dropout  With CorrReg 

ResNet [28]  
Wide ResNet [29] 

ResNeXt [62] 


Remarks We note that experiments in this section are not intended to compare with the best results on the benchmark image classification datasets. They are to show the efficacy of our proposed CorrReg for regularization of network training: even though input data are from the same source, intermediate features may be learned to be overfitting to viewspecific patterns. CorrReg effectively regularizes network training so that the final task of classification can collaboratively benefit from all feature views.
VB RGBD Object and Scene Recognition
In this section, we use RGBD object dataset [31] and SUN RGBD scene dataset [32] to investigate the efficacy of CorrReg for practical problems of multiview learning. The RGBD object dataset contains RGBD video frames of classes of object instances captured from different views, with roughly frames per object instance. We sample these videos by every frame of each video. We use the random dataset splits provided by [31], with each split containing different object instances of all the classes. For these splits, there are on average about images for training and images for testing. This dataset is intensively used for comparative studies of alternative baselines, investigation of our proposed method under different settings, and also for robustness test against contamination of input data. SUN RGBD scene dataset is a benchmark suite for indoor scene understanding, including RGBD images. For scene recognition, we follow [32] and select the most common categories, each of which has at least RGBD images in the dataset; we then divide the data into training and test sets, giving images for training and ones for testing. For both RGBD object and SUN RGBD datasets, we compute surface normal (SN) [69] from each depth image as input depth features.
VB1 Comparative Studies of Alternative Baselines
Our comparative studies of alternative baselines for RGBD object recognition are based on adaptations of an 8layer ConvNet. The adaptations consist of two lower, parallel streams followed by one upper, single stream, whose model architectures are shown in Figure 5 and whose layer parameters (numbers of feature maps, filter sizes, etc) are given in Appendix B. RGB and depth/SN images are respectively taken as inputs of the two lower streams, whose outputs are concatenated as inputs of the upper stream. We term such adapted networks as RGBDepth ConvNets. In this work, we investigate the effects of concatenating features of the two lower streams at different “heights” (lower, middle, or upper layers) of RGBDepth ConvNets, as illustrated in Figure 5. Alternative to RGBDepth ConvNets are plain ConvNets that consist of one of the two lower streams of RGBDepth ConvNets and the upper stream, as shown in Figure 5. Such plain ConvNets can be used for both RGB and depth images, and we term them as RGB ConvNet and Depth ConvNet respectively.
The above networks suggest several baseline methods for RGBD object recognition. In particular, given existence of both RGB and depth images during training and test phases, one may separately train RGB ConvNet or Depth ConvNet using singlemodal images, and use the trained models for singlemodal inference. Alternatively, one may use the above RGBDepth ConvNets that concatenate features of individual modalities at different heights for multimodal training and inference. To train these baseline networks, we use common data augmentation practices on the RGBD object dataset: we first rescale each training image to the size of , from which or the horizontal flip of which we randomly crop a region of the size . The learning rate is initialized as and decays by a factor of when learning curves plateau. Table VII shows that while concatenating features at lower or middle layers of RGBDepth ConvNets is not effective, feature concatenation at an upper layer of RGBDepth ConvNet achieves improved performance over singlemodal networks.
Methods  Accuracy 

RGB ConvNet 

Depth Convnet 

RGBDepth ConvNet 

(concat. at a lower height)  
RGBDepth ConvNet 

(concat. at a middle height)  
RGBDepth ConvNet 

(concat. at an upper height)  
RGBDepth ConvNet with Dropout 

RGBDepth ConvNet with L2Regu 

RGBDepth ConvNet with CorrReg 

RGBDepth ConvNet with L2Regu & Dropout 

RGBDepth ConvNet with CorrReg & Dropout 


mean  


Occlusion size  
RGB ConvNet  
Depth ConvNet  
RGBDepth ConvNet  
RGBDepth ConvNet with Dropout  
RGBDepth ConvNet with CorrReg  
RGBDepth ConvNet with CorrReg & Dropout  

The above baselines fuse multimodal features via direct concatenation. Discussions in Section I suggest that one may apply network regularization at fusion to help collaboratively learn from multiview features, which includes existing methods such as Dropout [20], and also our proposed CorrReg that explicitly leverages multiview learning criteria. More specifically, we apply CorrReg to the first layer of the upper stream of RGBDepth ConvNet that concatenates multiview features at upper “height”, making it become a CorrReg fusion layer, or alternatively apply Dropout to inputs of the first layer of the upper stream. As noted in Section IV, an approximate measure of correlation is used in [51] that simply computes the (squared) Euclidean distance between features of individual views. We also consider this simple correlation measure as a baseline regularizer of (1), and term such a method as L2Regu. We compare with L2Regu by applying it to the same first layer of the upper stream of RGBDepth ConvNet. We set the value of CorrReg as , the same penalty parameter of L2Regu as , and the dropping rate of Dropout as , which are determined by tuning on the validation set. Results in Table VII show that either Dropout, L2Regu, or CorrReg improves performance over that of direct concatenation, and CorrReg outperforms Dropout and L2Regu, showing the advantage of CorrReg in practical multiview learning problems. Table VIII also gives results of CorrReg when using different values. To investigate whether the effect of CorrReg (or L2Regu) is complementary to that of Dropout, we also use both CorrReg (or L2Regu) and Dropout in RGBDepth ConvNet. Using L2Regu together with Dropout may not improve over L2Regu itself; instead, using CorrReg together with Dropout further improves the performance, showing the advantage of CorrReg for complementary regularization.
VB2 Robustness against Contamination of Input Data
An important property of multiview learning is that inference should be less influenced when input data are contaminated [70]. In this section, we simulate such testing scenarios by adding random occlusion blocks to input RGB and depth images. Occlusion blocks are obtained by setting pixel values of the occluded regions as . We use the trained networks in Section VB1 for these investigations. Table IX reports comparative results under different sizes of random occlusion. Compared with RGB ConvNet and Depth ConvNet, direct feature concatenation using RGBDepth ConvNet may not provide better robustness against contamination of input data. RGBDepth ConvNet with our proposed CorrReg improves the robustness, and performs consistently better than plain RGBDepth ConvNet and the one with Dropout do.
VB3 Comparisons with the state of the art
Stateoftheart results on RGBD object recognition are obtained by using either advanced base models (e.g., ResNets [28]) with parameters pretrained on ImageNet [51], or advanced feature encoding scheme [52]. We follow [51] and use two ResNet50 [28] (after removing its last FC layer of classifier) as the lower, parallel streams, whose dimensional output feature vectors are concatenated as the input vector of a FC based CorrReg fusion layer, followed by the last layer of way classifier. The two lower streams respectively take RGB and SN images as inputs. We term such a constructed network as RGBDepth ResNet. For data augmentation, we follow [49] by first rescaling training images to the size of , and then randomly cropping regions of the size from them or their horizontal flips. RGBDepth ResNet is finetuned with the initial learning rates of for the lower streams and for the upper stream. We use minibatches of size 64, and set the value of CorrReg as and the dropping rate of Dropout as . Other training hyperparameters are the same as described in the beginning of Section V.
Methods  Accuracy 

Nonlinear SVM[31] 

CKM [71] 

CNNRNN [16] 

upgraded HMP [69] 

MMSS [49] 

FusCNN [48] 

CIMDLResNet [51] 

MDSICNN [52] 

RGB ResNet 

Depth ResNet 

RGBDepth ResNet 

RGBDepth ResNet with Dropout 

RGBDepth ResNet with CorrReg 

RGBDepth ResNet with CorrReg & Dropout 


Note that the method [51] uses the same ImageNet pretrained base models (i.e., ResNet50) as we do. It is interesting to observe in Table X that our result of RGBDepth ResNet that concatenates RGB and depth features directly is better than those from most of existing methods. Regularizing RGBDepth ResNet with either Dropout or CorrReg further improves the result, with CorrReg achieving better improvement. Using CorrReg together with Dropout achieves the new state of the art of .
VB4 Results on RGBD Scene Recognition
In this section, we report experiments of RGBD scene recognition on the SUN RGBD dataset [32]. We use the same network architectures and training manners as in Section VB3, with the only difference that replaces the 51way softmax classifiers with the 19way ones. Results in Table XI tell that RGBDepth ResNet using simple feature concatenation outperforms existing methods that have complicated feature fusion schemes and/or training criteria. Regularizing RGBDepth ResNet with CorrReg and/or Dropout further improves the result to the new state of the art.
Methods  Accuracy 

GIST RBF Kernel SVM [32] 

PlaceCNN Linear SVM [32] 

PlaceCNN RBF Kernel SVM [32] 

SSCNN [55] 

DMFF [72] 

MDSICNN [52] 

FVCNN [56] 

RGBDCNN(wSVM) [57] 

BilinearCNN [58] 

RGB ResNet 

Depth ResNet 

RGBDepth ResNet 

RGBDepth ResNet with Dropout 

RGBDepth ResNet with CorrReg 

RGBDepth ResNet with CorrReg & Dropout 


VC Multiview Recognition of 3D Object Shapes
We conduct experiments of multiview 3D object recognition on the ModelNet40 dataset [64] to investigate the efficacy of CorrReg for practical problems with data of more than two views. The ModelNet40 dataset contains CAD models (meshes) of object categories, with models for training and ones for testing. To prepare images of multiple views from each object model, we follow the camera setup in [59] and assume that each model is upright oriented; virtual cameras, pointing towards the model centroid, are evenly distributed (with intervals of degrees) around a horizontal circle that is elevated degrees from the ground plane; 2D images are rendered from these camera views.
Based on a very simple architecture of MVCNN [59] for multiview based 3D object recognition, where features of individual views extracted from lower, parallel layer streams are aggregated via featurewise max pooling, we design a Multiview Fusion Network (MvFusionNet) as shown in Figure 6. By pairing neighboring views, MvFusionNet reorganizes feature vectors of individual views from lower streams into an equal number of pairs of feature vectors. Each of such pairs is then fed into a fusion layer where CorrReg can also be applied to form a CorrReg fusion layer. Featurewise max pooling is subsequently applied to outputs of these fusion layers, and MvFusionNet ends with a FC layer of classifier. We investigate here whether CorrReg is helpful for feature aggregation of different views by regularizing such a constructed MvFusionNet.
Lower streams of MvFusionNet are adapted from ResNet101 [28] that is pretrained on ImageNet[51]. To train MvFusionNet, we use a minibatch of (i.e., images); the learning rates start at and decay at the rate of when learning curves plateau. The penalty of CorrReg is set as . We report in Table XII results of MvFusionNet without or with CorrReg regularization, where we also compare with recent stateoftheart results [73, 60, 61] on ModelNet40 whose multiview images are prepared following the same style of camera setup in [59] (i.e., camera views pointing towards upright orientation of object models). Due to varying architectural designs, network optimizers, and feature aggregation schemes, results of different methods in Table XII may not be directly comparable; nevertheless, it confirms the efficacy of CorrReg for better feature learning and aggregation from multiple views of 3D object shapes. We note that results of multiview based methods on ModelNet40 depend heavily on how multiview images are prepared by positioning virtual cameras on a sphere enclosing the object model. For example, the current best result on ModelNet40 is obtained in [74] by selecting camera setups from a much richer set of camera positioning and viewpoints. We expect our results can also be boosted by using multiview images rendered from these optimal camera setups.
Methods  Accuracy 

MVCNN [59] 
(, 80 views) 
Pairwise [75] 

Dominant Set Clustering [73] 
(, 24 views) 
MHBN [60] 
(, 6 views) 
GVCNN [61] 
(, 8 views) 
MvFusionNet 

MvFusionNet with CorrReg 


Vi Conclusion
We study in this paper deep multiview learning in the context of regularized network training. We take a regularization approach via multiview learning criteria, and propose a novel, effective, and efficient neuronwise correlationmaximizing regularizer. We also implement such regularizers collectively as a correlationregularized network layer (CorrReg). CorrReg can be applied to either FC or conv based fusion layers that concatenate intermediate features of individual views. Controlled experiments of benchmark image classification show that CorrReg consistently improves performance of various modern deep architectures. Applying CorrReg to multimodal deep networks achieves the new state of the art on the benchmark RGBD object and scene recognition datasets. In future research, we are interested in applying CorrReg to other multiview learning problems of practical interest.
References
 [1] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936.
 [2] G. Andrew, R. Arora, K. Livescu, and J. Bilmes, “Deep canonical correlation analysis,” in Proceedings of the International Conference on Machine Learning, 2013.
 [3] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the International Conference on Machine Learning, June 2011.
 [4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visualsemantic embedding model,” in Advances in Neural Information Processing Systems 26, 2013, pp. 2121–2129.
 [5] F. R. Bach and M. I. Jordan, “A probabilistic interpretation of canonical correlation analysis,” Department of Statistics, University of California, Berkeley, Tech. Rep, Tech. Rep., 2005.
 [6] D. A. Cohn and T. Hofmann, “The missing link  a probabilistic model of document content and hypertext connectivity,” in Advances in Neural Information Processing Systems 13, 2001, pp. 430–436.
 [7] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan, “Matching words and pictures,” J. Mach. Learn. Res., vol. 3, pp. 1107–1135, 2003.
 [8] N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in Neural Information Processing Systems 25, 2012, pp. 2222–2230.
 [9] R. Arora, A. Cotter, K. Livescu, and N. Srebro, “Stochastic optimization for PCA and PLS,” in 50th Annual Allerton Conference on Communication, Control, and Computing, Allerton, Allerton Park & Retreat Center, Monticello, IL, USA, October 15, 2012, 2012, pp. 861–868.
 [10] D. R. Hardoon, S. R. Szedmak, and J. R. Shawetaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Comput., vol. 16, no. 12, pp. 2639–2664, Dec. 2004.
 [11] P. L. Lai and C. Fyfe, “Kernel and nonlinear canonical correlation analysis,” Int. J. Neural Syst., vol. 10, pp. 365–377, 2000.
 [12] D. R. Hardoon and J. ShaweTaylor, “Sparse canonical correlation analysis,” Machine Learning, vol. 83, no. 3, pp. 331–353, 2011.
 [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, 2012, pp. 1097–1105.
 [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Regionbased convolutional networks for accurate object detection and segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142–158, Jan. 2016.
 [15] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multiview representation learning,” in Proceedings of the International Conference on Machine Learning, 2015.
 [16] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng, “Convolutionalrecursive deep learning for 3d object classification,” in Advances in Neural Information Processing Systems 25, 2012, pp. 656–664.
 [17] S. Akaho, “A kernel method for canonical correlation analysis,” in Proceedings of the International Meeting of the Psychometric Society, 2001.
 [18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 448–456.
 [19] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of the 13th European Conference on Computer Vision, 2014, pp. 818–833.
 [20] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
 [21] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning. The MIT Press, 2012.
 [22] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations (ICLR), 2017.
 [23] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: Stability of stochastic gradient descent,” in Proceedings of The 33rd International Conference on Machine Learning, vol. 48, 2016, pp. 1225–1234.
 [24] I. Kuzborskij and C. H. Lampert, “Datadependent stability of stochastic gradient descent,” CoRR, vol. abs/1703.01678, 2017.
 [25] C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. A. Poggio, “Theory of deep learning iib: Optimization properties of SGD,” CoRR, vol. abs/1801.02254, 2018.
 [26] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Report, 2009.
 [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 [29] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proceedings of the British Machine Vision Conference, 2016.
 [30] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” CoRR, vol. abs/1608.06993, 2016.
 [31] K. Lai, L. Bo, X. Ren, and D. Fox, “A largescale hierarchical multiview rgbd object dataset,” in Proceedings of the IEEE International Conference on on Robotics and Automation, 2011.
 [32] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgbd: A rgbd scene understanding benchmark suite,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2015, pp. 567–576.
 [33] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, no. 3, May 2013, pp. 1139–1147.
 [34] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 807–814.
 [35] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” CoRR, vol. abs/1602.07868, 2016.
 [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2014.
 [37] D. Ciresan, U. Meier, and J. Schmidhuber, “Multicolumn deep neural networks for image classification,” in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, ser. CVPR ’12, 2012, pp. 3642–3649.
 [38] L. Wan, M. D. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1058–1066.
 [39] M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proceedings of the International Conference on Learning Representations, 2013.
 [40] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for simplicity: The all convolutional net,” CoRR, vol. abs/1412.6806, 2014.
 [41] P. Baldi and P. J. Sadowski, “Understanding dropout,” in Advances in Neural Information Processing Systems 26, 2013, pp. 2814–2822.
 [42] S. Wager, S. Wang, and P. S. Liang, “Dropout training as adaptive regularization,” in Advances in Neural Information Processing Systems 26, 2013, pp. 351–359.
 [43] M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra, “Reducing overfitting in deep networks by decorrelating representations,” in International Conference on Learning Representations, 2016.
 [44] P. RodrÃguez, J. GonzÃ lez, G. Cucurull, J. M. Gonfaus, and X. Roca, “Regularizing cnns with locally constrained decorrelations,” in International Conference on Learning Representations, 2017.
 [45] K. Jia, D. Tao, S. Gao, and X. Xu, “Improving training of deep neural networks via singular value bounding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [46] S. Chandar, M. M. Khapra, H. Larochelle, and B. Ravindran, “Correlational neural networks,” CoRR, vol. abs/1504.07225, 2015.
 [47] L. Bo, X. Ren, and D. Fox, “Hierarchical matching pursuit for image classification: Architecture and fast algorithms,” in Advances in Neural Information Processing Systems 24, 2011, pp. 2115–2123.
 [48] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard, “Multimodal deep learning for robust rgbd object recognition,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2015, pp. 681–687.
 [49] A. Wang, J. Cai, J. Lu, and T.J. Cham, “Mmss: Multimodal sharable and specific feature learning for rgbd object recognition,” in Proceedings of the IEEE International Conference on Computer Vision, December 2015.
 [50] A. E. Johnson and M. Hebert, “Using spin images for efficient object recognition in cluttered 3d scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 5, pp. 433–449, May 1999.
 [51] Z. Wang, J. Lu, R. Lin, J. Feng, and J. zhou, “Correlated and individual multimodal deep learning for rgbd object recognition,” CoRR, vol. arXiv:1604.01655, 2016.
 [52] U. Asif, M. Bennamoun, and F. A. Sohel, “A multimodal, discriminative and spatially invariant cnn for rgbd object labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 9, pp. 2051–2065, Sept. 2018.
 [53] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems 28, 2015, pp. 2017–2025.
 [54] X. Li, M. Fang, J.J. Zhang, and J. Wu, “Learning coupled classifiers with rgb images for rgbd object recognition,” Pattern Recognition, vol. 61, no. C, pp. 433–446, Jan. 2017.
 [55] Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu, “Understand scene categories by objects: A semantic regularized scene classifier using convolutional neural networks,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2016, pp. 2318–2325.
 [56] A. Wang, J. Cai, J. Lu, and T.J. Cham, “Modality and component aware feature fusion for rgbd scene classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5995–6004.
 [57] X. Song, L. Herranz, and S. Jiang, “Depth cnns for rgbd scene recognition: Learning from scratch better than transferring from rgbcnns.” in AAAI, 2017, pp. 4271–4277.
 [58] H. F. Zaki, F. Shafait, and A. Mian, “Learning a deeply supervised multimodal rgbd embedding for semantic scene and object category recognition,” Robotics and Autonomous Systems, vol. 92, pp. 41–52, 2017.
 [59] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller, “Multiview convolutional neural networks for 3d shape recognition,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 945–953.
 [60] T. Yu, J. Meng, and J. Yuan, “Multiview harmonized bilinear network for 3d object recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 186–194.
 [61] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Groupview convolutional neural networks for 3d shape recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 264–272.
 [62] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 5987–5995.
 [63] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
 [64] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in CVPR, 2015, pp. 1912–1920.
 [65] R. Collobert, S. Bengio, and J. Mariethoz, “Torch: a modular machine learning software library,” Tech. Rep., 2002.
 [66] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. C. Courville, and Y. Bengio, “Maxout networks,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1319–1327.
 [67] C. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu, “Deeplysupervised nets,” in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015.
 [68] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proceedings of the European Conference on Computer Vision, 2016.
 [69] L. Bo, X. Ren, and D. Fox, “Unsupervised feature learning for rgbd based object recognition,” in Experimental Robotics. Springer, 2013, pp. 387–402.
 [70] C. Xu, D. Tao, and C. Xu, “Multiview intact space learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 12, pp. 2531–2544, 2015.
 [71] M. Blum, J. T. Springenberg, J. Wülfing, and M. Riedmiller, “A learned feature descriptor for object recognition in rgbd data,” in Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 1298–1303.
 [72] H. Zhu, J.B. Weibel, and S. Lu, “Discriminative multimodal feature fusion for rgbd indoor scene recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2969–2976.
 [73] C. Wang, M. Pelillo, and K. Siddiqi, “Dominant set clustering and pooling for multiview 3d object recognition,” in British Machine Vision Conference 2017, BMVC 2017, London, UK, September 47, 2017, 2017.
 [74] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, 2018, pp. 5010–5019.
 [75] E. Johns, S. Leutenegger, and A. J. Davison, “Pairwise decomposition of image sequences for active multiview recognition,” in CVPR. IEEE Computer Society, 2016, pp. 3813–3822.
Appendix A Gradients of the Proposed NeuronWise CorrelationMaximizing Regularizer
We use multivariable chain rule to derive the gradients of the neuronwise regularizer w.r.t. the weight vectors and , and also w.r.t. the input features , , . Their explicit forms are presented as follows (with no simplification).
Appendix B Network architectures
Our LeNet variant in Section VA consists of conv layers (the first one is the input layer), followed by FC layers (the last one is the output layer). The first two conv layers have filters, and the third one has filters. All the three conv layers have filters of size and stride . Max pooling of size is applied after the first conv layer, and average pooling of size is applied after both the second and third conv layers. The numbers of neurons for the three FC layers are respectively , , and .
Our used ResNet in Section VA1 follows [28, 68]. In particular, we first build a ConvNet that starts with a conv layer of filters, and then sequentially stacks three types of conv layers of filters, each of which has the feature map sizes of , , and , and filter numbers , , and , respectively. Spatial subsampling of feature maps is achieved by conv layers of stride . The ConvNet ends with a global average pooling and FC layers, with weight layers in total. Based on this ConvNet, we do (1) using an “identify shortcut” to connect every two conv layers of filters, and a “projection shortcut” when subsampling of feature maps is needed; (2) changing it to the preactivation version according to [68]. We set that gives weight layers.
Layer specifics of the RGB ConvNet, Depth ConvNet, and RGBDepth ConvNets used in Section VB1 (Figure 5) are presented in Table XIII.
Layer  Filter size/Filter no./ 
Stride/Padding  
conv1  7 7/96/2/3 
conv2  5 5/96/1/2 
conv3  3 3/112/1/1 
conv4  3 3/128/1/1 
conv5  3 3/128/1/1 
fc6  /1024// 
fc7  /512// 
fc8  /51// 
(max) pool1, pool2, pool3  2 2//2/1 
lowerconv  5 5/192/1/2 
middleconv  3 3/256/1/1 
upperfc  /1024// 