Deep Multi-View Learning using Neuron-Wise Correlation-Maximizing Regularizers

Deep Multi-View Learning using Neuron-Wise Correlation-Maximizing Regularizers

Kui Jia, Jiehong Lin, Mingkui Tan, and Dacheng Tao K. Jia and J. Lin are with the School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China. Email: kuijia@scut.edu.cn, lin.jiehong@mail.scut.edu.cn.M. Tan is with the School of Software Engineering, South China University of Technology, Guangzhou, China. Email: mingkuitan@scut.edu.cn.D. Tao is with the School of Information Technologies, The University of Sydney, New South Wales, Australia. Email: dacheng.tao@sydney.edu.au.
Abstract

Many machine learning problems concern with discovering or associating common patterns in data of multiple views or modalities. Multi-view learning is of the methods to achieve such goals. Recent methods propose deep multi-view networks via adaptation of generic Deep Neural Networks (DNNs), which concatenate features of individual views at intermediate network layers (i.e., fusion layers). In this work, we study the problem of multi-view learning in such end-to-end networks. We take a regularization approach via multi-view learning criteria, and propose a novel, effective, and efficient neuron-wise correlation-maximizing regularizer. We implement our proposed regularizers collectively as a correlation-regularized network layer (CorrReg). CorrReg can be applied to either fully-connected or convolutional fusion layers, simply by replacing them with their CorrReg counterparts. By partitioning neurons of a hidden layer in generic DNNs into multiple subsets, we also consider a multi-view feature learning perspective of generic DNNs. Such a perspective enables us to study deep multi-view learning in the context of regularized network training, for which we present control experiments of benchmark image classification to show the efficacy of our proposed CorrReg. To investigate how CorrReg is useful for practical multi-view learning problems, we conduct experiments of RGB-D object/scene recognition and multi-view based 3D object recognition, using networks with fusion layers that concatenate intermediate features of individual modalities or views for subsequent classification. Applying CorrReg to fusion layers of these networks consistently improves classification performance. In particular, we achieve the new state of the art on the benchmark RGB-D object and RGB-D scene datasets. We make the implementation of CorrReg publicly available.

I Introduction

Many machine learning problems concern with discovering or associating common patterns in data of multiple views or modalities. Typical applications include retrieving images from texts or vice versa, combining visual and audio signals for content understanding, and object recognition from visual observations of multiple modalities. Data of different views usually contain complementary information, whose statistical distributions in the high-dimensional measurements of individual views may also be different. Multi-view learning methods aim to exploit information contained in multiple views to better accomplish specified learning tasks. In this work, we take image classification, in particular multi-view or multi-modal object recognition (e.g., recognizing objects from RGB and depth images), as the primary example to study the problem of multi-view learning.

Given feature observations of different views, existing multi-view learning approaches learn latent space representations in either deterministic [1, 2, 3, 4] or probabilistic manners [5, 6, 7, 8]. The learning objective is to make resulting features of different views at each dimension of the latent space more related with each other, where relations may be measured by different metrics/criteria [9, 10, 4]. Among various techniques, Canonical Correlation Analysis (CCA) [1] and its extensions [11, 12, 2] are the most representative ones. For example, given two-view data, CCA learns pairs of linear projections so that in the projected space, features of both views are maximally correlated at corresponding dimensions.

Following the success of deep learning [13, 14], deep multi-view learning methods [2, 15] are also proposed recently for learning deep features from multi-view data. These methods apply multi-view learning criteria (e.g., CCA) on top of multiple single-view deep networks (cf. Figure 1-(a) for an illustration); a two-stage scheme of iterative learning is usually adopted to train the network parameters, where view-specific features are learned until the very top layers, to which either a sequential step of multi-view criteria followed by the objectives of the specified learning tasks or regularized learning objectives that seek a balance between multi-view criteria and the final tasks of interest, are applied. Alternatively, one may design deep architectures that concatenate at intermediate network layers (fusion layers) output features of lower, parallel layer streams for individual views [3, 8, 16], followed by upper network layers for specified learning tasks (e.g., image classification, cf. Figure 1-(b) for an illustration). Such end-to-end networks have the advantage that the final tasks of interest are achieved directly at the network outputs. However, output features of the lower, parallel streams in such networks capture view-specific patterns, which may not be aligned in a common space for a ready fusion in the subsequent layers.

To enjoy the advantage of end-to-end learning while collaboratively benefiting from different views, multi-view learning criteria could be exploited to improve the relations between resulting features of the lower, parallel streams. Under the framework of regularized function learning, this amounts to training network parameters by penalizing objectives of the main learning tasks with correlation-maximizing regularization at fusion layers (cf. Figure 1-(b)). In this work, we are interested in CCA criteria since they have long been the main workhorse for multi-view learning [11, 17, 10]. Empirical results also show that CCA based approaches outperform alternative ones in the context of deep multi-view representation learning [15]. Directly using CCA as the regularizer makes network training very expensive, which is also incompatible with the mini-batch based stochastic gradient descent (SGD), the commonly used algorithm in deep network training (cf. Section II for a discussion). Inspired by batch normalization [18], we propose in this paper a novel neuron-wise correlation-maximizing regularizer, and implement the proposed regularizers collectively as a correlation-regularized network layer (CorrReg). CorrReg can be applied to either fully-connected (FC) or convolutional (conv) fusion layers, simply by replacing these layers with their CorrReg versions (cf. Figure 3 for an illustration). CorrReg fusion has the same computational complexity as the plain one does, which is significantly lower than that of CCA regularization.

We note that the fusion layer in a multi-view network of Figure 1-(b) is computationally equivalent to any hidden layer in a generic deep neural network (DNN), by partitioning neurons of the hidden layer into multiple subsets (cf. Figure 1-(c) for an illustration). This equivalence suggests a multi-view feature learning perspective of generic DNNs: when considering hidden neurons of generic DNNs as pattern detectors that characterize different patterns of the input data [19], features learned at each neuron subset of a hidden layer could be considered as a specific view of the input data. Ideally, each of such feature views is to learn salient or discriminative patterns of the input data, which should also be generalizable to unseen data. However, there exists an issue of overfitting, a phenomenon that specific subsets of layer neurons are trained to be co-adapted to certain patterns in the training data, but cannot generalize well on the held-out test data [20]. As suggested by traditional learning theory [21], the risk of overfitting could be severe for generic DNNs, since modern DNNs have large model capacities and are usually over-parameterized in the sense that they can be trained to fit randomly labelled datasets [22]. Such a risk for generic DNNs can be addressed implicitly by SGD training [23, 24, 25], and explicitly by additional regularization [20, 18]. Our proposed CorReg takes the second regularization approach: improving correlations of neuron subsets reduces co-adaptation to possibly noisy or irrelevant, subset-specific patterns, and thus alleviates the problem of overfitting. Such a connection with generic DNNs enables us to study CorrReg in the context of regularized network training, and to compare with modern regularization techniques (e.g., Dropout [20]) for deep multi-view representation learning.

We finally summarize our contributions as follows.

  • We study in this work the problem of learning deep representations from multi-view data in end-to-end networks. We take a regularization approach via multi-view learning criteria, and propose a novel, effective, and efficient neuron-wise correlation-maximizing regularizer. We implement our proposed regularizers collectively as a correlation-regularized network layer (CorrReg), which will be made publicly available. CorrReg can be applied to either FC or conv based fusion layers, simply by replacing these layers with their CorrReg versions. CorrReg fusion layer has the same computational complexity as a plain one does, which is significantly lower than that of CCA regularization.

  • We consider a multi-view feature learning perspective of generic DNNs, by partitioning neurons of a hidden layer in a generic DNN into multiple subsets. Such a connection with generic DNNs enables us to study deep multi-view representation learning in the context of regularized network training, and to compare CorrReg with modern regularization techniques (e.g., Dropout [20]). We note that such a comparison is largely ignored in existing deep multi-view learning methods.

  • To investigate the efficacy of CorrReg for regularization of network training, we conduct control experiments on benchmark image classification [26] using generic DNNs [27, 28, 29, 30]. CorrReg consistently improves performance of these networks. To investigate how CorrReg is useful for practical multi-view learning problems, we conduct experiments of RGB-D object/scene recognition and multi-view based 3D object recognition, using networks with fusion layers that concatenate intermediate features of individual modalities or views for subsequent classification. Applying CorrReg to fusion layers of these networks consistently improves classification performance. In particular, we achieve the new state of the art on the benchmark RGB-D object [31] and RGB-D scene [32] datasets.


(a) (b) (c)

Fig. 1: Two-view illustrations of deep networks for multi-view learning, where fusion layers are inside shaded regions that concatenate features of individual views. (a) Deep multi-view features are learned by applying multi-view learning criteria, possibly together with the main learning objective, on top of multiple single-view deep networks; (b) a deep network takes as inputs data of multiple views/modalities for a specified learning task (e.g., multi-modal image classification), where features of individual views are concatenated at an intermediate layer (i.e., the fusion layer), and multi-view learning criteria can be imposed as a regularization on the fusion layer; (c) in generic DNNs, input neurons of a hidden layer can be partitioned into multiple subsets (represented as black circles grouped in different dashed boxes), and features learned at different subsets could be considered as multi-view features of the same input data.

Ii The Proposed Neuron-Wise Correlation-Maximizing Regularizers

Fig. 2: Two sub-layers are formed by randomly partitioning the input neurons and the corresponding weights of a network layer into two subsets (Left). For a specified output neuron of the layer, two internal features and can be computed from the partition (Right). Blank circles and dashed lines represent one sub-layer, and filled circles and solid lines represent the other.

In this section, we first use generic DNNs to present our proposed regularization method, and explain how our method is applied to their hidden layers. Our method is readily applied to fusion layers of deep networks that take practical multi-view data, which will be introduced in Section III.

We start with a DNN composed of FC layers. Denote its network parameters as , where and are respectively the weight matrix and bias vector associated with the network layer. In the setting of supervised learning, given training samples of categorical data, the network parameters in are optimized by minimizing the empirical risk , where is a properly chosen loss function, e.g., cross-entropy loss for image classification, and optimization is typically based on SGD or its variants [33]. As discussed in Section I, DNNs of high model capacities are able to learn complex functions but susceptible to overfitting. To remedy, one may apply regularization to reduce their model capacities. Adding regularization to the network training objective results in the following optimization problem

(1)

where is the regularizer to be specified, and is a trade-off parameter.

In this work, we are interested in regularizing network training using CCA based multi-view learning criteria [2, 15]. More specifically, for a specified network layer and a training sample , denote as the feature vector of computed all the way up from the input layer to the layer. The layer computes , where is an element-wise nonlinear activation function such as ReLU [34], and are the weight matrix and bias vector respectively, and where we omit the superscript to make the following notations of better clarity. By randomly partitioning input neurons/dimensions of the layer into two subsets 111For simplicity, we only consider in this work the case of partitioning neurons of a hidden layer in a generic DNN into two subsets., we get feature subvectors and from , and the corresponding weight submatrices and from . We simply have . Discussions in Section I suggest that and can be analogously considered as two feature views of the input data. The co-adaptation of features in each view is likely to cause network training to pay more attention to the extraction of view-specific patterns, rather than the category related patterns that are desired to be learned by network training. To address this issue and benefit more from both views of the features, one may use CCA criteria to regularize network training, which aim to increase the feature correlations between the two views. Given the two-view features and of training set , applying CCA to the network layer amounts to optimizing and by

(2)

where is an identity matrix of compatible size. The data matrices have been assumed centered for simplicity. Applying the above problem to the network layer also assumes implicitly that .

When directly using (2) as the regularizer in (1), network training requires solving (2) with stochastic optimization methods. Unfortunately, as pointed out in [9], the objective (2) does not easily admit a stochastic optimization due to the involvement of data covariance matrices in the constraints. One may use batch gradient descent to solve (2). It computes gradients of the correlation objective w.r.t. the CCA projected features and , which in turn will be used through back-propagation to compute the gradients w.r.t. and , and w.r.t. all the network parameters in the layers below [2]. This is expensive as it involves computation of covariance matrices (of and ), their inverse square roots, and also performing matrix singular value decomposition (SVD). One may nevertheless try mini-batch based gradient descent to solve (2), which, however, may produce singular data covariance matrices; [2] also points out that solving (2) by mini-batch based stochastic optimization empirically gives unsatisfactory results. Given these challenges of directly using CCA as the regularizer , we are motivated to find an alternative way to improve the correlations between the two feature views and .

Inspired by batch normalization [18], we propose to simplify the full CCA regularization in DNNs by considering the following two aspects. Firstly, instead of learning and to increase correlations at all the dimensions of the resulting features jointly, we propose to learn and , , independently for each output neuron of the layer, where and are the columns of and respectively. Note that such a decoupling suggests that the resulting features at the dimensions could be correlated, but at the same time the constraint of is also relaxed, enabling its flexible use in DNNs as tasks demand. Secondly, we use mini-batches of training samples, rather than all the ones, to approximate the statistics (i.e., mean and variance) necessary for computing correlations. This second simplification is enabled by the independent neuron-wise learning of and : since in the joint case, the size of mini-batches is required to be big enough to avoid singularity of covariance matrices.

For a specified output neuron of the layer, we first introduce the (scalar) random variables and whose samples are respectively computed as and , as illustrated in Figure 2, where we omit the neuron index for notational clarity. Given a mini-batch of training samples, we propose in this work the following neuron-wise correlation-maximizing regularizer

(3)

where the scalar is introduced for numerical stability, and

Based on the neuron-wise regularizer (3), we specify the general objective function (1) as the following regularized problem to improve network training

(4)

where indexes the group of neurons in the network layers that are specified to apply regularization, denotes a subset of network parameters that are involved in the computation of the incoming features of the neuron , and we have slightly abused the use of notation with that in (3). Note that and may contain overlapped parameters, i.e., those parameters associated with the common layers below and . The main objective (4) can be optimized using SGD or its variants [33], by sampling mini-batches of training samples in iterative steps. Details are presented shortly. Complexity analysis presented in Section II-A1 shows that compared with standard SGD training, our regularizer (3) increases computation cost only by a constant factor.

Ii-a Correlation-Regularized Network Layer

We still use the illustration in Figure 2 as the running example. In the forward pass of a mini-batch of size , any output neuron of the layer that is specified to apply the regularization (3) computes and , , which give output features of the neuron, before nonlinear activation , as , , where is the bias associated with this neuron. 222For each neuron that is applied the regularization, we have ever introduced trainable scalar parameters and to re-scale the internal features and , i.e., , which is similar to the scheme introduced in [35]. In our experiments this alternative scheme does not necessarily improve the performance. In the backward pass, we need to compute the gradients of the neuron-wise regularizer w.r.t. the weight vectors and , and also those w.r.t. the input features , , . These gradients can be derived via multi-variable chain rule, and we give their explicit forms in Appendix A. Note that the later ones are used to compute through back-propagation gradients of the regularizer w.r.t. network parameters in the lower layers that are also involved in the computation of .

We usually apply (3) to all the output neurons of the specified layer. To make regularized network training efficient, we note that for these neurons, their respective internal features and statistics, i.e., , , , , , and , and also the corresponding gradients can be computed independently and in parallel. Given the mini-batch of layer inputs of the two-view features , , we write gradients of the neuron-wise regularizers in the compact forms as , , and , where denotes the correlation objectives compactly. More specifically, sums the gradients of neuron-wise regularizers in the layer w.r.t. the input , and independently computes the gradient of each neuron-wise regularizer w.r.t. its associated weight vector ; the same operations apply to and .

In other words, we implement our proposed scheme (3) as a correlation-regularized network layer (CorrReg): in the forward pass, the computation is the same as a standard network layer, i.e., we do not explicitly compute the internal features and , and instead we directly compute , followed by element-wise nonlinear activation; in the backward pass, we compute the gradients of the correlation-regularized loss w.r.t. layer weights and layer inputs, which we spell out as

where and can be computed via standard back-propagation. The gradient of the main loss w.r.t. the layer bias vector can be obtained in the same way.

Ii-A1 Analysis of Computational Complexity

We analyze the additional computation cost incurred by imposing CorrReg on a network layer. Consider an layer that computes for a mini-batch of samples, and that has layer parameters and . Assume arithmetic with individual elements has complexity . In the forward pass, the computation for the layer with or without CorrReg is the same, and its complexity is . In the backward pass, without using CorrReg the complexity for back-propagating gradients through this layer is . By simplifying the gradient formulas given in Appendix A and also writing them in matrix forms, we have the same complexity of when using CorrReg. In summary, the complexity by imposing CorrReg on a network layer increases only by a constant factor. In contrast, using the CCA objective (2) as the regularizer involves computing inverse square root of covariance matrices of the size , and also performing SVD for matrix of the same size (one may refer to [2] for gradient formulas of CCA objective); it has the overall complexity of in the backward pass, which is significantly worse than that of CorrReg.

Ii-B Correlation-Regularized Convolutional Networks

Our proposed CorrReg can regularize both FC and conv layers. As discussed above, for an FC layer computing , we apply regularization to the input of by performing a random two-way partition on feature dimensions of , producing two internal features and , where the partition is fixed once determined. CorrReg is indeed to improve correlations between each dimension pair of and . It is straightforward to extend the above scheme of single two-way partition in CorrReg to its version of multiple two-way partitions. For a specified FC layer, one may simply perform multiple, random two-way partitions on feature dimensions of ; each of them produces their respective internal features, and also their respective gradients w.r.t. layer weights and layer inputs. The overall regularization imposed on this layer can be obtained by averaging the gradients from these multiple two-way partitions. Experiments investigating the efficacy of this scheme of multiple two-way partitions are reported in Section V-A.

For a conv layer, we perform single or multiple random two-way partitions on its input feature maps in the same way as for an FC layer. Each two-way partition produces two internal feature maps (corresponding to and in Figure 2) for each output feature map of the layer. Although features/observations at nearby locations of an image are generally correlated, we do not explicitly exploit such correlations. Instead, we independently apply regularization at each spatial location/pixel of each output feature map of the layer, so that correlations between the corresponding spatial locations in each pair of the internal feature maps are improved, where regularization is again applied before nonlinear activation. Applying our proposed CorrReg to modern architectures of DNNs (e.g., ConvNets [36], variants of ResNets [28, 29], or DenseNets [30]) is very simple: one simply replaces FC or conv layers with their CorrReg versions.

Iii Use of CorrReg Fusion in Deep Representation Learning from Multi-View Data


Fig. 3: Illustration of a CorrReg fusion layer (inside the dashed box).

We present in this section how CorrReg can be readily applied to deep networks that take as inputs data of multiple views/modalities and learn deep representations from them. Different from generic DNNs, these networks by design have lower, parallel streams for data of individual views, and it is natural to apply CorrReg to the fusion layers where features of different views are concatenated for use in the subsequent layers. The fusion layer in such a network is usually based on FC or conv layers. To use CorrReg, one may simply treat output features of two parallel streams as the input two-view features of CorrReg (corresponding to and in Section II), and replace the fusion layer with its CorrReg version, resulting in a CorrReg fusion layer. Figure 3 gives the illustration. In the following, we take network architectures used in our RGB-D recognition experiments to instantiate the use of CorrReg fusion.

Our networks for RGB-D object recognition and scene recognition are based on ConvNets [13] and ResNets [28]. Take an 8-layer ConvNet as the example. We modify it by enabling it to take inputs of both RGB and depth channels. The 8-layer ConvNet is composed of two lower, parallel streams followed by an upper, single stream. Outputs of the two lower streams are concatenated as inputs of the upper stream, and CorrReg is applied to the first layer of the upper stream, which thus becomes a CorrReg fusion layer. In this work, we also investigate the effects when the CorrReg fusion layer is at different “heights” (lower, middle, or upper layers) of the network. Figure 5 gives the illustration. Adaptation of ResNets is similar as above. We use such adapted network architectures for experiments of RGB-D object recognition and scene recognition in Section V-B.

Iv Relations with Existing Works

Our proposed CorrReg method is related to four categories of existing research: regularization of generic DNNs, deep learning research that specially focuses on multi-view representation learning, RGB-D object/scene recognition, and multi-view recognition of 3D object shapes. We respectively discuss the relations as follows.

Network regularization In the literature of DNNs, various regularization techniques have been proposed to address the issue of overfitting in network training, including the traditional early stopping, weight decay, and data augmentation [37], and also the more recent Dropout [20], Dropconnect [38], batch normalization [18], and all conv layer based networks [39, 40]. Among them, Dropout and Dropconnect are most related to our proposed method. In the original proposal of Dropout [20], each hidden neuron is randomly dropped (usually with a probability of ) at each training iteration, and the network is then updated on weights that are connected to the remaining neurons. During inference, all network weights are used after halving their values. Baldi and Sadowski [41] quantitatively analyze that random operations of dropout training and the associated inference can be understood as a good approximation to the expectation of outputs of a subnetwork ensemble, by introducing a bridging quantity Normalized Weighted Geometric Mean. They further show that the expectation of dropout gradients w.r.t. a network weight is approximately the gradient of the subnetwork ensemble regularized by adaptive weight decay. Wager et al. [42] present alternative interpretation of dropout training as adaptive weight decay by treating dropout as feature noising in generalized linear models. Analysis similar to [41] can be applied to Dropconnect [38] by randomly dropping weight connections rather than network neurons.

Dropout and Dropconnect achieve regularization by first sampling features/subnetworks (of shared weights), and then averaging over outputs of the subnetwork ensemble. Different from them, our CorrReg scheme explicitly increases the correlations between (internal) features of different views, and regularization is achieved by suppressing view-specific noisy patterns. We empirically show in this work the usefulness of CorrReg in improving network training, and leave its theoretical connections with classical regularization to future research. Note that a few recent methods of network regularization explicitly reduce correlations across dimensions of the output features of a layer [43], or achieve similar effects by enforcing orthogonality of the layer weight matrix [44, 45]. These methods impose regularization complementary to our proposed CorrReg, and we are interested in the investigation of their combined use in future research.

Deep multi-view representation learning Recent deep multi-view representation learning methods include those based on CCA [2, 15, 46] and those based on auto-encoders (AE) [3]. AE based methods typically learn a shared bottleneck layer on top of lower view-specific layers, and the learned joint representation at the bottleneck layer is used for reconstruction of multiple views. Deep CCA [2] directly applies CCA to the output layers of two deep networks, so that the learned networks can produce maximally correlated features at the output layers. Wang et al. [15] extend deep CCA as Deep Canonically Correlated Auto-Encoders, by balancing the correlation objective between the two views with their respective reconstruction objectives.

Most of existing deep multi-view learning works take a two-stage strategy for the final tasks of interest: they first learn from data of multiple views/modalities deep features in a common space, and then use the learned features in the common space either to train classifiers for multi-view or across-view classification, or to reconstruct data of missing views. In contrast, our use of correlation based multi-view learning is to regularize training of end-to-end networks, where parameters that project multi-view data into a common space are exactly those of an intermediate network layer.

RGB-D object and scene recognition RGB-D object recognition [47, 48, 8, 49] has drawn research attention recently as a typical application of multi-view learning. Lai et al. [31] collect the first large-scale, hierarchial RGB-D object dataset using a Kinect camera; they show that depth information substantially helps object recognition, by concatenating hand-crafted depth features (e.g., spin images [50]) with those of RGB ones, and using the concatenated features for classification. In [16], a hierarchical learning model of Convolutional and Recursive Neural Networks (CNNs and RNNs) are proposed, where CNNs are used for learning low-level features and RNNs with random weights for efficiently extracting higher-order features; this combined model is applied to RGB and depth images separately, and the resulting features are concatenated for classification.

Eitel et al. [48] propose a multi-modal deep learning architecture for RGB-D object recognition, which fuses, via feature concatenation, outputs of two parallel streams of modality-specific subnetworks (composed of conv and FC layers), and uses two additional FC layers for feature transformation and softmax classification; the whole network is trained via standard back-propagation with no consideration of multi-view learning/regularization criteria. Built on top of two parallel streams of ResNets (after removing their respective last layers of classifier) [28], a Correlated and Individual Multi-modal (CIM) learning layer is proposed in [51]; CIM aims to learn, in a discriminative and complementary manner, both correlated and modality-specific features from output vectors of the two ResNets, where “correlation” is measured by the Euclidean distance between projected features of the two ResNets’ outputs; parameters of the whole network in [51] are updated in an alternating manner: those of the two lower streams of ResNets are updated after updating of the CIM parameters (projection matrices) converges. In [52], a deep learning framework termed MDSI-CNN is proposed to learn highly discriminative and spatially invariant multi-modal feature representations at different hierarchical levels, which is technically achieved by introducing spatial transformer network [53] and Fisher encoding into CNN architectures. The problem of RGB-D image classification with limited training samples is addressed in [54]. It takes a domain adaptation approach and enforces the prediction consistency between two classifiers that are respectively learned either from the combined RGB and depth features or from the RGB features alone. Results on RGB-D object recognition show the efficacy of the proposed approach.

Methods for RGB-D scene recognition [55, 56, 57, 58] largely follow those of RGB-D object recognition. In particular, multi-modal deep architectures similar to that of [48] are still the main workhorse to get good recognition performance. We also use such a type of networks for RGB-D recognition, but with our proposed CorrReg fusion layers that have in-built neuron-wise correlation regularization. Training of CorrReg fusion based networks has no difference from standard back-propagation.

Multi-view recognition of 3D object shapes Recent research shows that multi-view images are of a promising representation for recognition of 3D object shapes. Given a 3D object model (mesh), multiple 2D images can be rendered by placing virtual cameras around the object, and recognition is based on the rendered 2D images of multiple views. Among recent methods, MVCNN [59] is a representative one that uses parallel streams of conv layers to extract features from individual views, and then aggregates these features as a global signature simply via max pooling across different views. Subsequent works improve over MVCNN by strengthening interaction among feature learning of individual views. For example, MHBN [60] uses harmonized bilinear pooling to aggregate local features, and GVCNN [61] proposes a group-view framework to model correlations among different views at a hierarchy of multiple levels. In this work, we adapt the architectural design of MVCNN by incorporating into it CorrReg fusion layers. We use such an adapted architecture to verify the usefulness of CorrReg for multi-view based 3D object recognition.

V Experiments

No CorrReg CorrReg Conv2 CorrReg Conv3 CorrReg FC4 CorrReg FC5 CorrReg FC6
() - - - - -
- () () () () ()
- () () () () ()
- () () () () ()
- () () () () ()
TABLE I: Error rates () on CIFAR-10 [26] when applying CorrReg, with varying numbers nReg of random two-way partitions, to different layers of a variant of LeNet [27]. Setting nReg as indicates no CorrReg is applied to any network layer. Experiments of each setting are run for times, and results are in the format of mean (standard deviation).

In this section, we first present control experiments of image classification to investigate the effectiveness of our proposed CorrReg for regularization of network training. We use generic DNNs including ConvNet (LeNet) [27], and modern deep architectures of ResNet [28], Wide ResNet [29], DenseNet [30], and ResNeXt [62]. These experiments are conducted on the benchmark datasets of CIFAR-10, CIFAR-100 [26], and ImageNet [63]. We then present experiments of RGB-D object/scene recognition and multi-view 3D object recognition to evaluate the usefulness of CorrReg for practical multi-view learning problems. We use the benchmark datasets of RGB-D object [31], RGB-D scene [32], and ModelNet40 [64] for these experiments, and compare with the state-of-the-art results.

We use cross-entropy loss to train all these networks. Training is based on SGD with momentum. Without mentioning otherwise, we use mini-batches of size , momentum of , and weight decay of ; network parameters are initialized using Gaussian random weights; batch normalization is applied, before ReLU nonlinearity, in all networks to accelerate their training. In each experiment, the initial learning rate, the value of in (4), and also the dropping rate of Dropout (when using Dropout regularization) are determined by using of training samples as the validation set. As (4) suggests, we use constant values for all neurons that are specified to apply CorrReg. Learning rates are decayed at the rate of when learning curves plateau. Our implementation and experiments are based on the Torch library [65].

V-a Control Experiments of Image Classification

We use the CIFAR-10 dataset for our controlled studies on a plain ConvNet (a variant of LeNet [27]). The CIFAR-10 dataset consists of object categories of color images of size ( training and testing ones). We follow [66] and preprocess the data using global contrast normalization and ZCA whitening. Our used LeNet variant consists of conv layers (the first one is the input layer), followed by FC layers (the last one is the output layer). Max or average pooling layers are applied after each conv layer. More layer specifics are given in Appendix B.

We first investigate the regularization effects when applying CorrReg to different network layers. To this end, we replace each of the network layers (except the input one), namely Conv2, Conv3, FC4, FC5, and FC6, with their CorrReg versions respectively, and compare the recognition performance. As indicated in Section II-B, the scheme of multiple (random) two-way partitions can be used when applying CorrReg to any network layer. We also investigate how different numbers nReg of two-way partitions in CorrReg achieve regularization, for which we set , or . Note that when , we use the first half neurons/feature maps of the layer as a subset, and use the other half as the second subset; when , regularization is achieved by averaging over those of the multiple two-way partitions. We run experiments of each setting (the layer/nReg pair) for times, and report results in the format of mean (standard deviation). Error rates reported in Table I tell that applying CorrReg, with any number nReg of two-way partitions, to these layers consistently achieves performance boost over the LeNet variant baseline. In general, CorrReg is more effective for (upper) FC layers; this is reasonable since a densely connected FC layer contains much more trainable parameters than a conv layer does, and is thus more susceptible to overfitting. Setting sometimes helps in getting even better results, but at the cost of slightly increased computation. In the subsequent experiments, we simply set for computational efficiency.

As a technique for regularization of network training, CorrReg is related to the methods [20, 38, 18] discussed in Section IV, in particular Dropout [20]. To compare CorrReg with Dropout, we apply them to the FC5 layer of LeNet variant. Since their working mechanisms are different, one might be also interested in using them together. Table II reports the comparative results. CorrReg achieves improvement comparable to that of Dropout, where the dropping rate is optimally set as by tuning from the range of on the validation set. Using CorrReg together with Dropout further improves the performance, showing the complementary regularization benefit of CorrReg to that of Dropout.

Methods Error rates ()
Plain LeNet variant ()
Dropout [20] ()
CorrReg ()
CorrReg + Dropout ()
TABLE II: Comparison of image classification on CIFAR-10 [26] when applying CorrReg and/or Dropout to an upper FC layer of a variant of LeNet [27]. Experiments are run for times, and results are in the format of mean (standard deviation).

CorrReg achieves regularization via improving correlations between the internal features produced by two-way partition of a layer, which suggests a natural alternative that halves the number of layer neurons. Halving the number of layer neurons reduces the model capacity and creates “bottlenecking” of information flow, thus implicitly imposing regularization. Oppositely, one may be also interested in alternatives that increase the number of layer neurons with varying factors. To investigate the efficacy of CorrReg for models with different capacities, we apply these alternatives again to the FC5 layer of LeNet variant. Results in Table III show that larger models perform better than smaller ones do, and applying our proposed CorrReg to larger models further improves the performance. We also compute in Table III the averaged (pair-wise) correlation among features of training samples learned at different layer neurons, in order to understand how network capacities relate to the behavior of feature correlations across layer neurons and how CorrReg plays a role here as a regularization. Results show that as the numbers of layer neurons increase, feature correlations between layer neurons increase, and applying CorrReg further enhances this effect.

Neuron No. 128 256 512 1024
W/O CorrReg / / / /
With CorrReg / / / /
TABLE III: Experiments on CIFAR-10 [26] using a variant of LeNet [27]. CorrReg is optionally applied to an upper FC layer with varying numbers of layer neurons. Results are in the format of error rate ()/correlation coefficient (). Refer to the main text for how correlation is computed.

CorrReg is a neuron-wise scheme of CCA regularization. One might be interested in the performance when using CCA as the regularizer. To this end, we apply the CCA objective (2) as a regularizer to the FC5 layer of the LeNet variant, where regularization parameter is set as by optimally tuning on the validation set. Computational complexity of CCA regularization is significantly worse than that of CorrReg, and Table IV shows that it practically consumes more time per iteration of SGD training (measured on an M40 GPU and Intel Xeon CPU running at 2.2 GHz). Although CCA regularization improves performance over that of plain LeNet variant, its results with different sizes of mini-batches are worse than those of CorrReg; we hypothesize that this is because optimization of CCA objective (particularly the constraints in (2)) is incompatible with SGD based network training.

Methods Plain LeNet variant CorrReg CCA regularization
Computational
Complexity
wall-clock time
per iter. (sec.)
Error rates () () () () () () () () () ()

TABLE IV: Computation and recognition performance on CIFAR-10 [26] when applying CorrReg or CCA regularization, with different sizes of mini-batches, to an upper FC layer of a variant of LeNet [27]. Wall-clock time is measured on an M40 GPU and Intel Xeon CPU running at 2.2 GHz. Experiments are run for times, and accuracies are in the format of mean (standard deviation). and denote the numbers of input and output neurons of the layer respectively.

The penalty parameter in (4) controls the amount of regularization that CorrReg imposes on network training. To investigate how performance of CorrReg depends on values, we use of training samples as validation, and apply CorrReg to different layers of the LeNet variant (i.e., the settings of Table I with ) using a range of values. Results are plotted in Figure 4. Figure 4 shows that smaller values of are usually better when applying CorrReg to (lower) conv layers, and larger values of are usually better for (upper) FC layers. This is reasonable due to two compound reasons: (1) when applying CorrReg to an upper network layer, larger values of are needed in order to back-propagate the regularization for better learning of features/parameters of all the layers below; (2) conv layers already have intrinsic regularization via weight sharing. This inconsistency of optimal values across different network layers makes use of CorrReg less convenient. Fortunately, results in this section suggest that to get the most effective regularization, one may simply apply CorrReg to an upper (FC) network layer, and set the optimal values accordingly. Setting typically gives good results. Experiments in the subsequent sections follow this empirical rule.


Fig. 4: Investigation of the penalty parameter on CorrReg’s performance. Each line represents the validation errors on CIFAR-10 [26] when applying CorrReg to a layer of a variant of LeNet [27] using a range of values. Each error rate is a mean out of runs. These results are obtained by using of CIFAR-10 training samples as validation.

V-A1 Results on Modern Deep Architectures

Architecture CIFAR-10 CIFAR-100
W/O Regu. Dropout-S1 Dropout-S1S2 CorrReg CorrReg- W/O Regu. Dropout-S1 Dropout-S1S2 CorrReg CorrReg-
DropoutS2 DropoutS2
ResNet [68]

Wide ResNet [29]

DenseNet [30]

TABLE V: Results (error rates ) on CIFAR-10 and CIFAR-100 [26] of several deep architectures with or without regularization. Standard data augmentation is used as in [67]. Refer to the main text for how Dropout-S1, Dropout-S1S2, CorrReg, and CorrReg-DropoutS2 are applied to these architectures.

We further investigate the regularization effects of CorrReg on modern deep architectures. For the datasets of CIFAR-10 and CIFAR-100, we use the representative architectures of ResNet [68], Wide ResNet [29], and DenseNet [30]. The CIFAR-100 dataset is an adaptation of CIFAR-10, consisting of object categories of color images. We use simple data augmentation following [67]: during training, we zero-pad pixels along each image side, and sample a region crop from the padded image or its horizontal flip; during testing, we simply use the original non-padded image. Our use of ResNet, Wide ResNet, and DenseNet for the CIFAR datasets is as follows: we use a pre-activation ResNet [68] of weight layers, whose layer specifics are given in Appendix B; we use the exactly same top-performing architecture of “WRN-28-10” as in [29]; we also use the exactly same top-performing architecture of “DenseNet-BC” (with the growth rate ) as in [30]. These architectures commonly aggregate features of lower layers via a top global average pooling layer, followed by a final FC layer of classification. For each network, we follow the empirical rule established in Section V-A and apply CorrReg to the final FC layer. We fix of CorrReg as for all the three networks. We train ResNet and Wide ResNet for a total of epochs; learning rates are initialized as , and decay after and epoches of training. For DenseNet, we follow [30] and train for an extended duration of epochs, using mini-batches of size . All other training hyperparameters are the same as described in the beginning of Section V (not necessarily the same as used in [68, 29, 30]). To compare with Dropout regularization, we use two schemes (denoted as Dropout-S1 and Dropout-S1S2 respectively): scheme 1 applies Dropout to (inputs of) the final FC layer of each network, which is the same as our use of CorrReg; scheme 2 follows the way in [29] and additionally applies Dropout to (inputs of) the second one of the two conv layers in each residual block of these networks. Dropping rates are tuned on the validation set with the optimal one set as . We also try CorrReg together with the above scheme 2 (denoted as CorrReg-DropoutS2), to compare fairly with Dropout regularization. Results in Table V show that Dropout-S1S2 improves over Dropout-S1 by providing additional regularization, especially for the CIFAR-100 dataset that contains much fewer training samples per category than CIFAR-10 does. When applying to the top FC layer alone, our proposed CorrReg consistently outperforms Dropout-S1. Moreover, the best results are obtained by CorrReg-DropoutS2 that has the combined benefit of CorrReg and Dropout regularization.

For the ImageNet dataset, we use the representative architectures of ResNet [28], Wide ResNet [29], and ResNeXt [62] (more specifically, the “ResNet-101”, and the top-performing “WRN-50-2-bottleneck” and “ResNeXt-101 (644d)” of these architectures). For data augmentation, we adopt the same scheme as in [28]: during training, we randomly sample a region crop from an image or its horizontal flip; during testing, we use a single crop of size . Learning rates are initialized as , and decay by a factor of at and of the total training epochs, using mini-batches of size . For each network, we again apply CorrReg to the top FC layer, using a fixed value of . We also apply Dropout to (inputs of) the same FC layers of these networks, where dropping rate is again set as . Table VI shows the comparative results. While Dropout regularization may not have effect on these architectures, our proposed CorrReg steadily achieves performance improvement.

Architecture W/O Regu. With Dropout With CorrReg
ResNet [28]

Wide ResNet [29]

ResNeXt [62]

TABLE VI: Results of single-crop testing on the ImageNet validation set [63] of several deep architectures with or without regularization. Results are in the format of top-1/top-5 error rates ().

Remarks We note that experiments in this section are not intended to compare with the best results on the benchmark image classification datasets. They are to show the efficacy of our proposed CorrReg for regularization of network training: even though input data are from the same source, intermediate features may be learned to be overfitting to view-specific patterns. CorrReg effectively regularizes network training so that the final task of classification can collaboratively benefit from all feature views.

V-B RGB-D Object and Scene Recognition

In this section, we use RGB-D object dataset [31] and SUN RGB-D scene dataset [32] to investigate the efficacy of CorrReg for practical problems of multi-view learning. The RGB-D object dataset contains RGB-D video frames of classes of object instances captured from different views, with roughly frames per object instance. We sample these videos by every frame of each video. We use the random dataset splits provided by [31], with each split containing different object instances of all the classes. For these splits, there are on average about images for training and images for testing. This dataset is intensively used for comparative studies of alternative baselines, investigation of our proposed method under different settings, and also for robustness test against contamination of input data. SUN RGB-D scene dataset is a benchmark suite for indoor scene understanding, including RGB-D images. For scene recognition, we follow [32] and select the most common categories, each of which has at least RGB-D images in the dataset; we then divide the data into training and test sets, giving images for training and ones for testing. For both RGB-D object and SUN RGB-D datasets, we compute surface normal (SN) [69] from each depth image as input depth features.

V-B1 Comparative Studies of Alternative Baselines

Fig. 5: Network architectures used for experiments of RGB-D object recognition. The corresponding layer parameters (numbers of feature maps, filter sizes, etc) are given in Appendix B. (a) The plain (RGB or depth) ConvNets. (b)-(d) The RGB-Depth ConvNets concatenating features of the two lower streams respectively at lower, middle, or upper “heights” of network layers.

Our comparative studies of alternative baselines for RGB-D object recognition are based on adaptations of an 8-layer ConvNet. The adaptations consist of two lower, parallel streams followed by one upper, single stream, whose model architectures are shown in Figure 5 and whose layer parameters (numbers of feature maps, filter sizes, etc) are given in Appendix B. RGB and depth/SN images are respectively taken as inputs of the two lower streams, whose outputs are concatenated as inputs of the upper stream. We term such adapted networks as RGB-Depth ConvNets. In this work, we investigate the effects of concatenating features of the two lower streams at different “heights” (lower, middle, or upper layers) of RGB-Depth ConvNets, as illustrated in Figure 5. Alternative to RGB-Depth ConvNets are plain ConvNets that consist of one of the two lower streams of RGB-Depth ConvNets and the upper stream, as shown in Figure 5. Such plain ConvNets can be used for both RGB and depth images, and we term them as RGB ConvNet and Depth ConvNet respectively.

The above networks suggest several baseline methods for RGB-D object recognition. In particular, given existence of both RGB and depth images during training and test phases, one may separately train RGB ConvNet or Depth ConvNet using single-modal images, and use the trained models for single-modal inference. Alternatively, one may use the above RGB-Depth ConvNets that concatenate features of individual modalities at different heights for multi-modal training and inference. To train these baseline networks, we use common data augmentation practices on the RGB-D object dataset: we first re-scale each training image to the size of , from which or the horizontal flip of which we randomly crop a region of the size . The learning rate is initialized as and decays by a factor of when learning curves plateau. Table VII shows that while concatenating features at lower or middle layers of RGB-Depth ConvNets is not effective, feature concatenation at an upper layer of RGB-Depth ConvNet achieves improved performance over single-modal networks.

Methods Accuracy

RGB ConvNet

Depth Convnet

RGB-Depth ConvNet
(concat. at a lower height)

RGB-Depth ConvNet
(concat. at a middle height)

RGB-Depth ConvNet
(concat. at an upper height)


RGB-Depth ConvNet with Dropout

RGB-Depth ConvNet with L2Regu

RGB-Depth ConvNet with CorrReg

RGB-Depth ConvNet with L2Regu & Dropout

RGB-Depth ConvNet with CorrReg & Dropout


TABLE VII: Recognition accuracies() on RGB-D object dataset [31] using architectures in Figure 5 and various regularization methods. Results are in the format of .
mean


TABLE VIII: Accuracies() of different values when applying CorrReg to a RGB-Depth ConvNet (Figure 5-(d)) for RGB-D object recognition [31].
Occlusion size
RGB ConvNet
Depth ConvNet
RGB-Depth ConvNet
RGB-Depth ConvNet with Dropout
RGB-Depth ConvNet with CorrReg
RGB-Depth ConvNet with CorrReg & Dropout

TABLE IX: Robustness test by adding random occlusion blocks of varying sizes to test RGB and depth (SN) images of the RGB-D object dataset [31]. Trained networks in Section V-B1 are used for these experiments. Results are in terms of recognition accuracy().

The above baselines fuse multi-modal features via direct concatenation. Discussions in Section I suggest that one may apply network regularization at fusion to help collaboratively learn from multi-view features, which includes existing methods such as Dropout [20], and also our proposed CorrReg that explicitly leverages multi-view learning criteria. More specifically, we apply CorrReg to the first layer of the upper stream of RGB-Depth ConvNet that concatenates multi-view features at upper “height”, making it become a CorrReg fusion layer, or alternatively apply Dropout to inputs of the first layer of the upper stream. As noted in Section IV, an approximate measure of correlation is used in [51] that simply computes the (squared) Euclidean distance between features of individual views. We also consider this simple correlation measure as a baseline regularizer of (1), and term such a method as L2Regu. We compare with L2Regu by applying it to the same first layer of the upper stream of RGB-Depth ConvNet. We set the value of CorrReg as , the same penalty parameter of L2Regu as , and the dropping rate of Dropout as , which are determined by tuning on the validation set. Results in Table VII show that either Dropout, L2Regu, or CorrReg improves performance over that of direct concatenation, and CorrReg outperforms Dropout and L2Regu, showing the advantage of CorrReg in practical multi-view learning problems. Table VIII also gives results of CorrReg when using different values. To investigate whether the effect of CorrReg (or L2Regu) is complementary to that of Dropout, we also use both CorrReg (or L2Regu) and Dropout in RGB-Depth ConvNet. Using L2Regu together with Dropout may not improve over L2Regu itself; instead, using CorrReg together with Dropout further improves the performance, showing the advantage of CorrReg for complementary regularization.

V-B2 Robustness against Contamination of Input Data

An important property of multi-view learning is that inference should be less influenced when input data are contaminated [70]. In this section, we simulate such testing scenarios by adding random occlusion blocks to input RGB and depth images. Occlusion blocks are obtained by setting pixel values of the occluded regions as . We use the trained networks in Section V-B1 for these investigations. Table IX reports comparative results under different sizes of random occlusion. Compared with RGB ConvNet and Depth ConvNet, direct feature concatenation using RGB-Depth ConvNet may not provide better robustness against contamination of input data. RGB-Depth ConvNet with our proposed CorrReg improves the robustness, and performs consistently better than plain RGB-Depth ConvNet and the one with Dropout do.

V-B3 Comparisons with the state of the art

State-of-the-art results on RGB-D object recognition are obtained by using either advanced base models (e.g., ResNets [28]) with parameters pre-trained on ImageNet [51], or advanced feature encoding scheme [52]. We follow [51] and use two ResNet-50 [28] (after removing its last FC layer of classifier) as the lower, parallel streams, whose -dimensional output feature vectors are concatenated as the input vector of a FC based CorrReg fusion layer, followed by the last layer of -way classifier. The two lower streams respectively take RGB and SN images as inputs. We term such a constructed network as RGB-Depth ResNet. For data augmentation, we follow [49] by first re-scaling training images to the size of , and then randomly cropping regions of the size from them or their horizontal flips. RGB-Depth ResNet is fine-tuned with the initial learning rates of for the lower streams and for the upper stream. We use mini-batches of size 64, and set the value of CorrReg as and the dropping rate of Dropout as . Other training hyperparameters are the same as described in the beginning of Section V.

Methods Accuracy

Nonlinear SVM[31]

CKM [71]

CNN-RNN [16]

upgraded HMP [69]

MMSS [49]

Fus-CNN [48]

CIMDL-ResNet [51]

MDSI-CNN [52]




RGB ResNet

Depth ResNet

RGB-Depth ResNet

RGB-Depth ResNet with Dropout

RGB-Depth ResNet with CorrReg

RGB-Depth ResNet with CorrReg & Dropout

TABLE X: Recognition accuracies () of different methods on the RGB-D object dataset [31]. Results are in the format of .

Note that the method [51] uses the same ImageNet pre-trained base models (i.e., ResNet-50) as we do. It is interesting to observe in Table X that our result of RGB-Depth ResNet that concatenates RGB and depth features directly is better than those from most of existing methods. Regularizing RGB-Depth ResNet with either Dropout or CorrReg further improves the result, with CorrReg achieving better improvement. Using CorrReg together with Dropout achieves the new state of the art of .

V-B4 Results on RGB-D Scene Recognition

In this section, we report experiments of RGB-D scene recognition on the SUN RGB-D dataset [32]. We use the same network architectures and training manners as in Section V-B3, with the only difference that replaces the 51-way softmax classifiers with the 19-way ones. Results in Table XI tell that RGB-Depth ResNet using simple feature concatenation outperforms existing methods that have complicated feature fusion schemes and/or training criteria. Regularizing RGB-Depth ResNet with CorrReg and/or Dropout further improves the result to the new state of the art.

Methods Accuracy

GIST RBF Kernel SVM [32]

Place-CNN Linear SVM [32]

Place-CNN RBF Kernel SVM [32]

SSCNN [55]

DMFF [72]

MDSI-CNN [52]

FV-CNN [56]

RGB-D-CNN(wSVM) [57]

BilinearCNN [58]


RGB ResNet

Depth ResNet

RGB-Depth ResNet

RGB-Depth ResNet with Dropout

RGB-Depth ResNet with CorrReg

RGB-Depth ResNet with CorrReg & Dropout

TABLE XI: Recognition accuracies () of different methods on the SUN RGB-D scene dataset [32].

V-C Multi-view Recognition of 3D Object Shapes

We conduct experiments of multi-view 3D object recognition on the ModelNet40 dataset [64] to investigate the efficacy of CorrReg for practical problems with data of more than two views. The ModelNet40 dataset contains CAD models (meshes) of object categories, with models for training and ones for testing. To prepare images of multiple views from each object model, we follow the camera set-up in [59] and assume that each model is upright oriented; virtual cameras, pointing towards the model centroid, are evenly distributed (with intervals of degrees) around a horizontal circle that is elevated degrees from the ground plane; 2D images are rendered from these camera views.

Based on a very simple architecture of MVCNN [59] for multi-view based 3D object recognition, where features of individual views extracted from lower, parallel layer streams are aggregated via feature-wise max pooling, we design a Multi-view Fusion Network (MvFusionNet) as shown in Figure 6. By pairing neighboring views, MvFusionNet re-organizes feature vectors of individual views from lower streams into an equal number of pairs of feature vectors. Each of such pairs is then fed into a fusion layer where CorrReg can also be applied to form a CorrReg fusion layer. Feature-wise max pooling is subsequently applied to outputs of these fusion layers, and MvFusionNet ends with a FC layer of classifier. We investigate here whether CorrReg is helpful for feature aggregation of different views by regularizing such a constructed MvFusionNet.


Fig. 6: Illustration of the multi-view fusion network used for experiments of multi-view 3D object recognition. Thick right arrows represent data re-organization by pairing feature vectors of individual views.

Lower streams of MvFusionNet are adapted from ResNet-101 [28] that is pre-trained on ImageNet[51]. To train MvFusionNet, we use a mini-batch of (i.e., images); the learning rates start at and decay at the rate of when learning curves plateau. The penalty of CorrReg is set as . We report in Table XII results of MvFusionNet without or with CorrReg regularization, where we also compare with recent state-of-the-art results [73, 60, 61] on ModelNet40 whose multi-view images are prepared following the same style of camera set-up in [59] (i.e., camera views pointing towards upright orientation of object models). Due to varying architectural designs, network optimizers, and feature aggregation schemes, results of different methods in Table XII may not be directly comparable; nevertheless, it confirms the efficacy of CorrReg for better feature learning and aggregation from multiple views of 3D object shapes. We note that results of multi-view based methods on ModelNet40 depend heavily on how multi-view images are prepared by positioning virtual cameras on a sphere enclosing the object model. For example, the current best result on ModelNet40 is obtained in [74] by selecting camera set-ups from a much richer set of camera positioning and viewpoints. We expect our results can also be boosted by using multi-view images rendered from these optimal camera set-ups.

Methods Accuracy

MVCNN [59]
(, 80 views)

Pairwise [75]

Dominant Set Clustering [73]
(, 24 views)

MHBN [60]
(, 6 views)

GVCNN [61]
(, 8 views)


MvFusionNet

MvFusionNet with CorrReg

TABLE XII: Accuracies () of multi-view 3D object recognition on the ModelNet40 dataset [64]. All methods use the camera set-up in [59] ( camera views pointing towards upright oriented model). Best results of some methods and the corresponding view numbers are also quoted in parentheses.

Vi Conclusion

We study in this paper deep multi-view learning in the context of regularized network training. We take a regularization approach via multi-view learning criteria, and propose a novel, effective, and efficient neuron-wise correlation-maximizing regularizer. We also implement such regularizers collectively as a correlation-regularized network layer (CorrReg). CorrReg can be applied to either FC or conv based fusion layers that concatenate intermediate features of individual views. Controlled experiments of benchmark image classification show that CorrReg consistently improves performance of various modern deep architectures. Applying CorrReg to multi-modal deep networks achieves the new state of the art on the benchmark RGB-D object and scene recognition datasets. In future research, we are interested in applying CorrReg to other multi-view learning problems of practical interest.

References

  • [1] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936.
  • [2] G. Andrew, R. Arora, K. Livescu, and J. Bilmes, “Deep canonical correlation analysis,” in Proceedings of the International Conference on Machine Learning, 2013.
  • [3] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the International Conference on Machine Learning, June 2011.
  • [4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visual-semantic embedding model,” in Advances in Neural Information Processing Systems 26, 2013, pp. 2121–2129.
  • [5] F. R. Bach and M. I. Jordan, “A probabilistic interpretation of canonical correlation analysis,” Department of Statistics, University of California, Berkeley, Tech. Rep, Tech. Rep., 2005.
  • [6] D. A. Cohn and T. Hofmann, “The missing link - a probabilistic model of document content and hypertext connectivity,” in Advances in Neural Information Processing Systems 13, 2001, pp. 430–436.
  • [7] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan, “Matching words and pictures,” J. Mach. Learn. Res., vol. 3, pp. 1107–1135, 2003.
  • [8] N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in Neural Information Processing Systems 25, 2012, pp. 2222–2230.
  • [9] R. Arora, A. Cotter, K. Livescu, and N. Srebro, “Stochastic optimization for PCA and PLS,” in 50th Annual Allerton Conference on Communication, Control, and Computing, Allerton, Allerton Park & Retreat Center, Monticello, IL, USA, October 1-5, 2012, 2012, pp. 861–868.
  • [10] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Comput., vol. 16, no. 12, pp. 2639–2664, Dec. 2004.
  • [11] P. L. Lai and C. Fyfe, “Kernel and nonlinear canonical correlation analysis,” Int. J. Neural Syst., vol. 10, pp. 365–377, 2000.
  • [12] D. R. Hardoon and J. Shawe-Taylor, “Sparse canonical correlation analysis,” Machine Learning, vol. 83, no. 3, pp. 331–353, 2011.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, 2012, pp. 1097–1105.
  • [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142–158, Jan. 2016.
  • [15] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in Proceedings of the International Conference on Machine Learning, 2015.
  • [16] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng, “Convolutional-recursive deep learning for 3d object classification,” in Advances in Neural Information Processing Systems 25, 2012, pp. 656–664.
  • [17] S. Akaho, “A kernel method for canonical correlation analysis,” in Proceedings of the International Meeting of the Psychometric Society, 2001.
  • [18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 448–456.
  • [19] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of the 13th European Conference on Computer Vision, 2014, pp. 818–833.
  • [20] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
  • [21] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning.    The MIT Press, 2012.
  • [22] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations (ICLR), 2017.
  • [23] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: Stability of stochastic gradient descent,” in Proceedings of The 33rd International Conference on Machine Learning, vol. 48, 2016, pp. 1225–1234.
  • [24] I. Kuzborskij and C. H. Lampert, “Data-dependent stability of stochastic gradient descent,” CoRR, vol. abs/1703.01678, 2017.
  • [25] C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. A. Poggio, “Theory of deep learning iib: Optimization properties of SGD,” CoRR, vol. abs/1801.02254, 2018.
  • [26] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Report, 2009.
  • [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [29] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proceedings of the British Machine Vision Conference, 2016.
  • [30] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” CoRR, vol. abs/1608.06993, 2016.
  • [31] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in Proceedings of the IEEE International Conference on on Robotics and Automation, 2011.
  • [32] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2015, pp. 567–576.
  • [33] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, no. 3, May 2013, pp. 1139–1147.
  • [34] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 807–814.
  • [35] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” CoRR, vol. abs/1602.07868, 2016.
  • [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [37] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, ser. CVPR ’12, 2012, pp. 3642–3649.
  • [38] L. Wan, M. D. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1058–1066.
  • [39] M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proceedings of the International Conference on Learning Representations, 2013.
  • [40] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for simplicity: The all convolutional net,” CoRR, vol. abs/1412.6806, 2014.
  • [41] P. Baldi and P. J. Sadowski, “Understanding dropout,” in Advances in Neural Information Processing Systems 26, 2013, pp. 2814–2822.
  • [42] S. Wager, S. Wang, and P. S. Liang, “Dropout training as adaptive regularization,” in Advances in Neural Information Processing Systems 26, 2013, pp. 351–359.
  • [43] M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra, “Reducing overfitting in deep networks by decorrelating representations,” in International Conference on Learning Representations, 2016.
  • [44] P. Rodríguez, J. Gonzàlez, G. Cucurull, J. M. Gonfaus, and X. Roca, “Regularizing cnns with locally constrained decorrelations,” in International Conference on Learning Representations, 2017.
  • [45] K. Jia, D. Tao, S. Gao, and X. Xu, “Improving training of deep neural networks via singular value bounding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [46] S. Chandar, M. M. Khapra, H. Larochelle, and B. Ravindran, “Correlational neural networks,” CoRR, vol. abs/1504.07225, 2015.
  • [47] L. Bo, X. Ren, and D. Fox, “Hierarchical matching pursuit for image classification: Architecture and fast algorithms,” in Advances in Neural Information Processing Systems 24, 2011, pp. 2115–2123.
  • [48] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard, “Multimodal deep learning for robust rgb-d object recognition,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2015, pp. 681–687.
  • [49] A. Wang, J. Cai, J. Lu, and T.-J. Cham, “Mmss: Multi-modal sharable and specific feature learning for rgb-d object recognition,” in Proceedings of the IEEE International Conference on Computer Vision, December 2015.
  • [50] A. E. Johnson and M. Hebert, “Using spin images for efficient object recognition in cluttered 3d scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 5, pp. 433–449, May 1999.
  • [51] Z. Wang, J. Lu, R. Lin, J. Feng, and J. zhou, “Correlated and individual multi-modal deep learning for rgb-d object recognition,” CoRR, vol. arXiv:1604.01655, 2016.
  • [52] U. Asif, M. Bennamoun, and F. A. Sohel, “A multi-modal, discriminative and spatially invariant cnn for rgb-d object labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 9, pp. 2051–2065, Sept. 2018.
  • [53] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems 28, 2015, pp. 2017–2025.
  • [54] X. Li, M. Fang, J.-J. Zhang, and J. Wu, “Learning coupled classifiers with rgb images for rgb-d object recognition,” Pattern Recognition, vol. 61, no. C, pp. 433–446, Jan. 2017.
  • [55] Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu, “Understand scene categories by objects: A semantic regularized scene classifier using convolutional neural networks,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2016, pp. 2318–2325.
  • [56] A. Wang, J. Cai, J. Lu, and T.-J. Cham, “Modality and component aware feature fusion for rgb-d scene classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5995–6004.
  • [57] X. Song, L. Herranz, and S. Jiang, “Depth cnns for rgb-d scene recognition: Learning from scratch better than transferring from rgb-cnns.” in AAAI, 2017, pp. 4271–4277.
  • [58] H. F. Zaki, F. Shafait, and A. Mian, “Learning a deeply supervised multi-modal rgb-d embedding for semantic scene and object category recognition,” Robotics and Autonomous Systems, vol. 92, pp. 41–52, 2017.
  • [59] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 945–953.
  • [60] T. Yu, J. Meng, and J. Yuan, “Multi-view harmonized bilinear network for 3d object recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 186–194.
  • [61] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Group-view convolutional neural networks for 3d shape recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 264–272.
  • [62] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 5987–5995.
  • [63] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [64] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in CVPR, 2015, pp. 1912–1920.
  • [65] R. Collobert, S. Bengio, and J. Mariethoz, “Torch: a modular machine learning software library,” Tech. Rep., 2002.
  • [66] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio, “Maxout networks,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1319–1327.
  • [67] C. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015.
  • [68] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proceedings of the European Conference on Computer Vision, 2016.
  • [69] L. Bo, X. Ren, and D. Fox, “Unsupervised feature learning for rgb-d based object recognition,” in Experimental Robotics.    Springer, 2013, pp. 387–402.
  • [70] C. Xu, D. Tao, and C. Xu, “Multi-view intact space learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 12, pp. 2531–2544, 2015.
  • [71] M. Blum, J. T. Springenberg, J. Wülfing, and M. Riedmiller, “A learned feature descriptor for object recognition in rgb-d data,” in Proceedings of the IEEE International Conference on Robotics and Automation.    IEEE, 2012, pp. 1298–1303.
  • [72] H. Zhu, J.-B. Weibel, and S. Lu, “Discriminative multi-modal feature fusion for rgbd indoor scene recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2969–2976.
  • [73] C. Wang, M. Pelillo, and K. Siddiqi, “Dominant set clustering and pooling for multi-view 3d object recognition,” in British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4-7, 2017, 2017.
  • [74] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 5010–5019.
  • [75] E. Johns, S. Leutenegger, and A. J. Davison, “Pairwise decomposition of image sequences for active multi-view recognition,” in CVPR.    IEEE Computer Society, 2016, pp. 3813–3822.

Appendix A Gradients of the Proposed Neuron-Wise Correlation-Maximizing Regularizer

We use multi-variable chain rule to derive the gradients of the neuron-wise regularizer w.r.t. the weight vectors and , and also w.r.t. the input features , , . Their explicit forms are presented as follows (with no simplification).

Appendix B Network architectures

Our LeNet variant in Section V-A consists of conv layers (the first one is the input layer), followed by FC layers (the last one is the output layer). The first two conv layers have filters, and the third one has filters. All the three conv layers have filters of size and stride . Max pooling of size is applied after the first conv layer, and average pooling of size is applied after both the second and third conv layers. The numbers of neurons for the three FC layers are respectively , , and .

Our used ResNet in Section V-A1 follows [28, 68]. In particular, we first build a ConvNet that starts with a conv layer of filters, and then sequentially stacks three types of conv layers of filters, each of which has the feature map sizes of , , and , and filter numbers , , and , respectively. Spatial sub-sampling of feature maps is achieved by conv layers of stride . The ConvNet ends with a global average pooling and FC layers, with weight layers in total. Based on this ConvNet, we do (1) using an “identify shortcut” to connect every two conv layers of filters, and a “projection shortcut” when sub-sampling of feature maps is needed; (2) changing it to the pre-activation version according to [68]. We set that gives weight layers.

Layer specifics of the RGB ConvNet, Depth ConvNet, and RGB-Depth ConvNets used in Section V-B1 (Figure 5) are presented in Table XIII.

Layer Filter size/Filter no./
Stride/Padding
conv1 7 7/96/2/3
conv2 5 5/96/1/2
conv3 3 3/112/1/1
conv4 3 3/128/1/1
conv5 3 3/128/1/1
fc6 -/1024/-/-
fc7 -/512/-/-
fc8 -/51/-/-
(max) pool1, pool2, pool3 2 2/-/2/1
lower-conv 5 5/192/1/2
middle-conv 3 3/256/1/1
upper-fc -/1024/-/-
TABLE XIII: Layer specifics of ConvNets in Figure 5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
356868
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description