A Deep Learning Approach to Multiple Kernel Fusion
Abstract
Kernel fusion is a popular and effective approach for combining multiple features that characterize different aspects of data. Traditional approaches for Multiple Kernel Learning (MKL) attempt to learn the parameters for combining the kernels through sophisticated optimization procedures. In this paper, we propose an alternative approach that creates dense embeddings for data using the kernel similarities and adopts a deep neural network architecture for fusing the embeddings. In order to improve the effectiveness of this network, we introduce the kernel dropout regularization strategy coupled with the use of an expanded set of composition kernels. Experiment results on a real-world activity recognition dataset show that the proposed architecture is effective in fusing kernels and achieves state-of-the-art performance.
A Deep Learning Approach to Multiple Kernel Fusion
Huan Song, Jayaraman J. Thiagarajan, Prasanna Sattigeri, |
---|
Karthikeyan Natesan Ramamurthy and Andreas Spanias ^{†}^{†}thanks: This research was supported in part by the SenSIP center. |
SenSIP Center, ECEE, Arizona State University, Tempe, AZ |
Lawrence Livermore National Labs, 7000 East Avenue, Livermore, CA |
IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY |
Index Terms— Kernel fusion, Deep learning, Dropout regularization, Activity recognition
1 Introduction
Kernel methods provide a powerful framework to extend several machine learning formulations since they enable the design of effective non-linear models. For example in Support Vector Machines (SVM), the problem of building binary classifiers to obtain non-linear decision boundaries can be reposed into a dual problem in terms of the kernel similarity matrix. Referred to as the kernel trick, this approach has been successfully applied to a wide range of supervised and unsupervised learning problems [1, 2]. A valid positive semidefinite kernel inherently defines a lifting (transformation) to a Reproducing Kernel Hilbert Space (RKHS), thereby enabling efficient approximation of any function of interest in the transformed space. Another important property of kernel methods is that fusing kernels from multiple sources (e.g. different feature descriptors or sensing modalities) is straightforward. A commonly adopted strategy is to consider a convex combination of the kernels. The process of simultaneously inferring the weights for the convex combination and minimizing the structural risk (SVM objective) is referred to as Multiple Kernel Learning (MKL) [3, 4]. The idea is to effectively exploit the complementary nature of the different features and the representation power of different kernel functions. Despite their wide-spread use, as pointed out by [5], MKL algorithms may suffer when solving for global weights and the most critical support vectors, since the weight for a kernel is restricted to be the same over the whole input space. This challenge is alleviated using Localized MKL (LMKL) [5], which introduces a gating function for each kernel. By treating the input data sample as a variable, the gating function is able to characterize the underlying localities in data and promotes reduced number of support vectors.
Several existing approaches for feature fusion begin by building compact and effective representations from raw features since fusing such compact representations can be robust to noise and outliers. In particular, sophisticated representation learning paradigms such as deep learning have shown exceptional power when dealing with complex, high-dimensional data. In [6], the authors focused on feature learning for different multimodal settings and showed that in the multimodal fusion case, the fused feature exploits complementary information from each modality. In [7], Zhao et.al. built sub-networks for each heterogeneous feature and relied on the Stacked Denoising Autoencoders to learn high-level homogeneous representations for feature integration. Note that, both methods start with the raw features directly and did not exploit the expressive power of similarity kernels. Existing works on incorporating the advantages of deep learning into kernel methods either develop novel kernel constructions to mimic the large neural computation [8] or apply similar neural network structure to combine kernels and optimize at each layer [9].
In this paper, we propose to exploit the advantages of deep architectures in feature learning to build a new approach to multiple kernel fusion. First, we adopt a novel viewpoint to kernels by treating the similarities encoded in the kernel matrix as a valid embedding of the data. This is similar to the approaches in the natural language processing literature, wherein relevance measures such as Pointwise Mutual Information (PMI) of a word with respect to other words in the vocabulary is treated as a word embedding [10]. Since the kernel matrix can be inherently sparse and low-rank, we propose to apply an additional dense embedding layer (e.g. Singular Value Decomposition) to the columns of the kernel matrix. Consequently, the problem of kernel fusion is transformed to fusing their dense embeddings. To this end, we build a deep architecture for kernel fusion, coupled with novel training strategies: (a) to emulate the convex combination approach in MKL, we expand the set of input kernels by considering combinations of different subsets of the base kernels, and (b) we perform kernel dropout in the fusion layer for improved regularization. The proposed architecture replaces the complex optimization procedure in MKL by efficient representation learning and straightforward feature merging. This makes our fusion approach easily scalable to a large number of kernels.
2 Proposed Approach
2.1 Architecture
The proposed approach considers the similarity information encoded in a kernel as an embedding of the data, and poses the problem of MKL as fusing these embeddings in a deep learning architecture. In this section, we start by presenting the general architecture and then describe strategies for improving the performance.
As shown in Figure 1, the proposed architecture consists of three components: (a) obtain dense embeddings for data using kernel similarities, (b) representation learning using a deep architecture, and (c) feature fusion. Let us denote the kernel Gram matrix as , where . Each column of the Gram matrix encodes the relevance between sample and all other samples and it can be treated as an embedding for . This viewpoint is very similar to the construction of dense word embeddings using the PMI in text processing [11]. In the ideal case, has large values for the samples that come from the same class with and zeros for other samples. The sparsity in these embeddings makes them unsuitable for inference tasks [10]. To alleviate this, we obtain a dense embedding of the kernel similarities using Principle Component Analysis (PCA), which projects the original kernel feature to a low-dimensional space. Note that, this can be easily replaced by other dense embedding techniques including manifold learning [12], Word2Vec [13] or random projection [14]. Besides providing dense embeddings, this step also helps to significantly improve the network training speed.
On top of each dense feature set obtained by PCA, we build a fully connected neural network. The goal is to use back-propogation in a large network to learn a concise representation which will be more effective for inference tasks. To achieve this, the size of the network needs to be adequately large. In our application, we build a layer network separately for each embedding. At each hidden layer, dropout regularization [15] is used to prevent overfitting and batch normalization [16] to accelerate training. After the representations are learned, we stack another layer which is responsible for fusing the features and obtaining the classification result with a softmax activation. The most straightforward approach for the feature merging is to simply concatenate all the inputs to the layer. However various other merge modes can be easily applied too including summation, averaging, multiplication etc. The flexibility of the merge layer facilitates a wide range of kernel combination forms.
2.2 Using Composition Kernels
An important property of MKL is the various parameterization forms for mixing kernels such as convex combination, Hadamard product or mixtures of polynomials [3]. We emulate this property by including all possible combinations of base kernels (namely the composition kernels in Figure 1) to the architecture input. Given the base kernels set , the whole input kernel set to our architecture will have size :
Simple kernel summation proves to be highly effective in practical recognition problems and the derived representation corresponding to the summed kernel is often very different from either alone. Note that other formulations with base kernels can also be used. Paired with the flexible merge and deep feature representation, our architecture covers a large number of kernel combination scenarios without explicitly formulating them.
2.3 Kernel Dropout Regularization
In dropout regularization [15] for training large neural networks, neurons are randomly chosen to be removed from the network along with their incoming and outgoing connections. The process can be viewed as sampling a large set of possible network architectures with shared weights. Given the large kernel set , a more effective regularization mechanism is needed to prevent the network training from overfitting certain kernels. More specifically, we propose to regularize the fusion layer by dropping the entire representations learned from some randomly chosen kernels. Denoting the representations learned for all kernels as and a vector associated with independent Bernoulli trials, the representation is dropped from the fusion layer if is 0. The feed-forward operation can be expressed as:
where are the weights for hidden unit , denotes vector concatenation and is the softmax activation function.
3 System Setup
In this section, we apply the proposed architecture to the important problem of sensor-based activity recognition. Recent advances in activity recognition have shown promising results in the applications of fitness monitoring and assistive living [17]. However, problem still exists on how to effectively deal with the measurement inaccuracy and noise. One popular approach to the problem is utilizing various features and kernels that characterize salient aspects of the data and develop efficient fusion mechanisms to combine them. In this paper, we construct kernels which describe the statistical property, periodic structure and inter-sample relations for the accelerometer signals.
3.1 Feature Extraction and Kernel Construction
3.1.1 Statistics Kernel
Statistical features have been known to be useful for activity recognition [17]. The features we use include mean, median, standard deviation, kurtosis, skewness and total acceleration. In addition, we extract the mean-crossing rate and dominant frequency to capture the frequency-domain information. We construct a Gaussian kernel and the best parameter is determined by cross validation on the training set.
3.1.2 Shape Kernel
In lieu of building conventional state-space models, Time Delay Embeddings (TDE) provide an effective way to reconstruct the underlying dynamical system from the observed data. Given a time-series data, the phase space is the set of states which contain all the necessary information to predict the future of the system [18]. The TDEs of a time-series data can be defined in matrix form whose th column is .
The time-delayed observation samples can be considered as points in , which is referred to as the delay embedding space. In our application, the delay parameter is fixed to and embedding dimension to . Following the approach in [18], we use PCA to project the embedding to 3-D for noise reduction. We extract a simple shape function based on the geometric distances, and use it to derive our feature. The shape function we consider measures the pair-wise distances between samples in the TDE space, calculated as [19]. A histogram is calculated using these distances with a pre-specified bin size to build the feature. Following this, we construct an intersection kernel [20], where , are the computed histograms.
3.1.3 Correlation Kernel
Correlation measures the dependence between two time-series signals and has been widely used in electroencephalogram (EEG) signal analysis. We calculate the absolute value of Pearson correlation coefficient. To account for shift between the two signals, the maximum absolute coefficient for a range of shift values is identified. The correlation matrix defined in this way does not guarantee the required positive semi-definite condition of kernel. To correct this, we remove the negative eigenvalues from the matrix. Given the eigen-decomposition of the correlation matrix , where and , the correlation kernel is constructed as , where .
3.2 Dataset
The dataset used in our experiments is obtained from [21] and corresponds to different daily activities for subjects. Each activity is repeated in trials for each subject. The -axis accelerometer measurements were obtained at a sampling rate of Hz. We consider seconds of non-overlapping frames and as a result there are frames. In our experiment, randomly chosen samples were used for training and the rest for testing. Putting together the base kernels described in Section 3.1 with the combination kernels (, etc.) makes the total number of input kernels for our architecture to .
4 Performance Evaluation and Conclusion
The proposed approach is tested on the described activity recognition dataset. In the dense embedding stage, the dimension of the kernel feature is reduced to . The -layer neural network for representation learning has size ---. At each hidden layer the dropout rate is fixed at . In the fusion layer, the kernel dropout rate is set to . We use Keras library with TensorFlow backend to build and perform optimization for the architecture.
The training convergence curve shown in Figure 2 demonstrates that the architecture is able to reduce the loss value and achieve convergence quickly. We report the classification performance in Table 1. In our case, the accuracy is defined as the averaged fraction of correctly predicted labels for all classes. We make comparison of our proposed architecture to other setups: (1) single kernel performance, which is obtained without the fusion layer, (2) different deep architecture settings, and (3) existing MKL methods.
Input Kernel | Accuracy () |
---|---|
Statistics | 79.3 |
Shape | 73.3 |
Correlation | 75.3 |
Architecture | Accuracy () |
(a): Standard Feature Fusion | 82.3 |
(b): (a) + Dense Embedding | 86.6 |
(c): (b) + Composition Kernels | 88.1 |
MKL Method | Accuracy () |
UNIFORM | 88.5 |
SMO-MKL [4] | N/A |
SwMKL [22] | 88.5 |
Proposed: (c) + Kernel Dropout | 90.2 |
First, we observe that all kernels achieve accuracies in a similar range, while the fusion of them provides a significant improvement. In the best case, the improvement is over compared to the best of single kernel. Second, we compare each of the proposed training strategies to the standard deep architecture feature fusion (by simple concatenation of learned representations). Dense embedding gives around improvement. This demonstrates the necessity of this preprocessing stage when treating kernel values as embeddings. The inclusion of composition kernels and kernel dropout regularization each provides further improvements. Although the improvement is not tremendous in this case, it is significant. We argue that each step is beneficial and expect much more usefulness of them in more complex problems when a large number of descriptors and kernels are needed. From the visualization of confusion matrix in Figure 3 we can see the overall classification model is highly effective to this problem and most of the confusion happens only between very related activities (e.g. elevator up versus elevator down).
We compare our approach to MKL methods including combination with uniform weights (denoted as UNIFORM), a popular MKL algorithm SMO-MKL [4] and a recent LMKL approach SwMKL [22]. UNIFORM provides a decent performance and this justifies our utilization of the composition kernels. SMO-MKL applies to binary classification natively and as pointed out by [23], the extension to multi-class classification is not trivial. We find this to be true for many existing MKL formulations. SwMKL relies on a regression method to learn the gating function which characterizes the discriminative capabilities of kernels on local data regions. However, in our case each base kernel classifies training data fairly well, causing a highly imbalanced regression problem. This prevents the Support Vector Regressor from obtaining a meaningful gating function, thereby resulting in a performance similar to that of UNIFORM fusion. The proposed approach achieves the best performance and more importantly, provides a reliable way to fuse a large number of kernels in a multi-class setting using powerful numerical and computational backends that are available for generic neural networks. The architecture is also general so that more advanced techniques can be easily incorporated at certain stages. ertain stages.
References
- [1] Shuicheng Yan, Dong Xu, Benyu Zhang, and Hong-Jiang Zhang, “Graph embedding: A general framework for dimensionality reduction,” in 2005 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2005, vol. 2, pp. 830–837.
- [2] André Elisseeff and Jason Weston, “A kernel method for multi-labelled classification,” in Advances in neural information processing systems, 2001, pp. 681–687.
- [3] Ashesh Jain, Swaminathan VN Vishwanathan, and Manik Varma, “Spf-gmkl: generalized multiple kernel learning with a million kernels,” in 18th SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2012, pp. 750–758.
- [4] Zhaonan S., Nawanol A., Manik V., and Svn V., “Multiple kernel learning and the smo algorithm,” in Advances in neural information processing systems, 2010, pp. 2361–2369.
- [5] M. Gönen and E. Alpaydin, “Localized multiple kernel learning,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 352–359.
- [6] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng, “Multimodal deep learning,” in Proceedings of the 28th ICML-11, 2011, pp. 689–696.
- [7] Lei Zhao, Qinghua Hu, and Yucan Zhou, “Heterogeneous features integration via semi-supervised multi-modal deep networks,” in International Conference on Neural Information Processing. Springer, 2015, pp. 11–19.
- [8] Youngmin Cho and Lawrence K Saul, “Kernel methods for deep learning,” in Advances in neural information processing systems, 2009, pp. 342–350.
- [9] Eric V Strobl and Shyam Visweswaran, “Deep multiple kernel learning,” in 2013 12th ICMLA. IEEE, 2013, vol. 1, pp. 414–417.
- [10] Omer Levy and Yoav Goldberg, “Neural word embedding as implicit matrix factorization,” in Advances in neural information processing systems, 2014, pp. 2177–2185.
- [11] Gerlof Bouma, “Normalized (pointwise) mutual information in collocation extraction,” Proceedings of GSCL, pp. 31–40, 2009.
- [12] Ahmed Elgammal and Chan-Su Lee, “Inferring 3d body pose from silhouettes using activity manifold learning,” in 2004 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2004, vol. 2, pp. II–681.
- [13] T Mikolov and J Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, 2013.
- [14] Ella Bingham and Heikki Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in 7th SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2001, pp. 245–250.
- [15] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
- [16] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
- [17] Mi Zhang and Alexander A Sawchuk, “Human daily activity recognition with sparse representation using wearable sensors,” IEEE journal of Biomedical and Health Informatics, vol. 17, no. 3, pp. 553–560, 2013.
- [18] J. Frank, S. Mannor, and D. Precup, “Activity and gait recognition with time-delay embeddings.,” in AAAI. Citeseer, 2010.
- [19] V. Venkataraman and P. Turaga, “Shape descriptions of nonlinear dynamical systems for video-based inference,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2016.
- [20] Subhransu Maji, Alexander C Berg, and Jitendra Malik, “Classification using intersection kernel support vector machines is efficient,” in 2008 IEEE Conference on CVPR. IEEE, 2008, pp. 1–8.
- [21] Mi Zhang and Alexander A Sawchuk, “Usc-had: a daily activity dataset for ubiquitous activity recognition using wearable sensors,” in Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 2012, pp. 1036–1043.
- [22] Raghvendra Kannao and Prithwijit Guha, “Tv news commercials detection using success based locally weighted kernel combination,” arXiv preprint arXiv:1507.01209, 2015.
- [23] Alexander Zien and Cheng Soon Ong, “Multiclass multiple kernel learning,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 1191–1198.