Adaptive Feature Ranking for Unsupervised Transfer Learning
Transfer Learning is concerned with the application of knowledge gained from solving a problem to a different but related problem domain. In this paper, we propose a method and efficient algorithm for ranking and selecting representations from a Restricted Boltzmann Machine trained on a source domain to be transferred onto a target domain. Experiments carried out using the MNIST, ICDAR and TiCC image datasets show that the proposed adaptive feature ranking and transfer learning method offers statistically significant improvements on the training of RBMs. Our method is general in that the knowledge chosen by the ranking function does not depend on its relation to any specific target domain, and it works with unsupervised learning and knowledge-based transfer.
Keywords: Feature Selection, Transfer Learning, Restricted Boltzmann Machines
Transfer Learning is concerned with the application of knowledge gained from solving a problem to a different but related problem domain. A number of researchers in Machine Learning have argued that the provision of supplementary knowledge should help improve learning performance [15, 2, 12, 1, 11, 14, 9, 3]. In Transfer Learning [10, 1, 11, 14], knowledge from a source domain can be used to improve performance in a target domain by assuming that related domains have some knowledge in common. In connectionist transfer learning, many approaches transfer the knowledge selected specifically based on the target, and (in some cases) with the provision of labels in the source domain [8, 16]. In constrast, we are interested in selecting the representations that can be transferred in general to the target without the provision of labels in the source domain, which is similar to self-taught learning [11, 6]. In addition, we propose to study how much knowledge should be transferred to the target domain, which has not been studied yet in self-taught mode.
This paper introduces a method and efficient algorithm for ranking and selecting representation knowledge from a Restricted Boltzmann Machine (RBM) [13, 4] trained on a source domain to be transferred onto a target domain. A ranking function is defined that is shown to minimize information loss in the source RBM. High-ranking features are then transferred onto a target RBM by setting some of the network’s parameters. The target RBM is then trained to adapt a set of additional parameters on a target dataset, while keeping the parameters set by transfer learning fixed.
Experiments carried out using the MNIST, ICDAR and TiCC image datasets show that the proposed adaptive feature ranking and transfer learning method offers statistically significant improvements on the training of RBMs in three out of four cases tested, where the MNIST dataset is used as source and either the ICDAR or TiCC dataset is used as target. As expected, our method improves on the predictive accuracy of the standard self-taught learning method for unsupervised transfer learning .
Our method is general in that the knowledge chosen by the ranking function does not depend on its relation to any specific target domain. In transfer learning, it is normal for the choice of the knowledge to be transferred from the source domain to rely on the nature of the target domain. For example, in , knowledge such as data samples in the source domain are transformed to a target domain to enrich supervised learning. In our approach, the transfer learning is unsupervised, as done in . We are concerned with inductive transfer learning in that representation learned from a domain can be useful in analogous domains. For example, in  a common orthonormal matrix is learned to construct linear features among multiple tasks. Self-taught learning , on the other hand, applies sparse coding to transfer the representations learned from source data onto the target domain. Like self-taught, we are interested in unsupervised transfer learning using cross-domain features. The system has been implemented in MATLAB; all learning parameters and data folds are available upon request.
The outline of the paper is as follows. In Section 2, we define feature selection by ranking in which high scoring features are associated with significant part of the network. In Section 3, we introduce the adaptive feature learning method and algorithm to combine selective features in source domain with features in target domain. In Section 4, we present and discuss the experimental results. Section 5 concludes and discusses directions for future work.
2 Feature Selection by Ranking
In this section, we present a method for selecting representations from an RBM by ranking their features. We define a ranking function and show that it can capture the most significant information in the RBM, according to a measure of information loss.
In a trained RBM, we define to be a score for each unit in the hidden layer. The score represents the weight-strength of a sub-network () consisting of all the connections from unit to the visible layer. A score is a positive real number that can be seen as capturing the uncertainty in the sub-network. We expect to be able to replace all the weights of a network by appropriate or with minimum information loss. We define the information loss as:
where is a vector of connection weights from unit in the hidden layer to all the units in the visible layer, is a vector of the same size as , and .
Since equation (1) is a quadratic function, the value of that minimizes information loss can be found by setting the derivatives to zero, as follows:
Since and from equation (1), we can see that:
holds if and only if , which will also minimize .
Applying equation (3) to (LABEL:eq:cfd_1), we obtain:
We may notice that using instead of weights would result in a compression of the network; however, in this paper we focus on the scores only for selecting features to use for transfer learning. In what follows, we give a practical example which shows that low-score features are semantically less meaningful than high-scoring ones.
We have trained an RBM with 10 hidden nodes to model the XOR function from its truth-table such that . The logical values are represented by integers , respectively. After training the network, a score for each sub-network can be obtained. The score allows us to replace each real-value weight by its sign, and interpret those signs logically where a negative sign () represents logical negation (), as exemplified in Table 1.
Table 2 shows all the sub-networks with associated scores. As one may recognize, each sub-network represents a logical rule learned from the RBM. However, not all of the rules are correct w.r.t. the XOR function. In particular, the sub-network scoring encodes a rule , which is inconsistent with . By ranking the sub-networks according to their scores, this can be identified: high-scoring sub-networks are consistent with the data, and low-scoring ones are not. We have repeated the training several times with different numbers of hidden units, obtaining similar intuitive results.
In what follows, we will study the effect of pruning low-scoring sub-networks from an RBM trained on complex image domain data. So far, we have seen that our scoring can be useful at identifying relevant representations. In particular, if the score of a sub-network is small in relation to the average score of the network, they seem more likely to represent noise.
3 Adaptive Feature Learning
In this section, we introduce adaptive feature learning. Given an RBM trained on a dataset, we are interested in investigating whether the score ranking introduced in Section 2 can be useful at improving the predictive accuracy of another RBM trained on a target domain. From a transfer learning perspective, we intend to produce a general transfer learning algorithm, ranking knowledge from a source domain to guide the learning on a target domain, using RBMs as transfer medium. In particular, we propose a transfer learning model as shown in Figure 1. Knowledge learned from a source domain is selected by the ranking function for transferring onto a target domain, as explained in what follows. The selection of features in the source domain is general in that it is independent from the data from the target domain.
In the target domain, an RBM is trained given a number of transferred parameters (): a fixed set of weights associated with high-ranking sub-networks from the source domain. The output of transferred hidden nodes in target domain is considered as self-taught features . The connections between the visible layer and the hidden units transferred onto the target RBM can be seen as a set of up-weights and down-weights. How much the down-weights affect the learning in the target RBM depends on an influence factor ; in this paper . If then the features are combined but the transferred knowledge will not influence learning in the target domain. In this case, the result is a combination of self-taught features and target RBM features. Otherwise, if then the transferred knowledge will influence learning in the target domain. We refer to this case as adaptive feature learning, because the knowledge transferred from the source domain is used to guide the learning of new features in the target domain. The outputs of additional hidden nodes (associated with parameter ) in target RBM are called adaptive features.
As usual, we train the target RBM to maximize the log-likelihood:
using Contrastive Divergence .
4 Experimental Results
We start by evaluating the approach on the MNIST handwritten dataset. First, we want to check if hidden units with low scores are indeed less significant in the case of image domains. Subsequently, we show that knowledge from one image domain can be used to improve predictive accuracy in another image domain.
4.1 Feature Selection
We have trained an RBM with 500 hidden nodes on 20,000 samples from the MNIST dataset in order to visualized the filter bases of the highest scoring sub-networks and the lowest scoring sub-networks (each takes of the network’s capacity). Figure 2 shows the result of using a standard RBM, and Figure 3 shows the result of using a sparse RBM . As can be seen, in Figure 2, high scores are mostly associated with more concrete visualizations of the expected MNIST patterns, while low scores are mostly associated with fading or noisy patterns. In Figure 3, high-scores are associated with sparse representations, while low-scores produce less meaningful representations according to the way in which the RBM was trained. In sparse RBM we use PCA to reduce the dimensionality of the images to and train the network with Gaussian visible units.
We also examined visually the impact of the scores on an RBM with 500 hidden units trained on the MNIST dataset’s 10,000 examples. Sub-networks with the highest scores were gradually removed, and the pruned RBM was compared on the reconstruction of images with that of the original RBM, as illustrated in Figure 4. In Figure 5 we shows the reconstruction of test images from RBMs in which low-scored features were gradually removed.
Finally, in order to obtain accuracy measures, we have provided the features obtained from the pruned RBMs as an input to an SVM classifier. Figure 6 shows the drop in accuracy with the gradual pruning of the RBM. In case of pruning low-scored features, at first, some removal of units have produced a slight increase in accuracy. Then, the results indicate that it is possible to remove nodes and maintain performance almost unchanged (until relevant units start to be removed, at which point accuracy deteriorates, when more than half of the number of units is removed). In case of pruning high-scored features, the accuracy decreases significantly when more than of hidden units are removed.
4.2 Unsupervised Transfer Learning
In order to evaluate transfer learning in the context of images, we
have transferred sub-networks from an RBM trained on the
Below, we use MNIST and MNIST to denote datasets with 30,000 and 5,000 samples from the MNIST data collection. For the TiCC collection, we denote TiCC and TiCC as the datasets of digits and letters, respectively. TiCC and TiCC are character samples from two different groups of writers. TiCC and TiCC are from the same group but the latter has a much smaller training set. Each column in Table 3 indicates a transfer experiment, e.g. MNIST : ICDAR uses the MNIST dataset as source domain and ICDAR as target domain. The percentages show the predictive accuracy on the target domain, as detailed in the sequel.
|MNIST : ICDAR||MNIST : TiCC||MNIST : TiCC||MNIST : TiCC|
|RBM STL||30.47 0.054||72.88 0.098||58.13 0.205||62.08 0.321|
|RBM||37.63 0.505||75.20 0.745||62.85 0.079||63.42 0.090|
|ASTL ()||36.66 0.495||76.49 0.361||63.21 0.134||65.04 0.330|
|ASTL ()||40.43 0.328||77.56 0.564||63.00 0.160||65.82 0.262|
In Table 3, we have measured classification performance when using the knowledge from the source domain to guide the learning in the target domain. For each domain, the data is divided into training, validation and test sets. Each experiment is repeated 50 times and we use the validation set to select the best model (number of transferred sub-networks, number of added units, learning rates and SVM hyper-parameters). The table reports the average results on the test sets. The results show that the proposed adaptive feature ranking and transfer learning method offers statistically significant improvements on the training of RBMs in three out of four cases tested, where the MNIST dataset is used as source and either the ICDAR or TiCC dataset is used as target. As expected, our method improves on the predictive accuracy of the standard self-taught learning method for unsupervised transfer learning. In order to compare with transferring low-scored features, we performed experiments follows what we have done with transferring high-scored features, except that in this case the features are ranked from low-scores to high-scores. We observed that the highest accuracies were achieved only when a large number of high-scored features are among those which have been transferred.
|TiCC : TiCC||TICC : TiCC||TICC : TICC|
|RBM STL||60.65 0.075||64.85 0.227||47.46 0.260|
|RBM||62.85 0.079||63.42 0.090||44.28 0.323|
|ASTL ()||62.41 0.166||66.10 0.137||51.07 0.684|
|ASTL ()||63.16 0.120||66.25 0.175||43.10 0.332|
We also carried out experiments on the TiCC collection transferring from characters to digits, digits to characters, and from group of writers to group of other writers. The results in Table 4 show that in two out of three experiments adaptive learning and combining features did not gain advantages over combination of selective features from source domain and features from target domain. It may suggest that using selective combination of features would be better and more efficient in a case that the source and target domains are considerably similar(i.e TiCC:TICC)
To compare our transfer learning approach with self-taught learning
, we train an RBM in the source domain and use it to
extract common features in the target domain for classification. In
the experiments where the domains have a close relation such as the
same type of data (digits in MNIST:TiCC) or in the same
sefl-taught learning works very well especially when the training
dataset in the target domain is small (TiCC:TiCC). We also
use sparse-coder  provided by
With transfer, it is generally accepted that the performance of the model in a target domain will depend on the quality of the knowledge it received and the structure of the model. We then evaluated performance of the model using different sizes of transferred knowledge and number of units added to the hidden layer. Figure 8 shows that if the size of transferred knowledge is too small, it will be dominated by the data from the target domain. However, if the size of transferred knowledge is too large it can cause a drop in performance since the model will try to learn new knowledge mainly based on the transferred knowledge with little knowledge from the target domain.
5 Conclusion and Future Work
We have presented a method and efficient algorithm for ranking and selecting representations from a Restricted Boltzmann Machine trained on a source domain to be transferred onto a target domain. The method is general in that the knowledge chosen by the ranking function does not depend on its relation to any specific target domain, and it works with unsupervised learning and knowledge-based transfer. Experiments carried out using the MNIST, ICDAR and TiCC image datasets show that the proposed adaptive feature ranking and transfer learning method offers statistically significant improvements on the training of RBMs.
In this paper we focus on selecting features from shallow network (RBM) for general transfer learning. In future work we are interested in learning and transferring high-level features from deep networks selectively for a specific domain.
- Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19. MIT Press, 2007.
- Artur S. Avila Garcez and Gerson Zaverucha. The connectionist inductive learning and logic programming system. Applied Intelligence, 11(1):59â77, July 1999.
- Jesse Davis and Pedro Domingos. Deep transfer via second-order markov logic. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 217â224, New York, NY, USA, 2009. ACM.
- Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771–1800, August 2002.
- Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efficient sparse coding algorithms. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 801–808. MIT Press, Cambridge, MA, 2007.
- Honglak Lee, Chaitanya Ekanadham, and Andrew Y. Ng. Sparse deep belief net model for visual area v2. In Advances in Neural Information Processing Systems. MIT Press, 2008.
- Joseph J. Lim, Ruslan Salakhutdinov, and Antonio Torralba. Transfer learning by borrowing examples for multiclass object detection. In NIPS, pages 118–126, 2011.
- Grégoire Mesnil, Yann Dauphin, Xavier Glorot, Salah Rifai, Yoshua Bengio, Ian J. Goodfellow, Erick Lavoie, Xavier Muller, Guillaume Desjardins, David Warde-Farley, Pascal Vincent, Aaron C. Courville, and James Bergstra. Unsupervised and transfer learning challenge: a deep learning approach. In ICML Unsupervised and Transfer Learning, pages 97–110, 2012.
- Lilyana Mihalkova, Tuyen Huynh, and Raymond J. Mooney. Mapping and revising markov logic networks for transfer learning. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 1, AAAI’07, page 608â614. AAAI Press, 2007.
- Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, October 2010.
- Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, ICML ’07, page 759â766, New York, NY, USA, 2007. ACM.
- Matthew Richardson and Pedro Domingos. Markov logic networks. Mach. Learn., 62(1-2):107â136, February 2006.
- Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. In In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Volume 1: Foundations, pages 194–281. MIT Press, Cambridge, 1986.
- Lisa Torrey, Jude W. Shavlik, Trevor Walker, and Richard Maclin. Transfer learning via advice taking. In Advances in Machine Learning I, pages 147–170. Springer, 2010.
- Geoffrey G. Towell and Jude W. Shavlik. Knowledge-based artificial neural networks. Artificial Intelligence, 70(1-2):119–165, 1994.
- Bin Wei and Christopher Pal. Heterogeneous transfer learning with rbms. In Wolfram Burgard and Dan Roth, editors, AAAI. AAAI Press, 2011.