Learning to Select Pre-trained Deep Representations with
Bayesian Evidence Framework
We propose a Bayesian evidence framework to facilitate transfer learning from pre-trained deep convolutional neural networks (CNNs). Our framework is formulated on top of a least squares SVM (LS-SVM) classifier, which is simple and fast in both training and testing, and achieves competitive performance in practice. The regularization parameters in LS-SVM is estimated automatically without grid search and cross-validation by maximizing evidence, which is a useful measure to select the best performing CNN out of multiple candidates for transfer learning; the evidence is optimized efficiently by employing Aitken’s delta-squared process, which accelerates convergence of fixed point update. The proposed Bayesian evidence framework also provides a good solution to identify the best ensemble of heterogeneous CNNs through a greedy algorithm. Our Bayesian evidence framework for transfer learning is tested on 12 visual recognition datasets and illustrates the state-of-the-art performance consistently in terms of prediction accuracy and modeling efficiency.
Image representations from deep CNN models trained for specific image classification tasks turn out to be powerful even for general purposes [2, 6, 7, 21, 23] and useful for transfer learning or domain adaptation. Therefore, CNNs trained on specific problems or datasets are often fine-tuned to facilitate training for new tasks or domains [2, 6, 13, 15, 16, 36], and an even simpler approach—application of off-the-shelf classification algorithms such as SVM to the representations from deep CNNs —is getting more attractive in many computer vision problems. However, fine-tuning of an entire deep network still requires a lot of efforts and resources, and SVM-based methods also involve time consuming grid search and cross validation to identify good regularization parameters. In addition, when multiple pre-trained deep CNN models are available, it is unclear which pre-trained models are appropriate for target tasks and which classifiers would maximize accuracy and efficiency. Unfortunately, most existing techniques for transfer learning or domain adaptation are limited to empirical analysis or ad-hoc application specific approaches.
We propose a simple but effective algorithm for transfer learning from pre-trained deep CNNs based on Bayesian least squares SVM (LS-SVM), which is formulated with Bayesian evidence framework [18, 29] and LS-SVM . This approach automatically determines regularization parameters in a principled way, and shows comparable performance to the standard SVMs based on hinge loss or squared hinge loss. More importantly, Bayesian LS-SVM provides an effective solution to select the best CNN out of multiple candidates and identify a good ensemble of heterogeneous CNNs for performance improvement. Figure 1 illustrates our approach. We also propose a fast Bayesian LS-SVM, which maximizes the evidence more efficiently based on Aitken’s delta-squared process .
One may argue against the use of LS-SVM for classification because the least squares loss function in LS-SVM tends to penalize well-classified examples. However, least squares loss is often used for training multilayer perceptron  and shows comparable performance to SVMs [28, 37]. In addition, Bayesian LS-SVM provides a technically sound formulation with outstanding performance in terms of speed and accuracy for transfer learning with deep representations. We also propose a fast Bayesian LS-SVM, which maximizes the evidence more efficiently based on Aitkenï¿½s delta-squared process . Considering simplicity and accuracy, we claim that our fast Bayesian LS-SVM is a reasonable choice for transfer learning with deep learning representation in visual recognition problems. Based on this approach, we achieved promising results compared to the state-of-the-art techniques on 12 visual recognition tasks.
The rest of this paper is organized as follows. Section 2 describes examples of transfer learning or domain adaptation based on pre-trained CNNs for visual recognition problems. Then, we discuss Bayesian evidence framework applicable to the same problem in Section 3 and its acceleration technique using Aitken’s delta-squared process in Section 4. The performance of our algorithm in various applications is demonstrated in Section 5.
2 Related Work
Since AlexNet  demonstrated impressive performance in the ImageNet large scale visual recognition challenge (LSVRC) 2012, a few deep CNNs with different architectures, e.g., VGG  and GoogLeNet , have been proposed in the subsequent events. Instead of training deep CNNs from scratch, some people have attempted to refine pre-trained networks for new tasks or datasets by updating the weights of all neurons or have adopted the intermediate outputs of existing deep networks as generic visual feature descriptors. These strategies can be interpreted as transfer learning or domain adaptation.
Refining a pre-trained CNN is called fine-tuning, where the architecture of the network may be preserved while weights are updated based on new training data. Fine-tuning is generally useful to improve performance [2, 6, 13, 36] but requires careful implementation to avoid overfitting. The second approach regards the pre-trained CNNs as feature extraction machines and combines the deep representations with the off-the-shelf classifiers such as linear SVM [7, 34], logistic regression [7, 34], and multi-layer neural network . The techniques in this category have been successful in many visual recognition tasks [2, 23, 24].
When combining a classification algorithm with image representations from pre-trained deep CNNs, we often face a critical issue. Although several deep CNN models trained on large scale image repositories are publicly available, there is no principled way to select a CNN out of multiple candidates and find the best ensemble of multiple CNNs for performance optimization. Existing algorithms typically rely on ad-hoc methods for model selection and fail to provide clear evidence for superior performance .
3 Bayesian LS-SVM for Model Selection
This section discusses a Bayesian evidence framework to select the best CNN model(s) in the presence of transferable multiple candidates and identify a reasonable regularization parameter for LS-SVM classifier automatically.
3.1 Problem Definition and Formulation
Suppose that we have a set of pre-trained deep CNN models denoted by . Our goal is to identify the best performing deep CNN model among the networks for transfer learning. A naïve approach is to perform fine tuning of network for target task, which requires substantial efforts for training. Another option is to replace some of fully connected layers in a CNN with an off-the-shelf classifier such as SVM and check the performance of target task through parameter tuning for each network, which would also be computationally expensive.
We adopt a Bayesian evidence framework based on LS-SVM to achieve the goal in a principled way, where the evidence of each network is maximized iteratively and the maximum evidences are used to select a reasonable model. During the evidence maximization procedure, the regularization parameter of LS-SVM is identified automatically without time consuming grid search and cross-validation. In addition, the Bayesian evidence framework is also applied to the construction of an ensemble of multiple CNNs to accomplish further performance improvement.
We deal with multi-label or multi-class classification problem, where the number of categories is . Let be a training set, where is a feature vector and is a binary variable that is set to 1 if label is given to and 0 otherwise. Then, for each class , we minimize a least squares loss with regularization penalty as follows:
where and . The optimal solution of the problem in (1) is given by
where is the eigen-decomposition of and is an identity matrix. This regularized least squares approach has clear benefit that it requires only one eigen-decomposition of to obtain the solution in (2) for all combinations of and .
3.3 Bayesian Evidence Framework
The optimization of the regularized least squares formulation presented in (1) is equivalent to the maximization of the posterior with fixed hyperparamters and denoted by , where . The posterior can be decomposed into two terms by Bayesian theorem as
where corresponds to Gaussian observation noise model given by
and denotes a zero-mean isotropic Gaussian prior as
Note that we dropped superscript for notational simplicity from the equations in this subsection.
where the precision matrix and mean vector of the posterior are given respectively by
The log evidence is maximized by repeatedly alternating the following fixed point update rules
which involves the derivation of as
where are eigenvalues of . Note that and should be re-estimated after each update of and .
Another pair of update rules of and are derived by an expectation-maximization (EM) technique as
but these procedures are substantially slower than the fixed point update rules in (8).
Through the optimization procedures described above, we determine the regularization parameter . Although the estimated parameters are not optimal, they may still be reasonable solutions since they are obtained by maximizing marginal likelihood in (6).
3.4 Model Selection using Evidence
The evidence computed in the previous subsection is for a single class, and the overall evidence for entire classes, denoted by , is obtained by the summation of the evidences from individual classes, which is given by
We compute the overall evidence corresponding to each deep CNN model, and choose the model with the maximum evidence for transfer learning. We expect that the selected model performs best among all candidates, which will be verified in our experiment.
In addition, when an ensemble of deep CNNs needs to be constructed for a target task, our approach selects a subset of good pre-trained CNNs in a greedy manner. Specifically, we add a network with the largest evidence in each stage and test whether the augmented network improves the evidence or not. The network is accepted if the evidence increases, or rejected otherwise. After the last candidate is tested, we obtain the final network combination and its associated model learned with the concatenated feature descriptors from accepted networks.
4 Fast Bayesian LS-SVM
Bayesian evidence framework discussed in Section 3 is useful to identify a good CNN for transfer learning and a reasonable regularization parameter. To make this framework even more practical, we present a faster algorithm to accomplish the same goal and a new theory that guarantees the converges of the algorithm.
4.1 Reformulation of Evidence
We are going to reduce to a function with only one parameter that directly corresponds to the regularization parameter . To this end, we re-write by using the eigen-decomposition as
where is the -th diagonal element in and denotes the -th element in . Then, we re-parameterize into as
The derivative of with respect to is given by
and we obtain the following equation by setting this derivative to zero,
Figure 2 illustrates the curvature of this log evidence function with respect to .
4.2 New Fixed-point Update Rule
We now derive a new fixed point update rule and present the sufficient condition for the existence of a fixed point. The stationary points in (4.1) with respect to satisfy
and we update the fixed-point by maximizing (4.1) as
As illustrated in Figure 2, in (4.1) is neither convex nor concave as illustrated in the supplementary file. However, we can show the sufficient condition of the existence of the fixed point using the following theorem.
Denote the update rule in (18) by . If is a binary variable and is an normalized nonnegative vector, then has a fixed point.
We first show that is asymptotically linear in as
Since is binary and is normalized and nonnegative, we can derive the following two relations,
Obviously, and there exists a such that . The intermediate value theorem implies the existence of such that , where as illustrated in Figure 3. ∎
The fixed point is unique if is concave. Although it is always concave according to our observation, we have no proof yet and leave it as a future work
4.3 Speed Up Algorithm
We accelerate the fixed point update rule in (18) by using Aitken’s delta-squared process . Figure 3 illustrates the Aitken’s delta-squared process. Let’s focus on the two points and , and line going through these two points. The equation of this line is
where and are replaced by and , respectively. The idea behind Aitken’s method is to approximate fixed point using the intersection of the line in (21) with line , which is given by
Our fast Bayesian learning algorithm for the regularized least squares problem in (1) is summarized in Algorithm 1. In our algorithm, we first compute the eigen-decomposition of . This is the most time consuming part but needs to be performed only once since the result can be reused for every label in . After that, we obtain the regularization parameter through an iterative procedure.
When we apply the Aitken’s delta-squared process, we have two potential failure cases as in Figure 4(a) and 4(b). The first case often arises if the initial is far from the fixed point , and the second case occurs when the approximating line in (21) is parallel to . Fortunately, these failures rarely happen in practice and can be handled easily by skipping the procedure in (22) and updating with .
We present the details of our experiment setting and the performance of our algorithm compared to the state-of-the-art techniques in 12 visual recognition benchmark datasets.
5.1 Datasets and Image Representation
The benchmark datasets involve various visual recognition tasks such as object recognition, photo annotation, scene recognition, fine grained recognition, visual attribute detection, and action recognition. Table 1 presents the characteristics of the datasets. In our experiment, we followed the given train and test split and evaluation measure of each dataset. For the datasets with bounding box annotations such as CUB200-2011, UIUC object attribute, Human attribute, and Stanford 40 actions, we enlarged the bounding boxes by 150% to consider neighborhood context as suggested in [23, 2].
|PASCAL VOC 2007 ||object recognition||5011||4952||20||1.5||mean AP|
|PASCAL VOC 2012 ||object recognition||5717||5823||20||1.5||mean AP|
|Caltech 101 ||object recognition||3060||6086||102||1||mean Acc.|
|Caltech 256 ||object recognition||15420||15187||257||1||mean Acc.|
|ImageCLEF 2011 ||photo annotation||8000||10000||99||11.9||mean AP|
|MIT Indoor Scene ||scene recognition||5360||1340||67||1||mean Acc.|
|SUN 397 Scene ||scene recognition||19850||19850||397||1||mean Acc.|
|CUB 200-2011 ||fine-grained recognition||5994||5794||200||1||mean Acc.|
|Oxford Flowers ||fine-grained recognition||2040||6149||200||1||mean Acc.|
|UIUC object attributes ||attribute detection||6340||8999||64||7.1||mean AUC|
|Human attributes ||attribute detection||4013||4022||9||1.8||mean AP|
|Stanford 40 actions ||action recognition||4000||5532||40||1||mean AP|
For deep learning representations, we selected 4 pre-trained CNNs from the Caffe Model Zoo: GoogLeNet , VGG19 , and AlexNet  trained on ImageNet, and GoogLeNet trained on Places . As generic image representations, we used the 4096 dimensional activations of the first fully connected layer in VGG19 and AlexNet and the 1024 dimensional vector obtained from the global average pooling layer located right before the final softmax layer in GoogLeNet.
Our implementation is in Matlab2011a, and all experiments were conducted on a quad-core Intel(R) core(TM) i7-3820 @ 3.60GHz processor.
5.2 Bayesian LS-SVM vs. SVM
We first compare the performance of our Bayesian LS-SVM with the standard SVM when they are applied to deep CNN features for visual recognition problems. We used only a single image scale in this experiment. LIBLINEAR  package is used for SVM training and the regularization parameters are selected by grid search with cross validations.
Table 2 presents the complete results of our experiment. Bayesian LS-SVM is competitive to SVM in terms of prediction accuracy even with significantly reduced training time. Training SVM is getting slower than Bayesian LS-SVM as the number of classes increases so it is particularly slow in Caltech 256 and SUN 397 datasets.
|PASCAL VOC 2007 ||SUN-397 |
|PASCAL VOC 2012 ||CUB-200 |
|Caltech 101 ||Oxford Flowers |
|Caltech 256 ||UIUC Attributes |
|ImageCLEF ||Human Attributes |
|MIT Indoor ||Stanford 40 Action |
Another notable observation in Table 2 is that the order of prediction accuracy is highly correlated to the evidence. This means that the selected model by Bayesian LS-SVM produces reliable testing accuracy and a proper deep learning image representation is obtained without time consuming grid search and cross validation. Note that cross validations in LS-SVM and SVM play the same role, but are less reliable and slower than our Bayesian evidence framework. The capability to select the appropriate CNN model and the corresponding regularization parameter is one of the most important properties of our algorithm.
5.3 Comparison with Other Methods
We now show that our Bayesian LS-SVM identifies a combination of multiple CNNs to improve accuracy without grid search and cross validation. For each task, we select a subset of 4 pre-trained CNNs in a greedy manner; we add CNNs to our selection, one by one, until the evidence does not increase. Our algorithm is compared with DeCAF , Zeiler , INRIA , KTH-S , KTH-FT , VGG , Zhang [35, 36], and TUBFI . In addition, our ensembles identified by greedy evidence maximization are compared with the oracle combinations—the ones with the highest accuracy in test set found by exhaustive search—and the best combinations found by exhaustive evidence maximization.
Table 3 presents that our ensembles approach achieves the best performance in most of the 12 tasks. The identified ensembles by the greedy approach are consistent with the selections by exhaustive evidence maximization and even oracle selections111This option is practically impossible since it requires evaluation with test dataset using all available models for model selection. made by testing accuracy maximization. Note that our network selections are natural and reasonable; GoogLeNet-ImageNet and VGG19 are selected frequently while GoogLeNet-Place is preferred to GoogLeNet-ImageNet in MIT Indoor and SUN-397 since the datasets are constructed for scene recognition. It turns out that the proposed algorithm tends to choose the networks with higher accuracies in the target task even though it makes selections based only on the evidence in a greedy manner. An interesting observation is that our result is less consistent with the selections by oracle and exhaustive evidence maximization in Stanford 40 Actions dataset, where GoogLeNet-Place seems to provide complementary information even with its low accuracy and is helpful to improve recognition performance. It is probably because actions are frequently performed at typical places, e.g., a fair portion of images in brushing teeth class are taken from bathrooms.
We described a simple and efficient technique to transfer deep CNN models pre-trained on specific image classification tasks to another tasks. Our approach is based on Bayesian LS-SVM, which combines Bayesian evidence framework and SVM with a least squares loss. In addition, we presented a faster fixed point update rule for evidence maximization through Aitken’s delta-squared process. Our fast Bayesian LS-SVM demonstrated competitive results compared to the standard SVM by selecting a deep CNN model in 12 popular visual recognition problems. We also achieved the state-of-the-art performance by identifying a good ensemble of the candidate models through our Bayesian LS-SVM framework.
This work was partly supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) [B0101-16-0307; Basic Software Research in Human-level Lifelong Machine Learning (Machine Learning Center), B0101-16-0552; Development of Predictive Visual Intelligence Technology (DeepView)], and National Research Foundation (NRF) of Korea [NRF-2013R1A2A2A01067464].
-  A. C. Aitken. On Bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh, 46:289–305, 1927.
-  H. Azizpour, A. S. Razavian, J. Sulivan, A. Maki, and S. Carlsson. From generic to specific deep representations for visual recognition. In CVPR Workshops, 2015.
-  A. Binder, W. Samek, M. Kloft, C. Müller, K.-R. Müller, and M. Kawanabe. The joint submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 photo annotation task. 2011.
-  C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon press Oxford, 1995.
-  L. D. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In ICCV, 2011.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and a. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, n. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
-  M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results, 2007.
-  M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC 2012) Results, 2012.
-  R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
-  A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describing objects by their attributes. In CVPR, 2009.
-  L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, 106(1):59–70, 2007.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587. IEEE, 2014.
-  G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007.
-  B. H. H. Nam. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 2016.
-  B. H. H. Noh, S. Hong. Learning deconvolution net- work for semantic segmentation. In ICCV, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional neural networks. In NIPS, volume 25, 2012.
-  D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.
-  M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
-  S. Nowak, K. Nagel, and J. Liebetrau. The CLEF 2011 photo annotation and concept-based retrieval tasks. In CLEF Workshop Notebook Paper, 2011.
-  M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, 2014.
-  A. Quattoni and A. Torrabla. Recognizing indoor scenes. In CVPR, 2009.
-  A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In CVPR Workshops, 2014.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  T. Van Gestel, J. A. K. S. B. Baesems, S. Viaene, J. Vanthienen, G. Dedene, B. De Moor, and J. Vandewalle. Benchmarking least squares support vector machines classifiers. Machine Learning, 54(1):5–32, 2004.
-  T. Van Gestel, J. A. K. Suykens, G. Lanckrie, A. Lambrechts, B. D. Moor, and J. Vandewalle. Bayesian framework for least-squares support vector machine classifiers, gaussian processes, and kernel fisher discriminant analysis. Neural Computation, 14(5):1115–1147, 2002.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 dataset. Technical report, California Institute of Technology, 2011.
-  Z. Wu, Y. Zhang, F. Yu, and J. Xiao. A GPU implementation of GoogLeNet. Technical report, Princeton University, 2014.
-  J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torrabla. SUN database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
-  B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. J. Guibas, and L. Fei-Fei. Action recognition by learning bases of action attributes and parts. In ICCV, 2011.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
-  N. Zhang, , M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. PANDA: Pose aligned networks for deep attribute modeling. In CVPR, 2014.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for fine-grained category detection. In ECCV, 2014.
-  P. Zhang and J. Peng. SVM vs regularized least squares classification. In ICPR, 2004.