A Feature Embedding Strategy for Highlevel CNN representations from Multiple ConvNets
Abstract
Following the rapidly growing digital image usage, automatic image categorization has become preeminent research area. It has broaden and adopted many algorithms from time to time, whereby multifeature (generally, handengineered features) based image characterization comes handy to improve accuracy. Recently, in machine learning, pretrained deep convolutional neural networks (DCNNs or ConvNets) have proven that the features extracted through such DCNN can improve classification accuracy. Thence, in this paper, we further investigate a feature embedding strategy to exploit cues from multiple DCNNs. We derive a generalized feature space by embedding three different DCNN bottleneck features with weights respect to their softmax crossentropy loss. Test outcomes on six different object classification datasets and an action classification dataset show that regardless of variation in image statistics and tasks the proposed multiDCNN bottleneck feature fusion is well suited to image classification tasks and an effective complement of DCNN. The comparisons to existing fusionbased image classification approaches prove that the proposed method surmounts the stateoftheart methods and produces competitive results with fully trained DCNNs as well.
A Feature Embedding Strategy for Highlevel CNN representations from Multiple ConvNets

Index Terms— Transfer learning, CNN, Image classification
1 Introduction
The traditional classification models using single feature representation suffers from the inability to tackle intraclass variations and global variants such as color, lightings and orientation of image statistics. Therefore, it is an intuitive process to fuse multiple features to meliorate the classification accuracy because multiple features can plausibly create a well generalized feature space. Researchers in the computer vision community also have shown interest in multiple feature fusion.
For example, Li et al. [1] utilized the Riemann manifold to combine the features from the covariance matrix of multiple features and concatenated multiple features to represent the object appearance. Meanwhile, Park [2] took the Multipartitioned featurebased classifier (MPFC) to fuse features such as Huesaturationvalue(HSV), Discrete cosine transformation (DCT) coefficients, Wavelet packet transform (WPT) and Hough transform (HT) with specific decision characteristic expertise table of local classifiers. Similarly, Kwon et al. [3] had advantage of multiple features for efficient object tracking, where, they dissevered the task into multiple constituents and combined multiple features through sparse Principal component analysis (PCA) to select the most important features, by which, the appearance variations were captured.
On the other hand, researchers in [4], [5], [6], [7] also found different ways to merge multiple handengineeredfeatures to improve classification accuracy. Fernando et al. [4] merged Huehistograms, Color name (CN) descriptors, Scaleinvariant feature transform (SIFT) and ColorSIFT, while, Gehler and Nowozin [5] achieved some success of improving classification accuracy by means of combining the basic SIFT feature with another eight different features: Histogram of gradients (HOG), Local binary pattern (LBP), ColorSIFT and so forth using Multiple kernel learning (MKL) to combine 49 different kernel matrices. Khan et al. [6] employed multiple cues by individually processing shape and color cues then combining them by modulating the SIFT shape features with categoryspecific color attention. They used a standardized multiscale grid detector with Harrislaplace point detector and a blob detector to create feature description, then they normalized all the patches to a predefined size and computed descriptors for all regions. Dixit et al. [7] embedded features from a CNN with Semantic fisher vector (SFV), where the SFV is ciphered as parameters of a multinominal Gaussian mixture FV.
In the aforesaid literature, however, the features fused are mainly the handengineered features or such features with bottleneck features^{1}^{1}1The highlevel feature representations of ConvNet that is feed into a final classification layer is called bottleneck features. from a single CNN. Hence, utilizing the bottleneck features extracted through an offtheshelf pretrained CNN, significantly, outperforms a majority of the baselines stateoftheart methods [8]. Thus, one may ponder the following questions: (i) If multiple CNN features extracted from different networks, can such features be complementary?, if so (ii) what can be an acceptable approach to fuse them so that the classification accuracy will improve? We address these questions by carrying out experiments on various datasets with three different pretrained CNNs as feature extractors, weights based on crossentropy loss function as feature embedding scheme and softmax as classifier. The experiment results have strengthen our idea of fusing multiple CNN features to improve image classification accuracy.
1.1 CNN as Feature Extractor
A DCNN pretrained on large image dataset can be exploited as generic feature extractor through transfer learning process [9]. Generally, in transfer learning, parameters (weights and biases) of first layers of source (pretrained DCNN) are transferred to the first layers of target (new task) network and left without updates during training on new dataset, while the rest of the layers known as adaptation layers of target task are randomly initialized and updated over the training. If a finetuning strategy is taken then backpropagation process will be carried out through the entire (copied + randomly initialized layers) network for calibrating the parameters of the copied layers in the new network so that the DCNN responses well to the new task.
In this experiment, we take three pretrained networks: AlexNet, VGG16, and Inceptionv3 and extract features from their respective penultimate layers. These networks have been trained on ImageNet^{2}^{2}2It contains more than 14 million images which are hand labeled with the presence/absence of 21000+ categories., where the final logits layer of each network has 1000 output neurons. That final layer is decapitated, then rest of the DCNN is employed as fixed feature extractor on the new datasets, where number classes per dataset may differ. The following intermezzo highlights the properties of the DCNNs.
AlexNet[10] is the winner of 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with 37.5% and 17.0% top1 and top5 object classification error rates respectively. It subsumes 5 convolutional (Conv) layers occasionally interspersed with maxpooling layers, 3 fullyconnected (FC) layers and the last softmax classifier with 1000 output neurons trained on 1.2 million images in the ImageNet2010 dataset. The penultimate layer referred as FC7 has 4096 output channels. VGG16[11] is the winner of 2014 ILSVRC challenge for localization task with 25.3% error and runnerup of the classification task with 24.8% and 7.5% top1 and top5 error rates respectively. It has 16 Conv layers with maxpooling layers after each set of two or more Conv layers, 2 FC layers, and a final softmax output layer. The penultimate layer FC2 has 4096 channels of output. Inceptionv3[12] is an improved version of GoogLeNet the winner of 2014 ILSVRC classification task. It achieved 21.2% top1 and 5.6% top5 error rates on the benchmark ILSVRC 2012 classification challenge validation set. We extract features of target datasets from a maxpooling layer named as pool_3:0 in the network, which has 2048 output channels.
Rest of this paper is organized as follows. Section 2 expatiates on the main ideas: feature extraction, feature embedding and classification via block diagrams and mathematical derivations. Section 3 details the experimental results through quantitative and qualitative analysis. Finally, Section 4 concludes the work with final remarks on future directions.
2 System Overview
As described in Section 1.1, using the selected CNN models and their associated learned parameters a forwardpass operation (without backpropagation) is carried out on the image statistics of new datasets to extract bottleneck features. Depends on the size of the dataset, feature extraction process may take several hours; however, it will be considerably little time than training or finetuning the CNN completely. For instance, on a Intel(R) Core(TM) i73770 CPU @ 3.40GHz machine with 16.0GB RAM, it would take about 56 hours to get the features from CIFAR10 dataset through Inceptionv3.
2.1 Feature Embedding
As we exploit three different CNNs for feature extraction as shown in Figure 1, the system must be devised with an appropriate strategy to merge the extracted features toward classification accuracy gain. The basic approach is concatenating all different features in a single vector per sample as , thus the final feature space has the dimension of . Although, such straight forward concatenation process often improve classification accuracy than using single feature, the penalty is unfair since a weak feature may deteriorate the performance of other good features. We circumvent that by introducing weighted feature embedding layer as shown in Figure 2, where we calculate crossentropy loss for each feature individually and update their assigned parameters using softmax function and gradient descent based optimizer to minimize the crossentropy loss. On the other hand, this layer functions as indemnifier for the variant image statistics like imaging conditions, viewpoints and object types of the source and target data. The following snippet describes the mathematical background of the technique.
The softmax function produces a categorical probability distribution, when the input is a set of multiclass logits as:
(1) 
where input is dimensional vector and output is also a dimensional vector having real values in the range and that add up to 1 as normalization happens via the sum of exponents term dividing actual exponentiation term. The cost function for the softmax function of the model can be written in terms of likelihood maximization with a given set of parameter as:
(2) 
where the likelihood can be deduced to a conditional distribution of and for the same as:
(3) 
Note that the probability that the class for a given input and with can be written in matrix form as:
(4) 
where is the probability that the class is given that the input is . Eventually, the cost function through maximizing the likelihood can be done by minimizing the negative loglikelihood as:
(5) 
where denotes the crossentropy error function. Then, the derivative of the cost function with respect to the softmax input can be used to update the weights as:
(6) 
where the learning rate tells us how quickly the cost changes the weights. In the same way, biases can also be updated; towards the goal of bringing the error function to local minimum. In this work, we utilize the backpropagation (aka backprops) based on gradient descendant optimization algorithm to update the weights and biases. The gradient decent algorithm is the workhorse of learning in neural networks, these days. Intricate description of backprops can be referred from [13]. Thus, we get dimension reduced logits of the Alex, VGG, and Inception bottleneck features respectively as shown in Figure 2.
Sequentially, the estimated logits are coalesced by a product and fed in into the final classification layer.
3 Experimental Results
Type  Dataset  Proposed  AlexNet  VGG16  Ince.v3  Other methods 

Object classification  CIFAR10  92.00  81.60  85.35  89.57  91.87[14], 85.02[15], 74.5[16] 
CIFAR100  74.60  56.30  67.26  69.86  72.60[17], 66.64[14]  
Caltech101  95.65  90.15  91.31  93.57  83.60[2], 82.10[5], 76.1[6]  
Caltech256  87.30  69.22  79.30  83.75  60.97[7], 50.80[5]  
MIT67  77.38  53.88  66.41  76.04  70.72[18], 65.10[7]  
Sun397  55.22  45.18  47.87  49.41  54.30[18], 38.00[19]  
Action classification  Pascal VOC 2012  82.50  63.39  71.13  79.98  70.20[9], 69.60 OXFORD[20] 
Experiments were carried out on 6 different object classification datasets: CIFAR10, CIFAR100 [21], MIT67 [22] Caltech101, Caltech256 ^{3}^{3}3http://www.vision.caltech.edu/Image_Datasets/Caltech101/, Sun397 ^{4}^{4}4http://groups.csail.mit.edu/vision/SUN/ and an action classification dataset the Pascal VOC 2012 [20]. Three statistics from each dataset is shown in Figure 3 while Table 2 summarizes all the datasets. In Pascal VOC 2012, as the action boundaries were given we extracted the action statistics within the boundaries and zero padded to make their dimension spatially square and resized to meet the requirement of the employed CNN architectures. For other datasets, whole size images were taken and only resized to meet the networks’ input layer requirements.
The results of the proposed bottleneck feature embedding are compared in Table 1 with existing algorithms. The Table also lists the performance of single CNN bottleneck feature without any feature fusion for quantitative analysis, while Figure 4 shows an overall performance comparison in terms of boxplot of the fused feature with the best results of other methods chosen from Table 1. From these comparisons one can understand that the proposed feature embedding has improved the classification accuracy by 1%  2% most of the cases without any dataaugmentation.
Dataset  No. of classes  Train. samples  Test samples  Ref. 

CIFAR10  10  50,000  10,000  [21] 
CIFAR100  100  50,000  10,000  [21] 
Caltech101  101  6,076  2,601  [23] 
Caltech256  256  21,363  9,146  [24] 
MIT67  67  5,360  1,340  [22] 
Sun397  397  59,550  10,919  [19] 
Pascal VOC  10  4,588  4,569  [20] 
Note that in Table 1, [14] uses Dataaugmentation + latent model ensemble with single CNN feature; [15], [16] and [17] do not use any feature fusion; [2], [5], [6], [7] and [19] use feature fusion of multiple handcrafted features or handcrafted feature(s) with a single CNN feature; [18] uses CNN features extracted though pretrained AlexNet on Places205/365, similarly [9] also uses CNN features extracted by using a pretrained AlexNet on 1512 classes of ImageNet (in our case, the AlexNet used is pretrained on 1000 classes of ImageNet).
4 Conclusion
An approach to fuse bottleneck features of multiple CNNs through weighted crossentropy is presented, where a set of three different pretrained CNNs are exploited as feature extractors. The test results on various datasets show that it outperforms the stateoftheart handcrafted feature fusion methods and produces very competitive results to fully trained (dataset specific) DCNN, as well. It accords with our hypothesis that features from multiple CNNs can be complementary to each other and fusion of them can be a generalized representation of images that is appearance invariant.
Although, the proposed feature embedding enhances the classification accuracy, how to fuse multiple features is still an open problem. In this work, our goal is to analyze if the accuracy improves when multiple CNN bottleneck features are fused as proposed. As for the future work, metric learning approaches can be exploited to capture facet in the CNN features that to differentiate classes and interclasses. Hence, this work can be extended for dynamic texture and video activity detection and classification, as well.
References
 [1] X. Li, W. Hu, Z. Zhang, and X. Zhang, “Robust visual tracking based on an effective appearance model,” Computer Vision  ECCV 2008: 10th European Conference on Computer Vision, pp. 396–408, 2008.
 [2] D.C. Park, “Multiple featurebased classifier and its application to image classification,” IEEE International Conference on Data Mining Workshops, pp. 65–71.
 [3] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in CVPR, pp. 1269–1276, 2010.
 [4] T.B. Fernando, E. Fromont, D. Muselet, and M. Sebban, “Discriminative feature fusion for image classification,” International Conference on Pattern Recogni. (ICPR), pp. 3434–3441, 2012.
 [5] P.V. Gehler and S. Nowozin, “On feature combination for multiclass object classification,” in ICCV, 2009.
 [6] F.S. Khan, J. van de Weijer, and M. Vanrell, “Modulating shape features by color attention for object recognition,” International Journal of Computer Vision (IJCV), vol. 98, pp. 49–64, 2012.
 [7] M. Dixit, S. Chen, D. Gao, N. Rasiwasia, and N. Vasconcelos, “Scene classification with semantic fisher vectors,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2974–2983, June 2015.
 [8] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features offtheshelf: An astounding baseline for recognition,” in CVPR Workshops, June 2014.
 [9] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring midlevel image representations using convolutional neural networks,” in Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’14, pp. 1717–1724, IEEE Computer Society, 2014.
 [10] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 [11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” International Conference on Learning Representations, vol. abs/1409.1556, 2014.
 [12] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014.
 [13] A.N. Michael, “Improving the way neural networks learn,” Neural Networks and Deep Learning, 2015. Determination Press.
 [14] M. Sun, T.X. Han, L. M.C. Xu, X., and K. Ahmad KhodayariRostamabad, “Latent model ensemble with autolocalization,” in Proceedings of the 23nd International Conference on Pattern Recognition (ICPR16), 2016.
 [15] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” Advances in Neural Information Processing Systems, vol. 25, pp. 2951–2959, 2012.
 [16] K. Yu and T. Zhang, “Improved local coordinate coding using local tangents,” ICML, 2010.
 [17] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams, “Scalable bayesian optimization using deep neural networks,” in JMLR Workshop and Conference Proceedings, pp. 2171–2180, 2015.
 [18] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Advances in Neural Information Processing Systems, vol. 27, pp. 487–495, Curran Associates, Inc., 2014.
 [19] J. Xiao, K. Hays, A. Ehinger, and A. Torralba, “Sun database: Largescale scene recognition from abbey to zoo,” in CVPR, pp. 3485–3492.
 [20] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision (IJCV), vol. 111, no. 1, pp. 98–136, 2015.
 [21] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
 [22] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in CVPR, 2009.
 [23] L. FeiFei, L.R. Fergus, and P. Perona, “Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories,” in CVPR, 2004.
 [24] G. Griffin, A. Holub, and P. Perona, “The caltech256: Caltech technical report,” vol. 7694, 2007.