Ensembles of feedforward-designed convolutional neural networks
An ensemble method that fuses the output decision vectors of multiple feedforward-designed convolutional neural networks (FF-CNNs) to solve the image classification problem is proposed in this work. To enhance the performance of the ensemble system, it is critical to increasing the diversity of FF-CNN models. To achieve this objective, we introduce diversities by adopting three strategies: 1) different parameter settings in convolutional layers, 2) flexible feature subsets fed into the Fully-connected (FC) layers, and 3) multiple image embeddings of the same input source. Furthermore, we partition input samples into easy and hard ones based on their decision confidence scores. As a result, we can develop a new ensemble system tailored to hard samples to further boost classification accuracy. Experiments are conducted on the MNIST and CIFAR-10 datasets to demonstrate the effectiveness of the ensemble method.
Ensembles of feedforward-designed convolutional neural networks
|Yueru Chen, Yijing Yang, Wei Wang and C.-C. Jay Kuo|
|University of Southern California, Los Angeles, California, USA|
Index Terms— Ensemble, Image classification, Interpretable CNN, Dimension reduction
We have seen rapid developments in the literature of convolutional neural network (CNN) in last six years [1, 2, 3, 4]. The CNN technology provides state-of-the-art solutions to many image processing and computer vision problems. Given a CNN architecture, all of its parameters are determined by the stochastic gradient descent (SGD) algorithm through backpropagation (BP). The BP training demands a high computational cost. Furthermore, most CNN publications are application-oriented. There is a limited amount of progress after the classical result in . Examples include: explainable CNNs [6, 7, 8] and feedforward designs without backpropagation [9, 10, 11].
The determination of CNN model parameters in the one-pass feedforward (FF) manner was recently proposed by Kuo et al. in . It derives network parameters of a target layer based on statistics of output data from its previous layer. No BP is used at all. This feedforward design provides valuable insights into the CNN operational mechanism. Besides, under the same network architecture, its training complexity is significantly lower than that of the BP-design CNN. FF-designed and BP-designed CNNs are denoted by FF-CNNs and BP-CNNs, respectively.
The FF-CNN and the BP-CNN were applied to the MNIST  and CIFAR-10  datasets for performance benchmarking in . The BP-CNN outperforms the FF-CNN by a small margin in terms of classification accuracy. To improve the performance of the FF-CNN, we use multiple FF-CNNs as base classifiers in an ensemble system and show that the ensemble idea offers a promising solution to reach higher classification accuracy in this work. Although the ensemble idea can be applied to both BP-CNNs and FF-CNNs, it is more suitable for FF-CNNs since FF-CNNs are weaker classifiers of extremely low complexity. We conduct an extensive performance study on the BP-CNN and the ensemble of FF-CNNs against the MNIST and CIFAR-10 datasets. Besides, we report the results by splitting simple and hard examples and treating them separately.
This work has several novel contributions. First, we make one modification on the FF-CNN design to achieve higher classification performance. That is, we apply the channel-wise PCA to spatial outputs of the conv layers to remove spatial-dimension redundancy. This reduces the dimension of feature vectors furthermore. This will be elaborated in Sec. 3.1. Second, our major contribution is to develop various ensemble systems using multiple FF-CNNs as base classifiers. The idea is shown in Fig. 1. To boost the performance of the ensemble solutions, we introduce three diversities: 1) flexible parameter settings in conv layers, 2) subsets of derived features, and 3) flexible image input forms. Third, we define the confidence score based on the final decision vector of the ensemble classifier and use it to separate easy examples from hard ones. Then, we can handle easy and hard examples separately.
FF-CNN. An FF-CNN consists of two modules in cascade: 1) the module of convolutional (conv) layers and 2) the module of fully-connected (FC) layers. They are designed using completely different strategies.
The construction of conv layers is totally unsupervised since no labels are needed in the construction process. It is designed in  as subspace approximation via spectral decomposition followed by the maximum pooling in the spatial domain. The subspace approximation is obtained using a new signal transform called the Saab (Subspace approximation with adjusted bias) transform. The Saab transform is a variant of the principal component analysis (PCA). It has a default constant-element bias vector used to annihilate nonlinear activation. Each conv layer contains one Saab transform followed by a maximum spatial pooling operation. The maximum pooling operation, which plays the ”winner-takes-all” role, is a nonlinear one. Its model parameters of a target layer are derived from the statistics of the output of the previous layer. The feature discriminant power is increased gradually due to a larger receptive field.
The design of FC layers casts as a multi-stage linear least squared regression (LSR) problem in . Suppose that the input and output dimensions are and (), respectively. We can cluster training samples of dimension into clusters and map all samples in a cluster into the unit vector of a vector space of dimension . Such a unit vector is also known as the one-hot vector. The index of the output space dimension defines a pseudo-label.
Ensemble Methods. Ensembles are often used to integrate multiple weak classifiers and make them be a stronger one . Examples include bagging , the random forest (an ensemble of decision trees) , stacked generalization , etc. Ensemble methods may not necessarily result in better classification performance than individual ones. Diversity is critical to the success of an ensemble system [18, 19].
3 Proposed Method
We use multiple FF-CNNs to serve as baseline classifiers to construct an ensemble system. Several novel ideas are proposed to make the ensemble system more effective.
3.1 Channel-wise PCA (C-PCA) for Spatial Dimension Reduction
Although the Saab transform can reduce redundancy in the spectral domain, there still exists correlation among spatial dimensions of the same spectral component. This is illustrated in Fig. 2. We see that the correlation is stronger in low frequency components. Also, by comparing the MNIST and CIFAR-10 datasets, the correlation is stronger in the CIFAR-10 dataset. To further reduce the feature space dimension, we apply the PCA to spatial dimensions at each filter. This is called the channel-wise PCA (C-PCA).
We use indices and to denote the th conv layer and the th spectral component. The feature dimension after the Saab transform is , where , and are the spectral, width and height dimensions of the th conv layer. Then, we apply C-PCA to features of the same filter index, , and reduce the original dimension, , to a space of smaller dimension , where , Thus, the feature dimension of a certain conv layer is equal to , where is is the selected filter number.
To improve the performance of an FF-CNN baseline, we develop a simple yet effective ensemble method as illustrated in Fig. 1. Here, we consider ensembles of LeNet-like CNNs, which contain two convolutional layers, two FC layers and one output layer. We adopt multiple FF-CNNs as the first-stage base classifiers in an ensemble system and concatenate their output decision vectors, whose dimension is the same as the class number. Then, we apply PCA to reduce feature dimension before feeding them into the second-stage ensemble classifier. The success of ensemble systems highly depends on the diversity of base classifiers. We propose three ways to increase the diversity of the baseline FF-CNNs as elaborated below.
Scheme 1) Flexible parameter settings in conv layers. We choose different filter sizes in conv layers. The filter spatial dimension is the same for the two conv layers of the LeNet-5 (i.e. . We consider four combinations of spatial dimensions. They are: , , and . Different filter sizes result in different receptive field sizes of the FF-CNN. In turn, they yield different features at the output of the conv layer.
Scheme 2) Subsets of derived features. For each FF-CNN, we have the following feature set for each sample:
where and represent the features obtained from the first and the second conv layers, respectively. We select a subset from . There are many possible selection choices. We test the following three selection rules in the experiment.
For each channel in , select features randomly, where are the spatial dimensions of the first conv layer and . Then, apply C-PCA to reduce the feature dimension to . Finally, randomly select features from features, where .
Apply C-PCA to to generate features and select features randomly, where .
Conduct checkerboard partitioning of in the spatial dimension. Then, apply the C-PCA to each part and generate two feature subsets.
We generate one decision vector using each feature subset.
Scheme 3) Flexible input image forms. We adopt different image input forms to increase diversity. For example, we use various color models to represent color images . Here, we use the RGB, YCbCr and Lab color spaces as different input forms to an FF-CNN. We also apply Laws filter bank of size  to input images to capture their different spectral characteristics. The final decision vector is obtained by combining FF-CNNs using different input representations.
3.3 Separation of Easy and Hard Examples
It is desired to separate hard examples from easy ones in the decision-making process. This is accomplished by computing the confidence score of each decision. It is determined by two factors: 1) the final decision vector of the ensemble classifier and 2) the prediction results of all base classifiers. Intuitively speaking, a decision is more confident if the maximum probability in the ensemble decision vector is larger or more base classifiers in an ensemble agree with each other. We define two confidence scores for an input image , where is the image index, as
where , , and denote the input data and two confidence scores, respectively, is the decision vector of the ensemble, is the number of base classifiers producing the majority class label for input and is the total number of the base classifiers. We call an input image a hard sample if and , where and are two threshold values. After the separation of easy and hard examples, a new FF-CNN ensemble targeting at the hard samples set, , can be trained to boost the classification performance of hard samples.
We conducted experiments on two popular datasets: MNIST  and CIFAR-10 . The MNIST dataset contains gray-scale images of handwritten digits 0-9. The CIFAR-10 dataset has 10 classes of tiny images of size . We adopted the LeNet-5 architecture  for the MNIST dataset. Since CIFAR-10 is a color image dataset, we set the filter numbers of the first and the second conv layers and the first and the second FC layers to 32, 64, 200 and 100, respectively, by following .
We applied C-PCA to the output of the second conv layer and reduced the feature dimension of the second conv layer per channel from 25 to 20 (for MNIST) or 12 (for CIFAR-10). We sometimes fed the responses from the first conv layer to the FC layers directly to increase feature diversity. When this happens, we set reduced feature dimension per channel to 30 (MNIST) and 20 (CIFAR-10) of the first conv layer while the original dimension per channel is .
We adopted the Radial Basis Function (RBF) SVM classifier as the ensemble classifier in all experiments. We applied PCA to cascaded decision vectors of base classifiers before the SVM classifier training. The reduced feature dimension was determined by the correlation of decision vectors of base classifiers in an ensemble.
4.1 Performance of Ensemble Systems
To show the power of ensembles, we conducted experiments by taking diversity schemes discussed in Sec. 3.2 into account.
Scheme 1. We compare the performance of BP-CNN, four FF-CNNs and the ensemble of four FF-CNNs in Table 1. The four FF-CNNs differ in their filter sizes in two conv layers: 1) (5x5,5x5), 2) (3x3,5x5), 3) (5x5,3x3), and 4) (3x3,3x3). For MNIST, their filter numbers are the same in all settings; namely, (6,16). For CIFAR-10, their filter numbers for RGB images are: 1) (32,64), 2) (24,64), 3) (32,64), and 4) (24,48). Their filter numbers for a single channel of color images are: 1) (16,32), 2) (8,32), 3 (16,32), and 4) (8,24). The classification accuracies of BP-CNN, four FF-CNNs and the ensemble are listed from columns 1 to 6. We see that the ensemble of four FF models provides 4% improvement than the best single FF model. Different filter sizes will directly affect the receptive field size of each conv layer and induce different statistics of the input data. In this way, we introduce diverse features into the ensemble system. While the performance gap between BP-CNN and the ensemble narrows down for MNIST, the ensemble outperforms BP-CNN for CIFAR-10.
Scheme 2. We evaluate the FF-1 design with feature subset diversity and set , and to 75%. We show the performance in Table 2, where the first to the fifth columns correspond to selected feature subsets from the entire (denoted by Conv2), two chosen by the third rule (denoted by Conv1-1 and Conv1-2), one by the first rule (denoted by Conv1-RD), and one by the second rule (denoted by Conv2-RD), respectively, where ”RD” denotes reduced dimension. Then, we study the performance of four ensemble methods: 1) the ensemble of Conv1, Conv1-1, Conv1-2 (ED-1); 2) the ensemble of six Conv1-RD results (ED-2); 3) the ensemble of twelve Conv2-RD results (ED-3); and 4) the ensemble of six Conv1-RD and twelve Conv2-RD results (ED-4). As compared with the performance of FF-1 for CIFAR-10 which is 63.7%, we see that ensembles using the feature subset diversity boost its performance by a margin ranging from 2.3 to 5.6%. It is worthwhile to point out that one can combine three classifiers (one trained on feature set and two trained on feature set) in the ED-1 ensemble. It yields 68.7% and 97.7% accuracy on CIFAR-10 and MNIST, respectively. This choice offers a simple and effective ensemble system. We will adopt ED-1 to build a larger ensemble system by adding other sources of diversity later.
Scheme 3. We conduct experiments by adopting different inputs to the FF-1 architecture in Table 3. We apply nine Laws filters of size  to gray-scale images and generate nine images that contain frequency components in different subbands. For color images in CIFAR-10, we represent the color information in three color spaces: RGB, YCbCr, and Lab, where we treat three channels individually in the last two color spaces. We observe 1.1% and 5.9% performance improvements on the MNIST and CIFAR-10, respectively, by assembling various input representations. This demonstrates the effectiveness of utilizing various input representations as the diversity source in an ensemble.
We can fuse three diversity types in an ensemble to boost the performance. The relation between test accuracy and ensemble complexity is shown in Fig. 3. In general, the ensemble of more classifiers gives better performance. So far, the best performance achieved on MNIST and CIFAR-10 are 98.7% and 74.2% in terms of test accuracy. As compared with the single BP-CNN reported in Table 1, the best ensemble result is 5.5% higher on CIFAR but 0.4% lower on MNIST. We can push the performance higher by separating easy and hard examples based on the scheme described in Sec. 3.3.
4.2 Separation of Easy and Hard Examples
By following the discussion in Sec. 3.3, we set and for the MNIST dataset and and for the CIFAR-10 dataset. The results are reported in Table 4. For the set of hard examples, the new ensemble system trained on this set provides 5.6% and 2.6% improvements in test accuracy for MNIST and CIFAR-10, respectively. More hard samples are classified correctly in this setting. Overall, the ensemble method with easy/hard example separation achieves test accuracies of 99.3% and 76.2% on the entire MNIST and CIFAR-10 datasets, respectively. It outperforms the best results obtained earlier as shown in Table 1.
To better understand the diversity among different FF-designed CNNs, we evaluate the correlation among the output of different classifiers using two diversity measures: Yuleâs Q-statistic and entropy measure . These measures are built on the correct/incorrect decision. The lower Q-statistic (or the higher entropy measure) indicates a higher diversity degree among base classifiers. The average measures among different diversity sources are reported in Table 5. The best diversity measures are achieved by combining all base classifiers, leading to a large amount of performance improvement. This is consistent with classification accuracy assessment as shown in Fig. 3.
We proposed an ensemble method that is built on multiple FF-CNNs of diversity. We see a significant improvement in test accuracy for the MNIST and the CIFAR-10 datasets. As future extensions, we would like to apply the ensemble method to the more challenging datasets with more object classes or/and larger image size, such as the CIFAR-100 and the ImageNet. Also, it will be interesting to develop a weakly supervised system based on the ensemble of multiple FF-CNNs.
-  Yann Lecun, LÃ©on Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, 1998, pp. 2278–2324.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
-  George Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 1989.
-  Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu, “Interpretable convolutional neural networks,” arXiv preprint arXiv:1710.00935, 2017.
-  C.-C. Jay Kuo, “Understanding convolutional neural networks with a mathematical model,” Journal of Visual Communication and Image Representation, vol. 41, pp. 406–413, 2016.
-  C.-C. Jay Kuo, “The CNN as a guided multilayer RECOS transform [lecture notes],” IEEE Signal Processing Magazine, vol. 34, no. 3, pp. 81–89, 2017.
-  Yueru Chen, Zhuwei Xu, Shanshan Cai, Yujian Lang, and C.-C. Jay Kuo, “A saak transform approach to efficient, scalable and robust handwritten digits recognition,” arXiv preprint arXiv:1710.10714, 2017.
-  C.-C. Jay Kuo and Yueru Chen, “On data-driven Saak transform,” Journal of Visual Communication and Image Representation, vol. 50, pp. 237–246, 2018.
-  C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen, “Interpretable convolutional neural networks via feedforward design,” arXiv preprint arXiv:1810.02786, 2018.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
-  Cha Zhang and Yunqian Ma, Ensemble machine learning: methods and applications, Springer, 2012.
-  Leo Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
-  Leo Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
-  David H Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992.
-  Gavin Brown, Jeremy Wyatt, Rachel Harris, and Xin Yao, “Diversity creation methods: a survey and categorisation,” Information Fusion, vol. 6, no. 1, pp. 5–20, 2005.
-  Ludmila I Kuncheva and Christopher J Whitaker, “Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,” Machine learning, vol. 51, no. 2, pp. 181–207, 2003.
-  Noor A Ibraheem, Mokhtar M Hasan, Rafiqul Z Khan, and Pramod K Mishra, “Understanding color models: a review,” ARPN Journal of science and technology, vol. 2, no. 3, pp. 265–275, 2012.
-  Kenneth I Laws, “Rapid texture identification,” in Image processing for missile guidance. International Society for Optics and Photonics, 1980, vol. 238, pp. 376–382.
-  William K Pratt, Digital image processing: PIKS Scientific inside, vol. 4, Wiley-interscience Hoboken, New Jersey, 2007.
-  Padraig Cunningham and John Carney, “Diversity versus quality in classification ensembles based on feature selection,” in European Conference on Machine Learning. Springer, 2000, pp. 109–116.