Ensembles of feedforwarddesigned convolutional neural networks
Abstract
An ensemble method that fuses the output decision vectors of multiple feedforwarddesigned convolutional neural networks (FFCNNs) to solve the image classification problem is proposed in this work. To enhance the performance of the ensemble system, it is critical to increasing the diversity of FFCNN models. To achieve this objective, we introduce diversities by adopting three strategies: 1) different parameter settings in convolutional layers, 2) flexible feature subsets fed into the Fullyconnected (FC) layers, and 3) multiple image embeddings of the same input source. Furthermore, we partition input samples into easy and hard ones based on their decision confidence scores. As a result, we can develop a new ensemble system tailored to hard samples to further boost classification accuracy. Experiments are conducted on the MNIST and CIFAR10 datasets to demonstrate the effectiveness of the ensemble method.
Ensembles of feedforwarddesigned convolutional neural networks
Yueru Chen, Yijing Yang, Wei Wang and C.C. Jay Kuo 
University of Southern California, Los Angeles, California, USA 
Index Terms— Ensemble, Image classification, Interpretable CNN, Dimension reduction
1 Introduction
We have seen rapid developments in the literature of convolutional neural network (CNN) in last six years [1, 2, 3, 4]. The CNN technology provides stateoftheart solutions to many image processing and computer vision problems. Given a CNN architecture, all of its parameters are determined by the stochastic gradient descent (SGD) algorithm through backpropagation (BP). The BP training demands a high computational cost. Furthermore, most CNN publications are applicationoriented. There is a limited amount of progress after the classical result in [5]. Examples include: explainable CNNs [6, 7, 8] and feedforward designs without backpropagation [9, 10, 11].
The determination of CNN model parameters in the onepass feedforward (FF) manner was recently proposed by Kuo et al. in [11]. It derives network parameters of a target layer based on statistics of output data from its previous layer. No BP is used at all. This feedforward design provides valuable insights into the CNN operational mechanism. Besides, under the same network architecture, its training complexity is significantly lower than that of the BPdesign CNN. FFdesigned and BPdesigned CNNs are denoted by FFCNNs and BPCNNs, respectively.
The FFCNN and the BPCNN were applied to the MNIST [12] and CIFAR10 [13] datasets for performance benchmarking in [11]. The BPCNN outperforms the FFCNN by a small margin in terms of classification accuracy. To improve the performance of the FFCNN, we use multiple FFCNNs as base classifiers in an ensemble system and show that the ensemble idea offers a promising solution to reach higher classification accuracy in this work. Although the ensemble idea can be applied to both BPCNNs and FFCNNs, it is more suitable for FFCNNs since FFCNNs are weaker classifiers of extremely low complexity. We conduct an extensive performance study on the BPCNN and the ensemble of FFCNNs against the MNIST and CIFAR10 datasets. Besides, we report the results by splitting simple and hard examples and treating them separately.
This work has several novel contributions. First, we make one modification on the FFCNN design to achieve higher classification performance. That is, we apply the channelwise PCA to spatial outputs of the conv layers to remove spatialdimension redundancy. This reduces the dimension of feature vectors furthermore. This will be elaborated in Sec. 3.1. Second, our major contribution is to develop various ensemble systems using multiple FFCNNs as base classifiers. The idea is shown in Fig. 1. To boost the performance of the ensemble solutions, we introduce three diversities: 1) flexible parameter settings in conv layers, 2) subsets of derived features, and 3) flexible image input forms. Third, we define the confidence score based on the final decision vector of the ensemble classifier and use it to separate easy examples from hard ones. Then, we can handle easy and hard examples separately.
2 Background
FFCNN. An FFCNN consists of two modules in cascade: 1) the module of convolutional (conv) layers and 2) the module of fullyconnected (FC) layers. They are designed using completely different strategies.
The construction of conv layers is totally unsupervised since no labels are needed in the construction process. It is designed in [11] as subspace approximation via spectral decomposition followed by the maximum pooling in the spatial domain. The subspace approximation is obtained using a new signal transform called the Saab (Subspace approximation with adjusted bias) transform. The Saab transform is a variant of the principal component analysis (PCA). It has a default constantelement bias vector used to annihilate nonlinear activation. Each conv layer contains one Saab transform followed by a maximum spatial pooling operation. The maximum pooling operation, which plays the ”winnertakesall” role, is a nonlinear one. Its model parameters of a target layer are derived from the statistics of the output of the previous layer. The feature discriminant power is increased gradually due to a larger receptive field.
The design of FC layers casts as a multistage linear least squared regression (LSR) problem in [11]. Suppose that the input and output dimensions are and (), respectively. We can cluster training samples of dimension into clusters and map all samples in a cluster into the unit vector of a vector space of dimension . Such a unit vector is also known as the onehot vector. The index of the output space dimension defines a pseudolabel.
Ensemble Methods. Ensembles are often used to integrate multiple weak classifiers and make them be a stronger one [14]. Examples include bagging [15], the random forest (an ensemble of decision trees) [16], stacked generalization [17], etc. Ensemble methods may not necessarily result in better classification performance than individual ones. Diversity is critical to the success of an ensemble system [18, 19].
3 Proposed Method
We use multiple FFCNNs to serve as baseline classifiers to construct an ensemble system. Several novel ideas are proposed to make the ensemble system more effective.
3.1 Channelwise PCA (CPCA) for Spatial Dimension Reduction
Although the Saab transform can reduce redundancy in the spectral domain, there still exists correlation among spatial dimensions of the same spectral component. This is illustrated in Fig. 2. We see that the correlation is stronger in low frequency components. Also, by comparing the MNIST and CIFAR10 datasets, the correlation is stronger in the CIFAR10 dataset. To further reduce the feature space dimension, we apply the PCA to spatial dimensions at each filter. This is called the channelwise PCA (CPCA).
We use indices and to denote the th conv layer and the th spectral component. The feature dimension after the Saab transform is , where , and are the spectral, width and height dimensions of the th conv layer. Then, we apply CPCA to features of the same filter index, , and reduce the original dimension, , to a space of smaller dimension , where , Thus, the feature dimension of a certain conv layer is equal to , where is is the selected filter number.
3.2 Diversity
To improve the performance of an FFCNN baseline, we develop a simple yet effective ensemble method as illustrated in Fig. 1. Here, we consider ensembles of LeNetlike CNNs, which contain two convolutional layers, two FC layers and one output layer. We adopt multiple FFCNNs as the firststage base classifiers in an ensemble system and concatenate their output decision vectors, whose dimension is the same as the class number. Then, we apply PCA to reduce feature dimension before feeding them into the secondstage ensemble classifier. The success of ensemble systems highly depends on the diversity of base classifiers. We propose three ways to increase the diversity of the baseline FFCNNs as elaborated below.
Scheme 1) Flexible parameter settings in conv layers. We choose different filter sizes in conv layers. The filter spatial dimension is the same for the two conv layers of the LeNet5 (i.e. . We consider four combinations of spatial dimensions. They are: , , and . Different filter sizes result in different receptive field sizes of the FFCNN. In turn, they yield different features at the output of the conv layer.
Scheme 2) Subsets of derived features. For each FFCNN, we have the following feature set for each sample:
where and represent the features obtained from the first and the second conv layers, respectively. We select a subset from . There are many possible selection choices. We test the following three selection rules in the experiment.

For each channel in , select features randomly, where are the spatial dimensions of the first conv layer and . Then, apply CPCA to reduce the feature dimension to . Finally, randomly select features from features, where .

Apply CPCA to to generate features and select features randomly, where .

Conduct checkerboard partitioning of in the spatial dimension. Then, apply the CPCA to each part and generate two feature subsets.
We generate one decision vector using each feature subset.
Scheme 3) Flexible input image forms. We adopt different image input forms to increase diversity. For example, we use various color models to represent color images [20]. Here, we use the RGB, YCbCr and Lab color spaces as different input forms to an FFCNN. We also apply Laws filter bank of size [21] to input images to capture their different spectral characteristics. The final decision vector is obtained by combining FFCNNs using different input representations.
3.3 Separation of Easy and Hard Examples
It is desired to separate hard examples from easy ones in the decisionmaking process. This is accomplished by computing the confidence score of each decision. It is determined by two factors: 1) the final decision vector of the ensemble classifier and 2) the prediction results of all base classifiers. Intuitively speaking, a decision is more confident if the maximum probability in the ensemble decision vector is larger or more base classifiers in an ensemble agree with each other. We define two confidence scores for an input image , where is the image index, as
(1) 
where , , and denote the input data and two confidence scores, respectively, is the decision vector of the ensemble, is the number of base classifiers producing the majority class label for input and is the total number of the base classifiers. We call an input image a hard sample if and , where and are two threshold values. After the separation of easy and hard examples, a new FFCNN ensemble targeting at the hard samples set, , can be trained to boost the classification performance of hard samples.
4 Experiments
We conducted experiments on two popular datasets: MNIST [12] and CIFAR10 [13]. The MNIST dataset contains grayscale images of handwritten digits 09. The CIFAR10 dataset has 10 classes of tiny images of size . We adopted the LeNet5 architecture [1] for the MNIST dataset. Since CIFAR10 is a color image dataset, we set the filter numbers of the first and the second conv layers and the first and the second FC layers to 32, 64, 200 and 100, respectively, by following [11].
We applied CPCA to the output of the second conv layer and reduced the feature dimension of the second conv layer per channel from 25 to 20 (for MNIST) or 12 (for CIFAR10). We sometimes fed the responses from the first conv layer to the FC layers directly to increase feature diversity. When this happens, we set reduced feature dimension per channel to 30 (MNIST) and 20 (CIFAR10) of the first conv layer while the original dimension per channel is .
We adopted the Radial Basis Function (RBF) SVM classifier as the ensemble classifier in all experiments. We applied PCA to cascaded decision vectors of base classifiers before the SVM classifier training. The reduced feature dimension was determined by the correlation of decision vectors of base classifiers in an ensemble.
4.1 Performance of Ensemble Systems
To show the power of ensembles, we conducted experiments by taking diversity schemes discussed in Sec. 3.2 into account.
Scheme 1. We compare the performance of BPCNN, four FFCNNs and the ensemble of four FFCNNs in Table 1. The four FFCNNs differ in their filter sizes in two conv layers: 1) (5x5,5x5), 2) (3x3,5x5), 3) (5x5,3x3), and 4) (3x3,3x3). For MNIST, their filter numbers are the same in all settings; namely, (6,16). For CIFAR10, their filter numbers for RGB images are: 1) (32,64), 2) (24,64), 3) (32,64), and 4) (24,48). Their filter numbers for a single channel of color images are: 1) (16,32), 2) (8,32), 3 (16,32), and 4) (8,24). The classification accuracies of BPCNN, four FFCNNs and the ensemble are listed from columns 1 to 6. We see that the ensemble of four FF models provides 4% improvement than the best single FF model. Different filter sizes will directly affect the receptive field size of each conv layer and induce different statistics of the input data. In this way, we introduce diverse features into the ensemble system. While the performance gap between BPCNN and the ensemble narrows down for MNIST, the ensemble outperforms BPCNN for CIFAR10.
BP  FF1  FF2  FF3  FF4  Ens.  

Filter Size  (5,5)  (5,5)  (5,3)  (3,5)  (3,3)   
MINST  99.1  97.1  97.0  97.2  97.3  98.2 
CIFAR10  68.7  63.7  65.3  64.2  65.9  69.9 
Scheme 2. We evaluate the FF1 design with feature subset diversity and set , and to 75%. We show the performance in Table 2, where the first to the fifth columns correspond to selected feature subsets from the entire (denoted by Conv2), two chosen by the third rule (denoted by Conv11 and Conv12), one by the first rule (denoted by Conv1RD), and one by the second rule (denoted by Conv2RD), respectively, where ”RD” denotes reduced dimension. Then, we study the performance of four ensemble methods: 1) the ensemble of Conv1, Conv11, Conv12 (ED1); 2) the ensemble of six Conv1RD results (ED2); 3) the ensemble of twelve Conv2RD results (ED3); and 4) the ensemble of six Conv1RD and twelve Conv2RD results (ED4). As compared with the performance of FF1 for CIFAR10 which is 63.7%, we see that ensembles using the feature subset diversity boost its performance by a margin ranging from 2.3 to 5.6%. It is worthwhile to point out that one can combine three classifiers (one trained on feature set and two trained on feature set) in the ED1 ensemble. It yields 68.7% and 97.7% accuracy on CIFAR10 and MNIST, respectively. This choice offers a simple and effective ensemble system. We will adopt ED1 to build a larger ensemble system by adding other sources of diversity later.
Conv2  Conv11  Conv12  Conv1RD  Conv2RD  ED1  ED2  ED3  ED4  

MINST  97.1  95.4  95.3  96.8  95.2  97.7  97.6  97.2  98.0 
CIFAR10  63.7  64.3  64.4  62.3  64.2  68.7  66.0  68.4  69.3 
Scheme 3. We conduct experiments by adopting different inputs to the FF1 architecture in Table 3. We apply nine Laws filters of size [22] to grayscale images and generate nine images that contain frequency components in different subbands. For color images in CIFAR10, we represent the color information in three color spaces: RGB, YCbCr, and Lab, where we treat three channels individually in the last two color spaces. We observe 1.1% and 5.9% performance improvements on the MNIST and CIFAR10, respectively, by assembling various input representations. This demonstrates the effectiveness of utilizing various input representations as the diversity source in an ensemble.
RGB  Grey  YCbCr  Lab  L1  L2  L3  L4  L5  L6  L7  L8  L9  ED  

MINST    97.1      97.0  95.1  87.8  92.6  93.7  94.9  95.6  93.1  92.6  98.2 
CIFAR10  63.7    54.0/41.4/41.1  53.2/40.0/41.0  50.6  44.8  44.5  46.3  48.3  44.9  47.6  43.0  45.8  69.6 
We can fuse three diversity types in an ensemble to boost the performance. The relation between test accuracy and ensemble complexity is shown in Fig. 3. In general, the ensemble of more classifiers gives better performance. So far, the best performance achieved on MNIST and CIFAR10 are 98.7% and 74.2% in terms of test accuracy. As compared with the single BPCNN reported in Table 1, the best ensemble result is 5.5% higher on CIFAR but 0.4% lower on MNIST. We can push the performance higher by separating easy and hard examples based on the scheme described in Sec. 3.3.
4.2 Separation of Easy and Hard Examples
By following the discussion in Sec. 3.3, we set and for the MNIST dataset and and for the CIFAR10 dataset. The results are reported in Table 4. For the set of hard examples, the new ensemble system trained on this set provides 5.6% and 2.6% improvements in test accuracy for MNIST and CIFAR10, respectively. More hard samples are classified correctly in this setting. Overall, the ensemble method with easy/hard example separation achieves test accuracies of 99.3% and 76.2% on the entire MNIST and CIFAR10 datasets, respectively. It outperforms the best results obtained earlier as shown in Table 1.
Easy  Hard  Hard+  FF  FF  

MNIST  Train  99.9  90.0  98.2  98.9  99.7 
Test  99.9  88.0  93.6  98.7  99.3  
Cifar10  Train  99.9  73.5  82.3  80.1  87.2 
Test  98.2  66.2  68.8  74.2  76.2 
4.3 Discussion
To better understand the diversity among different FFdesigned CNNs, we evaluate the correlation among the output of different classifiers using two diversity measures: Yuleâs Qstatistic and entropy measure [23]. These measures are built on the correct/incorrect decision. The lower Qstatistic (or the higher entropy measure) indicates a higher diversity degree among base classifiers. The average measures among different diversity sources are reported in Table 5. The best diversity measures are achieved by combining all base classifiers, leading to a large amount of performance improvement. This is consistent with classification accuracy assessment as shown in Fig. 3.
S1  S2  S3  ALL  

Qstatistic  0.88  0.89  0.66  0.61 
Entropy measure  0.21  0.24  0.47  0.49 
5 Conclusion
We proposed an ensemble method that is built on multiple FFCNNs of diversity. We see a significant improvement in test accuracy for the MNIST and the CIFAR10 datasets. As future extensions, we would like to apply the ensemble method to the more challenging datasets with more object classes or/and larger image size, such as the CIFAR100 and the ImageNet. Also, it will be interesting to develop a weakly supervised system based on the ensemble of multiple FFCNNs.
References
 [1] Yann Lecun, LÃ©on Bottou, Yoshua Bengio, and Patrick Haffner, “Gradientbased learning applied to document recognition,” in Proceedings of the IEEE, 1998, pp. 2278–2324.
 [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [4] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
 [5] George Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 1989.
 [6] Quanshi Zhang, Ying Nian Wu, and SongChun Zhu, “Interpretable convolutional neural networks,” arXiv preprint arXiv:1710.00935, 2017.
 [7] C.C. Jay Kuo, “Understanding convolutional neural networks with a mathematical model,” Journal of Visual Communication and Image Representation, vol. 41, pp. 406–413, 2016.
 [8] C.C. Jay Kuo, “The CNN as a guided multilayer RECOS transform [lecture notes],” IEEE Signal Processing Magazine, vol. 34, no. 3, pp. 81–89, 2017.
 [9] Yueru Chen, Zhuwei Xu, Shanshan Cai, Yujian Lang, and C.C. Jay Kuo, “A saak transform approach to efficient, scalable and robust handwritten digits recognition,” arXiv preprint arXiv:1710.10714, 2017.
 [10] C.C. Jay Kuo and Yueru Chen, “On datadriven Saak transform,” Journal of Visual Communication and Image Representation, vol. 50, pp. 237–246, 2018.
 [11] CC Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen, “Interpretable convolutional neural networks via feedforward design,” arXiv preprint arXiv:1810.02786, 2018.
 [12] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [13] Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
 [14] Cha Zhang and Yunqian Ma, Ensemble machine learning: methods and applications, Springer, 2012.
 [15] Leo Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
 [16] Leo Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
 [17] David H Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992.
 [18] Gavin Brown, Jeremy Wyatt, Rachel Harris, and Xin Yao, “Diversity creation methods: a survey and categorisation,” Information Fusion, vol. 6, no. 1, pp. 5–20, 2005.
 [19] Ludmila I Kuncheva and Christopher J Whitaker, “Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,” Machine learning, vol. 51, no. 2, pp. 181–207, 2003.
 [20] Noor A Ibraheem, Mokhtar M Hasan, Rafiqul Z Khan, and Pramod K Mishra, “Understanding color models: a review,” ARPN Journal of science and technology, vol. 2, no. 3, pp. 265–275, 2012.
 [21] Kenneth I Laws, “Rapid texture identification,” in Image processing for missile guidance. International Society for Optics and Photonics, 1980, vol. 238, pp. 376–382.
 [22] William K Pratt, Digital image processing: PIKS Scientific inside, vol. 4, Wileyinterscience Hoboken, New Jersey, 2007.
 [23] Padraig Cunningham and John Carney, “Diversity versus quality in classification ensembles based on feature selection,” in European Conference on Machine Learning. Springer, 2000, pp. 109–116.