Heterogeneous Multilayer Generalized Operational Perceptron
Abstract
The traditional Multilayer Perceptron (MLP) using McCullochPitts neuron model is inherently limited to a set of neuronal activities, i.e., linear weighted sum followed by nonlinear thresholding step. Previously, Generalized Operational Perceptron (GOP) was proposed to extend conventional perceptron model by defining a diverse set of neuronal activities to imitate a generalized model of biological neurons. Together with GOP, Progressive Operational Perceptron (POP) algorithm was proposed to optimize a predefined template of multiple homogeneous layers in a layerwise manner. In this paper, we propose an efficient algorithm to learn a compact, fully heterogeneous multilayer network that allows each individual neuron, regardless of the layer, to have distinct characteristics. Based on the complexity of the problem, the proposed algorithm operates in a progressive manner on a neuronal level, searching for a compact topology, not only in terms of depth but also width, i.e., the number of neurons in each layer. The proposed algorithm is shown to outperform other related learning methods in extensive experiments on several classification problems.
I Introduction
In recent years, learning systems based on neural networks have gained tremendous popularity in a variety of application domains such as machine vision, natural language processing, biomedical analysis or financial data analysis [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The recent resurgence of neural networks, especially deep neural networks, can be attributed to the developments of specialized computing hardware such as Graphical Processing Units (GPU) and improved training techniques such as Batch Normalization [11], Dropout [12], Residual Connection [13] as well as stochastic optimization algorithms such as Nesterov SGD [14] or Adam [15], to name a few. In order to improve the performance and generalization capacity of neural networks, a large amount of efforts has been made to learn deeper and deeper network topologies with larger, heavily annotated datasets. While the network architectures and training heuristics have evolved over the past few years, the core components of a neural network, i.e., the neuron model has remained relatively unchanged. The most typical artificial neuron model is based on McCullochPitts perceptron [16], thereupon simply referred to as perceptron, which performs a linear summation with learnable synaptic weights followed by an elementwise nonlinear activation function. This design principle was aimed to loosely simulate biological neurons in mammalian nervous system, which is also used in the current stateoftheart architectures such as Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN).
The authors in [17] proposed to replace the crude model of biological neuron based on McCullochPitts design with a more general neuron model called Generalized Operational Perceptron (GOP), which also includes the conventional perceptron as a special case. Each GOP is characterized by learnable synaptic weights and an operator set comprising of three types of operations: nodal operation, pooling operation, and activation operation, as illustrated in Figure 1. The form of each operation is selected from a library of predefined operations. By allowing different neurons to have different nodal, pool and activation operators, GOP can encapsulate a diversity of both linear and nonlinear operations.
The diversity introduced by GOP poses a much more challenging problem as compared to the traditional perceptron: optimizing both the synaptic weights and the choice of the operator set. In order to build a neural network using GOP, a progressive learning algorithm called Progressive Operational Perceptron (POP) was proposed in [17], which optimizes a predefined network template in a layerwise manner. To avoid an intractable search in the combinatorial space of operator sets, POP constrains all GOPs within the same layer to share the same operator set. While limiting the functional form of neurons in the same layer, training POP is still painstakingly slow since the evaluation of an operator set involves training a single hidden layer network and the combinatorial search space of the optimal operator set in the output layer in conjunction with the hidden layer is enormous. POP achieves a partial degree of selforganization by stopping adding new hidden layers when the objective is met on the data forming the training set. The width of each layer is, however, predefined beforehand, leading to a suboptimal final topology in terms of compactness.
When building a learning system based on neural networks, the architectural choice of the networks’ topology plays an important role in the generalization capacity [18]. The size of the neural network, i.e., the number of layers and the number of neurons in each layer, are usually selected based on some standard widely used structures or by manual tuning through experimentation. While some designs favor network depth such as Residual Networks with hundred of layers, empirical experiments in [19] have shown that shallower but wider topologies can achieve better generalization performance and computation efficiency. In case of densely connected topologies, an MLP with large layers can easily lead to overfitting while randomized neural networks [20, 21, 22, 23] typically formed by a large hidden layer with random neurons are robust to overfitting. There have been several attempts [24, 25, 17, 26, 27, 28, 29, 30, 31] to systematically search for the optimized network architectures with a given objective criterion. Regarding densely connected networks, existing literature employs different learning strategies such as progressively adding neurons and solve for the synaptic weights by randomization or convex optimization, or both. While randomization and convex optimization are characterized with fast training time and usually come with some forms of theoretical guarantee, the resulting architectures are often large with hundreds or thousands neurons. Another popular approach is the application of evolutionary strategies in the architectural search procedure. For example, by encoding the network configurations and parameters into particles, multidimensional particle swam optimization was used to evolve both the network configurations and synaptic weights in [30]. While evolutionary algorithms work well in practice, their fitness evaluation step often requires heavy computation, rendering their application in large datasets intractable. Recently, by employing powerful commodity hardware with GPUs, it was able to evolve LSTM architectures on Penn Treebank language modeling dataset [31]. The common drawback of all the aforementioned learning systems is the use of perceptron model, which limits the learning capability of each neuron.
Since the development in the silicon layer that enables costefficient computing devices that are affordable to a wider range of practitioners, as well as lowcost embedded, mobile devices that are affordable to many customer classes, more and more research efforts have been focused on efficient inference systems on mobile devices that require small memory footprint and computation power. While the stateoftheart machine learning models based on deep neural networks with millions of parameters can be easily deployed to a powerful workstation, they are yet ready for the deployment on mobile devices having limited memory, computing power, and battery life. To reduce the storage and computation required for deployment on such devices, existing approaches include compressing a pretrained network by weight quantization, lowrank approximations and parameter pruning [32] or designing a network topology with fewer parameters and computations [33, 34]. It should be noted that most of these works focus on the convolutional architectures which are the core component of many visual learning systems. While visual inference tasks are actively investigated by convolutional neural networks, the potential of neural networks encompasses a much wider range of applications, ranging from healthcare monitoring to smart sensors, which are traditionally solved with densely connected topologies [35, 36, 37]. In this regard, learning a problemdependent, compact network configuration makes a step towards the realization of inference systems on lowcost devices.
In this work, we focus on the problem of learning efficient and compact network structures by learning fully heterogeneous multilayer networks using GOPs. We propose an algorithm to progressively grow the network structures, both in width and depth, while searching for the optimal operator set, the synaptic weights of the newly added neurons and the corresponding decision function. The contributions of our work can be summarized as follows:

We analyze the current drawbacks of the related algorithms and propose an efficient algorithm to overcome the existing shortcomings by learning datadependent, fully heterogeneous multilayer architectures employing GOPs. The resulting networks can achieve a high degree of selforganization with respect to the operator set of each neuron, the number of neurons per layer and the number of layers.

In addition to the proposed algorithm, we also present three other variants that can be seen as simplified versions of it. This is followed by the discussion focusing on the advantages and possible limitations of each variant in comparison with our proposed algorithm.

To validate the efficiency of the proposed algorithm in comparison with other variants and existing progressive learning systems, we have conducted extensive experimental benchmarks on a variety of realworld classification problems.

We publish our implementation of all evaluated methods in this paper to facilitate the future research in this area ^{1}^{1}1https://github.com/viebboy/HeMLGOP.
The remainder of the paper is organized as follows: In Section 2, we review the GOP model with POP, the progressive learning algorithm proposed in [17]. Section 3 starts by discussing the shortcomings of POP and proceeds to present our approach towards the design of fully heterogeneous networks using GOPs. The discussion of other variants of our approach is also presented in Section 3. In Section 4, we provide details of our experiment protocols and quantitative analysis of the experiment results. Section 5 concludes our work.
Ii Related Work
This Section describes the GOP model and the corresponding algorithm POP, proposed in [17] to learn GOPbased networks.
Iia Generalized Operational Perceptron
The neuronal activities of the th GOP neuron at layer is illustrated in Figure 1. As mentioned in the previous section, each GOP is characterized by the adjustable synaptic weights (), the bias term and an operator set (nodal operator , pooling operator , activation operator ). The form of each operator is selected from a predefined library of operators: , , . The task of learning a network using GOP is, therefore, the search for the optimal operator set for each GOP and the corresponding synaptic weights. Given (, , ), and (), the activities of the th neuron can be described by the following equations:
(1)  
(2)  
(3) 
In Eq. (1), the nodal operator takes as input the th output of the previous layer and the synaptic weight which connects the th neuron at layer to the th neuron at layer . After applying function , the nodal operation produces a scalar .
In Eq. (2), the outputs of the nodal operation over neurons from the previous layer are then combined through the pooling operation, which applies the pooling function over (). The pooled result is then shifted by the bias term to produce .
In Eq. (3), the output of the pooling operation then goes through the activation function to produce the output of the th GOP in layer .
An example of a library of operators, which is also used in our experiments, is shown in Table I. It is clear that when the operator set of a GOP is selected as (multiplication, summation, sigmoid) then it operates as a conventional perceptron.
Nodal () 


Multiplication  
Exponential  
Harmonic  
Quadratic  
Gaussian  
DoG  
Pool () 

Summation  
1Correlation  
2Correlation  
Maximum  
Activation () 

Sigmoid  
Tanh  
ReLU  
Softplus  
Inverse Absolute  
ELU  

IiB Progressive Operational Perceptron
Let , , be the number of elements in , and respectively, then the total number of possible combinations for a GOP is . It is clear that given a multilayer topology of GOPs and a large value for , the combinatorial search space when optimizing all neurons simultaneously in such network is enormous. For example, consider the case where , as used in [17], and using a twohiddenlayer network with neurons in each layer and output neurons. Such a small network architecture corresponds to different configurations. To narrow the search space, POP was proposed in [17] to learn the network topology in a layerwise manner. In order to operate, a template network structure specifying the number of neurons for each hidden layer and maximum number of hidden layers is predefined. In addition, a target objective is specified to determine the convergence of the algorithm. For example, [, , , , ; ] defines a template with input units, output neurons, 3 hidden layers with , , neurons respectively, and specifies the target mean squared error. Starting from the first hidden layer, POP constructs a Single Hidden Layer Network (SHLN) [, , ] and learns the operator sets and weights in the hidden and output layer with a constraint: neurons in the same layer share the same operator set. Finding the operator sets is done by a greedy iterative search procedure called twopass GIS. In the first pass, a random operator set is fixed to the hidden layer and POP iterates through all operator sets in the library: at each iteration, the output layer is assigned the iterated operator set ; the synaptic weights of SHLN with (, ) are found by BP for epochs, and the performance is recorded. After this procedure, the current best operator set in the output layer is found. With fixed in the output layer, POP performs similar loop to find the best set for the hidden layer . The second pass of GIS is similar to the first one, except the operator set in the hidden layer is initialized with from the first pass instead of random assignment. After applying twopass GIS, the performance of the current SHLN is compared with the target objective . If the target is not achieved, the output layer of SHLN is discarded and the current hidden layer is fixed and used to generate fixed inputs to learn the next hidden layer with neurons in the similar manner as the first hidden layer. On the other hand, if the target objective is met, POP stops the progressive learning procedure and finetunes the whole network structure that has been learned.
Iii Proposed Method
In this section, we describe a new approach to overcome the limitations of current algorithms in building heterogeneous network architectures directly from data. At the end of this section, we also discuss other possible variants of our approach and present our view on the pros and cons of each of them.
Iiia Limitations in POP
It is clear from Section IIB that POP optimizes the operator set and weights in each hidden layer by running through loops over the library of the operator sets with each iteration running BP with epochs. Therefore, the computational complexity to optimize an SHLN is BP epochs. Such a search scheme is not only computationally demanding, but also redundant due to the fact that when the target objective cannot be achieved with the current network configuration, POP simply discards what has been learned for the output layer and reiterates the searching procedure for the new hidden and output layer. Let us consider the case where the operator set in the output layer is already known a priori, the cost of the search in POP is reduced from to , which is a significant factor of reduction. In fact, we argue that if the hidden neurons can extract highly discriminant representations, which is the design target of GOP, then only a simple linear decision function is needed in the output layer.
There are two constraints imposed by POP on the learned architecture: a predefined width of each hidden layer and the sharing of the operator set within the same layer. Both constraints limit the representational power of the learned hidden layer. While it is computationally infeasible to search for the operator set of each individual GOP following the searching approach in POP, [17] argued that in a classification problem an optimal operator set of a neuron should also be optimal to others in the same layer, i.e., on the same level of the hierarchical abstract. This is, however, a strong assumption. As an illustrational example, let us assume that at some arbitrary level of abstraction in the network appear patterns of both straight lines and curves, and we assume that there exist two operator sets that allow the neurons to detect straight lines and curves respectively. By limiting the neurons to either being able to detect a line or a curve, whichever yields better results, one of the patterns will not be captured in the internal representation of the network. Such an approach will lead to the confusion on the objects which are composed of both patterns. One might argue that with enough neurons that can detect lines, a curvature can also be detected in a limiting sense. This comes to the question: how many neurons will be enough?. By imposing a predefined width of each hidden layer, POP already incurs an additional hyperparameter choice, leading to either insufficient or redundant representation, both of which require a huge amount of hyperparameter tuning efforts to achieve an efficient and compact network.
IiiB Heterogeneous Multilayer Generalized Operational Perceptron (HeMLGOP)
The aforementioned limitations in POP are, in fact, interrelated. The computational complexity of the search procedure can be reduced by avoiding the search of the operator sets in the hidden layer in conjunction with the output layer. This can be done by simply assuming a linear decision function, which requires highly discriminative hidden representations. In other words, heterogeneous hidden layers of GOPs with adaptive size. It should be noted that, in order to search for the optimal operator set of a neuron, it is necessary to evaluate all operator sets in the library. Instead of evaluating each operator set by hundreds BP iterations as in POP, we propose to use ideas originated from Randomized Networks (RN) [20, 21, 22, 23] for the evaluation of an operator set. Given a single hidden layer topology with linear transformation in the output layer, we assign random synaptic weights connecting the input layer to the hidden layer while giving a closedform global solution of the output layer, i.e., the decision function. Particularly, let and be the hidden representation, and the target representation of training samples respectively. The optimal synaptic weights connecting the hidden layer and the output layer is given as:
(4) 
where is the MoorePenrose generalized inverse of .
There are several methods to calculate the MoorePenrose generalized inverse of a matrix [38, 39]. For example, in the orthogonal projection method, when or when , with being a positive scalar added to the diagonal of or to ensure stability and improve generalization performance according to ridge regression theory [40].
Given a single hidden layer network with GOP neurons in the hidden layer and linear output layer, and since each operator set represents a distinct type of neuronal activity or functional form, we argue that the suitable functional form of a GOP, i.e., the operator set, can be evaluated with random synaptic weights drawn from a uniform distribution [41]. After finding the optimal operator set of a neuron with respect to the objective function, the corresponding weights can be finetuned by BP.
To learn a problemdependent network topology, we adopt a progressive learning approach in terms of width and depth in our algorithm. Given a learning problem, the proposed algorithm starts with a single hidden layer network of GOPs in the hidden layer and linear neurons in the output layer. By starting with a small , e.g., , we assume these GOPs share the same operator set. The algorithm proceeds to select the optimal operator set of neurons by iterating through all operator sets in the library: at iteration , random synaptic weights in the hidden layer are drawn from a uniform distribution, the decision boundary is calculated as follows:
(5) 
where denotes the standardized hidden output of GOPs with operator set .
At each iteration, the performance of the network is recorded. After evaluating all operator sets, the best performing one is selected for the current neurons and the corresponding synaptic weights as well as output layer weights are updated with BP for epochs. During BP with minibatch training, the normalization step is replaced by Batch Normalization [11], which is initialized with mean and standard deviation. Since the hidden layer will be incrementally grown with heterogeneous GOPs, the normalization step is necessary to ensure the hidden representations in the network having similar range. Once the operator set is found and the synaptic weights of a GOP are finetuned with BP, they are fixed.
The algorithm continues by progressively adding GOPs sharing the same operator set to the hidden layer. It is worth noting that when , the algorithm allows the growth of fully heterogeneous layers. The operator set of newly added GOPs is found similarly as in case of the first GOPs, i.e., by iterating through all operator sets and solving for the output layer weights. Particularly, at iteration , let be the normalized hidden representation of the existing GOPs that have been learned and be the normalized hidden representation produced by the newly added GOPs with the th operator set in the library and random weights, the optimal linear transformation in the output layer is given as:
(6) 
In addition to Eq. (6), the new decision boundary can also be updated efficiently in an incremental manner based on , the decision boundary with respect to [42]. After the best performing operator set of newly added GOPs is found, their synaptic weights , the normalization statistics and the linear transformation in the output layer are updated through BP. Here we should note that the existing GOPs with representation are not updated since the inclusion of neurons is expected to complement the existing features. The progressive learning in the current hidden layer stops when the inclusion of new neurons stops improving the performance of the network. This is measured by a relative criterion based on the rate of improvement rather than an absolute threshold value on the performance as in POP:
(7) 
where denotes the rate of improvement when adding neurons to the current hidden layer and , denote the loss values before and after adding neurons respectively. For large positive , the inclusion of new neurons indicates a large improvement of the network with respect to the existing structure. On the contrary, a small or negative indicates a minimal improvement or a degradation in the performance. It should be noted that depending on the learning problem, other quantities can be chosen in place of the loss value, with appropriate signs in the nominator of Eq. (7). For example, in classification problem, can be defined as with denotes the classification accuracy. Given and a threshold , the proposed algorithm stops adding neurons to the current hidden layer if .
When the progression in the current hidden layer stops improving performance with respect to , the proposed algorithm forms a new hidden layer with GOPs between the current hidden layer and the output layer. All the existing hidden layers in the network are then fixed and act as feature extractor, producing inputs to the newly formed hidden layer. The progression in the new hidden layer repeats in a similar manner as in the previous hidden layers with an initial GOPs and incrementally adding GOPs until the criterion is met. After the new hidden layer is fully learned, the proposed algorithm evaluates the necessity to include the newly learned hidden layer by evaluating the relative improvement of the network before and after adding the new hidden layer:
(8) 
where denotes the rate of improvement when adding the new hidden layer and , denote the loss values before and after adding the new hidden layer respectively. Similar to the progression of neurons in a hidden layer, the progression of hidden layers is controlled through hyperparameter . The newly learned hidden layer is included in the network topology and the progression continues if . Otherwise, the progressive learning terminates, and the whole network structure that has been learned during progressive learning is finetuned for some BP epochs. Since a new hidden layer is initially formed with a small number of neurons, our proposed algorithm only evaluates the inclusion of a new hidden layer when it is fully learned. The summary of our proposed algorithm is presented in Algorithm 1.
IiiC Variants
One can also identify three variants of the proposed algorithm. They are the following:

Homogeneous Multilayer Randomized Network (HoMLRN): instead of a heterogeneous layer of GOPs, in this variant, all neurons in the same layer share the same operator set. A hidden layer starts with neurons whose operator set is found by using RN in the operator set search procedure. Once the operator set of the starting neurons is found, it is fixed for all other neurons in the same layer. At each progressive step, neurons with random weights are added to the current hidden layer, and the decision boundary is adjusted through linear regression. After the progression finishes, the final network structure is finetuned through BP.

Heterogeneous Multilayer Randomized Network (HeMLRN): in this variant, the progressive learning procedure is similar to our algorithm, i.e., for newly added neurons, the algorithm searches for the best performing operator set by using RN. The only difference between this variant and our algorithm is that the synaptic weights are not finetuned during progressive learning but only after the final topology is found.

Homogeneous Multilayer Generalized Operational Perceptron (HoMLGOP): similar to HoMLRN and POP, this variant enforces the sharing of operator set within the same layer. The progressive learning in a hidden layer starts with GOPs whose operator set is found via Randomized Network in the operator set search procedure. GOPs with the operator set similar to the starting neurons are incrementally added to the current hidden layer. At each increment, the synaptic weights of newly added neurons are updated through BP instead of Randomized Network as in HoMLRN. After the progression of network structure, the final topology is finetuned as a whole.
It is clear that the aforementioned variants can be seen as simplified versions of our approach in certain aspects. Particularly, both HoMLRN and HeMLRN depend solely on Randomized Networks during progressive learning, which reduces a portion of computational cost induced by weights finetuning through BP. While Randomized Networks can be suitable to search for the functional form of newly added GOPs, we argue that it is necessary to further adjust the weights of the newly added neurons through BP to effectively exploit their representation power. Without the weight adjustment step interleaved with Randomized Network, both HoMLRN and HeMLRN are expected to progress towards having large hidden layers. Moreover, since the hidden layers rely only on random weights during the progression, the outputs of a hidden layer are not expected to be highly discriminative as an input to the next hidden layer, which might also lead to ineffective progression in depth. While HoMLGOP incorporates the weight finetuning step in the progressive learning procedure, this variant avoids the cost of searching for the optimal operator set when incrementally adding neurons. As a result, the homogeneity constraint in hidden layers might prevent HoMLGOP from learning compact hidden layers.
Regarding the complexity of the proposed algorithm, the aforementioned variants, and POP, it is not straightforward to compare the computational complexity between them since the complexity of all progressive learning algorithms depends on the speed of convergence given a criterion. However, it is worth noting that since the proposed algorithm allows the growth of heterogeneous hidden layers, and at each incremental step, the synaptic weights of newly added neurons are strengthened through BP to fully adapt to the problem, it is expected to observe fast convergence of the rate of improvement, producing both compact and efficient network structures. In our empirical analysis, the proposed algorithm converges after a few incremental steps in most of the learning problems, and the number of network parameters in the learned architectures is much lower compared to the competing approaches. Another point worth mentioning is that the search procedure in the proposed method and the mentioned variants relies on random weights, which might produce different operator sets at different runs. This can lead to high variance in the final topologies between different runs, especially in case of HoMLRN and HoMLGOP in which the operator set of a layer is found only once. The effect of randomness is, however, reduced in the proposed algorithm because if in a hidden layer, an optimal operator set was not found in the previous incremental steps due to randomness, it can still be chosen in the next steps.
Iv Experiments
In this section, we provide empirical analysis of the proposed algorithm, the aforementioned variants, and other related algorithms. We start by describing the experimental protocols and datasets, followed by the discussion of the empirical results.
Iva Competing methods
In order to compare the effectiveness of the proposed algorithm with related approaches, the following additional methods were included in our empirical analysis:

Progressive Operational Perceptron (POP) [17]: the only existing GOPbased algorithm.

HoMLRN, HeMLRN, HoMLGOP: the variants of the proposed method mentioned in the previous section.

Progressive Learning Network (PLN) [24]: by using nonlinear activation functions that satisfy the progression property, the authors in [24] proposed a progressive learning algorithm that increments the width of a hidden layer by random perceptrons and solving a convex optimization problem. When the performance improvement in a hidden layer saturates, PLN forms a new hidden layer by incrementally adding random neurons to the current output layer to form a new hidden layer and adding new output layer.

Broad Learning System (BLS) [25]: based on the idea of Random Vector Functional Link Neural Network [43], the authors proposed a progressive learning algorithm that increments the width of a two hidden layer network. Neurons in the first hidden layer are called feature nodes, which synthesize hidden features by random linear transformation and sigmoid activation. Neurons in the second hidden layer are called enhancement nodes, which again linearly transform the outputs of feature nodes with random weights, followed by the sigmoid activation. The outputs of feature nodes and enhancement nodes are concatenated as an input to a linear output layer. Before progressively adding new feature nodes and enhancement nodes, BLS finetunes the feature nodes by Alternating Direction Method of Multiplier (ADMM). During the progression, only random nodes are added.

Progressive Multilayer Perceptron (PMLP): this is a variant of POP that uses McCullochPitts perceptron instead of GOP. The progressive learning step is similar to POP with a predefined maximal template structure as an input, PMLP incrementally adds a new hidden layer if a target objective cannot be achieved.
IvB Datasets
We have conducted experiments on classification problems in different application domains with different sizes, ranging from few hundred samples up to k samples. With respect to the problem size, the datasets can be divided into groups: smallscale problems ( datasets) formed by less than samples and medium/large scale problems ( datasets) formed by more than samples.
Information about all datasets used in our experiments is presented in Table II. For PIMA [44], CMC [45] and YEAST [46], we used the original data representations provided by the database. Olympic Sports [45] and Holywood3d [47] are human action video datasets. We used the stateoftheart action video representation in [48] and combined the five action descriptions following the suggested multichannel kernel approach followed by KPCA to obtain vectorbased representation for each action video. All medium/large scale problems are classification problem based on visual inputs. Particularly, 15 scenes and MIT indoor are scene recognition datasets, Caltech101 and Caltech256 are related to the problem of object recognition while CFW and PubFig concern face recognition problem. Regarding the input representation of scene recognition and object recognition datasets, we employed the deep features generated by average pooling over spatial dimension of the last convolution layer of VGG network [49] trained on ILSVRC2012 database. Similarly, deep features generated by VGGface network [50] were used in CFW and PubFig.
Database 
Samples  Input dimension  Target dimension 

PIMA [44] 
768  8  2 
Olympic Sports [45]  774  100  16 
Holywood3d [47]  945  100  14 
CMC [51]  1473  9  3 
YEAST [46]  1484  8  10 
15 scenes [52] 
4485  512  15 
MIT indoor [53]  15620  512  67 
Caltech101 [54]  9145  512  102 
Caltech256 [55]  30607  512  257 
PubFig [56]  13002  512  83 
CFW60k [57]  60000  512  500 

Since POP is the most computationally demanding algorithm, we could only afford to perform experiments with POP in smallscale problems. Although empirical results in medium/large scale problems are not available, the efficiency of POP in comparison with other algorithms can be observed in five smallscale datasets.
HeMLGOP  HoMLGOP  HeMLRN  HoMLRN  POP  PMLP  PLN  BLS  
Holywood3d  
Olympic Sports  
CMC  
PIMA  
YEAST  
15 scenes    86.36  
MIT indoor    
Caltech101    
Caltech256    
PubFig    
CFW60k    

HeMLGOP  HoMLGOP  HeMLRN  HoMLRN  POP  PMLP  PLN  BLS  
Holywood3d  k  k  k  k  k  k  k  k 
Olympic Sports  k  k  k  k  k  k  k  k 
CMC  k  k  k  k  k  k  k  k 
PIMA  k  k  k  k  k  k  k  k 
YEAST  k  k  k  k  k  k  k  k 
15 scenes  k  k  k  k    k  k  k 
MIT indoor  k  k  k  k    k  k  k 
Caltech101  k  k  k  k    k  k  k 
Caltech256  k  k  k  k    k  k  k 
PubFig  k  k  k  k    k  k  k 
CFW60k  k  k  k  k    k  k  k 

IvC Experiment Protocol
In smallscale problems, since the number of samples is small, we only partitioned the datasets into a train () and a test () set, except for Holywood3d and Olympic Sports in which we used the partition given by the databases. In medium/large scale problems, of the data was randomly chosen for training while was selected as validation and test set each. To deal with the effect of randomness, each algorithm was run times on each problem, and the medium performance on the test set and the corresponding architectural information are reported.
Since other progressive learning methods (PLN, BLS) are largely affected by hyperparameter settings, we have conducted extensive hyperparameter search for each algorithm. Particularly, in PLN, we tested the values for the leastsquare regularization parameter, and , for the regularization parameters when solving the output layer; in BLS, the regularization parameter used in the calculation of pseudoinverse is in the set of and regularization parameter used in ADMM algorithm is in the set of . The number of iterations in ADMM algorithm was set to for both PLN and BLS.
HeMLGOP  HoMLGOP  HeMLRN  HoMLRN  POP  PMLP  PLN  BLS  
Holywood3d  
Olympic Sports  
CMC  
PIMA  
YEAST  

HeMLGOP  HoMLGOP  HeMLRN  HoMLRN  POP  PMLP  PLN  BLS  
Holywood3d  k  k  k  k  k  k  k  k 
Olympic Sports  k  k  k  k  k  k  k  k 
CMC  k  k  k  k  k  k  k  k 
PIMA  k  k  k  k  k  k  k  k 
YEAST  k  k  k  k  k  k  k  k 
15 scenes  k  k  k  k    k  k  k 
MIT indoor  k  k  k  k    k  k  k 
Caltech101  k  k  k  k    k  k  k 
Caltech256  k  k  k  k    k  k  k 
PubFig  k  k  k  k    k  k  k 
CFW60k  k  k  k  k    k  k  k 

Regarding the hyperparameter settings of the proposed method (HeMLGOP) and other variants (HoMLRN, HeMLRN, HoMLGOP), was used in the Randomized Network step. For all the methods that employ BP, it is important to properly regularize the network structure to avoid overfitting. Regularization setting in BP includes weight regularization and Dropout [12]. We experimented with types of weight regularization: weight decay with scale of and norm constraint with maximum value in . The dropout step was applied to the output of the hidden layer with the percentage selected from . In addition, dropout was applied to the deep feature input during BP. For smallscale problems, during progressive learning, the following learning rate schedule and the corresponding number of epochs were applied to all methods while in medium/large scale datasets, the learning rate schedule and the number of epochs were set to , , respectively.
For POP and PMLP, we defined a network template of maximum layers and layers in smallscale and medium/large scale problems, respectively, with each layer having neurons. In all competing algorithms that have incremental steps within a layer, the layer starts with neurons and increments neurons at each step. To avoid the problem of growing arbitrarily large hidden layers, and to make the learned architectures comparable between all methods, we limit the maximum number of added neurons in a hidden layer in our proposed method and other variants to , and for PLN and BLS. Moreover, we applied a universal convergence criterion based on the rate of improvement during network progression for all methods, i.e., an algorithm stops adding neurons to the current hidden layer when and stops adding hidden layers when with and calculated according to Eq. (7) and (8).
IvD Results
Table III shows the classification accuracy of all competing methods on the datasets and Table IV shows the corresponding model sizes, i.e., the number of parameters in the network.
Regarding the performances on smallscale datasets, among all competing algorithms, it is clear that the proposed algorithm (HeMLGOP) and its heterogeneous variant (HeMLRN) are the leading performers. The differences, in terms of classification accuracy, between the two algorithms are statistically insignificant. However, the models learned by HeMLGOP are significantly smaller (approximately in most cases) as compared to those learned by HeMLRN. Between two homogeneous variants of the proposed algorithm, it is clear that HoMLGOP consistently outperforms HoMLRN in terms of both classification accuracy and compactness.
Between the two algorithms that employ BP during progressive learning, the results of HeMLGOP are similar or better than its homogeneous counterpart (HoMLGOP) in both classification accuracy and memory footprint, with the only exception in Holywood3d in which HeMLGOP is more accurate, requiring more parameters. The empirical results of HeMLGOP and its variants are consistent with our discussion in Section IIIC: allowing heterogeneous neurons within a hidden layer can lead to better performing networks, and weights adjustment through BP is necessary to fully harness the representation power of newly added neurons during progressive learning.
Regarding the performance of POP and PMLP on smallscale datasets, it is obvious that the final network topologies learned by the two algorithms are enormous, compared to the proposed algorithm and its variants. Particularly, in CMC, PIMA, and YEAST, HeMLGOP needs approximately only parameters, while POP and PMLP require a vast amount of more than parameters, i.e., memory saving achieved by HeMLGOP with similar or better accuracies. The differences between POP, PMLP and the proposed algorithm are expected since our proposed algorithm addresses the two limitations in POP as discussed in Section IIIA: fixed hidden layer sizes and the homogeneity constraint of a layer. Regarding PLN, while the algorithm requires slightly fewer parameters in Holywood3d and Olympic Sports datasets as compared with HeMLGOP, the classification performances of PLN are similar or much worse. In other smallscale problems, PLN is inferior to HeMLGOP in both accuracy and compactness.
Similar phenomena between the competing algorithms can be observed in medium/large scale datasets: the proposed HeMLGOP remains as the best performing algorithm to learn the most compact network topologies while being similarly or more accurate than other benchmarked algorithms. The classification accuracies of HeMLRN and HoMLGOP, are competitive with the proposed algorithm, however, achieved by larger network configurations. HoMLRN performs worst among its GOPbased counterparts. While being as accurate as the proposed HeMLGOP in most medium/large scale problems, the models learned by PMLP require to number of parameters. Moving to a medium/large scale setting with more challenging problems, PLN and BLS are consistently inferior to other algorithms in both measures. In addition, the networks grown by PLN in some datasets such as Caltech256 ( classes) or CFW ( classes) are enormous with the number of parameters reaching the order of millions. This is due to the limitations in PLN that the size of a hidden layer is always equal or larger than twice the number of target outputs since a new hidden layer is constructed based on the current output layer. By having only hidden layers and resorting entirely to random weights during progression, BLS tends to grow large but inefficient networks as compared to other algorithms.
As mentioned in Section III, one of the motivations in our work is to speed up the operator set searching procedure in GOPbased system like POP. Thus, Table V presents the training time (in seconds) of all algorithms on smallscale problems. It should be noted that algorithms based on BP can take huge advantage of GPUs to speed up the training process. However, to give comparable results in terms of training time of all competing methods, we conducted all smallscale experiments based on CPUs with the same machine configuration. It is clear that the proposed algorithm is much faster than POP by approximately in most cases. While HoMLGOP, HeMLRN, HoMLRN can be seen as simplified versions of the proposed algorithm, there is no clear winner among the four algorithms in the context of training time. Depending on the difficulty of the given problem, the training time of HeMLGOP is relatively short as compared to its variants since the proposed algorithm tends to converge after only a few progressive steps with small network topologies, e.g., in CMC, PIMA, YEAST. Among all competing algorithms, it is clear that PLN and BLS are the fastest algorithms to train since both algorithms rely only on random weights during the network progression. As shown in Table III, this advantage of fast training time results in the cost of inferior performances and very large model sizes for deployment as compared to the proposed algorithm. While Table V can give an intuition on the speed of each algorithm during the training stage, it is worth noting that the benchmark can only give strict comparisons between the proposed algorithm, its variants, and POP, all of which are based on our unoptimized implementation of GOP. The exact relative comparison on the training time between perceptronbased networks (PMLP, PLN, BLS) and GOPbased networks (POP, the proposed algorithm and, its variants) can change drastically when an optimized implementation of GOP is available.
Nowadays, with the development of commodity hardware, training an algorithm in some orders of magnitude longer than another might not prevent its application. However, deploying a large pretrained model to constrained computing environments such as those in mobile, embedded devices poses a big challenge in practical application. Not only the storage requirement plays an important factor during the deployment stage in mobile devices but also the amount of energy consumed during inference. While the actual inference time of each algorithm depends on the details of the implementation such as hardwarespecific optimizations or concurrency supports, the computational complexity of an algorithm is directly related to the energy consumptions. Under this perspective, Table VI shows the number of floatingpoint operations (FLOPs) required by each network in Table III to predict a new sample. With compact network configurations, it is clear that out of datasets, the proposed algorithm requires the smallest number of operations during inference. In other cases, the number of FLOPs in HeMLGOP remains relatively low as compared to other algorithms. For example, in CMC, Caltech101 or PubFig, HeMLGOP is the second best in terms of computational complexity. Although being very fast to train, making inference with the learned models produced by PLN and BLS is costly in most cases. For example, in YEAST, Caltech256 and CFW, the numbers of FLOPs in PLN and BLS are more than compared to HeMLGOP. Regarding POP, not only does the algorithm take a very long time to train but also heavy costs to make an inference. The computational complexity of PMLP during testing is on the same order as POP, however, with much shorter training time. It is worth noting that in case of GOPbased networks, the number of parameters in the model and the inference cost are not directly related, i.e., two networks having the same topology could have different inference complexities. This is due to the fact that different operator sets in GOP possess different computational complexities.
Figure 2 shows the distribution of operators used by our proposed algorithm and its three variants in all datasets. It is clear that while the proposed algorithm and its heterogeneous variant (HeMLRN) used a diverse set of operators, the types of operators selected by HoMLGOP and HoMLRN are more limited. Within the library of nodal operators, “Multiplication” was popular among all four algorithms. Similar observations can be seen in activation operators: “ReLU” and “ELU” were favored as activation functions while “Summation”, “1Correlation” and “2Correlation” were the most popular pooling operators to all algorithms.
V Conclusions
In this paper, we proposed an efficient algorithm to learn fully heterogeneous multilayer networks that utilize Generalized Operational Perceptron. The proposed algorithm (HeMLGOP) is capable of learning very compact network structures with efficient performances. Together with the proposed algorithm, we also presented related variants which can be seen as simplified versions of HeMLGOP. Extensive experimental benchmarks in realworld classification problems have shown that, under different perspectives, the proposed algorithm and its variants outperform other related approaches, including POP, an algorithm that also employs Generalized Operational Perceptron.
References
 [1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788, 2016.
 [2] A. Iosifidis, A. Tefas, and I. Pitas, “Viewinvariant action recognition based on artificial neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 3, pp. 412–424, 2012.
 [3] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis, “Forecasting stock prices from the limit order book using convolutional neural networks,” in Business Informatics (CBI), 2017 IEEE 19th Conference on, vol. 1, pp. 7–12, IEEE, 2017.
 [4] A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649, IEEE, 2013.
 [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.
 [6] M. Zabihi, A. B. Rad, S. Kiranyaz, M. Gabbouj, and A. K. Katsaggelos, “Heart sound anomaly and quality detection using ensemble of neural networks without segmentation,” in Computing in Cardiology Conference (CinC), 2016, pp. 613–616, IEEE, 2016.
 [7] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis, “Using deep learning to detect price change indications in financial markets,” in European Signal Processing Conference (EUSIPCO), Kos, Greece, 2017.
 [8] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
 [9] M. A. Waris, A. Iosifidis, and M. Gabbouj, “Cnnbased edge filtering for object proposals,” Neurocomputing, 2017.
 [10] X. An, D. Kuang, X. Guo, Y. Zhao, and L. He, “A deep learning method for classification of eeg data based on motor imagery,” in International Conference on Intelligent Computing, pp. 203–210, Springer, 2014.
 [11] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [12] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 [14] Y. Bengio, N. BoulangerLewandowski, and R. Pascanu, “Advances in optimizing recurrent networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8624–8628, IEEE, 2013.
 [15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [16] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943.
 [17] S. Kiranyaz, T. Ince, A. Iosifidis, and M. Gabbouj, “Progressive operational perceptrons,” Neurocomputing, vol. 224, pp. 142–154, 2017.
 [18] D. Ellis and N. Morgan, “Size matters: An empirical study of neural network training for large vocabulary continuous speech recognition,” in Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, vol. 2, pp. 1013–1016, IEEE, 1999.
 [19] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
 [20] Y.H. Pao and Y. Takefuji, “Functionallink net computing: theory, system architecture, and functionalities,” Computer, vol. 25, no. 5, pp. 76–79, 1992.
 [21] W. F. Schmidt, M. A. Kraaijveld, and R. P. Duin, “Feedforward neural networks with random weights,” in Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference on, pp. 1–4, IEEE, 1992.
 [22] G.B. Huang, Q.Y. Zhu, and C.K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 13, pp. 489–501, 2006.
 [23] A. Rahimi and B. Recht, “Random features for largescale kernel machines,” in Advances in neural information processing systems, pp. 1177–1184, 2008.
 [24] S. Chatterjee, A. M. Javid, M. Sadeghi, P. P. Mitra, and M. Skoglund, “Progressive learning for systematic design of large neural networks,” arXiv preprint arXiv:1710.08177, 2017.
 [25] C. P. Chen and Z. Liu, “Broad learning system: An effective and efficient incremental learning system without the need for deep architecture,” IEEE transactions on neural networks and learning systems, vol. 29, no. 1, pp. 10–24, 2018.
 [26] A. G. Ivakhnenko, “Polynomial theory of complex systems,” IEEE transactions on Systems, Man, and Cybernetics, no. 4, pp. 364–378, 1971.
 [27] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in Advances in neural information processing systems, pp. 153–160, 2007.
 [28] M. Kulkarni and S. Karande, “Layerwise training of deep networks using kernel similarity,” arXiv preprint arXiv:1703.07115, 2017.
 [29] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through augmenting topologies,” Evolutionary computation, vol. 10, no. 2, pp. 99–127, 2002.
 [30] S. Kiranyaz, T. Ince, A. Yildirim, and M. Gabbouj, “Evolutionary artificial neural networks by multidimensional particle swarm optimization,” Neural networks, vol. 22, no. 10, pp. 1448–1462, 2009.
 [31] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
 [32] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv preprint arXiv:1710.09282, 2017.
 [33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
 [34] D. T. Tran, A. Iosifidis, and M. Gabbouj, “Improving efficiency in convolutional neural network with multilinear filters,” arXiv preprint arXiv:1709.09902, 2017.
 [35] E. Haselsteiner and G. Pfurtscheller, “Using timedependent neural networks for eeg classification,” IEEE transactions on rehabilitation engineering, vol. 8, no. 4, pp. 457–463, 2000.
 [36] S. J. Qin, “Neural networks for intelligent sensors and controlâpractical issues and some solutions,” in Neural Systems for Control, pp. 213–234, Elsevier, 1997.
 [37] P. J. Antsaklis, “Neural networks for control systems,” IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 242–244, 1990.
 [38] D. Serre, “Matrices, volume 216 of graduate texts in mathematics,” 2002.
 [39] K. Banerjee, “Generalized inverse of matrices and its applications,” 1973.
 [40] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
 [41] X. Liu, S. Lin, J. Fang, and Z. Xu, “Is extreme learning machine feasible? a theoretical assessment (part i),” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 1, pp. 7–20, 2015.
 [42] P. Courrieu, “Fast computation of moorepenrose inverse matrices,” arXiv preprint arXiv:0804.4809, 2008.
 [43] Y.H. Pao, G.H. Park, and D. J. Sobajic, “Learning and generalization characteristics of the random vector functionallink net,” Neurocomputing, vol. 6, no. 2, pp. 163–180, 1994.
 [44] J. W. Smith, J. Everhart, W. Dickson, W. Knowler, and R. Johannes, “Using the adap learning algorithm to forecast the onset of diabetes mellitus,” in Proceedings of the Annual Symposium on Computer Application in Medical Care, p. 261, American Medical Informatics Association, 1988.
 [45] J. C. Niebles, C.W. Chen, and L. FeiFei, “Modeling temporal structure of decomposable motion segments for activity classification,” in European conference on computer vision, pp. 392–405, Springer, 2010.
 [46] P. Horton and K. Nakai, “A probabilistic classification system for predicting the cellular localization sites of proteins.,” in Ismb, vol. 4, pp. 109–115, 1996.
 [47] S. Hadfield and R. Bowden, “Hollywood 3d: Recognizing actions in 3d natural scenes,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 3398–3405, IEEE, 2013.
 [48] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 3551–3558, IEEE, 2013.
 [49] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [50] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al., “Deep face recognition.,” in BMVC, vol. 1, p. 6, 2015.
 [51] T.S. Lim, W.Y. Loh, and Y.S. Shih, “A comparison of prediction accuracy, complexity, and training time of thirtythree old and new classification algorithms,” Machine learning, vol. 40, no. 3, pp. 203–228, 2000.
 [52] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Computer vision and pattern recognition, 2006 IEEE computer society conference on, vol. 2, pp. 2169–2178, IEEE, 2006.
 [53] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 413–420, IEEE, 2009.
 [54] L. FeiFei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Computer vision and Image understanding, vol. 106, no. 1, pp. 59–70, 2007.
 [55] G. Griffin, A. Holub, and P. Perona, “Caltech256 object category dataset,” 2007.
 [56] N. Pinto, Z. Stone, T. Zickler, and D. Cox, “Scaling up biologicallyinspired computer vision: A case study in unconstrained face recognition on facebook,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, pp. 35–42, IEEE, 2011.
 [57] X. Zhang, L. Zhang, X.J. Wang, and H.Y. Shum, “Finding celebrities in billions of web images,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 995–1007, 2012.