A Low Complexity Decentralized Neural Net with Centralized Equivalence using Layerwise Learning
Abstract
We design a low complexity decentralized learning algorithm to train a recently proposed large neural network in distributed processing nodes (workers). We assume the communication network between the workers is synchronized and can be modeled as a doublystochastic mixing matrix without having any master node. In our setup, the training data is distributed among the workers but is not shared in the training process due to privacy and security concerns. Using alternatingdirectionmethodofmultipliers (ADMM) along with a layerwise convex optimization approach, we propose a decentralized learning algorithm which enjoys low computational complexity and communication cost among the workers. We show that it is possible to achieve equivalent learning performance as if the data is available in a single place. Finally, we experimentally illustrate the time complexity and convergence behavior of the algorithm.
I Introduction
Decentralized machine learning receives a high interest in signal processing, machine learning, and data analysis. In a decentralized setup, the training dataset is not in one place but distributed among several workers (or processing nodes). Due to physical limitations, the workers are connected with a communication network which is often represented as a graph in machine learning and signal processing fields. In such a communication network, data privacy and security among the workers are the main concerns in developing a decentralized learning algorithm. To this end, the following three aspects are of particular interest for a decentralized machine learning setup:

Workers are not allowed to share data, and there exists no master node that has access to all workers.

The objective is to achieve the same performance as that of a centralized setup.

The learning algorithm should have a low computational complexity and communication overhead to efficiently handle large scale data.
In this article, we develop a decentralized neural network for a classification problem to address these three aspects. The decentralized neural network is based on a recently proposed neural network called selfsize estimating feedforward neural network (SSFN) [31]. The SSFN is a multilayer feedforward neural network that can estimate its size; meaning that the network automatically finds the necessary number of neurons and layers to achieve a certain performance. SSFN uses a rectifiedlinearunit (ReLU) activation function and a special structure on the weight matrices. The weight matrices have two parts: one part is learned during the optimization process and the other part is predetermined as a random matrix instance. Weight matrices are learned using a series of convex optimization problems in a layerwise fashion. The combination of layerwise learning and the use of random matrices enables SSFN to be trained with a low computational requirement. Besides, the layerwise nature of the training process leads to a significant reduction of communication overhead in decentralized learning compared to the gradientbased methods. Note that the SSFN does not use gradientbased methods, such as backpropagation, and hence does not require high computational resources. It is shown in [31] that further optimization of weight matrices in SSFN using backpropagation does not lead to significant performance improvement.
Our contribution is to develop a decentralized neural network by using the architecture and learning approach of SSFN that provides low computation and communication costs. We refer to this as decentralized SSFN (dSSFN) throughout the article. We use alternatingdirectionmethodofmultipliers (ADMM) [6] for finding decentralized solution of layerwise convex optimization in dSSFN. Note that similar to [31], a decentralized estimation of the size of SSFN is possible in our framework as well, at the expense of higher complexity. In this article, we focus on training a fixedsize SSFN over a synchronous communication network. To seek consensus among the workers, we assume the communication network can be modeled by a doublystochastic mixing matrix. We conduct experiments for circular network topology, while our approach remains valid for sparse and connected communication networks as well. By systematically increasing the network connections between the workers, we investigate the tradeoff between training time and the number of network connections. Besides, we experimentally show the convergence behavior of dSSFN throughout the layers and compare its classification performance against centralized SSFN for several wellknown datasets.
Ia Literature Review
There exists an enormous literature on distributed learning for largescale data in recent years using huge computational resources [9, 12, 17, 2]. The most prominent work in this area is the DistBelief framework which employs model parallelism techniques to use thousands of computing clusters to train a large neural network [9]. However, there is a growing need to develop algorithms that require less computational and communication resources. The use cases of such algorithms are internetofthings, vehicular communication, sensor network, etc [22, 15].
One popular approach to develop costefficient algorithms is to use variants of gradientdescent for distributed training of large neural networks. Stochastic gradient descent (SGD) and its variants, e.g., stochastic variance reduced gradient (SVRG), is designed to reduce the computational complexity of each iteration compared to the vanilla gradient descent [18]. Although these schemes are computationally efficient, they may significantly increase the communication complexity of the training process [4]. In particular, these approaches require a much larger number of iterations to ensure convergence to the true solution, and therefore, the number of information exchanges between the master node and each worker is potentially high.
This challenge has attracted wide attention in recent years. The approaches that are trying to address this issue can be seen as two different classes of algorithms. In the first class, a lossy quantization of the parameters and their gradient is employed to mitigate the huge communication burden, at the cost of a more number of iterations compared to the unquantized scheme. Some recent studies show that by carefully designing the quantizer at every step, it is possible to maintain the convergence speed of vanilla gradient descent [?, 33]. The second class of algorithms removes the requirement for master nodes to communicate with all workers at some iterations. In this way, the communication burden can be reduced at the cost of an increased local computational complexity [8]. All of the above works investigate developing a costefficient algorithm in a masterslave topology and requires the communication to be synchronized.
Another widely studied algorithm for distributed optimization is the alternating direction method of multipliers (ADMM) and its variants. This class of algorithms has been studied by augmented Lagrangian methods or by operator theoretical frameworks [6, 32, 29, 3]. This class of algorithms gives more flexibility regarding the underlying topology and the required assumption on the communication links, e.g., synchronously and lossless communication. For example, [29] provides a framework for asynchronous updates of multiple workers under the assumption of having reliable communication links. [3] extends this result and proposes a relaxed ADMM algorithm for asynchronous updates over lossy peertopeer networks and provides linear convergence near a neighborhood of the true solution. While the only gradientbased method that can deal with packet loss and partiallyasynchronous updates is [1] which implicitly requires the workers to use synchronized stepsize [3]. Thus, we choose ADMM as a different optimization approach to develop a costefficient distributed learning algorithm that gives us more flexibility regarding the underlying topology.
There are several works for training artificial neural networks based on nongradient algorithms [35, 14, 7, 24]. [35] provides an ADMMbased method for joint training of all layers of a neural network. A fast yet effective architecture is random vector functional link (RVFL) networks that uses some of its parameters as randomly chosen between the input layer and the hidden layer while keeping direct links from the input layer to the output layer [27]. From the evaluation and proposed works of RVFL networks, it is observed that the noniterative nature of RVFL leads to faster learning algorithms and low computational complexity in the distributed scenario [36, 30]. A variant of RVFL is extreme learning machine (ELM) that removes the direct link between the input layer and the output layer while provides competitive performance with low complexity in different applications [23, 37, 28]. in the distributed scenario. There have been several efforts to learn an ELM in a distributed manner. For example, He et. al. [13] employs the advantages of MapReduce [10] to propose a distributed extreme learning machine scenario. We find a recent work [26] where they use ADMM to achieve the equivalent solution of the centralized ELM. In most of the works, they assume that every node in the network is fully connected to all other nodes. In this article, we investigate the network model in which every node has access to a limited number of neighbors.
There exist works to develop deep randomized neural networks based on RVFL and its variants [19, 34, 31]. They are shown to be capable of providing highquality performance while keeping the computational complexity low. Katuwal et. al. [19] uses stacked autoencoders to construct a multilayer RVFL network to obtain favorable performance while keeping low computational complexity. Tang et. al. [34] proposes the hierarchical ELM (HELM) which contains a multilayer forward encoding part followed by the original ELMbased regression. The recent work by Chatterjee et. al. [31] introduces a multilayer ELMbased architecture called self sizeestimating feedforward neural network (SSFN). SSFN can estimate its size and guarantees the training error of the network to be decreasing as the number of layers increases. This is achieved using the lossless flow property [31] and solving a constrained leastsquares problem using ADMM at each layer.
In this article, we investigate the prospect of SSFN in a decentralized scenario over synchronous communication networks. The layerwise nature of SSFN and the use of random weights makes SSFN an appealing option for low complexity design in distributed and online learning frameworks. Besides, the use of ADMM allows us to implement a decentralized SSFN with centralized equivalence [6], while paves the way for extending this result to asynchronous and lossy communication networks [29, 3] in our future studies.
Ii Decentralized SSFN
We begin this section with a decentralized problem formulation for a feedforward neural network. Then, we briefly explain the architecture and learning of (centralized) SSFN followed by decentralization in synchronous communication networks. Finally, we show a comparison with a decentralized gradient descent algorithm.
Iia Problem formulation
In a supervised learning problem, let be a pairwise form of data vector that we observe and target vector that we wish to infer. Let and . The target vector can be a categorical variable for a classification problem with classes. Let us construct a feedforward neural network with layers, and hidden neurons in the ’th layer. We denote the weight matrix for ’th layer by . For an input vector , a feedforward neural network produces a mapping from input data to the feature vector in its last layer. The feature vector depends on parameters as . Then we use a linear transformation to generate target prediction as , where is the output matrix. We assume that there exists no parameter to optimize activation functions as they are predefined and fixed. A feedforward neural network has the following form
where denotes the nonlinear tranform function that uses a scalarwise activation function, for example ReLU. The feedforward neural network signal flow follows sequenstial use of linear transform (LT) and nonlinear transform (NLT).
Suppose that we have a sample training dataset . The training dataset is distributed over nodes in a decentralized setup as , where denotes the dataset for ’th node. We assume that . The dataset is comprised of samples such that .
The output of the feedforward neural network for the ’th node has the form . The training cost for the ’th node is defined as
(1) 
where denotes norm of a vector. The total cost for the training dataset over all nodes is . The decentralized learning problem is
(2) 
where and ensure that we have the same parameters for the set of neural networks across all nodes. The constraints and are for regularization of parameters to avoid overfitting to the training dataset. Note that the constraints and lead to the case that the decentralized problem (2) is exactly equivalent to the following centralized problem
(3) 
if the problem (3) has a unique solution. It is well known that the above optimization problem is nonconvex with respect to its parameters, and a learning algorithm will generally provide a suboptimal solution as a local minima.
IiB Centralized SSFN
To design decentralized SSFN, we briefly discuss SSFN in this section for completeness. Details can be found in [31]. SSFN is a feedforward neural network and its design requires a low computational complexity. The architecture of SSFN with its signal flow diagram is shown in Figure 1.
While the work of [31] developed the SSFN architecture that learns its parameters and size of the network automatically, we work with a fixed size SSFN and learn its parameters. Note that our proposed method remains valid for estimating the size at the cost of higher complexity. The number of layers and the hidden neurons for the ’th layer are fixed apriori. For simplicity, we assume that all layers have the same number of hidden neurons, which means .
The SSFN addresses the optimization problem (3) in a suboptimal manner. The SSFN parameters and are learned layerbylayer in a sequential forward learning approach. The feature vector of ’th layer is constructed as follows
(4) 
The layerbylayer sequential learning approach starts by optimizing layer number and then the new layers are added and optimized onebyone until we reach . Let us first assume that we have an layer network. The layer network will be built on an optimized layer network. We define . For designing the layer network given the layer network, the steps of finding parameter are as follows:

For all the samples in the training dataset , we compute .

Using the samples we define a training cost
(5) We compute the optimal output matrix by solving the convex optimization problem
(6) It is shown in [31] that we can choose the regularization parameters Note that is a dimensional matrix, and every for is a dimensional amtrix.

We form the weight matrix for the ’th layer
(7) where is a fixed matrix of dimension , is learned by convex optimization (6), and is an instance of random matrix. The matrix is dimensional, and every for is dimensional. Note that we only learn to form . We do not learn as it is prefixed before training of SSFN. After constructing the weight matrix according to (7), the layer network is
(8)
It is shown in [31] that the three steps mentioned above guarantee monotonically decreasing cost with increasing the layer number . The monotonically decreasing cost is the key to address the optimization problem (3) as we continue to add new layers onebyone and set the weight matrix of every layer using (7). It was experimentally shown (see Table 5 of [31]) that the use of gradient search (backpropagation) for further optimization of parameters in SSFN could not provide any noticeable performance improvement. Note that backpropagationbased optimization requires a significant computational complexity compared to the proposed layerwised approach.
IiC Decentralized SSFN for Synchronous Communication
We now focus on developing decentralized SSFN (dSSFN) where information exchange between nodes follows synchronous communications. The main task is finding decentralized solution of the convex optimization problem (6). We recast the optimization problem (6) in the following form
(9) 
where is an auxiliary variable. We use matrix notation henceforth for simplicity. For the ’th node on graph, we define the following matrices: is a dimensional matrix comprising of the column vectors , is a dimensional matrix comprising of the column vectors , and is a dimensional matrix comprising of the column vectors in the ’th layer. The matrices , , and correspond to the dataset . Using the matrix notation, the optimization problem (9) can be written as
(10) 
where is an auxiliary variable. By using alternating direction method of multipliers (ADMM) [6], we break it into three subproblems as follows
Here, is the Lagrangian parameter of ADMM in the ’th layer, and is the scaled dual variable at node . The ADMM iterations are:
(11) 
where denotes the iteration for ADMM, and performs projection onto the space of matrices with Frobenius norm less than or equal to . The operation is defined as
For the ’th iteration of ADMM, it is required that the average quantity be available to every node. This average can be found by seeking consensus over the graph. It can be easily seen that if the graph topology is modeled as a doublystochastic matrix, it is possible to achieve the consensus across all nodes by a sufficiently large number of exchanges throughout the network [5]. The main steps of decentralized SSFN are shown in Algorithm 1.
IiD Synchronous communication
To guarantee that every node learns the same SSFN structure with centralized equivalence, it is required to have synchronous communication and computation over the graph. This synchronized manner is also used for exchanging in update in equation (11). After ADMM converges for all the nodes on the graph, we construct one more layer of SSFN and repeat until we learn the parameters for all the layers.
IiE Comparison with decentralized gradient search
We now present a comparison with distributed gradient descent for neural networks. While being generally a powerful method, gradient descent has practical limitations due to a high computational complexity and communication overhead. Let us assume for simplicity that there is no regularization constraints on and . Considering the weight matrix at ’th layer of the neural network. The centralized gradient descent is
(12) 
where denotes the iteration for gradient search and is the step size of the algorithm. The centralized gradient descent can be done in the following decentralized manner:
(13) 
For ’th iteration of gradient search, it is required that the average quantity be available to every node. An average can be found by seeking consensus over a communication graph. The communication property of such graphs can be modeled as a doublystochastic mixing matrix. Therefore, under the technical condition of consensus seeking, it is possible to realize decentralized gradient search which is exactly the same as the centralized setup. Assume that we require iterations of information exchange to calculate an average quantity. Then assuming that the gradient descent requires iterations to converge, we need times of information exchange. In practice, is in order of hundreds and is in order of thousands. Since the matrix contains scalars, the total information exchange for learning is
(14) 
In practice, this total information exchange may be very large and lead to a high communication load. Further, as the sparsity level of the graph increases, the required number of information exchanges also increases, and that leads to a longer training time for gradient descent.
With this limitation of gradient descent, we take a different approach. We use a structured neural network where parameters are learned using ADMM to solve a convex optimization problem. The use of ADMM allows fast and efficient optimization in the decentralized scenario.
We now quantify the communication load for decentralized SSFN. Let us assume that we require iterations of information exchange across the nodes to calculate an average quantity. Assuming that the ADMM requires iterations, we need times of information exchange for learning and forming according to equation (7). The submatrix in is an instance of random matrix, and it is predefined across all nodes. In practice, and are both in the order of hundreds. The matrix has scalars. Hence, the total information exchange for learning is
(15) 
The ratio of communication load between gradient descent and decentralized SSFN is
(16) 
since in practice, we have and .
Iii Experimental Evaluation
In this section, we apply numerical experiments to evaluate the performance of the decentralized SSFN. We compare the performance of decentralized SSFN with centralized SSFN. We investigate how the training time differs versus the connectivity of the underlying network. We learn SSFN on a decentralized underlying network with the following topology and properties.
Network topology
The decentralized learning approach we propose in this manuscript can be performed on any network topology with the network mixing matrix being modeled as a doublystochastic matrix. There are several common types of network topology representations, such as connected cycles, random geometric structure, and regular expander structure [11]. To perform a systematic study, we use circular topology as a communication network with a doublystochastic mixing matrix in the experiments.
Circular network topology with nodes has a degree to represent its connectivity. We show examples of circular topology in Figure 2. A network with a degree has level of connected cycles between the neighbors. This implies that each node in the network has connections with neighbors on the left and right sides, respectively. A network with a low degree is considered sparse in the sense of having much fewer connections. A low degree in a network limits the number of information exchanges and subsequently affects the convergence speed of a decentralized learning algorithm. It is expected that a network consensus can be achieved faster if the degree of a graph increases.
The communication property over the network can be modeled as a doublystochastic matrix in which is the weight of importance that the ’th node assigns to the ’th node during parameters exchange. The doublystochastic matrix has the following property:
Here refers to the the set of neighbors with whom the ’th node is connected. Note that . In this setup, we assume that there is no master node and no node is isolated either. For the sake of simplicity, in the following experiments, the doublystochastic mixing matrix is chosen in such a way that every node is connected to its neighbors with equal weights. That means we have , where denotes cardinality of the set . As we use circular network topology for the experiments, we have the relation
for a graph with degree .
Iiia Classification tasks and datasets
We evaluate the decentralized SSFN for different classification tasks. The datasets that we use are briefly mentioned in Table I. We use the dimensional target vector in a classification task represented as a onehot vector. A target vector has only one scalar component that is 1, and the other scalar components are zero.
Dataset 






Vowel  528  462  10  11  
Satimage  4435  2000  36  6  
Caltech101  6000  3000  3000  102  
Letter  13333  6667  16  26  
NORB  24300  24300  2048  5  
MNIST  60000  10000  784  10 
IiiB Experimental setup
In all experiments, we set the number of layers and the number of hidden neurons for each layer. We fix the number of nodes and uniformly divide the training dataset between the nodes. We set the number of iterations in ADMM as for each layer.
Dataset  Centralized SSFN  Decentralized SSFN  












Vowel  1000.00  53.8  58.31.70  1000.00  51.67  59.21.10  
Satimage  94.20.21  10.6  86.90.37  92.10.10  9.37  88.80.08  
Caltech101  99.90.01  38.9  73.20.91  99.90.01  34.94  75.40.29  
Letter  99.40.02  19.5  91.80.23  98.90.03  17.64  92.50.22  
NORB  96.70.04  13.9  82.50.22  96.70.02  13.93  82.60.16  
MNIST  96.80.06  12.9  94.80.16  97.00.04  13.24  95.10.16 
IiiC Experimental results
We first show the performance of decentralized SSFN for a graph with a degree compared with the centralized SSFN. The performances are shown in Table II. It can be seen that dSSFN provides similar performance to centralized SSFN for the proper choice of hyperparameters. The practical performance of the decentralized SSFN is affected by the choice of hyperparameters , the number of ADMM iterations . Choosing proper and guarantees ADMM to converge within iterations.
The convergence behavior of dSSFN throughout the layers is shown in Figure 4. The decentralized objective cost versus the total number of ADMM iterations in all layers is shown for Satimage, Letter, and MNIST dataset. For each layer (every ADMM iterations), ADMM converges to a global solution for the optimization problem (10). Overall it can be observed that the curves show a powerlaw behavior. Similar to SSFN, the objective cost converges as we increase the number of layers. Therefore, we can decide to stop the addition of new layers when we see that the cost has a convergence trend.
Figure 4 shows training time for learning decentralized SSFN versus network degree for Satimage, Letter, and MNIST datasets. It is interesting to observe that the training time shows a transition jump in the middle range of . There exists a threshold after which the learning mechanism in decentralized SSFN converges noticeably faster. The degree represents sparsity in the graph, and in turn, relates to privacy, security, and physical communication links. Our results imply that a suitable network degree helps to achieve a tradeoff between the graph degree and training time.
IiiD Reproducible codes
Matlab codes of all the experiments described in this paper are available at https://sites.google.com/site/saikatchatt/. The datasets used for the experiments can be found at [25, 16, 21, 20] .
Iv Conclusion
We develop a decentralized multilayer neural network and show that it is possible to achieve centralized equivalence under some technical assumptions. While being suboptimal because of its layerwise nature, the proposed method is costefficient compared to the general gradientbased methods in the sense of computation and communication complexities. We experimentally show the convergence behavior of dSSFN throughout the layers and provide a monotonically decreasing training cost by adding more layers. Besides, we inspect the time complexity of the algorithm under different network connectivity degrees. Our experiments show that dSSFN can provide centralized performance for a network with a high sparsity level in its connections. The proposed method is limited to the network topologies with a doublystochastic mixing matrix and synchronized connections. Extending this result to asynchronous and lossy peertopeer networks by using relaxed ADMM approaches is a potential future direction.
References
 (2019) Distributed multiagent convex optimization over random digraphs. IEEE Transactions on Automatic Control. External Links: Document, ISSN 23343303 Cited by: §IA.
 (2018) Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235. External Links: 1804.03235 Cited by: §IA.
 (2019) Asynchronous distributed optimization over lossy networks via relaxed ADMM: stability and linear convergence. arXiv preprint arXiv:1901.09252. External Links: 1901.09252 Cited by: §IA, §IA.
 (2018) Optimization methods for largescale machine learning. SIAM Review 60 (2), pp. 223–311. External Links: Document Cited by: §IA.
 (200503) Gossip algorithms: design, analysis and applications. Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies. 3 (), pp. 1653–1664. External Links: ISSN 0743166X Cited by: §IIC.
 (201101) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 (1), pp. 1–122. External Links: ISSN 19358237 Cited by: §IA, §IA, §I, §IIC.
 (2017) Progressive learning for systematic design of large neural networks. arXiv preprint arXiv:1710.08177. External Links: 1710.08177 Cited by: §IA.
 (2018) LAG: lazily aggregated gradient for communicationefficient distributed learning. Advances in Neural Information Processing Systems 31, pp. 5050–5060. Cited by: §IA.
 (2012) Large scale distributed deep networks. pp. 1223–1231. Cited by: §IA.
 (200801) MapReduce: simplified data processing on large clusters. Commun. ACM 51 (1), pp. 107–113. External Links: ISSN 00010782 Cited by: §IA.
 (201203) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic Control 57 (3), pp. 592–606. External Links: Document, ISSN 00189286 Cited by: §III1.
 (2018) Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications 116, pp. 1 – 8. Cited by: §IA.
 (2013) Parallel extreme learning machine for regression based on mapreduce. Neurocomputing 102, pp. 52 – 58. Cited by: §IA.
 (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1), pp. 489 – 501. Cited by: §IA.
 (201704) Machine learning paradigms for nextgeneration wireless networks. IEEE Wireless Communications 24 (2), pp. 98–105. External Links: Document, ISSN 15580687 Cited by: §IA.
 (2011.06) Learning a discriminative dictionary for sparse coding via label consistent ksvd. CVPR 2011 (), pp. 1697–1704. External Links: Document, ISSN 10636919, Link Cited by: §IIID.
 (2016) How to scale distributed deep learning?. arXiv preprint arXiv:1611.04581. External Links: 1611.04581 Cited by: §IA.
 (2013) Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems 26, pp. 315–323. Cited by: §IA.
 (2019) Stacked autoencoder based deep random vector functional link neural network for classification. Applied Soft Computing 85, pp. 105854. External Links: ISSN 15684946, Document, Link Cited by: §IA.
 (199811) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 00189219, Link Cited by: §IIID.
 (200406) Learning methods for generic object recognition with invariance to pose and lighting. CVPR 2004. 2 (), pp. 97–104. External Links: Document, ISSN 10636919, Link Cited by: §IIID.
 (201801) Learning IoT in edge: deep learning for the internet of things with edge computing. IEEE Network 32 (1), pp. 96–101. External Links: Document, ISSN 1558156X Cited by: §IA.
 (2016) Extreme learning machine based transfer learning for data classification. Neurocomputing 174, pp. 203 – 210. Cited by: §IA.
 (201804) Distributed large neural network with centralized equivalence. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (), pp. 2976–2980. External Links: ISSN 2379190X Cited by: §IA.
 (2013.) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §IIID.
 (2017) Distributed extreme learning machine with alternating direction method of multiplier. Neurocomputing 261 (Supplement C), pp. 164 – 170. Cited by: §IA.
 (1992) Neuralnet computing and the intelligent control of systems. International Journal of Control 56 (2), pp. 263–289. External Links: Document, Link, https://doi.org/10.1080/00207179208934315 Cited by: §IA.
 (2016) An unsupervised discriminative extreme learning machine and its applications to data clustering. Neurocomputing 174, pp. 250 – 264. Cited by: §IA.
 (2016) ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM Journal on Scientific Computing 38 (5), pp. A2851–A2879. Cited by: §IA, §IA.
 (2015) Distributed learning for random vector functionallink networks. Information Sciences 301, pp. 271 – 284. External Links: ISSN 00200255, Document, Link Cited by: §IA.
 (2019) SSFN: self sizeestimating feedforward network and low complexity design. arXiv preprint arXiv:1905.07111. External Links: 1905.07111 Cited by: §IA, §I, §I, item 2, §IIB, §IIB, §IIB.
 (2018) CoCoA: a general framework for communicationefficient distributed optimization. Journal of Machine Learning Research 18 (230), pp. 1–49. Cited by: §IA.
 (2019) Sparsified SGD with memory. arXiv preprint arXiv:1809.07599. External Links: 1809.07599 Cited by: §IA.
 (201604) Extreme learning machine for multilayer perceptron. IEEE Transactions on Neural Networks and Learning Systems 27 (4), pp. 809–821. External Links: Document, ISSN 2162237X Cited by: §IA.
 (201620–22 Jun) Training neural networks without gradients: a scalable ADMM approach. Proceedings of The 33rd International Conference on Machine Learning 48, pp. 2722–2731. Cited by: §IA.
 (2016) A comprehensive evaluation of random vector functional link networks. Information Sciences 367368, pp. 1094 – 1105. External Links: ISSN 00200255, Document, Link Cited by: §IA.
 (2016) Uncertain xml documents classification using extreme learning machine. Neurocomputing 174, pp. 375 – 382. Cited by: §IA.