A Low Complexity Decentralized Neural Net with Centralized Equivalence using Layer-wise Learning

A Low Complexity Decentralized Neural Net with Centralized Equivalence using Layer-wise Learning

Abstract

We design a low complexity decentralized learning algorithm to train a recently proposed large neural network in distributed processing nodes (workers). We assume the communication network between the workers is synchronized and can be modeled as a doubly-stochastic mixing matrix without having any master node. In our setup, the training data is distributed among the workers but is not shared in the training process due to privacy and security concerns. Using alternating-direction-method-of-multipliers (ADMM) along with a layer-wise convex optimization approach, we propose a decentralized learning algorithm which enjoys low computational complexity and communication cost among the workers. We show that it is possible to achieve equivalent learning performance as if the data is available in a single place. Finally, we experimentally illustrate the time complexity and convergence behavior of the algorithm.

decentralized learning, neural network, ADMM, communication network

I Introduction

Decentralized machine learning receives a high interest in signal processing, machine learning, and data analysis. In a decentralized setup, the training dataset is not in one place but distributed among several workers (or processing nodes). Due to physical limitations, the workers are connected with a communication network which is often represented as a graph in machine learning and signal processing fields. In such a communication network, data privacy and security among the workers are the main concerns in developing a decentralized learning algorithm. To this end, the following three aspects are of particular interest for a decentralized machine learning setup:

  1. Workers are not allowed to share data, and there exists no master node that has access to all workers.

  2. The objective is to achieve the same performance as that of a centralized setup.

  3. The learning algorithm should have a low computational complexity and communication overhead to efficiently handle large scale data.

In this article, we develop a decentralized neural network for a classification problem to address these three aspects. The decentralized neural network is based on a recently proposed neural network called self-size estimating feedforward neural network (SSFN) [31]. The SSFN is a multi-layer feedforward neural network that can estimate its size; meaning that the network automatically finds the necessary number of neurons and layers to achieve a certain performance. SSFN uses a rectified-linear-unit (ReLU) activation function and a special structure on the weight matrices. The weight matrices have two parts: one part is learned during the optimization process and the other part is predetermined as a random matrix instance. Weight matrices are learned using a series of convex optimization problems in a layer-wise fashion. The combination of layer-wise learning and the use of random matrices enables SSFN to be trained with a low computational requirement. Besides, the layer-wise nature of the training process leads to a significant reduction of communication overhead in decentralized learning compared to the gradient-based methods. Note that the SSFN does not use gradient-based methods, such as backpropagation, and hence does not require high computational resources. It is shown in [31] that further optimization of weight matrices in SSFN using backpropagation does not lead to significant performance improvement.

Our contribution is to develop a decentralized neural network by using the architecture and learning approach of SSFN that provides low computation and communication costs. We refer to this as decentralized SSFN (dSSFN) throughout the article. We use alternating-direction-method-of-multipliers (ADMM) [6] for finding decentralized solution of layer-wise convex optimization in dSSFN. Note that similar to [31], a decentralized estimation of the size of SSFN is possible in our framework as well, at the expense of higher complexity. In this article, we focus on training a fixed-size SSFN over a synchronous communication network. To seek consensus among the workers, we assume the communication network can be modeled by a doubly-stochastic mixing matrix. We conduct experiments for circular network topology, while our approach remains valid for sparse and connected communication networks as well. By systematically increasing the network connections between the workers, we investigate the trade-off between training time and the number of network connections. Besides, we experimentally show the convergence behavior of dSSFN throughout the layers and compare its classification performance against centralized SSFN for several well-known datasets.

I-a Literature Review

There exists an enormous literature on distributed learning for large-scale data in recent years using huge computational resources [9, 12, 17, 2]. The most prominent work in this area is the DistBelief framework which employs model parallelism techniques to use thousands of computing clusters to train a large neural network [9]. However, there is a growing need to develop algorithms that require less computational and communication resources. The use cases of such algorithms are internet-of-things, vehicular communication, sensor network, etc [22, 15].

One popular approach to develop cost-efficient algorithms is to use variants of gradient-descent for distributed training of large neural networks. Stochastic gradient descent (SGD) and its variants, e.g., stochastic variance reduced gradient (SVRG), is designed to reduce the computational complexity of each iteration compared to the vanilla gradient descent [18]. Although these schemes are computationally efficient, they may significantly increase the communication complexity of the training process [4]. In particular, these approaches require a much larger number of iterations to ensure convergence to the true solution, and therefore, the number of information exchanges between the master node and each worker is potentially high.

This challenge has attracted wide attention in recent years. The approaches that are trying to address this issue can be seen as two different classes of algorithms. In the first class, a lossy quantization of the parameters and their gradient is employed to mitigate the huge communication burden, at the cost of a more number of iterations compared to the unquantized scheme. Some recent studies show that by carefully designing the quantizer at every step, it is possible to maintain the convergence speed of vanilla gradient descent [?, 33]. The second class of algorithms removes the requirement for master nodes to communicate with all workers at some iterations. In this way, the communication burden can be reduced at the cost of an increased local computational complexity [8]. All of the above works investigate developing a cost-efficient algorithm in a master-slave topology and requires the communication to be synchronized.

Another widely studied algorithm for distributed optimization is the alternating direction method of multipliers (ADMM) and its variants. This class of algorithms has been studied by augmented Lagrangian methods or by operator theoretical frameworks [6, 32, 29, 3]. This class of algorithms gives more flexibility regarding the underlying topology and the required assumption on the communication links, e.g., synchronously and lossless communication. For example, [29] provides a framework for asynchronous updates of multiple workers under the assumption of having reliable communication links. [3] extends this result and proposes a relaxed ADMM algorithm for asynchronous updates over lossy peer-to-peer networks and provides linear convergence near a neighborhood of the true solution. While the only gradient-based method that can deal with packet loss and partially-asynchronous updates is [1] which implicitly requires the workers to use synchronized step-size [3]. Thus, we choose ADMM as a different optimization approach to develop a cost-efficient distributed learning algorithm that gives us more flexibility regarding the underlying topology.

There are several works for training artificial neural networks based on non-gradient algorithms [35, 14, 7, 24]. [35] provides an ADMM-based method for joint training of all layers of a neural network. A fast yet effective architecture is random vector functional link (RVFL) networks that uses some of its parameters as randomly chosen between the input layer and the hidden layer while keeping direct links from the input layer to the output layer [27]. From the evaluation and proposed works of RVFL networks, it is observed that the non-iterative nature of RVFL leads to faster learning algorithms and low computational complexity in the distributed scenario [36, 30]. A variant of RVFL is extreme learning machine (ELM) that removes the direct link between the input layer and the output layer while provides competitive performance with low complexity in different applications [23, 37, 28]. in the distributed scenario. There have been several efforts to learn an ELM in a distributed manner. For example, He et. al. [13] employs the advantages of Map-Reduce [10] to propose a distributed extreme learning machine scenario. We find a recent work [26] where they use ADMM to achieve the equivalent solution of the centralized ELM. In most of the works, they assume that every node in the network is fully connected to all other nodes. In this article, we investigate the network model in which every node has access to a limited number of neighbors.

There exist works to develop deep randomized neural networks based on RVFL and its variants [19, 34, 31]. They are shown to be capable of providing high-quality performance while keeping the computational complexity low. Katuwal et. al. [19] uses stacked autoencoders to construct a multi-layer RVFL network to obtain favorable performance while keeping low computational complexity. Tang et. al. [34] proposes the hierarchical ELM (H-ELM) which contains a multi-layer forward encoding part followed by the original ELM-based regression. The recent work by Chatterjee et. al. [31] introduces a multi-layer ELM-based architecture called self size-estimating feed-forward neural network (SSFN). SSFN can estimate its size and guarantees the training error of the network to be decreasing as the number of layers increases. This is achieved using the lossless flow property [31] and solving a constrained least-squares problem using ADMM at each layer.

In this article, we investigate the prospect of SSFN in a decentralized scenario over synchronous communication networks. The layer-wise nature of SSFN and the use of random weights makes SSFN an appealing option for low complexity design in distributed and online learning frameworks. Besides, the use of ADMM allows us to implement a decentralized SSFN with centralized equivalence [6], while paves the way for extending this result to asynchronous and lossy communication networks [29, 3] in our future studies.

Ii Decentralized SSFN

We begin this section with a decentralized problem formulation for a feedforward neural network. Then, we briefly explain the architecture and learning of (centralized) SSFN followed by decentralization in synchronous communication networks. Finally, we show a comparison with a decentralized gradient descent algorithm.

Ii-a Problem formulation

In a supervised learning problem, let be a pair-wise form of data vector that we observe and target vector that we wish to infer. Let and . The target vector can be a categorical variable for a classification problem with -classes. Let us construct a feed-forward neural network with layers, and hidden neurons in the ’th layer. We denote the weight matrix for ’th layer by . For an input vector , a feed-forward neural network produces a mapping from input data to the feature vector in its last layer. The feature vector depends on parameters as . Then we use a linear transformation to generate target prediction as , where is the output matrix. We assume that there exists no parameter to optimize activation functions as they are predefined and fixed. A feed-forward neural network has the following form

where denotes the non-linear tranform function that uses a scalar-wise activation function, for example ReLU. The feedforward neural network signal flow follows sequenstial use of linear transform (LT) and non-linear transform (NLT).

Suppose that we have a -sample training dataset . The training dataset is distributed over nodes in a decentralized setup as , where denotes the dataset for ’th node. We assume that . The dataset is comprised of samples such that .

The output of the feed-forward neural network for the ’th node has the form . The training cost for the ’th node is defined as

(1)

where denotes -norm of a vector. The total cost for the training dataset over all nodes is . The decentralized learning problem is

(2)

where and ensure that we have the same parameters for the set of neural networks across all nodes. The constraints and are for regularization of parameters to avoid overfitting to the training dataset. Note that the constraints and lead to the case that the decentralized problem (2) is exactly equivalent to the following centralized problem

(3)

if the problem (3) has a unique solution. It is well known that the above optimization problem is non-convex with respect to its parameters, and a learning algorithm will generally provide a suboptimal solution as a local minima.

Ii-B Centralized SSFN

To design decentralized SSFN, we briefly discuss SSFN in this section for completeness. Details can be found in [31]. SSFN is a feedforward neural network and its design requires a low computational complexity. The architecture of SSFN with its signal flow diagram is shown in Figure 1.

Fig. 1: The architecture of a multi-layer SSFN with layers and its signal flow diagram. LT stands for linear transform (weight matrix) and NLT stands for non-linear transform (activation function). We use ReLU activation function.

While the work of [31] developed the SSFN architecture that learns its parameters and size of the network automatically, we work with a fixed size SSFN and learn its parameters. Note that our proposed method remains valid for estimating the size at the cost of higher complexity. The number of layers and the hidden neurons for the ’th layer are fixed a-priori. For simplicity, we assume that all layers have the same number of hidden neurons, which means .

The SSFN addresses the optimization problem (3) in a suboptimal manner. The SSFN parameters and are learned layer-by-layer in a sequential forward learning approach. The feature vector of ’th layer is constructed as follows

(4)

The layer-by-layer sequential learning approach starts by optimizing layer number and then the new layers are added and optimized one-by-one until we reach . Let us first assume that we have an -layer network. The -layer network will be built on an optimized -layer network. We define . For designing the -layer network given the -layer network, the steps of finding parameter are as follows:

  1. For all the samples in the training dataset , we compute .

  2. Using the samples we define a training cost

    (5)

    We compute the optimal output matrix by solving the convex optimization problem

    (6)

    It is shown in [31] that we can choose the regularization parameters Note that is a -dimensional matrix, and every for is a -dimensional amtrix.

  3. We form the weight matrix for the ’th layer

    (7)

    where is a fixed matrix of dimension , is learned by convex optimization (6), and is an instance of random matrix. The matrix is -dimensional, and every for is -dimensional. Note that we only learn to form . We do not learn as it is pre-fixed before training of SSFN. After constructing the weight matrix according to (7), the -layer network is

    (8)

It is shown in [31] that the three steps mentioned above guarantee monotonically decreasing cost with increasing the layer number . The monotonically decreasing cost is the key to address the optimization problem (3) as we continue to add new layers one-by-one and set the weight matrix of every layer using (7). It was experimentally shown (see Table 5 of [31]) that the use of gradient search (backpropagation) for further optimization of parameters in SSFN could not provide any noticeable performance improvement. Note that backpropagation-based optimization requires a significant computational complexity compared to the proposed layer-wised approach.

Ii-C Decentralized SSFN for Synchronous Communication

We now focus on developing decentralized SSFN (dSSFN) where information exchange between nodes follows synchronous communications. The main task is finding decentralized solution of the convex optimization problem (6). We recast the optimization problem (6) in the following form

(9)

where is an auxiliary variable. We use matrix notation henceforth for simplicity. For the ’th node on graph, we define the following matrices: is a -dimensional matrix comprising of the column vectors , is a -dimensional matrix comprising of the column vectors , and is a -dimensional matrix comprising of the column vectors in the ’th layer. The matrices , , and correspond to the dataset . Using the matrix notation, the optimization problem (9) can be written as

(10)

where is an auxiliary variable. By using alternating direction method of multipliers (ADMM) [6], we break it into three subproblems as follows

Here, is the Lagrangian parameter of ADMM in the ’th layer, and is the scaled dual variable at node . The ADMM iterations are:

(11)

where denotes the iteration for ADMM, and performs projection onto the space of matrices with Frobenius norm less than or equal to . The operation is defined as

For the ’th iteration of ADMM, it is required that the average quantity be available to every node. This average can be found by seeking consensus over the graph. It can be easily seen that if the graph topology is modeled as a doubly-stochastic matrix, it is possible to achieve the consensus across all nodes by a sufficiently large number of exchanges throughout the network [5]. The main steps of decentralized SSFN are shown in Algorithm 1.

Input:

1:  Training dataset for the ’th node
2:  Parameters to set:
3:  Set of random matrices are generated and shared between all nodes

Initialization:

1:   \hfill(Index for ’th layer)

Progressive growth of layers:

1:  repeat
2:      \hfill(Increase layer number)
3:      \hfill(Iteration index of ADMM)
4:     Compute Solve (6) in decentralized form (10) to find :
5:     repeat
6:        
7:        Solve using (11)
8:        Find using consensus over graph
9:        Find and by (11)
10:     until 
11:     Form weight matrix
12:  until 
Algorithm 1 : Algorithm for learning decentralized SSFN

Ii-D Synchronous communication

To guarantee that every node learns the same SSFN structure with centralized equivalence, it is required to have synchronous communication and computation over the graph. This synchronized manner is also used for exchanging in -update in equation (11). After ADMM converges for all the nodes on the graph, we construct one more layer of SSFN and repeat until we learn the parameters for all the layers.

Ii-E Comparison with decentralized gradient search

We now present a comparison with distributed gradient descent for neural networks. While being generally a powerful method, gradient descent has practical limitations due to a high computational complexity and communication overhead. Let us assume for simplicity that there is no regularization constraints on and . Considering the weight matrix at ’th layer of the neural network. The centralized gradient descent is

(12)

where denotes the iteration for gradient search and is the step size of the algorithm. The centralized gradient descent can be done in the following decentralized manner:

(13)

For ’th iteration of gradient search, it is required that the average quantity be available to every node. An average can be found by seeking consensus over a communication graph. The communication property of such graphs can be modeled as a doubly-stochastic mixing matrix. Therefore, under the technical condition of consensus seeking, it is possible to realize decentralized gradient search which is exactly the same as the centralized setup. Assume that we require iterations of information exchange to calculate an average quantity. Then assuming that the gradient descent requires iterations to converge, we need times of information exchange. In practice, is in order of hundreds and is in order of thousands. Since the matrix contains scalars, the total information exchange for learning is

(14)

In practice, this total information exchange may be very large and lead to a high communication load. Further, as the sparsity level of the graph increases, the required number of information exchanges also increases, and that leads to a longer training time for gradient descent.

With this limitation of gradient descent, we take a different approach. We use a structured neural network where parameters are learned using ADMM to solve a convex optimization problem. The use of ADMM allows fast and efficient optimization in the decentralized scenario.

We now quantify the communication load for decentralized SSFN. Let us assume that we require iterations of information exchange across the nodes to calculate an average quantity. Assuming that the ADMM requires iterations, we need times of information exchange for learning and forming according to equation (7). The submatrix in is an instance of random matrix, and it is pre-defined across all nodes. In practice, and are both in the order of hundreds. The matrix has scalars. Hence, the total information exchange for learning is

(15)

The ratio of communication load between gradient descent and decentralized SSFN is

(16)

since in practice, we have and .

Iii Experimental Evaluation

In this section, we apply numerical experiments to evaluate the performance of the decentralized SSFN. We compare the performance of decentralized SSFN with centralized SSFN. We investigate how the training time differs versus the connectivity of the underlying network. We learn SSFN on a decentralized underlying network with the following topology and properties.

Fig. 2: Example of circular communication network topology. We have the number of nodes with degree , respectively.

Network topology

The decentralized learning approach we propose in this manuscript can be performed on any network topology with the network mixing matrix being modeled as a doubly-stochastic matrix. There are several common types of network topology representations, such as -connected cycles, random geometric structure, and -regular expander structure [11]. To perform a systematic study, we use circular topology as a communication network with a doubly-stochastic mixing matrix in the experiments.

Circular network topology with nodes has a degree to represent its connectivity. We show examples of circular topology in Figure 2. A network with a degree has level of connected cycles between the neighbors. This implies that each node in the network has connections with neighbors on the left and right sides, respectively. A network with a low degree is considered sparse in the sense of having much fewer connections. A low degree in a network limits the number of information exchanges and subsequently affects the convergence speed of a decentralized learning algorithm. It is expected that a network consensus can be achieved faster if the degree of a graph increases.

The communication property over the network can be modeled as a doubly-stochastic matrix in which is the weight of importance that the ’th node assigns to the ’th node during parameters exchange. The doubly-stochastic matrix has the following property:

Here refers to the the set of neighbors with whom the ’th node is connected. Note that . In this setup, we assume that there is no master node and no node is isolated either. For the sake of simplicity, in the following experiments, the doubly-stochastic mixing matrix is chosen in such a way that every node is connected to its neighbors with equal weights. That means we have , where denotes cardinality of the set . As we use circular network topology for the experiments, we have the relation

for a graph with degree .

Iii-a Classification tasks and datasets

We evaluate the decentralized SSFN for different classification tasks. The datasets that we use are briefly mentioned in Table I. We use the -dimensional target vector in a classification task represented as a one-hot vector. A target vector has only one scalar component that is 1, and the other scalar components are zero.

Dataset
of
train data
of
test data
Input
dimension ()
of
classes ()
Vowel 528 462 10 11
Satimage 4435 2000 36 6
Caltech101 6000 3000 3000 102
Letter 13333 6667 16 26
NORB 24300 24300 2048 5
MNIST 60000 10000 784 10
TABLE I: Dataset for multi-class classification.

Iii-B Experimental setup

In all experiments, we set the number of layers and the number of hidden neurons for each layer. We fix the number of nodes and uniformly divide the training dataset between the nodes. We set the number of iterations in ADMM as for each layer.

Dataset Centralized SSFN Decentralized SSFN
Train
Accuracy
Train
Error
Test
Accuracy
Train
Accuracy
Train
Error
Test
Accuracy
Vowel 1000.00 -53.8 58.31.70 1000.00 -51.67 59.21.10
Satimage 94.20.21 -10.6 86.90.37 92.10.10 -9.37 88.80.08
Caltech101 99.90.01 -38.9 73.20.91 99.90.01 -34.94 75.40.29
Letter 99.40.02 -19.5 91.80.23 98.90.03 -17.64 92.50.22
NORB 96.70.04 -13.9 82.50.22 96.70.02 -13.93 82.60.16
MNIST 96.80.06 -12.9 94.80.16 97.00.04 -13.24 95.10.16
TABLE II: Classification performance comparison between centralized SSFN and decentralized SSFN on a circular communication network where .

Iii-C Experimental results

(b) Letter
(c) MNIST
Fig. 3: Objective cost versus total number of ADMM iterations throughout all layers.
(a) Satimage
(b) Letter
(a) Satimage
(b) Letter
(c) MNIST
Fig. 3: Objective cost versus total number of ADMM iterations throughout all layers.
(a) Satimage
(b) Letter
(c) MNIST
Fig. 4: Training time changes as the network degree increases on the -node circular communcication network.
(a) Satimage

We first show the performance of decentralized SSFN for a graph with a degree compared with the centralized SSFN. The performances are shown in Table II. It can be seen that dSSFN provides similar performance to centralized SSFN for the proper choice of hyperparameters. The practical performance of the decentralized SSFN is affected by the choice of hyperparameters , the number of ADMM iterations . Choosing proper and guarantees ADMM to converge within iterations.

The convergence behavior of dSSFN throughout the layers is shown in Figure 4. The decentralized objective cost versus the total number of ADMM iterations in all layers is shown for Satimage, Letter, and MNIST dataset. For each layer (every ADMM iterations), ADMM converges to a global solution for the optimization problem (10). Overall it can be observed that the curves show a power-law behavior. Similar to SSFN, the objective cost converges as we increase the number of layers. Therefore, we can decide to stop the addition of new layers when we see that the cost has a convergence trend.

Figure 4 shows training time for learning decentralized SSFN versus network degree for Satimage, Letter, and MNIST datasets. It is interesting to observe that the training time shows a transition jump in the middle range of . There exists a threshold after which the learning mechanism in decentralized SSFN converges noticeably faster. The degree represents sparsity in the graph, and in turn, relates to privacy, security, and physical communication links. Our results imply that a suitable network degree helps to achieve a trade-off between the graph degree and training time.

Iii-D Reproducible codes

Matlab codes of all the experiments described in this paper are available at https://sites.google.com/site/saikatchatt/. The datasets used for the experiments can be found at [25, 16, 21, 20] .

Iv Conclusion

We develop a decentralized multilayer neural network and show that it is possible to achieve centralized equivalence under some technical assumptions. While being sub-optimal because of its layer-wise nature, the proposed method is cost-efficient compared to the general gradient-based methods in the sense of computation and communication complexities. We experimentally show the convergence behavior of dSSFN throughout the layers and provide a monotonically decreasing training cost by adding more layers. Besides, we inspect the time complexity of the algorithm under different network connectivity degrees. Our experiments show that dSSFN can provide centralized performance for a network with a high sparsity level in its connections. The proposed method is limited to the network topologies with a doubly-stochastic mixing matrix and synchronized connections. Extending this result to asynchronous and lossy peer-to-peer networks by using relaxed ADMM approaches is a potential future direction.

References

  1. S. S. Alaviani and N. Elia (2019) Distributed multi-agent convex optimization over random digraphs. IEEE Transactions on Automatic Control. External Links: Document, ISSN 2334-3303 Cited by: §I-A.
  2. R. Anil, G. Pereyra, A. T. Passos, R. Ormandi, G. Dahl and G. Hinton (2018) Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235. External Links: 1804.03235 Cited by: §I-A.
  3. N. Bastianello, R. Carli, L. Schenato and M. Todescato (2019) Asynchronous distributed optimization over lossy networks via relaxed ADMM: stability and linear convergence. arXiv preprint arXiv:1901.09252. External Links: 1901.09252 Cited by: §I-A, §I-A.
  4. Léon. Bottou, F. E. Curtis and Jorge. Nocedal (2018) Optimization methods for large-scale machine learning. SIAM Review 60 (2), pp. 223–311. External Links: Document Cited by: §I-A.
  5. S. Boyd, A. Ghosh, B. Prabhakar and D. Shah (2005-03) Gossip algorithms: design, analysis and applications. Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies. 3 (), pp. 1653–1664. External Links: ISSN 0743-166X Cited by: §II-C.
  6. S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein (2011-01) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 (1), pp. 1–122. External Links: ISSN 1935-8237 Cited by: §I-A, §I-A, §I, §II-C.
  7. S. Chatterjee, A. M. Javid, M. Sadeghi, P. P. Mitra and M. Skoglund (2017) Progressive learning for systematic design of large neural networks. arXiv preprint arXiv:1710.08177. External Links: 1710.08177 Cited by: §I-A.
  8. T. Chen, G. Giannakis, T. Sun and W. Yin (2018) LAG: lazily aggregated gradient for communication-efficient distributed learning. Advances in Neural Information Processing Systems 31, pp. 5050–5060. Cited by: §I-A.
  9. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le and A. Y. Ng (2012) Large scale distributed deep networks. pp. 1223–1231. Cited by: §I-A.
  10. J. Dean and S. Ghemawat (2008-01) MapReduce: simplified data processing on large clusters. Commun. ACM 51 (1), pp. 107–113. External Links: ISSN 0001-0782 Cited by: §I-A.
  11. J. C. Duchi, A. Agarwal and M. J. Wainwright (2012-03) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic Control 57 (3), pp. 592–606. External Links: Document, ISSN 0018-9286 Cited by: §III-1.
  12. O. Gupta and R. Raskar (2018) Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications 116, pp. 1 – 8. Cited by: §I-A.
  13. Q. He, T. Shang, F. Zhuang and Z. Shi (2013) Parallel extreme learning machine for regression based on mapreduce. Neurocomputing 102, pp. 52 – 58. Cited by: §I-A.
  14. G. Huang, Q. Zhu and C. Siew (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1), pp. 489 – 501. Cited by: §I-A.
  15. C. Jiang, H. Zhang, Y. Ren, Z. Han, K. Chen and L. Hanzo (2017-04) Machine learning paradigms for next-generation wireless networks. IEEE Wireless Communications 24 (2), pp. 98–105. External Links: Document, ISSN 1558-0687 Cited by: §I-A.
  16. Z. Jiang, Z. Lin and L. S. Davis (2011.-06) Learning a discriminative dictionary for sparse coding via label consistent k-svd. CVPR 2011 (), pp. 1697–1704. External Links: Document, ISSN 1063-6919, Link Cited by: §III-D.
  17. P. H. Jin, Q. Yuan, F. Iandola and K. Keutzer (2016) How to scale distributed deep learning?. arXiv preprint arXiv:1611.04581. External Links: 1611.04581 Cited by: §I-A.
  18. R. Johnson and T. Zhang (2013) Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems 26, pp. 315–323. Cited by: §I-A.
  19. R. Katuwal and P.N. Suganthan (2019) Stacked autoencoder based deep random vector functional link neural network for classification. Applied Soft Computing 85, pp. 105854. External Links: ISSN 1568-4946, Document, Link Cited by: §I-A.
  20. Y. Lecun, L. Bottou, Y. Bengio and P. Haffner (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 0018-9219, Link Cited by: §III-D.
  21. Y. LeCun, F. J. Huang and L. Bottou (2004-06) Learning methods for generic object recognition with invariance to pose and lighting. CVPR 2004. 2 (), pp. 97–104. External Links: Document, ISSN 1063-6919, Link Cited by: §III-D.
  22. H. Li, K. Ota and M. Dong (2018-01) Learning IoT in edge: deep learning for the internet of things with edge computing. IEEE Network 32 (1), pp. 96–101. External Links: Document, ISSN 1558-156X Cited by: §I-A.
  23. X. Li, W. Mao and W. Jiang (2016) Extreme learning machine based transfer learning for data classification. Neurocomputing 174, pp. 203 – 210. Cited by: §I-A.
  24. X. Liang, A. M. Javid, M. Skoglund and S. Chatterjee (2018-04) Distributed large neural network with centralized equivalence. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (), pp. 2976–2980. External Links: ISSN 2379-190X Cited by: §I-A.
  25. M. Lichman (2013.) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §III-D.
  26. M. Luo, L. Zhang, J. Liu, J. Guo and Q. Zheng (2017) Distributed extreme learning machine with alternating direction method of multiplier. Neurocomputing 261 (Supplement C), pp. 164 – 170. Cited by: §I-A.
  27. Y. PAO, S. M. PHILLIPS and D. J. SOBAJIC (1992) Neural-net computing and the intelligent control of systems. International Journal of Control 56 (2), pp. 263–289. External Links: Document, Link, https://doi.org/10.1080/00207179208934315 Cited by: §I-A.
  28. Y. Peng, W. Zheng and B. Lu (2016) An unsupervised discriminative extreme learning machine and its applications to data clustering. Neurocomputing 174, pp. 250 – 264. Cited by: §I-A.
  29. Zhimin. Peng, Yangyang. Xu, Ming. Yan and Wotao. Yin (2016) ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM Journal on Scientific Computing 38 (5), pp. A2851–A2879. Cited by: §I-A, §I-A.
  30. S. Scardapane, D. Wang, M. Panella and A. Uncini (2015) Distributed learning for random vector functional-link networks. Information Sciences 301, pp. 271 – 284. External Links: ISSN 0020-0255, Document, Link Cited by: §I-A.
  31. M. Skoglund (2019) SSFN: self size-estimating feed-forward network and low complexity design. arXiv preprint arXiv:1905.07111. External Links: 1905.07111 Cited by: §I-A, §I, §I, item 2, §II-B, §II-B, §II-B.
  32. V. Smith, S. Forte, C. Ma, M. Takáč, M. I. Jordan and M. Jaggi (2018) CoCoA: a general framework for communication-efficient distributed optimization. Journal of Machine Learning Research 18 (230), pp. 1–49. Cited by: §I-A.
  33. S. U. Stich, J. Cordonnier and M. Jaggi (2019) Sparsified SGD with memory. arXiv preprint arXiv:1809.07599. External Links: 1809.07599 Cited by: §I-A.
  34. J. Tang, C. Deng and G. Huang (2016-04) Extreme learning machine for multilayer perceptron. IEEE Transactions on Neural Networks and Learning Systems 27 (4), pp. 809–821. External Links: Document, ISSN 2162-237X Cited by: §I-A.
  35. G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel and T. Goldstein (2016-20–22 Jun) Training neural networks without gradients: a scalable ADMM approach. Proceedings of The 33rd International Conference on Machine Learning 48, pp. 2722–2731. Cited by: §I-A.
  36. L. Zhang and P.N. Suganthan (2016) A comprehensive evaluation of random vector functional link networks. Information Sciences 367-368, pp. 1094 – 1105. External Links: ISSN 0020-0255, Document, Link Cited by: §I-A.
  37. X. Zhao, X. Bi, G. Wang, Z. Zhang and H. Yang (2016) Uncertain xml documents classification using extreme learning machine. Neurocomputing 174, pp. 375 – 382. Cited by: §I-A.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414578
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description