Distributed Averaging CNNELM for Big Data
Arif Budiman, Mohamad Ivan Fanany, Chan Basaruddin,
1 Machine Learning and Computer Vision Laboratory
Faculty of Computer Science, Universitas Indonesia
* intanurma@gmail.com
Abstract
Increasing the scalability of machine learning to handle big volume of data is a challenging task. The scale up approach has some limitations. In this paper, we proposed a scale out approach for CNNELM based on MapReduce on classifier level. Map process is the CNNELM training for certain partition of data. It involves many CNNELM models that can be trained asynchronously. Reduce process is the averaging of all CNNELM weights as final training result. This approach can save a lot of training time than single CNNELM models trained alone. This approach also increased the scalability of machine learning by combining scale out and scale up approaches. We verified our method in extended MNIST data set and notMNIST data set experiment. However, it has some drawbacks by additional iteration learning parameters that need to be carefully taken and training data distribution that need to be carefully selected. Further researches to use more complex image data set are required.
Keywords— deep learning, extreme learning machine, convolutional, neural network, big data, map reduce
1 Introduction
Nowadays, We are seeing a massive growth of data at a faster rate than ever before. However, the benefits of big data become meaningless if none of the processing machine can cultivate and adapt to the data quickly enough. Big data mining needs special machine learning approaches to learn huge volumes of data in an acceptable time. Volume and velocity issues are critical in overcoming big data challenges [13]. It means the data are so massive hence very difficult to be handled by a single computation task in a timely fashion. As with many new hardware and software technologies, we require a special approach to make the most of hardware and software work effectively on speed, scalability, and simplicity presents in real big data knowledge mining.
Scalability is the ability of data processing system to adapt against increased demands. It can be categorized into the following two types of scalability [23]:

Vertical Scaling: Known as scale up. It involves powering more and larger computation components within a single system. It is also known as ”scale up” and it usually involves a single instance of an operating system. I.e. Adding more power and capacity (CPU, GPU, RAM, Storage) to an existing machine. However, Scale up is limited to the maximum hardware specification of a single machine.

Horizontal Scaling: Known as scale out. The system distributes the workload across many independent computation resources which may be low end commodity machines or high end machines. All resources added together can speed up the processing capability. Thus we add more machines into one pool of resources. Scale out offers easier and dynamic scalability by adding various size of machines into the existing pool.
To increase the scalability of big data processing, the common approach is to distribute big data and running the process in parallel. Parallel computing is a simultaneous use of multiple computing resources to solve complex computational problems by breaking down the process into simpler series of instructions that can be executed simultaneously on different processing units and then employ an overall control management [1].
To overcome overhead complexities in parallel programming, Google introduced a programming model named MapReduce [3]. MapReduce is a framework for processing large data within a parallel and distributed way on many computers including low end computers in a cluster. MapReduce provides two essential functions: 1) Map function, it processes each sub problems to another nodes within the cluster; 2) Reduce function, it organizes the results from each node to be a cohesive solution [25].
Developing MapReduce is simple by firstly exposing structure and process similarity and then aggregation process [25]. All the similar tasks are easily parallelized, distributed to the processors and load balanced between them. MapReduce framework does not related to specific hardware technologies. It can be employed to multiple and heterogeneous machine independent.
Further researches introduced MapReduce paradigm to speed up various machine learning algorithms, i.e., locally weighted linear regression (LWLR), kmeans, logistic regression (LR), naive Bayes (NB), support vector machine (SVM), gaussian discriminant analysis (GDA), expectation–maximization (EM) and backpropagation (NN) [24], stochastic gradient descent (SGD) [29], convolutional neural network (CNN) [26], extreme learning machine (ELM) [27].
CNN [14] is a popular machine learning that getting benefits from parallel computation. CNN uses a lot of convolution operations that needs many processing cores to speed up the computation time using graphics processing units (GPUs) parallelization. However, the scale up approach still has limitation mainly caused by the amount of memory available on GPUs [12, 21].
Learn from the scale up limited capability, we proposed a scale out approach based on MapReduce model to distribute the big data computation into several CNN models. We integrated the CNN architecture [14, 28, 4] with ELM [10, 9, 7, 27]. The CNN works as unsupervised convolution features learner and ELM works as supervised classifier. We employed parallel stochastic gradient descent (SGD) algorithm [29] to fine tune the weights of CNNELM and to average the final weights of CNNELM.
Our main contributions in this paper are as follows.

We studied the CNNELM integration using MapReduce model;

We employed map processes as CNNELM multi classifiers learning independently (asynchronous) on different partition of training data. The reduce process is the averaging all weights (kernel weights on CNN and output weights on ELM) of all CNNELM classifiers. Our method enables scale out combination of highly scale up CNNELM members to handle very huge training data. Our idea is to place MapReduce model not intended for CNN matrix operation level but for classifier level. Many asynchronous CNN models trained together to solve very large complex problem rather than single models trained in very powerful machine.

Against ELM tenet for non iterative training, we studied the weight after fine tuning using stochastic gradient descent iteration during ELM training to check the averaging performance after some iterations.
The rest of this paper is organized as follows. Section 1 is to give introduction and research objectives. In Section 2, a related review of previous MapReduce framework implementations is given. Section 3 is to describe our proposed methods. Our empirical experiments result is introduced in Section 4. Finally, conclusions are drawn in Section 5.
2 Literature Reviews
2.1 Parallel SGD and weight averaging
SGD is a very popular training algorithm for various machine learning models i.e., regression, SVM, and NN. Zinkevich et.al [12] proposed a parallel model of SGD based on MapReduce that highly suitable for parallel and largescale machine learning. In parallel SGD, the training data is accessed locally by each model and only communicated when it finished. The algorithm of parallel SGD is described below 1.
The idea of averaging was developed by Polyak et.al [20]. The averaged SGD is ordinary SGD that averages its weight over time. When optimization is finished, the averaged weight replaces the ordinary weight from SGD. It is based on the idea of averaging the trajectories, however the application requires a large amount of a priori information.
Let we have unlimited training data : within the same distribution. Learning objective is to construct the mapping function from observation data that taken randomly and its related class. However, when the number of training data , we need to address the expected value of with is the learning parameters. According to law of large numbers, we can make sure the consistency of expected value of learning model is approximated by the sample averages and almost surely to the expected value as with probability 1.
If the training data is partitioned by to be partition, and each partition trained independently , we can make sure the expected value is approximated by where .
2.2 MapReduce in ELM
Extreme Learning Machine (ELM) is one of the famous machine learning that firstly proposed by Huang [10, 9, 7]. It used single hidden layer feedforward neural network (SLFN) architecture and generalized pseudoinverse for learning process. Similar with Neural Networks (NN), ELM used random value in hidden nodes parameters. The uniqueness of ELM is non iterative generalized pseudoinverse optimization process However, the hidden nodes parameters remain set and fixed after the training. It becomes the ELM training is fast and can avoid local minima.
The ELM learning result is Output weight () that can be computed by:
(1) 
which is a pseudoinverse (MoorePenrose generalized inverse) function of H. The ELM learning objective is to find the smallest leastsquares solution of linear system that can be obtained when = .
Hidden layer matrix H is computed by activation function g of the summation matrix from the hidden nodes parameter (such as input weight and bias ) and training input x with size N number of training data and L number of hidden nodes (called random feature mapping).
The performance of ELM hinges on generalized inverse solution. The solution of uses ridge regression orthogonal projection method, by using a positive 1/ value as regularization to the auto correlation matrices or . Thus, we can solve Eq. 1 as follows.
(2) 
Further, Eq. 2 can be solved by sequential series using block matrices inverse (A Fast and Accurate Online Sequential named online sequential extreme learning machine (OSELM) [17]) or by MapReduce approach (Elastic Extreme Learning Machine (ELM) [27] or Parallel ELM [6]).
Parallelization process using MapReduce approach can be divided as follows :

Map. Map is the transformation of intermediate matrix multiplications for each training data and target portion.
If and , According to decomposable matrices, they can be written as :
(3) (4) 
Reduce. Reduce is the aggregate process to sum the Map result. The output weights can be computed easily from reduce/aggregate process.
(5)
Therefore, MapReduce based ELM is more efficient for massive training data set, can be solved easily by parallel computation and has better performance [27].
Regarding about iteration, Lee et.al [11] explained on BP Trained Weightbased ELM that the optimized input weights with BP training is more feasible than randomly assigned weights. Lee et.al implemented Average ELM however the classification accuracy was lower than basic ELM because the number of training data is so small and the network architecture is not large.
2.3 MapReduce in CNN
CNN is biologicallyinspired [14] from visual cortex that has a convolution arrangement of cells (a receptive field that sensitive to small sub visual field and local filter) and following by simple cells that only respond maximally to specific trigger within receptive fields. A simple CNN architecture consists of some convolution layers and following by pooling layers in the feed forward architecture. CNN has excellent performance for spatial visual classification [22].
The input layer exposes 2D structure with of image, and is the number of input channels. The convolution layer has filters (or kernels) of size where and can either be the same or smaller than the number of input channels . The filters have locally connected structure which is each convolved with the image to produce feature maps of size . If, at a given layer, we have the feature map as , whose filters are determined by the weights and bias , then the feature map is obtained as :
(6) 
Each feature map is then pooled using pooling operation either down sampling, mean or max sampling over contiguous regions (Using scale ranges between 2 for small and up to 5 for larger inputs). An additive bias and activation function (i.e. sigmoid, tanh, or reLU) can be applied to each feature map either before or after the pooling layer. At the end of the CNN layer, there may be any densely connected NN layers for supervised learning (See Fig. 1) [28]. Many variants of CNN architectures in the literatures, but the basic common building blocks are convolutional layer, pooling layer and fully connected layer [4].
The convolution operations need to be computed in parallel for faster computation time that can be taken from multi processor hardware, i.e., GPU [21]. Krizhevsky et. al. demonstrated a large CNN is capable of achieving record breaking results on the 1.2 million highresolution images with 1000 different classes. However, the GPU has memory size limitation that limit the CNN network size to achieve better accuracy [12].
CNN used back propagation algorithm that needs iterations to get the optimum solution. One iteration contains error back propagated step and following by parameter update step. The learning errors are propagated back to the previous layers using SGD optimization and continued by applying the update to kernel weight and bias parameters.
If is the error on layer from a cost function where is weight, is bias parameters, and are the training data and target.
(7) 
If the layer is densely connected and the is output layer, then the error and the gradients for the layer are computed as :
(8) 
Where is the derivative of the activation function.
(9)  
(10) 
But if the layer is a convolutional and subsampling layer then the error is computed as :
(11) 
Where is the related pooling operation.
To calculate the gradient to the filter maps, we used convolution operation and flip operation to the error matrix.
(12)  
(13) 
Where is the input of layer, and is the input layer.
Finally, one iteration will update the parameters and with learning rate, as follows:
(14)  
(15) 
Most CNN implementations are using GPU [21] to speed up convolution operation that required hundred numbers of core processors. Wang et. al [26] used MapReduce on Hadoop platform to take advantage of the computing power of multi core CPU to solve matrix parallel computation. However, the number of multi core CPU is far less than GPU can provide.
GPU has limited shared memory than CPU global memory. Scherer et. al [21] explained because shared memory is very limited, so it reuses loaded data as often as possible. Comparing with CPU, the global memory in CPU can be extended larger with lower price than additional GPU cards.
3 Proposed Method
We used common CNNELM integration [19, 8, 5] architecture when the last convolution layer output is fed as hidden nodes weight H of ELM (See Fig. 1). For better generalization accuracy, we used nonlinear optimal tanh () activation function [15]. We used the ELM as a parallel supervised classifier to replace fully connected NN. Compared with regular ELM, we do not need input weight as hidden nodes parameter (See Fig. 2).
The idea of backward is similar with densely connected NN back propagation error method with cost function :
(16) 
Then it propagated back with SGD to optimize the weight kernels of convolution layers (See Fig. 3).
Detail algorithm is explained on Algorithm 2.
4 Experiment and Performance Results
4.1 Data set
MNIST is the common data set for big data machine learning, in fact, it accepted as standard and give an excellent result. MNIST data set is a balanced data set that contains numeric (09) (10 target class) with size pixel in a gray scale image. The dataset has been divided for 60,000 examples for training data and separated 10,000 examples for testing data [16]. We extended MNIST data set larger by adding 3 types of image noises (See Fig. 4) to be 240,000 examples of training data and 40,000 examples of testing data.
For additional experiments, we used notMNIST large data set [2] that has a lot of foolish images (See Fig. 5 and 6). NotMNIST has gray scale image size as attributes. We divided the set to be numeric (09) (360,000 data) and alphabet (AJ) symbol (540,000) data including many foolish images. The challenge with notMNIST numeric and notMNIST alphabet is many similarities between class 1 with class I, class 4 with class A, and another look alike foolish images.
4.2 Experiment Methods
We defined the scope of works as following:

We enhanced DeepLearn Toolbox [18] with Matlab parallel computing toolbox.

We used single precision for all computation in this paper.

We focused on simple CNN architecture that consist of convolution layers (c), following by reLU activation layer then pooling layer (s) with down sampling in this paper.

We compared the performance in testing accuracy with non partitioned sequential CNNELM classifier using the similar structure size.



To verify our method, we formulated the following research questions:

How is the performance following number of iterations?

How is the effectiveness of weight averaging CNNELM model for various number of training partition?

How is the performance consistencies of weight averaging CNNELM model following number of iterations?
4.3 Performance Results
In this section, we explained the research questions as follows.

The performance of CNNELM can be improved by using back propagation algorithm. However, we need to select the appropriate learning rate parameter, number of batch and number of iteration that could impact to the final performance (See Fig. 7). The wrong parameter selection especially learning rate could trap into local minima. So, we can use dynamic learning rate rather than static rate.
Figure 7: Testing Accuracy on extended MNIST data set using 6c2s12c2s CNNELM Model. 
In this experiment, we partitioned the training data to be 2 partitions and 5 partitions on notMNIST. We compared the testing accuracy of CNNELM no partition model with average 2 partition model and average 5 partition model (See table 2 and 3). Unfortunately, the performance of average CNNELM more partitions and more iterations model has decreased than CNNELM no partition model. However, different result found for extended MNIST (See table 4 and 5). Because extended MNIST has been built from the same distribution on each 60,000 partition size while not on notMNIST.
Model Testing Accuracy % CNNELM 1 72.851.23 CNNELM 1/2 40.510.87 CNNELM 2/2 40.350.86 CNNELM Average 2 67.912.77 CNNELM 1/5 20.560.22 CNNELM 2/5 20.210.94 CNNELM 3/5 20.500.91 CNNELM 4/5 31.480.54 CNNELM 5/5 31.470.53 CNNELM Average 5 60.830.20 Table 2: Testing Accuracy for 3c2s9c2s kernel size=5 at iteration=0, batch=75,000 on notMNIST Model Testing Accuracy % CNNELM 1 73.721.32 CNNELM 1/2 41.451.25 CNNELM 2/2 41.190.73 CNNELM Average 2 66.852.43 CNNELM 1/5 20.560.24 CNNELM 2/5 20.090.96 CNNELM 3/5 21.220.86 CNNELM 4/5 31.710.52 CNNELM 5/5 31.700.52 CNNELM Average 5 59.590.24 Table 3: Testing Accuracy for 3c2s9c2s kernel size=5 at iteration e=5, , batch=75,000 on notMNIST Model Testing Accuracy % CNNELM 1 92.230.44 CNNELM 1/4 92.130.87 CNNELM 2/4 92.220.43 CNNELM 3/4 92.160.23 CNNELM 4/4 92.110.13 CNNELM Average 4 92.240.23 Table 4: Testing Accuracy for 6c2s12c2s kernel size=5 at iteration=0, batch=60,000 on MNIST Model Testing Accuracy % CNNELM 1 92.410.36 CNNELM 1/4 92.260.13 CNNELM 2/4 92.370.56 CNNELM 3/4 92.200.31 CNNELM 4/4 92.280.17 CNNELM Average 4 92.400.26 Table 5: Testing Accuracy for 6c2s12c2s kernel size=5 at iteration e =5, , batch=60,000 on MNIST
5 Conclusion
The proposed CNNELM method gives better scale out capability for processing large data set in parallel. We can partition large data set. We can assign CNNELM classifier for each partition, then we just aggregated the result by averaging the weight parameters of all CNNELM parameters. Thus, it can safe a lot of training time rather than sequential training. However, more CNNELM classifiers (smaller partition) has worse performance for averaging CNNELM, as well as more iterations and data distribution effect.
We think some ideas for future research:

We will develop the methods on another CNN framework with GPU computing for larger complex data set.

We need to investigate another optimum learning parameters on more complex CNN architecture, i.e., dropout and dropconnect regularization, decay parameters.
6 Acknowledgment
This work is supported by Higher Education Center of Excellence Research Grant funded Indonesia Ministry of Research and Higher Education Contract No. 1068/UN2.R12/ HKP.05.00/2016
7 Conflict of Interests
The authors declare that there is no conflict of interest regarding the publication of this paper.
References
 B. Barney.
 Y. Bulatov. notmnist dataset, September 2011.
 J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008.
 J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, and G. Wang. Recent advances in convolutional neural networks. CoRR, abs/1512.07108, 2015.
 L. Guo and S. Ding. A hybrid deep learning cnnelm model and its application in handwritten numeral recognition. page 2673–2680, 7 2015.
 Q. He, T. Shang, F. Zhuang, and Z. Shi. Parallel extreme learning machine for regression based on mapreduce. Neurocomput., 102:52–58, Feb. 2013.
 G. Huang, G.B. Huang, S. Song, and K. You. Trends in extreme learning machines: A review. Neural Networks, 61(0):32 – 48, 2015.
 G.B. Huang, Z. Bai, L. L. C. Kasun, and C. M. Vong. Local receptive fields based extreme learning machine. IEEE Computational Intelligence Magazine (accepted), 10, 2015.
 G.B. Huang, D. Wang, and Y. Lan. Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics, 2(2):107–122, 2011.
 G.B. Huang, Q. Y. Zhu, and C. K. Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(13):489–501, 2006.
 K. P. D.C. P. Y.M. J. KheonHee Lee, Miso Jang and S.Y. Min. An efficient learning scheme for extreme learning machine and its application. 2015.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 D. Laney. 3D data management: Controlling data volume, velocity, and variety. Technical report, META Group, February 2001.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In S. Haykin and B. Kosko, editors, Intelligent Signal Processing, pages 306–351. IEEE Press, 2001.
 Y. LeCun, L. Bottou, G. B. Orr, and K.R. Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pages 9–50, London, UK, UK, 1998. SpringerVerlag.
 Y. LeCun and C. Cortes. Mnist handwritten digit database, 2010.
 N.Y. Liang, G.B. Huang, P. Saratchandran, and N. Sundararajan. A fast and accurate online sequential learning algorithm for feedforward networks. Neural Networks, IEEE Transactions on, 17(6):1411–1423, Nov 2006.
 R. B. Palm. Deep learning toolbox.
 S. Pang and X. Yang. Deep convolutional extreme learning machine and its application in handwritten digit classification. Hindawi, 2016.
 B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 D. Scherer, H. Schulz, and S. Behnke. Accelerating LargeScale Convolutional Neural Networks with Parallel Graphics Multiprocessors, pages 82–91. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
 P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition  Volume 2, ICDAR ’03, pages 958–, Washington, DC, USA, 2003. IEEE Computer Society.
 D. Singh and C. K. Reddy. A survey on platforms for big data analytics. Journal of Big Data, 2(1):1–20, 2014.
 C. tao Chu, S. K. Kim, Y. an Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. Mapreduce for machine learning on multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 281–288. MIT Press, 2007.
 C. Tsang, K. Tsoi, J. H. Yeung, B. S. Kwan, A. P. Chan, C. C. Cheung, and P. H. Leong. Mapreduce as a programming model for custom computing machines. FieldProgrammable Custom Computing Machines, Annual IEEE Symposium on, 00(undefined):149–159, 2008.
 Q. Wang, J. Zhao, D. Gong, Y. Shen, M. Li, and Y. Lei. Parallelizing convolutional neural networks for action event recognition in surveillance videos. International Journal of Parallel Programming, pages 1–26, 2016.
 J. Xin, Z. Wang, L. Qu, and G. Wang. Elastic extreme learning machine for big data classification. Neurocomputing, 149, Part A:464 – 471, 2015.
 M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks, pages 818–833. Springer International Publishing, Cham, 2014.
 M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603. Curran Associates, Inc., 2010.