A Novel Family of Boosted Online Regression Algorithms with Strong Theoretical Bounds
We investigate boosted online regression and propose a novel family of regression algorithms with strong theoretical bounds. In addition, we implement several variants of the proposed generic algorithm. We specifically provide theoretical bounds for the performance of our proposed algorithms that hold in a strong mathematical sense. We achieve guaranteed performance improvement over the conventional online regression methods without any statistical assumptions on the desired data or feature vectors. We demonstrate an intrinsic relationship, in terms of boosting, between the adaptive mixture-of-experts and data reuse algorithms. Furthermore, we introduce a boosting algorithm based on random updates that is significantly faster than the conventional boosting methods and other variants of our proposed algorithms while achieving an enhanced performance gain. Hence, the random updates method is specifically applicable to the fast and high dimensional streaming data. Specifically, we investigate Newton Method-based and Stochastic Gradient Descent-based linear regression algorithms in a mixture-of-experts setting, and provide several variants of these well known adaptation methods. However, the proposed algorithms can be extended to other base learners, e.g., nonlinear, tree-based piecewise linear. Furthermore, we provide theoretical bounds for the computational complexity of our proposed algorithms. We demonstrate substantial performance gains in terms of mean square error over the base learners through an extensive set of benchmark real data sets and simulated examples.
Keywords:Online boosting, online regression, boosted regression, ensemble learning, smooth boost, mixture methods
Boosting is considered as one of the most important ensemble learning methods in the machine learning literature and it is extensively used in several different real life applications from classification to regression (Bauer and Kohavi (1999); Dietterich (2000); Schapire and Singer (1999); Schapire and Freund (2012); Freund and E.Schapire (1997); Shrestha and Solomatine (2006); Shalev-Shwartz and Singer (2010); Saigo et al. (2009); Demiriz et al. (2002)). As an ensemble learning method (Fern and Givan (2003); Soltanmohammadi et al. (2016); Duda et al. (2001)), boosting combines several parallel running “weakly” performing algorithms to build a final “strongly” performing algorithm (Soltanmohammadi et al. (2016); Freund (2001); Schapire and Freund (2012); Mannor and Meir (2002)). This is accomplished by finding a linear combination of weak learning algorithms in order to minimize the total loss over a set of training data commonly using a functional gradient descent (Duffy and Helmbold (2002); Freund and E.Schapire (1997)). Boosting is successfully applied to several different problems in the machine learning literature including classification (Jin and Zhang (2007); Chapelle et al. (2011); Freund and E.Schapire (1997)), regression (Duffy and Helmbold (2002); Shrestha and Solomatine (2006)), and prediction (Taieb and Hyndman (2014, 2013)). However, significantly less attention is given to the idea of boosting in online regression framework. To this end, our goal is (a) to introduce a new boosting approach for online regression, (b) derive several different online regression algorithms based on the boosting approach, (c) provide mathematical guarantees for the performance improvements of our algorithms, and (d) demonstrate the intrinsic connections of boosting with the adaptive mixture-of-experts algorithms (Arenas-Garcia et al. (2016); Kozat et al. (2010)) and data reuse algorithms (Shaffer and Williams (1983)).
Although boosting is initially introduced in the batch setting (Freund and E.Schapire (1997)), where algorithms boost themselves over a fixed set of training data, it is later extended to the online setting (Oza and Russell (2001)). In the online setting, however, we neither need nor have access to a fixed set of training data, since the data samples arrive one by one as a stream (Ben-David et al. (1997); Fern and Givan (2003); Lu et al. (2016)). Each newly arriving data sample is processed and then discarded without any storing. The online setting is naturally motivated by many real life applications especially for the ones involving big data, where there may not be enough storage space available or the constraints of the problem require instant processing (Bottou and Bousquet (2008)). Therefore, we concentrate on the online boosting framework and propose several algorithms for online regression tasks. In addition, since our algorithms are online, they can be directly used in adaptive filtering applications to improve the performance of conventional mixture-of-experts methods (Arenas-Garcia et al. (2016)). For adaptive filtering purposes, the online setting is especially important, where the sequentially arriving data is used to adjust the internal parameters of the filter, either to dynamically learn the underlying model or to track the nonstationary data statistics (Arenas-Garcia et al. (2016); Sayed (2003)).
Specifically, we have parallel running weak learners (WL) (Schapire and Freund (2012)) that receive the input vectors sequentially. Each WL uses an update method, such as the second order Newton’s Method (NM) or Stochastic Gradient Descent (SGD), depending on the target of the applications or problem constraints (Sayed (2003)). After receiving the input vector, each algorithm produces its output and then calculates its instantaneous error after the observation is revealed. In the most generic setting, this estimation/prediction error and the corresponding input vector are then used to update the internal parameters of the algorithm to minimize a priori defined loss function, e.g., instantaneous error for the SGD algorithm. These updates are performed for all of the WLs in the mixture. However, in the online boosting approaches, these adaptations at each time proceed in rounds from top to bottom, starting from the first WL to the last one to achieve the “boosting” effect (Chen et al. (2012)). Furthermore, unlike the usual mixture approaches (Arenas-Garcia et al. (2016); Kozat et al. (2010)), the update of each WL depends on the previous WLs in the mixture. In particular, at each time , after the WL calculates its error over pair, it passes a certain weight to the next WL, the WL, quantifying how much error the constituent WLs from to made on the current pair. Based on the performance of the WLs from to on the current pair, the WL may give a different emphasis (importance weight) to pair in its adaptation in order to rectify the mistake of the previous WLs.
The proposed idea for online boosting is clearly related to the adaptive mixture-of-experts algorithms widely used in the machine learning literature, where several parallel running adaptive algorithms are combined to improve the performance. In the mixture methods, the performance improvement is achieved due to the diversity provided by using several different adaptive algorithms each having a different view or advantage (Kozat et al. (2010)). This diversity is exploited to yield a final combined algorithm, which achieves a performance better than any of the algorithms in the mixture. Although the online boosting approach is similar to mixture approaches (Kozat et al. (2010)), there are significant differences. In the online boosting notion, the parallel running algorithms are not independent, i.e., one deliberately introduces the diversity by updating the WLs one by one from the first WL to the WL for each new sample based on the performance of all the previous WLs on this sample. In this sense, each adaptive algorithm, say the WL, receives feedback from the previous WLs, i.e., to , and updates its inner parameters accordingly. As an example, if the current is well modeled by the previous WLs, then the WL performs minor update using and may give more emphasis (importance weight) to the later arriving samples that may be worse modeled by the previous WLs. Thus, by boosting, each adaptive algorithm in the mixture can concentrate on different parts of the input and output pairs achieving diversity and significantly improving the gain.
The linear online learning algorithms, such as SGD or NM, are among the simplest as well as the most widely used regression algorithms in the real-life applications (Sayed (2003)). Therefore, we use such algorithms as base WLs in our boosting algorithms. To this end, we first apply the boosting notion to several parallel running linear NM-based WLs and introduce three different approaches to use the importance weights (Chen et al. (2012)), namely “weighted updates”,“data reuse”, and “random updates”. In the first approach, we use the importance weights directly to produce certain weighted NM algorithms. In the second approach, we use the importance weights to construct data reuse adaptive algorithms (Oza and Russell (2001)). However, data reuse in boosting, such as (Oza and Russell (2001)), is significantly different from the usual data reusing approaches in adaptive filtering (Shaffer and Williams (1983)). As an example, in boosting, the importance weight coming from the WL determines the data reuse amount in the WL, i.e., it is not used for the filter, hence, achieving the diversity. The third approach uses the importance weights to decide whether to update the constituent WLs or not, based on a random number generated from a Bernoulli distribution with the parameter equal to the weight. The latter method can be effectively used for big data processing (Malik (2013)) due to the reduced complexity. The output of the constituent WLs is also combined using a linear mixture algorithm to construct the final output. We then update the final combination algorithm using the SGD algorithm (Kozat et al. (2010)). Furthermore, we extend the boosting idea to parallel running linear SGD-based algorithm similar to the NM case.
We start our discussions by investigating the related works in Section 2. We then introduce the problem setup and background in Section 3, where we provide individual sequence as well as MSE convergence results for the NM and SGD algorithms. We introduce our generic boosted online regression algorithm in Section 4 and provide the mathematical justifications for its performance. Then, in Sections 5 and 6, three different variants of the proposed boosting algorithm are derived, using the NM and SGD, respectively. Then, in Section 7 we provide the mathematical analysis for the computational complexity of the proposed algorithms. The paper concludes with extensive sets of experiments over the well known benchmark data sets and simulation models widely used in the machine learning literature to demonstrate the significant gains achieved by the boosting notion.
2 Related Works
AdaBoost is one of the earliest and most popular boosting methods, which has been used for binary and multiclass classifications as well as regression (Freund and E.Schapire (1997)). This algorithm has been well studied and has clear theoretical guarantees, and its excellent performance is explained rigorously (Breiman (1997)). However, AdaBoost cannot perform well on the noisy data sets (Servedio (2003)), therefore, other boosting methods have been suggested that are more robust against noise.
In order to reduce the effect of noise, SmoothBoost was introduced in (Servedio (2003)) in a batch setting. Moreover, in (Servedio (2003)) the author proves the termination time of the SmoothBoost algorithm by simultaneously obtaining upper and lower bounds on the weighted advantage of all samples over all of the weak learners. We note that the SmoothBoost algorithm avoids overemphasizing the noisy samples, hence, provides robustness against noise. In (Oza and Russell (2001)), the authors extend bagging and boosting methods to an online setting, where they use a Poisson sampling process to approximate the reweighting algorithm. However, the online boosting method in (Oza and Russell (2001)) corresponds to AdaBoost, which is susceptible to noise. In (Babenko et al. (2009)), the authors use a greedy optimization approach to develop the boosting notion to the online setting and introduce stochastic boosting. Nevertheless, while most of the online boosting algorithms in the literature seek to approximate AdaBoost, (Chen et al. (2012)) investigates the inherent difference between batch and online learning, extend the SmoothBoost algorithm to an online setting, and provide the mathematical guarantees for their algorithm. (Chen et al. (2012)) points out that the online weak learners do not need to perform well on all possible distributions of data, instead, they have to perform well only with respect to smoother distributions. Recently, in (Beygelzimer et al. (2015b)) the authors have developed two online boosting algorithms for classification, an optimal algorithm in terms of the number of weak learners, and also an adaptive algorithm using the potential functions and boost-by-majority (Freund (1995)).
In addition to the classification task, the boosting approach has also been developed for the regression (Duffy and Helmbold (2002)). In (Bertoni et al. (1997)), a boosting algorithm for regression is proposed, which is an extension of Adaboost.R (Bertoni et al. (1997)). Moreover, in (Duffy and Helmbold (2002)), several gradient descent algorithms are presented, and some bounds on their performances are provided. In (Babenko et al. (2009)) the authors present a family of boosting algorithms for online regression through greedy minimization of a loss function. Also, in (Beygelzimer et al. (2015a)) the authors propose an online gradient boosting algorithm for regression.
In this paper we propose a novel family of boosted online algorithms for the regression task using the “online boosting” notion introduced in (Chen et al. (2012)), and investigate three different variants of the introduced algorithm. Furthermore, we show that our algorithm can achieve a desired mean squared error (MSE), given a sufficient amount of data and a sufficient number of weak learners. In addition, we use similar techniques to (Servedio (2003)) to prove the correctness of our algorithm. We emphasize that our algorithm has a guaranteed performance in an individual sequence manner, i.e., without any statistical assumptions on the data. In establishing our algorithm and its justifications, we refrain from changing the regression problem to the classification problem, unlike the AdaBoost.R (Freund and E.Schapire (1997)). Furthermore, unlike the online SmoothBoost (Chen et al. (2012)), our algorithm can learn the guaranteed MSE of the weak learners, which in turn improves its adaptivity.
3 Problem Description and Background
All vectors are column vectors and represented by bold lower case letters. Matrices are represented by bold upper case letters. For a vector (or a matrix ), (or ) is the transpose and Tr() is the trace of the matrix . Here, and represent the identity matrix of dimension and the all zeros vector of length , respectively. Except and , the time index is given in the subscript, i.e., is the sample at time . We work with real data for notational simplicity. We denote the mean of a random variable as . Also, we show the cardinality of a set by .
We sequentially receive -dimensional input (regressor) vectors , , and desired data , and estimate by , where is an online regression algorithm. At each time the estimation error is given by and is used to update the parameters of the WL. For presentation purposes, we assume that , however, our derivations hold for any bounded but arbitrary desired data sequences. In our framework, we do not use any statistical assumptions on the input feature vectors or on the desired data such that our results are guaranteed to hold in an individual sequence manner (Kozat and Singer (Jan. 2008)).
The linear methods are considered as the simplest online modeling or learning algorithms, which estimate the desired data by a linear model as , where is the linear algorithm’s coefficients at time . Note that the previous expression also covers the affine model if one includes a constant term in , hence we use the purely linear form for notational simplicity. When the true is revealed, the algorithm updates its coefficients based on the error . As an example, in the basic implementation of the NM algorithm, the coefficients are selected to minimize the accumulated squared regression error up to time as
where is a fixed vector of coefficients. The NM algorithm is shown to enjoy several optimality properties under different statistical settings (Sayed (2003)). Apart from these results and more related to the framework of this paper, the NM algorithm is also shown to be rate optimal in an individual sequence manner (Merhav and Feder (1993)). As shown in (Merhav and Feder (1993)) (Section V), when applied to any sequence and , the accumulated squared error of the NM algorithm is as small as the accumulated squared error of the best batch least squares (LS) method that is directly optimized for these realizations of the sequences, i.e., for all , and , the NM achieves
The NM algorithm is a member of the Follow-the-Leader type algorithms (Cesa-Bianchi and Lugosi (2006)) (Section 3), where one uses the best performing linear model up to time to predict . Hence, (2) follows by direct application of the online convex optimization results (Shalev-Shwartz (2012)) after regularization. The convergence rate (or the rate of the regret) of the NM algorithm is also shown to be optimal so that in the upper bound cannot be improved (Singer et al. (2002)). It is also shown in (Singer et al. (2002)) that one can reach the optimal upper bound (with exact scaling terms) by using a slightly modified version of (1)
Note that the extension (3) of (1) is a forward algorithm (Section 5 of Azoury and Warmuth (2001)) and one can show that, in the scalar case, the predictions of (3) are always bounded (which is not the case for (1)) (Singer et al. (2002)).
We emphasize that in the basic application of the NM algorithm, all data pairs , , receive the same “importance” or weight in (1). Although there exists exponentially weighted or windowed versions of the basic NM algorithm (Sayed (2003)), these methods weight (or concentrate on) the most recent samples for better modeling of the nonstationarity (Sayed (2003)). However, in the boosting framework (Freund and E.Schapire (1997)), each sample pair receives a different weight based on not only those weighting schemes, but also the performance of the boosted algorithms on this pair. As an example, if a WL performs worse on a sample, the next WL concentrates more on this example to better rectify this mistake. In the following sections, we use this notion to derive different boosted online regression algorithms.
Although in this paper we use linear WLs for the sake of notational simplicity, one can readily extend our approach to nonlinear and piecewise linear regression methods. For example, one can use tree based online regression methods (Khan et al. (2016); Vanli and Kozat (2014); Kozat et al. (2007)) as the weak learners, and boost them with the proposed approach.
4 New Boosted Online Regression Algorithm
In this section we present the generic form of our proposed algorithms and provide the guaranteed performance bounds for that. Regarding the notion of “online boosting” introduced in (Chen et al. (2012)), the online weak learners need to perform well only over smooth distributions of data points. We first present the generic algorithm in Algorithm (1) and provide its theoretical justifications, then discuss about its structure and the intuition behind it.
In Algorithm 1, we have copies of an online WL, each of which is guaranteed to have a weighted MSE of at most . We prove that the Algorithm 1 can reach a desired MSE, , through Lemma 1, Lemma 2, and Theorem 1. Note that since we assume , the trivial solution incurs an MSE of at most . Therefore, we define a weak learner as an algorithm which has an MSE less than .
Lemma 2. If the weak learners are guaranteed to have a weighted MSE less than , i.e.,
there is an integer that satisfies the conditions in Lemma 1.
Proof. The proof of Lemma 2 is given in Appendix B.
Theorem 1. If the weak learners in line 11 of Algorithm 1 achieve a weighted MSE of at most , there exists an upper bound for such that the algorithm reaches the desired MSE.
Proof. This theorem is a direct consequence of combining Lemma 1 and Lemma 2.
Note that although we are using copies of a base learner as the weak learners and seek to improve its performance, the constituent WLs can be different. However, by using the boosting approach, we can improve the MSE performance of the overall system as long as the WLs can provide a weighted MSE of at most . For example, we can improve the performance of mixture-of-experts algorithms (Arenas-Garcia et al. (2016)) by leveraging the boosting approach introduced in this paper.
As shown in Fig. 1, at each iteration , we have parallel running WLs with estimating functions , producing estimates of , . As an example, if we use “linear” algorithms, is the estimate generated by the WL. The outputs of these WLs are then combined using the linear weights to produce the final estimate as (Kozat et al. (2010)), where is the vector of outputs. After the desired output is revealed, the parallel running WLs will be updated for the next iteration. Moreover, the linear combination coefficients are also updated using the normalized SGD (Sayed (2003)), as detailed later in Section 4.1.
After is revealed, the constituent WLs, , , are consecutively updated, as shown in Fig. 1, from top to bottom, i.e., first is updated, then, and finally is updated. However, to enhance the performance, we use a boosted updating approach (Freund and E.Schapire (1997)), such that the WL receives a “total loss” parameter, , from the WL, as
to compute a weight . The total loss parameter , indicates the sum of the differences between the modified desired MSE () and the squared error of the first WLs at time . Then, we add the difference to , to generate , and pass to the next WL, as shown in Fig. 1. Here, measures how much the WL is off with respect to the final MSE performance goal. For example, in a stationary environment, if , where is a deterministic function and is the observation noise, one can select the desired MSE as an upper bound on the variance of the noise process , and define a modified desired MSE as . In this sense, measures how the WLs are cumulatively performing on pair with respect to the final performance goal.
We then use the weight to update the WL with the “weighted updates”, “data reuse”, or “random updates” method, which we explain later in Sections 5 and 6. Our aim is to make large if the first WLs made large errors on , so that the WL gives more importance to in order to rectify the performance of the overall system. We now explain how to construct these weights, such that . To this end, we set , for all , and introduce a weighting similar to (Servedio (2003); Chen et al. (2012)). We define the weights as
where is the guaranteed upper bound on the weighted MSE of the weak learners. However, since there is no prior information about the exact MSE performance of the weak learners, we use the following weighting scheme
where indicates an estimate of the weak learner’s MSE, and is a design parameter, which determines the “dependence” of each WL update on the performance of the previous WLs, i.e., corresponds to “independent” updates, like the ordinary combination of the WLs in adaptive filtering (Kozat et al. (2010); Arenas-Garcia et al. (2016)), while a greater indicates the greater effect of the previous WLs performance on the weight of the current WL. Note that including the parameter does not change the validity of our proofs, since one can take as the new guaranteed weighted MSE. Here, is an estimate of the “Weighted Mean Squared Error” (WMSE) of the WL over and . In the basic implementation of the online boosting (Servedio (2003); Chen et al. (2012)), is set to the classification advantage of the weak learners (Servedio (2003)), where this advantage is assumed to be the same for all weak learners. In this paper, to avoid using any a priori knowledge and to be completely adaptive, we choose as the weighted and thresholded MSE of the WL up to time as
where , and thresholds into the range . This thresholding is necessary to assure that , which guarantees for all and . We point out that (7) can be recursively calculated.
Regarding the definition of , if the first WLs are “good”, we will pass less weight to the next WLs, such that those WLs can concentrate more on the other samples. Hence, the WLs can increase the diversity by concentrating on different parts of the data Kozat et al. (2010). Furthermore, following this idea, in (6), the weight is larger, i.e., close to 1, if most of the WLs, , have errors larger than on , and smaller, i.e., close to 0, if the pair is easily modeled by the previous WLs such that the WLs do not need to concentrate more on this pair.
4.1 The Combination Algorithm
Although in the proof of our algorithm, we assume a constant combination vector over time, we use a time varying combination vector in practice, since there is no knowledge about the exact number of the required week learners for each problem. Hence, after is revealed, we also update the final combination weights based on the final output , where , . To update the final combination weights, we use the normalized SGD algorithm Sayed (2003) yielding
4.2 Choice of Parameter Values
The choice of is a crucial task, i.e., we cannot reach any desired MSE for any data sequence unconditionally. As an example, suppose that the data are generated randomly according to a known distribution, while they are contaminated with a white noise process. It is clear that we cannot obtain an MSE level below the noise power. However, if the WLs are guaranteed to satisfy the conditions of Theorem 1, this would not happen. Intuitively, there is a guaranteed upper bound (i.e., ) on the worst case performance, since in the weighted MSE, the samples with a higher error have a more important effect. On the other hand, if one chooses a smaller than the noise power, will be negative for almost every , turning most of the weights into 1, and as a result the weak learners fail to reach a weighted MSE smaller than . Nevertheless, in practice we have to choose the parameter reasonably and precisely such that the conditions of Theorem 1 are satisfied. For instance, we set to be an upper bound on the noise power.
In addition, the number of weak learners, , is chosen regarding to the computational complexity constraints. However, in our experiments we choose a moderate number of weak learners, , which successfully improves the performance. Moreover, according to the results in Section 8.3, the optimum value for is around 1, hence, we set the parameter in our simulations.
5 Boosted NM Algorithms
At each time , all of the WLs (shown in Fig. 1) estimate the desired data in parallel, and the final estimate is a linear combination of the results generated by the WLs. When the WL receives the weight , it updates the linear coefficients using one of the following methods.
5.1 Directly Using ’s as Sample Weights
Here, we consider as the weight for the observation pair and apply a weighted NM update to . For this particular weighted NM algorithm, we define the Hessian matrix and the gradient vector as
where is the forgetting factor Sayed (2003) and can be calculated in a recursive manner as
5.2 Data Reuse Approaches Based on The Weights
Another approach follows Ozaboost (Oza and Russell (2001)). In this approach, from , we generate an integer, say , where is a design parameter that takes on positive integer values. We then apply the NM update on the pair repeatedly times, i.e., run the NM update on the same pair times consecutively. Note that should be determined according to the computational complexity constraints. However, increasing does not necessarily result in a better performance, therefore, we use moderate values for , e.g., we use in our simulations. The final is calculated after NM updates. As a major advantage, clearly, this reusing approach can be readily generalized to other adaptive algorithms in a straightforward manner.
We point out that Ozaboost (Oza and Russell (2001)) uses a different data reuse strategy. In this approach, is used as the parameter of a Poisson distribution and an integer is randomly generated from this Poisson distribution. One then applies the NM update times.
5.3 Random Updates Approach Based on The Weights
In this approach, we simply use the weight as a probability of updating the WL at time . To this end, we generate a Bernoulli random variable, which is with probability and is with probability . Then, we update the WL, only if the Bernoulli random variable equals . With this method, we significantly reduce the computational complexity of the algorithm. Moreover, due to the dependence of this Bernoulli random variable on the performance of the previous constituent WLs, this method does not degrade the MSE performance, while offering a considerably lower complexity, i.e., when the MSE is low, there is no need for further updates, hence, the probability of an update is low, while this probability is larger when the MSE is high.
6 Boosted SGD Algorithms
In this case, as shown in Fig. 1, we have parallel running WLs, each of which is updated using the SGD algorithm. Based on the weights given in (6) and the total loss and MSE parameters in (4) and (7), we next introduce three SGD based boosting algorithms, similar to those introduced in Section 5.
6.1 Directly Using ’s to Scale The Learning Rates
We note that by construction method in (6), , thus, these weights can be directly used to scale the learning rates for the SGD updates. When the WL receives the weight , it updates its coefficients , as
where . Note that we can choose for all , since the online algorithms work consecutively from top to bottom, and the WL will have a different learning rate .
6.2 A Data Reuse Approach Based on The Weights
In this scenario, for updating , we use the SGD update times to obtain the as
where is a constant design parameter.
Similar to the NM case, if we follow the Ozaboost (Oza and Russell (2001)), we use the weights to generate a random number from a Poisson distribution with parameter , and perform the SGD update times on as explained above.
6.3 Random Updates Based on The Weights
Again, in this scenario, similar to the NM case, we use the weight to generate a random number from a Bernoulli distribution, which equals with probability , and equals with probability . Then we update using SGD only if the generated number is .
7 Analysis Of The Proposed Algorithms
In this section we provide the complexity analysis for the proposed algorithms. We prove an upper bound for the weights , which is significantly less than 1. This bound shows that the complexity of the “random updates” algorithm is significantly less than the other proposed algorithms, and slightly greater than that of a single WL. Hence, it shows the considerable advantage of “boosting with random updates” in processing of high dimensional data.
7.1 Complexity Analysis
Here we compare the complexity of the proposed algorithms and find an upper bound for the computational complexity of random updates scenario (introduced in Section 5.3 for NM, and in Section 6.3 for SGD updates), which shows its significantly lower computational burden with respect to two other approaches. For , each WL performs computations to generates its estimate, and if updated using the NM algorithm, requires computations due to updating the matrix , while it needs computations when updated using the SGD method (in their most basic implementation).
We first derive the computational complexity of using the NM updates in different boosting scenarios. Since there are a total of WLs, all of which are updated in the “weighted updates” method, this method has a computational cost of order per each iteration . However, in the “random updates”, at iteration , the WL may or may not be updated with probabilities and respectively, yielding
where indicates the complexity of running the WL at iteration . Therefore, the total computational complexity at iteration will be , which yields
Hence, if is upper bounded by , the average computational complexity of the random updates method, will be
In Theorem 2, we provide sufficient constraints to have such an upper bound.
Furthermore, we can use such a bound for the “data reuse” mode as well. In this case, for each WL , we perform the NM update times, resulting a computational complexity of order . For the SGD updates, we similarly obtain the computational complexities , , and , for the “weighted updates”, “random updates”, and “data reuse” scenarios respectively.
The following theorem determines the upper bound for .
Theorem 2. If the WLs converge and achieve a sufficiently small MSE (according to the proof following this Theorem), the following upper bound is obtained for , given that is chosen properly,
where and .
It can be straightforwardly shown that, this bound is less than for appropriate choices of , and reasonable values for the MSE according to the proof. This theorem states that if we adjust such that it is achievable, i.e., the WLs can provide a slightly lower MSE than , the probability of updating the WLs in the random updates scenario will decrease. This is of course our desired results, since if the WLs are performing sufficiently well, there is no need for additional updates. Moreover, if is opted such that the WLs cannot achieve a MSE equal to , the WLs have to be updated at each iteration, which increases the complexity.
Proof: For simplicity, in this proof, we have assumed that , however, the results are readily extended to the general values of . We construct our proof based on the following assumption:
Assumption: assume that ’s are independent and identically distributed (i.i.d) zero-mean Gaussian random variables with variance .
Now, we show that under certain conditions, will be less than 1, hence, we obtain an upper bound for . We define , yielding
where is the moment generating function of the random variable . From the Algorithm 2, . According to the Assumption, is a standard normal random variable. Therefore, has a Gamma distribution as (Papoulis and Pillai (2002)), which results in the following moment generating function for
In the above equality is a random variable, the mean of which is denoted by . We point out that will approach to in convergence. We define a function such that , and seek to find a condition for to be a concave function. Then, by using the Jenssen’s inequality for concave functions, we have
Inspired by (20), we define and . By these definitions we obtain
Considering that , in order for to be concave, it suffices to have
which reduces to the following necessary and sufficient conditions: