Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications
In this paper, we design and analyze a new zeroth-order online algorithm, namely, the zeroth-order online alternating direction method of multipliers (ZOO-ADMM), which enjoys dual advantages of being gradient-free operation and employing the ADMM to accommodate complex structured regularizers. Compared to the first-order gradient-based online algorithm, we show that ZOO-ADMM requires times more iterations, leading to a convergence rate of , where is the number of optimization variables, and is the number of iterations. To accelerate ZOO-ADMM, we propose two minibatch strategies: gradient sample averaging and observation averaging, resulting in an improved convergence rate of , where is the minibatch size. In addition to convergence analysis, we also demonstrate ZOO-ADMM to applications in signal processing, statistics, and machine learning.
Online convex optimization (OCO) performs sequential inference in a data-driven adaptive fashion, and has found a wide range of applications [1, 2, 3]. In this paper, we focus on regularized convex optimization in the OCO setting, where a cumulative empirical loss is minimized together with a fixed regularization term. Regularized loss minimization is a common learning paradigm, which has been very effective in promotion of sparsity through or mixed / regularization , low-rank matrix completion via nuclear norm regularization , graph signal recovery via graph Laplacian regularization , and constrained optimization by imposing indicator functions of constraint sets .
Several OCO algorithms have been proposed for regularized optimization, e.g., composite mirror descent, namely, proximal stochastic gradient descent , regularized dual averaging , and adaptive gradient descent . However, the complexity of the aforementioned algorithms is dominated by the computation of the proximal operation with respect to the regularizers . An alternative is to use online alternating direction method of multipliers (O-ADMM) [11, 12, 13]. Different from the algorithms [8, 9, 10], the ADMM framework offers the possibility of splitting the optimization problem into a sequence of easily-solved subproblems. It was shown in [11, 12, 13] that the online variant of ADMM has convergence rate of for convex loss functions and for strongly convex loss functions, where is the number of iterations.
One limitation of existing O-ADMM algorithms is the need to compute and repeatedly evaluate the gradient of the loss function over the iterations. In many practical scenarios, an explicit expression for the gradient is difficult to obtain. For example, in bandit optimization , a player receives partial feedback in terms of loss function values revealed by her adversary, and making it impossible to compute the gradient of the full loss function. Similarly, in simulation-based optimization problems, a black-box computation model does not provide explicit function expressions or gradients [15, 16]. In adversarial black-box machine learning models, only the function values (e.g., prediction results) are provided . Moreover, in some high dimensional settings, acquiring the gradient information may be difficult, e.g., involving matrix inversion . This motivates the development of gradient-free (zeroth-order) optimization algorithms.
Zeroth-order optimization approximates the full gradient via a randomized gradient estimate [19, 20, 21, 14, 22, 23]. For example, in [14, 22], zeroth-order algorithms were developed for bandit convex optimization with multi-point bandit feedback. In , a zeroth-order gradient descent algorithm was proposed that has convergence rate, where is the number of variables in the objective function. This slowdown in convergence rate was improved to in . Its optimality was further proved in  under the framework of mirror descent algorithms.
Different from the aforementioned zeroth-order algorithms, in this paper we propose a zeroth-order online ADMM (called ZOO-ADMM) algorithm, which enjoys advantages of gradient-free computation as well as ADMM. We analyze the convergence of ZOO-ADMM under different settings, including stochastic optimization, learning with strongly convex loss functions, and minibatch strategies for convergence acceleration. To the best of our knowledge, this is the first work to study zeroth-order ADMM-type algorithms with convergence guarantees. We summarize our contributions as follows.
We integrate the idea of zeroth-order optimization with online ADMM, leading to a new gradient-free OCO algorithm, ZOO-ADMM.
We prove ZOO-ADMM yields a convergence rate for smooth+nonsmooth composite objective functions.
We introduce two minibatch strategies for acceleration of ZOO-ADMM, leading to an improved convergence rate , where is the minibatch size.
We illustrate ZOO-ADMM in black-box optimization for music recommendation systems, sensor selection for signal processing, and sparse cox regression for biomarker feature selection.
Ii ADMM: from First Order to Zeroth Order
In this paper, we consider the regularized loss minimization problem over a time horizon of length
where and are optimization variables, and are closed convex sets, is a convex and smooth cost/loss function parameterized by at time , is a convex regularization function (possibly nonsmooth), and , , and are appropriate coefficients associated with a system of linear constraints.
In problem (1), the use of time-varying cost functions captures possibly time-varying environmental uncertainties that may exist in the online setting [1, 24]. We can also write the online cost as when it cannot be explicitly parameterized by . One interpretation of is the empirical approximation to the stochastic objective function . Here is an empirical distribution with density , where is a set of i.i.d. samples, and is the Dirac delta function at . We also note that when , , , , , the variable and the linear constraint in (1) can be eliminated, leading to a standard OCO formulation. Here denotes the identity matrix, and is the vector of all zeros111In the sequel we will omit the dimension index , which can be inferred from the context..
Ii-a Background on O-ADMM
O-ADMM [11, 13, 12] was originally proposed to extend batch-type ADMM methods to the OCO setting. For solving (1), a widely-used algorithm was developed by , which combines online proximal gradient descent and ADMM in the following form:
where is the iteration number (possibly the same as the time step), is the gradient of the cost function at , namely, , is a Lagrange multiplier (also known as the dual variable), is a positive weight to penalize the augmented term associated with the equality constraint of (1), denotes the norm, is a non-increasing sequence of positive step sizes, and is a Bregman divergence generated by the strongly convex function with a known symmetric positive definite coefficient matrix .
Similar to batch-type ADMM algorithms, the subproblem in (3) is often easily solved via the proximal operator with respect to . However, one limitation of O-ADMM is that it requires the gradient in (2). We will develop the gradient-free (zeroth-order) O-ADMM algorithm below that relaxes this requirement.
Ii-B Motivation of ZOO-ADMM
where is a random vector drawn independently at each iteration from a distribution with , and is a non-increasing sequence of small positive smoothing constants. Here for notational simplicity we replace with . The rationale behind the estimator (5) is that becomes an unbiased estimator of when the smoothing parameter approaches zero .
After replacing with in (5), the resulting algorithm (2)-(4) can be implemented without explicit gradient computation. This extension is called zeroth-order O-ADMM (ZOO-ADMM) that involves a modification of step (2) :
In (6), we can specify the matrix in such a way as to cancel the term . This technique has been used in the linearized ADMM algorithms [7, 26] to avoid matrix inversions. Defining , the update rule (6) simplifies to a projection operator
where is a parameter selected to ensure . Here signifies that is positive semidefinite.
Iii Algorithm and Convergence Analysis of ZOO-ADMM
In this section, we begin by stating assumptions used in our analysis. We then formally define the ZOO-ADMM algorithm and derive its convergence rate.
We assume the following conditions in our analysis.
Assumption A: In problem (1), and are bounded with finite diameter , and at least one of and in is invertible.
Assumption B: is convex and Lipschitz continuous with for all and .
Assumption C: is -smooth with .
Assumption D: is convex and -Lipschitz continuous with for all , where denotes the subgradient of .
Assumption E: In (5), given , the quantity is finite, and there is a function satisfying for all , where denotes the inner product of two vectors.
We remark that Assumptions A-D are standard for stochastic gradient-based and ADMM-type methods [1, 24, 25, 11]. Assumption A implies that and for all and for all . Based on Jensen’s inequality, Assumptions B implies that . Assumption C implies a Lipschitz condition over the gradient with constant [27, 1]. Also based on Jensen’s inequality, we have . Assumption E places moment constraints on the distribution that will allow us to derive the necessary concentration bounds for our convergence analysis. If is uniform on the surface of the Euclidean-ball of radius , we have and . And if , we have and . For ease of representation, we restrict attention to the case that in the rest of the paper. It is also worth mentioning that the convex and strongly convex conditions of can be described as
where is a parameter controlling convexity. If , then is strongly convex with parameter . Otherwise (), (9) implies convexity of .
The ZOO-ADMM iterations are given as Algorithm 1. Compared to O-ADMM in , we only require querying two function values for the generation of gradient estimate at step 3. Also, steps 7-11 of Algorithm 1 imply that the equality constraint of problem (1) is always satisfied at or . The average regret of ZOO-ADMM is bounded in Theorem 1.
Suppose is invertible in problem (1). For generated by ZOO-ADMM, the expected average regret is bounded as
where is introduced in (7), , , , and are defined in Assumptions A-E, and denotes a constant term that depends on , , , , , , and . Suppose is invertible in problem (1). For , the regret obeys the same bounds as (43).
Proof: See Appendix A.
In Theorem 1, if the step size and the smoothing parameter are chosen as
for some constant and , then the regret bound (43) simplifies to
The above simplification is derived in Appendix B.
It is clear from (12) that ZOO-ADMM converges at least as fast as , which is similar to the convergence rate of O-ADMM found by  but involves an additional factor . Such a dimension-dependent effect on the convergence rate has also been reported for other zeroth-order optimization algorithms [20, 21, 22], leading to the same convergence rate as ours. In (12), even if we set (namely, ) for an unbiased gradient estimate (5), the dimension-dependent factor is not eliminated. That is because the second moment of the gradient estimate also depends on the number of optimization variables. In the next section, we will propose two minibatch strategies that can be used to reduce the variance of the gradient estimate and to improve the convergence speed of ZOO-ADMM.
Iv Convergence analysis: Special Cases
In this section, we specialize ZOO-ADMM to three cases: a) stochastic optimization, b) strongly convex cost function in (1), and c) the use of minibatch strategies for evaluation of gradient estimates. Without loss of generality, we restrict analysis to the case that is invertible in (1).
The stochastic optimization problem is a special case of the OCO problem (1). If the objective function becomes then we can link the regret with the optimization error at the running average and under the condition that is convex. We state our results as Corollary 1.
Proof: See Appendix C.
Suppose is strongly convex, and the step size and the smoothing parameter are chosen as and for . Given generated by ZOO-ADMM, the expected average regret can be bounded as
Proof: See Appendix D.
Corollary 2 implies that when the cost function is strongly convex, the regret bound of ZOO-ADMM could achieve up to a logarithmic factor . Compared to the regret bound in the general case (12), the condition of strong convexity improves the regret bound in terms of the number of iterations , but the dimension-dependent factor now becomes linear in the dimension due to the effect of the second moment of gradient estimate.
The use of a gradient estimator makes the convergence rate of ZOO-ADMM dependent on the dimension , i.e., the number of optimization variables. Thus, it is important to study the impact of minibatch strategies on the acceleration of the convergence speed [28, 29, 11, 21]. Here we present two minibatch strategies: gradient sample averaging and observation averaging. In the first strategy, instead of using a single sample as in (5), the average of sub-samples are used for gradient estimation
where is called the batch size. The use of (14) is analogous to the use of an average gradient in incremental gradient  and stochastic gradient . In the second strategy, we use a subset of observations to reduce the gradient variance,
In Corollary 3, we demonstrate the convergence behavior of the general hybrid ZOO-ADMM.
Consider the hybrid minibatch strategy (47) in ZOO-ADMM, and set and . The expected average regret is bounded as
where and are number of sub-samples and , respectively.
Proof: See Appendix E.
It is clear from Corollary 3 that the use of minibatch strategies can alleviate the dimension dependency, leading to the regret bound . The regret bound in (17) also implies that the convergence behavior of ZOO-ADMM is similar using either gradient sample averaging minibatch (14) or observation averaging minibatch (15). If and , the regret bound (17) reduces to , which is the general case in (12). Interestingly, if or , we obtain the same regret error as in the case where an explicit expression for the gradient is used in the OCO algorithms.
V Applications of ZOO-ADMM
In this section, we demonstrate several applications of ZOO-ADMM in signal processing, statistics and machine learning.
V-a Black-box optimization
In some OCO problems, explicit gradient calculation is impossible due to the lack of a mathematical expression for the loss function. For example, commercial recommender systems try to build a representation of a customer’s buying preference function based on a discrete number of queries or purchasing history, and the system never has access to the gradient of the user’s preference function over their product line, which may even be unknown to the user. Gradient-free methods are therefore necessary. A specific example is the Yahoo! music recommendation system , which will be further discussed in the Sec. VI. In these examples, one can consider each user as a black-box model that provides feedback on the value of an objective function, e.g., relative preferences over all products, based on an online evaluation of the objective function at discrete points on its domain. Such a system can benefit from ZOO-ADMM.
V-B Sensor selection
Sensor selection for parameter estimation is a fundamental problem in smart grids, communication systems, and wireless sensor networks . The goal is to seek the optimal tradeoff between sensor activations and the estimation accuracy. The sensor selection problem is also closely related to leader selection  and experimental design .
For sensor selection, we often solve a (relaxed) convex program of the form 
where is the optimization variable, is the number of sensors, is the observation coefficient of sensor at time , and is the number of selected sensors. The objective function of (18) can be interpreted as the log determinant of error covariance associated with the maximum likelihood estimator for parameter estimation . The constraint is a relaxed convex hull of the Boolean constraint , which encodes whether or not a sensor is selected.
Conventional methods such as projected gradient (first-order) and interior-point (second-order) algorithms can be used to solve problem (18). However, both of them involve calculation of inverse matrices necessary to evaluate the gradient of the cost function. By contrast, we can rewrite (18) in a form amenable to ZOO-ADMM that avoids matrix inversion,
where is an auxiliary variable, with , and are indicator functions
V-C Sparse Cox regression
In survival analysis, Cox regression (also known as proportional hazards regression) is a method to investigate effects of variables of interest upon the amount of time that elapses before a specified event occurs, e.g., relating gene expression profiles to survival times such as time to cancer recurrence or death . Let be triples of covariates, where is a vector of covariates or factors for subject , is a censoring indicator variable taking if an event (e.g., death) is observed and otherwise, and denotes the censoring time.
where is the vector of covariates coefficients to be designed, is the set of subjects at risk at time , namely, , and is a regularization parameter. In the objective function of (20), the first term corresponds to the (negative) log partial likelihood for the Cox proportional hazards model , and the second term encourages sparsity of the covariate coefficients.
By introducing a new variable together with the constraint , problem (20) can be cast as the canonical form (1) amenable to the ZOO-ADMM algorithm. This helps us to avoid the gradient calculation for the involved objective function in Cox regression. We specify the ZOO-ADMM algorithm for solving (20) in Appendix G.
In this section, we demonstrate the effectiveness of ZOO-ADMM, and validate its convergence behavior for the applications introduced in Sec. V. In Algorithm 1, we set , , , , , , , and the distribution is chosen to be uniform on the surface of the Euclidean-ball of radius . Unless specified otherwise, we use the gradient sample averaging minibatch of size in ZOO-ADMM. Through this section, we compare ZOO-ADMM with the conventional O-ADMM algorithm in  under the same parameter settings. Our experiments are performed on a synthetic dataset for sensor selection, and on real datasets for black-box optimization and Cox regression. Experiments were conducted by Matlab R2016 on a machine with 3.20 GHz CPU and 8 GB RAM.
Black-box optimization: We consider prediction of users’ ratings in the Yahoo! music system . Our dataset, provided by , include true music ratings , and the predicted ratings of individual models created from the NTU KDD-Cup team . Let represent a matrix of each models’ predicted ratings on Yahoo! music data sample. We split the dataset into two equal parts, leading to the training dataset and the test dataset , where .
Our goal is to find the optimal coefficients to blend individual models such that the mean squared error is minimized, where , is the th row vector of , and is the th entry of . Since we cannot access the information directly , explicit gradient calculation for is impossible. We can treat the loss function as a black box, where it is evaluated at individual points in its domain but not over any open region of its domain. As discussed in Sec. V-A, we can apply ZOO-ADMM to solve the proposed linear blending problem, and the prediction accuracy can be measured by the root mean squared error (RMSE) of the test data , where an update of is obtained at each iteration.
In Fig. 1, we compare the performance of ZO-ADMM with O-ADMM and the optimal solution provided by . In Fig. 1-(a), we present RMSE as a function of iteration number under different minibatch schemes. As we can see, both gradient sample averaging (over ) and observation averaging (over ) significantly accelerate the convergence speed of ZOO-ADMM. In particular, when the minibatch size is large enough ( in our example), the dimension-dependent slowdown factor of ZOO-ADMM can be mitigated. We also observe that ZOO-ADMM reaches the best RMSE in  after iterations. In Fig. 1-(a), we show the convergence error versus iteration number using gradient sample averaging minibatch of size . Compared to O-ADMM, ZOO-ADMM has a larger performance gap in its first few iterations, but it thereafter converges quickly resulting in comparable performance to O-ADMM.
Sensor selection: We consider an example of estimating a spatial random field based on measurements of the field at a discrete set of sensor locations. Assume that sensors are randomly deployed over a square region to monitor a vector of field intensities (e.g., temperature values). The objective is to estimate the field intensity at locations over a time period of secs. In (18), the observation vectors are chosen randomly, and independently, from a distribution . Here is generated by an exponential model , , where is the -th spatial location at which the field intensity is to be estimated and is the spatial location of the sensor.
In Fig. 2, we present the performance of ZOO-ADMM for sensor selection. In Fig. 2-(a), we show the mean squared error (MSE) averaged over random trials for different number of selected sensors in (18). We compare our approach with O-ADMM and the method in . The figure shows that ZOO-ADMM yields almost the same MSE as O-ADMM. The method in  yields slightly better estimation performance, since it uses the second-order optimization method for sensor selection. In Fig. 2-(b), we present the computation time of ZOO-ADMM versus the number of optimization variables . The figure shows that ZOO-ADMM becomes much more computationally efficient as increases since no matrix inversion is required.
Sparse Cox regression: We next employ ZOO-ADMM to solve problem (20) for building a sparse predictor of patient survival using the Kidney renal clear cell carcinoma dataset222Available at http://gdac.broadinstitute.org/. The aforementioned dataset includes clinical data (survival time and censoring information) and gene expression data for patients ( with tumor and without tumor). Our goal is to seek the best subset of genes (in terms of optimal sparse covariate coefficients) that make the most significant impact on the survival time.
|# selected genes||19||56||93|
In Fig. 3, we show the partial likelihood and number of selected genes as functions of the regularization parameter . The figure shows that ZOO-ADMM nearly attains the accuracy of O-ADMM. Furthermore, the likelihood increases as the number of selected genes increases. There is thus a tradeoff between the (negative) log partial likelihood and the sparsity of covariate coefficients in problem (20). To test the significance of our selected genes, we compare our approach with the significance analysis based on univariate Cox scores used in . The percentage of overlap between the genes identified by each method is shown in Table I under different values of . Despite its use of a zeroth order approximation to the gradient, the ZOO-ADMM selects at least of the genes selected by the gradient-based Cox scores of .
In this paper, we proposed and analyzed a gradient-free (zeroth-order) online optimization algorithm, ZOO-ADMM. We showed that the regret bound of ZOO-ADMM suffers an additional dimension-dependent factor in convergence rate over gradient-based online variants of ADMM, leading to convergence rate, where is the number of optimization variables. To alleviate the dimension dependence, we presented two minibatch strategies that yield an improved convergence rate of , where is the minibatch size. We illustrated the effectiveness of ZOO-ADMM via multiple applications using both synthetic and real-world datasets. In the future, we would like to relax the assumptions on smoothness and convexity of the cost function in ZOO-ADMM.
Appendix A Proof of Theorem 1
We first introduce key notations used in our analysis. Given the primal-dual variables , and of problem (1), we define , and a primal-dual mapping
where is skew symmetric, namely, . An important property of the affine mapping is that for every and . Supposing the sequence is generated by an algorithm, we introduce the auxiliary sequence
Here for notational simplicity we have used, and henceforth will continue to use, instead of .
In (23), based on , we have
We also note that the terms , , , , , and are independent of time . In particular, we have
where denotes the Frobenius norm of a matrix, and we have used the facts that and .
Based on the optimality condition of in (3), we have
which is equivalent to . And thus, we obtain
where we have used the fact that .