Online Alternating Direction Method
Online optimization has emerged as powerful tool in large scale optimization. In this paper, we introduce efficient online optimization algorithms based on the alternating direction method (ADM), which can solve online convex optimization under linear constraints where the objective could be non-smooth. We introduce new proof techniques for ADM in the batch setting, which yields a convergence rate for ADM and forms the basis for regret analysis in the online setting. We consider two scenarios in the online setting, based on whether an additional Bregman divergence is needed or not. In both settings, we establish regret bounds for both the objective function as well as constraints violation for general and strongly convex functions. We also consider inexact ADM updates where certain terms are linearized to yield efficient updates and show the stochastic convergence rates. In addition, we briefly discuss that online ADM can be used as projection-free online learning algorithm in some scenarios. Preliminary results are presented to illustrate the performance of the proposed algorithms.
In recent years, online optimization [celu06, Zinkevich03, haak07] and its batch counterpart stochastic gradient descent [Robi51:SP, Judi09:SP] has contributed substantially to advances in large scale optimization techniques for machine learning. Online convex optimization has been generalized to handle time-varying and non-smooth convex functions [Duchi10_comid, duchi09, xiao10]. Distributed optimization, where the problem is divided into parts on which progress can be made in parallel, has also contributed to advances in large scale optimization [boyd10, Bertsekas89, ceze98].
Important advances have been made based on the above ideas in the recent literature. Composite objective mirror descent (COMID) [Duchi10_comid] generalizes mirror descent [Beck03] to the online setting. COMID also includes certain other proximal splitting methods such as FOBOS [duchi09] as special cases. Regularized dual averaging (RDA) [xiao10] generalizes dual averaging [nesterov09] to online and composite optimization, and can be used for distributed optimization [Duchi11_dv]. The three methods consider the following composite objective optimization [nest07:composite]:
where the functions are convex functions and is a convex set. Solving (1) usually involves the projection onto . In some cases, e.g., when is the norm or is the unit simplex, the projection can be done efficiently. In general, the full projection requires an inner loop algorithm, leading to a double loop algorithm for solving (1) [hazan12:free].
In this paper, we propose single loop online optimization algorithms for composite objective optimization subject to linear constraints. In particular, we consider optimization problems of the following form:
where , and and are convex sets. The linear equality constraint introduces splitting variables and thus splits functions and feasible sets into simpler constraint sets and . (2) can easily accommodate linear inequality constraints by introducing a slack variable, which will be discussed in Section LABEL:sec:linie. In the sequel, we drop the convex sets and for ease of exposition, noting that one can consider and other additive functions to be the indicators of suitable convex feasible sets. and can be non-smooth, including piecewise linear and indicator functions. In the context of machine learning, is usually a loss function such as , hinge and logistic loss, while is a regularizer, e.g., , , nuclear norm, mixed-norm and total variation.
In the batch setting, where , (2) can be solved by the well known alternating direction method of multipliers (ADMM or ADM) [boyd10]. First introduced in [Gabay76], ADM has since been extensively explored in recent years due to its ease of applicability and empirical performance in a wide variety of problems, including composite objectives [boyd10, Eckstein92, Lin_Ma09]. It has been shown as a special case of Douglas-Rachford splitting method [comb09:prox, Douglas56, Eckstein92], which in turn is a special case of the proximal point method [Rockafellar76]. Recent literature has illustrated the empirical efficiency of ADM in a broad spectrum of applications ranging from image processing [Ng10, Figueiredo10, Afonso10:tv, Chan11] to applied statistics and machine learning [Scheinberg10, Afonso10:tv, Yuan09, Yuan09b, yang09, Lin_Ma09, Barman11, Meshi10, Martins11]. ADM has been shown to outperform state-of-the-art methods for sparse problems, including LASSO [Tibshirani96, Hastie09, Afonso10:tv, boyd10], total variation [Goldstein10:tv], sparse inverse covariance selection [Dempster72, Banerjee08, Friedman08, Meinshausen06, Scheinberg10, Yuan09], and sparse and low rank approximations [Yuan09b, Lin_Ma09, cand09:rpca]. ADM have also been used to solve linear programs (LPs) [Eckstein90], LP decoding [Barman11] and MAP inference problems in graphical models [Martins11, Meshi10, wang12:kladm]. In addition, an advantage of ADM is that it can handle linear equality constraint of the form , which makes distributed optimization by variable splitting in a batch setting straightforward [Bertsekas89, Nedic10, boyd10, boyd12:gpbs, Giannakis07]. For further understanding of ADM, we refer the readers to the comprehensive review by [boyd10] and references therein.
Although the proof of global convergence of ADM can be found in [Gabay83, Eckstein92, boyd10], the literature does not have the convergence rate for ADM 111During/after the publication of our preliminary version [wang12:oadm], the convergence rate for ADM was shown in [he12:vi, he12:cst, luo12:admm, deng12:admm, dan12:admm, Goldstein12:fadmm], but our proof is different and self-contained. In particular, the other approaches do not prove the convergence rate for the objective, which is the key for the regret analysis in the online setting or stochastic setting. or even the convergence rate for the objective, which is fundamentally important to regret analysis in the online setting. We introduce new proof techniques for the rate of convergence of ADM in the batch setting, which establish a convergence rate for the objective, the optimality conditions (constraints) and ADM based on variational inequalities [fapa03]. The convergence rate for ADM is in line with gradient methods for composite objective [Nest04:bkcov, nest07:composite, duchi09]222 The gradient methods can be accelerated to achieve the convergence rate [Nest04:bkcov, nest07:composite].. Our proof requires rather weak assumptions compared to the Lipschitz continuous gradient required in general in gradient methods [Nest04:bkcov, nest07:composite, duchi09]. Further, the convergence analysis for the batch setting forms the basis of regret analysis in the online setting.
In an online or stochastic gradient descent setting, where is a time-varying function, (2) amounts to solving a sequence of equality-constrained subproblems, which in general leads to a double-loop algorithm where the inner loop ADM iterations have to be run till convergence after every new data point or function is revealed. As a result, ADM has not yet been generalized to the online setting.
We consider two scenarios in the online setting, based on whether an additional Bregman divergence is needed or not for a proximal function in each step. We propose efficient online ADM (OADM) algorithms for both scenarios which make a single pass through the update equations and avoid a double loop algorithm. In the online setting, while a single pass through the ADM update equations is not guaranteed to satisfy the linear constraint in each iteration, we consider two types of regret: regret in the objective as well as regret in constraint violation. We establish both types of regret bounds for general and strongly convex functions. In Table 1, we summarize the main results of OADM and also compare with OGD [Zinkevich03], FOBOS [duchi09], COMID [Duchi10_comid] and RDA [xiao10]. While OADM aims to solve linearly-constrained composite objective optimization problems, OGD, FOBOS and RDA are for such problems without explicit constraints. In both general and strongly convex cases, our methods achieve the optimal regret bounds for the objective as well as the constraint violation, while start-of-the-art methods achieve the optimal regret bounds for the objective. We also present preliminary experimental results illustrating the performance of the proposed OADM algorithms in comparison with FOBOS and RDA [duchi09, xiao10].
The key advantage of the OADM algorithms can be summarized as follows: Like COMID and RDA, OADM can solve online composite optimization problems, matching the regret bounds for existing methods. The ability to additionally handle linear equality constraint of the form makes non-trivial variable splitting possible yielding efficient distributed online optimization algorithms [Dekel12_minbatch] and projection-free online learning [hazan12:free] based on OADM. Further, the notion of regret in both the objective as well as constraint may contribute towards development of suitable analysis tools for online constrained optimization problems [Mannor06, Majin11].
|Methods||OADM||OGD, FOBOS, COMID, RDA|
We summarize our main contributions as follows:
We establish a convergence rate for the objective, optimality conditions (constraints) and variational inequality for ADM.
We propose online ADM (OADM), which is the first single loop online algorithm to explicitly solve the linearly-constrained problem (2) by just doing a single pass over examples.
In OADM, we establish the optimal regret bounds for both objective and constraint violation for general as well as strongly convex functions. The introduction of regret for constraint violation which allows constraints to be violated at each round but guarantees constraints to be satisfied on average in the long run.
We show some inexact updates in the OADM through the use of an additional Bregman divergence, including OGD and COMID as special cases. For OADM with inexact updates, we also show the stochastic convergence rates.
For an intersection of simple constraints, e.g., linear constraint (simplex), OADM is a projection-free online learning algorithm achieving the optimal regret bounds for both general and strongly convex functions.
The rest of the paper is organized as follows. In Section 2, we analyze batch ADM and establish its convergence rate. In Section 3, we propose OADM to solve the online optimization problem with linear constraints. In Sections 4 and 5, we present the regret analysis in two different scenarios based on whether an additional Bregman divergence is added or not. In Section 6, we discuss inexact ADM updates and show the stochastic convergence rates, show the connection to related works and projection-free online learning based on OADM. We present preliminary experimental results in Section 7, and conclude in Section 8.
2 Analysis for Batch Alternating Direction Method
We consider the batch ADM problem where is fixed in (2), i.e.,
The Lagrangian [boyd04:convex, Bertsekas99] for the equality-constrained optimization problem (3) is
where are the primal variables and is the dual variable. To penalize the violation of equality constraint, augmented Lagrangian methods use an additional quadratic penalty term. In particular, the augmented Lagrangian [Bertsekas99] for (2) is
where is a penalty parameter. Batch ADM updates the three variables by alternatingly minimizing the augmented Lagrangian. It executes the following three steps iteratively till convergence [boyd10]:
At step , the equality constraint in (3) is not necessarily satisfied in ADM. However, one can show that the equality constraint is satisfied in the long run so that .
While global convergence of ADMM has been established under appropriate conditions, we are interested in the rate of convergence of ADM in terms of iteration complexity, i.e., the number of iterations needed to obtain an -optimal solution. Most first-order methods require functions to be smooth or having Lipschitz continuous gradient to establish the convergence rate [Nest04:bkcov, nest07:composite, duchi09]. The assumptions in establishing convergence rate of ADM are relatively simple [boyd10], and are stated below for the sake of completeness:
(a) and are closed, proper and convex.
(b) An optimal solution to (3) exists. Let be an optimal solution. Denote .
(c) Without loss of generality, . Let be the largest eigenvalue of .
We first analyze the convergence rate for the objective and optimality conditions (constraints) separately using new proof techniques, which play an important role for the regret analysis in the online setting. Then, a joint analysis of the objective and constraints using a variational inequality [fapa03] establishes the convergence rate for ADM.
2.1 Convergence Rate for the Objective
The updates of implicitly generate the (sub)gradients of and , as given in the following lemma.
Let be the subgradient of at , we have
Let be the subgradient of at , we have
The following lemma shows the inaccuracy of the objective with respect to the optimum at is bounded by step differences of and .
Let the sequences be generated by ADM. Then for any satisfying , we have