Online Alternating Direction Method

Online Alternating Direction Method

Huahua Wang
Dept of Computer Science & Engg
University of Minnesota, Twin Cities
huwang@cs.umn.edu
   Arindam Banerjee
Dept of Computer Science & Engg
University of Minnesota, Twin Cities
banerjee@cs.umn.edu
Abstract

Online optimization has emerged as powerful tool in large scale optimization. In this paper, we introduce efficient online optimization algorithms based on the alternating direction method (ADM), which can solve online convex optimization under linear constraints where the objective could be non-smooth. We introduce new proof techniques for ADM in the batch setting, which yields a convergence rate for ADM and forms the basis for regret analysis in the online setting. We consider two scenarios in the online setting, based on whether an additional Bregman divergence is needed or not. In both settings, we establish regret bounds for both the objective function as well as constraints violation for general and strongly convex functions. We also consider inexact ADM updates where certain terms are linearized to yield efficient updates and show the stochastic convergence rates. In addition, we briefly discuss that online ADM can be used as projection-free online learning algorithm in some scenarios. Preliminary results are presented to illustrate the performance of the proposed algorithms.

1 Introduction

In recent years, online optimization [celu06, Zinkevich03, haak07] and its batch counterpart stochastic gradient descent [Robi51:SP, Judi09:SP] has contributed substantially to advances in large scale optimization techniques for machine learning. Online convex optimization has been generalized to handle time-varying and non-smooth convex functions [Duchi10_comid, duchi09, xiao10]. Distributed optimization, where the problem is divided into parts on which progress can be made in parallel, has also contributed to advances in large scale optimization [boyd10, Bertsekas89, ceze98].

Important advances have been made based on the above ideas in the recent literature. Composite objective mirror descent (COMID) [Duchi10_comid] generalizes mirror descent [Beck03] to the online setting. COMID also includes certain other proximal splitting methods such as FOBOS [duchi09] as special cases. Regularized dual averaging (RDA) [xiao10] generalizes dual averaging [nesterov09] to online and composite optimization, and can be used for distributed optimization [Duchi11_dv]. The three methods consider the following composite objective optimization [nest07:composite]:

(1)

where the functions are convex functions and is a convex set. Solving (1) usually involves the projection onto . In some cases, e.g., when is the norm or is the unit simplex, the projection can be done efficiently. In general, the full projection requires an inner loop algorithm, leading to a double loop algorithm for solving (1[hazan12:free].

In this paper, we propose single loop online optimization algorithms for composite objective optimization subject to linear constraints. In particular, we consider optimization problems of the following form:

(2)

where , and and are convex sets. The linear equality constraint introduces splitting variables and thus splits functions and feasible sets into simpler constraint sets and . (2) can easily accommodate linear inequality constraints by introducing a slack variable, which will be discussed in Section LABEL:sec:linie. In the sequel, we drop the convex sets and for ease of exposition, noting that one can consider and other additive functions to be the indicators of suitable convex feasible sets. and can be non-smooth, including piecewise linear and indicator functions. In the context of machine learning, is usually a loss function such as , hinge and logistic loss, while is a regularizer, e.g., , , nuclear norm, mixed-norm and total variation.

In the batch setting, where , (2) can be solved by the well known alternating direction method of multipliers (ADMM or ADM) [boyd10]. First introduced in [Gabay76], ADM has since been extensively explored in recent years due to its ease of applicability and empirical performance in a wide variety of problems, including composite objectives [boyd10, Eckstein92, Lin_Ma09]. It has been shown as a special case of Douglas-Rachford splitting method [comb09:prox, Douglas56, Eckstein92], which in turn is a special case of the proximal point method [Rockafellar76]. Recent literature has illustrated the empirical efficiency of ADM in a broad spectrum of applications ranging from image processing [Ng10, Figueiredo10, Afonso10:tv, Chan11] to applied statistics and machine learning [Scheinberg10, Afonso10:tv, Yuan09, Yuan09b, yang09, Lin_Ma09, Barman11, Meshi10, Martins11]. ADM has been shown to outperform state-of-the-art methods for sparse problems, including LASSO [Tibshirani96, Hastie09, Afonso10:tv, boyd10], total variation [Goldstein10:tv], sparse inverse covariance selection [Dempster72, Banerjee08, Friedman08, Meinshausen06, Scheinberg10, Yuan09], and sparse and low rank approximations [Yuan09b, Lin_Ma09, cand09:rpca]. ADM have also been used to solve linear programs (LPs) [Eckstein90], LP decoding [Barman11] and MAP inference problems in graphical models [Martins11, Meshi10, wang12:kladm]. In addition, an advantage of ADM is that it can handle linear equality constraint of the form , which makes distributed optimization by variable splitting in a batch setting straightforward [Bertsekas89, Nedic10, boyd10, boyd12:gpbs, Giannakis07]. For further understanding of ADM, we refer the readers to the comprehensive review by [boyd10] and references therein.

Although the proof of global convergence of ADM can be found in [Gabay83, Eckstein92, boyd10], the literature does not have the convergence rate for ADM 111During/after the publication of our preliminary version [wang12:oadm], the convergence rate for ADM was shown in  [he12:vi, he12:cst, luo12:admm, deng12:admm, dan12:admm, Goldstein12:fadmm], but our proof is different and self-contained. In particular, the other approaches do not prove the convergence rate for the objective, which is the key for the regret analysis in the online setting or stochastic setting. or even the convergence rate for the objective, which is fundamentally important to regret analysis in the online setting. We introduce new proof techniques for the rate of convergence of ADM in the batch setting, which establish a convergence rate for the objective, the optimality conditions (constraints) and ADM based on variational inequalities [fapa03]. The convergence rate for ADM is in line with gradient methods for composite objective [Nest04:bkcov, nest07:composite, duchi09]222 The gradient methods can be accelerated to achieve the convergence rate [Nest04:bkcov, nest07:composite].. Our proof requires rather weak assumptions compared to the Lipschitz continuous gradient required in general in gradient methods [Nest04:bkcov, nest07:composite, duchi09]. Further, the convergence analysis for the batch setting forms the basis of regret analysis in the online setting.

In an online or stochastic gradient descent setting, where is a time-varying function, (2) amounts to solving a sequence of equality-constrained subproblems, which in general leads to a double-loop algorithm where the inner loop ADM iterations have to be run till convergence after every new data point or function is revealed. As a result, ADM has not yet been generalized to the online setting.

We consider two scenarios in the online setting, based on whether an additional Bregman divergence is needed or not for a proximal function in each step. We propose efficient online ADM (OADM) algorithms for both scenarios which make a single pass through the update equations and avoid a double loop algorithm. In the online setting, while a single pass through the ADM update equations is not guaranteed to satisfy the linear constraint in each iteration, we consider two types of regret: regret in the objective as well as regret in constraint violation. We establish both types of regret bounds for general and strongly convex functions. In Table 1, we summarize the main results of OADM and also compare with OGD [Zinkevich03], FOBOS [duchi09], COMID [Duchi10_comid] and RDA [xiao10]. While OADM aims to solve linearly-constrained composite objective optimization problems, OGD, FOBOS and RDA are for such problems without explicit constraints. In both general and strongly convex cases, our methods achieve the optimal regret bounds for the objective as well as the constraint violation, while start-of-the-art methods achieve the optimal regret bounds for the objective. We also present preliminary experimental results illustrating the performance of the proposed OADM algorithms in comparison with FOBOS and RDA [duchi09, xiao10].

The key advantage of the OADM algorithms can be summarized as follows: Like COMID and RDA, OADM can solve online composite optimization problems, matching the regret bounds for existing methods. The ability to additionally handle linear equality constraint of the form makes non-trivial variable splitting possible yielding efficient distributed online optimization algorithms [Dekel12_minbatch] and projection-free online learning [hazan12:free] based on OADM. Further, the notion of regret in both the objective as well as constraint may contribute towards development of suitable analysis tools for online constrained optimization problems [Mannor06, Majin11].

Problem
Methods OADM OGD, FOBOS, COMID, RDA
Regret Bounds Objective constraint Objective
General Convex
Strongly Convex
Table 1: Main results for regret bounds of OADM in solving linearly-constrained composite objective optimization, in comparison with OGD, FOBOS, COMID and RDA in solving composite objective optimization. In both general and strongly convex cases, OADM achieves the optimal regret bounds for the objective, matching the results of the state-of-the-art methods. In addition, OADM also achieves the optimal regret bounds for constraint violation, showing the equality constraint will be satisfied on average in the long.

We summarize our main contributions as follows:

  • We establish a convergence rate for the objective, optimality conditions (constraints) and variational inequality for ADM.

  • We propose online ADM (OADM), which is the first single loop online algorithm to explicitly solve the linearly-constrained problem (2) by just doing a single pass over examples.

  • In OADM, we establish the optimal regret bounds for both objective and constraint violation for general as well as strongly convex functions. The introduction of regret for constraint violation which allows constraints to be violated at each round but guarantees constraints to be satisfied on average in the long run.

  • We show some inexact updates in the OADM through the use of an additional Bregman divergence, including OGD and COMID as special cases. For OADM with inexact updates, we also show the stochastic convergence rates.

  • For an intersection of simple constraints, e.g., linear constraint (simplex), OADM is a projection-free online learning algorithm achieving the optimal regret bounds for both general and strongly convex functions.

The rest of the paper is organized as follows. In Section 2, we analyze batch ADM and establish its convergence rate. In Section 3, we propose OADM to solve the online optimization problem with linear constraints. In Sections 4 and 5, we present the regret analysis in two different scenarios based on whether an additional Bregman divergence is added or not. In Section 6, we discuss inexact ADM updates and show the stochastic convergence rates, show the connection to related works and projection-free online learning based on OADM. We present preliminary experimental results in Section 7, and conclude in Section 8.

2 Analysis for Batch Alternating Direction Method

We consider the batch ADM problem where is fixed in (2), i.e.,

(3)

The Lagrangian [boyd04:convex, Bertsekas99] for the equality-constrained optimization problem (3) is

(4)

where are the primal variables and is the dual variable. To penalize the violation of equality constraint, augmented Lagrangian methods use an additional quadratic penalty term. In particular, the augmented Lagrangian [Bertsekas99] for (2) is

(5)

where is a penalty parameter. Batch ADM updates the three variables by alternatingly minimizing the augmented Lagrangian. It executes the following three steps iteratively till convergence [boyd10]:

(6)
(7)
(8)

At step , the equality constraint in (3) is not necessarily satisfied in ADM. However, one can show that the equality constraint is satisfied in the long run so that .

While global convergence of ADMM has been established under appropriate conditions, we are interested in the rate of convergence of ADM in terms of iteration complexity, i.e., the number of iterations needed to obtain an -optimal solution. Most first-order methods require functions to be smooth or having Lipschitz continuous gradient to establish the convergence rate [Nest04:bkcov, nest07:composite, duchi09]. The assumptions in establishing convergence rate of ADM are relatively simple [boyd10], and are stated below for the sake of completeness:

Assumption 1

(a) and are closed, proper and convex.

(b) An optimal solution to (3) exists. Let be an optimal solution. Denote .

(c) Without loss of generality, . Let be the largest eigenvalue of .

We first analyze the convergence rate for the objective and optimality conditions (constraints) separately using new proof techniques, which play an important role for the regret analysis in the online setting. Then, a joint analysis of the objective and constraints using a variational inequality [fapa03] establishes the convergence rate for ADM.

2.1 Convergence Rate for the Objective

The updates of implicitly generate the (sub)gradients of and , as given in the following lemma.

Lemma 1

Let be the subgradient of at , we have

(9)
(10)

Let be the subgradient of at , we have

(11)
Proof.

Since minimizes (6), we have

Rearranging the terms gives (9). using (8) yield (10).

Similarly, minimizes (7), then

Rearranging the terms and using (8) yield (11). ∎

The following lemma shows the inaccuracy of the objective with respect to the optimum at is bounded by step differences of and .

Lemma 2

Let the sequences be generated by ADM. Then for any satisfying , we have

(12)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
105753
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description