A General Convergence Result for Mirror Descent with Armijo Line Search

A General Convergence Result for Mirror Descent with Armijo Line Search

Yen-Huan Li École polytechnique fédérale de Lausanne, Switzerland Carlos A. Riofrío Volkan Cevher École polytechnique fédérale de Lausanne, Switzerland
Abstract

Existing convergence guarantees for the mirror descent algorithm require the objective function to have a bounded gradient or be smooth relative to a Legendre function. The bounded gradient and relative smoothness conditions, however, may not hold in important applications, such as quantum state tomography and portfolio selection. In this paper, we propose a local version of the relative smoothness condition as a generalization of its existing global version, and prove that under this local relative smoothness condition, the mirror descent algorithm with Armijo line search always converges. Numerical results showed that, therefore, the mirror descent algorithm with Armijo line search was the fastest guaranteed-to-converge algorithm for quantum state tomography, empirically on real data-sets.

1 Introduction

Consider a constrained convex optimization problem:

 f⋆=minx\setf(x)x∈X, (P)

where is a convex differentiable function, and is a convex closed set in . We assume that .

The mirror descent algorithm is standard for solving such a constrained convex optimization problem [6, 32]. Given an initial iterate , the mirror descent algorithm iterates as

 (1)

for some convex differentiable function and a properly chosen sequence of step sizes , where denotes the Bregman divergence induced by :

With a proper choice of the funciton , the mirror descent algorithm can have an almost dimension-independent convergence rate guarantee, or lower per-iteration computational complexity. A famous example is the exponentiated gradient method, which enjoys both benefits [25, 26]. The exponentiated gradient method corresponds to the mirror descent algorithm with being the negative Shannon entropy.

Convergence of the mirror descent algorithm has been established under the following two conditions on the objective function.

1. Bounded gradient: There exists some , such that

 ∥∇f(x)∥≤L,∀x∈X,

for some norm [6, 32]111To be precise, results in this direction assume that there exists a function satisfying where denotes the sub-differential of at the point .

2. Relative smoothness: There exist some and a convex differentiable function , such that

where denotes the Bregman divergence induced by [2, 3, 31].

These conditions may not hold, or introduce undesirable computational burdens for some applications. Quantum state tomography is one such instance.

Example 1

Quantum state tomography (QST) is the task of estimating the state of qubits (quantum bits) given measurement outcomes [36]; this task is essential to calibrating quantum computation devices. Numerically, it corresponds to minimizing the function

 fQST(x):=−n∑i=1logTr(Mix),

for given positive semi-definite matrices , on the set of quantum density matrices

 D:=\Setx∈Cd×dx≥0,Tr(x)=1. (2)

The dimension equals , where is the number of qubits (quantum bits).

Notice that the diagonal of a density matrix in must belong to the probability simplex in ; therefore, a density matrix can be viewed as a matrix analogue of a probability distribution. Regarding this observation, it is natural to consider the matrix version of the exponentiated gradient method, for which the Shannon entropy is replaced by its matrix analogue called the von Neumann entropy [11, 40]. Unfortunately, the following is easily checked.

Proposition 1

The gradient of the function is not bounded. The function is not smooth relative to the von Neumann entropy.

A proof is given in Section A.

Another popular choice of the function is Burg’s entropy. The resulting mirror descent algorithm iterates as

 xk+1=(x−1k+αk∇f(xk))−1,∀k∈N,

where is chosen such that [28]. The numerical search for yields high per-iteration computational complexity of the mirror descent algorithm.

We note that in terms of the objective functions and constraint sets, positron emission tomography, optimal portfolio selection, and non-negative linear inverse problems are essentially vector analogues of QST [12, 14, 41]. The same issues we have discussed above remain in these applications, though the computational burden due to the Burg entropy may be relatively minor in these vector analogues.

To address “non-standard” applications like QST, we relax the condition on the objective function. Specifically, we propose a novel localized version of the relative smoothness condition. The local relative smoothness condition does not involve any parameter, in comparison to the bounded gradient and (global) relative smoothness conditions. Therefore, we do not seek for a closed-form expression for the step sizes; instead, we consider selecting the step sizes adaptively by Armijo line search.

1.1 Related work

The mirror descent algorithm was introduced in [32]. The formulation (1) was proposed in [6], which is equivalent to the original one under standard assumptions. The interior gradient method studied in [2] is also of the form (1); the difference lies in the technical conditions. Standard convergence analyses of the mirror descent, as discussed above, assume either bounded gradient or relative smoothness [2, 3, 6, 31, 32]. The exponentiated gradient method was proposed in [26]; it is also known as the entropic mirror descent [6].

For quantum state tomography, there are few guaranteed-to-converge optimization algorithms. The algorithm was proposed as an analogue of the expectation maximization (EM) algorithm [23], but does not always converge [42]. The diluted algorithm is a variant of the algorithm; it guarantees convergence by exact line search [42]. The Frank-Wolfe algorithm converges with a step size selection rule slightly different from the standard one [35]. The SCOPT algorithm proposed in [39], a proximal gradient method for composite self-concordant minimization, also converges, as the logarithmic function is a standard instance of a self-concordant function. The numerical results in Section 6, unfortunately, showed that the convergence speeds of the diluted , Frank-Wolfe, and SCOPT algorithms are not satisfactory on real data-sets.

For the vector analogues of QST mentioned above, the standard approach is the EM algorithm [14, 18, 41]. The algorithm is also known as the Richardson-Lucy (RL) algorithm in astronomy and microscopy (see, e.g., [7]). The numerical results in Section 6 showed that the EM algorithm is slow on real data-sets for portfolio selection. There are faster accelerated versions of the EM algorithm based on line search, but they lack convergence guarantees [7]. Guranteed-to-converge variable metric methods with line search were proposed in [9, 10], but they involve an infinite number of parameters to be properly tuned.

Armijo line search was proposed in [1], for minimizing functions with Lipschitz gradients. The formulation of Armijo line search studied in this paper is the generalized version proposed in [8].

1.2 Contributions

We propose a novel local relative smoothness condition, and show that the condition is satisfied by a large class of objective functions. The main result is Theorem 1, which establishes convergence of the mirror descent algorithm with Armijo line search under the local relative smoothness condition. Numerical results showed that, because of Theorem 1, the exponentiated gradient method with Armijo line search was the fastest guaranteed-to-converge algorithm for QST, empirically on real data-sets. To the best of our knowledge, even for globally relatively smooth objective functions, convergence of mirror descent with Armijo line search has not been proven; Theorem 1 provides the first convergence guarantee for this setup.

2 Mirror Descent with Armijo Line Search

Let be a convex differentiable function strictly convex on . The corresponding Bregman divergence is given by

Because of the strict convexity of , it holds that , and if and only if .

Define . The corresponding mirror descent algorithm starts with some , and iterates as

where denotes the step size. To ensure that the mirror descent algorithm is well-defined, we will assume the following throughout this paper.

Assumption

For every and , is uniquely defined and lies in .

There are several sufficient conditions that guarantee Assumption 2, but in practice, it is typically easier to directly check Assumption 2. The interested reader is referred to, e.g., [3, 4] for the details.

We consider choosing the step sizes by the Armijo rule. Let and . The Armijo rule outputs for every , where is the least non-negative integer such that

The Armijo rule can be easily implemented by a while-loop, as shown in Algorithm 1.

3 Local Relative Smoothness

In this section, we introduce the local relative smoothness condition, and provide a detailed discussion. In particular, we provide some practical approaches to checking the local relative smoothness condition, alone with concrete examples illustrating when the practical approaches can and cannot be applied.

Roughly speaking, the local relative smoothness condition asks that for every point, there exists a neighborhood on which is relatively smooth.

Definition 1

We say that is locally smooth relative to on , if for every , there exist some and , such that

 (3)

where denotes the ball centered at of radius with respect to a norm.

If we set , then (3) becomes

This is indeed the the locally Lipschitz gradient condition in literature.

Lemma 1

The following two statements are equivalent.

1. The function is locally smooth relative to on .

2. Its gradient is locally Lipschitz on ; that is, for every , there exists some and , such that

 ∥∇f(z2)−∇f(z1)∥2≤Lx∥z2−z1∥2,∀z1,z2∈Bεx(x)∩~X.

The proof of Lemma 1 is standard; we give it in Appendix B.

It is already known that the local Lipschitz gradient condition lies strictly between the following two conditions.

1. The function is differentiable.

2. The gradient of is (globally) Lipschitz.

See [22, 24] for the details.

The following result provides a practical approach to checking the local Lipschitz gradient condition.

Proposition 2

Suppose that is relatively open in , and is twice continuously differentiable on . Then is locally smooth relative to on .

Proof

Recall the definition of relative openness: For every in , there exists some such that . Notice that the largest eigenvalue of is a continuous function on ; by the extreme value theorem, there exists some such that for every . For every , we use Taylor’s formula with the integral remainder and write

 f(z2)

which proves the proposition.

Corollary 1

If is twice continuously differentiable on , then it is locally smooth relative to on .

Indeed, under the setting of Corollary 1, the function has a bounded Hessian by the extreme value theorem, and hence is smooth relative to , i.e., the function satisfies the standard smoothness assumption in literature [33]; then most existing convergence results for first-order optimization algorithms apply. To derive an upper bound of the Lipschitz parameter, however, may be non-trivial. Moreover, there are cases where Corollary 1 does not apply, while Proposition 2 is applicable. Below is an example.

Example 2

Set for every . Set to be the positive orthant. Then is not twice continuously differentiable on ; for example, does not exist. However, Proposition 2 is applicable— is relatively open in as is open, and it is easily checked that is twice continuously differentiable on .

Note that the local Lipschitz gradient condition is not always applicable.

Example 3

Set for every , where we adopt the convention that . Set to be the probability simplex in . Then is not locally smooth relative to . For example, the point lies in , while is unbounded around . However, it is obvious that is locally smooth relative to the negative Shannon entropy—indeed, itself is the negative Shannon entropy function.

A standard setting for the mirror descent algorithm requires the following [2, 6, 25].

Assumption

The function is strongly convex with respect to a norm on ; that is, there exists some , such that

If is locally smooth relative to , it is also locally smooth relative to any function strongly convex on with respect to a norm —if for some and , it holds that

then we have

for some such that , which exists because all norms on a finite-dimensional space are equivalent. Therefore, with Assumption 3, it suffices to check for local smoothness relative to .

Example 4

Suppose that the constraint set is the probability simplex. By Pinsker’s inequality, the negative Shannon entropy is strongly convex on with respect to the -norm [17]. By the discussion above and Corollary 1, any convex objective function that is twice continuously differentiable on is locally smooth relative to the negative Shannon entropy.

It is possible that Assumption 3 does not hold, while we have local relative smoothness.

Example 5

Consider the function as defined in Example 2. Set , the Burg entropy. Then obviously, is smooth—and hence locally smooth—relative to . However, if we set to be the positive orthant, is not strongly convex on .

4 Main Result

The main result of this paper, the following theorem, says that the mirror descent algorithm with Armijo line search is well-defined, and guaranteed to converge, given assumptions discussed above.

Theorem 1

Suppose that Assumption 2 holds. Suppose that , and is locally smooth relative to . Then the following hold.

1. The Armijo line search procedure terminates in finite steps.

2. The sequence is non-increasing.

3. The sequence converges to , if is bounded.

Boundedness of the sequence holds, for example, when the constraint set or level set is bounded. A sufficient condition for the latter case is coercivity—a function is called coercive, if for every sequence such that , we have (see, e.g., [5]).

5 Proof of Theorem 1

The proof of Theorem 1 stems from standard arguments (see, e.g., [2]), showing that the mirror descent algorithm converges, as long as the step sizes are bounded away from zero. However, without any global parameter of the objective function, we are not able to provide an explicit lower bound for all step sizes as in [2]. We solve this difficulty by proving the existence of a strictly positive lower bound, for all but a finite number of the step sizes.

The following result shows that for every , can be arbitrarily close to by setting very small. This result is so fundamental in our analysis that we will use it without explicitly mentioning it.

Lemma 2

The function is continuous in for every .

Proof

Apply Theorem 7.41 in [38].

For ease of presentation, we put the proofs of some technical lemmas in Section C.

5.1 Proof of Statement 1

Statement 1 follows from the following lemma.

Lemma 3

For every , there exists some , such that

 (4)

Proof

We write (4) equivalently as

By the local relative smoothness condition, it suffices to check

By Lemma 7, it suffices to check

 αLxDh(x(α),x)≤(1−τ)Dh(x(α),x),∀α∈(0,αx].

If , it suffices to set . Otherwise, we have ; then Lemma 6 implies that is a minimizer, and Lemma 3 follows with any .

5.2 Proof of Statements 2 and 3

Theorem 2

Let be a sequence in . Suppose that the assumptions in Theorem 1 hold. Then the sequence monotonically converges to , if the following hold.

1. There exists some , such that

2. The sum of step sizes diverges, i.e., .

Theorem 2 is essentially a restatement of Theorem 4.1 in [2]. We give a proof in Appendix D for completeness.

The first condition in Theorem 1 is automatically satisfied by the definition of Armijo line search. The second condition is verified by the following lemma.

Lemma 4

Suppose that the assumptions in Theorem 1 hold. If none of the iterates is a solution to (P), it holds that .

Proof

We prove by contradiction. Suppose that . Then there exists a sub-sequence converging to zero. By the boundedness of , there exists a sub-sequence converging to a limit point , for some . Notice that converges to zero. For large enough , we have

which implies

By the local relative smoothness condition and Lemma 7, we write

 r−1αkLx∞Dh(xk−1(r−1αk),xk−1)>(1−τ)Dh(xk−1(r−1αk),xk−1).

If , we get

 αk>r(1−τ)Lx∞,

a contradiction. Therefore, is strictly positive, and the lemma follows.

Proof (Proof of Statements 2 and 3 of Theorem 1)

If none of the iterates is a solution to (P), Theorem 2 and Lemma 4 imply that the sequence converges to . Otherwise, if is a solution, Lemma 6 implies that for every . Monotonicity of the sequence follows from Corollary 2 in Section C.

6 Numerical Results

We illustrate applications of Theorem 1 in this section.

6.1 Portfolio Selection

Consider long-term investment in a market of stocks under the discrete-time setting. At the beginning of the -th day, , the investor distributes his total wealth to the stocks following a vector in the probability simplex . Denote the price relatives—(possibly negative) returns the investor would receive at the end of the day with one-dollar investment—of the stocks by a vector . Then, if the investor has one dollar at the beginning of the first day, the wealth at the end of the -th day is . For every , the best constant rebalanced portfolio up to the -th day is defined as a solution of the optimization problem [15]

 (BCRP)

The wealth incurred by the best constant rebalanced portfolio is a benchmark for on-line portfolio selection algorithms [15, 16, 20].

Denote the objective function in (BCRP) by . As is simply a vector analogue of , most existing convergence guarantees in convex optimization does not hold. The optimization problem (BCRP) was addressed by an expectation-maximization (EM)-type method developed by Cover [14]. Given an initial iterate , Cover’s algorithm iterates as

 xk=−xk−1⋅∇fBCRP(xk−1),∀k∈N,

where the symbol “” denotes element-wise multiplication. The algorithm possesses a guarantee of convergence but not the convergence rate [14, 18].

Now we show that the optimization problem (BCRP) can be also solved by the exponentiated gradient method with Armijo line search.

Proposition 3

The function is locally smooth relative to the (negative) Shannon entropy on the constraint set .

Proof

Note that is open, and hence is relatively open in . It is easily checked that is twice continuously differentiable on , and hence on . By Proposition 2, the function is locally smooth relative to . By Pinsker’s inequality [17], the Shannon entropy is strongly convex on with respect to the -norm. As all norms on a finite-dimensional space are equivalent, the proposition follows.

Therefore, the exponentiated gradient method—mirror descent with the Shannon entropy—is guaranteed to converge for solving (BCRP). The iteration rule has a closed-form:

 x(α)=c−1x⋅exp(−α∇fBCRP(x)),∀x∈P,α≥0,

where we set for any .

We compare the convergence speeds of Cover’s algorithm and the exponentiated gradient method with Armijo line search, for the New York Stock Exchange (NYSE) data during January 1st, 1985–June 30th, 2010 [30]. The corresponding dimensions are and . We set , , and for the Armijo line search procedure. The numerical experiment was done in MATLAB R2018a, on a MacBook Pro with an Intel Core i7 2.8GHz processor and 16GB DDR3 memory.

The numerical result is presented in Figure 1, where we plot the total wealth yielded by the algorithm iterates, with an initial wealth of one dollar. The proposed approach—exponentiated gradient method with Armijo line search—was obviously faster than Cover’s algorithm. For example, fixing the budget of the computation time to be one second, the proposed approach yields more than twice of the wealth yielded by Cover’s algorithm.

6.2 Quantum State Tomography

Quantum state tomography (QST) is the task of estimating the state of qubits (quantum bits), given measurement outcomes. Numerically, QST corresponds to solving a convex optimization problem specified in Example 1. Recall that in the introduction, we have shown that the corresponding objective function, , does not satisfy the bounded gradient condition and is not smooth relative to the von Neumann entropy, while mirror descent with the Burg entropy has high per-iteration computational complexity.

The following proposition is a matrix analogue to Proposition 3. A proof is provided in Section E.

Proposition 4

The function is locally smooth relative to the von Neumann entropy on the constraint set .

Therefore, the (matrix) exponentiated gradient method—mirror descent with the von Neumann entropy—with Armijo line search is guaranteed to converge, by Theorem 1. The corresponding iteration rule has a closed-form expression [11, 40]:

 x(α)=c−1exp(log(x)−α∇f(x)),

for every and , where is a positive real normalizing the trace of . The functions and denote matrix exponential and logarithm, respectively.

We test the empirical performance of the exponentiated gradinet method with Armijo line search, on real experimental data generated following the setting in [19]. We compare it with the performances of the diluted algorithm [42], SCOPT [39], and the modified Frank-Wolfe algorithm studied in [35]. We also consider the algorithm [23]; it does not always converge [42], but is typically much faster than the diluted algorithm in practice.

We compare the convergence speeds for the -qubit () and -qubit () cases, in Fig. 2 and 3, respectively. The corresponding “sample sizes” (number of summands in ) are and , respectively. The numerical experiments were done in MATLAB R2015b, on a MacBook Pro with an Intel Core i7 2.8GHz processor and 16GB DDR3 memory. We set , and in Algorithm 1 for both cases. In both figures, denotes the minimum value of found by the five algorithms in 120 iterations.

One can observe that the exponentiated gradient method with Armijo line search is the fastest, in terms of the actual elapsed time. The slowness of the other algorithms is explainable.

1. The diluted algorithm, using the notation of this paper, iterates as

 xk+1=c−1k[I+βkf′(xk)]Hρk[I+βkf′(xk)],

where normalizes the trace of . To guarantee convergence, the step sizes are computed by exact line search. The exact line search procedure renders the algorithm slow.

2. SCOPT is a projected gradient method for minimizing self-concordant functions [33, 34]. Notice that projection onto typically results in a low-rank output; hence, it is possible that for some low-rank and iterate , but then is not a feasible solution because is not defined222In a standard setup of quantum state tomography, the matrices are single-rank [19].. This is called the stalling problem in [27]. Luckily, self-concordance of ensures that if an iterate lies in , and the next iterate lies in a small enough Dikin ellipsoid centered at , then also lies in . It is easily checked that is a self-concordant function of parameter . Following the theory in [33, 34], the radius of the Dikin ellipsoid shrinks at the rate , so SCOPT becomes slow when is large.

3. The Frank-Wolfe algorithm suffers for a sub-linear convergence rate when the solution is near an extreme point of the constraint set (see, e.g., [29] for an illustration in the vector case). Notice that the set of extreme points of is the set of single-rank positive semi-definite matrices of unit trace. In the experimental data we have, the density matrix to be estimated is indeed close to a single-rank matrix (which is called a pure state in quantum mechanics). Therefore, the ML estimate—the minimizer of on —is expected to be also close to a single-rank matrix.

Notice that the empirical convergence rate of the exponentiated gradient method with Armijo line search is linear.

Acknowledgements

We thank David Gross for valuable discussions, and Ya-Ping Hsieh for checking previous versions of this paper. YHL and VC were supported by SNF 200021-146750 and ERC project time-data 725594. CAR was supported by the Freie Universität Berlin within the Excellence Initiative of the German Research Foundation, DFG (SPP 1798 CoSIP), and the Templeton Foundation.

Appendix A Proof of Proposition 1

Consider the two-dimensional case, where . Define and . Suppose that there are only two summands, with and . Then we have . It suffices to disprove all properties on the set of diagonal density matrices. Hence, we will focus on the function , defined for any in the probability simplex .

As either or can be arbitrarily close to zero, it is easily checked that the gradient of is unbounded. Now we check the relative smoothness condition. As we only consider diagonal matrices, it suffices to check with respect to the (negative) Shannon entropy:

 h(x,y):=−xlogx−ylogy+x+y,∀(x,y)∈P,

for which the convention is adopted.

Lemma 5 ([31])

The function is -smooth relative to the Shannon entropy for some , if and only if is convex.

Therefore, we check the positive semi-definiteness of the Hessian of . A necessary condition for the Hessian to be positive semi-definite is that

 −L∂2h∂x2(x,y)−∂2g∂x2(x,y)=Lx−1x2≥0,

for all , but the inequality cannot hold for , for any fixed .

Appendix B Proof of Lemma 1

(Statement 2 Statement 1) Let , and . Define, for every , . We write

 ≤∫10∥∇f(zτ)−∇f(z1)∥2∥z2−z1∥2dτ ≤∫10Lxτ∥z2−z1∥22dτ =Lx2∥z2−z1∥22,

where we have applied the Cauchy-Schwarz inequality for the first inequality, and the local smoothness condition for the second inequality. Note that is the intersection of convex sets, and hence is convex; therefore, for every .

(Statement 1 Statement 2) Let , and . Define . Then is locally Lipschitz on ; moreover, since , the point is a global minimizer of . Therefore, we obtain

 φ(z1)≤φ(z2−1Lx∇φ(z2))≤φ(z2)−12Lx∥∇φ(z2)∥2;

that is,

Similarly, we get

Summing up the two inequalities; we obtain

This implies, by the Cauchy-Schwarz inequality,

 ∥∇f(z2)−∇f(z1)∥2≤Lx∥z2−z1∥2.

Appendix C Auxiliary Technical Lemmas for Proving Theorem 1

Lemma 6

If for some , then is a solution to (P). If a point is a solution to (P), then for all .

Proof

That is a solution to (P) is equivalent to the optimality condition

We can equivalently write

which is the optimality condition of

Lemma 7

For every and , it holds that

Proof

By definition, we have

Corollary 2

The sequence is non-increasing.

Proof

The Armijo rule and Lemma 7 guarantee that

Appendix D Proof of Theorem 2

For every , we write

 f(xk−1)−f(u)

The optimality condition for implies

Applying the three-point identity [13], we obtain

 =−α−1k[Dh(u,xk−1)−Dh(u,xk)−Dh(xk,xk−1)] ≥−α−1k[Dh(u,xk−1)−Dh(u,xk))].

Then we can write

 αk[f(xk−1)−f(u)]

Summing up the inequality for all , we get

where . Corollary 2 says that the sequence is non-increasing; then we have

 n∑k=1αkf(xk−1)≥n∑k=1αkf(xn)=Snf(xn).

Therefore, we obtain

Note that by the Armijo rule, we have

 f(x0)−f⋆ ≥limk→∞f(x0)−f(xk) =∞∑j=1[f(xj−1)−f(xj)]

Therefore, , which are non-negative by Lemma 4, must converge to zero. Theorem 2 then follows from the following lemma.

Lemma 8 ([37])

Let be a sequence of real numbers, and be a sequence of positive real numbers. Define for every , where . If and , then .

Appendix E Proof of Proposition 4

Note that is open, and hence is relatively open in . It is easily checked that is twice continuously differentiable on , and hence on . By Proposition 2, the function is locally smooth relative to , where denotes the Frobenius norm. By the quantum version of Pinsker’s inequality [21], the von Neumann entropy is strongly convex on with respect to the trace norm. As all norms on a finite-dimensional space are equivalent, the proposition follows.

References

• [1] Armijo, L. Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 16, 1 (1966), 1–3.
• [2] Auslender, A., and Teboulle, M. Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16, 3 (2006), 697–725.
• [3] Bauschke, H. H., Bolte, J., and Teboulle, M. A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42, 2 (2017), 330–348.
• [4] Bauschke, H. H., Borwein, J. M., and Combettes, P. L. Essential smoothness, essential strict convexity, and Legendre functions in Banach spaces. Commun. Contemp. Math. 3, 4 (2001), 615–647.
• [5] Bauschke, H. H., and Combettes, P. L. Convex analysis and monotone operator thoery in Hilbert spaces. Springer, New York, NY, 2011.
• [6] Beck, A., and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31 (2003), 167–175.
• [7] Bertero, M., Boccacci, P., Desiderà, G., and Vicidomini, G. Image deblurring with Poisson data: from cells to galaxies. Inverse Probl. 25 (2009).
• [8] Bertsekas, D. P. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Trans. Automat. Contr. AC-21, 2 (1976), 174–184.
• [9] Bonettini, S., Loris, I., Porta, F., and Prato, M. Variable metric inexact line-search-based methods for nonsmooth optimization. SIAM J. Optim. 26, 2 (2016), 891–921.
• [10] Bonettini, S., Zanella, R., and Zanni, L. A scaled gradient projection method for constrained image deblurring. Inverse Probl. 25 (2009).
• [11] Bubeck, S. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8, 3–4 (2015), 231–358.
• [12] Byrne, C., and Censor, Y. Proximity function minimization using multiple Bregman projections, with application to split feasibility and Kullback-Leibler distance minimization. Ann. Oper. Res. 105 (2001), 77–98.
• [13] Chen, G., and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3, 3 (Aug. 1993), 538–543.
• [14] Cover, T. M. An algorithm for maximizing expected log investment return. IEEE Trans. Inf. Theory IT-30, 2 (1984), 369–373.
• [15] Cover, T. M. Universal portfolios. Math. Finance 1, 1 (1991), 1–29.
• [16] Cover, T. M., and Ordentlich, E. Universal portfolios with side information. IEEE Trans. Inf. Theory 42, 2 (1996), 348–363.
• [17] Csiszár, I., and Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, second ed. Cambridge Univ. Press, Cambridge, UK, 2011.
• [18] Csiszár, I., and Tusnády, G. Information geometry and alternating minimization procedures. Stat. Decis., Supplement 1 (1984), 205–237.
• [19] Häffner, H., Hänsel, W., Roos, C. F., Benhelm, J., Check-al-kar, D., Chwalla, M., Körber, T., Rapol, U. D., Riebe, M., Schmidt, P. O., Becher, C., Gühne, O., Dür, W., and Blatt, R. Scalable multiparticle entanglement of trapped ions. Nature 438 (2005), 643–646.
• [20] Hazan, E., and Kale, S. An online portfolio selection algorithm with regret logarithmic in price variation. Math. Finance 25, 2 (2015), 288–310.
• [21] Hiai, F., Ohya, M., and Tsukada, M. Sufficiency, KMS condition and relative entropy in von Neumann algebras. Pac. J. Math. 96, 1 (1981), 99–109.
• [22] Hiriart-Urruty, J.-B., Strodiot, J.-J., and Nguyen, V. H. Generalized Hessian matrix and second-order optimality conditions for problem with data. Appl. Math. Optim. 11 (1984), 43–56.
• [23] Hradil, Z. Quantum-state estimation. Phys. Rev. A 55, 3 (1997).
• [24] Ioffe, A., and Milosz, T. On a characterization of functions. Cybern. Syst. Anal. 38, 3 (2002), 313–322.
• [25] Juditsky, A., and Nemirovski, A. First-order methods for nonsmooth convex large-scale optimization, I: General purpose methods. In Optimization for Machine Learning, S. Sra, S. Nowozin, and S. J. Wright, Eds. MIT Press, Cambridge, MA, 2012, ch. 5.
• [26] Kivinen, J., and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132 (1997), 1–63.
• [27] Knee, G. C., Bolduc, E., Leach, J., and Gauger, E. M. Maximum-likelihood quantum process tomography via projected gradient descent. arXiv:1803.10062v1.
• [28] Kulis, B., Sustik, M. A., and Dhillon, I. S. Low-rank kernel learning with Bregman matrix divergences. J. Mach. Learn. Res. 10 (2009), 341–376.
• [29] Lacoste-Julien, S., and Jaggi, M. On the global linear convergence of Frank-Wolfe optimization variants. In Adv. Neural Information Processing Systems 28 (2015).
• [30] Li, B., Sahoo, D., and Hoi, S. C. H. OLPS: A toolbox for on-line portfolio selection. J. Mach. Learn. Res. 17 (2016), 1–5.
• [31] Lu, H., Freund, R. M., and Nesterov, Y. Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28, 1 (2018), 333–354.
• [32] Nemirovsky, A. S., and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, Chichester, 1983.
• [33] Nesterov, Y. Introductory Lectures on Convex Optimization. Kluwer, Boston, MA, 2004.
• [34] Nesterov, Y., and Nemirovskii, A. Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia, PA, 1994.
• [35] Odor, G., Li, Y.-H., Yurtsever, A., Hsieh, Y.-P., El Halabi, M., Tran-Dinh, Q., and Cevher, V. Frank-Wolfe works for non-Lipschitz continuous gradient objectives: Scalable Poisson phase retrieval. In IEEE Int. Conf. Acoustics, Speech and Signal Processing (2016), pp. 6230–6234.
• [36] Paris, M., and Řeháček, J., Eds. Quantum State Estimation. Springer, Berlin, 2004.
• [37] Polyak, B. T. Introduction to Optimization. Optimization Softw., Inc., New York, NY, 1987.
• [38] Rockafellar, R. T., and Wets, R. J. Variational Analysis. Springer, Berlin, 2009.
• [39] Tran-Dinh, Q., Kyrillidis, A., and Cevher, V. Composite self-concordant minimization. J. Mach. Learn. Res. 16 (2015), 371–416.
• [40] Tsuda, K., Rätsch, G., and Warmuth, M. K. Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6 (2005), 995–1018.
• [41] Vardi, Y., Shepp, L. A., and Kaufman, L. A statistical model for positron emission tomography. J. Am. Stat. Assoc. 80, 389 (1985), 8–20.
• [42] Řeháček, J., Hradil, Z., Knill, E., and Lvovsky, A. I. Diluted maximum-likelihood algorithm for quantum tomography. Phys. Rev. A 75 (2007).
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters