A General Convergence Result for
Mirror Descent with Armijo Line Search
Abstract
Existing convergence guarantees for the mirror descent algorithm require the objective function to have a bounded gradient or be smooth relative to a Legendre function. The bounded gradient and relative smoothness conditions, however, may not hold in important applications, such as quantum state tomography and portfolio selection. In this paper, we propose a local version of the relative smoothness condition as a generalization of its existing global version, and prove that under this local relative smoothness condition, the mirror descent algorithm with Armijo line search always converges. Numerical results showed that, therefore, the mirror descent algorithm with Armijo line search was the fastest guaranteedtoconverge algorithm for quantum state tomography, empirically on real datasets.
1 Introduction
Consider a constrained convex optimization problem:
(P) 
where is a convex differentiable function, and is a convex closed set in . We assume that .
The mirror descent algorithm is standard for solving such a constrained convex optimization problem [6, 32]. Given an initial iterate , the mirror descent algorithm iterates as
(1) 
for some convex differentiable function and a properly chosen sequence of step sizes , where denotes the Bregman divergence induced by :
With a proper choice of the funciton , the mirror descent algorithm can have an almost dimensionindependent convergence rate guarantee, or lower periteration computational complexity. A famous example is the exponentiated gradient method, which enjoys both benefits [25, 26]. The exponentiated gradient method corresponds to the mirror descent algorithm with being the negative Shannon entropy.
Convergence of the mirror descent algorithm has been established under the following two conditions on the objective function.
These conditions may not hold, or introduce undesirable computational burdens for some applications. Quantum state tomography is one such instance.
Example 1
Quantum state tomography (QST) is the task of estimating the state of qubits (quantum bits) given measurement outcomes [36]; this task is essential to calibrating quantum computation devices. Numerically, it corresponds to minimizing the function
for given positive semidefinite matrices , on the set of quantum density matrices
(2) 
The dimension equals , where is the number of qubits (quantum bits).
Notice that the diagonal of a density matrix in must belong to the probability simplex in ; therefore, a density matrix can be viewed as a matrix analogue of a probability distribution. Regarding this observation, it is natural to consider the matrix version of the exponentiated gradient method, for which the Shannon entropy is replaced by its matrix analogue called the von Neumann entropy [11, 40]. Unfortunately, the following is easily checked.
Proposition 1
The gradient of the function is not bounded. The function is not smooth relative to the von Neumann entropy.
A proof is given in Section A.
Another popular choice of the function is Burg’s entropy. The resulting mirror descent algorithm iterates as
where is chosen such that [28]. The numerical search for yields high periteration computational complexity of the mirror descent algorithm.
We note that in terms of the objective functions and constraint sets, positron emission tomography, optimal portfolio selection, and nonnegative linear inverse problems are essentially vector analogues of QST [12, 14, 41]. The same issues we have discussed above remain in these applications, though the computational burden due to the Burg entropy may be relatively minor in these vector analogues.
To address “nonstandard” applications like QST, we relax the condition on the objective function. Specifically, we propose a novel localized version of the relative smoothness condition. The local relative smoothness condition does not involve any parameter, in comparison to the bounded gradient and (global) relative smoothness conditions. Therefore, we do not seek for a closedform expression for the step sizes; instead, we consider selecting the step sizes adaptively by Armijo line search.
1.1 Related work
The mirror descent algorithm was introduced in [32]. The formulation (1) was proposed in [6], which is equivalent to the original one under standard assumptions. The interior gradient method studied in [2] is also of the form (1); the difference lies in the technical conditions. Standard convergence analyses of the mirror descent, as discussed above, assume either bounded gradient or relative smoothness [2, 3, 6, 31, 32]. The exponentiated gradient method was proposed in [26]; it is also known as the entropic mirror descent [6].
For quantum state tomography, there are few guaranteedtoconverge optimization algorithms. The algorithm was proposed as an analogue of the expectation maximization (EM) algorithm [23], but does not always converge [42]. The diluted algorithm is a variant of the algorithm; it guarantees convergence by exact line search [42]. The FrankWolfe algorithm converges with a step size selection rule slightly different from the standard one [35]. The SCOPT algorithm proposed in [39], a proximal gradient method for composite selfconcordant minimization, also converges, as the logarithmic function is a standard instance of a selfconcordant function. The numerical results in Section 6, unfortunately, showed that the convergence speeds of the diluted , FrankWolfe, and SCOPT algorithms are not satisfactory on real datasets.
For the vector analogues of QST mentioned above, the standard approach is the EM algorithm [14, 18, 41]. The algorithm is also known as the RichardsonLucy (RL) algorithm in astronomy and microscopy (see, e.g., [7]). The numerical results in Section 6 showed that the EM algorithm is slow on real datasets for portfolio selection. There are faster accelerated versions of the EM algorithm based on line search, but they lack convergence guarantees [7]. Guranteedtoconverge variable metric methods with line search were proposed in [9, 10], but they involve an infinite number of parameters to be properly tuned.
1.2 Contributions
We propose a novel local relative smoothness condition, and show that the condition is satisfied by a large class of objective functions. The main result is Theorem 1, which establishes convergence of the mirror descent algorithm with Armijo line search under the local relative smoothness condition. Numerical results showed that, because of Theorem 1, the exponentiated gradient method with Armijo line search was the fastest guaranteedtoconverge algorithm for QST, empirically on real datasets. To the best of our knowledge, even for globally relatively smooth objective functions, convergence of mirror descent with Armijo line search has not been proven; Theorem 1 provides the first convergence guarantee for this setup.
2 Mirror Descent with Armijo Line Search
Let be a convex differentiable function strictly convex on . The corresponding Bregman divergence is given by
Because of the strict convexity of , it holds that , and if and only if .
Define . The corresponding mirror descent algorithm starts with some , and iterates as
where denotes the step size. To ensure that the mirror descent algorithm is welldefined, we will assume the following throughout this paper.
Assumption
For every and , is uniquely defined and lies in .
There are several sufficient conditions that guarantee Assumption 2, but in practice, it is typically easier to directly check Assumption 2. The interested reader is referred to, e.g., [3, 4] for the details.
We consider choosing the step sizes by the Armijo rule. Let and . The Armijo rule outputs for every , where is the least nonnegative integer such that
The Armijo rule can be easily implemented by a whileloop, as shown in Algorithm 1.
3 Local Relative Smoothness
In this section, we introduce the local relative smoothness condition, and provide a detailed discussion. In particular, we provide some practical approaches to checking the local relative smoothness condition, alone with concrete examples illustrating when the practical approaches can and cannot be applied.
Roughly speaking, the local relative smoothness condition asks that for every point, there exists a neighborhood on which is relatively smooth.
Definition 1
We say that is locally smooth relative to on , if for every , there exist some and , such that
(3) 
where denotes the ball centered at of radius with respect to a norm.
If we set , then (3) becomes
This is indeed the the locally Lipschitz gradient condition in literature.
Lemma 1
The following two statements are equivalent.

The function is locally smooth relative to on .

Its gradient is locally Lipschitz on ; that is, for every , there exists some and , such that
It is already known that the local Lipschitz gradient condition lies strictly between the following two conditions.

The function is differentiable.

The gradient of is (globally) Lipschitz.
The following result provides a practical approach to checking the local Lipschitz gradient condition.
Proposition 2
Suppose that is relatively open in , and is twice continuously differentiable on . Then is locally smooth relative to on .
Proof
Recall the definition of relative openness: For every in , there exists some such that . Notice that the largest eigenvalue of is a continuous function on ; by the extreme value theorem, there exists some such that for every . For every , we use Taylor’s formula with the integral remainder and write
which proves the proposition.
Corollary 1
If is twice continuously differentiable on , then it is locally smooth relative to on .
Indeed, under the setting of Corollary 1, the function has a bounded Hessian by the extreme value theorem, and hence is smooth relative to , i.e., the function satisfies the standard smoothness assumption in literature [33]; then most existing convergence results for firstorder optimization algorithms apply. To derive an upper bound of the Lipschitz parameter, however, may be nontrivial. Moreover, there are cases where Corollary 1 does not apply, while Proposition 2 is applicable. Below is an example.
Example 2
Set for every . Set to be the positive orthant. Then is not twice continuously differentiable on ; for example, does not exist. However, Proposition 2 is applicable— is relatively open in as is open, and it is easily checked that is twice continuously differentiable on .
Note that the local Lipschitz gradient condition is not always applicable.
Example 3
Set for every , where we adopt the convention that . Set to be the probability simplex in . Then is not locally smooth relative to . For example, the point lies in , while is unbounded around . However, it is obvious that is locally smooth relative to the negative Shannon entropy—indeed, itself is the negative Shannon entropy function.
Assumption
The function is strongly convex with respect to a norm on ; that is, there exists some , such that
If is locally smooth relative to , it is also locally smooth relative to any function strongly convex on with respect to a norm —if for some and , it holds that
then we have
for some such that , which exists because all norms on a finitedimensional space are equivalent. Therefore, with Assumption 3, it suffices to check for local smoothness relative to .
Example 4
Suppose that the constraint set is the probability simplex. By Pinsker’s inequality, the negative Shannon entropy is strongly convex on with respect to the norm [17]. By the discussion above and Corollary 1, any convex objective function that is twice continuously differentiable on is locally smooth relative to the negative Shannon entropy.
It is possible that Assumption 3 does not hold, while we have local relative smoothness.
Example 5
Consider the function as defined in Example 2. Set , the Burg entropy. Then obviously, is smooth—and hence locally smooth—relative to . However, if we set to be the positive orthant, is not strongly convex on .
4 Main Result
The main result of this paper, the following theorem, says that the mirror descent algorithm with Armijo line search is welldefined, and guaranteed to converge, given assumptions discussed above.
Theorem 1
Suppose that Assumption 2 holds. Suppose that , and is locally smooth relative to . Then the following hold.

The Armijo line search procedure terminates in finite steps.

The sequence is nonincreasing.

The sequence converges to , if is bounded.
Boundedness of the sequence holds, for example, when the constraint set or level set is bounded. A sufficient condition for the latter case is coercivity—a function is called coercive, if for every sequence such that , we have (see, e.g., [5]).
5 Proof of Theorem 1
The proof of Theorem 1 stems from standard arguments (see, e.g., [2]), showing that the mirror descent algorithm converges, as long as the step sizes are bounded away from zero. However, without any global parameter of the objective function, we are not able to provide an explicit lower bound for all step sizes as in [2]. We solve this difficulty by proving the existence of a strictly positive lower bound, for all but a finite number of the step sizes.
The following result shows that for every , can be arbitrarily close to by setting very small. This result is so fundamental in our analysis that we will use it without explicitly mentioning it.
Lemma 2
The function is continuous in for every .
Proof
Apply Theorem 7.41 in [38].
For ease of presentation, we put the proofs of some technical lemmas in Section C.
5.1 Proof of Statement 1
Statement 1 follows from the following lemma.
Lemma 3
For every , there exists some , such that
(4) 
5.2 Proof of Statements 2 and 3
We start with the following known result.
Theorem 2
Let be a sequence in . Suppose that the assumptions in Theorem 1 hold. Then the sequence monotonically converges to , if the following hold.

There exists some , such that

The sum of step sizes diverges, i.e., .
Theorem 2 is essentially a restatement of Theorem 4.1 in [2]. We give a proof in Appendix D for completeness.
The first condition in Theorem 1 is automatically satisfied by the definition of Armijo line search. The second condition is verified by the following lemma.
Lemma 4
Proof
We prove by contradiction. Suppose that . Then there exists a subsequence converging to zero. By the boundedness of , there exists a subsequence converging to a limit point , for some . Notice that converges to zero. For large enough , we have
which implies
By the local relative smoothness condition and Lemma 7, we write
If , we get
a contradiction. Therefore, is strictly positive, and the lemma follows.
6 Numerical Results
We illustrate applications of Theorem 1 in this section.
6.1 Portfolio Selection
Consider longterm investment in a market of stocks under the discretetime setting. At the beginning of the th day, , the investor distributes his total wealth to the stocks following a vector in the probability simplex . Denote the price relatives—(possibly negative) returns the investor would receive at the end of the day with onedollar investment—of the stocks by a vector . Then, if the investor has one dollar at the beginning of the first day, the wealth at the end of the th day is . For every , the best constant rebalanced portfolio up to the th day is defined as a solution of the optimization problem [15]
(BCRP) 
The wealth incurred by the best constant rebalanced portfolio is a benchmark for online portfolio selection algorithms [15, 16, 20].
Denote the objective function in (BCRP) by . As is simply a vector analogue of , most existing convergence guarantees in convex optimization does not hold. The optimization problem (BCRP) was addressed by an expectationmaximization (EM)type method developed by Cover [14]. Given an initial iterate , Cover’s algorithm iterates as
where the symbol “” denotes elementwise multiplication. The algorithm possesses a guarantee of convergence but not the convergence rate [14, 18].
Now we show that the optimization problem (BCRP) can be also solved by the exponentiated gradient method with Armijo line search.
Proposition 3
The function is locally smooth relative to the (negative) Shannon entropy on the constraint set .
Proof
Note that is open, and hence is relatively open in . It is easily checked that is twice continuously differentiable on , and hence on . By Proposition 2, the function is locally smooth relative to . By Pinsker’s inequality [17], the Shannon entropy is strongly convex on with respect to the norm. As all norms on a finitedimensional space are equivalent, the proposition follows.
Therefore, the exponentiated gradient method—mirror descent with the Shannon entropy—is guaranteed to converge for solving (BCRP). The iteration rule has a closedform:
where we set for any .
We compare the convergence speeds of Cover’s algorithm and the exponentiated gradient method with Armijo line search, for the New York Stock Exchange (NYSE) data during January 1st, 1985–June 30th, 2010 [30]. The corresponding dimensions are and . We set , , and for the Armijo line search procedure. The numerical experiment was done in MATLAB R2018a, on a MacBook Pro with an Intel Core i7 2.8GHz processor and 16GB DDR3 memory.
The numerical result is presented in Figure 1, where we plot the total wealth yielded by the algorithm iterates, with an initial wealth of one dollar. The proposed approach—exponentiated gradient method with Armijo line search—was obviously faster than Cover’s algorithm. For example, fixing the budget of the computation time to be one second, the proposed approach yields more than twice of the wealth yielded by Cover’s algorithm.
6.2 Quantum State Tomography
Quantum state tomography (QST) is the task of estimating the state of qubits (quantum bits), given measurement outcomes. Numerically, QST corresponds to solving a convex optimization problem specified in Example 1. Recall that in the introduction, we have shown that the corresponding objective function, , does not satisfy the bounded gradient condition and is not smooth relative to the von Neumann entropy, while mirror descent with the Burg entropy has high periteration computational complexity.
Proposition 4
The function is locally smooth relative to the von Neumann entropy on the constraint set .
Therefore, the (matrix) exponentiated gradient method—mirror descent with the von Neumann entropy—with Armijo line search is guaranteed to converge, by Theorem 1. The corresponding iteration rule has a closedform expression [11, 40]:
for every and , where is a positive real normalizing the trace of . The functions and denote matrix exponential and logarithm, respectively.
We test the empirical performance of the exponentiated gradinet method with Armijo line search, on real experimental data generated following the setting in [19]. We compare it with the performances of the diluted algorithm [42], SCOPT [39], and the modified FrankWolfe algorithm studied in [35]. We also consider the algorithm [23]; it does not always converge [42], but is typically much faster than the diluted algorithm in practice.
We compare the convergence speeds for the qubit () and qubit () cases, in Fig. 2 and 3, respectively. The corresponding “sample sizes” (number of summands in ) are and , respectively. The numerical experiments were done in MATLAB R2015b, on a MacBook Pro with an Intel Core i7 2.8GHz processor and 16GB DDR3 memory. We set , and in Algorithm 1 for both cases. In both figures, denotes the minimum value of found by the five algorithms in 120 iterations.
One can observe that the exponentiated gradient method with Armijo line search is the fastest, in terms of the actual elapsed time. The slowness of the other algorithms is explainable.

The diluted algorithm, using the notation of this paper, iterates as
where normalizes the trace of . To guarantee convergence, the step sizes are computed by exact line search. The exact line search procedure renders the algorithm slow.

SCOPT is a projected gradient method for minimizing selfconcordant functions [33, 34]. Notice that projection onto typically results in a lowrank output; hence, it is possible that for some lowrank and iterate , but then is not a feasible solution because is not defined^{2}^{2}2In a standard setup of quantum state tomography, the matrices are singlerank [19].. This is called the stalling problem in [27]. Luckily, selfconcordance of ensures that if an iterate lies in , and the next iterate lies in a small enough Dikin ellipsoid centered at , then also lies in . It is easily checked that is a selfconcordant function of parameter . Following the theory in [33, 34], the radius of the Dikin ellipsoid shrinks at the rate , so SCOPT becomes slow when is large.

The FrankWolfe algorithm suffers for a sublinear convergence rate when the solution is near an extreme point of the constraint set (see, e.g., [29] for an illustration in the vector case). Notice that the set of extreme points of is the set of singlerank positive semidefinite matrices of unit trace. In the experimental data we have, the density matrix to be estimated is indeed close to a singlerank matrix (which is called a pure state in quantum mechanics). Therefore, the ML estimate—the minimizer of on —is expected to be also close to a singlerank matrix.
Notice that the empirical convergence rate of the exponentiated gradient method with Armijo line search is linear.
Acknowledgements
We thank David Gross for valuable discussions, and YaPing Hsieh for checking previous versions of this paper. YHL and VC were supported by SNF 200021146750 and ERC project timedata 725594. CAR was supported by the Freie Universität Berlin within the Excellence Initiative of the German Research Foundation, DFG (SPP 1798 CoSIP), and the Templeton Foundation.
Appendix A Proof of Proposition 1
Consider the twodimensional case, where . Define and . Suppose that there are only two summands, with and . Then we have . It suffices to disprove all properties on the set of diagonal density matrices. Hence, we will focus on the function , defined for any in the probability simplex .
As either or can be arbitrarily close to zero, it is easily checked that the gradient of is unbounded. Now we check the relative smoothness condition. As we only consider diagonal matrices, it suffices to check with respect to the (negative) Shannon entropy:
for which the convention is adopted.
Lemma 5 ([31])
The function is smooth relative to the Shannon entropy for some , if and only if is convex.
Therefore, we check the positive semidefiniteness of the Hessian of . A necessary condition for the Hessian to be positive semidefinite is that
for all , but the inequality cannot hold for , for any fixed .
Appendix B Proof of Lemma 1
(Statement 2 Statement 1) Let , and . Define, for every , . We write
where we have applied the CauchySchwarz inequality for the first inequality, and the local smoothness condition for the second inequality. Note that is the intersection of convex sets, and hence is convex; therefore, for every .
(Statement 1 Statement 2) Let , and . Define . Then is locally Lipschitz on ; moreover, since , the point is a global minimizer of . Therefore, we obtain
that is,
Similarly, we get
Summing up the two inequalities; we obtain
This implies, by the CauchySchwarz inequality,
Appendix C Auxiliary Technical Lemmas for Proving Theorem 1
Proof
That is a solution to (P) is equivalent to the optimality condition
We can equivalently write
which is the optimality condition of
Lemma 7
For every and , it holds that
Proof
By definition, we have
Corollary 2
The sequence is nonincreasing.
Proof
Appendix D Proof of Theorem 2
For every , we write
The optimality condition for implies
Applying the threepoint identity [13], we obtain
Then we can write
Summing up the inequality for all , we get
where . Corollary 2 says that the sequence is nonincreasing; then we have
Therefore, we obtain
Note that by the Armijo rule, we have
Therefore, , which are nonnegative by Lemma 4, must converge to zero. Theorem 2 then follows from the following lemma.
Lemma 8 ([37])
Let be a sequence of real numbers, and be a sequence of positive real numbers. Define for every , where . If and , then .
Appendix E Proof of Proposition 4
Note that is open, and hence is relatively open in . It is easily checked that is twice continuously differentiable on , and hence on . By Proposition 2, the function is locally smooth relative to , where denotes the Frobenius norm. By the quantum version of Pinsker’s inequality [21], the von Neumann entropy is strongly convex on with respect to the trace norm. As all norms on a finitedimensional space are equivalent, the proposition follows.
References
 [1] Armijo, L. Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 16, 1 (1966), 1–3.
 [2] Auslender, A., and Teboulle, M. Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16, 3 (2006), 697–725.
 [3] Bauschke, H. H., Bolte, J., and Teboulle, M. A descent lemma beyond Lipschitz gradient continuity: firstorder methods revisited and applications. Math. Oper. Res. 42, 2 (2017), 330–348.
 [4] Bauschke, H. H., Borwein, J. M., and Combettes, P. L. Essential smoothness, essential strict convexity, and Legendre functions in Banach spaces. Commun. Contemp. Math. 3, 4 (2001), 615–647.
 [5] Bauschke, H. H., and Combettes, P. L. Convex analysis and monotone operator thoery in Hilbert spaces. Springer, New York, NY, 2011.
 [6] Beck, A., and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31 (2003), 167–175.
 [7] Bertero, M., Boccacci, P., Desiderà, G., and Vicidomini, G. Image deblurring with Poisson data: from cells to galaxies. Inverse Probl. 25 (2009).
 [8] Bertsekas, D. P. On the GoldsteinLevitinPolyak gradient projection method. IEEE Trans. Automat. Contr. AC21, 2 (1976), 174–184.
 [9] Bonettini, S., Loris, I., Porta, F., and Prato, M. Variable metric inexact linesearchbased methods for nonsmooth optimization. SIAM J. Optim. 26, 2 (2016), 891–921.
 [10] Bonettini, S., Zanella, R., and Zanni, L. A scaled gradient projection method for constrained image deblurring. Inverse Probl. 25 (2009).
 [11] Bubeck, S. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8, 3–4 (2015), 231–358.
 [12] Byrne, C., and Censor, Y. Proximity function minimization using multiple Bregman projections, with application to split feasibility and KullbackLeibler distance minimization. Ann. Oper. Res. 105 (2001), 77–98.
 [13] Chen, G., and Teboulle, M. Convergence analysis of a proximallike minimization algorithm using Bregman functions. SIAM J. Optim. 3, 3 (Aug. 1993), 538–543.
 [14] Cover, T. M. An algorithm for maximizing expected log investment return. IEEE Trans. Inf. Theory IT30, 2 (1984), 369–373.
 [15] Cover, T. M. Universal portfolios. Math. Finance 1, 1 (1991), 1–29.
 [16] Cover, T. M., and Ordentlich, E. Universal portfolios with side information. IEEE Trans. Inf. Theory 42, 2 (1996), 348–363.
 [17] Csiszár, I., and Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, second ed. Cambridge Univ. Press, Cambridge, UK, 2011.
 [18] Csiszár, I., and Tusnády, G. Information geometry and alternating minimization procedures. Stat. Decis., Supplement 1 (1984), 205–237.
 [19] Häffner, H., Hänsel, W., Roos, C. F., Benhelm, J., Checkalkar, D., Chwalla, M., Körber, T., Rapol, U. D., Riebe, M., Schmidt, P. O., Becher, C., Gühne, O., Dür, W., and Blatt, R. Scalable multiparticle entanglement of trapped ions. Nature 438 (2005), 643–646.
 [20] Hazan, E., and Kale, S. An online portfolio selection algorithm with regret logarithmic in price variation. Math. Finance 25, 2 (2015), 288–310.
 [21] Hiai, F., Ohya, M., and Tsukada, M. Sufficiency, KMS condition and relative entropy in von Neumann algebras. Pac. J. Math. 96, 1 (1981), 99–109.
 [22] HiriartUrruty, J.B., Strodiot, J.J., and Nguyen, V. H. Generalized Hessian matrix and secondorder optimality conditions for problem with data. Appl. Math. Optim. 11 (1984), 43–56.
 [23] Hradil, Z. Quantumstate estimation. Phys. Rev. A 55, 3 (1997).
 [24] Ioffe, A., and Milosz, T. On a characterization of functions. Cybern. Syst. Anal. 38, 3 (2002), 313–322.
 [25] Juditsky, A., and Nemirovski, A. Firstorder methods for nonsmooth convex largescale optimization, I: General purpose methods. In Optimization for Machine Learning, S. Sra, S. Nowozin, and S. J. Wright, Eds. MIT Press, Cambridge, MA, 2012, ch. 5.
 [26] Kivinen, J., and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132 (1997), 1–63.
 [27] Knee, G. C., Bolduc, E., Leach, J., and Gauger, E. M. Maximumlikelihood quantum process tomography via projected gradient descent. arXiv:1803.10062v1.
 [28] Kulis, B., Sustik, M. A., and Dhillon, I. S. Lowrank kernel learning with Bregman matrix divergences. J. Mach. Learn. Res. 10 (2009), 341–376.
 [29] LacosteJulien, S., and Jaggi, M. On the global linear convergence of FrankWolfe optimization variants. In Adv. Neural Information Processing Systems 28 (2015).
 [30] Li, B., Sahoo, D., and Hoi, S. C. H. OLPS: A toolbox for online portfolio selection. J. Mach. Learn. Res. 17 (2016), 1–5.
 [31] Lu, H., Freund, R. M., and Nesterov, Y. Relatively smooth convex optimization by firstorder methods, and applications. SIAM J. Optim. 28, 1 (2018), 333–354.
 [32] Nemirovsky, A. S., and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, Chichester, 1983.
 [33] Nesterov, Y. Introductory Lectures on Convex Optimization. Kluwer, Boston, MA, 2004.
 [34] Nesterov, Y., and Nemirovskii, A. InteriorPoint Polynomial Algorithms in Convex Programming. SIAM, Philadelphia, PA, 1994.
 [35] Odor, G., Li, Y.H., Yurtsever, A., Hsieh, Y.P., El Halabi, M., TranDinh, Q., and Cevher, V. FrankWolfe works for nonLipschitz continuous gradient objectives: Scalable Poisson phase retrieval. In IEEE Int. Conf. Acoustics, Speech and Signal Processing (2016), pp. 6230–6234.
 [36] Paris, M., and Řeháček, J., Eds. Quantum State Estimation. Springer, Berlin, 2004.
 [37] Polyak, B. T. Introduction to Optimization. Optimization Softw., Inc., New York, NY, 1987.
 [38] Rockafellar, R. T., and Wets, R. J. Variational Analysis. Springer, Berlin, 2009.
 [39] TranDinh, Q., Kyrillidis, A., and Cevher, V. Composite selfconcordant minimization. J. Mach. Learn. Res. 16 (2015), 371–416.
 [40] Tsuda, K., Rätsch, G., and Warmuth, M. K. Matrix exponentiated gradient updates for online learning and Bregman projection. J. Mach. Learn. Res. 6 (2005), 995–1018.
 [41] Vardi, Y., Shepp, L. A., and Kaufman, L. A statistical model for positron emission tomography. J. Am. Stat. Assoc. 80, 389 (1985), 8–20.
 [42] Řeháček, J., Hradil, Z., Knill, E., and Lvovsky, A. I. Diluted maximumlikelihood algorithm for quantum tomography. Phys. Rev. A 75 (2007).