A near-optimal algorithm for approximating the John Ellipsoid

We develop a simple and efficient algorithm for approximating the John Ellipsoid of a symmetric polytope. Our algorithm is near optimal in the sense that our time complexity matches the current best verification algorithm. We also provide the MATLAB code for further research.

## 1 Introduction

Let be a polytope where has nonzero, finite Euclidean volume. The classical theorem of [Joh48] states that if is the ellipsoid of maximal volume contained in , then , where represents a dilation of the ellipsoid by a factor of about its center. Moreover, if is symmetric, then . The maximal volume inscribed ellipsoid (MVIE) is called the John Ellipsoid, and we are interested in the problem of approximating when the polytope is centrally symmetric, i.e.  can be expressed as where and has rank .

The problem of computing the ellipsoid of maximal volume inside polytope given by a set of inequalities has a wealth of different applications, including sampling and integration [Vem05, CDWY18], linear bandits [BCBK12, HK16], linear programming  [LS14], cutting plane methods [KTE88] and differential privacy [NTZ13].

Computing the John Ellipsoid additionally has applications in the field of experimental design, a classical problem in statistics [Atw69]. Specifically, the D-optimal design problem wants to maximize the determinant of the Fisher information matrix [KW60, Atw69], which turns out to be equivalent to finding the John Ellipsoid of a symmetric polytope. While this equivalence is known, e.g. [Tod16], we include it in Section 2 for completeness. The problem of D-optimal design has received recent attention in the machine learning community, e.g. [AZLSW17, WYS17, LFN18].

### 1.1 Our Contribution

Our main contribution is to develop an approximation algorithm to computing the John Ellipsoid inside a centrally symmetric polytope given by a set of inequalities. Previously, for solving the MVIE problem or its dual equivalent D-optimal design problem, researchers have developed various algorithms, such as first-order methods [Kha96, KY05, DAST08], and second-order interior-point methods [NN94, SF04]. Instead of using traditional optimization methods, we apply a very simple fixed point iteration. The analysis is also simple and clean, yet the convergence rate is very fast. We state our main result as follows.

###### Theorem 1.1 (Informal).

Given , let be a centrally symmetric polytope defined as . For , there is an algorithm (Algorithm 1) that runs in time , returning an ellipsoid so that .

In Lemma 2.3, we show that our ellipsoid is -close to the John Ellipsoid in a certain sense. However, if we want to get -approximation to the maximal volume, we shall set 111For details, see Lemma 2.3., then Algorithm 1 runs in time , and when is constant, this is comparable with the best known results  [KY05, TY07].

Furthermore, we use sketching ideas from randomized linear algebra to speed up the algorithm so that the running time does not depend on explicitly. This will make sense if is a sparse matrix. Our result is stated as follows.

###### Theorem 1.2 (Informal).

Given , let be a centrally symmetric polytope defined as . For and , there is an algorithm (Algorithm 2) that runs within many iterations, returning an ellipsoid so that with probability at least , . Moreover, each iteration involves in solving linear systems of the form where is some diagonal matrix.

Algorithm 2 is near optimal, because in order to verify the correctness of the result, we need to compute the leverage scores of some weighted version of . The best known algorithm for approximating leverage scores needs to solve many linear systems [SS11, DMIMW12, CW13, NN13]. One key advantage of our algorithm is that it reduces the problem of computing John ellipsoid to a relatively small number of linear systems. Therefore, it allows the user to apply the linear systems solver tailored for the given matrix . For example, if is tall, one can apply sketching technique to solve the linear systems in nearly linear time [Woo14]; if each row of A has only two non-zeros, one can apply Laplacian solvers [DS08, KOSZ13, CKM14, KS16, KLP16]. In the code we provided, we used the Cholesky decomposition which is very fast for many sparse matrices in practice.

### 1.2 Related Works

There is a long line of research on computing the maximal volume ellipsoid inside polytopes given by a list of linear inequalities. We note that [KT93] presented a linear time reduction from the problem of computing a minimum volume enclosing ellipsoid (MVEE) of a set of points to the maximal volume inscribed ellipsoid problem; therefore, these algorithms also hold for approximating the John Ellipsoid.

Using an interior-point algorithm, [NN94] showed that a approximation of MVEE can be computed in time . [KT93] subsequently improved the runtime to . Later on, [Nem99] and [Ans02] independently obtained an algorithm. To the best of authors’ knowledge, currently the best algorithms by [KY05, TY07] run in time . We refer readers to [Tod16] for a comprehensive introduction and overview.

Computing the minimum volume enclosing ellipsoid of a set of points is the dual problem of D-optimal design. By generalizing smoothness condition on first order method, [LFN18] managed to solve D-optimal design problem within many iterations. However, in the dense case, their iteration costs time, which leads to larger running time comparing to [KY05, TY07]. [GP18] applied Bregman proximal method on the D-optimal design problem and observe accelerated convergence rate in their numerical experiments; however, they did not prove that their experimental parameter settings satisfy the assumption of their algorithm222We are grateful Gutman and Peña provided us their code for testing, of which D-optimal design was only one application..

A natural version of the D-optimal design problem is to require an integral solution. The integral variant is shown to be -hard [ČH12], although recently approximation algorithms have been developed [AZLSW17, SX18]. In our context, this means the weight vector is the integral optimal solution to (2), where the sum of the weights is some specified integral parameter .

Several Markov chains for sampling convex bodies have well understood performance guarantees based upon the roundedness of the convex body. If , then the mixing time of hit-and-run and the ball walk are both steps [LV06, KLS97]. Thus, placing a convex body in John position guarantees the walks mix in steps, and steps if the body is symmetric; this transformation is used in practice with the convex body to be a polytope [HCT17]. Generating the John Ellipsoid, with a fixed center point, has also been employed as a proposal distribution for a Markov chain [CDWY18, GN18].

We build our even faster algorithm via sketching techniques. Sketching has been successfully applied to speed up different problems, such as linear programs [LSZ19], clustering [CEM15, SYZ18], low rank approximations [CW13, NN13, BW14, CW15a, RSW16, SWZ17], linear regression [CW13, NN13, CLM15, CW15b, PSW17, ALS18], total least regression [DSWY19], tensor regression [LHW17, DSSW18] and tensor decomposition [WTSA15, SWZ16, SWZ19]. Readers may refer to [Woo14] for a comprehensive survey on sketching technique. We use sketching techniques to speed up computing leverage scores. This idea was first used in [SS11].

Previous research on the MVEE problem did take advantage of the sparsity of the input matrix , and to the best of our knowledge, our algorithm is the first one that is able to deal with large sparse input. It would be interesting if we can apply sketching techniques to further speed up existing algorithms.

#### 1.2.1 Relation with [Cp15]

We shall mention that our work is greatly inspired by [CP15]. The Lewis Weights for matrix is defined as the unique vector so that for ,

 a⊤i(A⊤diag(¯¯¯¯w)1−2/pA)−1ai=¯¯¯¯w2/pi.

It is known that computing the Lewis Weight is equivalent to computing the maximal volume inscribed ellipsoid. [CP15] proposes an algorithm for approximating Lewis Weights for all . Their algorithm is an iterative algorithm that is very similar to our Algorithm 1, and the convergence is proved by arguing the iteration mapping is contractive. The main difference is that [CP15] outputs the weights in the last round, while our Algorithm 1 takes the average over all rounds and outputs the averaging weights, which allows us to conduct a convexity analysis and deal with the case.

## 2 Problem Formulation

In this section, we formally define the problem of computing the John Ellipsoid of a symmetric polytope. Let be a symmetric convex polytope, where denotes the set . We assume has full rank. By symmetry, we know that the maximal volume ellipsoid inside the polytope should be centered at the origin. Any ellipsoid centered at the origin can be expressed by , where is a positive definite matrix. Note that the volume of is proportional to , and an ellipsoid is contained in polytope if and only if for , . For any , we can write where . Hence

 maxx∈E|a⊤ix|=max∥y∥2≤1|a⊤iGy|=max∥y∥2≤1∥Gai∥2⋅∥y∥2=∥Gai∥2

Therefore, we can compute the John Ellipsoid of by solving the following optimization program:

 Maximize logdetG2, (1) subject to: G⪰0, ∥Gai∥2≤1, ∀i∈[m].

It turns out that the optimal ellipsoid satisfies , where is the optimal solution of the program

 Minimize m∑i=1wi−logdet(m∑i=1wiaia⊤i)−n, (2) subject to: wi≥0,∀i∈[m].

Actually program (2) is the Lagrange dual of program (2). Moreover, we have the following optimality criteria for in the above program.

###### Lemma 2.1 (Optimality criteria, Proposition 2.5 in [Tod16]).

A weight is optimal for program (2) if and only if

 m∑i=1wi=n, a⊤i(m∑i=1wiaia⊤i)−1ai=1, if wi≠0; a⊤i(m∑i=1wiaia⊤i)−1ai<1, if wi=0.

Computing the John Ellipsoid is closely related to D-optimal design problem [Atw69, BV04, Tod16, GP18]. For the D-optimal design problem, we are given input where , and we want to solve program,

 Maximize logdet(Xdiag(v)X⊤), (3) subject to: vi≥0, ∀i∈[m] m∑i=1vi=1

We emphasize that program (3) and program (2) are equivalent, in the following sense. By Lemma 2.1, we can rewrite program (2) as minimizing , subject to for and . By setting , we obtain program (3). Thus, optimal solutions to programs (2) and (3) are equivalent up to a multiplicative factor .

We can also talk about an approximate John Ellipsoid.

###### Definition 2.2.

For , we say is a -approximation of program (2) if satisfies

 m∑i=1wi=n, a⊤i(m∑i=1wiaia⊤i)−1ai≤1+ϵ,∀i∈[m].

Lemma 2.3 gives a geometric interpretation of the approximation factor in Definition 2.2. Recall that the exact John Ellipsoid of satisfies .

###### Lemma 2.3 ((1+ϵ)-approximation is good rounding).

Let be a -approximation of (2). Define as . Then

 1√1+ϵ⋅Q⊆P⊆√n⋅Q.

Moreover, .

###### Proof.

Let and suppose . Then, we have that . So,

 |Ax|i=⟨ai,x⟩=⟨Gai,G−1x⟩≤∥Gai∥2∥G−1x∥2≤∥Gai∥2√1+ϵ.

Since , then and .

On the other hand, for , we have that . Hence

 x⊤G−2x=x⊤A⊤diag(w)Ax=m∑i=1wi|Ax|2i≤m∑i=1wi=n.

So .

Finally, since is contained in , is a feasible solution to program (1). Moreover is a feasible solution to program (2). So by duality of program (1) and (2), we have the duality gap is at most

 (n−logdet(m∑i=1wiaia⊤i)−n)−logdet((1+ϵ)m∑i=1wiaia⊤i)−1=nlog(1+ϵ)≤nϵ.

Let the matrix representation of be , then by optimality of and the duality gap, we have

 logdet((G′)−2)≥logdet(G2∗)−nϵ.

Since is proportional to , we conclude that . ∎

## 3 Main Algorithm

In this section, we present Algorithm 1 for approximating program (2) and analyze its performance.

Let be the function defined as where for ,

 σi(v)=a⊤i(m∑j=1vjaja⊤j)−1ai=a⊤i(A⊤diag(v)A)−1ai. (4)

Let be the optimal solution to program (2). By Lemma (2.1), satisfies , or equivalently

 w∗i=w∗i⋅σi(w∗) (5)

Inspired by (5), we use the fixed point iteration for and . has very nice properties. Actually, by setting , we can rewrite as , hence is actually the leverage score of the -th row of the matrix  [CLM15]. From the well-known properties of leverage scores, we have

###### Lemma 3.1 (Properties of leverage scores, e.g. see Section 3.3 of [Clm+15]).

For and , we have . Moreover, .

In order to show Algorithm 1 provides a good approximation of the John Ellipsoid, in the sense of Definition 2.2, we need to argue that for the output of Algorithm 1, . Our main result is the following theorem.

###### Theorem 3.2 (Main Result).

Let be the output of Algorithm 1 in line (1). For all , when , we have for ,

 σi(w)≤1+ϵ.

Moreover,

 m∑i=1wi=n.

Therefore, Algorithm 1 provides -approximation to program (2).

We now analyze the running time of Algorithm 1.

###### Theorem 3.3 (Performance of Algorithm 1).

For all , we can find a -approximation of John Ellipsoid inside a symmetric convex polytope in time .

###### Proof.

The main loop is executed times, and inside each loop, we can first use time to compute , then compute in time. To see why we introduce , observe that . Now we can compute the Cholesky decomposition of in time , and use the Cholesky decomposition to compute in time for each . Finally, we can compute by computing in time . This is valid since . To summarize, in each iteration we use time, hence the overall running time is as stated. ∎

Now we turn to proving Theorem 3.2. The proof of Theorem 3.2 relies on the following important observation, whose proof can be found in Appendix B.

###### Lemma 3.4 (Convexity).

For , let be the function defined as

 ϕi(v)=logσi(v)=log⎛⎝a⊤i(m∑j=1vjaja⊤j)−1ai⎞⎠.

Then is convex.

Now that is convex, we can apply Jensen’s inequality to get Lemma 3.5.

###### Lemma 3.5 (Telescoping).

Fix as the number of main loops executed in Algorithm 1. Let be the output in line (1) of Algorithm 1. Then for ,

 ϕi(w)≤1Tlogmn
###### Proof.

Recall that . By Lemma 3.4, is convex, and so

 ϕi(w) =ϕi(1TT∑k=1w(k)) ≤1TT∑k=1ϕi(w(k)) by Jensen’s inequality =1TT∑k=1logσi(w(k)) by definition of ϕi function =1TT∑k=1logw(k+1)iw(k)i =1Tlogw(T+1)iw(1)i ≤1Tlogmn by Lemma 3.1 and the initialization of w(1)

Now we are ready to prove Theorem 3.2.

###### Proof of Theorem 3.2.

Set . By Lemma 3.5, we have for ,

 logσi(w)=ϕi(w)≤1Tlogmn=ϵ2≤log(1+ϵ)

where the last step uses the fact that when , . This gives us .

On the other hand, from Lemma 3.1 we have . Hence

 m∑i=1wi=m∑i=11TT∑k=1w(k)i=n

We shall mention that it is possible to further improve Algorithm 1 by applying sketching technique from randomized linear algebra. Here we present the performance of our accelerated algorithm, and detailed analysis can be found in Appendix C.

###### Theorem 3.6 (Performance of Algorithm 2).

For all , we can find a -approximation of the John Ellipsoid inside a symmetric convex polytope within iterations with probability at least . Moreover, each iteration involves solving linear systems of the form for some diagonal matrix .

## Acknowledgment

We thank Zhao Song for his generous help on this paper.

## References

• [ALS18] Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, and Ruiqi Zhong. Subspace embedding and linear regression with Orlicz norm. In International Conference on Machine Learning, pages 224–233. https://arxiv.org/pdf/1806.06430, 2018.
• [Ans02] Kurt M Anstreicher. Improved complexity for maximum volume inscribed ellipsoids. SIAM Journal on Optimization, 13(2):309–320, 2002.
• [Atw69] Corwin L Atwood. Optimal and efficient designs of experiments. The Annals of Mathematical Statistics, pages 1570–1602, 1969.
• [AZLSW17] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal design of experiments via regret minimization. In International Conference on Machine Learning, pages 126–135. https://arxiv.org/pdf/1711.05174, 2017.
• [BCBK12] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Annual Conference on Learning Theory, volume 23, pages 41–1. Microtome, 2012.
• [BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, 2004.
• [BW14] Christos Boutsidis and David P Woodruff. Optimal CUR matrix decompositions. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC), pages 353–362. ACM, https://arxiv.org/pdf/1405.7910, 2014.
• [CDWY18] Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Fast MCMC sampling algorithms on polytopes. The Journal of Machine Learning Research, 19(1):2146–2231, 2018.
• [CEM15] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163–172. ACM, 2015.
• [ČH12] Michal Černỳ and Milan Hladík. Two complexity results on C-optimality in experimental design. Computational Optimization and Applications, 51(3):1397–1408, 2012.
• [CKM14] Michael B. Cohen, Rasmus Kyng, Gary L. Miller, Jakub W. Pachocki, Richard Peng, Anup B. Rao, and Shen Chen Xu. Solving SDD linear systems in nearly log time. In Proceedings of the Forty-sixth Annual ACM Symposium on Theory of Computing, STOC ’14, 2014.
• [CLM15] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 181–190. ACM, 2015.
• [CP15] Michael B Cohen and Richard Peng. row sampling by Lewis Weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 183–192. ACM, https://arxiv.org/pdf/1412.0588, 2015.
• [CW13] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 81–90. ACM, 2013.
• [CW15a] Kenneth L Clarkson and David P Woodruff. Input sparsity and hardness for robust subspace approximation. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 310–329. IEEE, 2015.
• [CW15b] Kenneth L Clarkson and David P Woodruff. Sketching for M-estimators: A unified approach to robust regression. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 921–939. Society for Industrial and Applied Mathematics, 2015.
• [DAST08] S Damla Ahipasaoglu, Peng Sun, and Michael J Todd. Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software, 23(1):5–19, 2008.
• [DMIMW12] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13(Dec):3475–3506, 2012.
• [DS08] Samuel I Daitch and Daniel A Spielman. Faster approximate lossy generalized flow via interior point algorithms. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 451–460. ACM, 2008.
• [DSSW18] Huaian Diao, Zhao Song, Wen Sun, and David Woodruff. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics, pages 1299–1308, 2018.
• [DSWY19] Huaian Diao, Zhao Song, David Woodruff, and Xin Yang. Total least squares regression in input sparsity time. In Manuscript, 2019.
• [GN18] Adam Gustafson and Hariharan Narayanan. John’s walk. arXiv preprint arXiv:1803.02032, 2018.
• [GP18] David H Gutman and Javier F Peña. A unified framework for bregman proximal methods: subgradient, gradient, and accelerated gradient schemes. arXiv preprint arXiv:1812.10198, 2018.
• [GR14] Izrail Solomonovich Gradshteyn and Iosif Moiseevich Ryzhik. Table of integrals, series, and products. Academic press, 2014.
• [HCT17] Hulda S. Haraldsdottir, Ben Cousins, Ines Thiele, Ronan M. T. Fleming, and Santosh Vempala. CHRR: coordinate hit-and-run with rounding for uniform sampling of constraint-based models. Bioinformatics, 33(11), 1 2017.
• [HK16] Elad Hazan and Zohar Karnin. Volumetric spanners: an efficient exploration basis for learning. The Journal of Machine Learning Research, 17(1):4062–4095, 2016.
• [Jam13] GJO Jameson. Inequalities for gamma function ratios. The American Mathematical Monthly, 120(10):936–940, 2013.
• [Joh48] Fritz John. Extremum problems with inequalities as subsidiary conditions. In Studies and Essays Presented to R. Courant on his 60th Birthday, January 8, 1948, pages 187–204. Interscience Publishers, Inc., New York, N. Y., 1948.
• [Kha96] Leonid G Khachiyan. Rounding of polytopes in the real number model of computation. Mathematics of Operations Research, 21(2):307–320, 1996.
• [KLP16] Rasmus Kyng, Yin Tat Lee, Richard Peng, Sushant Sachdeva, and Daniel A Spielman. Sparsified cholesky and multigrid solvers for connection laplacians. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 842–850. ACM, 2016.
• [KLS97] Ravi Kannan, László Lovász, and Miklós Simonovits. Random walks and an volume algorithm for convex bodies. Random Structures Algorithms, 11(1):1–50, 1997.
• [KOSZ13] Jonathan A Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A simple, combinatorial algorithm for solving SDD systems in nearly-linear time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 911–920. ACM, 2013.
• [KS16] Rasmus Kyng and Sushant Sachdeva. Approximate gaussian elimination for laplacians-fast, sparse, and simple. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 573–582. IEEE, 2016.
• [KT93] Leonid G Khachiyan and Michael J Todd. On the complexity of approximating the maximal inscribed ellipsoid for a polytope. Mathematical Programming, 61(1):137–159, 1993.
• [KTE88] L. Khachiyan, S. Tarasov, and I. Ehrlich. The method of inscribed ellipsoids. Soviet Math. Doklady, 1988.
• [KW60] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12(363-366):234, 1960.
• [KY05] Piyush Kumar and E Alper Yildirim. Minimum-volume enclosing ellipsoids and core sets. Journal of Optimization Theory and Applications, 126(1):1–21, 2005.
• [LFN18] Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
• [LHW17] Xingguo Li, Jarvis Haupt, and David Woodruff. Near optimal sketching of low-rank tensor regression. In Advances in Neural Information Processing Systems, pages 3466–3476. https://arxiv.org/pdf/1709.07093, 2017.
• [LM00] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
• [LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in iterations and faster algorithms for maximum flow. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 424–433. IEEE, 2014.
• [LSZ19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT. https://arxiv.org/pdf/1905.04447, 2019.
• [LV06] László Lovász and Santosh Vempala. Hit-and-run from a corner. SIAM J. Comput., 35(4):985–1005, 2006.
• [Nem99] Arkadi Nemirovski. On self-concordant convex–concave functions. Optimization Methods and Software, 11(1-4):303–384, 1999.
• [NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming, volume 13. Siam, 1994.
• [NN13] Jelani Nelson and Huy L Nguyên. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 117–126. IEEE, 2013.
• [NTZ13] Aleksandar Nikolov, Kunal Talwar, and Li Zhang. The geometry of differential privacy: the sparse and approximate cases. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 351–360. ACM, 2013.
• [PSW17] Eric Price, Zhao Song, and David P. Woodruff. Fast regression with an guarantee. In International Colloquium on Automata, Languages, and Programming (ICALP), 2017.
• [RSW16] Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 250–263. ACM, 2016.
• [SF04] Peng Sun and Robert M Freund. Computation of minimum-volume covering ellipsoids. Operations Research, 52(5):690–706, 2004.
• [SS11] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913–1926, 2011.
• [SWZ16] Zhao Song, David P. Woodruff, and Huan Zhang. Sublinear time orthogonal tensor decomposition. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS) 2016, December 5-10, 2016, Barcelona, Spain, pages 793–801, 2016.
• [SWZ17] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise -norm error. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 688–701. ACM, 2017.
• [SWZ19] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In SODA. https://arxiv.org/pdf/1704.08246, 2019.
• [SX18] Mohit Singh and Weijun Xie. Approximate positive correlated distributions and approximation algorithms for D-optimal design. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2240–2255. Society for Industrial and Applied Mathematics, 2018.
• [SYZ18] Zhao Song, Lin F Yang, and Peilin Zhong. Sensitivity sampling over dynamic geometric data streams with applications to -clustering. arXiv preprint arXiv:1802.00459, 2018.
• [Tod16] Michael J. Todd. Minimum-volume ellipsoids : theory and algorithms. MOS-SIAM series on optimization. SIAM, Philadelphia, 2016.
• [TY07] Michael J Todd and E Alper Yıldırım. On khachiyan’s algorithm for the computation of minimum-volume enclosing ellipsoids. Discrete Applied Mathematics, 155(13):1731–1744, 2007.
• [Vem05] Santosh Vempala. Geometric random walks: a survey. In Combinatorial and computational geometry, volume 52 of Math. Sci. Res. Inst. Publ., pages 577–616. Cambridge Univ. Press, Cambridge, 2005.
• [Woo14] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
• [WTSA15] Yining Wang, Hsiao-Yu Tung, Alexander J Smola, and Anima Anandkumar. Fast and guaranteed tensor decomposition via sketching. In Advances in Neural Information Processing Systems (NIPS), pages 991–999. https://arxiv.org/pdf/1506.04448, 2015.
• [WYS17] Yining Wang, Adams Wei Yu, and Aarti Singh. On computationally tractable selection of experiments in measurement-constrained regression models. The Journal of Machine Learning Research, 18(1):5238–5278, 2017.

## Appendix A Preliminaries

In this section we introduce notations and preliminaries used in the appendix. We use to represent the normal distribution with mean and variance .

### a.1 Multivariate Calculus

Let be a differentiable function. The directional derivative of in the direction is defined as

 Df(x)[h]:=df(x+th)dt∣∣t=0.

We can also define a high order directional derivative as

 Dkf(x)[h1,⋯,hk]:=dkf(x+∑ki=1tihi)dt1dt2⋯dtk∣∣t1=0,…,tk=0.

The following two properties of directional derivatives will be useful.

###### Proposition A.1.
• (Chain rule) .

• Let where . For , .

### a.2 Gamma Function

function is a well-known math object. It is defined as

 Γ(z)=∫+∞0xz−1e−xdx.

We need the following result on Gamma function.

###### Lemma A.2 (Corollary 1 of [Jam13]).

For all and ,

 x(x+y)y−1≤Γ(x+y)Γ(x)≤xy.

### a.3 Tail Bound for χ2 Distribution

We need the following version of concentration for distribution.

###### Lemma A.3 (Lemma 1 in [Lm00]).

Let be a distribution with degree of freedom. Then for ,

 Pr[X−n≥2√nt+2t]≤e−t.

## Appendix B Proof of Lemma 3.4

In this section we provide the proof of Lemma 3.4.

###### Proof.

We first prove a strengthened result: for fixed , the function defined as is convex. Here is the set of all positive definite matrices. Notice that is an open set, so we can differentiate .

We argue that it is sufficient to show for all and all , the second order directional derivative is non-negative. This is because . So if for all we have , then , which is precisely the convex condition.

Let us do some computation with Proposition A.1.

 Df(M)[H]=−a⊤M−1HM−1aa⊤M−1a,

and

 D2f(M)[H,H]=2a⊤M−1HM−1HM−1a⋅a⊤M−1a−(a⊤M−1HM−1a)2(a⊤M−1a)2.

By Cauchy-Schwarzt inequality, we have

 a⊤M−1HM−1HM−1a⋅a⊤M−1a=∥M−12HM−1a∥22⋅∥M−12a∥22≥(⟨M−12HM−1a,M−12a⟩)2=(a⊤M−1HM−1a)2.

Hence for all and all .

Now we are ready to work on . For all in the domain of , let and . Then for all ,

 ϕi(λv+(1−λ)v′)=loga⊤i(m∑i=1(λvi+(1−λ)v′i)aia⊤i)−1ai=loga⊤i(λm∑i=1viaia⊤i+(1−λ)m∑i=1v′iaia⊤i)−1ai=f(λM+(1−λ)M′)≤λf(M)+(1−λ)f(M′)because f% is convex=λϕi(v)+(1−λ)ϕi(v′).

So is also convex. ∎

## Appendix C Faster Algorithm for Computing John Ellipsoid for Sparse Matrix

In this section we present our accelerated algorithm, Algorithm 2 and analyze its performance. Recall that Algorithm 1 uses the iterating rule where . With our setting of , we have

 (B(k))⊤B(k)=A⊤√W(k)⋅√W(k)A=A⊤W(k)A.

So is just what we do in Algorithm 1. Hence from Lemma 3.1, we obtain the following properties about .

###### Proposition C.1 (Bound on ^w(k)).

For completeness we define . For and , . Moreover, .

However, is a by matrix, so it is computationally expensive to compute . The trick we use here, which is initially introduced first by [SS11], is to introduce a random Gaussian matrix with rows to speed up the computation. Of course, this will introduce extra error, however we can prove that the overall result has good concentration.

### c.1 Approximation Guarantee

Since Algorithm 2 is a randomized algorithm, we need to argue that for the output of Algorithm 2, with high probability. Our main result in this section is

###### Theorem C.2 (Main result).

Let be the output in line (2) of Algorithm 2. For all , when and , we have

 Pr[∀i∈[m],σi(w)≤1+ϵ]≥1−δ.

Moreover, before rescaling at the end,

 Pr[m∑i=1wi≤(1+ϵ)n]≥1−δ.

By scaling so that , we have

###### Theorem C.3 (Approximation guarantee).

For all , When and , Algorithm 2 provides a -approximation to program (2) with probability at least .

###### Proof.

On line (2) of Algorithm 2 we set , hence .

From Theorem C.2, with probability at least , , and . Therefore for all ,

 σi(v)=a⊤i(A⊤diag(v)A)−1ai=a⊤i(n∑mi=1wi⋅A⊤diag(w)A)−1ai=∑mi=1winσi(w)≤(1+ϵ)2.

From now on we focus on proving Theorem C.2. Recall that for . Similar to Lemma 3.5, we can prove the following lemma with the convexity of .

###### Lemma C.4 (Telescoping).

Fix as the number of main loops executed in Algorithm 2. Let be the output at line (2) of Algorithm 2. Then for ,

 ϕi(w)≤1Tlogmn+1TT∑k=1log^w(k)iw(k)i.
###### Proof.

Recall that . By Lemma 3.4, is convex, so we can apply Jensen’s inequality to obtain

 ϕi(w) ≤1TT∑k=1ϕi(w(k))) Jensen’s inequality =1TT∑k=1logσi(w(k)) by definition of ϕi function =1TT∑k=1log^w(k+1)iw(k)i ^w(k+1)i=w(k)iσi(w(k)) =1TT∑k=1log^w(k+1)i^w(k)i^w(k)iw