Recent progress on scaling algorithms and applications

# Recent progress on scaling algorithms and applications

Ankit Garg
Microsoft Research India
garga@microsoft.com
Rafael Oliveira
University of Toronto
rafael@cs.toronto.edu
###### Abstract

Scaling problems have a rich and diverse history, and thereby have found numerous applications in several fields of science and engineering. For instance, the matrix scaling problem has had applications ranging from theoretical computer science to telephone forecasting, economics, statistics, optimization, among many other fields. Recently, a generalization of matrix scaling known as operator scaling has found applications in non-commutative algebra, invariant theory, combinatorics and algebraic complexity; and a further generalization (tensor scaling) has found more applications in quantum information theory, geometric complexity theory and invariant theory.

In this survey, we will describe in detail the scaling problems mentioned above, showing how alternating minimization algorithms naturally arise in this setting, and we shall present a general framework to rigorously analyze such algorithms. These simple problems and algorithms are not just applicable to diverse mathematical and CS areas, but also serve to bring out deep connections between them. As this framework makes extensive use of concepts from invariant theory, we also provide a very gentle introduction to basic concepts of invariant theory and how they are used to analyze alternating minimization algorithms for the scaling problems.

This survey is intended for a general computer science audience, and the only background required is basic knowledge of calculus and linear algebra, thereby making it accessible to graduate students and even to advanced undergraduates.

## 1 Introduction

Scaling problems have been in the background of many important developments in theoretical computer science, often times in an implicit or disguised way. For instance, Forster’s lower bound on the sign-rank of a matrix [For02], Linial et al.’s deterministic approximation of the permanent [LSW98] and quantitative versions of the Sylvester-Gallai theorem [BDYW11] are results which can be seen as particular instances or consequences of scaling problems

Outside of computer science, scaling algorithms have appeared (implicitly and explicitly) in many different areas, such as economics [Sto62], statistics [Sin64], optimization [RS89], telephone forecasting [Kru37], non-commutative algebra [GGOW16], functional analysis [GGOW17], quantum information theory [Gur04a] and many others.

When trying to solve a scaling problem, a natural alternating minimization algorithm comes to mind, and as such these algorithms have been proposed independently by many researchers. The analysis of such alternating minimization algorithms, on the other hand, has been a difficult task, with many different approaches being proposed for each scaling problem, and before recent works, without a unified way of analyzing such scaling algorithms. In this survey, we exposit a unified way of analyzing the natural alternating minimization algorithms, which is based on the series of works [LSW98, Gur04a, GGOW16, BGO18, BFG18].

This framework is a welcome addition to the rich and ever growing theory of optimization. The contributions to the theory of optimization are multifold. First is providing a general framework in which a general optimization heuristic, alternating minimization, converges in a polynomial number of iterations. Secondly, the underlying optimization problems are non-convex and yet they can be solved efficiently.111The underlying problems are geodesically convex i.e. convex in a different geometry. We will not discuss geodesic convexity in this survey but say that it is an active area of research and these scaling problems provide several interesting challenges and applications for this area. So the framework provides a seemingly new tool looking for applications. Thirdly, these scaling problems give rise to a rich class of polytopes, called moment polytopes, which have exponentially many facets and yet there exist weak membership (conjecturally strong membership) oracles for them (see [GGOW17, BFG18]). It remains to be seen if these moment polytopes can capture a large class of combinatorial polytopes (some of them they already can - [GGOW17]).

This survey is divided as follows: in Section 2, we formally describe the three scaling problems that we study, together with the natural alternating minimization algorithms proposed for them. In Section 3 we give an elementary introduction to invariant theory, with many examples, discussing how the scaling problems defined in Section 2 are particular instances of more general invariant theory questions. In Section 4, we provide a unified, 3-step analysis of the alternating minimization algorithms proposed in Section 2, showing how invariant theory is used in the analysis of such algorithms. In Section 5, we give a detailed discussion of some of the numerous applications of scaling algorithms, providing more references for the interested reader. In Section 6 we conclude this survey, presenting further directions and future work to be done in the area, and in Section 7 we discuss related works (old and new) which we could not cover in this survey due to space and time constraints, but otherwise would perfectly fit within the scope of the survey.

## 2 Scaling: problems and algorithms

We first describe the various scaling problems and natural algorithms for them based on alternating minimization. Section 2.1 studies matrix scaling, Section 2.2 studies operator scaling and Section 2.3 studies tensor scaling.

### 2.1 Matrix scaling

The simplest scaling problem is matrix scaling, which dates back to Kruithoff [Kru37] in telephone forecasting and Sinkhorn [Sin64] in statistics. There is a huge body of literature on this problem (see for instance [RS89, KK96, LSW98, GY98, Ide16, ALOW17, CMTV17] and references therein). In this subsection we will describe Sinkhorn’s algorithm and the analysis of it done in [LSW98]. We refer the reader to the last two references above for more sophisticated algorithms for matrix scaling.

We start with a few definitions, the first being the definition of a scaling of a matrix.

###### Definition 2.1 (Scaling of a matrix).

Suppose we are given an non-negative (real) matrix . We say that is a scaling of if it can be obtained by multiplying the rows and columns of by positive scalars. In other words, is a scaling of if there exist diagonal matrices (with positive entries) s.t. .

Next is the definition of a doubly stochastic matrix.

###### Definition 2.2 (Doubly stochastic).

An non-negative matrix is said to be doubly stochastic if all of its row and column sums are equal to .

The matrix scaling problem is simple to describe: given an non-negative matrix , find a scaling of which is doubly stochastic (if one exists). It turns out that an approximate version of the problem is more natural and has more structure, and as we will see in Section 3, this is not by accident. To define the approximate version, we will need another definition which is a quantification of how close a matrix is to being doubly stochastic.

###### Definition 2.3.

Given an non-negative matrix , define its distance to doubly stochastic to be

 ds(A)=n∑i=1(ri−1)2+n∑j=1(cj−1)2

Where denote the row and column sums of , respectively.

With this notion of distance we can define the -scaling problem, whose goal is to find a scaling of s.t. (if one exists).

###### Definition 2.4 (Scalability).

We will say that a non-negative matrix is scalable if for all , there exists a scaling of s.t. .

Given this definition, several natural questions arise. When is a matrix scalable? If a matrix is scalable, can one efficiently find an -scaling? It turns out that answers to both questions are extremely pleasing! The answer to the first question is given by the following theorem (e.g. see [RS89]).

###### Theorem 2.5.

An non-negative matrix is scalable iff .222Here is the permanent of the matrix . In other words, is scalable iff the bipartite graph defined by the support of has a perfect matching.

Learning this nice structural result, one is naturally lead to the second (algorithmic) question: If is scalable, how to efficiently find and -scaling?333Historically, the quest for an algorithmic solution to this problem preceded the structural results. Towards this Sinkhorn [Sin64] suggested an extremely natural algorithm which was analyzed in [LSW98].

###### Theorem 2.6 ([Lsw98]).

Algorithm 1 with iterations works correctly. That is, if the algorithm outputs is not scalable, then is not scalable. If is scalable, then the algorithm will output an -scaling of .

It turns out that to test scalability, it suffices to take . More formally,

###### Lemma 2.7 ([Lsw98]).

Suppose be an non-negative matrix. If is row or column normalized and , then is scalable.

Thus Algorithm 1 along with Theorem 2.5 and Theorem 2.6 gives an alternate (albeit slower) algorithm to test if a bipartite graph has a perfect matching.555Note the iterates in Algorithm 1 are row or column normalized and hence Lemma 2.7 applies.

### 2.2 Operator scaling

The operator scaling problem was first introduced and studied by Gurvits [Gur04a]. We refer the reader to [Gur04a, GGOW16, AZGL18] for various motivations, connections and applications. The objects of study here are tuples of complex matrices . The name operator scaling comes from the fact that these tuples define a map from positive definite matrices to themselves, by .666These maps are called completely positive maps/operators and are very natural from the point of view of quantum mechanics.777 denotes the conjugate transpose of . But here we will restrict ourselves to the representation as tuple of matrices, for simplicity of exposition.

We start with a few definitions. First is the definition of scaling in this setting.

###### Definition 2.8 (Scaling of tuples).

Given a tuple of complex matrices, , we say that is a scaling of if there exist invertible matrices s.t. i.e. for all .

Next is the definition of doubly stochastic in this setting.

###### Definition 2.9 (Doubly stochastic tuples).

A tuple of complex matrices, , is said to be doubly stochastic if

 m∑i=1AiA†i=m∑i=1A†iAi=In

As before, the operator scaling question is: given a tuple , find a scaling which is doubly stochastic (if one exists). Again an approximate version is more natural. Towards that, we have the following definition quantifying how close a tuple is to being doubly stochastic.888We apologize for the overload of notation. Some of it is deliberate to draw out the syntactic similarity between various scaling problems. As we will see later, there is a common thread that binds all these problems.

###### Definition 2.10.

Given a tuple of complex matrices, , define

Here denotes the Frobenius norm.

The goal in the current version of -scaling problem is to find a scaling of s.t. (if one exists).

###### Definition 2.11 (Scalability).

A tuple of complex matrices, is scalable if for all , there exists a scaling of s.t. .

We ask the same questions as before. When is a tuple scalable? If it is scalable, can we find an -scaling efficiently? There is a deep theory underlying the answer to the first question, and to unveil it we will need another definition.999Notice the similarity with the definition of dimension expanders (see [AFG14] and references therein).

###### Definition 2.12 (Dimension non-decreasing tuples).

We say that a tuple of of complex matrices, is dimension non-decreasing if for all subspaces , . Here denotes the subspace and denotes the subspace .

The following theorem gives a very pleasing answer to the first question.

###### Theorem 2.13 ([Gur04a]).

A tuple of complex matrices, is scalable iff is dimension non-decreasing.

What about the second question? Gurvits [Gur04a] suggested a natural algorithm similar to that of Sinkhorn, although he could not analyze it in all cases. The full analysis, stated in the following theorem, was proved in [GGOW16].

###### Theorem 2.14 ([Ggow16]).

Algorithm 2 with iterations works correctly. That is if the algorithm outputs is not scalable, then is not scalable. If is scalable, then the algorithm will output an -scaling of .

Similar to the matrix scaling setting, to test scalability, it suffices to take . More formally,

###### Lemma 2.15 ([Gur04a]).

Suppose is a tuple of complex matrices. If or , and , then is scalable.

Hence Algorithm 2 along with Theorem 2.13 and Theorem 2.14 gives a polynomial time algorithm to test if a tuple is dimension non-decreasing.

###### Theorem 2.16 ([Ggow16]).

There is a polynomial time algorithm to test if a tuple of complex matrices is dimension non-decreasing.

This was the first polynomial time algorithm for the operator scaling problem and as we will later see has applications in derandomization. Soon after, [IQS17a] (also see [IQS17b, DM15]) designed an algebraic algorithm for this problem which also works over finite fields. Their algorithm is an algebraic analogue of the augmenting paths algorithm for matching!

### 2.3 Tensor scaling

In this section, we will discuss a scaling problem which is a generalization of operator scaling and was studied in [BGO18]. The objects of study here are tuples of tensors. Let us denote the space of tensors by . Then we will use the notation to denote tuples of tensors where each .

We start with the definition of scaling in this setting.111111The scaling here looks very different from matrix scaling. One can also define a generalization of matrix scaling to tensors but we will not focus on that version in this survey (see [FL89]).

###### Definition 2.17 (Tensor scaling of tuples).

Given a tuple of tensors (in ), , we say that is a scaling of if there exist invertible matrices s.t. for all . We will use the notation for this scaling action.

Before going to the definition of stochastic tuples in this setting, we need to define a certain notion of marginals.

###### Definition 2.18 (Marginals).

Given a tuple of tensors, , identify it with . Then we will denote the marginals of by , where is a positive semidefinite matrix for all . For each , we can flatten to obtain . Then . These are uniquely characterized by the following property:

 tr[(Im⊗In1⊗⋯⊗Ci⊗⋯Ind)AA†]=tr[CiρAi]

for all and for all .

###### Remark 2.19.

The above notion of marginals is very natural from the point of view of quantum mechanics. If one views as representing a quantum state on systems indexed by , then are the marginal states on systems respectively.

Now we are ready to define the notion of stochasticity in this setting.

###### Definition 2.20 (d-stochastic tuples).

A tuple of tensors (in ) is said to be -stochastic if for each , i.e. the marginals are all scalar multiples of identity matrices.

The normalization by is needed because for all . We will also need the following measure which quantifies how close a tuple is to being -stochastic.

###### Definition 2.21.

Given a tuple of tensors, , define

 ds(A)=d∑i=1∥∥∥ρAi−1niIni∥∥∥2F

Note that the definition above differs from Definitions 2.10, LABEL: and 2.3 slightly in terms of a normalization factor. As before the -scaling problem is to find a scaling of s.t. (if one exists). Scalability is also defined similarly as before.

###### Definition 2.22 (Scalability).

We will say that a tuple of tensors, , is scalable if for all , there exists a scaling of s.t. .

The same questions arise. When is a tuple scalable? If it is scalable, can one find an -scaling efficiently? The answer to the first question is given by remarkable and deep theorems of Hilbert and Mumford, and Kempf and Ness. To properly state this answer we need some more definitions.

###### Definition 2.23 (Deficiency).

We call a subset deficient if there exist real numbers s.t. for all .

We encourage the reader to work out an alternate characterization of deficiency in the case and . Hint: it is related to perfect matchings in bipartite graphs.

We will also use the following notation.

 supp(A)={(j1,…,jd)∈[n1]×⋯×[nd]:∃i∈[m]s.t.Ai(j1,…,jd)≠0}
###### Theorem 2.24 (Hilbert-Mumford + Kempf-Ness [Hil93, Mum65, Kn79], see [Bgo+18]).

A tuple of tensors is scalable iff for every tuple of invertible matrices , is not deficient.

We leave it as an exercise to verify that the above theorem is the same as Theorem 2.13 in the case and .

How to find an efficient scaling if one exists? It turns out that one can extend the alternating minimization kind of algorithms from the matrix and operator scaling settings to the tensor scaling setting as well.

Algorithm 3 was proposed in [VDDM03] without analysis. The following theorem regarding the analysis of the algorithm was proved in [BGO18].

###### Theorem 2.25 ([Bgo+18]).

Algorithm 2 with iterations works correctly . That is if the algorithm outputs is not scalable, then is not scalable. If is scalable, then the algorithm will output an -scaling of .

Unfortunately, unlike the matrix and operator scaling case, to test scalability, it is not sufficient to take which is polynomially small (see [BGO18] for a discussion). Hence we still do not have a polynomial time algorithm for testing scalability of tensors.

## 3 Source of scaling

Given the syntactic similarities between Sections 2.3, LABEL:, 2.2, LABEL: and 2.1, it is natural to wonder if there is a general setting which captures all these scaling problems. In other words, where does scaling come from? It turns out that scaling arises in an algebraic setting and understanding the algebraic setting is crucial to a unified analysis of Algorithms 3, LABEL:, 2, LABEL: and 1.

In Section 3.1, we introduce basic concepts in invariant theory, which provides crucial tools for the analysis of scaling algorithms. In Section 3.2 we introduce basic concepts of geometric invariant theory, which elucidates the connection between invariant theory and scaling problems.

### 3.1 Invariant theory: source of scaling

Invariant theory studies the linear actions of groups on vector spaces. We refer the reader to the excellent books [DK15, Stu08] for an extensive introduction to the area. We will only cover a few basics that we need for our purpose here. Invariant theory deals with linear actions of groups on vector spaces. For our purpose vector spaces will be over complex numbers () and the groups we will deal with will be extremely simple - special linear group, denoted by ( matrices over with determinant ), direct products of special linear group as well as the diagonal subgroup of the special linear group, denoted by (diagonal matrices over with determinant ) and direct products. However the theory is quite general and generalizes to large class of groups.

Suppose we have a group which acts linearly on a vector space .131313That is the group action satisfies the following axioms: for all and , in addition to the properties of being a group action i.e. , for all , and for being the identity element of the group. Usually one also requires that the action is algebraic. Fundamental objects of study in invariant theory are the invariant polynomials which are just polynomial functions on left invariant by the action of the group . Invariant polynomials form a ring and this ring is usually denoted by . More formally,

 C[V]G={p∈C[V]:p(g⋅v)=p(v)∀g∈G,v∈V}

Let us consider a simple example. The group acts on the vector space 141414 denotes the space of complex matrices by left-right multiplication as follows: . is an invariant polynomial for this action and it turns out it is the only one (prove it!). That is any invariant polynomial is just of the form , for a univariate polynomial , or in other words, generates the invariant ring. As an aside (this will not be so important for us), Hilbert [Hil90, Hil93] proved that the invariant ring is always finitely generated! 151515He proved it for the actions of general linear groups but his proof readily generalizes to a more general class of groups called reductive groups. These papers proved several theorems which are the building blocks of modern algebra, like Nullstellansatz and finite basis theorem, as “lemmas" enroute to proving the finite generation of invariant rings!

Some other fundamental objects of study in invariant theory are orbits and orbit-closures. The orbit of a vector , is simply the set of all vector elements that can be transformed to by the group action. That is,

 OG(v)={g⋅v:g∈G}

An orbit-closure, of a vector is obtained by simply including all the limit points of sequences of points in an orbit. That is,

 ¯¯¯¯OG(v)={w∈V:∃g1,…,gk,…,s.t.limk→∞gk⋅v=w}

Many important problems in theoretical computer science are really questions about orbit-closures. To list a few,

1. The graph isomorphism problem is about checking if the orbit closures161616Note that for the action of a finite group, the orbit of a point is the same as its orbit closure. of two graphs (under the group action of permuting the vertices) are the same or not.

2. The vs question (or more precisely a variant of it) can be phrased as testing if the (padded) permanent polynomial lies in the orbit-closure of the determinant (w.r.t. the action on the polynomials induced by the action of general linear group on the variables). This is the starting point of geometric complexity theory (GCT) [MS02, Bür12, Lan15].

3. The question of tensor rank lower bounds (more precisely border rank) can be phrased as asking if a padded version of the given tensor lies in the orbit-closure of the diagonal unit tensor (w.r.t. the natural action of products of general linear groups on the tensors). This approach also falls under the purview of geometric complexity theory [BI11, BI13].

It turns out that a very simple concept in invariant theory captures the mysteries about the scaling problems in Sections 2.3, LABEL:, 2.2, LABEL: and 2.1. This is the so called null cone of a group action (on a vector space). The null cone has dual definitions in terms of the invariant polynomials as well as orbit-closures (in a very general setting, and in particular for the group actions we care about in this survey). This duality is quite important for the analysis of the scaling algorithms.

###### Definition 3.1 (Null cone).

The null cone for a group acting on a vector space , denoted by , is the zero set of all homogeneous invariant polynomials. That is,

 NG(V)={v∈V:p(v)=0∀homogeneousp∈C[V]G}

It is a cone since implies that for all . A theorem due to Hilbert [Hil93] and Mumford [Mum65] 171717Not to be confused with Hilbert-Mumford criterion which we will come across later. says that for a large class of group actions (which includes the group actions we will study), iff (try to figure out the easy direction). This is a consequence of Hilbert’s Nullstellensatz along with the fact that orbit-closures for certain group actions are algebraic varieties (or in other words Euclidean and Zariski closures match). If we look at the left-right multiplication example discussed above, the null cone is just the space of singular matrices since determinant generates the invariant ring. We leave it as an exercise to verify that the matrix lies in the orbit-closure of any singular matrix (under the left-right multiplication action of ).

We will now describe the connection between null cone and scaling problems. For this we will need to move on to the area of geometric invariant theory, which provides geometric and analytic tools to study problems in invariant theory, and also provides with an intriguing non-commutative extension of Farkas’ lemma (or linear programming duality). As a teaser of things to come, the objects in Sections 2.3, LABEL:, 2.2, LABEL: and 2.1 are scalable iff they are not in the null cone of certain group actions!

### 3.2 Geometric invariant theory: non-commutative duality

In this section, we will give a brief overview of the geometric invariant theoretic approach to studying the null cone problem. This will also fit in nicely with the computational aspects of the null cone. Section 3.2.1 describes the Hilbert-Mumford criterion which is really answering the question: how does one prove if some vector is in the null cone. Section 3.2.2 describes Kempf-Ness which answers the question: how does one prove if some vector is not in the null cone. Section 3.2.3 studies the Hilbert-Mumford and the Kempf-Ness criterion for certain commutative group actions and explains why these generalize Farkas’ lemma. Section 3.3 explains the connection between geometric invariant theory and scaling problems.

#### 3.2.1 Hilbert-Mumford criterion

Fix the action of a group on a vector space . How does one prove to someone that a vector is in the null cone? We know that is in the null cone iff i.e. there is a sequence of group elements s.t. . So this sequence of group elements is a witness to being in the null cone. Is their a more succinct witness? After all, how do we even describe an infinite sequence of group elements? The Hilbert-Mumford criterion says that there does exist a more succinct witness (again we won’t go into the technical conditions the group needs to satisfy but just say that they will be satisfied for the groups we will consider).

###### Theorem 3.2 (Hilbert-Mumford criterion [Hil93, Mum65]).

iff there is a one-parameter subgroup of s.t. .

What this means is that instead of looking at all sequences of group elements, one only needs to restrict our attention to those sequences of group elements which can be succinctly described by one-parameter subgroups. What are one-parameter subgroups? These are just algebraic group homomorphisms (i.e. an algebraic map which is also a group homomorphism) . Let us look at several examples (we encourage the reader to prove these statements).

1. For the group (the multiplicative group of non-zero complex numbers), all one parameter subgroups are of the form for some .

2. For the group (direct product of copies of ), all one parameter subgroups are of the form for some .

3. For the group (diagonal matrices with determinant ), all one parameter subgroups are of the form for some satisfying .

4. For the group , all one parameter subgroups are of the form

 λ(t)=((ta1,…,tan),(tb1,…,tbn))

for some satisfying .

5. For the group ( invertible matrices), all one parameter subgroups are of the form for some and some . Here represents a diagonal matrix with on the diagonal.

6. For the group ( invertible matrices with determinant ), all one parameter subgroups are of the form for some and some satisfying .

7. For the group , all one parameter subgroups are of the form

for some and some integer ’s satisfying for all .

Let us return to the example of the left-right multiplication action of on . Recall that sends to and is in the null cone iff it is singular. If is singular, what is a one-parameter subgroup driving it to the zero matrix? Since is singular, there exists an invertible (which can be taken to have determinant ) s.t. has the last row all zeroes. Then the one-parameter subgroup

 λ(t)=(Sdiag(t,t,…,t,t−(n−1))S−1,In)

sends to the zero matrix. Later we will see more examples corresponding to each of the scaling problems.

Having understood how to prove if a given vector is in the null cone, we move on to study how to prove that a given vector is not in the null cone.

#### 3.2.2 Kempf-Ness theorem

Fix the action of a group on a vector space . How does one prove that a vector is not in the null cone? We know that a vector iff there is a homogeneous invariant polynomial s.t. . Such a can serve as a witness that . However, these polynomials typically have exponentially large degree (see [Der01]) and may not have any efficient description. An alternative witness is given by the Kempf-Ness theorem [KN79].

To state the Kempf-Ness theorem, we need to (informally) define something called a moment map, which relies on the following function,

 fv(g)=∥g⋅v∥22

This function defines the following optimization problem,

 N(v)=infg∈Gfv(g) (1)

Note that iff . Now the moment map at , denoted by is simply the gradient of the function “along the group action” at (the identity element of the group ).181818There are minor differences between this definition and how moment map is usually defined. We will not go into the specifics of the space in which lives but instead do the moment map calculation for several examples. First let us state the Kempf-Ness theorem.

###### Theorem 3.3 (Kempf-Ness [Kn79]).

iff there is a non-zero s.t. .

If , then there exists a non-zero which is of minimal norm and hence . So this is the easy direction. The amazing part about the Kempf-Ness theorem is that any local minima becomes a global minima i.e. if for some non-zero , then , even though only guarantees that one cannot decrease the norm of by actions of group elements close to identity (that is, “local” group actions). This smells of some kind of convexity and indeed, the function is geodesically convex (i.e. convex w.r.t. some appropriate metric on the group). We will not delve more into geodesic convexity or moment maps in this survey but refer the interested reader to [NM84, Woo11, HH12, GRS13].

Let us return to the example of the left-right multiplication action of on . Recall that sends to and is in the null cone iff it is singular. What is the moment map in this case? , where are traceless matrices s.t.

 tr[P1Q1]+tr[P2Q2] =dds∥∥exp(sQ1)Mexp(sQT2)∥∥2F∣∣∣s=0 =2tr[MM†Q1]+2tr[(M†M)TQ2]

for all Hermitian traceless matrices . 191919It suffices to focus on Hermitian matrices since for skew-Hermitian matrices, their exponential is unitary and hence cannot change the norm. Also note that is an -linear map over the space of Hermitian matrices , when is Hermitian. A crucial point here which we have glossed over, but nonetheless the reader should verify, is that if is an -linear map, then there exists a Hermitian s.t. for all Hermitian , . Here denotes the conjugate transpose of the matrix . Thus

 P1/2=MM†−∥M∥2FnInandP2/2=(M†M)T−∥M∥2FnIn

Hence is the same as saying is a scalar multiple of a unitary matrix. It is not hard to see that any non-singular can be brought to such a form by the left-right multiplication action.

In the next section, we will see what Hilbert-Mumford and Kempf-Ness theorem look like for actions of . Readers wanting to get to the setting of scaling problems could skip the next section.

#### 3.2.3 Commutative group actions: Farkas’ lemma

In this section, we play around with the Hilbert-Mumford criterion and Kempf-Ness theorem and see what it gives for actions of the group .

Fix vectors . Then acts on as follows: sends the basis vector to . That is is an eigenvector of the action of with eigenvalue . We urge the reader to prove that all actions of look essentially like this.

What is the null cone for this action? Let us apply Hilbert-Mumford criterion. Recall that all the one-parameter subgroups of look like for some . Now fix , where , with , and denote by , the support of i.e.

 supp(v)={j∈[m]:vj≠0}

Then the Hilbert-Mumford criterion (Theorem 3.2) tells us that iff there is a one-parameter subgroup that drives to zero. That is, there exists s.t.

 limt→0n∏i=1taiω(j)i=limt→0t⟨a,ω(j)⟩=0

for all . Equivalently, we have:

###### Proposition 3.4.

iff there exists s.t. for all .

Now let us see what the Kempf-Ness theorem says in this setting. By computing the moment map we see that it satisfies the following,

 ⟨μG(v),b⟩ =dds∥(exp(sb1),…,exp(sbn))⋅v∥22∣∣∣s=0 =2m∑j=1|vj|2⟨ω(j),b⟩

for all .202020Again as before, it suffices to look at , since the imaginary part (the exponential of it) does not change the norm. Hence . Now the Kempf-Ness theorem (Theorem 3.3) says that iff there exists non-zero s.t. . Note that if there exists non-zero s.t. , then . So this matches with the conclusions of the Farkas’ lemma which says that there exists s.t. for all iff . The first part of the Farkas’ lemma matches the case via the Hilbert-Mumford criterion and the second part matches the case via the Kempf-Ness theorem!

### 3.3 Hilbert-Mumford, Kempf-Ness and scaling

In this section, we delve into the connection between geometric invariant theory and various scaling problems. Sections 3.3.3, LABEL:, 3.3.2, LABEL: and 3.3.1 consider the consequences of Hilbert-Mumford and Kemp-Ness theorems for the matrix, operator and tensor scaling problems, respectively.

#### 3.3.1 Matrix scaling

We elucidate here the connection between geometric invariant theory and matrix scaling. For the connection to invariant theory, we need a group action on a vector space. Given that the objects of study are non-negative real matrices, it is natural to guess the vector space would be (given that we only discussed invariant theory with the base field being ). But what is the group action? The group action is also almost given away by the definition of scaling. The first guess might be that the group is and it acts on as follows,

 ((t1,…,tn),(s1,…,sn))⋅M=diag(t1,…,tn)Mdiag(s1,…,sn)

But it turns out that the null cone for this action is the whole of (verify this). But the above guess comes pretty close and the right thing is obtained by looking at an appropriate normalization. It turns out that the group would be and it acts on by the same action as above (it won’t be immediately clear why imposing a determinant constraint on the group elements is the right thing to do).

Now let us see what the Hilbert-Mumford criterion (Theorem 3.2) and Kempf-Ness theorem (Theorem 3.3) say about this group action.

Recall from Section 3.2.1 that all the one-parameter subgroups of look like

 λ(t)=((ta1,…,ta1),(tb1,…,tbn))

for some integers satisfying . Let us denote by , the support of i.e.

 supp(M)={(i,j)∈[n]×[n]:Mi,j≠0}

Then the Hilbert-Mumford criterion says that is in the null cone iff there exists a one-parameter subgroup as above s.t.

 limt→0λ(t)⋅M=0

Equivalently,

###### Corollary 3.5.

iff there exist integers satisfying s.t. for all .

We encourage the reader to prove that the above proposition implies that iff the bipartite graph defined by has no perfect matching.

Now let us apply the Kempf-Ness theorem. First let us calculate the moment map. , where and and it satisfies the following,

 ⟨p,d⟩+⟨q,e⟩ =2∑i,j|M|2i,j(di+ej) =2⟨rM,d⟩+2⟨cM,e⟩

for all satisfying . Here and are the vectors of row and column sums of the matrix , respectively. Thus and , where

 avgM=n∑i=1rM(i)/n=n∑j=1cM(j)/n

and is the all ’s vector. Now the Kempf-Ness theorem says that iff there exists a non-zero s.t. . Equivalently,

###### Corollary 3.6.

iff the non-negative real matrix , given by , is scalable.

Corollaries 3.6, LABEL: and 3.5 together yield a proof of Theorem 2.5.

#### 3.3.2 Operator scaling

For the operator scaling problem, the vector space is clear, , i.e. copies of . The group action is also clear (except for the normalization to determinant ). and it acts on as follows,

 (B,C)⋅(A1,…,Am)=(BA1CT,…,BAmCT)

This action is sometimes called the left-right action. We leave the details of the Hilbert-Mumford criterion and Kempf-Ness theorem to the reader and only say that they yield the following corollaries which together imply Theorem 2.13.

###### Corollary 3.7 (Hilbert-Mumford for left-right action).

iff is dimension non-decreasing (Definition 2.12).

iff is scalable.

#### 3.3.3 Tensor scaling

For the tensor scaling problem, the vector space is . The group is which acts on as follows,

 (g1,…,gd)⋅(A1,…,Am)=((g1⊗⋯⊗gd)A1,…,(g1⊗⋯⊗gd)Am)

Again we will leave the details of the Hilbert-Mumford criterion and Kempf-Ness theorem to the reader and only say that they yield the following corollaries, which together imply Theorem 2.24.

###### Corollary 3.9 (Hilbert-Mumford for tensor action).

iff there is a tuple of invertible matrices (of appropriate sizes) s.t. is deficient (Definition 2.23).

iff is scalable.

## 4 Analysis of scaling algorithms

In this section, we provide a unified analysis of the scaling algorithms described in Sections 2.3, LABEL:, 2.2, LABEL: and 2.1. We will first design a common template and analysis for Algorithms 3, LABEL:, 2, LABEL: and 1 and then look at each case separately to fill in the details that need to be done differently. Most of the analysis will be common and the only difference will be the choice of a potential function (although the source of all the potential functions will be invariant theory). Algorithm 4 contains a common template for all the three scaling algorithms.