Deterministic Discrepancy Minimization via the Multiplicative Weight Update Method

# Deterministic Discrepancy Minimization via the Multiplicative Weight Update Method

Avi Levy Email: avius@uw.edu    Harishchandra Ramadas Email: ramadas@math.washington.edu    Thomas Rothvoss Email: rothvoss@uw.edu. Supported by NSF grant 1420180 with title “Limitations of convex relaxations in combinatorial optimization”, an Alfred P. Sloan Research Fellowship and a David & Lucile Packard Foundation Fellowship. File compiled on July 23, 2019, \currenttime.
University of Washington, Seattle
###### Abstract

A well-known theorem of Spencer shows that any set system with sets over elements admits a coloring of discrepancy . While the original proof was non-constructive, recent progress brought polynomial time algorithms by Bansal, Lovett and Meka, and Rothvoss. All those algorithms are randomized, even though Bansal’s algorithm admitted a complicated derandomization.

We propose an elegant deterministic polynomial time algorithm that is inspired by Lovett-Meka as well as the Multiplicative Weight Update method. The algorithm iteratively updates a fractional coloring while controlling the exponential weights that are assigned to the set constraints.

A conjecture by Meka suggests that Spencer’s bound can be generalized to symmetric matrices. We prove that matrices that are block diagonal with block size admit a coloring of discrepancy .

Bansal, Dadush and Garg recently gave a randomized algorithm to find a vector with entries in with in polynomial time, where is any matrix whose columns have length at most 1. We show that our method can be used to deterministically obtain such a vector.

## 1 Introduction

The classical setting in (combinatorial) discrepancy theory is that a set system over a ground set of elements is given and the goal is to find bi-coloring so that the worst imbalance of a set is minimized. Here we abbreviate . A seminal result of Spencer [Spe85] says that there is always a coloring where the imbalance is at most for . The proof of Spencer is based on the partial coloring method that was first used by Beck in 1981 [Bec81]. The argument applies the pigeonhole principle to obtain that many of the many colorings must satisfy for all sets . Then one can take the difference between such a pair of colorings with to obtain a partial coloring of low discrepancy. This partial coloring can be used to color half of the elements. Then one iterates the argument and again finds a partial coloring. As the remaining set system has only half the elements, the bound in the second iteration becomes better by a constant factor. This process is repeated until all elements are colored; the total discrepancy is then given by a convergent series with value . More general arguments based on convex geometry were given by Gluskin [Glu89] and by Giannopoulos [Gia97], but their arguments still relied on a pigeonhole principle with exponentially many pigeons and pigeonholes and did not lead to polynomial time algorithms.

In fact, Alon and Spencer [AS08] even conjectured that finding a coloring satisfying Spencer’s theorem would by intractable. In a breakthrough, Bansal [Ban10] showed that one could set up a semi-definite program (SDP) to find at least a vector coloring, using Spencer’s Theorem to argue that the SDP has to be feasible. He then argued that a random walk guided by updated solutions to that SDP would find a coloring of discrepancy in the balanced case . However, his approach needed a very careful choice of parameters.

A simpler and truly constructive approach that does not rely on Spencer’s argument was provided by Lovett and Meka [LM12], who showed that for , any polytope of the form contains a point that has at least half of the coordinates in . Here it is important that the polytope is large enough; if the normal vectors are scaled to unit length, then the argument requires that holds. Their algorithm surprisingly simple: start a Brownian motion at and stay inside any face that is hit at any time. They showed that this random walk eventually reaches a point with the desired properties.

More recently, the third author provided another algorithm which simply consists of taking a random Gaussian vector and then computing the nearest point to in . In contrast to both of the previous algorithms, this argument extends to the case that where is any symmetric convex set with a large enough Gaussian measure.

However, all three algorithms described above are randomized, although Bansal and Spencer [BS13] could derandomize the original arguments by Bansal. They showed that the random walk already works if the directions are chosen from a 4-wise independent distribution, which then allows a polynomial time derandomization.

In our algorithm, we think of the process more as a multiplicative weight update procedure, where each constraint has a weight that increases if the current point moves in the direction of its normal vector. The potential function we consider is the sum of those weights. Then in each step we simply need to select an update direction in which the potential function does not increase.

The multiplicative weight update method is a meta-algorithm that originated in game theory but has found numerous recent applications in theoretical computer science and machine learning. In the general setting one imagines having a set of experts (in our case the set constraints) that are assigned an exponential weight that reflects the value of the gain/loss that expert’s decisions had in previous rounds. Then in each iteration one selects an update, which can be a convex combination of experts, where the convex coefficient is proportional to the current weight of the expert111We should mention for the sake of completeness that our update choice is not a convex combination of the experts weighted by their exponential weights.. We refer to the very readable survey of Arora, Hazan and Kale [AHK12] for a detailed discussion.

### 1.1 Related work

If we have a set system where each element lies in at most sets, then the partial coloring technique described above can be used to find a coloring of discrepancy  [Sri97]. A linear programming approach of Beck and Fiala [BF81] showed that the discrepancy is bounded by , independent of the size of the set system. On the other hand, there is a non-constructive approach of Banaszczyk [Ban98] that provides a bound of using convex geometry arguments. Only very recently, a corresponding algorithmic bound was found by Bansal, Dadush and Garg [BDG16]. A conjecture of Beck and Fiala says that the correct bound should be . This bound can be achieved for the vector coloring version, see Nikolov [Nik13].

More generally, the theorem of Banaszczyk [Ban98] shows that for any convex set with Gaussian measure at least and any set of vectors of length , there exist signs so that .

A set of permutations on symbols induces a set system with sets given by the prefix intervals. One can use the partial coloring method to find a discrepancy coloring [SST], while a linear programming approach gives a discrepancy [Boh90]. In fact, for any one can always color half of the elements with a discrepancy of — this even holds for each induced sub-system [SST]. Still, [NNN12] constructed 3 permutations requiring a discrepancy of to color all elements.

Also the recent proof of the Kadison-Singer conjecture by Marcus, Spielman and Srivastava [MSS13] can be seen as a discrepancy result. They show that a set of vectors with can be partitioned into two halves so that for where and is the identity matrix. Their method is based on interlacing polynomials; no polynomial time algorithm is known to find the desired partition.

For a symmetric matrix , let denote the largest singular value; in other words, the largest absolute value of any eigenvalue. The discrepancy question can be generalized from sets to symmetric matrices with by defining . Note that picking 0/1 diagonal matrices corresponding to the incidence vector of element would exactly encode the set coloring setting. Again the interesting case is ; in contrast to the diagonal case it is only known that the discrepancy is bounded by , which is already attained by a random coloring. Meka222See the blog post
https://windowsontheory.org/2014/02/07/discrepancy-and-beating-the-union-bound/.
conjectured that the discrepancy of matrices can be bounded by .

For a very readable introduction into discrepancy theory, we recommend Chapter 4 in the book of Matoušek [Mat99] or the book of Chazelle [Cha01].

### 1.2 Our contribution

Our main result is a deterministic version of the theorem of Lovett and Meka:

###### Theorem 1.

Let unit vectors, be a starting point and let be parameters so that . Then there is a deterministic algorithm that computes a vector with for all and , in time .

By setting this yields a deterministic version of Spencer’s theorem in the balanced case :

###### Corollary 2.

Given sets over elements, there is a deterministic algorithm that finds a -discrepancy coloring in time .

Furthermore, Spencer’s hyperbolic cosine algorithm [Spe77] can also be interpreted as a multiplicative weight update argument. However, the techniques of [Spe77] are only enough for a discrepancy bound for the balanced case. Our hope is that similar arguments can be applied to solve open problems such as whether there is an extension of Spencer’s result to balance matrices [Zou12] and to better discrepancy minimization techniques in the Beck-Fiala setting. To demonstrate the versatility of our arguments, we show an extension to the matrix discrepancy case.

We say that a symmetric matrix is -block diagonal if it can be written as , where each is a symmetric matrix.

###### Theorem 3.

For given -block diagonal matrices with for one can compute a coloring with deterministically in time .

Finally, we can also give the first deterministic algorithm for the result of Bansal, Dadush and Garg [BDG16].

###### Theorem 4.

Let be a matrix with for all columns . Then there is a deterministic algorithm to find a coloring with in time .

While [BDG16] need to solve a semidefinite program in each step of their random walk, our algorithm does not require solving any SDPs. Note that we do not optimize running times such as by using fast matrix multiplication.

In the Beck-Fiala setting, we are given a set system over elements, where each element is contained in at most subsets. Theorem 4 then provides the first polynomial-time deterministic algorithm that produces a coloring with discrepancy ; we simply choose the matrix whose rows are the incidence vectors of members of the set system, scaled by .

For space reasons, we defer the proof of Theorem 3 to Appendix B.

## 2 The algorithm for partial coloring

We will now describe the algorithm proving Theorem 1. First note that for any we can remove the constraint , as it does not cut off any point in . Thus we assume without loss of generality that . Let denote the step size of our algorithm. The algorithm will run for iterations, each of computational cost . Note that so the algorithm terminates in iterations. The total runtime is hence .

For a symmetric matrix we know that an eigendecomposition can be computed in time . Here is the th eigenvalue of and is the corresponding eigenvector with . We make the convention that the eigenvalues are sorted as . The algorithm is as follows:

1. Set weights for all .

2. FOR TO DO

1. Define the following subspaces {itemize*}

2. . Here are the indices with maximum weight .

3. , for .

4. Let be any unit vector in

5. Choose a maximal so that , with .

6. Update .

7. Let . If , then set and stop.

The intuition is that we maintain weights for each constraint that increase exponentially with the one-sided discrepancy . Those weights are discounted in each iteration by a factor that is slightly less than 1 — with a bigger discount for constraints with a larger parameter . The subspaces and ensure that the length of is monotonically increasing and fully colored elements remain fully colored.

### 2.1 Bounding the number of iterations

First, note that if the algorithm terminates, then at least half of the variables in will be either or . In particular, once a variable is set to , it is removed from the set of active variables and the subsequent updates will leave those coordinates invariant.

First we bound the number of iterations. Here we use that the algorithm always makes a step of length orthogonal to the current position — except for the steps where it hits the boundary.

###### Lemma 5.

The algorithm terminates after iterations.

###### Proof.

First, we can analyze the length increase

 ∥x(t+1)∥22=∥x(t)+δ⋅y(t)∥22=∥x(t)∥22+2δ⟨x(t),y(t)⟩=0+δ2∥y(t)∥22,

using that . Whenever , we have . It happens that at most times, simply because in each such iteration must decrease by at least one. We know that . Suppose for the sake of contradiction that , then , which is impossible. We can hence conclude that the algorithm will terminate in step (7) after at most iterations. ∎

### 2.2 Properties of the subspace U(t)

One obvious condition to make the algorithm work is to guarantee that the subspace satisfies . In fact, its dimension will even be linear in .

###### Lemma 6.

In any iteration , one has .

###### Proof.

We simply need to account for all linear constraints that define and we get

 dim(U(t))≥|A(t)|−|I(t)|−|{i:λi≤1}|−n16−2≥n2−n16−n8−n16−2≥n8

assuming that . ∎

Another crucial property will be that every vector in has a bounded quadratic error term:

###### Lemma 7.

For each unit vector one has .

###### Proof.

We have since each is a unit vector, hence Because is positive semidefinite, we know that , where is the th eigenvalue. Then by Markov’s inequality at most a fraction of eigenvalues can be larger than . The claim follows as is spanned by the eigenvectors belonging to the smallest eigenvalues, which means for . ∎

### 2.3 The potential function

So far, we have defined the weights by iterative update steps, but it is not hard to verify that in each iteration one has the explicit expression

 w(t)i=exp(λi⟨vi,x(t)−x(0)⟩−λ2i⋅(1+t⋅4δ2n)). (1)

Inspired by the multiplicative weight update method, we consider the potential function that is simply the sum of the individual weights. At the beginning of the algorithm we have using the assumption in Theorem 1. Next, we want to show that the potential function does not increase. Here the choice of the subspaces and will be crucial to control the error.

###### Lemma 8.

In each iteration one has .

###### Proof.

Let us abbreviate as the discount factor for the th constant. Note that in particular and . The change in one step can be analyzed as follows:

 Φ(t+1) = m∑i=1w(t+1)i=m∑i=1w(t)i⋅exp(λiδ⟨vi,y(t)⟩)⋅ρi \lx@stackrel(∗)≤ m∑i=1w(t)i⋅(1+λiδ⟨vi,y(t)⟩+λ2iδ2⟨vi,y(t)⟩2)⋅ρi = m∑i=1w(t)i⋅ρi+δ⟨m∑i=1λiw(t)iρivi,y(t)⟩=0 % since y(t)∈U(t)5+δ2m∑i=1w(t)iλ2iρi≤1⟨vi,y(t)⟩2 ≤ m∑i=1w(t)i⋅ρi+δ2⋅(y(t))TM(t)y(t)\lx@stackrel(∗∗)≤m∑i=1w(t)i⋅ρi+δ216nm∑i=1w(t)iλ2i \lx@stackrel(∗∗∗)≤ m∑i=1w(t)i=Φ(t).

In , we use the inequality for together with the fact that . In we bound using Lemma 7. In we finally use the fact that . ∎

Typically in the multiplicative weight update method one can only use the fact that which would lead to the loss of an additional factor. The trick in our approach is that there is always a linear number of weights of order since the updates are always chosen orthogonal to the constraints with highest weight.

###### Lemma 9.

At the end of the algorithm,

###### Proof.

Suppose, for contradiction, that for some . Let be the last iteration when was not among the constraints with highest weight. After iteration , only decreases in each iteration, due to the factor . Then

 2

and hence, This would imply that contradicting Lemma 8. ∎

If , then .

###### Proof.

First note that the algorithm always walks orthogonal to all constraint vectors if and in this case . Now suppose that . We know that Taking logarithms on both sides and dividing by then gives

 ⟨vi,x(T)−x(0)⟩≤log(2)λi≤2+λi(1+4Tδ2n≤2)≤11λi.

This lemma concludes the proof of Theorem 1. ∎

### 2.4 Application to set coloring

Now we come to the main application of the partial coloring argument from Theorem 1, which is to color set systems:

###### Lemma 11.

Given a set system , we can find a coloring with for every deterministically in time .

###### Proof.

For a fractional vector , let us abbreviate as the discrepancy with respect to set . Set . For many phases we do the following. Let be the not yet fully colored elements. Define a vector of length with parameters . Then apply Theorem 1 to find with such that for . Since each time at least half of the elements get fully colored we have for all . Then and

 disc(Si,x)≤∑s≥1O(√2−(s−1)nlog(2m2−(s−1)n)))≤O(√nlog(2mn))

using that this convergent sequence is dominated by the first term.

In each application of Theorem 1 one has . Thus phase runs for iterations, each of which takes time. This gives a total runtime of in phase . Summing the geometric series for results in a total running time of . ∎

By setting in Lemma 11, we obtain Corollary 2.

## 3 Matrix balancing

In this section we prove Theorem 3. We begin with some preliminaries. For matrices , let be the Frobenius inner product. Recall that any symmetric matrix can be written as , where is the eigenvalue corresponding to eigenvector . The trace of is and for symmetric matrices one has . If has only nonnegative eigenvalues, we say that is positive semidefinite and write . Recall that if and only if for all . For a symmetric matrix , we denote as the largest Eigenvalue and as the largest singular value. Note that if , then . If , then . Finally, note that for any symmetric matrix one has .

From the eigendecomposition , one can easily show that the maximum singular value also satisfies and . For any function we define to be the symmetric matrix that is obtained by applying to all Eigenvalues. In particular we will be interested in the matrix exponential . For any symmetric matrices , the Golden-Thompson inequality says that . (It is not hard to see that for diagonal matrices one has equality.) We refer to the textbook of Bhatia [Bha97] for more details.

###### Theorem 12.

Let be -block diagonal matrices with for and let be a starting point. Then there is a deterministic algorithm that finds an with

 ∥∥n∑i=1(xi−x(0)i)⋅Ai∥∥op≤O(√nlog(2qmn))

in time . Moreover, at least coordinates of will be in .

Our algorithm computes a sequence of iterates such that is the desired vector with half of the coordinates being integral. In our algorithm the step size is and we use a parameter to control the scaling of the following potential function:

 Φ(t):=Tr[exp(εn∑i=1(x(t)i−x(0)i)⋅Ai)].

Suppose are symmetric matrices so that . Then we can decompose the weight function as with In other words, the potential function is simply the sum of the potential function applied to each individual block. The algorithm is as follows:

1. FOR TO DO

1. Define weight matrix

2. Define the following subspaces {itemize*}

3. . Here are the indices with maximum weight .

4. is the subspace defined in Lemma 14, with .

5. Let be any unit vector in .

6. Choose a maximal so that , where .

7. Let . If , then set and stop.

The analysis of our algorithm follows a sequence of lemmas, the proofs of most of which we defer to Appendix A. By exactly the same arguments as in Lemma  5 we know that the algorithm terminates after iterations. Each iteration can be done in time (c.f. Lemma 14).

###### Lemma 13.

In each iteration one has .

###### Proof.

We simply need to account for all linear constraints that define and we get

 dim(U(t))≥|A(t)|U(t)1−|I(t)|U(t)3−n16U(t)5−2U(t)2,U(t)4≥n2−n16q2⋅q2−n16−2≥n4

assuming that . ∎

To analyze the behavior of the potential function, we first prove the existence of a suitable subspace that will bound the quadratic error term.

###### Lemma 14.

Let be a symmetric positive semidefinite matrix, let be symmetric matrices with and let be a parameter. Then in time one can compute a subspace of dimension so that

 W∙(n∑i=1yiAi)2≤k⋅Tr[W]∀y∈U with ∥y∥2=1. (2)

Proof. See Appendix A.

Again, we bound the increase in the potential function:

###### Lemma 15.

In each iteration , one has .

Proof. See the Appendix A.

This gives us a bound on the potential function at the end of the algorithm.

###### Lemma 16.

At the end of the algorithm, .

###### Proof.

Since , we get that , using the fact that . ∎

###### Lemma 17.

We have .

Proof. See Appendix A.

These lemmas put together give us Theorem 12: an algorithm that yields a partial coloring with the claimed properti