Solution of the optimal assignment problem by diagonal scaling algorithms111 This work has been supported in part by the French National Research Agency (ANR) through the COSINUS program (projects PETAL no ANR-08-COSI-009 and PETALH no ANR-10-COSI-013) and by the Gaspard Monge Optimization Programme of Fondation Mathématique Jacques Hadamard and EDF.
We show that a solution of the optimal assignment problem can be obtained as the limit of the solution of an entropy maximization problem, as a deformation parameter tends to infinity. This allows us to apply entropy maximization algorithms to the optimal assignment problem. In particular, the Sinkhorn algorithm leads to a parallelizable method, which can be used as a preprocessing to handle large dense optimal assignment problems. This parallel preprocessing allows one to delete entries which do not belong to optimal permutations, leading to a reduced instance which becomes solvable with limited memory requirements. large scale optimal assignment problem; entropy maximization; matrix diagonal scaling; parallel computing; Sinkhorn iteration; Newton method.
One of the most classical problems in combinatorial optimization is optimal assignment. Several applications of this problem arise in different fields of applied sciences such as bioinformatics for protein structure alignment problem Holm (1993); Lin, Y. H. et al. (2004), VLSI design Huang et al. (1990), image processing and computer vision Cheng et al. (1996), and the pivoting problem in the solution of large linear systems of equations Olschowka & Neumaier (1996); Duff & Koster (2000); Li & Demmel (2003). Thus, this problem has received considerable attention and several algorithms have been proposed to solve it.
The first polynomial time algorithm to solve this problem was proposed by Kuhn (1955). It works in time, which was improved to by Edmonds & Karp (1970) (see also Dinic & Kronrod (1969)) where denotes the dimension of the input matrix. In the sparse case, Fredman & Tarjan (1987) proposed an improved algorithm which uses Fibonacci heaps for the shortest paths computations. It runs in time where denotes the number of arcs. Several other algorithms have also been developed. We refer the interested reader to the recent book of Burkard et al. (2009).
In this paper we exploit the connection between the optimal assignment problem and entropy maximization. The latter is well studied in the field of convex optimization Fang et al. (1997). The main idea is to think of the optimal assignment problem as the limit of a deformation of an entropy maximization problem. More precisely, given an non-negative matrix , let us look for an bistochastic matrix maximizing the relative entropy
Here, is the deformation parameter. We will show in Section Solution of the optimal assignment problem by diagonal scaling algorithms111 This work has been supported in part by the French National Research Agency (ANR) through the COSINUS program (projects PETAL no ANR-08-COSI-009 and PETALH no ANR-10-COSI-013) and by the Gaspard Monge Optimization Programme of Fondation Mathématique Jacques Hadamard and EDF. that when goes to infinity, the unique solution of the entropy maximization problem converges to a point which is of maximal entropy among the ones in the convex hull of the matrices representing optimal permutations. In particular, if there is only one optimal permutation, converges to the matrix representing this optimal permutation. In Section 2.1 we prove that, as ,
for some constant . This shows an exponential convergence to the optimal solution when increases.
The maximal entropy matrix can be computed by any matrix scaling algorithm such as Sinkhorn iteration Sinkhorn & Knopp (1967) or Newton method Knight & Ruiz (2012). Subsequently, these iterative methods can be used to develop new algorithms to solve the optimal assignment problem and related combinatorial optimization problems.
In Section Solution of the optimal assignment problem by diagonal scaling algorithms111 This work has been supported in part by the French National Research Agency (ANR) through the COSINUS program (projects PETAL no ANR-08-COSI-009 and PETALH no ANR-10-COSI-013) and by the Gaspard Monge Optimization Programme of Fondation Mathématique Jacques Hadamard and EDF., we introduce an iterative method which is based on a modification of Sinkhorn scaling algorithm, in which the deformation parameter is slowly increased (this procedure is reminiscent from simulated annealing, the parameter playing the role of the inverse of the temperature). We prove that this iteration, which we refer to as deformed-Sinkhorn iteration, converges to a matrix whose entries that belong to the optimal permutations are nonzero, while all the other entries are zero. An estimation of the rate of convergence is also presented, but this appears to be mostly of theoretical interest since in practice, the convergence of this algorithm appears to be slow.
In Section Solution of the optimal assignment problem by diagonal scaling algorithms111 This work has been supported in part by the French National Research Agency (ANR) through the COSINUS program (projects PETAL no ANR-08-COSI-009 and PETALH no ANR-10-COSI-013) and by the Gaspard Monge Optimization Programme of Fondation Mathématique Jacques Hadamard and EDF., we investigate a preprocessing algorithm which can be used in the solution of large scale dense optimal assignment problem. This problem appears in several applications such as vehicle routing problem, object recognition and computer vision Buš & Tvrdík (2009). An application to cosmology (reconstruction of the early universe) can be found in the work of Brenier et al. (2003). Models of large dense random assignment problems are also considered in (Mezard et al., 1987, Ch. VII) from the point of view of statistical physics.
Our preprocessing algorithm, is based on an iterative method that eliminates the entries not belonging to an optimal assignment. This reduces the initial problem to a much smaller problem in terms of memory requirements. This is illustrated in Figures 2 and 2.
The idea of this algorithm is to take large enough, then apply a diagonal scaling algorithm to until convergence to a bistochastic matrix , and finally delete the small entries of . Computing naively the exponential of would lead to numerical overflow for large values of . However, in Section Solution of the optimal assignment problem by diagonal scaling algorithms111 This work has been supported in part by the French National Research Agency (ANR) through the COSINUS program (projects PETAL no ANR-08-COSI-009 and PETALH no ANR-10-COSI-013) and by the Gaspard Monge Optimization Programme of Fondation Mathématique Jacques Hadamard and EDF., we shall see that it is possible to implement this iteration in a numerically stable way (with “log-glasses”). We also provide an approximate optimality certificate which can be used to check a posteriori that the result of the algorithm is optimal up to a given factor. Our algorithm, presented in Section Solution of the optimal assignment problem by diagonal scaling algorithms111 This work has been supported in part by the French National Research Agency (ANR) through the COSINUS program (projects PETAL no ANR-08-COSI-009 and PETALH no ANR-10-COSI-013) and by the Gaspard Monge Optimization Programme of Fondation Mathématique Jacques Hadamard and EDF., assumes the existence of at least one matching, since otherwise, Sinkhorn iteration may not converge.
In Section Solution of the optimal assignment problem by diagonal scaling algorithms111 This work has been supported in part by the French National Research Agency (ANR) through the COSINUS program (projects PETAL no ANR-08-COSI-009 and PETALH no ANR-10-COSI-013) and by the Gaspard Monge Optimization Programme of Fondation Mathématique Jacques Hadamard and EDF., we present the result of running the preprocessing algorithm on several dense matrices from gallery of Matlab. We consider two variants of the algorithm, one by using Sinkhorn iteration as the diagonal scaling algorithm and the other one by using Newton iteration. The advantage of Newton method is the speed of the convergence to bistochastic matrix. On the other hand, the advantage of Sinkhorn iteration is that, it can be efficiently implemented in parallel Amestoy et al. (2008); Duff et al. (2008). So, the latter variant of the preprocessing algorithm can be used in the solution of very large dense optimal assignment problems, where the data cannot be stored in one machine. In this way, the algorithm can run in parallel and reduces the size of the original problem and then the reduced problem can be solved by any classical method. For both variants, we show that the preprocessing algorithm can be efficiently used to decrease the size of the dense matrices, up to .
In this section, after recalling some known facts concerning the connection between entropy maximization and matrix scaling problems, we show (Theorem 2.3) that the unique solution of a deformed entropy maximization problem i.e., an entropy maximization problem depending of a deformation parameter, converges to the solution of an optimal assignment problem. We also determine the convergence speed (Corollary 2.2).
The diagonal scaling problem can be generally defined as finding diagonal matrices and with positive diagonal entries such that the scaled matrix has prescribed row and column sums. Due to the variety of its applications, this problem has been well studied Menon & Schneider (1969); Brualdi (1974); Sinkhorn & Knopp (1967). A comparison of the proposed algorithms to solve this problem, can be found in Schneider & Zenios (1990). A remarkable special case arises when the row and column sums of the matrix are required to be identically one, so that is bistochastic.
A non-negative matrix, , has support if it has a positive diagonal that is, there exists a permutation such that . Also, it has total support if every positive entry belongs to a diagonal. The standard way to check whether a matrix has support is to compute its Dulmage-Mendelsohn decomposition Dulmage & Mendelsohn (1958); Pothen & Fan (1990). However, this approach does not lead to a parallel algorithm. We note that an alternative algorithm based on diagonal scaling has been proposed by Linial et al. (2000). This algorithm allows one to determine whether a matrix has support after Sinkhorn iterations. It can be implemented in parallel.
A non-negative square matrix is fully indecomposable if there does not exist permutation matrices and such that is of the form
where and are square matrices. The following theorem provides a sufficient condition for the existence of a diagonal scaling.
Theorem 2.1 (Sinkhorn & Knopp (1967))
Let be an non-negative matrix with total support. Then there exist diagonal matrices and such that is bistochastic. Moreover, if is fully indecomposable, then and are unique up to a constant factor.
Now, consider the following optimization problem, which consists in finding an bistochastic matrix maximizing the following relative entropy
is the entropy function, is a parameter, and denotes the set of bistochastic matrices. We define in the context of the product .
We shall assume that the matrix has total support, so that the diagonal matrices and are known to exist. We denote by the pattern (set of non-zero entries) of the matrix .
The general relation between the entropy maximization and scaling problems is well known, see e.g. Schneider (1989) for an overview.
Let be a matrix with total support. Then, the solution of the entropy maximization problem indicated in Equation 2.1 is unique and it is characterized by the existence of two diagonal matrices and , such that .
Thus, the characterization of the theorem shows that is obtained from the th Hadamard power by a diagonal scaling. The previous theorem is a special case of Theorem 3.1 of Borwein et al. (1994), which is established in a more general infinite dimensional setting (for ; but the result for an arbitrary follows trivially from it).
We now study the convergence of as tends to infinity. We shall consider the face of the polytope of bistochastic matrices consisting of the optimal solutions of the linear programming formulation of the optimal assignment problem
As tends to infinity, the matrix converges to the unique matrix maximizing the entropy among the ones that belong to the face consisting of the convex hull of optimal permutation matrices. In particular, if the solution of the optimal assignment problem is unique, then converges to the associated bistochastic matrix.
Since is the point of maximum of ,
Consider a sequence converging to infinity, and assume that converges to some matrix , which must belong to . Setting in the previous inequality and taking the limit as tends to infinity, we get , which shows that belongs to the face .
is the sum of two non-negative terms, because is a point of maximum of , and is a convex hull of matrices representing optimal permutations. It follows that , and so, if is any accumulation point of as tends to infinity, , showing that is of maximal entropy among the matrices in . Since the entropy function is strictly convex, is the only point with the latter property, and so every accumulation point of is equal to , showing that converges to as . ∎
If there is only one optimal permutation, then converges to the corresponding permutation matrix.
We have already shown in Theorem 2.3 that the maximal entropy solution converges as tends to infinity, to a matrix which is a convex hull of optimal permutation matrices. In particular, converges to an optimal permutation matrix if the optimal permutation is unique. Now, the question is how fast this convergence is. This is answered by the next results.
Following Hardy & Riesz (1915), we call generalized Dirichlet series in a parameter a sum
where the coefficients are such that is either a finite (possibly empty) set or a denumerable set having as the only accumulation point (in particular, a finite number of monomials with negative powers of may appear in the series).
Assume that the matrix has total support. Then, every entry is given by a generalized Dirichlet series in the parameter that is absolutely convergent in some punctured disk .
This field was used in particular by Akian, Bapat and Gaubert Akian et al. (1998) to address a somehow related asymptotic problem, concerning the Perron eigenvector of the matrix as . The corresponding field of formal generalized Dirichlet series is also a useful tool in tropical geometry, as pointed out by Markwig (2010). Before proving this theorem, we derive the following corollary.
Assume that the matrix has total support. Then, there exists a positive constant such that,
holds for all , as .
We already showed that converges to as . It follows that the leading monomial of the Dirichlet series expansion of is , and that the next monomial is necessarily of the form for some and (we adopt the convention that if is constant near ). Then, it suffices to take for the minimum of all the obtained in this way. ∎
The proof of Theorem 2.4 uses a model theory result of van den Dries & Speisseger (1998), who constructed a o-minimal expansion of the field of real numbers which is such that the definable functions in one variable correspond to generalized Dirichlet series that are absolutely convergent in a punctured disk. We briefly recall their construction, referring the reader to the monography van den Dries (1998) for more background on o-minimal structures, and in particular for the definition of the notions used here.
For any , let be commuting variables, and consider a formal series
where the multi-index ranges over . The support of is now . We denote by the -algebra of formal series the support of which is included in a Cartesian product where every is either a finite subset of or a denumerable subset of having has the only accumulation point. For each with for , we set , and denote by the subalgebra of consisting of those such that . Then, we denote by the expansion of the real ordered field by the collection of all functions , with , such that is outside and that it is given on by a power series for some with . The following theorem follows from van den Dries & Speisseger (1998).
Theorem 2.5 (See Theorem B and § 10, Paragraph 2 in van den Dries & Speisseger (1998))
Let and let be definable in . Then, there exists a generalized Dirichlet series in a single variable that is absolutely convergent in a punctured disk for some , such that holds for all .
Actually, Theorem B in van den Dries & Speisseger (1998) deals with a larger o-minimal structure, , in which it is only required that every set is well ordered in the above construction. The latter theorem shows that every definable function in one variable in this larger structure coincides with a Hahn series (series with well ordered support) that is absolutely convergent in a punctured disk. However, it is remarked in Section 10 of van den Dries & Speisseger (1998) that the same statement remains true if “well ordered” is replaced by “finite or denumerable with as the only accumulation point” in the construction of and if Hahn series are replaced by generalized Dirichlet series.
of Theorem 2.4.
We make the change of variable , and we will write for .
Since the matrix is non-negative and has total support, so does for all , and so, the solution of the entropy maximization problem is the only matrix such that there there exist diagonal matrices and with positive diagonal entries such that . Every non-zero entry of the matrix can be written as . In particular, the function belongs to for every . It follows that for every , the function is definable in the structure . Hence, by Theorem 2.5, every entry of has an expansion as a generalized Dirichlet series in the variable and this series is absolutely convergent in a punctured disk . ∎
The formulation (2.1) is somehow reminiscent of interior point methods, in which the entropy is replaced by a log-barrier function (the latter would be in the present setting). The present thought of as a function of is analogous to the central path, and as does the central path, converges to a face containing optimal solutions. However, the entropy does not satisfies the axioms of the theory of self-concordant barriers on which the analysis of interior point methods is based. Indeed, the speed of convergence in appears to be of a totally different nature by comparison with the speed of observed in interior point methods Nesterov & Nemirovskii (1994).
The constant appearing in Corollary 2.2 can be small if there are several nearly optimal permutations, and then a large value of may be needed to approximate . However, in such cases, a much smaller value of turns out to be enough for the method described in the next sections, the aim of which is to eliminate a priori entries not belonging to (nearly) optimal permutations. This is illustrated by the following matrix, in which the identity permutation is optimal, and the transposition is nearly optimal:
For , we have the following matrix, the significant entries of which indicate precisely the optimal and nearly optimal permutations:
The convergence of to is illustrated in Figure 3. Observe that the graph of as a function of is approximately piecewise affine. In fact, each piece corresponds to a monomial in the generalized Dirichlet series expansion (2.2). The path converges quickly to the face containing the two nearly optimal permutations and slowly to the unique optimal permutation.
Finding an explicit formula for the speed of convergence appears to be an interesting combinatorial problem (which is beyond the scope of this paper).
In this section, we consider the Sinkhorn iteration, which is probably the most classical way of computing the diagonal matrices of Theorem 2.2, leading to the solution of the entropy maximization problem. We develop a “path following method” in which the value of is gradually increased in the course of Sinkhorn iterations. We prove that if the matrix has support ( has support if it has a positive diagonal), and if the growth of is moderate enough, then the sequence of matrices produced by the algorithm converges to a point which belongs to the face generated by optimal permutations. The results of this section leads to an algorithm to compute the optimal assignment which we refer to as deformed Sinkhorn iteration. This algorithm is mostly interesting from the theoretical point of view since the value of increases slowly in the course of the algorithm which yields a slow convergence to the solution.
A simple way to compute the diagonal matrices is Sinkhorn iteration Sinkhorn & Knopp (1967). This algorithm starts from a given matrix , divides every row by its sum, then every column of the new matrix by its sum, and so on, until the matrix obtained in this way converges to a bistochastic matrix. The advantage of this algorithm is that it can be efficiently implemented in parallel Amestoy et al. (2008) and it can be applied to any non-negative matrix which has at least one nonzero permutation. The disadvantage is that, it is generally slower than other methods.
Recall first that the open cone consisting of positive vectors of is equipped with Hilbert’s projective metric, defined by
Note that is zero if and only if the vectors and are proportional. We refer to (Bapat & Raghavan, 1997, § 6) for more background. In particular, if is a positive matrix, a theorem of Birkhoff shows that the map is a contraction in Hilbert’s projective metric, with a contraction rate
The following result is a consequence of this theorem.
Proposition 3.1 (Franklin & Lorenz (1989))
For a positive matrix , the global rate of convergence of Sinkhorn iteration is bounded above by .
This general bound is applicable only for positive matrices and it can be coarse in practice. We shall use this contraction rate to prove the convergence of our deformed Sinkhorn iteration in Section Solution of the optimal assignment problem by diagonal scaling algorithms111 This work has been supported in part by the French National Research Agency (ANR) through the COSINUS program (projects PETAL no ANR-08-COSI-009 and PETALH no ANR-10-COSI-013) and by the Gaspard Monge Optimization Programme of Fondation Mathématique Jacques Hadamard and EDF..
It is proved by Soules (1991) that the rate of convergence of Sinkhorn algorithm is always linear when the input matrix, , has total support. He defines, where, and is an operator which divides each row by its sum and then divides each column by its sum. He also defines a scalar,
for a given norm , where denotes the final bistochastic matrix. Since the value of is norm dependent, he called the convergence rate to be linear if for some norm, , which he proved for Sinkhorn iteration. More recently, Knight (2008) provided a local rate of convergence. Due to his work, for classical Sinkhorn iteration the local rate of convergence of a fully indecomposable matrix, is bounded by where is the second singular value of the bistochastic matrix to which the iteration converges. Hence, the following result allows us to estimate the local convergence rate of Sinkhorn iteration, as .
Assume that there is only one optimal permutation. Then, there is a constant such that
Assume now that the matrix is fully indecomposable (which implies that there are several optimal permutations). Then,
Let and denote the singular values of two matrices, and respectively. Define a diagonal matrix such that . Due to the perturbation theorem of Mirsky (1960) for any unitarily invariant norm we have, . So, for and ,
for which the constant depends on the coefficients of the Puiseux series and possibly on the dimension of . Thus, if the original matrix has only one optimal permutation, which implies that
Moreover according to the Birkhoff-von Neumann theorem Birkhoff (1946), for any norm on which is invariant under permutation of the coordinates and for any bistochastic matrix , and subsequently
When is fully indecomposable, since the multiplication of two fully indecomposable matrices is also fully indecomposable, is fully indecomposable. Note also that for all , , which implies that is primitive. Then, according to the Perron-Frobenius theorem, all the eigenvalues of distinct from have a modulus strictly smaller than which yields . ∎
Let be a real non-negative matrix. The standard Sinkhorn iteration is defined as follows
where and respectively, are column scaled and row scaled matrices and denote the column scaling operator in which all the columns of a matrix are divided by their sums and be the similar operator for rows. It is easy to verify that, and for any diagonal matrix . Now consider the following iteration for a sequence of vectors
where denotes the vector of dimension and denotes the operator which inverts every entry of a vector. In the sequel, for all vectors , denotes the diagonal matrix with diagonal entries .
Note that and that so
a similar statement can be proved for to show that . ∎
Now consider the following iteration which is a standard Sinkhorn iteration with a deformation of using an increasing sequence which goes to infinity.
Analogous to the standard Sinkhorn iteration, let and respectively, be column scaled and row scaled matrices defined as the following:
For a diagonal matrix , real matrices and the matrices in the iteration, the following properties hold.
where indicates the Hadamard product
We only prove the last one since others are straightforward.
So we define deformed Sinkhorn iteration as the following
Here, respectively denote the vectors of row sums and column sums.
In the following two sections, we will prove that the deformed Sinkhorn iteration will converge to a bistochastic matrix where all the nonzero entries belong to an optimal permutation of the original matrix.
For an input matrix, , assume that the deformed Sinkhorn iteration converges to a bistochastic matrix. Define the weight of a permutation, , with respect to , to be . If has a support, it should have at least one optimal permutation as with nonzero weight. It is evident that is the optimal permutation for all the matrices and produced by each deformed Sinkhorn iteration. Observe that for all permutations and , the ratio is invariant if we multiply the matrix by diagonal matrices. So it follows from the Equation 3.4 that
Thus, for all non optimal permutations such as , will converge to zero when . Since in each iteration the weight of optimal permutation, , is bounded above by , the weight of all non optimal permutations will converge to zero which yields the following lemma.
Assume that the deformed Sinkhorn iteration converges to a matrix, , produced by the deformed Sinkhorn iteration when . If the original matrix has a support, then all the permutations of have zero weight, except the optimal permutations of the original matrix .
Due to the theorem of Birkhoff-von Neumann, a square bistochastic matrix in is a convex combination of permutation matrices. Hence, all the nonzero entries of a bistochastic matrix belong to a permutation with nonzero weight. This statement together with the previous lemma yield the following theorem.
For a non-negative matrix which has a support, as , if the deformed Sinkhorn iteration converges to a matrix , then all the nonzero entries of belong to an optimal permutation of the original matrix.
Recall that the rate of convergence of the classical Sinkhorn iteration is bounded above by where . The following theorem presents the main result of this section:
Let be a positive matrix. If where , then the deformed Sinkhorn iteration will converge to a bistochastic matrix and subsequently to a solution of optimal assignment of the original matrix .
The proof relies on the next lemmas. For a matrix , , and for two diagonally equivalent matrices such as and , .
For positive matrices and , diagonal matrix and the Hilbert projective metric , the following properties hold.
The proof is straightforward. ∎
is invariant under or operators.
The proof is straightforward.∎
The next lemma is analogous to Lemma 2 in Franklin & Lorenz (1989), where the classical Sinkhorn iteration is considered.
The second statement is established in a similar way. ∎
Assume that , where . Then we have
Let , and define the sequence by , where
Since every function is nondecreasing, an induction shows that , for all , and so, it suffices to show that .
Let be the fixed point of . Setting and observing that
Since , one readily checks that the sequence decreases with and converges to zero. If for every , then , and the result is established. Assume now that for some . Define for all . Observe that
Using the fact that holds for all , an immediate induction yields
Since , we have
Letting in (3.7), we get . Since this holds for all , it follows that , and so,
Hence, converges to zero. ∎
The proof of Theorem 3.6 is achieved since implies that .
In this section we introduce a new algorithm which can be used as a parallel preprocessing algorithm to solve large dense optimal assignment problems, in order to delete the entries not belonging to optimal assignment. This approach is based on computing , defined in Theorem 2.3, as an approximation to for relatively large values of . Before going further we need to address the numerical instability of computing .
To compute , one has to start from the input matrix and compute the bistochastic matrix by applying a numerical algorithm such as Sinkhorn iteration or Newton method. But the naive computation of is numerically unstable for large values of . In the following we provide two approaches to avoid this numerical instability. The first method is a prescaling step which can be followed by any scaling algorithm such as the Sinkhorn iteration or Newton method. A limitation of this method is that the value of cannot exceed where is the largest number, in the numerical range (for example in double-precision floating-point arithmetic). This is overcome by the second approach, in which Sinkhorn iteration is implemented with “log-glasses” (along the lines of tropical geometry). This allows one to arbitrarily increase the value of . However, this approach does not naturally carry over to other (non-Sinkhorn) scaling algorithms.
To avoid the numerical instability one can use the prescaling step presented below. We set . By applying this prescaling, all the nonzero scaled entries will be placed in interval where is Napier’s constant. In the case when , the prescaling has another interesting property, that is, the scaled matrix is invariant by any entrywise power of the input matrix. In other words, if we apply the prescaling to the matrix , for all , the matrix obtained after the prescaling step turns out to be independent of the choice of . When the entries of have already been located in the interval , then we do not need to perform the previous prescaling since the denominator in the formula defining will be small if is close to .