Iterative Multiplicative Filters for Data Labeling
Abstract
Based on an idea in [4] we propose a new iterative multiplicative filtering algorithm for label assignment matrices which can be used for the supervised partitioning of data. Starting with a rownormalized matrix containing the averaged distances between prior features and the observed ones the method assigns in a very efficient way labels to the data. We interpret the algorithm as a gradient ascent method with respect to a certain function on the product manifold of positive numbers followed by a reprojection onto a subset of the probability simplex consisting of vectors whose components are bounded away from zero by a small constant. While such boundedness away from zero is necessary to avoid an arithmetic underflow, our convergence results imply that they are also necessary for theoretical reasons. Numerical examples show that the proposed simple and fast algorithm leads to very good results. In particular we apply the method for the partitioning of manifoldvalued images.
sectioning \setkomafonttitle \setkomafontdescriptionlabel
1 Introduction
Data labeling is a basic problem which appears in various applications. In particular it can be used for image partitioning and segmentation, which is an important preprocessing step for many stateoftheart algorithms in highlevel computer vision. It addresses the task to assign labels from a finite set to the image points in a meaningful way. Thereby the number of labels is much smaller than the image dimension . Fig. 1 illustrates a typical labeling result which leads to the segmentation of a texture image.
As a first ingredient for a labeling procedure features of the pixels together with a similarity measure for these features is needed. In this paper we only deal with supervised labeling where prototypical features for each label are available. While for natural color images in the RGB color space often the distance, , on is sufficient, this choice is clearly not appropriate when dealing with manifoldvalued images as, e.g., in DTMRI, where the image pixels are positive definite matrices or in Electron Backscattered Diffraction, where the pixels lie in a quotient manifold of the special orthogonal group. Also for the texture image in Fig. 1 we have to deal with more appropriate features, actually means and covariance matrices associated to the image pixels as described in the numerical part in Section 4, paragraph on ,,symmetric positive definite matrices”, second example. In other words, often the features can be considered to lie on a manifold and the distance on the manifold provides a good similarity measure. We remark that there is a large amount of literature on the collection of appropriate image features and a detailed description is beyond the scope of this paper. However, taking just features of single pixels and their distances to prototypical label features into account would lead to a high sensibility of the method to errors. To get robust labeling procedures and to respect moreover the geometry of the image the local or nonlocal neighborhood of the pixels should be involved as a second ingredient into the labeling procedure. A first idea would be a two step method which applies a neighborhood filtering procedure directly on the image features followed by a label assignment via comparison of distances to the prototypical label features. The neighborhood filters may be local or nonlocal [11, 19, 23] and linear or nonlinear [48]. Indeed, such nonlinear methods may also comprise nonlinear diffusion filter [10, 51] or variational restoration models inspired by the MumfordShah functional [13, 38]. However, such methods become rather complicated if the features lie for example in a non flat manifold. Then already the simple local, linear averaging method calls for the computation of Karcher means which requires itself an iterative procedure. Note that recently nonlocal patchbased methods were extensively applied for the denoising of manifoldvalued images in [34]. In this paper, we will apply the neighborhood filtering not directly on the image features, but on the socalled label assignment vectors. The label assignment vectors , , contain the probabilities that image pixel belongs to label . Clearly, since we are dealing with probabilities, we have that belongs to the probability simplex
(1) 
Hence, our neighborhood filtering process is independent of the considered feature space. The idea to handle neighborhood relations via the label assignment vectors can also be found in socalled relaxation labeling approaches [29, 41, 45]. However these methods do not use a multiplicative filtering process.
Besides the above mentioned methods there exists a huge number of variational labeling approaches and no single technique works best for all cases. The models can roughly be divided into continuous and discrete ones. Recent continuous multilabel models as those in [8, 14, 36] consider a convex functional which consists of a data term that contains the distances between the pixel features and prototypical features weighted by the label assignment vectors and a regularization term which enforces certain neighborhood relations on the label assignment vectors. A frequent choice are isotropic total variation regularizers [46] and their relatives as waveletlike terms [25]. The global minimizers can be found by primal dual first order methods [12, 15]. In [16] it was shown that the relaxed two label method is tight, a promising result which is unfortunately no longer true for multilabel tasks [35]. Discrete models, in particular relaxed linear programs, are often the method of choice in computer vision since they appear to be more efficient for huge sets of high dimensional data. Moreover they are in general tighter than continuous models, see exemplary [31, 32]. For a comprehensive study of discrete energy minimizing methods also in the context of image segmentation we refer to [30] which significantly expands the scope of methods involved in the earlier comparative study [47]. The above mentioned continuous and discrete models are nonsmooth and convex. We want to mention that there exist also non convex approaches as, e.g., those in [26, 27, 40, 50, 53] which always have to cope with local minima.
Recently, Åström, Petra, Schmitzer, and Schnörr [4] suggested an interesting supervised geometric approach to the labeling problem, see also [3, 5] for the corresponding conference papers. The objective function to maximize is defined on the open manifold of stochastic matrices with the FisherRao metric and a maximizing algorithm via the corresponding Riemannian gradient ascent flow is considered. In the numerical part the authors apply several simplifications, in particular lifting maps to cope with the non completeness of the manifold. Finally this leads to an iterative numerical procedure. Unlike the continuous Riemannian gradient flow that is shown in [4] to converge to edges of the simplex, the authors merely showed that the simplified numerical scheme closely approximates this flow, but did not prove its convergence. In this paper we discuss the convergence of the resulting numerical scheme.
Since the approach in [4] may be hard to read for people not trained in differential geometry we show in Section 3 how their final iteration scheme can be rewritten such that we arrive at an easily understandable algorithm. A slight modification of this algorithm leads to our (even simpler) algorithm reported in the next section.
We start with this algorithm since we want to introduce it from a userfriendly point of view as a multiplicative filtering process of label assignment vectors. We initialize our iterations with a label assignment matrix containing label assignment vectors of averaged distances between the prior features and the observed ones. Matrices of this kind are common kernel matrices in nonlinear filter [11, 48], patch and graph Laplacian based methods, see, e.g., [9, 20]. In particular the data and their priors may lie in any metric space, which makes the method highly flexible. Then these label assignment vectors are iterated in a multiplicative way, where we have to take care that the components stay within the probability simplex and a (small) fixed distance away from zero. We analyze the convergence of the algorithm if such an is not fixed. In exact arithmetic it appears that in most relevant cases the algorithm would converge to a trivial label assignment matrix. This is not the case if we demand that the components are bounded away from zero by a small distance. Such condition is used in the numerical part anyway to avoid an arithmetic underflow. However, the analysis shows that there is also a theoretical reason for bounding the components away from zero. Indeed, in all numerical examples we observed convergence of the algorithm to the vertices of the probability simplex . We reinterpret our algorithm as gradient ascent step of a certain function on the product manifold of positive numbers followed by a reprojection onto the probability simplex. The projection is done with respect to the KullbackLeibler distance. We provide various numerical examples.
The outline of this paper is as follows: In Section 2 we introduce and discuss our iterative multiplicative filtering algorithm: the simple algorithm is presented in Subsection 2.1, followed by its convergence analysis in the case that the components of the label assignment vectors are not bounded away from zero in Subsection 2.2. We interpret the algorithm as a gradient ascent step of a certain function followed by a KullbackLeibler projection onto the probability simplex in Subsection 2.3. The relation to paper [4] together is detailed in Section 3. Numerical results demonstrate the very good performance of our algorithm in Section 4. In particular we apply the method for the partitioning of manifoldvalued images. The paper finishes with conclusions in Section 5.
2 Multiplicative Filtering of Label Assignment Matrices
2.1 Basic Algorithm
In this subsection we introduce the supervised labeling algorithm. As outlined in the introduction two basic ingredients are needed, namely i) feature vectors (prototypical ones for the labels of each pixel) together with a distance measure between them, and ii) neighborhood relations between pixels usually given by certain weights.
Let denote the identity matrix, the matrix with all components , and the th unit vector in . By we denote the norm and by the inner product of vectors . Likewise, if are matrices, we use . Let be a metric space. In our applications will be or a nonflat Riemannian manifold.
We are interested in partitioning features
into labels, where we suppose that we are given prior features
In other words, we consider supervised labeling. Then
(2) 
is the distance matrix of the features to the priors. Throughout this paper exponentials and logarithms of matrices are meant componentwise. Further, denotes the componentwise product of matrices. To each pixel we assign a neighborhood and set
(3) 
and
where with . We can extend the definition of the weights to all by setting if .
We will work on label assignment matrices with nonnegative entries
whose rows sum up to 1. In other words, consists of label assignment vectors , where denotes the probability simplex (1). In particular, will serve as initialization for our algorithm. Then we apply a multiplicative neighborhood averaging to the columns of , i.e., separately for each label . To have full flexibility, we may consider other neighborhoods and weights with , for this process. We assume that the weights in Algorithm 1 fulfill
i.e., mutual neighbors are weighted in the same way and extend their definition to all indices in by setting if . In summary we propose the Algorithm 1:
When the algorithm stops with , we assign the label
(4) 
to the th pixel.
Let us comment the steps of the algorithm:
Step 1.
Here is the weighted geometric mean of the label assignment vectors in the neighborhood of . In particular we have for that
Taking the logarithm we obtain
with the weight matrix
(5) 
Hence Step 1 can be rewritten as
(6) 
Step 2.
Here we ensure that the rows of lie in the probability simplex .
Step 3.
This step basically avoids that the entries of become too small (actually smaller than ). Very small values do also not appear if we set
(7) 
followed by a normalization. Indeed, in numerical tests this update instead of Step 3 works similar. We have chosen Step 3 since together with Step 2 it has a nice interpretation as KullbackLeibler projection onto the part of the probability simplex with entries entries larger or equal to . However, we will see in the next subsection that Step 3 is not just numerical cosmetics, but in general absolutely necessary to avoid that the iterates converge to the same vertex of the simplex for all which would result in a trivial labeling.
By Step 1, the weight matrix plays a crucial role. By definition this matrix is symmetric and stochastic, i.e., it has nonnegative entries and . We finish this subsection by collecting some wellknown properties of symmetric stochastic matrices, see, e.g., [22, 28, 43, 52].
Lemma 2.1.
Let be a symmetric stochastic matrix. Then the following holds true:

The matrix has an eigenvalue decomposition of the form
(8) where , , and
is an orthogonal matrix with first column .

If has positive diagonal entries, then is not an eigenvalue.

If is irreducible, i.e., the associated matrix graph is connected, then is a single eigenvalue.

If has positive diagonal entries and is a single eigenvalue, then converges as to the constant matrix .
2.2 Convergence Analysis for
In this subsection, we analyze the convergence of Algorithm 1 without Step 3, i.e., we use the same initialization as in the algorithm, but compute just
(9)  
(10) 
In other words we consider the case . We will see that the third step is indeed an essential.
We start with the following observation.
Proof.
First note that we obtain the same iterates if we skip the normalization Step 2 in the intermediate iterations and normalize only in step . Therefore we consider the sequence with the same starting point and iterations
Taking the logarithm and setting the iteration becomes
(12) 
where . With this can be rewritten as
(13)  
(14) 
Then and row normalization yields the assertion. ∎
The following remark gives a first intuition about the convergence behavior of .
Remark 2.3.
By (14) we have
From the definition of we know that is a symmetric stochastic matrix with positive diagonal entries. Assuming that is a single eigenvalue of this matrix, we conclude by Lemma 2.1 that converges to the matrix with constant columns as . However, this does in general not imply that the columns of also converge to a constant vector for as the following example shows: We have
Multiplying with , taking the exponential and normalizing the rows leads in the first column of to
(15)  
(16) 
so that
The general convergence result is given next.
Theorem 2.4.

The sequence generated by (10) converges.

Let have the eigenvalue decomposition (8), where contains the eigenvalues in descending order. Assume that there are pairwise distinct eigenvalues with multiplicity , . Let be the largest index such that . Set
Then, for every , the sequence converges to a unit vector if and only if fulfills the following property

If for an , there exist , such that
then there exists and so that
(17)

Proof.
(18) 
where and , . By Lemma 2.1 and Gershgorin’s circle theorem we know that , and if and only if . Taking the exponential and normalizing the rows we get for the th row
(19)  
(20)  
(21) 
where
(22) 
Taking the logarithm results in
The first sum contains the eigenvalues larger than 1 and the second one those smaller or equal to 1. If it exists, let be the smallest index such that . Then it follows
and consequently
(23) 
Hence, by (21), we see that converges as .
ii) It remains to examine under which conditions it converges to a unit vector. Consider a fixed . We distinguish two cases.
1. If for all there exists an such that , then we see by (23) that . Since, by construction the components of sum up to 1, we conclude that converges to a unit vector.
2. If there exists such that for all , then, by (23), the component is strictly smaller than 1 and becomes 0 if and only if there exists and so that and for all . This is exactly condition (PI) and we are done. ∎
Let us have a closer look at the property (PI).
If for all which means that we do not take the neighborhoods of the points into account, then and property (PI) is fulfilled if for every the vector has a unique smallest element. Then the algorithm gives the same assignment as just which is reasonable.
Things change completely if we take care of the point neighborhoods. First we give an equivalent condition to (PI).
Remark 2.5.
Having a look at part ii) of the above proof we realize by (23) and (21) that converges to a unit vector if and only if one of its components converges to 1. By (23) and (21) this is the case if and only if the following condition (PI) is fulfilled:

For all there exists such that for all there exists such that
Checking condition (PI) would give us the limit vector of since it provides the component , where the vector is 1.
The following corollary shows that in many practical relevant image labeling tasks the iterates in (10) converge the same vertex of the simplex.
Corollary 2.6.
Let be a symmetric stochastic matrix which has eigenvalue 1 of multiplicity 1 and all other eigenvalues have absolute values smaller than 1. Assume that there exists such that
(24) 
Then, for all , the vectors generated by (10) converge for to the th unit vector.
Proof.
By assumption on we have in (PI), resp. (PI), for and all that
where . By Lemma 2.1 i) we know that the first column of is given by . Thus,
Choosing according to assumption (24) we conclude for all . Thus,
and we can take in (PI). Then we see by the proof of Theorem 2.4, resp. Remark 2.5, that converges for to the th unit vector independent of . ∎
By Lemma 2.1 i)iii), the assumptions on are fulfilled if is irreducible and has positive diagonal entries. In our task this is the case if we use local pixel neighborhoods, where the central pixels have positive weights , . Assumption (24) is fulfilled in many practical relevant cases, in particular in all our numerical examples. This underlines the relevance of Step 3 in Algorithm 1. We finish this subsection by an intuitive example to illustrate the importance of .
Example 2.7.
Labels  
1  1  3  3  3  3  3  2  2  2  
1  1  1  3  3  3  3  2  2  2  
1  1  1  3  3  3  3  3  2  2  
3  3  3  3  3  3  3  3  3  3 
We initialize Algorithm 1 with the matrix
(25) 
which corresponds to a labeling of a signal of length with features and three labels. We choose neighborhoods of size 3 with uniform weights and mirror boundary conditions. Table 1 depicts the results of the labeling for the different , where the label assignment changes. In all cases the corresponding label assignment vectors are the vertices of the probability simplex (up to machine precision) and the labels 1,2,3 were assigned via (4). Note that we get the same labeling for as for , but the label assignment matrix converges in the first case to vertices as . Up to the pixels are labeled as expected. If decreases further, we still get reasonable results originating from the large starting values of the first segment. Until double precision is reached, i.e., , we see no more changes in the labeling result. To be able to handle we look at without normalization, i.e., we iterate (12). Then we observe after iterations that the maximal value of is attained in the last row. This means that after normalization in exact arithmetic all pixels will get the label 3. Since (24) reads for our example as
(26) 
and the third value is the largest one this agrees with Corollary 2.6.
2.3 Gradient Ascent Reprojection Algorithm
In this subsection we show that Algorithm 1 can be seen as gradient ascent reprojection algorithm of a certain function defined on the product manifold of positive numbers. The positive numbers form together with the metric on the tangent space given by a Riemannian manifold. The distance between is defined by
Indeed, computation in can often be reduced to the Euclidean setting after taking the logarithm. The exponential map and its inverse read
(27) 
For and , the weighted Riemannian center of mass on is given by
(28) 
The minimizer can be computed as follows. The Riemannian gradient of the squared distance is
(29) 
where denotes the KullbackLeibler distance. It is defined by
(30) 
with the convention and using the componentwise division in . The function is jointly convex in both arguments. It is the Bregman distance of the Shannon entropy and is related to entropy regularized Wasserstein distances, which have recently found applications in image processing, see [44] and the references therein. Setting the Riemannian gradient of the objective in (28) to zero using (29), and dividing by 2 we get
(31) 
Consequently, the weighted Riemannian center of mass is just the weighted geometric mean .
By we denote the usual product manifold. We are interested in the function given by
(32)  
(33) 
where , are the columns of and is the weight matrix in (5).
In Step 1 of our algorithm we compute , perform a gradient ascent step of in the Euclidean arithmetic and take the exponential of the result. The direct execution on is given by the following lemma.
Lemma 2.8.
Proof.
By separability of we can restrict our attention to single columns of . A tangent vector acts on as
where is just the Euclidean gradient of at . The Riemannian gradient fulfills
for all . By the above definition of the Riemannian metric on we get