Iterative Multiplicative Filters for Data Labeling

# Iterative Multiplicative Filters for Data Labeling

## Abstract

Based on an idea in [4] we propose a new iterative multiplicative filtering algorithm for label assignment matrices which can be used for the supervised partitioning of data. Starting with a row-normalized matrix containing the averaged distances between prior features and the observed ones the method assigns in a very efficient way labels to the data. We interpret the algorithm as a gradient ascent method with respect to a certain function on the product manifold of positive numbers followed by a reprojection onto a subset of the probability simplex consisting of vectors whose components are bounded away from zero by a small constant. While such boundedness away from zero is necessary to avoid an arithmetic underflow, our convergence results imply that they are also necessary for theoretical reasons. Numerical examples show that the proposed simple and fast algorithm leads to very good results. In particular we apply the method for the partitioning of manifold-valued images.

\setlist

[itemize]label=– \DeclareCaptionLabelSeparatorperiodspace.  \captionsetupformat=hang,labelsep=periodspace,indention=-3em,labelfont=bf,width=.95skip=.5 \captionsetup[subfigure]aboveskip=.5ex,belowskip=1.5ex, justification=centering,indention=-1em,labelformat=simple,labelsep=space, labelfont=normal, hypcap=true,width=.975skip=.5 \DeclareCaptionLabelSeparatorperiodspace.  \captionsetupfont=small,format=hang,labelsep=periodspace,indention=-3em,labelfont=bf,width=.95skip=.5 \captionsetup[subfigure]font=small,aboveskip=.5ex,belowskip=1.5ex, justification=centering,indention=-1em,labelformat=simple,labelsep=space, labelfont=small, hypcap=true,width=skip=.5 \setkomafontsectioning \setkomafonttitle \setkomafontdescriptionlabel

## 1 Introduction

Data labeling is a basic problem which appears in various applications. In particular it can be used for image partitioning and segmentation, which is an important preprocessing step for many state-of-the-art algorithms in high-level computer vision. It addresses the task to assign labels from a finite set to the image points in a meaningful way. Thereby the number of labels is much smaller than the image dimension . Fig. 3 illustrates a typical labeling result which leads to the segmentation of a texture image.

As a first ingredient for a labeling procedure features of the pixels together with a similarity measure for these features is needed. In this paper we only deal with supervised labeling where prototypical features for each label are available. While for natural color images in the RGB color space often the -distance, , on is sufficient, this choice is clearly not appropriate when dealing with manifold-valued images as, e.g., in DT-MRI, where the image pixels are positive definite matrices or in Electron Backscattered Diffraction, where the pixels lie in a quotient manifold of the special orthogonal group. Also for the texture image in Fig. 3 we have to deal with more appropriate features, actually means and covariance matrices associated to the image pixels as described in the numerical part in Section 4, paragraph on ,,symmetric positive definite matrices”, second example. In other words, often the features can be considered to lie on a manifold and the distance on the manifold provides a good similarity measure. We remark that there is a large amount of literature on the collection of appropriate image features and a detailed description is beyond the scope of this paper. However, taking just features of single pixels and their distances to prototypical label features into account would lead to a high sensibility of the method to errors. To get robust labeling procedures and to respect moreover the geometry of the image the local or nonlocal neighborhood of the pixels should be involved as a second ingredient into the labeling procedure. A first idea would be a two step method which applies a neighborhood filtering procedure directly on the image features followed by a label assignment via comparison of distances to the prototypical label features. The neighborhood filters may be local or nonlocal [11, 19, 23] and linear or nonlinear [48]. Indeed, such nonlinear methods may also comprise nonlinear diffusion filter [10, 51] or variational restoration models inspired by the Mumford-Shah functional [13, 38]. However, such methods become rather complicated if the features lie for example in a non flat manifold. Then already the simple local, linear averaging method calls for the computation of Karcher means which requires itself an iterative procedure. Note that recently nonlocal patch-based methods were extensively applied for the denoising of manifold-valued images in [34]. In this paper, we will apply the neighborhood filtering not directly on the image features, but on the so-called label assignment vectors. The label assignment vectors , , contain the probabilities that image pixel belongs to label . Clearly, since we are dealing with probabilities, we have that belongs to the probability simplex

 ΔK\coloneqq{x∈RK:K∑k=1xk=1,x≥0}. (1)

Hence, our neighborhood filtering process is independent of the considered feature space. The idea to handle neighborhood relations via the label assignment vectors can also be found in so-called relaxation labeling approaches [29, 41, 45]. However these methods do not use a multiplicative filtering process.

Besides the above mentioned methods there exists a huge number of variational labeling approaches and no single technique works best for all cases. The models can roughly be divided into continuous and discrete ones. Recent continuous multilabel models as those in [8, 14, 36] consider a convex functional which consists of a data term that contains the distances between the pixel features and prototypical features weighted by the label assignment vectors and a regularization term which enforces certain neighborhood relations on the label assignment vectors. A frequent choice are isotropic total variation regularizers [46] and their relatives as wavelet-like terms [25]. The global minimizers can be found by primal dual first order methods [12, 15]. In [16] it was shown that the relaxed two label method is tight, a promising result which is unfortunately no longer true for multilabel tasks [35]. Discrete models, in particular relaxed linear programs, are often the method of choice in computer vision since they appear to be more efficient for huge sets of high dimensional data. Moreover they are in general tighter than continuous models, see exemplary [31, 32]. For a comprehensive study of discrete energy minimizing methods also in the context of image segmentation we refer to [30] which significantly expands the scope of methods involved in the earlier comparative study [47]. The above mentioned continuous and discrete models are non-smooth and convex. We want to mention that there exist also non convex approaches as, e.g., those in [26, 27, 40, 50, 53] which always have to cope with local minima.

Recently, Åström, Petra, Schmitzer, and Schnörr [4] suggested an interesting supervised geometric approach to the labeling problem, see also [3, 5] for the corresponding conference papers. The objective function to maximize is defined on the open manifold of stochastic matrices with the Fisher-Rao metric and a maximizing algorithm via the corresponding Riemannian gradient ascent flow is considered. In the numerical part the authors apply several simplifications, in particular lifting maps to cope with the non completeness of the manifold. Finally this leads to an iterative numerical procedure. Unlike the continuous Riemannian gradient flow that is shown in [4] to converge to edges of the simplex, the authors merely showed that the simplified numerical scheme closely approximates this flow, but did not prove its convergence. In this paper we discuss the convergence of the resulting numerical scheme.

Since the approach in [4] may be hard to read for people not trained in differential geometry we show in Section 3 how their final iteration scheme can be rewritten such that we arrive at an easily understandable algorithm. A slight modification of this algorithm leads to our (even simpler) algorithm reported in the next section.

We start with this algorithm since we want to introduce it from a user-friendly point of view as a multiplicative filtering process of label assignment vectors. We initialize our iterations with a label assignment matrix containing label assignment vectors of averaged distances between the prior features and the observed ones. Matrices of this kind are common kernel matrices in nonlinear filter [11, 48], patch- and graph Laplacian based methods, see, e.g., [9, 20]. In particular the data and their priors may lie in any metric space, which makes the method highly flexible. Then these label assignment vectors are iterated in a multiplicative way, where we have to take care that the components stay within the probability simplex and a (small) fixed distance away from zero. We analyze the convergence of the algorithm if such an is not fixed. In exact arithmetic it appears that in most relevant cases the algorithm would converge to a trivial label assignment matrix. This is not the case if we demand that the components are bounded away from zero by a small distance. Such condition is used in the numerical part anyway to avoid an arithmetic underflow. However, the analysis shows that there is also a theoretical reason for bounding the components away from zero. Indeed, in all numerical examples we observed convergence of the algorithm to the vertices of the probability simplex . We reinterpret our algorithm as gradient ascent step of a certain function on the product manifold of positive numbers followed by a reprojection onto the probability simplex. The projection is done with respect to the Kullback-Leibler distance. We provide various numerical examples.

The outline of this paper is as follows: In Section 2 we introduce and discuss our iterative multiplicative filtering algorithm: the simple algorithm is presented in Subsection 2.1, followed by its convergence analysis in the case that the components of the label assignment vectors are not bounded away from zero in Subsection 2.2. We interpret the algorithm as a gradient ascent step of a certain function followed by a Kullback-Leibler projection onto the probability simplex in Subsection 2.3. The relation to paper [4] together is detailed in Section 3. Numerical results demonstrate the very good performance of our algorithm in Section 4. In particular we apply the method for the partitioning of manifold-valued images. The paper finishes with conclusions in Section 5.

## 2 Multiplicative Filtering of Label Assignment Matrices

### 2.1 Basic Algorithm

In this subsection we introduce the supervised labeling algorithm. As outlined in the introduction two basic ingredients are needed, namely i) feature vectors (prototypical ones for the labels of each pixel) together with a distance measure between them, and ii) neighborhood relations between pixels usually given by certain weights.

Let denote the identity matrix, the matrix with all components , and the -th unit vector in . By we denote the norm and by the inner product of vectors . Likewise, if are matrices, we use . Let be a metric space. In our applications will be or a non-flat Riemannian manifold.

We are interested in partitioning features

 fi∈M,i=1,…,n,

into labels, where we suppose that we are given prior features

 f∗k∈M,k=1,…,K.

In other words, we consider supervised labeling. Then

 D\coloneqq(di,k)n,Ki=1,k=1=(D1…Dn)T,di,k\coloneqqdist(fi,f∗k), (2)

is the distance matrix of the features to the priors. Throughout this paper exponentials and logarithms of matrices are meant componentwise. Further, denotes the componentwise product of matrices. To each pixel we assign a neighborhood and set

 Missing or unrecognized delimiter for \Big (3)

and

 A\coloneqq(A1…An)T∈Rn,K≥0,

where with . We can extend the definition of the weights to all by setting if .

We will work on label assignment matrices with non-negative entries

 W\coloneqq(W1…Wn)T∈Rn,K≥0,

whose rows sum up to 1. In other words, consists of label assignment vectors , where denotes the probability simplex (1). In particular, will serve as initialization for our algorithm. Then we apply a multiplicative neighborhood averaging to the columns of , i.e., separately for each label . To have full flexibility, we may consider other neighborhoods and weights with , for this process. We assume that the weights in Algorithm 1 fulfill

 ρi,j=ρj,i,

i.e., mutual neighbors are weighted in the same way and extend their definition to all indices in by setting if . In summary we propose the Algorithm 1:

When the algorithm stops with , we assign the label

 argmaxk∈{1,…,K}W(R)i,k (4)

to the -th pixel. Let us comment the steps of the algorithm:

#### Step 1.

Here is the weighted geometric mean of the label assignment vectors in the neighborhood of . In particular we have for that

 ∏j∈Nρ(i)(Wj,k)ρi,j=(∏j∈Nρ(i)Wj,k)1|Nρ(i)|,k=1,…,K.

Taking the logarithm we obtain

 logU(r)=logW(r)+PlogW(r)

with the weight matrix

 P\coloneqq(ρi,j)ni,j=1. (5)

Hence Step 1 can be rewritten as

 U(r+1)=W(r)∘exp(PlogW(r)). (6)

#### Step 2.

Here we ensure that the rows of lie in the probability simplex .

#### Step 3.

This step basically avoids that the entries of become too small (actually smaller than ). Very small values do also not appear if we set

 W(r+1)i\coloneqqV(r+1)i+1K(ε−mink{V(r+1)i,k}) (7)

followed by a normalization. Indeed, in numerical tests this update instead of Step 3 works similar. We have chosen Step 3 since together with Step 2 it has a nice interpretation as Kullback-Leibler projection onto the part of the probability simplex with entries entries larger or equal to . However, we will see in the next subsection that Step 3 is not just numerical cosmetics, but in general absolutely necessary to avoid that the iterates converge to the same vertex of the simplex for all which would result in a trivial labeling.

By Step 1, the weight matrix plays a crucial role. By definition this matrix is symmetric and stochastic, i.e., it has nonnegative entries and . We finish this subsection by collecting some well-known properties of symmetric stochastic matrices, see, e.g., [22, 28, 43, 52].

###### Lemma 2.1.

Let be a symmetric stochastic matrix. Then the following holds true:

[label=)]
1. The matrix has an eigenvalue decomposition of the form

 P=QΛQT,Λ\coloneqqdiag(λ1,…,λn), (8)

where , , and

 Q\coloneqq(q1…qn)T

is an orthogonal matrix with first column .

2. If has positive diagonal entries, then is not an eigenvalue.

3. If is irreducible, i.e., the associated matrix graph is connected, then is a single eigenvalue.

4. If has positive diagonal entries and is a single eigenvalue, then converges as to the constant matrix .

### 2.2 Convergence Analysis for ε=0

In this subsection, we analyze the convergence of Algorithm 1 without Step 3, i.e., we use the same initialization as in the algorithm, but compute just

 1 U(r+1)i=W(r)i∘∏j∈Nρ(i)(W(r)j)ρi,j, (9) 2 W(r+1)i=U(r+1)i/∥U(r+1)i∥1. (10)

In other words we consider the case . We will see that the third step is indeed an essential.

###### Lemma 2.2.

The -the iterate in (10) is given by

 W(r)=(expw(r)1∥expw(r)1∥1,…,expw(r)n∥expw(r)n∥1)T, (11)

with

 w(r)=(w(r)1…w(r)n)T\coloneqq(In+P)ra,a\coloneqqlogA

and the weight matrix in (5).

###### Proof.

First note that we obtain the same iterates if we skip the normalization Step 2 in the intermediate iterations and normalize only in step . Therefore we consider the sequence with the same starting point and iterations

 ~W(r+1)i\coloneqq~W(r)i∘∏j∈N(i)(~W(r)j)ρi,j.

Taking the logarithm and setting the iteration becomes

 w(r+1)i=w(r)i+∑j∈N(i)ρi,jw(r)j, (12)

where . With this can be rewritten as

 w(r+1) =(In+P)w(r) (13) =(In+P)r+1w(0). (14)

Then and row normalization yields the assertion. ∎

The following remark gives a first intuition about the convergence behavior of .

###### Remark 2.3.

By (14) we have

 w(r)=2r(12(In+P))rw(0)=2rv(r).

From the definition of we know that is a symmetric stochastic matrix with positive diagonal entries. Assuming that is a single eigenvalue of this matrix, we conclude by Lemma 2.1 that converges to the matrix with constant columns as . However, this does in general not imply that the columns of also converge to a constant vector for as the following example shows: We have

 v(r)\coloneqq⎛⎝12+12r12+(23)r12+(23)r12+12r⎞⎠→(12121212)\eqqcolonvasr→∞.

Multiplying with , taking the exponential and normalizing the rows leads in the first column of to

 exp(2r⎛⎝12+12r12+(23)r⎞⎠)exp(2r(12+12r))+exp(2r(12+(23)r)) =exp((2r−1+12r−1+(43)r))exp(2r−1+1)+exp(2r−1+(43)r) (15) =exp((1(43)r))exp(1)+exp((43)r) (16)

so that

 Unknown environment '%

The general convergence result is given next.

###### Theorem 2.4.
[label=)]
1. The sequence generated by (10) converges.

2. Let have the eigenvalue decomposition (8), where contains the eigenvalues in descending order. Assume that there are pairwise distinct eigenvalues with multiplicity , . Let be the largest index such that . Set

 ^a\coloneqqQTlogA,^a=(^a1…^aK)∈Rn,K.

Then, for every , the sequence converges to a unit vector if and only if fulfills the following property

• If for an , there exist , such that

 cl,k(s)\coloneqqns+κns−1∑j=nsqi,j(^aj,l−^aj,k)=0for all s=1,…,^s,

then there exists and so that

 c~l,k(~s)>0andc~l,k(s)=0for all s=1,…,~s−1. (17)
###### Proof.

i) By (14) and (8) we obtain

 w(r) =(In+P)rw(0)=Q(In+Λ)r^a=QDr^a, (18)

where and , . By Lemma 2.1 and Gershgorin’s circle theorem we know that , and if and only if . Taking the exponential and normalizing the rows we get for the -th row

 W(r)i Extra open brace or missing close brace (19) =⎛⎜ ⎜ ⎜ ⎜⎝11+exp(qTiDr^a2)exp(qTiDr^a1)+…+exp(qTiDr^aK)exp(qTiDr^a1)…1exp(qTiDr^a1)exp(qTiDr^aK)+…+exp(qTiDr^aK−1)exp(qTiDr^aK)+1⎞⎟ ⎟ ⎟ ⎟⎠T (20) =(11+G2,1(r)+…+GK,1(r)…1G1,K(r)+…+GK−1,K(r)+1)T, (21)

where

 Gl,k(r)\coloneqqexp(qTiDr^al)exp(qTiDr^ak)=n∏j=1exp(qi,jμrj(^aj,l−^aj,k)). (22)

Taking the logarithm results in

 gl,k(r) \coloneqqlog(Gl,k(r))=n∑j=1qi,j(^aj,l−^aj,k)μrj=M∑s=1cl,k(s)μrns =^s∑s=1cl,k(s)μrns+M∑s=^s+1cl,k(s)μrns.

The first sum contains the eigenvalues larger than 1 and the second one those smaller or equal to 1. If it exists, let be the smallest index such that . Then it follows

 limr→∞gl,k(r)={constifcl,k(s)=0for all s=1,…,^s,sgn(cl,k(~s))∞ifcl,k(s)=0for all s=1,…,~s−1,

and consequently

 limr→∞Gl,k(r)=⎧⎪⎨⎪⎩exp(const)ifcl,k(s)=0for all s=1,…,^s,+∞ifcl,k(~s)>0,0ifcl,k(~s)<0. (23)

Hence, by (21), we see that converges as .

ii) It remains to examine under which conditions it converges to a unit vector. Consider a fixed . We distinguish two cases.

1. If for all there exists an such that , then we see by (23) that . Since, by construction the components of sum up to 1, we conclude that converges to a unit vector.

2. If there exists such that for all , then, by (23), the component is strictly smaller than 1 and becomes 0 if and only if there exists and so that and for all . This is exactly condition (PI) and we are done. ∎

Let us have a closer look at the property (PI).

If for all which means that we do not take the neighborhoods of the points into account, then and property (PI) is fulfilled if for every the vector has a unique smallest element. Then the algorithm gives the same assignment as just which is reasonable.

Things change completely if we take care of the point neighborhoods. First we give an equivalent condition to (PI).

###### Remark 2.5.

Having a look at part ii) of the above proof we realize by (23) and (21) that converges to a unit vector if and only if one of its components converges to 1. By (23) and (21) this is the case if and only if the following condition (PI) is fulfilled:

• For all there exists such that for all there exists such that

 cl,k(~s)<0andcl,k(s)=0for alls=1,…~s−1.

Checking condition (PI) would give us the limit vector of since it provides the component , where the vector is 1.

The following corollary shows that in many practical relevant image labeling tasks the iterates in (10) converge the same vertex of the simplex.

###### Corollary 2.6.

Let be a symmetric stochastic matrix which has eigenvalue 1 of multiplicity 1 and all other eigenvalues have absolute values smaller than 1. Assume that there exists such that

 n∏j=1Aj,k>n∏j=1Aj,lfor alll∈{1,…,K}∖{k}. (24)

Then, for all , the vectors generated by (10) converge for to the -th unit vector.

###### Proof.

By assumption on we have in (PI), resp. (PI), for and all that

 cl,k(1)=qi,1(^a1,l−^a1,k)

where . By Lemma 2.1 i) we know that the first column of is given by . Thus,

 ^a1,m=1√nn∑j=1logAj,m.

Choosing according to assumption (24) we conclude for all . Thus,

 cl,k(1)=qi,1(^a1,l−^a1,k)=(^a1,l−^a1,k)/√n<0foralll∈{1,…,K}∖{k}

and we can take in (PI). Then we see by the proof of Theorem 2.4, resp. Remark 2.5, that converges for to the -th unit vector independent of . ∎

By Lemma 2.1 i)-iii), the assumptions on are fulfilled if is irreducible and has positive diagonal entries. In our task this is the case if we use local pixel neighborhoods, where the central pixels have positive weights , . Assumption (24) is fulfilled in many practical relevant cases, in particular in all our numerical examples. This underlines the relevance of Step 3 in Algorithm 1. We finish this subsection by an intuitive example to illustrate the importance of .

###### Example 2.7.

We initialize Algorithm 1 with the matrix

 A\coloneqq⎛⎜ ⎜ ⎜⎝4545151515151514141411011015151515151212121101103535353535141414⎞⎟ ⎟ ⎟⎠T, (25)

which corresponds to a labeling of a signal of length with features and three labels. We choose neighborhoods of size 3 with uniform weights and mirror boundary conditions. Table 1 depicts the results of the labeling for the different , where the label assignment changes. In all cases the corresponding label assignment vectors are the vertices of the probability simplex (up to machine precision) and the labels 1,2,3 were assigned via (4). Note that we get the same labeling for as for , but the label assignment matrix converges in the first case to vertices as . Up to the pixels are labeled as expected. If decreases further, we still get reasonable results originating from the large starting values of the first segment. Until double precision is reached, i.e., , we see no more changes in the labeling result. To be able to handle we look at without normalization, i.e., we iterate (12). Then we observe after iterations that the maximal value of is attained in the last row. This means that after normalization in exact arithmetic all pixels will get the label 3. Since (24) reads for our example as

 (n∏j=1Aj,k)3k=1=12×107(64,4,243), (26)

and the third value is the largest one this agrees with Corollary 2.6.

### 2.3 Gradient Ascent Reprojection Algorithm

In this subsection we show that Algorithm 1 can be seen as gradient ascent reprojection algorithm of a certain function defined on the product manifold of positive numbers. The positive numbers form together with the metric on the tangent space given by a Riemannian manifold. The distance between is defined by

 distR∗(x1,x2)\coloneqq|logx1−logx2|.

Indeed, computation in can often be reduced to the Euclidean setting after taking the logarithm. The exponential map and its inverse read

 Expx(ξ)=xexpξx,Logxy=xlogyx. (27)

For and , the weighted Riemannian center of mass on is given by

 argminx∈R∗n∑j=1γjdistR∗(x,xj)2. (28)

The minimizer can be computed as follows. The Riemannian gradient of the squared distance is

 ∇R∗dist2R∗(⋅,y)(x)=−2Logxy=−2xlogyx=2KL(x,y), (29)

where denotes the Kullback-Leibler distance. It is defined by

 KL:RN≥0×RN>0→R,KL(x,y)\coloneqq⟨x,logxy⟩, (30)

with the convention and using the component-wise division in . The function is jointly convex in both arguments. It is the Bregman distance of the Shannon entropy and is related to entropy regularized Wasserstein distances, which have recently found applications in image processing, see [44] and the references therein. Setting the Riemannian gradient of the objective in (28) to zero using (29), and dividing by 2 we get

 0=n∑j=1γjxlogxxj=n∑j=1γjKL(x,xj)=KL(x,n∏j=1xγjj). (31)

Consequently, the weighted Riemannian center of mass is just the weighted geometric mean .

By we denote the usual product manifold. We are interested in the function given by

 F(W) \coloneqq⟨logW,PlogW⟩ (32) =K∑k=1⟨log\underaccent¯Wk,Plog\underaccent¯Wk⟩=K∑k=1(log\underaccent¯Wk)TPlog\underaccent¯Wk, (33)

where , are the columns of and is the weight matrix in (5).

In Step 1 of our algorithm we compute , perform a gradient ascent step of in the Euclidean arithmetic and take the exponential of the result. The direct execution on is given by the following lemma.

###### Lemma 2.8.

Step 1 in Algorithm 1 is a gradient ascent step of given by (32) at with step length 1. In other words, we compute

 ExpW(∇MF(W))=W∘exp(PlogW)

at , where denotes the Riemannian gradient of .

###### Proof.

By separability of we can restrict our attention to single columns of . A tangent vector acts on as

 ξF(\underaccent¯W)=⟨∇F(\underaccent¯W),ξ⟩,

where is just the Euclidean gradient of at . The Riemannian gradient fulfills

 ⟨∇MF(\underaccent¯W),ξ⟩\underaccent¯W=⟨∇F(\underaccent¯W),ξ⟩=⟨1\underaccent¯W∘Plog\underaccent¯W,ξ⟩

for all . By the above definition of the Riemannian metric on we get

 ⟨∇MF(\underaccent¯W),ξ⟩\underaccent¯W=⟨1\underaccent¯W2∘∇MF(\underaccent¯W),ξ⟩=⟨1\underaccent¯W∘Plog\underaccent¯W,ξ⟩

so that . Using finally (27) gives