On convexification/optimization of functionals including an l^{2}-misfit term

On convexification/optimization of functionals including an -misfit term

Abstract

We provide theory for computing the (lower semicontinuous) convex envelope of functionals of the type

(1)

and discuss applications to various non-convex optimization problems. The latter term is a data fit term whereas provides structural constraints on . By minimizing (1), possibly with additional constraints, we thus find a tradeoff between matching the measured data and enforcing a particular structure on , such as sparsity or low rank. For these particular cases, the theory provides alternatives to convex relaxation techniques such as -minimization (for vectors) and nuclear norm-minimization (for matrices). For functionals of the form

where the convex envelope usually is not explicitly computable, we provide theory for how minimizers of (explicitly computable) approximations of the convex envelope relate to minimizers of the original functional. In particular, we give explicit conditions on when the two coincide.

keywords:
Fenchel conjugate, convex envelope, non-convex/non-smooth optimization
Msc:
[2010] 49M20, 65K10, 90C26

1 Introduction

The purpose of this article is to convexify, or partially convexify, functionals of the type

(2)

where , and

(3)

where (the space of -matrices with the Frobenius norm). In other words, we are interested in computing the l.s.c. convex envelope or at least an approximation thereof. We will also consider weighted norms and penalty terms like

(4)

in order to treat problems where a matrix of a fixed rank is sought. Such functionals appear in a multitude of optimization problems, where the goal is to find a point such that the functional attains its minimum, possibly with additional constraints on . We refer to the overview article Tseng (2010) which includes a long list of applications. The problem of minimizing (2) and (3) differ significantly in that (3) has a closed form solution whereas solving (2) is NP-hard. However, minimization of (3) over a subspace or in combination with additional priors, is also a hard well-known problem with many applications, and knowing the l.s.c. convex envelope can help to find approximate solutions, as we advocate in this paper. We refer to Larsson and Olsson (2016); Recht et al. (2010) and the references therein for examples of applications.

Since the functional (4), as well as and , are non-convex, it is tempting to replace them by their convex envelopes. However, in all three cases the convex envelope equals 0. To obtain problems that are efficiently solvable, it is therefore popular to replace e.g. with the -norm or by the nuclear norm, a strategy which is sometimes called convex relaxation, thus obtaining a convex problem reminiscent of the original problem. Such methods have a long history, but has received new attention in recent times due to the realization that the original problem and the convex relaxation under certain assumptions have the same solution, as pioneered in the work concerning compressed sensing Candès et al. (2006). The argument behind the choice of convex replacement is often that the functionals in question are the convex envelopes of the original ones when restricted to the unit ball, see e.g. Recht et al. (2010).

Figure 1: Illustration of a non-convex, non-continuous functional together with its convex envelope and a “traditional” convex relaxation.

Despite the success of these methods, there is a huge difference between the functional and for large values of , which usually leads to a bias in the solution of the convex relaxation, which is a well known issue, (see e.g. Larsson and Olsson (2016); Soubies et al. (2015) for a deeper discussion and further references concerning these problems). To remedy this, there has recently been two independent attempts at finding convexifications closer to the original functional, namely Larsson and Olsson (2016) for minimizing (3) (in combination with additional restrictions) and Soubies et al. (2015) for minimizing (2) as is. In this paper we find a unifying framework and significantly extend the existing theory.

Figure 1 highlights these issues; let be the characteristic functional of a set , i.e. the function equalling 1 on and zero elsewhere. In red we see the functional where (which is a particular case of both (2) and (3) in dimension 1), in blue its convex envelope and in pink the convex relaxation . Clearly the global minimum of the red and blue coincide, but the global minimum of the convex relaxation is different. We will return to this picture in Section 4.1.

We now outline the main contributions of this paper in greater detail. Consider any functional of the form

(5)

where is an arbitrary separable Hilbert space and any functional on that is bounded below. We introduce a transform , where is a parameter, and show that the (l.s.c) convex envelope of the functional in (5) is

(6)

Values will mainly be of interest in Part III, and we simply write in place of . Note that the shape of the convex envelope is completely independent of . The functionals and are closely related to the Moreau-envelope, Lasry-Lions approximants or proximal hulls, which we elaborate more on in Section 2.1. In Section 2.2 we provide numerous examples of for various functionals acting on matrices as well as vectors. We also provide a number of general results to simplify the computation of . Finer properties of the transform are proven in Section 2.3, which concludes the first part of the paper, titled “general theory”.

Figure 2: Illustration of a non-convex optimization problem with linear constraints. The bottom left panel shows a non-convex functional along with its level sets. The gray line represents the subspace we are interested in, and the blue curve the values of the functional restricted to the subspace. The bottom right panel shows the same setup, but here the convex envelope is shown as well in orange/yellow. The values of the convex envelope over the subspace is shown in the red curve. The top figure shows a one-dimensional plot of the values of the original functional (blue) and the convex envelope (red) evaluated on the subspace. The respective minima are shown by circles and highlighted by the vertical lines. Note that they are located close to each other, but that they are not identical despite the fact that the global minimum for the original functional and its convex envelope coincide.

The remainder of the paper is divided in two parts corresponding to the prototype functionals (2) and (3), which are very different in nature. To further explain why, we remark that can be computed explicitly only if the global minimum of the original functional (5) can be found explicitly, as in the case of (3). Therefore, as opposed to the situation in (2), the problem only becomes difficult in combination with additional restrictions. Suppose e.g. that we want to minimize (5) over some subspace or say that we wish to minimize where is a convex functional related to any additional prior information, (see Section 4 in Larsson and Olsson (2016) for concrete examples). In both cases we end up with minimization problems with no closed form solution. Replacing with then gives us a convex problem, similar to the original one, which can be addressed with standard convex approximation schemes like the projected subgradient method, dual ascent, ADMM or forward-backward splitting. It is often the case that the minimum of the “convexified” problem coincides with the minimum of the non-convex problem, which is easily verified by simply checking if holds at the point of convergence. It is important however to realize that this is not always the case, as Figure 2 demonstrates. However, Figure 7 in Section 3.1 shows the same functional with a different subspace on which the two minima does coincide. We elaborate further on this in Section 3.1. More information on these issues as well as the rationale behind replacing by is also found in Larsson and Olsson (2016) (specific to certain rank type functionals acting on matrices) and Andersson et al. (2016b) (studying convex envelopes in greater generality and dual ascent). It is not the aim of the present paper to provide recommendations for which algorithm to use to solve a specific application, and the best candidate will certainly depend on the particular situation. Nevertheless, several of the algorithms mentioned above requires the ability to compute the so called proximal operator, and we provide theory for this in Section 3.3, which concludes Part II of the paper, titled “applications when explicit formulas are available”.

Figure 3: The same setup as in Figure 1, but with an additional functional in black illustrating (8) in case (left) and (right). See Section 4.1 for a more detailed description.

Part III of the paper is devoted to the problem of minimizing

(7)

where is any linear transformation. We assume that is such that is computable, but due to the linear transformation , the functional

(8)

will not equal the convex envelope of (7), which we assume is untractable, as in the case of (2). The parameter now becomes a valuable tool as it tunes the curvature of . The expression (8) is illustrated (in one dimension and for values of (left) and (right)) in Figure 3. The circles represent global minima of the respective functions.

Generalizing the left figure, we assume in Section 4.2 that is below the square of the lowest singular value of . We prove that the functional (8) is a convex functional below (7), and hence minimization of (8) will produce a minimizer which, although not necessarily equal to the minimizer of the original problem, likely is closer than that obtained by other convex relaxation methods (if such at all are available). Moreover, the minimizer of the original and modified problem do coincide whenever , which often is easily checked in practice. An example of when this happens, similar to Figure 3, is shown in Figure 8 in Section 4.2.

For the problem (2), is usually a matrix with a large kernel, and the smallest singular value is 0, which rules out the above approach. In Section 4.3 we consider the case , generalizing the situation in the right picture of Figure 3. We can then show that (8) is a continuous (but not everywhere convex) functional with the following desirable properties; (8) lies between (7) and its l.s.c. convex envelope, any local minimizer of (8) is a local minimizer of (7), the global minimizers of (8) and (7) coincide (see Proposition 4.6 and Theorem 4.7). We remark that, despite not being convex, critical points of (8) can be found using e.g. the forward-backward splitting method Attouch et al. (2013); Bolte et al. (2014). The situation in Section 4.3 is thus drastically different from the previous scenarios; whether a global minimizer of the original problem is found depends only on the starting point for the algorithm seeking local minimizer. This latter part of the paper is inspired by Soubies et al. (2015), which considers problem (2), and also contains a list of recent algorithms for finding local minima of functionals of the type considered above. We briefly revisit the problem (2) in the final Section 4.5.

The paper also contains two appendices of some independent interest. Appendix I revisits an extension of Milman’s theorem, giving structural properties of l.s.c. convex envelopes, due to A. Brondsted in a short notice from 1966 Brondsted (1966), which seems to have remained unnoticed by the community. Appendix II extends the famous von Neumann’s trace inequality to operators on infinite dimensional spaces, a result which rather surprisingly has not been published to our best knowledge.

Notation

Set . The set of complex matrices, equipped with the Frobenius norm, is denoted . Throughout the paper, and sometimes denote separable Hilbert spaces (possibly finite dimensional). Let denote all Hilbert-Schmidt operators with the Hilbert-Schmidt norm. We remark that in case and with the canonical norms, then is readily identified with with the Frobenius norm. The singular value decomposition (SVD) of a given is denoted , where we choose , and . The vector of singular values (i.e. the elements on the diagonal of ) is then denoted by Note that we thus define the singular values such that the amount of singular values equals the dimension of . More generally, given any operator acting on infinite dimensional spaces, we can pick singular vectors and such that

(9)

where are the singular values (ordered decreasingly) and . Moreover can be taken to be an orthonormal sequence in and to be an orthonormal basis in (see e.g. Theorem 1.4 Simon (1979)). We follow the matrix theory custom of numbering the singular vectors starting at 1, as opposed to 0 which is more common in operator theory.

will denote the subspace of of self-adjoint (Hermitian) operators, and the vector of eigenvalues of a given . In case has finite dimension , so that is identified with , we simply write .

for is identified with . Given , denotes the amount of non-zero elements (by abuse of notation since this is not a norm), and the canonical norm. We abbreviate lower semi-continuous by l.s.c., and we denote by the set of points where the functional is finite. Both and will denote the l.s.c convex envelope of a functional .

is the -transform computed with the scalar product of and parameter . Usually is omitted from the notation and furthermore when we simply write .

2 Part I; general theory.

2.1 The -transform

Let be a separable1 Hilbert space over or , such as with the canonical norm or , equipped with the Frobenius norm which we denote . All Hilbert spaces over are also Hilbert spaces over with the scalar product , and hence it is no restriction to assume that is a real Hilbert space wherever needed. Even if is a Hilbert space over , we will implicitly assume that the scalar product is .

Given any functional the Legendre transform (or Fenchel conjugate) is defined as

(10)

We recall the following well known properties of Legandre transforms, see e.g. Proposition 13.11 and 13.39 in Bauschke and Combettes (2011).

Proposition 2.1.

Let be a -valued functional on a separable Hilbert space . Then is l.s.c convex and equals the l.s.c. convex envelope of .

Given a functional on we now introduce a transform defined as follows

(11)

We remark that is the convex envelope of , which is immediate by iteration of

(12)

It is clear from the second line of (11) that is simply the negative of the famous Moreau-envelope. However, the double Moreau-envelope does not equal , and is not connected with convex envelopes. For we do have

(13)

For general parameters , the above functional is called the Lasry-Lions approximation of Lasry and Lions (1986), which has been studied in the context of regularization of non-convex functionals. For it is also called the proximal hull in Rockafellar and Wets (2009) (see Example 1.44), and it is also studied in Section 6 of Strömberg (1996) (with the notation ), mainly with focus on differentiability-results. It is also closely connected to the more general “proximal average”, see e.g. Bauschke et al. (2008); Hare (2009).

However, the connection with convex envelopes seems to have been ignored,2 and it is the aim of this paper to systematically study this topic and its applications. To create more flexibility, we will tune with an additional parameter , basically determining the maximum negative curvature of (Corollary 2.24). To this end set

so that

(14)

(compare with (13)). When we simply write as before. By the above formula we see that the negative of the Moreau envelope of minus the Moreau envelope (of ) does equal . can also be seen as an inf-convolution followed by a sup-convolution with .

Proposition 2.2.

Let be a -valued l.s.c. functional on a separable Hilbert space and . Then is takes values in and is continuous, whereas is l.s.c., takes values in and is continuous in the interior of .

Proof.

It clearly suffices to prove the statement for . The statements of the interchanging signs follows easily by the last line of (11), which also shows that avoids . By Proposition 2.1 and (12) it follows that (and ) is the difference of an l.s.c. convex functional and a quadratic term. With this in mind the continuity statements follows by standard properties of l.s.c. convex functionals (see e.g. Corollary 8.30 Bauschke and Combettes (2011)). ∎

We now focus on the convex envelope of for some fixed and .

Theorem 2.3.

Let be a -valued functional on a separable Hilbert space . Then

and

In particular, is the l.s.c. convex envelope of and .

Proof.

We have

from which the first identity follows. Similarly

The statement about the convex envelope is a direct consequence of Proposition 2.1, by which we immediately get . This implies the latter part of the inequality , whereas the former has already been noticed in Proposition 2.2. ∎

The above theorem can also be applied to expressions of the form

(15)

upon renormalizing using , but we postpone the theory for this case to Part III, in particular Proposition 4.9. Finer properties of the -transform are discussed in Section 2.3. In the coming section we make a long list of computable -transforms as well as provide general tools to compute such.

2.2 Examples of -transforms

We begin by studying the functional , where denotes the characteristic functional of a set and . This seemingly innocent functional is relevant for both key problems (2) and (3), which follows by noting that

(16)

and

(17)
Figure 4: Illustration of (red) along with its double -transform for in Example 2.4.
Example 2.4.

Let and set where is a fixed parameter (see red curve in Figure 4). Then

(18)

Clearly, the maximum is found either at or at which gives

(19)

To compute , we repeat the process

Since is constantly equal to its supremum value whenever , it follows that the maximum is attained at for all satisfying , yielding . For the same reason the maximum is attained in whenever . Since the -terms cancel in this segment, the functional to be maximized is linear there, and so the maximum must be obtained at . It easily follows that

(20)

For the values this is shown in blue in Figure 4, and the functional can be seen in blue in Figure 1.

The expression (20) has appeared e.g. in Larsson and Olsson (2016); Soubies et al. (2015). The point here is that it allows us to compute the -transform of the more complicated cost functionals (16) and (17), when combined with the below propositions. We refer to Ch. I.6 in Conway (2013) for the basics of direct products of separable Hilbert spaces. We write if there is a need to clarify which space is used to compute the transform.

Proposition 2.5.

Let where be separable Hilbert spaces and set . Suppose that are -valued functionals on and set where and . Then

Proof.

We have that

Combining this with Example 2.4 we immediately get

(21)

To derive a similar expression for (17), we need von-Neumann’s trace inequality for operators on separable Hilbert spaces. We thus shift focus to functionals acting on the singular values of a matrix or, more generally, a Hilbert-Schmidt operator in . Set and note that the singular values of a given lie in the set , which we identify with in case (see the Notation section for conventions concerning numbering of singular values).

Given two separable Hilbert spaces , we let be the Hilbert space of Hilbert-Schmidt operators with the standard norm (see e.g. Simon (1979)). The inequality then reads as follows:

Theorem 2.6.

Let be any separable Hilbert spaces, let be arbitrary and denote their singular values by , , respectively. Then

with equality if and only if the singular vectors can be chosen identically.

The statement is well known for matrices but, surprisingly, the infinite dimensional version is nowhere to be found in the standard literature on operator theory, and we have also not been able to locate it in any scientific publication. For that reason, we include a proof in Appendix II.

Proposition 2.7.

Let be any separable Hilbert spaces. Suppose that is a permutation and sign invariant -valued functional on , , and that . Then

In particular, this identity holds for all matrices.

Proof.

Since , von-Neumann’s inequality implies that the supremum is attained for an that shares singular vectors with . Hence

Due to the permutation and sign invariance of , we can drop the restrictions on and so

It is now easy to determine the -transform of the rank-functional on matrices. The computations in the previous proposition as well as the expressions in the next example, are taken from Larsson and Olsson (2016). The main point of the previous propositions is to make these ideas easily applicable also in other scenarios.

Example 2.8.

Recall (17). By combining Propositions 2.5, 2.7 with (19) and (20) we immediately get that

(22)

and

(23)

We now enter unexplored territory. For applications where one minimizes a functional over a certain linear subspace of , the unweighted Frobenius norm is not always the most optimal choice, as the next example highlights.

Example 2.9.

Fix and let be the Hankel matrix generated by the sequence . If one is interested in minimizing the rank of a Hankel matrix while at the same time not deviating far from some measurement , as is frequent in frequency estimation Andersson et al. (2016c), one option is to minimize the functional over the set of Hankel matrices, (we consider minimization over subspaces in more detail in Part II, Example 3.2). Setting , the quadratic term corresponds to a weighted misfit term of the form

(24)

(see Figure 5, left) which is not the most natural quantity to minimize, as has been observed by many authors (e.g. Gillard and Zhigljavsky (2013)).

Figure 5: Left; the weight appearing in (24) for . Right; corresponding weight for (28).

To partially remedy the issues highlighted in the previous example, we include a few results on how to compute -transforms with respect to certain weighted spaces of matrices. Given with (strictly) positive entries, we let be the Hilbert space obtained by introducing the norm

where e.g. are the entries of . In case , i.e. is equal to one componentwise, we will simply write as earlier. Suppose now that we are interested in computing , where is such that has an explicit expression. In general, this will only be possible if is a direct tensor, i.e. of the form

(25)

where and are sequences of length and respectively. The following examples and proposition show how to do this. A linear operator between two spaces that is bijective and isometric will be referred to as unitary.

Example 2.10.

Under the assumption (25), note that

is unitary between and , where e.g. is a diagonal matrix with . Also note that and are unitary, where refers to componentwise division. The space is of course the same as as a vector space, but with a different norm. In fact, if denotes the canonical basis in , we have that () defines an orthonormal basis in . Each matrix defines an operator by the usual matrix multiplication, i.e. . It is easy to see that

Proposition 2.11.

Let and be separable Hilbert spaces, let be unitary and let be unitary. Then the induced map given by is unitary.

Moreover, let be an -valued functional on . Then

and

Proof.

The first statement is immediate by the definition of the Hilbert-Schmidt norm. The first identity follows from the calculation

and the latter is a consequence of applying the former twice. ∎

Example 2.12.

We continue Example 2.10. Set , , , , , and . Note that since left or right multiplication with invertible diagonal matrices do not change the rank. By Proposition 2.11 and Example 2.8 we conclude that

(26)

and

(27)

generalizing (22) and (23).

Example 2.13.

Continuing example 2.9 we consider minimization of the functional

over the set of Hankel matrices, where we assume that is odd and that with . By the above theory the l.s.c. convex envelope is given by

Inserting in the quadratic term gives

(28)

where is depicted in Figure 5, right. Compared with (24), this weight is clearly much closer to a uniform flat weight (both weights (24) and (28) start and end with the weight 1, so the scaling in Figure 5 is fair). What the optimal choice of would be in order to yield as flat a weight as possible, is to our knowledge an open question.

We now change focus and take a look at functionals forcing a predetermined amount of non-zero terms.

Example 2.14.

In define and define to be the vector resorted so that is a decreasing sequence. Then

To see this, note that and it is clear that the optimal value of is if is among the greatest, and zero else.

The computation of is more involved. The expression is

where is a particular number between 1 and . This is derived in Andersson et al. (2016a), albeit without using -transforms and in the setting of matrices with fixed rank (see Example 2.15). Nevertheless, the computations are easily adapted to as above.

Example 2.15.

Let be the manifold of matrices of rank , and let be the indicator functional of , i.e. the functional which is 0 on and elsewhere. Letting be as above, note that . Hence we can use Proposition 2.7 to see that has -transform and