Blended Matching Pursuit
Matching pursuit algorithms are an important class of algorithms in signal processing and machine learning. We present a blended matching pursuit algorithm, combining coordinate descent-like steps with stronger gradient descent steps, for minimizing a smooth convex function over a linear space spanned by a set of atoms. We derive sublinear to linear convergence rates according to the smoothness and sharpness orders of the function and demonstrate computational superiority of our approach. In particular, we derive linear rates for a wide class of non-strongly convex functions, and we demonstrate in experiments that our algorithm enjoys very fast rates of convergence and wall-clock speed while maintaining a sparsity of iterates very comparable to that of the (much slower) orthogonal matching pursuit.
Ligatures=TeX \setmainfontTeX Gyre Termes \setsansfont[Scale=MatchUppercase]TeX Gyre Heros \setmonofontTeX Gyre Cursor \setmathfontTeX Gyre Termes Math \extrafloats100
Let be a separable real Hilbert space, be a dictionary, and be a smooth convex function. In this paper, we aim at finding a sparse (approximate) solution to the problem:
Together with fast convergence, achieving good sparsity, i.e., keeping the iterates as linear combinations of a small number of atoms in the dictionary , is a primary objective. Therefore, in signal recovery, Problem 1 is often solved with algorithms such as Coordinate Descent or Matching Pursuit (mallat93mp). Our approach is inspired by the Blended Conditional Gradients algorithm of pok18bcg which solves the constrained setting of Problem 1, i.e., minimizing over the convex hull of the dictionary, which is ultimately based on the Frank-Wolfe algorithm (fw56), also called Conditional Gradient algorithm (polyak66cg). As introduced in pok17lazy, pok18bcg enhanced the vanilla Frank-Wolfe algorithm by replacing the linear programming oracle with a weak-separation oracle, and by blending the traditional conditional gradient steps with lazified conditional gradient steps and projected gradient steps, while still avoiding projections.
In a separate line of work, locatello17mpfw unified the Frank-Wolfe and Matching Pursuit algorithms, and proposed a Generalized Matching Pursuit algorithm (GMP) and an Orthogonal Matching Pursuit algorithm (OMP) for solving Problem 1. Essentially, what locatello17mpfw established is that GMP corresponds to the vanilla Frank-Wolfe algorithm and OMP corresponds to the Fully-Corrective Frank-Wolfe algorithm in some meaningful way. GMP and OMP converge with similar rates in the various regimes, namely with a sublinear rate for smooth convex functions and with a linear rate for smooth strongly convex functions, however they have different advantages: GMP converges faster in wall-clock time while OMP offers (much) sparser iterates. The interest in these algorithms is that they work in the general setting of smooth convex functions in Hilbert spaces and without assuming incoherence or restricted isometry properties (RIP) of the dictionary. For an in-depth discussion of the advantages of GMP and OMP over other methods, e.g., in tropp04, grib06, dav10, shalev10, tem13; tem14; tem15, tib15, yao16, and nguyen17, we refer the interested reader to locatello17mpfw. In a follow-up work, locatello18mpcd presented an Accelerated Matching Pursuit algorithm, which we compare our approach to as well.
In short, it is fair to say that pok18bcg and locatello17mpfw motivated our approach, which aims at unifying the best of GMP (speed) and OMP (sparsity). However, while the overall idea is reasonably natural, we face considerable challenges as many important features of Frank-Wolfe methods do not apply anymore in the Matching Pursuit setting and cannot be as easily overcome as in locatello17mpfw, requiring a different analysis. For example, Frank-Wolfe (duality) gaps are not readily available but they are crucial in controlling the blending of steps, and further key components, such as the weak-separation oracle, require modifications.
We demonstrate that Matching Pursuit is also amenable to blending, which results in our Blended Matching Pursuit algorithm (BMP), a fast and sparse first-order method for solving Problem 1. Our contribution can be summarized as follows:
Blended Matching Pursuit.
We provide a new Blended Matching Pursuit algorithm. Our blending approach unifies the best of speed of convergence and sparsity of the iterates into one algorithm, which is of fundamental interest for practitioners solving Problem 1.
We present a convergence analysis of a wide class of functions employing different smoothness and sharpness orders. We establish a continuous range of convergence rates between and depending on the properties of the function, where is the desired accuracy and . In particular, we derive linear rates of convergence for a large class of smooth convex but non-strongly convex functions solving Problem 1.
We demonstrate the computational superiority of our algorithm, comparing it to state-of-the-art results. While our algorithm makes (much) more progress in wall-clock time compared to GMP and OMP, we achieve a level of sparsity that rivals that of OMP. We also show that our approach computationally outperforms the Accelerated Matching Pursuit algorithm of locatello18mpcd, both in speed of convergence and sparsity of the iterates.
The paper is structured as follows. We introduce notions and notation in Section 2. We present the Blended Matching Pursuit algorithm in Section 3. We start with a simplified convergence analysis as a warm-up in Section 3.1 and present the main convergence analysis in Section 3.2. We provide computational experiments in Section 4, and we conclude with some final remarks.
We work in a separable real Hilbert space with induced norm . Given , the inner product between and is denoted by . A set of normalized vectors is a dictionary if it is at most countable and , and in this case its elements are referred to as atoms. For any set , let denote the symmetrization of and denote the diameter of . Let denote the interior of , where denotes the open ball of radius centered at . If is closed and convex, let denote the orthogonal projection onto . Given , we will simply denote by and when these quantities are well-defined and that no confusion is possible. For Problem 1 to be feasible, we will assume to be coercive, i.e., . Since is convex, this is actually a mild assumption if . Last, for , the brackets denote the set of integers between (and including) and .
Let be a Fréchet differentiable function. In the following, we use extended notions of smoothness and strong convexity by introducing orders, and we weaken and generalize the notion of strong convexity to that of sharpness (see, e.g., alex17sharp and kerdreux2018restarting for related work). We say that is:
smooth of order if there exists such that for all ,
strongly convex of order if there exists such that for all ,
sharp of order on if is a bounded set, , and there exists such that for all ,
If needed, we may specify the constants by introducing as -smooth, -strongly convex, or -sharp.
The following fact provides an upper bound on the sharpness order of a smooth function.
Let be smooth of order , convex, and sharp of order on . Then .
Let and . By sharpness, smoothness, and ,
As the left-hand side is constant and can be arbitrarily close to , we conclude that . ∎
2.1 On sharpness and strong convexity
Notice that if is strongly convex of order , then . Let . It follows directly from that for any bounded set such that , is sharp of order on . Thus, strong convexity implies sharpness. However, not every sharp function is strongly convex; moreover, the next example shows that not every sharp and convex function is strongly convex.
Example 2.2 (Distance to a convex set).
Let be a nonempty, closed, and bounded convex set, and be a bounded set such that . The function is convex, and it is sharp of order on . Indeed, since and , we have for all ,
Now, suppose contains more than one element. Then, has more than one minimizer. However, a function that is strongly convex of order has no more than one minimizer. Therefore, cannot be strongly convex of order , for all . Notice that is also a smooth function, of order .
Hence, sharpness can be seen as a more general and weaker condition than strong convexity, as it is a local condition around the optimal solutions whereas strong convexity is a global condition. In fact, building on the Łojasewicz inequality of lojo63, (bolte07lojo, Equation (15)) showed that sharpness always holds in finite dimensional spaces for reasonably well-behaved convex functions; see Lemma 2.3. Polynomial convex functions, , and the rectifier ReLU are simple examples of such functions.
Let be a lower semicontinuous, convex, and subanalytic function with , and denote . Then for any bounded set , there exists and such that for all ,
Strong convexity is a standard assumption in convex minimization to prove linear convergence rates on smooth convex objectives but, regrettably, this considerably restricts the set of candidate functions. For our Blended Matching Pursuit algorithm, we will only assume sharpness to establish linear convergence rates, thus including a wide class of functions.
2.2 Matching Pursuit algorithms
For and , Problem 1 falls in the area of sparse recovery and is often solved using the Matching Pursuit algorithm of mallat93mp. The algorithm recovers the sparse representation of the signal from the dictionary by sequentially pursuing the best matching atom. It searches for the atom most correlated with the residual , i.e., , and adds it to the linear decomposition of the current iterate to form the new iterate , keeping track of the active set . However, this does not prevent the algorithm from selecting atoms that have already been activated or that are redundant. The Orthogonal Matching Pursuit variant (pati93; mallat94omp) of the Matching Pursuit algorithm overcomes this by computing the new iterate as the projection of the signal onto , however requiring more computation; see chen89 and tropp04 for analyses and zhang09 for an extension to the stochastic case. Thus, becomes orthogonal to the active set.
In order to solve Problem 1 for any smooth convex objective, locatello17mpfw proposed the Generalized Matching Pursuit (GMP) and Generalized Orthogonal Matching Pursuit (GOMP) algorithms (Algorithm 1); slightly abusing notation we will refer to the latter simply as Orthogonal Matching Pursuit (OMP). The atom selection subroutine is implemented with a Frank-Wolfe linear minimization query (Line 3). The solution to this query is guaranteed to be a descent direction as it satisfies by symmetry of , and if and only if . Notice that for and , the GMP and OMP variants of Algorithm 1 recover the original Matching Pursuit and Orthogonal Matching Pursuit algorithms respectively. In particular, up to a sign which does not affect the sequence of iterates, .
As mentioned earlier, there is high interest in these generalized algorithms as they work for all smooth convex functions and do not require incoherence or restricted isometry properties (RIP) assumptions on the dictionary.
2.3 Weak-separation oracle
We now introduce the weak-separation oracle (Oracle 1), a modified version of the one first introduced in pok17lazy and recently used in lan2017conditional and pok18bcg. Note that the modification asks for an unconstrained improvement, whereas the original weak-separation oracle required an improvement relative to a reference point. As such, our variant here is even simpler than the original weak-separation oracle. The oracle is called in Line 11 by the Blended Matching Pursuit algorithm (Algorithm 2).
The weak-separation oracle determines whether there exists an atom such that . If this is not the case, then this implies that can be separated from the ambient space by and with the linear inequality for all . In practice, the oracle can be efficiently implemented using caching, i.e., first testing atoms that were already returned during previous calls by the Blended Matching Pursuit algorithm, as they may already satisfy the condition here. If no such atom from a previous call exists, then the condition can be decided, e.g., by means of a call to a linear optimization oracle; see pok17lazy for an in-depth discussion. By biasing the selection of the atom, in case of a positive call, towards one that has already been selected in earlier iterations, caching (together with blending) keeps the new iterate (Line 17) as sparse as possible. We would also like to briefly note that the parameter can be used to further promote positive calls over negative calls, by weakening the improvement requirement.
3 The Blended Matching Pursuit algorithm
We now present our Blended Matching Pursuit algorithm (BMP) in Algorithm 2. Note that although we make blended steps, we can maintain the explicit decompositions of the iterates as linear combinations of the atoms.
Remark 3.1 (Algorithm design).
BMP (Algorithm 2) actually does not require the atoms to all be normalized and only needs the dictionary to be bounded, whether it be for ensuring the convergence rates or for computations; one could further take advantage of this to add weights to certain atoms. Line 6 is simply taking the component of parallel to , which can be achieved by basic linear algebra and is computationally very cheap. The line searches Line 7 and Line 17 can be replaced with explicit step-sizes using the smoothness of (see Fact 3.3). The purpose of (the optional) Line 22 is to reoptimize the active set , e.g., by reducing it to a subset that forms a basis for its linear span. One could also obtain further sparsity by removing atoms whose coefficient in the decomposition of the iterate is smaller than some threshold .
Similarly to the Blended Conditional Gradients algorithm of pok18bcg, BMP employs (pseudo) dual bound estimates as certificate of convergence (see, e.g., Equation (21) in the proof of Theorem 3.6) and as criterion to guide which type of step to take next (see Line 5 and the weak-separation oracle call in Line 11); note that not being valid dual bounds is a major difference with the Blended Conditional Gradients algorithm of pok18bcg. When the criterion indicates not to take a constrained step (Line 7), the weak-separation oracle (Oracle 1) decides between performing a dual bound step (Line 13) or performing a full progress step (Line 17): if the oracle returns a negative answer, this implies that no full step can reduce the primal gap as required and a dual step is taken instead, which scales down the dual bound estimate by a factor (Line 15). In this last case, we can think of BMP as performing an (implicit!) restart with a new primal gap estimate.
The criterion Line 5, driven by dual bound estimates, is critical in balancing speed of convergence and sparsity of the iterates, and depends primarily on . Since by symmetry of (Line 2), setting to a high value () prompts BMP to take constrained steps until a certain target accuracy on is reached, effectively optimizing over the active set , and in this case BMP behaves like OMP. On the other hand, setting to a low value () prompts BMP to take full steps and to essentially behave like GMP. Empirically (see Section 4), we found that setting leads to close to both maximal speed of convergence and sparsity of the iterates, with the default choices . In this setting, BMP converges (much) faster than GMP and has iterates with sparsity very comparable to that of OMP. Note that the value of also impacts the range of values of to which BMP is sensitive, since the criterion (Line 5) tests while the weak-separation oracle asks for such that . Thus, it is no surprise that is optimal when .
Ultimately, the blending of constrained steps, dual steps, and full steps controlled by the criterion in Line 5 and the weak-separation oracle (Oracle 1) via the dual bound estimates is the key to the speed-up in BMP compared to GMP and OMP, while being able to maintain sparsity similar to that of OMP. This process enables BMP to replace the expensive atom computation Line 3 in GMP and OMP with the much cheaper Line 4 in BMP, where the atom search is performed over the (symmetrized) set of active atoms only while GMP and OMP need to search over the whole (symmetrized) dictionary.
At last, and as dicussed in Section 4 and in the sensitivity analyses, we point out that BMP is easy to tune and a robust configuration is always . One does not need to know specific properties of (e.g., the smoothness constant ) to run it with close-to-optimal performance, and hence BMP can actually be seen as a parameter-free algorithm.
3.1 Warm-up: the (strongly) convex case
Here, we establish our main result for BMP in the simpler cases of smooth convex functions of order (Theorem 3.4, sublinear rate) and smooth strongly convex functions of orders (Theorem 3.6, linear rate), to recover the rates of locatello17mpfw for GMP and OMP; our main result (Theorem 3.7) will be presented in Section 3.2. Notice that for ensuring both sublinear and linear rates, locatello17mpfw assume knowledge of an upper bound on , and this before even proving convergence of the algorithms, i.e., that . In locatello18mpcd, this is resolved by working with the atomic norm for defining smoothness and strong convexity of . In contrast, we need neither the finiteness assumption nor to change the norm, however we assume to be coercive to ensure feasibility of Problem 1. The following results will be needed, with proofs provided in the Appendix.
Let be a coercive function and be a sequence of iterates in such that for all . Then is bounded.
Let be differentiable, , , and such that . Define
Theorem 3.4 (Smooth convex case).
Let be a dictionary such that and let be smooth of order , convex, and coercive. Then the Blended Matching Pursuit algorithm (Algorithm 2) ensures that for all where
Let and where , , and are the number of dual steps (Line 13), full steps (Line 17), and constrained steps (Line 7) taken in total respectively. The objective is continuous and coercive so . Let for . Similarly to pok17lazy, we introduce epoch starts at iteration or any iteration immediately following a dual step. Our goal is to bound the number of epochs and the number of iterations within each epoch. Notice that and for .
Let . The function is coercive and for , so by Fact 3.2 the sequence of iterates is bounded. Define . Note that is independent of . Let be an iteration of the algorithm, and . We can assume that otherwise the iterates have already converged. By convexity, . Since , there exists such that . Thus, so
so with (2),
where is the number of dual steps taken before . Therefore, by (5) and since ,
where we used (by symmetry). Therefore, the primal progress is at least
where the last three lines respectively come from the Cauchy-Schwarz inequality, , (Line 5), and (by symmetry). Therefore, the primal progress is at least
whose lower bound only differs by a constant factor from that of a full step (7).
Now, we have
where and are the number of full steps and constrained steps taken during epoch respectively. Let be an epoch start. Thus, is a dual step. By (4), since and ,
We conclude that the algorithm converges with
We now establish Theorem 3.6 in the simplified case of smooth strongly convex functions with orders . Here only, we assume to be finite dimensional for further simplicity of the proof. These simplifications will enable us to understand the main result Theorem 3.7 in the next Section 3.2 more easily. The following Corollary to Fact 3.3 will be needed, with proof provided in the Appendix.
Let be -strongly convex of order with . Then for all ,
Theorem 3.6 (Smooth strongly convex case).
Let be a dictionary such that and let be -smooth of order and -strongly convex of order , with . Then the Blended Matching Pursuit algorithm (Algorithm 2) ensures that for all where
Moreover, as .
Let and where , , and are the number of dual steps (Line 13), full steps (Line 17), and constrained steps (Line 7) taken in total respectively, and . Similarly to pok17lazy, we introduce epoch starts at iteration or any iteration immediately following a dual step. Our goal is to bound the number of epochs and the number of iterations within each epoch. Notice that and for .
Let be an iteration of the algorithm and . We can assume that otherwise the iterates have already converged. By convexity, . Since , there exists such that . Therefore, so
and it follows that
By Corollary 3.5,
Combining with (13) we obtain
where is the number of dual steps taken before . Therefore, by (16) and since , is finite with
where we used (by symmetry). Therefore, the primal progress is at least
Last, if a constrained step is taken (Line 7), then by smoothness,