Orthogonal Matching Pursuit with Replacement
Abstract
In this paper, we consider the problem of compressed sensing where the goal is to recover almost all the sparse vectors using a small number of fixed linear measurements. For this problem, we propose a novel partial hardthresholding operator that leads to a general family of iterative algorithms. While one extreme of the family yields well known hard thresholding algorithms like ITI and HTPMalekiD10 (); Foucart10 (), the other end of the spectrum leads to a novel algorithm that we call Orthogonal Matching Pursuit with Replacement (OMPR). OMPR, like the classic greedy algorithm OMP, adds exactly one coordinate to the support at each iteration, based on the correlation with the current residual. However, unlike OMP, OMPR also removes one coordinate from the support. This simple change allows us to prove that OMPR has the best known guarantees for sparse recovery in terms of the Restricted Isometry Property (a condition on the measurement matrix). In contrast, OMP is known to have very weak performance guarantees under RIP. Given its simple structure, we are able to extend OMPR using locality sensitive hashing to get OMPRHash, the first provably sublinear (in dimensionality) algorithm for sparse recovery. Our proof techniques are novel and flexible enough to also permit the tightest known analysis of popular iterative algorithms such as CoSaMP and Subspace Pursuit. We provide experimental results on large problems providing recovery for vectors of size up to million dimensions. We demonstrate that for largescale problems our proposed methods are more robust and faster than existing methods.
1 Introduction
We nowadays routinely face highdimensional datasets in diverse application areas such as biology, astronomy, finance and the web. The associated curse of dimensionality is often alleviated by prior knowledge that the object being estimated has some structure. One of the most natural and wellstudied structural assumption for vectors is sparsity. Accordingly, a huge amount of recent work in machine learning, statistics and signal processing has been devoted to finding better ways to leverage sparse structures. Compressed sensing, a new and active branch of modern signal processing, deals with the problem of designing measurement matrices and recovery algorithms, such that almost all sparse signals can be recovered from a small number of measurements. It has important applications in imaging, computer vision and machine learning (see, for example, DuarteDTLSKB08 (); WrightMMSHY10 (); HsuKLZ10 ()).
In this paper, we focus on the compressed sensing setting (CandesT05, ; Donoho06, ) where we want to design a measurement matrix such that a sparse vector with can be efficiently recovered from the measurements . Initial work focused on various random ensembles of matrices such that, if was chosen randomly from that ensemble, one would be able to recover all or almost all sparse vectors from . CandesT05 () isolated a key property called the restricted Isometry property (RIP) and proved that, as long as the measurement matrix satisfies RIP, the true sparse vector can be obtained by solving an optimization problem,
The above problem can be easily formulated as a linear program and is hence efficiently solvable. We recall for the reader that a matrix is said to satisfy RIP of order if there is some such that, for all with , we have
Several random matrix ensembles are known to satisfy with high probability provided one chooses measurements. Candes08 () showed that minimization recovers all sparse vectors provided satisfies although the condition has been recently improved to (Foucart10b, ). Note that, in compressed sensing, the goal is to recover all, or most, sparse signals using the same measurement matrix . Hence, weaker conditions such as restricted convexity NRWY09 () studied in the statistical literature (where the aim is to recover a single sparse vector from noisy linear measurements) typically do not suffice. In fact, if RIP is not satisfied then multiple sparse vectors can lead to the same observation , hence making recovery of the true sparse vector impossible.
Based on its RIP guarantees, minimization can guarantee recovery using just measurements, but it has been observed in practice that minimization is too expensive in large scale applications (DonohoMaMo09, ), for example, when the dimensionality is in the millions. This has sparked a huge interest in iterative methods for sparse recovery. An early classic iterative method is Orthogonal Matching Pursuit (OMP) (PatiRK93, ; DavidMA97, ) that greedily chooses elements to add to the support. It is a natural, easytoimplement and fast method but unfortunately lacks strong theoretical guarantees. Indeed, it is known that, if run for iterations, OMP cannot uniformly recover all sparse vectors under an RIP condition of the form (Rauhut08, ; MoS11, ). However, Zhang10 () showed that OMP, if run for iterations, recovers the optimal solution for ; a significantly more restrictive condition than the ones required by other methods like minimization.
Several other iterative approaches have been proposed that include Iterative Soft Thresholding (IST) (MalekiD10, ), Iterative Hard Thresholding (IHT) (BlumensathD09, ), Compressive Sampling Matching Pursuit (CoSaMP) (NeedellT09, ), Subspace Pursuit (SP) (DaiM09, ), Iterative Thresholding with Inversion (ITI) (Maleki09, ), Hard Thresholding Pursuit (HTP) (Foucart10, ) and many others. Among the family of iterative hard thresholding algorithms, following MalekiD10 (), we can identify two major subfamilies: one and twostage algorithms. As their names suggest, the distinction is based on the number of stages in each iteration of the algorithm. Onestage algorithms such as IHT, ITI and HTP, decide on the choice of the next support set and then usually solve a least squares problem on the updated support. The onestage methods always set the support set to have size , where is the target sparsity level. On the other hand, twostage algorithms, notable examples being CoSaMP and SP, first enlarge the support set, solve a least squares on it, and then reduce the support set back again to the desired size. A second least squares problem is then solved on the reduced support. These algorithms typically enlarge and reduce the support set by or elements. An exception is the twostage algorithm FoBa Zhang08 () that adds and removes single elements from the support. However, it differs from our proposed methods as its analysis requires very restrictive RIP conditions ( as quoted in HsuKLZ10 ()) and the connection to locality sensitive hashing (see below) is not made. Another algorithm with replacement steps appears in ShwartzSZ10 (). However, the algorithm and the setting under which it is analyzed are different from ours.
In this paper, we present, and provide a unified analysis for a family of onestage iterative hard thresholding algorithms. The family is parameterized by a positive integer . At the extreme value , we recover the algorithm ITI/HTP. At the other extreme , we get a novel algorithm that we call Orthogonal Matching Pursuit with Replacement (OMPR). OMPR can be thought of as a simple modification of the classic greedy algorithm OMP: instead of simply adding an element to the existing support, it replaces an existing support element with a new one. Surprisingly, this change allows us to prove sparse recovery under the condition . This is the best based RIP condition under which any method, including minimization, can be shown to provably perform sparse recovery.
OMPR also lends itself to a faster implementation using locality sensitive hashing (LSH). This allows us to provide recovery guarantees using an algorithm whose runtime is provably sublinear in , the number of dimensions. An added advantage of OMPR, unlike many iterative methods, is that no careful tuning of the stepsize parameter is required even under noisy settings or even when RIP does not hold. The default stepsize of is always guaranteed to converge to at least a local optima.
Finally, we show that our proof techniques used in the analysis of the OMPR family are useful in tightening the analysis of twostage algorithms, such as CoSaMP and SP, as well. As a result, we are able to prove better recovery guarantees for these algorithms — for CoSaMP and for SP. We hope that this unified analysis sheds more light on the interrelationships between the various kinds of iterative hard thresholding algorithms.
In summary, the contributions of this paper are as follows.

We present a family of iterative hard thresholding algorithms that on one end of the spectrum includes existing algorithms such as ITI/HTP while on the other end gives OMPR. OMPR is an improvement over the classical OMP method as it enjoys better theoretical guarantees and is also better practically as shown in our experiments.

Unlike other improvements over OMP, such as CoSaMP or SP, OMPR changes only one element of the support at a time. This allows us to use Locality Sensitive Hashing (LSH) to speed it up resulting in the first provably sublinear (in the ambient dimensionality ) time sparse recovery algorithm.

We provide a general proof for all the algorithms in our partial hard thresholding based family. In particular, we can guarantee recovery using OMPR, under both noiseless and noisy settings, provided . This is the least restrictive condition under which any efficient sparse recovery method is known to work. Furthermore, our proof technique can be used to provide a general theorem that provides the least restrictive known guarantees for all the twostage algorithms such as CoSamp and SP (see Appendix D).
All proofs omitted from the main body of the paper can be found in the appendix.
2 Orthogonal Matching Pursuit with Replacement
Orthogonal matching pursuit (OMP), is a classic iterative algorithm for sparse recovery. At every stage, it selects a coordinate to include in the current support set by maximizing the inner product between columns of the measurement matrix and the current residual . Once the new coordinate has been added, it solves a least squares problem to fully minimize the error on the current support set. As a result, the residual becomes orthogonal to the columns of that correspond to the current support set. Thus, the least squares step is also referred to as orthogonalization by some authors (DavenportW10, ).
Let us briefly explain some of our notation. We use the MATLAB notation:
The hard thresholding operator sorts its argument vector in decreasing order (in absolute value) and retains only the top entries. It is defined formally in the next section. Also, we use subscripts to denote subvectors and submatrices, e.g. if is a set of cardinality and , denotes the subvector of indexed by . Similarly, for a matrix denotes a submatrix of size with columns indexed by . The complement of set is denoted by and denotes the subvector not indexed by . The support (indices of nonzero entries) of a vector is denoted by .
Our new algorithm called Orthogonal Matching Pursuit with Replacement (OMPR ), shown as Algorithm 1, differs from OMP in two respects. First, the selection of the coordinate to include is based not just on the magnitude of entries in but instead on a weighted combination with the stepsize controlling the relative importance of the two addends. Second, the selected coordinate replaces one of the existing elements in the support, namely the one corresponding to the minimum magnitude entry in the weighted combination mentioned above.
Once the support of the next iterate has been determined, the actual iterate is obtained by solving the least squares problem:
Note that if the matrix satisfies RIP of order or larger, the above problem will be well conditioned and can be solved quickly and reliably using an iterative least squares solver.
We will show that OMPR, unlike OMP, recovers any sparse vector under the RIP based condition . This appears to be the least restrictive recovery condition (i.e., best known condition) under which any method, be it basis pursuit (minimization) or some iterative algorithm, is guaranteed to recover all sparse vectors.
In the literature on sparse recovery, RIP based conditions of a different order other than are often provided. It is seldom possible to directly compare two conditions, say, one based on and the other based on . Foucart (Foucart10, ) has given a heuristic to compare such RIP conditions based on the number of samples it takes in the Gaussian ensemble to satisfy a given RIP condition. This heuristic says that an RIP condition of the form is less restrictive if the ratio is smaller. For the OMPR condition , this ratio is which makes it heuristically the least restrictive RIP condition for sparse recovery.
Theorem 1 (Noiseless Case).
Suppose the vector is sparse and the matrix satisfies and . Then OMPR recovers approximation to from measurements in iterations.
Theorem 2 (Noisy Case).
Suppose the vector is sparse and the matrix satisfies and . Then, in O() iterations OMPR converges to approximate solution, i.e., from measurements . is a universal constant and is dependent only on .
The above theorems are actually special cases of our convergence results for a family of algorithms that contains OMPR as a special case. We now turn our attention to this family. We note that the condition is very mild and will typically hold for standard random matrix ensembles as soon as the number of rows sampled is larger than a fixed universal constant.
3 A New Family of Iterative Algorithms
In this section we show that OMPR is one particular member of a family of algorithms parameterized by a single integer . The th member of this family, OMPR (), shown in Algorithm 2, replaces at most elements of the current support with new elements. OMPR corresponds to the choice . Hence, OMPR and OMPR () refer to the same algorithm.
Our first result in this section connects the OMPR family to hard thresholding. Given a set of cardinality , define the partial hard thresholding operator
(1) 
As is clear from the definition, the operator tries to find a vector close to a given vector under two constraints: (i) the vector should have bounded support (), and (ii) its support should not include more than new elements outside a given support .
The name partial hard thresholding operator is justified because of the following reasoning. When , the constraint is trivially implied by and hence the operator becomes independent of . In fact, it becomes identical to the standard hard thresholding operator
(2) 
Even though the definition of seems to involve searching through subsets, it can in fact be computed efficiently by simply sorting the vector by decreasing absolute value and retaining the top entries.
The following result shows that even the partial hard thresholding operator is easy to compute. In fact, lines 6–8 in Algorithm 2 precisely compute .
Proposition 3.
Let and be given. Then can be computed using the sequence of operations
The proof of this proposition is straightforward and elementary. However, using it, we can now see that the OMPR () algorithm has a simple conceptual structure. In each iteration (with current iterate having support ), we do the following:

(Gradient Descent) Form . Note that is the gradient of the objective function at .

(Partial Hard Thresholding) Form by partially hard thresholding using the operator .

(Least Squares) Form the next iterate by solving a least squares problem on the support of .
A nice property enjoyed by the entire OMPR family is guaranteed sparse recovery under RIP based conditions. Note that the condition under which OMPR () recovers sparse vectors becomes more restrictive as increases. This could be an artifact of our analysis, as in experiments, we do not see any degradation in recovery ability as is increased.
Theorem 4 (Noiseless Case).
Suppose the vector is sparse. Then OMPR () recovers an approximation to from measurements in iterations provided we choose a step size that satisfies and .
Theorem 5 (Noisy Case).
Suppose the vector is sparse. Then OMPR () converges to a approximate solution, i.e., from measurements in iterations provided we choose a step size that satisfies and , where .
Proof.
Here we provide a rough sketch of the proof of Theorem 4; the complete proof is given in Appendix A.
Our proof uses the following crucial observation regarding the structure of the vector Due to the least squares step of the previous iteration, the current residual is orthogonal to columns of . This means that
(3) 
As the algorithm proceeds, elements come in and move out of the current set . Let us give names to the set of found and lost elements as we move from to :
Hence, using (3) and updates for : , and . Now let , then using upper RIP and the fact that , we can show that (details are in the Appendix A):
(4) 
Furthermore, since is chosen based on the largest entries in , we have: Plugging this into (4), we get:
(5) 
The above expression shows that if then our method monotonically decreases the objective function and converges to a local optimum even if RIP is not satisfied (note that upper RIP bound is independent of lower RIP bound, and can always be satisfied by normalizing the matrix appropriately).
Lemma 6.
Let and . Then assuming , at least one new element is found i.e. . Furthermore, , where is a constant.
Special Cases: We have already observed that the OMPR algorithm of the previous section is simply OMPR (). Also note that Theorem 1 immediately follows from Theorem 4.
The algorithm at the other extreme of has appeared at least three times in the recent literature: as Iterative (hard) Thresholding with Inversion (ITI) in Maleki09 (), as SVPNewton (in its matrix avatar) in JainMD10 (), and as Hard Thresholding Pursuit (HTP) in Foucart10 ()). Let us call it IHTNewton as the least squares step can be viewed as a Newton step for the quadratic objective. The above general result for the OMPR family immediately implies that it recovers sparse vectors as soon as the measurement matrix satisfies .
Corollary 7.
Suppose the vector is sparse and the matrix satisfies . Then IHTNewton recovers from measurements in iterations.
4 Tighter Analysis of Two Stage Hard Thresholding Algorithms
Recently, MalekiD10 () proposed a novel family of algorithms, namely twostage hard thresholding algorithms. During each iteration, these algorithms add a fixed number (say ) of elements to the current iterate’s support set. A least squares problem is solved over the larger support set and then elements with smallest magnitude are dropped to form next iterate’s support set. Next iterate is then obtained by again solving the least squares over next iterate’s support set. See Appendix D for a more detailed description of the algorithm.
Using proof techniques developed for our proof of Theorem 4, we can obtain a simple proof for the entire spectrum of algorithms in the twostage hard thresholding family.
Theorem 8.
Suppose the vector is sparse. Then the Twostage Hard Thresholding algorithm with replacement size recovers from measurements in iterations provided:
Note that CoSaMP NeedellT09 () and Subspace Pursuit(SP) (DaiM09, ) are popular special cases of the twostage family. Using our general analysis, we are able to provide significantly less restrictive RIP conditions for recovery.
Corollary 9.
CoSaMPNeedellT09 () recovers sparse from measurements provided
Corollary 10.
Subspace PursuitDaiM09 () recovers sparse from measurements provided
Note that CoSaMP’s analysis given by NeedellT09 () requires while Subspace Pursuit’s analysis given by DaiM09 () requires . See Appendix D in the supplementary material for proofs of the above theorem and corollaries.
5 Fast Implementation Using Hashing
In this section, we discuss a fast implementation of the OMPR method using localitysensitive hashing. The main intuition behind our approach is that the OMPR method selects at most one element at each step (given by ); hence, selection of the top most element is equivalent to finding the column that is most “similar” (in magnitude) to , i.e., this may be viewed as the similarity search task for queries of the form and .
To this end, we use locality sensitive hashing (LSH) (GionisIM99, ), a well known datastructure for approximate nearestneighbor retrieval. Note that while LSH is designed for nearest neighbor search (in terms of Euclidean distances) and in general might not have any guarantees for the similar neighbor search task, we are still able to apply it to our task because we can lowerbound the similarity of the most similar neighbor.
We first briefly describe the LSH scheme that we use. LSH generates hash bits for a vector using randomized hash functions that have the property that the probability of collision between two vectors is proportional to the similarity between them. For our problem, we use the following hash function: , where is a random hyperplane generated from the standard multivariate Gaussian distribution. It can be shown that GW94 ()
Now, an bit hash key is created by randomly sampling hash functions , i.e., , where each is sampled randomly from the standard multivariate Gaussian distribution. Next, hash tables are constructed during the preprocessing stage using independently constructed hash key functions . During the query stage, a query is indexed into each hash table using hashkey functions and then the nearest neighbors are retrieved by doing an exhaustive search over the indexed elements.
Below we state the following theorem from GionisIM99 () that guarantees sublinear time nearest neighbor retrieval for LSH.
Theorem 11.
Let and , then with probability , LSH recovers nearest neighbors, i.e., where is the nearest neighbor to and is a point retrieved by LSH.
However, we cannot directly use the above theorem to guarantee convergence of our hashing based OMPR algorithm as our algorithm requires finding the most similar point in terms of magnitude of the inner product. Below, we provide appropriate settings of the LSH parameters to guarantee sublinear time convergence of our method under a slightly weaker condition on the RIP constant. A detailed proof of the theorem below can be found in Appendix B.
Theorem 12.
Let and , where is a small constant, then with probability , OMPR with hashing converges to the optimal solution in computational steps.
The above theorem shows that the time complexity is sublinear in . However, currently our guarantees are not particularly strong as for large the exponent of will be close to . We believe that the exponent can be improved by more careful analysis and our empirical results indicate that LSH does speed up the OMPR method significantly.
6 Experimental Results
In this section we present empirical results to demonstrate accurate and fast recovery by our OMPR method. In the first set of experiments, we present phase transition diagram for OMPR and compare it to the phase transition diagram of OMP and IHTNewton with step size . For the second set of experiments, we demonstrate robustness of OMPR compared to many existing methods when measurements are noisy or smaller in number than what is required for exact recovery. For the third set of experiments, we demonstrate efficiency of our LSH based implementation by comparing recovery error and time required for our method with OMP and IHTNewton (with stepsize and ). We do not present results for the /basis pursuit methods, as it has already been shown in several recent papers (Foucart10, ; MalekiD10, ) that the relaxation based methods are relatively inefficient for very large scale recovery problems.
In all the experiments we generate the measurement matrix by sampling each entry independently from the standard normal distribution and then normalize each column to have unit norm. The underlying sparse vectors are generated by randomly selecting a support set of size and then each entry in the support set is sampled uniformly from . We use our own optimized implementation of OMP and IHTNewton. All the methods are implemented in MATLAB and our hashing routine uses mex files.
(a) OMPR  (b) OMP  (c) IHTNewton 
6.1 Phase Transition Diagrams
We first compare different methods using phase transition diagrams which are commonly used in compressed sensing literature to compare different methods (MalekiD10, ). We first fix the number of measurements to be and generate different problem sizes by varying and . For each problem size , we generate random Gaussian measurement matrices and sparse random vectors. We then estimate the probability of success of each of the method by applying the method to 100 randomly generated instances. A method is considered successful for a particular instance if it recovers the underlying sparse vector with at most relative error.
In Figure 1, we show the phase transition diagram of our OMPR method as well as that of OMP and IHTNewton (with step size 1). The plots shows probability of successful recovery as a function of and . Figure 1 (a) shows color coding of different success probabilities; red represents high probability of success while blue represents low probability of success. Note that for Gaussian measurement matrices, the RIP constant is less than a fixed constant if and only if , where is a universal constant. This implies that and hence a method that recovers for high will have a large fraction in the phase transition diagram where successful recovery probability is high. We observe this phenomenon for both OMPR and IHTNewton method which is consistent with their respective theoretical guarantees (see Theorem 4). On the other hand, as expected, the phase transition diagram of OMP has a negligible fraction of the plot that shows high recovery probability.
6.2 Performance for Noisy or Undersampled Observations
Next, we empirically compare performance of OMPR to various existing compressed sensing methods. As shown in the phase transition diagrams in Figure 1, OMPR provides comparable recovery to the IHTNewton method for noiseless cases. Here, we show that OMPR is fairly robust under the noisy setting as well as in the case of undersampled observations, where the number of observations is much smaller than what is required for exact recovery.
For this experiment, we generate random Gaussian measurement matrix of size . We then generate random binary vector of sparsity and add Gaussian noise to it. Figure 2 (a) shows recovery error () incurred by various methods for increasing and noise level of . Clearly, our method outperforms the existing methods, perhaps a consequence of guaranteed convergence to a local minima for fixed step size . Similarly, Figure 2 (b) shows recovery error incurred by various methods for fixed and varying noise level. Here again, our method outperforms existing methods and is more robust to noise. Finally, in Figure 2 (c) we show difference in error incurred along with confidence interval (at signficance level) by IHTNewton and OMPR for varying levels of noises and . Our method is better than IHTNewton (at signficance level) in terms of recovery error in around 30 cells of the table, and is not worse in any of the cells but one.
Noise/ k 10 30 50 0.00 0.00(0.0) 0.21(0.6) 0.25(0.3) 0.05 0.00(0.0) 0.13(0.3) 0.37(0.3) 0.10 0.00(0.0) 0.28(0.3) 0.63(0.4) 0.20 0.03(0.0) 0.62(0.2) 0.58(0.5) 0.30 0.18(0.1) 0.92(0.3) 0.92(0.6) 0.40 0.31(0.1) 1.19(0.3) 0.84(0.5) 0.50 0.37(0.1) 1.48(0.3) 1.24(0.6)  
(a)  (b)  (c) 
(a)  (b)  (c) 
6.3 Performance of LSH based implementation
Next, we empirically study recovery properties of our LSH based implementation of OMPR ( OMPRHash ) in the following realtime setup: Generate a random measurement matrix from the Gaussian ensemble and construct hash tables offline using hash functions specified in Section 5. Next, during the reconstruction stage, measurements arrive one at a time and the goal is to recover the underlying signal accurately in realtime.For our experiments, we generate measurements using random sparse vectors and then report recovery error and computational time required by each of the method averaged over runs.
In our first set of experiments, we empirically study the performance of different methods as increases. Here, we fix , and generate measurements using dimensional random vectors of support set size . We then run different methods to estimate vectors of support size that minimize . For our OMPRHash method, we use bits hashkeys and generate hashtables. Figure 3 (a) shows the error incurred by OMPR , OMPRHash , and IHTNewton for different (recall that is an input to both OMPR and IHTNewton). Note that although OMPRHash performs an approximation at each step, it is still able to achieve error similar to OMPR and IHTNewton. Also, note that since the number of measurements are not enough for exact recovery by the IHTNewton method, it typically diverges after a few steps. As a result, we use IHTNewton with step size which is always guaranteed to monotonically converge to at least a local minima (see Theorem 4). In contrast, in OMPR and OMPRHash can always set step size aggressively to be .
Next, we evaluate OMPRHash as dimensionality of the data increases. For OMPRHash , we use hashkeys and hashtables. Figures 3(b) and (c) compare error incurred and time required by OMPRHash with OMPR and IHTNewton. Here again we use step size for IHTNewton as it does not converge for . Note that OMPRHash is an order of magnitude faster than OMPR while incurring slightly higher error. OMPRHash is also nearly times faster than IHTNewton.
Appendix A Proofs related to OMPR: Exact Recovery Case
Let us denote the objective function by . Let denote the support set of and be the support set of . Define the sets
(false alarms)  
(missed detections)  
As the algorithms proceed, elements get in and move in and out of the current set . Let us give names to the set of found and lost elements as we move from to :
(found)  
We first state two technical lemmas that we will need. These can be found in [19].
Lemma 13.
For any , we have,
Lemma 14.
For any such that , we have,
Proof of Theorem 3
Lemma 15.
Let . Then, in OMPR (),
Proof.
Since is the solution to the least squares problem ,
(6) 
Now, note that
(7) 
Hence,
(8) 
That is,
where the third line follows from the fact that , , and are disjoint sets.
As and , the above inequality implies
∎
Next, we provide a lemma that bounds the function value in terms of missed detections and also .
Lemma 16.
Let , , and . Then, at each step,
(9) 
Proof.
Lemma 17.
Let and . Then assuming , at least one new element is found i.e. . Furthermore, , where is a constant.
Proof.
We consider the following three exhaustive cases:

and : Let , s.t., . Now,
Now, as consists of top elements of :
(14) Furthermore, since , hence every element of is smaller in magnitude than every element of , otherwise that element should have been included in . Furthermore, . Hence,
(15) (16) Using above equation along with Lemma 15, we get:
(17) Now, note that if , then implying that . Hence, at least one new element is added, i.e., .