Deconvolution of High Dimensional Mixtures via Boosting, with Application to Diffusion-Weighted MRI of Human Brain
Diffusion-weighted magnetic resonance imaging (DWI) and fiber tractography are the only methods to measure the structure of the white matter in the living human brain. The diffusion signal has been modelled as the combined contribution from many individual fascicles of nerve fibers passing through each location in the white matter. Typically, this is done via basis pursuit, but estimation of the exact directions is limited due to discretization [1, 2]. The difficulties inherent in modeling DWI data are shared by many other problems involving fitting non-parametric mixture models. Ekanadaham et al.  proposed an approach, continuous basis pursuit, to overcome discretization error in the 1-dimensional case (e.g., spike-sorting). Here, we propose a more general algorithm that fits mixture models of any dimensionality without discretization. Our algorithm uses the principles of L2-boost , together with refitting of the weights and pruning of the parameters. The addition of these steps to L2-boost both accelerates the algorithm and assures its accuracy. We refer to the resulting algorithm as elastic basis pursuit, or EBP, since it expands and contracts the active set of kernels as needed. We show that in contrast to existing approaches to fitting mixtures, our boosting framework (1) enables the selection of the optimal bias-variance tradeoff along the solution path, and (2) scales with high-dimensional problems. In simulations of DWI, we find that EBP yields better parameter estimates than a non-negative least squares (NNLS) approach, or the standard model used in DWI, the tensor model, which serves as the basis for diffusion tensor imaging (DTI) . We demonstrate the utility of the method in DWI data acquired in parts of the brain containing crossings of multiple fascicles of nerve fibers.
Deconvolution of High Dimensional Mixtures via Boosting, with Application to Diffusion-Weighted MRI of Human Brain
Charles Y. Zheng Department of Statistics Stanford University Stanford, CA 94305 firstname.lastname@example.org Franco Pestilli Department of Psychological and Brain Sciences Indiana University, Bloomington, IN 47405 email@example.com Ariel Rokem Department of Psychology Stanford University Stanford, CA 94305 firstname.lastname@example.org
In many applications, one obtains measurements for which the response is related to via some mixture of known kernel functions , and the goal is to recover the mixture parameters and their associated weights:
where is a known kernel function parameterized by , and are model parameters to be estimated, are unknown nonnegative weights to be estimated, and is additive noise. The number of components is also unknown, hence, this is a nonparametric model. One example of a domain in which mixture models are useful is the analysis of data from diffusion-weighted magnetic resonance imaging (DWI). This biomedical imaging technique is sensitive to the direction of water diffusion within millimeter-scale voxels in the human brain in vivo. Water molecules freely diffuse along the length of nerve cell axons, but is restricted by cell membranes and myelin along directions orthogonal to the axon’s trajectory. Thus, DWI provides information about the microstructural properties of brain tissue in different locations, about the trajectories of organized bundles of axons, or fascicles within each voxel, and about the connectivity structure of the brain. Mixture models are employed in DWI to deconvolve the signal within each voxel with a kernel function, , assumed to represent the signal from every individual fascicle [1, 2] (Figure 1B), and provide an estimate of the fiber orientation distribution function (fODF) in each voxel, the direction and volume fraction of different fascicles in each voxel. In other applications of mixture modeling these parameters represent other physical quantities. For example, in chemometrics, represents a chemical compound and its spectra. In this paper, we focus on the application of mixture models to the data from DWI experiments and simulations of these experiments.
1.1 Model fitting - existing approaches
Hereafter, we restrict our attention to the use of squared-error loss; resulting in penalized least-squares problem
Minimization problems of the form (2) can be found in the signal deconvolution literature and elsewhere: some examples include super-resolution in imaging , entropy estimation for discrete distributions , X-ray diffraction , and neural spike sorting . Here, is a convex penalty function of . Examples of such penalty functions given in Section 2.1; a formal definition of convexity in the nonparametric setting can be found in the supplementary material, but will not be required for the results in the paper. Technically speaking, the objective function (2) is convex in , but since its domain is of infinite dimensionality, for all practical purposes (2) is a nonconvex optimization problem. One can consider fixing the number of components in advance, and using a descent method (with random restarts) to find the best model of that size. Alternatively, one could use a stochastic search method, such as simulated annealing or MCMC , to estimate the size of the model and the model parameters simultaneously. However, as one begins to consider fitting models with increasing number of components and of high dimensionality, it becomes increasingly difficult to apply these approaches . Hence a common approach to obtaining an approximate solution to (2) is to limit the search to a discrete grid of candidate parameters . The estimated weights and parameters are then obtained by solving an optimization problem of the form
where has the th column , where is defined by . Examples applications of this non-negative least-squares-based approach (NNLS) include  and [1, 2, 7]. In contrast to descent based methods, which get trapped in local minima, NNLS is guaranteed to converge to a solution which is within of the global optimum, where depends on the scale of discretization. In some cases, NNLS will predict the signal accurately (with small error), but the parameters resulting will still be erroneous. Figure 1 illustrates the worst-case scenario where discretization is misaligned relative to the true parameters/kernels that generated the signal.
In an effort to improve the discretization error of NNLS, Ekanadham et al  introduced continuous basis pursuit (CBP). CBP is an extension of nonnegative least squares in which the points on the discretization grid can be continuously moved within a small distance; in this way, one can reach any point in the parameter space. But instead of computing the actual kernel functions for the perturbed parameters, CBP uses linear approximations, e.g. obtained by Taylor expansions. Depending on the type of approximation employed, CBP may incur large error. The developers of CBP suggest solutions for this problem in the one-dimensional case, but these solutions cannot be used for many applications of mixture models (e.g DWI). The computational cost of both NNLS and CBP scales exponentially in the dimensionality of the parameter space. In contrast, using stochastic search methods or descent methods to find the global minimum will generally incur a computational cost scaling which is exponential in the sample size times the parameter space dimensions. Thus, when fitting high-dimensional mixture models, practitioners are forced to choose between the discretization errors inherent to NNLS, or the computational difficulties in the descent methods. We will show that our boosting approach to mixture models combines the best of both worlds: while it does not suffer from discretization error, it features computational tractability comparable to NNLS and CBP. We note that for the specific problem of super-resolution, Càndes derived a deconvolution algorithm which finds the global minimum of (2) without discretization error and proved that the algorithm can recover the true parameters under a minimal separation condition on the parameters . However, we are unaware of an extension of this approach to more general applications of mixture models.
The model (1) appears in an entirely separate context, as the model for learning a regression function as an ensemble of weak learners , or boosting . However, the problem of fitting a mixture model and the problem of fitting an ensemble of weak learners have several important differences. In the case of learning an ensemble, the family can be freely chosen from a universe of possible weak learners, and the only concern is minimizing the prediction risk on a new observation. In contrast, in the case of fitting a mixture model, the family is specified by the application. As a result, boosting algorithms, which were derived under the assumption that is a suitably flexible class of weak learners, generally perform poorly in the signal deconvolution setting, where the family is inflexible. In the context of regression, boost, proposed by Buhlmann et al  produces a path of ensemble models which progressively minimize the sum of squares of the residual. boost fits a series of models of increasing complexity. The first model consists of the single weak learner which best fits . The second model is formed by finding the weak learner with the greatest correlation to the residual of the first model, and adding the new weak learner to the model, without changing any of the previously fitted weights. In this way the size of the model grows with the number of iterations: each new learner is fully fit to the residual and added to the model. But because the previous weights are never adjusted, Boost fails to converge to the global minimum of (2) in the mixture model setting, producing suboptimal solutions. In the following section, we modify Boost for fitting mixture models. We refer to the resulting algorithm as elastic basis pursuit.
2 Elastic Basis Pursuit
Our proposed procedure for fitting mixture models consists of two stages. In the first stage, we transform a penalized problem to an equivalent non regularized least squares problem. In the second stage, we employ a modified version of Boost, elastic basis pursuit, to solve the transformed problem. We will present the two stages of the procedure, then discuss our fast convergence results.
For most mixture problems it is beneficial to apply a -norm based penalty, by using a modified input and kernel function family , so that
We will use our modified Boost algorithm to produce a path of solutions for objective function on the left side, which results in a solution path for the penalized objective function (2).
For example, it is possible to embed the penalty in the optimization problem (2). One can show that solutions obtained by using the penalty function have a one-to-one correspondence with solutions of obtained using the usual penalty . The penalty is implemented by using the transformed input: and using modified kernel vectors . Other kinds of regularization are also possible, and are presented in the supplemental material.
2.2 From Boost to Elastic Basis Pursuit
Motivated by the connection between boosting and mixture modelling, we consider application of Boost to solve the transformed problem (the left side of(3)). Again, we reiterate the nonparametric nature of the model space; by minimizing (3), we seek to find the model with any number of components which minimizes the residual sum of squares. In fact, given appropriate regularization, this results in a well-posed problem. In each iteration of our algorithm a subset of the parameters, are considered for adjustment. Following Lawson and Hanson , we refer to these as the active set. As stated before, Boost can only grow the active set at each iteration, converging to inaccurate models. Our solution to this problem is to modify Boost so that it grows and contracts the active set as needed; hence we refer to this modification of the Boost algorithm as elastic basis pursuit. The key ingredient for any boosting algorithm is an oracle for fitting a weak learner: that is, a function which takes a residual as input and returns the parameter corresponding to the kernel most correlated with the residual. EBP takes as inputs the oracle , the input vector , the function , and produces a path of solutions which progressively minimize (3). To initialize the algorithm, we use NNLS to find an initial estimate of . In the th iteration of the boosting algorithm, let be residual from the previous iteration (or the NNLS fit, if ). The algorithm proceeds as follows
Call the oracle to find , and add to the active set .
Refit the weights , using NNLS, to solve:
where is the matrix formed from the regressors in the active set, for . This yields the residual .
Prune the active set by removing any parameter whose weight is zero, and update the weight vector in the same way. This ensures that the active set remains sparse in each iteration. Let denote the values of at the end of this step of the iteration.
Stopping may be assessed by computing an estimated prediction error at each iteration, via an independent validation set, and stopping the algorithm early when the prediction error begins to climb (indicating overfitting).
Psuedocode and Matlab code implementing this algorithm can be found in the supplement.
In the boosting context, the property of refitting the ensemble weights in every iteration is known as the totally corrective property; LPBoost  is a well-known example of a totally corrective boosting algorithm. While we derived EBP as a totally corrective variant of Boost, one could also view EBP as a generalization of the classical Lawson-Hanson (LH) algorithm  for solving nonnegative least-squares problems. Given mild regularity conditions and appropriate regularization, Elastic Basis Pursuit can be shown to deterministically converge to the global optimum: we can bound the objective function gap in the th iteration by , where is an explicit constant (see 2.3). To our knowledge, fixed iteration guarantees are unavailable for all other methods of comparable generality for fitting a mixture with an unknown number of components.
2.3 Convergence Results
(Detailed proofs can be found in the supplementary material.)
For our convergence results to hold, we require an oracle function which satisfies
for some fixed . Our proofs can also be modified to apply given a stochastic oracle that satisfies (9) with fixed probability for every input . Recall that denotes the transformed input, the transformed kernel and the dimensionality of . We assume that the parameter space is compact and that , the transformed kernel function, is continuous in . Furthermore, we assume that either regularization is imposed, or the kernels satisfy a positivity condition, i.e. for . Proposition 1 states that these conditions imply the existence of a maximally saturated model of size with residual .
The existence of such a saturated model, in conjunction with existence of the oracle , enables us to state fixed-iteration guarantees on the precision of EBP, which implies asymptotic convergence to the global optimum. To do so, we first define the quantity , see (9) above. Proposition 2 uses the fact that the residuals are orthogonal to , thanks to the NNLS fitting procedure in step 2. This allows us to bound the objective function gap in terms of . Proposition 3 uses properties of the oracle to lower bound the progress per iteration in terms of .
Proposition 2 Assume the conditions of Proposition 1. Take saturated model . Then defining
the th residual of the EBP algorithm can be bounded in size by
In particular, whenever converges to 0, the algorithm converges to the global minimum.
Proposition 3 Assume the conditions of Proposition 1. Then
for defined above in (9). This implies that the sequence is decreasing.
Combining Propositions 2 and 3 yields our main result for the non-asymptotic convergence rate.
Proposition 4 Assume the conditions of Proposition 1. Then for all ,
for defined in (5)
Hence we have characterized the non-asymptotic convergence of EBP at rate with an explicit constant, which in turn implies asymptotic convergence to the global minimum.
3 DWI Results and Discussion
To demonstrate the utility of EBP in a real-world application, we used this algorithm to fit mixture models of DWI. Different approaches are taken to modeling the DWI signal. The classical Diffusion Tensor Imaging (DTI) model , which is widely used in applications of DWI to neuroscience questions, is not a mixture model. Instead, it assumes that diffusion in the voxel is well approximated by a 3-dimensional Gaussian distribution. This distribution can be parameterized as a rank-2 tensor, which is expressed as a 3 by 3 matrix. Because the DWI measurement has antipodal symmetry, the tensor matrix is symmetric, and only 6 independent parameters need to be estimated to specify it. DTI is accurate in many places in the white matter, but its accuracy is lower in locations in which there are multiple crossing fascicles of nerve fibers. In addition, it should not be used to generate estimates of connectivity through these locations. This is because the peak of the fiber orientation distribution function (fODF) estimated in this location using DTI is not oriented towards the direction of any of the crossing fibers. Instead, it is usually oriented towards an intermediate direction (Figure 4B). To address these challenges, mixture models have been developed, that fit the signal as a combination of contributions from fascicles crossing through these locations. These models are more accurate in fitting the signal. Moreover, their estimate of the fODF is useful for tracking the fascicles through the white matter for estimates of connectivity. However, these estimation techniques either use different variants of NNLS, with a discrete set of candidate directions , or with a spherical harmonic basis set , or use stochastic algorithms . To overcome the problems inherent in these techniques, we demonstrate here the benefits of using EBP to the estimation of a mixture models of fascicles in DWI. We start by demonstrating the utility of EBP in a simulation of a known configuration of crossing fascicles. Then, we demonstrate the performance of the algorithm in DWI data.
The DWI measurements for a single voxel in the brain are for directions on the three dimensional unit sphere, given by
The kernel functions each describe the effect of a single fascicle traversing the measurement voxel on the diffusion signal, well described by the Stejskal-Tanner equation . Because of the non-negative nature of the MRI signal, is generated from a Rician distribution . where is a scalar quantity determined by the experimenter, and related to the parameters of the measurement (the magnitude of diffusion sensitization applied in the MRI instrument). is a positive definite quadratic form, which is specified by the direction along which the fascicle represented by traverses the voxel and by additional parameters and , corresponding to the axial and radial diffusivity of the fascicle represented by . The oracle function is implemented by Newton-Raphson with random restarts. In each iteration of the algorithm, the parameters of (direction and diffusivity) are found using the oracle function, , using gradient descent on , the current residuals. In each iteration, the set of is shrunk or expanded to best match the signal.
In a simulation with a complex configuration of fascicles, we demonstrate that accurate recovery of the true fODF can be achieved. In our simulation model, we take , and generate as uniformly distributed vectors on the unit sphere and weights as i.i.d. uniformly distributed on the interval . Each is associated with a between 0.5 and 2, and setting to 0. We consider the signal in 150 measurement vectors distributed on the unit sphere according to an electrostatic repulsion algorithm. We partition the vectors into a training partition and a test partition to minimize the maximum angular separation in each partition. we generate a signal
We use cross-validation on the training set to fit NNLS with varying L1 regularization parameter , using the regularization penalty function: . We choose this form of penalty function because we interpret the weights as comprising partial volumes in the voxel; hence represents the total volume of the voxel weighted by the isotropic component of the diffusion. We fix the regularization penalty parameter . The estimated fODFs and predicted signals are obtained by three algorithms: DTI, NNLS, and EBP. Each algorithm is applied to the training set (75 directions), and error is estimated, relative to a prediction on the test set (75 directions). The latter two methods (NNLS, EBP) use the regularization parameters and the chosen by cross-validated NNLS. Figure 2 illustrates the first two iterations of EBP applied to these simulated data. The estimated fODF are compared to the true fODF by the antipodally symmetrized Earth Mover’s distance (EMD)  in each iteration. Figure 3 demonstrates the progress of the internal state of the EBP algorithm in many repetitions of the simulation. In the simulation results (Figure 4), EBP clearly reaches a more accurate solution than DTI, and a sparser solution than NNLS.
The same procedure is used to fit the three models to DWI data, obtained at 2x2x2 , at a b-value of 4000 . In the these data, the true fODF is not known. Hence, only test prediction error can be obtained. We compare RMSE of prediction error between the models in a region of interest (ROI) in the brain containing parts of the corpus callosum, a large fiber bundle that contains many fibers connecting the two hemispheres, as well as the centrum semiovale, containing multiple crossing fibers (Figure 5). NNLS and EBP both have substantially reduced error, relative to DTI.
We developed an algorithm to model multi-dimensional mixtures. This algorithm, Elastic Basis Pursuit (EBP), is a combination of principles from boosting, and principles from the Lawson-Hanson active set algorithm. It fits the data by iteratively generating and testing the match of a set of candidate kernels to the data. Kernels are added and removed from the set of candidates as needed, using a totally corrective backfitting step, based on the match of the entire set of kernels to the data at each step. We show that the algorithm reaches the global optimum, with fixed iteration guarantees. Thus, it can be practically applied to separate a multi-dimensional signal into a sum of component signals. For example, we demonstrate how this algorithm can be used to fit diffusion-weighted MRI signals into nerve fiber fascicle components.
The authors thank Brian Wandell and Eero Simoncelli for useful discussions. CZ was supported through an NIH grant 1T32GM096982 to Robert Tibshirani and Chiara Sabatti, AR was supported through NIH fellowship F32-EY022294. FP was supported through NSF grant BCS1228397 to Brian Wandell
 Tournier J-D, Calamante F, Connelly A (2007). Robust determination of the fibre orientation distribution in diffusion MRI: non-negativity constrained super-resolved spherical deconvolution. Neuroimage 35:1459â72
 DellâAcqua F, Rizzo G, Scifo P, Clarke RA, Scotti G, Fazio F (2007). A model-based deconvolution approach to solve fiber crossing in diffusion-weighted MR imaging. IEEE Trans Biomed Eng 54:462â72
 Ekanadham C, Tranchina D, and Simoncelli E. (2011). Recovery of sparse translation-invariant signals with continuous basis pursuit. IEEE transactions on signal processing (59):4735-4744.
 Bühlmann P, Yu B (2003). Boosting with the L2 loss: regression and classification. JASA, 98(462), 324-339.
 Basser,P. J., Mattiello, J. and Le-Bihan, D. (1994). MR diffusion tensor spectroscopy and imaging. Biophysical Journal, 66:259-267.
 Candès, E. J., and FernandezâGranda, C. (2013). Towards a Mathematical Theory of Superâresolution. Communications on Pure and Applied Mathematics.
 Valiant, G., and Valiant, P. (2011, June). Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd annual ACM symposium on Theory of computing (pp. 685-694). ACM.
 Sánchez-Bajo, F., and Cumbrera, F. L. (2000). Deconvolution of X-ray diffraction profiles by using series expansion. Journal of applied crystallography, 33(2), 259-266.
 Behrens TEJ, Berg HJ, Jbabdi S, Rushworth MFS, and Woolrich MW (2007). Probabilistic diffusion tractography with multiple fiber orientations: What can we gain? NeuroImage (34):144-45.
 Bro, R., and De Jong, S. (1997). A fast non-negativity-constrained least squares algorithm. Journal of chemometrics, 11(5), 393-401.
 Lawson CL, and Hanson RJ. (1995). Solving Least Squares Problems. SIAM.
 Demiriz, A., Bennett, K. P., and Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1-3), 225-254.
 Stejskal EO, and Tanner JE. (1965). Spin diffusion measurements: Spin echoes in the presence of a time-dependent gradient field. J Chem Phys(42):288-92.
 Gudbjartsson, H., and Patz, S. (1995). The Rician distribution of noisy MR data. Magn Reson Med. 34: 910â914.
 Rubner, Y., Tomasi, C. Guibas, L.J. (2000). The earth mover’s distance as a metric for image retrieval. Intl J. Computer Vision, 40(2), 99-121.
5 Supplemental Material for Continuous Basis Pursuit for High Dimensional Mixtures
5.1 Continuous Basis Pursuit
Continuous basis pursuit, introduced by Ekanadham et al. , can be viewed as an extension of nonnegative least squares where we are given the liberty of perturbing the points on the discretization grid to adjusted versions where the perturbations are constrained to lie within Voronoi cells generated by . The idea of CBP is to linearly approximate the resulting kernel functions . In particular, in first-order CBP (FOCBP), one uses the approximation
where is the dimensionality of the parameter space. Defining and
for , and convex constraint sets
one writes the FOCBP objective function as
yielding estimates ,
where is the th standard basis vector.
Ekanadham et al. suggested using solvers for semidefinite programs (SDP) to solve instances of CBP problems, like the objective function above. However, we found that FOCBP can be transformed into a nonnegative least squares problem, generally resulting in speedups and improvements in stability.
The key observation is that any pertubed parameter can be represented as a positive linear combination of the finite set of vertices of , . Yet, this implies that the corresponding approximated kernel function can also be represented as a positive linear combination of .
Hence defining by
we can obtain the equivalent problem
which yields identical estimates to the original approach (7), via
5.2 The Lawson-Hanson algorithm for positive mixture problems
Before discussing our proposed method, true continuous basis pursuit, we discuss the unique properties of the Lawson-Hanson algorithm  for solving nonnegative least squares problems of the form
where is a matrix, in the special case of with nonnegative entries.
The algorithm begins with an active set initialized to the null set and estimate intialized to 0, and uses a tolerance . Letting represent the columns of corresponding to the indices included in and be the entries of corresponding to the indices of . The LH algorithm is as follows [Charles will summarize]
Initialize set of indices to the empty set.
Intiialize to be a vector of zeroes
Run main loop
Return , the solution to the least-squares problem (8)
Letting be the smallest index such that , set
Let be a vector of zeros.
Begin inner loop.
Let be the set of indices where .
Since the LH algorithm was proposed in 1974, a number of improvements have been proposed for solving large-scale nonnegative least squares problem. Efron’s least-angle procedure is especially suitable for solving the lasso-regularized NNLS problem but can also be applied to the original NNLS problem. Kim, Sra and Dhillon proposed an interior-point based method for solving NNLS problems using conjugate gradients. Potluru propose using coordinate descent to solve NNLS. The FISTA algorithm of Beck can also be modified to solve NNLS.
But in the special case of positive and , one can see both theoretically and empirically that the original LH algorithm far outperforms these more recent competing methods.
Firstly, the vector remains sparse in every iteration of the LH algorithm, even for noisy data. This means that the LH algorithm gains a substantial advantage over coordinate descent methods by computing the true least-sqaures solution for the current active set.
Secondly, the nature of the basis set renders gradient-descent based approaches, like the Kim Sra Dhillon algorithm, much less effective. Due to the high degree of collinearity in the basis set, the function has high curvature in the direction of the gradient, which often reduces the maximum step size at each iteration to below working precision.
Thirdly, the nonnegativity constraints combined with high dimensionality pose a challenge to methods like FISTA, which rely on log barrier functions to enforce the nonnegativity constraint.
Fourthly, the geometry of the basis set, which resembles a high-dimensional connected, curved surface with a spike at , poses special difficulties for Efron’s LARS algorithm, which aggresively adds variables to the active set as it continuously adjusts the coefficients of the solution vector. The LARS algorithm is hampered by the frequency at which the active set must change along the solution path. On the other hand, since the LARS algorithm recovers the entire L1 regularized solution path, it may still be useful for tuning the L1 regularization parameter.
Recall that we define an oracle via the property that
for some fixed .
Proposition. For any positive integer , and for any , there exists such that
for defined in (LABEL:lsp0obj).
Proof. Form the matrix . Then
for any . But if we minimize over nonnegative, we can find a solution with or fewer nonzero entries, as proved in Lawson and Hanson . Taking to be the nonnegative entries of and taking to be the corresponding parameters , we have .
For the Lemma, we take to be a compact set in and we require that be continuous with respect to for any fixed .
Lemma. Under the conditions stated above, there exists a nonnegative integer and and such that
Proof. Since is compact, so is . Also the space is compact. And by the continuity of , if we define by
then is continuous. Since the squared norm of any vector is nonnegative, we know that . By the compactness of , there exists such that . Take and take to be the sequence of nonnegative entries of , and to be the sequence of nonnegative entries of to complete the proof.
Proof. For define
First we show that produces an upper bound on . Define
Note that is jointly convex in , and verify that and . Further note that
due to the fact that the residual is orthogonal to the columns of (see ). Meanwhile, note that , which implies
Now due to the convexity of , we have
The next major step is to see that
Here, (14) follows from the fact that the columns of include by also all of the columns of for which is nonzero. Next, (16) is obtained by an application of the Pythagorean theorem, and (17) by applying the definitions of and the condition (9) on . Finally, (18) follows from observing that is nondecreasing in , hence .
From this result, we obtain
But since , this implies that is convergent. Hence, there exists a constant , and such that for all ,
Since we bounded the objective function gap in terms of in (12), this yields the desired result.