Fast, Provably convergent IRLS Algorithm for p-norm Linear Regression Code for this work is available at https://github.com/utoronto-theory/pIRLS.

Fast, Provably convergent IRLS Algorithm for -norm Linear Regression thanks: Code for this work is available at https://github.com/utoronto-theory/pIRLS.

Deeksha Adil
Department of Computer Science
University of Toronto
deeksha@cs.toronto.edu
&Richard Peng
School of Computer Science
Georgia Institute of Technology
rpeng@cc.gatech.edu
&
Sushant Sachdeva
Department of Computer Science
University of Toronto
sachdeva@cs.toronto.edu
Abstract

Linear regression in -norm is a canonical optimization problem that arises in several applications, including sparse recovery, semi-supervised learning, and signal processing. Generic convex optimization algorithms for solving -regression are slow in practice. Iteratively Reweighted Least Squares (IRLS) is an easy to implement family of algorithms for solving these problems that has been studied for over 50 years. However, these algorithms often diverge for , and since the work of Osborne (1985), it has been an open problem whether there is an IRLS algorithm that is guaranteed to converge rapidly for We propose -IRLS, the first IRLS algorithm that provably converges geometrically for any . Our algorithm is simple to implement and is guaranteed to find a -approximate solution in iterations. Our experiments demonstrate that it performs even better than our theoretical bounds, beats the standard Matlab/CVX implementation for solving these problems by 10–50x, and is the fastest among available implementations in the high-accuracy regime.

1 Introduction

We consider the problem of -norm linear regression (henceforth referred to as -regression),

(1)

where are given and denotes the -norm. This problem generalizes linear regression and appears in several applications including sparse recovery CandesT05 (), low rank matrix approximation chierichetti17a (), and graph based semi-supervised learning AlamgirMU11 ().

An important application of -regression with is graph based semi-supervised learning (SSL). Regularization using the standard graph Laplacian (also called a -Laplacian) was introduced in the seminal paper of Zhu, Gharamani, and Lafferty ZhuGL03 (), and is a popular approach for graph based SSL, see e.g. ZhouBLWS04 (); BelkinMN04 (); ChapelleSZ09 (); Zhu05 (). The 2-Laplacian regularization suffers from degeneracy in the limit of small amounts of labeled data NadlerSZ09 (). Several works have since suggested using the -Laplacian instead AlamgirMU11 (); BridleZ13 (); ZhouB11 () with large and have established its consistency and effectiveness for graph based SSL with small amounts of data pmlr-v49-elalaoui16 (); Calder17 (); CalderFL19 (); DejanT17 (); KyngRSS15 (). Recently, -Laplacians have also been used for data clustering and learning problems ElmoatazTT15 (); ElmoatazDT17 (); HafieneFE18 (). Minimizing the -Laplacian can be easily seen as an -regression problem.

Though -regression is a convex programming problem, it is very challenging to solve in practice. General convex programming methods such as conic programming using interior-point methods (like those implemented in CVX) are very slow in practice. First order methods do not perform well for these problems with since the gradient vanishes rapidly close to the optimum.

For applications such graph based SSL with -Laplacians, it is important that we are able to compute a solution that approximates the optimal solution coordinate-wise rather than just achieving an approximately optimal objective value, since these coordinates determine the labels for the vertices. For such applications, we seek a -approximate solution, an such that its objective value, is at most times the optimal value for some very small ( or so) in order achieve a reasonable coordinate-wise approximation. Hence, it is very important for the dependence on be rather than

IRLS Algorithms.

A family of algorithms for solving the -regression problem are the IRLS (Iterated Reweighted Least Squares) algorithms. IRLS algorithms have been discovered multiple times independently and have been studied extensively for over 50 years e.g. Lawson61 (); Rice64 (); Osborne85 (); GorodnitskyR97 () (see Burrus12 () for a detailed survey). The main step in an IRLS algorithm is to solve a weighted least squares (-regression) problem to compute the next iterate,

(2)

starting from any initial solution (usually the least squres solution corresponding to ). Each iteration can be implemented by solving a linear system Picking gives us an IRLS algorithm where the only fixed point is the minimizer of the regression problem (1) (which is unique for ).

The basic version of the above IRLS algorithm converges reliabily in practice for and diverges often even for moderate (say  (CalderFL19, , pg 12)). Osborne Osborne85 () proved that the above IRLS algorithm converges in the limit for Karlovitz Karlovitz70 () proved a similar result for an IRLS algorithm with a line search for even . However, both these results only prove convergence in the limit without any quantitative bounds, and assume that you start close enough to the solution. The question of whether a suitable IRLS algorithm converges geometrically to the optimal solution for (1) in a few iterations has been open for over three decades.

Our Contributions.

We present -IRLS, the first IRLS algorithm that provably converges geometrically to the optimal solution for -regression for all Our algorithm is very similar to the standard IRLS algorithm for regression, and given an returns a feasible solution for (1) in iterations (Theorem 3.1). Here is the number of rows in We emphasize that the dependence on is rather than

Our algorithm -IRLS is very simple to implement, and our experiments demonstrate that it is much faster than the available implementations for in the high accuracy regime. We study its performance on random dense instances of -regression, and low dimensional nearest neighbor graphs for -Laplacian SSL. Our Matlab implementation on a standard desktop machine runs in at most 2–2.5s (60–80 iterations) on matrices for size or graphs with nodes and around 5000 edges, even with and Our algorithm is at least 10–50x faster than the standard Matlab/CVX solver based on Interior point methods cvx (); gb08 (), while finding a better solution. We also converge much faster than IRLS-homotopy based algorithms CalderFL19 () that are not even guaranteed to converge to a good solution. For larger say this difference is even more dramatic, with -IRLS obtaining solutions with at least 4 orders of magnitude smaller error with the same number of iterations. Our experiments also indicate that -IRLS scales much better than as indicated by our theoretical bounds, with the iteration count almost unchanged with problem size, and growing very slowly (at most linearly) with

1.1 Related Works and Comparison

IRLS algorithms have been used widely for various problems due to their exceptional simplicity and ease of implementation, including compressive sensing ChartrandY08 (), sparse signal reconstruction GorodnitskyR97 (), and Chebyshev approximation in FIR filter design BarretoB94 (). There have been various attempts at analyzing variants of IRLS algorithm for -norm minimization. We point the reader to the survey by Burrus Burrus12 () for numerous pointers and a thorough history.

The works of Osborne Osborne85 () and Karlovitz Karlovitz70 () mentioned above only prove convergence in the limit without quantitative bounds and under assumptions on and that we start close enough. Several works show that it is similar to Newton’s method (e.g. Kahng72 (); BurrusBS94 ()), or that adaptive step sizes help (e.g. BurrusV99 (); BurrusV12 ()) but do not prove any guarantees.

A few notable works prove convergence guarantees for IRLS algorithms for sparse recovery (even in some cases) DaubechiesDFG08 (); DaubechiesDFG10 (); LN18 (), and for low-rank matrix recovery FornasierRW11 (). Quantitative convergence bounds for IRLS algorithms for are given by Straszak and Vishnoi StraszakV16Flow (); StraszakV16 (); StraszakV16IRLS (), inspired by slime-mold dynamics. Ene and Vladu give IRLS algorithms for and  EneV2019 (). However, both these works have dependence in the number of iterations, with the best result by EneV2019 () having a total iteration count roughly .

The most relevant theoretical results for -norm minimization are Interior point methods NesterovN94 (), the homotopy method of Bubeck et al BubeckCLL18 (), and the iterative-refinement method of Adil et alAdilKPS19 (). The convergence bounds we prove on the number of iterations required by -IRLS (roughly ) has a better dependence on than Interior Point methods (roughly ), but marginally worse than the dependence in the work of Bubeck et alBubeckCLL18 () (roughly ) and Adil et alAdilKPS19 () (roughly ). Also related, but not directly comparable is the work of Bullins Bullins18 () (restricted to ) and the work of Maddison et alMaddison18 () (first order method with condition number dependence).

More importantly though, our algorithm is far simpler to implement (no implementations are available for any of the above other than Interior Point Methods), and unlike any of these works, our algorithm has a locally greedy structure that allows for greedily optimizing the objective using a line search, resulting in much better performance in practice than that guaranteed by our theoretical bounds.

Another line of heuristic algorithms combines IRLS algorithms with a homotopy based approach (e.g. Kahng72 (). See Burrus12 ()). These methods start from a solution for and slowly increase multiplicatively, using an IRLS algorithm for each phase and the previous solution as a starting point. These algorithms perform better in practice than usual IRLS algorithms. However, to the best of our knowledge, they are not guaranteed to converge, and no bounds on their performance are known. Rios Flores19 () provides an efficient implementation of such a method based on the work of Rios et alCalderFL19 (), along with detailed experiments. Our experiments show that our algorithm converges much faster than the implementation from Rios (see Section 4).

2 Preliminaries

We first define some terms that we will use in the formal analysis of our algorithm. For our analysis we use a more general form of the -regression problem,

(3)

Setting and to be empty recovers the standard -regression problem.

Definition 2.1 (Residual Problem).

The residual problem of (3) at is defined as,

Here and is the gradient of the objective at . Define to denote the objective of the residual problem evaluated at .

Definition 2.2 (Approximation to the Residual Problem).

Let and be the optimum of the residual problem. A -approximate solution to the residual problem is such that and

3 Algorithm and Analysis

1:procedure -IRLS()
2:     
3:     
4:     while  do
5:         
6:         
7:         
8:         
9:          LineSearch
10:         
11:         if  InsufficientProgressCheck() then                 
12:     return
Algorithm 1 -IRLS Algorithm
1:procedure InsufficientProgressCheck()
2:     
3:     
4:     
5:     if  or  then return true
6:     else return false      
Algorithm 2 Check Progress

Our algorithm -IRLS, described in Algorithm (1), is the standard IRLS algorithm (equation 2) with few key modifications. The first difference is that at each iteration , we add a small systematic padding to the weights The second difference is that the next iterate is calculated by performing a line search along the line joining the current iterate and the standard IRLS iterate at iteration (with the modified weights) 111Note that -IRLS has been written in a slightly different but equivalent formulation, where it solves for . Both these modifications have been tried in practice, but primarily from practical justifications: padding the weights avoids ill-conditioned matrices, and line-search can only help us converge faster and improves stability Karlovitz70 (); BurrusV99 (); BurrusV99 (). Our key contribution is to show that these modifications together allow us to provably make progress towards the optimum, resulting in a final iteration count of Finally, at every iteration we check if the objective value decreases sufficiently, and this allows us to adjust appropriately. We emphasize here that our algorithm always converges. We prove the following theorem:

Theorem 3.1.

Given any and . Algorithm 1 returns such that and , in at most iterations.

3.1 Convergence Analysis

The analysis, at a high level, is based on iterative refinement techniques for -norms developed in the work of Adil et al AdilKPS19 () and Kyng et al KyngPSW19 (). These techniques allow us to use a crude -approximate solver for the residual problem (Definition 2.1) number of times to obtain a approximate solution for the -regression problem (Lemma 3.1).

In our algorithm, if we had solved the standard weighted problem instead, would be unbounded. The padding added to the weights allow us to prove that the solution to weighted problem gives a bounded approximation to the residual problem provided we have the correct padding, or in other words correct value of (Lemma 3.1). Lemma 3.1 gives the required condition to adjust the value of and is used in Algorithm 2. We will show that the number of iterations where we are adjusting the value of are small. Finally, we require proving our termination condition. Lemma 3.1 shows that when the algorithm terminates, we have an -approximate solution to our main problem. The remaining lemma of this section, Lemma 3.1 gives the loop invariant which is used at several places in the proof of Theorem 3.1. Due to space constraints, we only state the main lemmas here and defer the proofs to the supplementary material.

We begin with the lemma that talks about our overall iterative refinement scheme. The iterative refinement scheme in AdilKPS19 () and KyngPSW19 () has an exponential dependence on . We improve this dependence to a small polynomial in . {restatable}lemmaIterativeRefinement(Iterative Refinement). Let , and . Starting from , and iterating as, , where is a -approximate solution to the residual problem (Definition 2.1), we get an -approximate solution to (3) in at most calls to a -approximate solver for the residual problem. The next lemma talks about bounding the approximation factor , when we have the right value of . {restatable}lemmaApproximation(Approximation). Let be such that and be as defined in lines (5), (6), (7) and (9) of Algorithm 1. Let be the solution of the following program,

(4)

Then, is an - approximate solution to the residual problem.

The following lemma gives us the condition that determines if we do not have the correct value of . The condition used in the algorithm is the contrapositive of the lemma statement. {restatable}lemmaProgress(Check Progress). Let be as defined in line (4) of Algorithm 2 and the solution of program (4). If , then .

We next present the loop invariant followed by the conditions for the termination. {restatable}lemmaInvariant(Invariant) At every iteration of the while loop, we have and ,

{restatable}

lemmatermination(Termination). Let be such that . Then,

and,

We next see how Lemmas 3.1,3.1, 3.1, and 3.1 together imply our main result, Theorem 3.1.

3.2 Proof of Theorem 3.1

Proof.

We first show that at termination, the algorithm returns an -approximate solution. We begin by noting that the quantity can only decrease with every iteration. At iteration , let denote the smallest number such that . Note that must be at least (Lemma 3.1). Let us first consider the termination condition of the while loop. When we terminate, . Lemma 3.1 now implies that . Lemma 3.1 also shows that at each iteration our solution satisfies , therefore the solution returned at termination also satisfies the subspace constraints.

We next prove the running time bound. Note that the objective is non increasing with every iteration. This is because the LineSearch returns a factor that minimizes the objective given a direction , i.e., , which could also be zero. Consider the following two cases on the values of . These are the only cases possible due to our invariant Lemma 3.1.

  1. :
    In this case, Lemma 3.1 says that we get a -approximate solution to the residual problem.

  2. :
    In this case, we want the algorithm to reduce so that it falls under the above case. Lemma 3.1 implies that if , then we do not have and therefore the algorithm reduces . In the case when , the algorithm ensures that . As a result we can apply Lemma A.5 (from the Appendix, given in the supplementary material) to bound , which following the proof of Lemma 3.1 again gives a -approximate solution to the residual problem.

From the above, it is clear that the algorithm either reduces or returns an -approximate solution to the residual problem. The number of steps in which we reduce is at most (Lemma 3.1 gives the value of ). The number of remaining steps from Lemma 3.1 is . The total number of iterations required by our algorithm is , thus concluding the proof of the theorem. ∎

4 Experiments

Figure 1: Averaged over 100 random samples. Graph: nodes (- edges). Solver: PCG with Cholesky preconditioner.

In this section, we detail our results from experiments studying the performance of our algorithm, -IRLS. We implemented our algorithm in Matlab on a standard desktop machine, and evaluated its performance on two types of instances, random instances for -regression, and graphs for -Laplacian minimization. We study the scaling behavior of our algorithm as we change and the size of the problem. We compare our performance to the Matlab/CVX solver that is guaranteed to find a good solution, and to the IRLS/homotopy based implementation from CalderFL19 () that is not guaranteed to converge, but runs quite well in practice. We now describe our instances, parameters and experiments in detail.

(a) Size of fixed to .
(b) Sizes of : . Error
(c) Size of is fixed to . Error
Figure 2: Random Matrix instances. Comparing the number of iterations and time taken by our algorithm with the parameters. Averaged over 100 random samples for and . Linear solver used : backslash.
(a) Size of graph fixed to nodes (around - edges).
(b) Number of nodes: . Error
(c) Size of graph fixed to nodes (around - edges). Error
Figure 3: Graph Instances. Comparing the number of iterations and time taken by our algorithm with the parameters. Averaged over 100 graph samples. Linear solver used : backslash.
(a) Fixed . Size of matrices: .
(b) Size of matrices fixed to .
(c) Fixed . The number of nodes : .
(d) Size of graphs fixed to nodes ( around edges).
Figure 4: Averaged over 100 samples. Precision set to .CVX solver used : SDPT3 for Matrices and Sedumi for Graphs.
Code.

The code for this work is available at https://github.com/utoronto-theory/pIRLS.

Instances and Parameters.

We consider two types of instances, random matrices and graphs.

  1. Random Matrices: We want to solve the problem . In these instances we use random matrices and , where every entry of the matrix is chosen uniformly at random between and .

  2. Graphs: We use the graphs described in CalderFL19 (). The set of vertices is generated by choosing vectors in uniformly at random and the edges are created by connecting the nearest neighbours. Weights of each edge is specified by a gaussian type function (Eq 3.1,CalderFL19 ()). Very few vertices (around 10) have labels which are again chosen uniformly at random between and . The problem studied on these instances is to determine the minimizer of the laplacian. We formulate this problem into the form , details of this formulation can be found in the Appendix that is in the supplementary material.

Note that we have different parameters for each problem, the size of the instance i.e., the number of rows of matrix , the norm we solve for, , and the accuracy to which we want to solve each problem, . We will consider each of these parameters independently and see how our algorithm scales with them for both instances.

Benchmark Comparisons.

We compare the performance of our program with the following:

  1. Standard MATLAB optimization package, CVX cvx (); gb08 ().

  2. The most efficient algorithm for -semi supervised learning given in CalderFL19 () was newton’s method with homotopy. We take their hardest problem, and compare the performance of their code with ours by running our algorithm for the same number of iterations as them and showing that we get closer to the optimum, or in other words a smaller error , thus showing we converge much faster.

Implementation Details.

We normalize the instances by running our algorithm once and dividing the vector by the norm of the final objective, so that our norms at the end are around . We do this for every instance before we measure the runtime or the iteration count for uniformity and to avoid numerical precision issues. All experiments were performed on MATLAB 2018b on a Desktop ubuntu machine with an Intel Core - CPU @ processor and 4GB RAM. For the graph instances, we fix the dimension of the space from which we choose vertices to and the number of labelled vertices to be . The graph instances are generated using the code Flores19 () by CalderFL19 (). Other details specific to the experiment are given in the captions.

4.1 Experimental Results

Dependence on Parameters.

Figure 2 shows the dependence of the number of iterations and runtime on our parameters for random matrices. Similarly for graph instances, Figure 3 shows the dependence of iteration count and runtime with the parameters. As expected from the theoretical guarantees, the number of iterations and runtimes increase linearly with . The dependence on size and are clearly much better in practice (nearly constant and at most linear respectively) than the theoretical bounds ( and respectively) for both kinds of instances.

Comparisons with Benchmarks.
  • Figure 4 shows the runtime comparison between our IRLS algorithm -IRLS and CVX. For all instances, we ensured that our final objective was smaller than the objective of the CVX solver. As it is clear for both kinds of instances, our algorithm takes a lot lesser time and also increases more slowly with size and as compared to CVX. Note that that CVX does a lot better when , but it is still at least - times slower for random matrices and - times slower for graphs.

  • Figure 1 shows the performance of our algorithm when compared to the IRLS/Homotopy method of CalderFL19 (). We use the same linear solvers for both programs, preconditioned conjugate gradient with an incomplete cholesky preconditioner and run both programs to the same number of iterations. The plots indicate the value as described previously. For our IRLS algorithm we indicate our upper bound on and for their procedure we indicate a lower bound on which is the relative difference in the objectives achieved by the two algorithms. It is clear that our algorithm achieves an error that is orders of magnitudes smaller than the error achieved by their algorithm. This shows that our algorithm has a much faster rate of convergence. Note that there is no guarantee on the convergence of the method used by CalderFL19 (), whereas we prove that our algorithm converges in a small number of iterations.

5 Discussion

To conclude, we present -IRLS, the first IRLS algorithm that provably converges to a high accuracy solution in a small number of iterations. This settles a problem that has been open for over three decades. Our algorithm is very easy to implement and we demonstrate that it works very well in practice, beating the standard optimization packages by large margins. The theoretical bound on the numbers of iterations has a sub-linear dependence on size and a small polynomial dependence on , however in practice, we see an almost constant dependence on size and at most linear dependence on in random instances and graphs. In order to achieve the best theoretical bounds we would require some form of acceleration. For and regression, it has been shown that it is possible to achieve acceleration, however without geometric convergence. It remains an open problem to give a practical IRLS algorithm which simultaneously has the best possible theoretical convergence bounds.

Acknowledgements

DA is supported by SS’s NSERC Discovery grant, and an Ontario Graduate Scholarship. SS is aupported by the Natural Sciences and Engineering Research Council of Canada (NSERC), a Connaught New Researcher award, and a Google Faculty Research award. RP is partially supported by the NSF under Grants No. 1637566 and No. 1718533.

References

  • [ACR16] Ahmed El Alaoui, Xiang Cheng, Aaditya Ramdas, Martin J. Wainwright, and Michael I. Jordan. Asymptotic behavior of -based Laplacian regularization in semi-supervised learning. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 879–906, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
  • [AKPS19] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative refinement for -norm regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1405–1424, 2019.
  • [AL11] Morteza Alamgir and Ulrike V. Luxburg. Phase transition in the family of -resistances. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 379–387. Curran Associates, Inc., 2011.
  • [BB94] Jose Antonio Barreto and C Sidney Burrus. complex approximation using iterative reweighted least squares for fir digital filters. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages III–545. IEEE, 1994.
  • [BBS94] C.S. Burrus, J.A. Barreto, and I.W. Selesnick. Iterative reweighted least-squares design of fir filters. Trans. Sig. Proc., 42(11):2926–2936, November 1994.
  • [BCLL18] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee, and Yuanzhi Li. An homotopy method for lp regression provably beyond self-concordance and in input-sparsity time. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1130–1137, New York, NY, USA, 2018. ACM.
  • [BL18] Ning Bi and Kaihao Liang. Iteratively reweighted algorithm for signals recovery with coherent tight frame. Mathematical Methods in the Applied Sciences, 41(14):5481–5492, 2018.
  • [BMN04] Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and semi-supervised learning on large graphs. In John Shawe-Taylor and Yoram Singer, editors, Learning Theory, pages 624–638, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
  • [Bul18] Brian Bullins. Fast minimization of structured convex quartics. arXiv preprint arXiv:1812.10349, 2018.
  • [Bur12] C Sidney Burrus. Iterative reweighted least squares. OpenStax CNX. Available online: http://cnx. org/contents/92b90377-2b34-49e4-b26f-7fe572db78a1, 12, 2012.
  • [BZ13] Nick Bridle and Xiaojin Zhu. p-voltages: Laplacian regularization for semi-supervised learning on high-dimensional data. In Eleventh Workshop on Mining and Learning with Graphs (MLG2013), 2013.
  • [Cal17] Jeff Calder. Consistency of lipschitz learning with infinite unlabeled data and finite labeled data. CoRR, abs/1710.10364, 2017.
  • [CGK17] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy, and David P. Woodruff. Algorithms for low-rank approximation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 806–814, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [CSZ09] O. Chapelle, B. Scholkopf, and A. Zien, Eds. Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, March 2009.
  • [CT05] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, Dec 2005.
  • [CW08] R. Chartrand and Wotao Yin. Iteratively reweighted algorithms for compressive sensing. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3869–3872, March 2008.
  • [DDFG08] I. Daubechies, R. DeVore, M. Fornasier, and S. Gunturk. Iteratively re-weighted least squares minimization: Proof of faster than linear rate for sparse recovery. In 2008 42nd Annual Conference on Information Sciences and Systems, pages 26–29, March 2008.
  • [DDFG10] Ingrid Daubechies, Ronald DeVore, Massimo Fornasier, and C. Sinan Gunturk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1–38, 2010.
  • [EDT17] A Elmoataz, X Desquesnes, and M Toutain. On the game p-laplacian on weighted graphs with applications in image processing and data clustering. European Journal of Applied Mathematics, 28(6):922–948, 2017.
  • [ETT15] Abderrahim Elmoataz, Matthieu Toutain, and Daniel Tenbrinck. On the -laplacian and -laplacian on graphs with applications in image and data processing. SIAM Journal on Imaging Sciences, 8(4):2412–2451, 2015.
  • [EV19] Alina Ene and Adrian Vladu. Improved convergence for and regression via iteratively reweighted least squares. CoRR, abs/1902.06391, 2019.
  • [FRW11] M. Fornasier, H. Rauhut, and R. Ward. Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614–1640, 2011.
  • [GB08] Michael Grant and Stephen Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95–110. Springer-Verlag Limited, 2008. http://stanford.edu/~boyd/graph_dcp.html.
  • [GB14] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014.
  • [GR97] I. F. Gorodnitsky and B. D. Rao. Sparse signal reconstruction from limited data using focuss: a re-weighted minimum norm algorithm. IEEE Transactions on Signal Processing, 45(3):600–616, March 1997.
  • [HFE18] Yosra Hafiene, Jalal Fadili, and Abderrahim Elmoataz. Nonlocal -laplacian variational problems on graphs. arXiv preprint arXiv:1810.12817, 2018.
  • [Kah72] S. W. Kahng. Best -approximation. Math. Comput., 26(118):505–508, 1972.
  • [Kar70] L.A Karlovitz. Construction of nearest points in the , p even, and norms. i. Journal of Approximation Theory, 3(2):123 – 127, 1970.
  • [KPSW19] Rasmus Kyng, Richard Peng, Sushant Sachdeva, and Di Wang. Flows in almost linear time via adaptive preconditioning. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 902–913, New York, NY, USA, 2019. ACM.
  • [KRSS15] Rasmus Kyng, Anup Rao, Sushant Sachdeva, and Daniel A. Spielman. Algorithms for lipschitz learning on graphs. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 1190–1223, 2015.
  • [Law61] C. L. Lawson. Contribution to the theory of linear least maximum approximation. Ph.D. dissertation, Univ. Calif., 1961.
  • [MPT18] Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet. Hamiltonian descent methods. arXiv preprint arXiv:1809.05042, 2018.
  • [NN94] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics, 1994.
  • [NSZ09] Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Statistical analysis of semi-supervised learning: The limit of infinite unlabelled data. 2009.
  • [Osb85] M. R. Osborne. Finite Algorithms in Optimization and Data Analysis. John Wiley & Sons, Inc., New York, NY, USA, 1985.
  • [RCL19] Mauricio Flores Rios, Jeff Calder, and Gilad Lerman. Algorithms for -based semi-supervised learning on graphs. CoRR, abs/1901.05031, 2019.
  • [Ric64] J.R. Rice. The Approximation of Functions, By John R. Rice. Addison-Wesley Series in Computer Science and Information Processing. 1964.
  • [Rio19] Mauricio Flores Rios. Laplacianlpgraphssl. https://github.com/mauriciofloresML/Laplacian_Lp_Graph_SSL, 2019.
  • [ST17] Dejan Slepcev and Matthew Thorpe. Analysis of -laplacian regularization in semi-supervised learning. CoRR, abs/1707.06213, 2017.
  • [SV16a] Damian Straszak and Nisheeth K. Vishnoi. IRLS and slime mold: Equivalence and convergence. CoRR, abs/1601.02712, 2016.
  • [SV16b] Damian Straszak and Nisheeth K. Vishnoi. Natural algorithms for flow problems. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 1868–1883, 2016.
  • [SV16c] Damian Straszak and Nisheeth K. Vishnoi. On a natural dynamics for linear programming. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, Cambridge, MA, USA, January 14-16, 2016, page 291, 2016.
  • [VB99] R. A. Vargas and C. S. Burrus. Adaptive iterative reweighted least squares design of fir filters. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), volume 3, pages 1129–1132 vol.3, March 1999.
  • [VB12] Ricardo A. Vargas and C. Sidney Burrus. Iterative design of digital filters. CoRR, abs/1207.4526, 2012.
  • [ZB11] Xueyuan Zhou and Mikhail Belkin. Semi-supervised learning by higher order regularization. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 892–900, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
  • [ZBL04] D. Zhou, O. Bousquet, TN. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16, pages 321–328, Cambridge, MA, USA, June 2004. Max-Planck-Gesellschaft, MIT Press.
  • [ZGL03] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919, 2003.
  • [Zhu05] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2005.

Appendix A Proofs from Section 3

a.1 Proof of Lemma 3.1

\IterativeRefinement

* We first show that we can upper and lower bound the change in objective by a linear term plus a quadratically smoothed function.

Lemma A.1.

For any and , we have for and ,

The proof of the above lemma is long and hence deferred to the end of this section. Applying the above lemma on our objective we get,

(5)

where is the diagonal matrix with entries and . We next show the relation between the residual problem defined in the preliminaries and the change in objective value when is updated by .

Lemma A.2.

For any and and ,

and

Proof.

The first inequality directly follows from (5). For the second inequality,

a.1.1 Proof of Lemma Iterative Refinement

Proof.

Let be a -approximate solution to the residual problem. Using this fact and Lemma A.2 for , we get,

Also,

Now, after iterations,

So to get an -approximate solution, we need iterations for our value of . ∎

a.1.2 Proof of Lemma a.1

Proof.

To show this, we show that the above holds for all coordinates. For a single coordinate, the above expression is equivalent to proving,

Let . Since the above clearly holds for , it remains to show for all ,

  1. :
    In this case, . So, and the right inequality directly holds. To show the other side, let

    We have,

    and

    Since , . So is an increasing function in and .

  2. :
    Now, , and . As a result,

    which gives the right inequality. Consider,

    Let . The above expression now becomes,

    We know that . When , and . This gives us,

    giving us for . When , and .

    giving us for . Therefore, giving us, , thus giving the left inequality.

  3. :
    Let Now,

    When , we have,

    and

    So is an increasing function of which gives us, . Therefore is a decreasing function, and the minimum is at which is . This gives us our required inequality for . When , and . We are left with the range . Again, we have,

    Therefore, is an increasing function, . This implies is an increasing function, giving, as required.

    To show the other direction,

    Now, since ,

    We thus have, when is positive and when is negative. The minimum of is at which is . This concludes the proof of this case.

a.2 Proof of Lemma 3.1

We require bounding the objective of program 4. To do that we first give a bound on a decision version of the residual problem, and then relate this problem with problem 4.

Lemma A.3.

Let be such that the optimum of the residual problem, . Then the following problem has optimum at most .

(6)
Proof.

The above objective is strictly non-negative since is a feasible solution and we are maximizing the objective. Since the last terms are strictly non-negative, we must have,

Since is the optimum and satisfies ,

We thus have,

which gives the following bound.

For notational convenience, let function . Now, we know that, and . This gives,

Let , where . Note that . Now, and,