# A Universal Primal-Dual Convex Optimization Framework

###### Abstract

We propose a new primal-dual algorithmic framework for a prototypical constrained convex optimization template. The algorithmic instances of our framework are universal since they can automatically adapt to the unknown Hölder continuity degree and constant within the dual formulation. They are also guaranteed to have optimal convergence rates in the objective residual and the feasibility gap for each Hölder smoothness degree. In contrast to existing primal-dual algorithms, our framework avoids the proximity operator of the objective function. We instead leverage computationally cheaper, Fenchel-type operators, which are the main workhorses of the generalized conditional gradient (GCG)-type methods. In contrast to the GCG-type methods, our framework does not require the objective function to be differentiable, and can also process additional general linear inclusion constraints, while guarantees the convergence rate on the primal problem.

## 1 Introduction

This paper constructs an algorithmic framework for the following convex optimization template:

(1) |

where is a convex function, , , and and are nonempty, closed and convex sets in and respectively. The constrained optimization formulation (1) is quite flexible, capturing many important learning problems in a unified fashion, including matrix completion, sparse regularization, support vector machines, and submodular optimization [1, 2, 3].

Processing the inclusion in (1) requires a significant computational effort in the large-scale setting [4]. Hence, the majority of the scalable numerical solution methods for (1) are of the primal-dual-type, including decomposition, augmented Lagrangian, and alternating direction methods: cf., [5, 6, 7, 4, 8, 9]. The efficiency guarantees of these methods mainly depend on three properties of : Lipschitz gradient, strong convexity, and the tractability of its proximal operator. For instance, the proximal operator of , i.e., , is key in handling non-smooth while obtaining the convergence rates as if it had Lipschitz gradient.

When the set is absent in (1), other methods can be preferable to primal-dual algorithms. For instance, if has Lipschitz gradient, then we can use the accelerated proximal gradient methods by applying the proximal operator for the indicator function of the set [10, 11]. However, as the problem dimensions become increasingly larger, the proximal tractability assumption can be restrictive. This fact increased the popularity of the generalized conditional gradient (GCG) methods (or Frank-Wolfe-type algorithms), which instead leverage the following Fenchel-type oracles [1, 12, 13]

(2) |

where is a convex function. When , we obtain the so-called linear minimization oracle [12]. When , then the (sub)gradient of the Fenchel conjugate of , , is in the set . The sharp-operator in (2) is often much cheaper to process as compared to the operator [1, 12]. While the GCG-type algorithms require -iterations to guarantee an -primal objective residual/duality gap, they cannot converge when their objective is nonsmooth [14].

To this end, we propose a new primal-dual algorithmic framework that can exploit the sharp-operator of in lieu of its proximal operator. Our aim is to combine the flexibility of proximal primal-dual methods in addressing the general template (1) while leveraging the computational advantages of the GCG-type methods. As a result, we trade off the computational difficulty per iteration with the overall rate of convergence. While we obtain optimal rates based on the sharp-operator oracles, we note that the rates reduce to with the sharp operator vs. with the proximal operator when is completely non-smooth (cf. Definition 1.1). Intriguingly, the convergence rates are the same when is strongly convex. Unlike GCG-type methods, our approach can now handle nonsmooth objectives in addition to complex constraint structures as in (1).

Our primal-dual framework is universal in the sense the convergence of our algorithms can optimally adapt to the Hölder continuity of the dual objective (cf., (6) in Section 3) without having to know its parameters. By Hölder continuity, we mean the (sub)gradient of a convex function satisfies with parameters and for all . The case models the bounded subgradient, whereas captures the Lipschitz gradient. The Hölder continuity has recently resurfaced in unconstrained optimization by [15] with universal gradient methods that obtain optimal rates without having to know and . Unfortunately, these methods cannot directly handle the general constrained template (1). After our initial draft appeared, [14] presented new GCG-type methods for composite minimization, i.e., , relying on Hölder smoothness of (i.e., ) and the sharp-operator of . The methods in [14] do not apply when is non-smooth. In addition, they cannot process the additional inclusion in (1), which is a major drawback for machine learning applications.

Our algorithmic framework features a gradient method and its accelerated variant that operates on the dual formulation of (1). For the accelerated variant, we study an alternative to the universal accelerated method of [15] based on FISTA [10] since it requires less proximal operators in the dual. While the FISTA scheme is classical, our analysis of it with the Hölder continuous assumption is new. Given the dual iterates, we then use a new averaging scheme to construct the primal-iterates for the constrained template (1). In contrast to the non-adaptive weighting schemes of GCG-type algorithms, our weights explicitly depend on the local estimates of the Hölder constants at each iteration. Finally, we derive the worst-case complexity results. Our results are optimal since they match the computational lowerbounds in the sense of first-order black-box methods [16].

##### Paper organization:

Section 2 briefly recalls primal-dual formulation of problem (1) with some standard assumptions. Section 3 defines the universal gradient mapping and its properties. Section 4 presents the primal-dual universal gradient methods (both the standard and accelerated variants), and analyzes their convergence. Section 5 provides numerical illustrations, followed by our conclusions. The supplementary material includes the technical proofs and additional implementation details.

##### Notation and terminology:

For notational simplicity, we work on the spaces with the Euclidean norms. We denote the Euclidean distance of the vector to a closed convex set by . Throughout the paper, represents the Euclidean norm for vectors and the spectral norm for the matrices. For a convex function , we use both for its subgradient and gradient, and for its Fenchel’s conjugate. Our goal is to approximately solve (1) to obtain in the following sense:

###### Definition 1.1.

Given an accuracy level , a point is said to be an -solution of (1) if

Here, we call the primal objective residual and the feasibility gap.

## 2 Primal-dual preliminaries

In this section, we briefly summarise the primal-dual formulation with some standard assumptions. For the ease of presentation, we reformulate (1) by introducing a slack variable as follows:

(3) |

Let and . Then, we have as the feasible set of (3).

##### The dual problem:

##### Fundamental assumptions:

To characterize the primal-dual relation between (1) and (4), we require the following assumptions [17]:

###### Assumption A. 1.

The function is proper, closed, and convex, but not necessarily smooth. The constraint sets and are nonempty, closed, and convex. The solution set of (1) is nonempty. Either is polyhedral or the Slater’s condition holds. By the Slater’s condition, we mean , where stands for the relative interior of .

##### Strong duality:

Under Assumption , the solution set of the dual problem (4) is also nonempty and bounded. Moreover, the strong duality holds, i.e., .

## 3 Universal gradient mappings

This section defines the universal gradient mapping and its properties.

### 3.1 Dual reformulation

We first adopt the composite convex minimization formulation [11] of (4) in convex optimization for better interpretability as

(5) |

where , and the correspondence between and is as follows:

(6) |

Since and are generally non-smooth, FISTA and its proximal-based analysis [10] are not directly applicable. Recall the sharp operator defined in (2), then can be expressed as

and we define the optimal solution to the subproblem above as follows:

(7) |

The second term, , depends on the structure of . We consider three special cases:

Sparsity/low-rankness: If for a given and a given norm , then , the scaled dual norm of . For instance, if , then . While the -norm induces the sparsity of , computing requires the max absolute elements of . If (the nuclear norm), then , the spectral norm. The nuclear norm induces the low-rankness of . Computing in this case leads to finding the top-eigenvalue of , which is efficient.

Cone constraints: If is a cone, then becomes the indicator function of its dual cone . Hence, we can handle the inequality constraints and positive semidefinite constraints in (1). For instance, if , then , the indicator function of . If , then , the indicator function of the negative semidefinite matrix cone.

Separable structures: If and are separable, i.e., and , then the evaluation of and its derivatives can be decomposed into subproblems.

### 3.2 Hölder continuity of the dual universal gradient

Let be a subgradient of , which can be computed as . Next, we define

(8) |

where is the Hölder smoothness order. Note that the parameter explicitly depends on [15]. We are interested in the case , and especially the two extremal cases, where we either have the Lipschitz gradient that corresponds to , or the bounded subgradient that corresponds to .

We require the following condition in the sequel:

###### Assumption A. 2.

.

Assumption A.2 is reasonable. We explain this claim with the following two examples. First, if is subdifferentiable and is bounded, then is also bounded. Indeed, we have

Hence, we can choose and .

### 3.3 The proximal-gradient step for the dual problem

Given and , we define

as an approximate quadratic surrogate of . Then, we consider the following update rule:

(9) |

For a given accuracy , we define

(10) |

We need to choose the parameter such that is an approximate upper surrogate of , i.e., for some and . If and are known, then we can set defined by (10). In this case, is an upper surrogate of . In general, we do not know and . Hence, can be determined via a backtracking line-search procedure.

## 4 Universal primal-dual gradient methods

We apply the universal gradient mappings to the dual problem (5), and propose an averaging scheme to construct for approximating . Then, we develop an accelerated variant based on the FISTA scheme [10], and construct another primal sequence for approximating .

### 4.1 Universal primal-dual gradient algorithm

Our algorithm is shown in Algorithm 1. The dual steps are simply the universal gradient method in [15], while the new primal step allows to approximate the solution of (1).

Complexity-per-iteration: First, computing at Step 1 requires the solution . For many and , we can compute efficiently and often in a closed form. Second, in the line-search procedure, we require the solution at Step 3.a, and the evaluation of . The total computational cost depends on the proximal operator of and the evaluations of . We prove below that our algorithm requires two oracle queries of on average.

###### Theorem 4.1.

The worst-case analytical complexity: We establish the total number of iterations to achieve an -solution of (1). The supplementary material proves that

(13) |

where . This complexity is optimal for , but not for [16].

At each iteration , the linesearch procedure at Step 3 requires the evaluations of . The supplementary material bounds the total number of oracle queries, including the function and its gradient evaluations, up to the th iteration as follows:

(14) |

Hence, we have , i.e., we require approximately two oracle queries at each iteration on the average.

### 4.2 Accelerated universal primal-dual gradient method

We now develop an accelerated scheme for solving (5). Our scheme is different from [15] in two key aspects. First, we adopt the FISTA [10] scheme to obtain the dual sequence since it requires less operators compared to the fast scheme in [15]. Second, we perform the line-search after computing , which can reduce the number of the sharp-operator computations of and . Note that the application of FISTA to the dual function is not novel per se. However, we claim that our theoretical characterization of this classical scheme based on the Hölder continuity assumption in the composite minimization setting is new.

Complexity per-iteration: The per-iteration complexity of Algorithm 2 remains essentially the same as that of Algorithm 1.

###### Theorem 4.2.

The worst-case analytical complexity: The supplementary material proves the following worst-case complexity of Algorithm 2 to achieve an -solution :

(17) |

This worst-case complexity is optimal in the sense of first-order black box models [16].

The line-search procedure at Step 3 of Algorithm 2 also terminates after a finite number of iterations. Similar to Algorithm 1, Algorithm 2 requires gradient query and function evaluations of at each iteration. The supplementary material proves that the number of oracle queries in Algorithm 2 is upperbounded as follows:

(18) |

Roughly speaking, Algorithm 2 requires approximately two oracle query per iteration on average.

## 5 Numerical experiments

This section illustrates the scalability and the flexibility of our primal-dual framework using some applications in the quantum tomography (QT) and the matrix completion (MC).

### 5.1 Quantum tomography with Pauli operators

We consider the QT problem which aims to extract information from a physical quantum system. A -qubit quantum system is mathematically characterized by its density matrix, which is a complex positive semidefinite Hermitian matrix , where . Surprisingly, we can provably deduce the state from performing compressive linear measurements based on Pauli operators [18]. While the size of the density matrix grows exponentially in , a significantly fewer compressive measurements (i.e., ) suffices to recover a pure state -qubit density matrix as a result of the following convex optimization problem:

(19) |

where the constraint ensures that is a density matrix. The recovery is also robust to noise [18].

Since the objective function has Lipschitz gradient and the constraint (i.e., the Spectrahedron) is tuning-free, the QT problem provides an ideal scalability test for both our framework and GCG-type algorithms. To verify the performance of the algorithms with respect to the optimal solution in large-scale, we remain within the noiseless setting. However, the timing and the convergence behavior of the algorithms remain qualitatively the same under polarization and additive Gaussian noise.

To this end, we generate a random pure quantum state (e.g., rank-1 ), and we take random Pauli measurements. For qubits system, this corresponds to a dimensional problem with measurements. We recast (19) into (1) by introducing the slack variable .

We compare our algorithms vs. the Frank-Wolfe method, which has optimal convergence rate guarantees for this problem, and its line-search variant. Computing the sharp-operator requires a top-eigenvector of , while evaluating corresponds to just computing the top-eigenvalue of via a power method. All methods use the same power method subroutine, which is implemented in MATLAB’s eigs function. We set for our methods and have a wall-time s in order to stop the algorithms. However, our algorithms seems insensitive to the choice of for the QT problem.

Figure 1 illustrates the iteration and the timing complexities of the algorithms. UniPDGrad algorithm, with an average of line-search steps per iteration, has similar iteration and timing performance as compared to the standard Frank-Wolfe scheme with step-size . The line-search variant of Frank-Wolfe improves over the standard one; however, our accelerated variant, with an average of line-search steps, is the clear winner in terms of both iterations and time. We can empirically improve the performance of our algorithms even further by adapting a similar line-search strategy in the weighting step as Frank-Wolfe, i.e., by choosing the weights in a greedy fashion to minimize the objective function. The practical improvements due to line-search appear quite significant.

### 5.2 Matrix completion with MovieLens dataset

To demonstrate the flexibility of our framework, we consider the popular matrix completion (MC) application. In MC, we seek to estimate a low-rank matrix from its subsampled entries , where is the sampling operator, i.e., .

Convex formulations involving the nuclear norm have been shown to be quite effective in estimating low-rank matrices from limited number of measurements [19]. For instance, we can solve

(20) |

with Frank-Wolfe-type methods, where is a tuning parameter, which may not be available a priori. We can also solve the following parameter-free version

(21) |

While the nonsmooth objective of (21) prevents the tuning parameter, it clearly burdens the computational efficiency of the convex optimization algorithms.

We apply our algorithms to (20) and (21) using the MovieLens 100K dataset. Frank-Wolfe algorithms cannot handle (21) and only solve (20). For this experiment, we did not pre-process the data and took the default ub test and training data partition. We start out algorithms form , we set the target accuracy , and we choose the tuning parameter as in [20]. We use lansvd function (MATLAB version) from PROPACK [21] to compute the top singular vectors, and a simple implementation of the power method to find the top singular value in the line-search, both with relative error tolerance.

The first two plots in Figure 2 show the performance of the algorithms for (20). Our metrics are the normalized objective residual and the root mean squared error (RMSE) calculated for the test data. Since we do not have access to the optimal solutions, we approximated the optimal values, and RMSE, by iterations of AccUniPDGrad. Other two plots in Figure 2 compare the performance of the formulations (20) and (21) which are represented by the empty and the filled markers, respectively. Note that, the dashed line for AccUniPDGrad corresponds to the line-search variant, where the weights are chosen to minimize the feasibility gap. Additional details about the numerical experiments can be found in the supplementary material.

## 6 Conclusions

This paper proposes a new primal-dual algorithmic framework that combines the flexibility of proximal primal-dual methods in addressing the general template (1) while leveraging the computational advantages of the GCG-type methods. The algorithmic instances of our framework are universal since they can automatically adapt to the unknown Hölder continuity properties implied by the template. Our analysis technique unifies Nesterov’s universal gradient methods and GCG-type methods to address the more broadly applicable primal-dual setting. The hallmarks of our approach includes the optimal worst-case complexity and its flexibility to handle nonsmooth objectives and complex constraints, compared to existing primal-dual algorithm as well as GCG-type algorithms, while essentially preserving their low cost iteration complexity.

#### Acknowledgments

This work was supported in part by ERC Future Proof, SNF 200021-146750 and SNF CRSII2-147633. We would like to thank Dr. Stephen Becker of University of Colorado at Boulder for his support in preparing the numerical experiments.

#### References

## References

- [1] M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization. J. Mach. Learn. Res. Workshop & Conf. Proc., vol. 28, pp. 427–435, 2013.
- [2] V. Cevher, S. Becker, and M. Schmidt. Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Process. Mag., vol. 31, pp. 32–43, Sept. 2014.
- [3] M. J. Wainwright, Structured regularizers for high-dimensional problems: Statistical and computational issues. Annu. Review Stat. and Applicat., vol. 1, pp. 233–253, Jan. 2014.
- [4] G. Lan and R. D. C. Monteiro, Iteration-complexity of first-order augmented Lagrangian methods for convex programming. Math. Program., pp. 1–37, Jan. 2015, doi:10.1007/s10107-015-0861-x.
- [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. and Trends in Machine Learning, vol. 3, pp. 1–122, Jan. 2011.
- [6] P. L. Combettes and J.-C. Pesquet, A proximal decomposition method for solving convex variational inverse problems. Inverse Problems, vol. 24, Nov. 2008, doi:10.1088/0266-5611/24/6/065014.
- [7] T. Goldstein, E. Esser, and R. Baraniuk, Adaptive primal-dual hybrid gradient methods for saddle point problems. 2013, http://arxiv.org/pdf/1305.0546.
- [8] R. Shefi and M. Teboulle, Rate of convergence analysis of decomposition methods based on the proximal method of multipliers for convex minimization. SIAM J. Optim., vol. 24, pp. 269–297, Feb. 2014.
- [9] Q. Tran-Dinh and V. Cevher, Constrained convex minimization via model-based excessive gap. In Advances Neural Inform. Process. Syst. 27 (NIPS2014), Montreal, Canada, 2014.
- [10] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., vol. 2, pp. 183–202, Mar. 2009.
- [11] Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program., vol. 103, pp. 127–152, May 2005.
- [12] A. Juditsky and A. Nemirovski, Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles. Math. Program., pp. 1–36, Mar. 2015, doi:10.1007/s10107-015-0876-3.
- [13] Y. Yu, Fast gradient algorithms for structured sparsity. PhD dissertation, Univ. Alberta, Edmonton, Canada, 2014.
- [14] Y. Nesterov, Complexity bounds for primal-dual methods minimizing the model of objective function. CORE, Univ. Catholique Louvain, Belgium, Tech. Rep., 2015.
- [15] Y. Nesterov, Universal gradient methods for convex optimization problems. Math. Program., vol. 152, pp. 381–404, Aug. 2015.
- [16] A. Nemirovskii and D. Yudin, Problem complexity and method efficiency in optimization. Hoboken, NJ: Wiley Interscience, 1983.
- [17] R. T. Rockafellar, Convex analysis (Princeton Math. Series), Princeton, NJ: Princeton Univ. Press, 1970.
- [18] D. Gross, Y.-K. Liu, S. T. Flammia, S. Becker, and J. Eisert, Quantum state tomography via compressed sensing. Phys. Rev. Lett., vol. 105, pp. Oct. 2010, doi:10.1103/PhysRevLett.105.150401.
- [19] E. Candès and B. Recht, Exact matrix completion via convex optimization. Commun. ACM, vol. 55, pp. 111–119, June 2012.
- [20] M. Jaggi and M. Sulovský, A simple algorithm for nuclear norm regularized problems. In Proc. 27th Int. Conf. Machine Learning (ICML2010), Haifa, Israel, 2010, pp. 471–478.
- [21] R. M. Larsen, PROPACK - Software for large and sparse SVD calculations. Available: http://sun.stanford.edu/~rmunk/PROPACK/.

Supplementary document:

A Universal Primal-Dual Convex Optimization Framework

In this supplementary document, we provide the technical proofs and additional implementation details, and it is organized as follows: Section A defines the key estimates, that forms the basis of the universal gradient algorithms. Sections B and C present the proofs of Theorems 4.1 and 4.2 respectively. Finally, Section D provides the implementation details of the quantum tomography and the matrix completion problems considered in Section 5.

## Appendix A The key estimate of the proximal-gradient step

Lemma 2 in [22], which we present below as Lemma A.1, provides key properties for constructing universal gradient algorithms. We refer to [22] for the proof of this lemma.

###### Lemma A.1.

Let function satisfy the Assumption . Then for any and

the following statement holds for any

This lemma provides an approximate quadratic upper bound for . However, it depends on the choice of the inexactness parameter and the smoothness parameter . If , then can be set to the Lipschitz constant , and it becomes independent of .

The algorithms that we develop in this paper are based on the proximal-gradient step (9) on the dual objective function . This update rule guarantees the following estimate:

###### Lemma A.2.

Let be the quadratic model of . If , which is defined by (9), satisfies

(22) |

for some , then the following inequality holds for any

###### Proof of Lemma a.2.

If and are known, we can set , then the condition (22) is automatically satisfied. However, we do not know and a priori in general. In this case, can be determined via a line-search procedure on the condition (22).

The following lemma guarantees that the line-search procedure in Algorithms 1 and 2 terminates after a finite number of line-search iterations.