Stochastic First-order Methods for Convex and
Nonconvex Functional Constrained Optimization
Functional constrained optimization is becoming more and more important in machine learning and operations research. Such problems have potential applications in risk-averse machine learning, semisupervised learning and robust optimization among others. In this paper, we first present a novel Constraint Extrapolation (ConEx) method for solving convex functional constrained problems, which utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration) step. We show that this method is a unified algorithm that achieves the best-known rate of convergence for solving different functional constrained convex composite problems, including convex or strongly convex, and smooth or nonsmooth problems with stochastic objective and/or stochastic constraints. Many of these rates of convergence were in fact obtained for the first time in the literature. In addition, ConEx is a single-loop algorithm that does not involve any penalty subproblems. Contrary to existing dual methods, it does not require the projection of Lagrangian multipliers into a (possibly unknown) bounded set. Second, for nonconvex functional constrained problem, we introduce a new proximal point method which transforms the initial nonconvex problem into a sequence of convex functional constrained subproblems. We establish the convergence and rate of convergence of this algorithm to KKT points under different constraint qualifications. For practical use, we present inexact variants of this algorithm, in which approximate solutions of the subproblems are computed using the aforementioned ConEx method and establish their associated rate of convergence. To the best of our knowledge, most of these convergence and complexity results of the proximal point method for nonconvex problems also seem to be new in the literature.
In this paper, we study the following composite optimization problem with functional constraints:
Here, is a convex compact set, and are continuous functions which are not necessarily convex, is a proper convex lower semicontinuous function, and are convex and continuous functions. Problem (1.1) covers different convex and nonconvex settings depending on the assumptions on and , .
In the convex setting, we assume that , , are convex or strongly convex functions, which can be either smooth, nonsmooth or the sum of smooth and nonsmooth components. We also assume that , , are “simple” functions in the sense that, for any given vector and non-negative weight vector , a certain proximal operator associated with the function can be computed efficiently. For such problems, Lipschitz smoothness properties of ’s is of no consequence due to the simplicity of this proximal operator.
For the nonconvex case, we assume that , , are smooth functions, which are not necessarily convex, but satisfying a certain lower curvature condition (c.f. (1.3)). However, we do not put the simplicity assumption about the proximal operator associated with convex functions , , in order to cover a broad class of nonconvex problems, including those with non-differentiable objective functions or constraints.
Constrained optimization problems of the above form are prevalent in data science. One such example arises from risk averse machine learning. Let models the loss for a random data-point . Our goal is to minimize a certain risk measure [42, 43], e.g., the so-called conditional value at risk that penalizes only the positive deviation of the loss function, subject to the constraint that the expected loss is less than a threshold value. Therefore, one can formulate this problem as
where denotes conditional value at risk and is the tolerance on the average loss that one can consider as acceptable. In many practical situations, the loss function is nonconvex w.r.t. . Other examples of problem (1.1) can also be found in semi-supervised learning, where one would like minimize the loss function defined over the labeled samples, subject to certain proximity type constraints for the unlabeled samples.
There exists a variety of literature on solving convex functional constrained optimization problems (1.1). One research line focuses on primal methods without involving the Lagrange multipliers including the cooperative subgradient methods [38, 26] and level-set methods [27, 34, 29, 4, 28]. One possible limitation of these methods is the difficulty to directly achieve accelerated rate of convergence when the objective or constraint functions are smooth. Constrained convex optimization problems can also be solved by reformulating them as saddle point problems which will then be solved by using primal-dual type algorithms (see [33, 18]). The main hurdle for existing primal-dual methods exists in that they require the projection of dual multipliers inside a ball whose diameter is usually unknown. Other alternative approaches for constrained convex problems include the classical exact penalty, quadratic penalty and augmented Lagrangian methods [6, 22, 23, 46]. These approaches however require the solutions of penalty subproblems and hence are more complicated than primal and primal-dual methods. Recently, research effort has also been directed to stochastic optimization problems with functional constraints [26, 4]. In spite of many interesting findings, existing methods for solving these problems are still limited: a) many primal methods solve only stochastic problems with deterministic constraints , and the convergence for accelerated primal-dual methods [33, 18] has not been studied for stochastic functional constrained problems; and b) a few algorithms for solving problems with expectation constraints require either a constraint evaluation step , or stochastic lower bounds on the optimal value , thus relying on a light-tail assumption for the stochastic noise and conservative sampling estimates based on Bernstein inequality. Some other algorithms require even more restrictive assumptions that the noise associated with stochastic constraints has to be bounded .
The past few years has also seen a resurgence of interest in the design of efficient algorithms for nonconvex stochastic optimization, especially for stochastic and finite-sum problems due to their importance in machine learning. Most of these studies need to assume that the constraints are convex, and focus on the analysis of iteration complexity, i.e., the number of iterations required to find an approximate stationary point, as well as possible ways to accelerate such approximate solutions. If the nonconvex functional constraints do not appear, one type of approach for solving (1.1) is to directly generalize stochastic gradient descent type methods (see [15, 16, 41, 1, 13, 45, 35, 37, 20]) for solving problems with nonconvex objective functions. An alternative approach is to indirectly utilize convex optimization methods within the framework of proximal-point methods which transfer nonconvex optimization problems into a series of convex ones (see [17, 7, 14, 11, 19, 24, 40, 36]). While direct methods are simpler and hence easier to implement, indirect methods may provide stronger theoretical performance guarantees under certain circumstances, e.g., when the problem has a large conditional number, many components and/or multiple blocks . However, if nonconvex functional constraints do appear in (1.1), the study on its solution methods is scarce. While there is a large body of work on the asymptotic analysis and the optimality conditions of penalty-based approaches for general constrained nonlinear programming (for example, see [6, 32, 3, 2, 39]), only a few works discussed the complexity of these methods for solving problems with nonconvex functional constraints [8, 44, 12]. However, these techniques are not applicable to our setting because they cannot guarantee the feasibility of the generated solutions, but certain local non-increasing properties for the constraint functions. On the other hand, the feasibility of the nonconvex functional constraints appear to be important in our problems of interest.
In this paper, we attempt to address some of the aforementioned significant issues associated with both convex and nonconvex functional constrained optimization. Our contributions mainly exist in the following several aspects.
Firstly, for solving convex functional constrained problems, we present a novel primal-dual type method, referred to as the Constraint Extrapolation (ConEx) method. One distinctive feature of this method from existing primal-dual methods is that it utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration/momentum) step. As a consequence, contrary to the well-known Nemirovski’s mirror-prox method  and a new primal-dual method recently developed by Hamedani and Aybat , ConEx does not require the projection of Lagrangian multipliers onto a (possibly unknown) bounded set. In addition, ConEx is a single-loop algorithm that does not involve any penalty subproblems. Due to the built-in acceleration step, this method can explore problem structures and hence achieve better rate of convergence than primal methods. In fact, we show that this method is a unified algorithm that achieves the best-known rate of convergence for solving different convex functional constrained problems, including convex or strongly convex, and smooth or non-smooth problems with stochastic objective and/or stochastic constraints.
|Strongly convex (1.1)||Convex (1.1)|
Table 1 provides a brief summary for the iteration complexity of the ConEx method for solving different functional constrained problems. For the strongly convex case, ConEx can obtain convergence to an -approximate solution (i.e., optimality gap and infeasibility are ) as well as convergence of the distance of the last iterate to the optimal solution. The complexity bounds provided in Table 1 for the strongly convex case hold for both types of convergence criterions. For semi- and fully-stochastic case, we use the notion of expected convergence instead of exact convergence used in the deterministic case. It should be noted that in Table 1, we ignore the impact of various Lipschitz constants and/or stochastic noises for the sake of simplicity. In fact, the ConEx method achieves quite a few new complexity results by reducing the impact of these Lipschitz constants and stochastic noises (see Theorems 2.1 and 2.1 and discussions afterwards). Even though ConEx is a primal-dual type method, we can show its convergence irrespective of the knowledge of the optimal Lagrange multipliers as it does not require the projection of multipliers onto the ball. In particular, convergence rates of the ConEx method for nonsmooth cases (either convex or strongly convex) in Table 1 holds irrespective of the knowledge of the optimal Lagrange multipliers. For smooth cases, if certain parameters of ConEx method are not big enough (compared to the norm of optimal Lagrange multipliers), then it converges at the rates for nonsmooth problems of the respective case. As one can see from Table 1, such a change would cause a suboptimal convergence rate in terms of only for the deterministic case, but complexity will be the same for both semi- and fully-stochastic cases. It is worth mentioning that faster convergence rates for the smooth cases can still be attained by incorporating certain line search procedures. To the best of our knowledge, this is the first time in the literature that a simple single-loop algorithm was developed for solving all different types of convex functional constrained problems in an optimal manner.
Secondly, we aim to extend the ConEx method for the nonconvex setting and present a new framework of proximal point method for solving the nonconvex functional constrained optimization problems, which otherwise seem to be difficult to solve by using direct approaches. The key component of our method is to exploit the structure of the nonconvex objective and constraints , , thereby turning the original problem into a sequence of functional constrained subproblems with a strongly convex objective and strongly convex constraints. We show that when the initial point is strictly feasible, then all the subsequent points generated in the algorithm remain strictly feasible. Hence by Slater condition, there exists Lagrange multipliers attaining strong duality for each subproblem. Furthermore, we analyze the conditions under which the dual variables are bounded, and show asymptotic convergence of the sequence to the KKT points of the original problem. Moreover, we provide the first iteration complexity of this proximal point method under certain regularity conditions. More specifically, we show that this method requires iterations to obtain an appropriately defined -KKT point.
For practical use, we propose an inexact proximal point type algorithm for which only approximate solutions of the subproblems are given. To develop the convergence analysis of the proposed method, we present different termination criterions for controlling the accuracy for solving the subproblems, either based on the distance to the optimal solution, or in terms of functional optimality gap and constraint violation, depending on different types of constraint qualifications. We then establish the convergence or complexity of the inexact proximal point method for solving nonconvex functional constrained problems. We also present the overall complexity of the inexact proximal point method when the ConEx method is used to solve the subproblems under appropriate constraint qualification conditions.
Close to the completion of our paper, we notice that Ma et. al.  also worked independently on the analysis of the proximal-point methods for nonconvex functional constrained problems. In fact, the initial version of  was released almost at the same time as ours. In spite of some overlap, there exist a few essential differences between our work and . First, we establish the convergence/complexity of the proximal point method under a variety of constraint qualification conditions, including Mangasarian-Fromovitz constraint qualification (MFCQ), strong MFCQ, and strong feasibility, and hence our work covers a broader class of nonconvex problems, while  only consider a uniform Slater’s condition. Strong feasibility condition is stronger than the uniform Slater’s condition but is easier to verify. Second,  uses a different definition of subdifferential than ours and the definition of the KKT conditions in  comes from convex optimization problems. While it is unclear under what constraint qualification this KKT condition is necessary for local optimality, it is possible to put their problem into our composite framework in (1.1) and compute the subdifferential that provably yields our KKT condition under the aforementioned MFCQ. Third, for solving the convex subproblems we provide a unified algorithm, i.e., ConEx, that can achieve the best-known rate of convergence for solving different problem classes, including deterministic, semi- and fully-stochastic, smooth and nonsmooth problems. On the other hand, different methods were suggested for solving different types of problems in . In particular, a variant of the switching subgradient method, which was firstly presented by Polyak in  for the general convex case, and later extended by  for the stochastic and strongly convex cases, was suggested for solving deterministic problems. For the stochastic case they directly apply the algorithm in  and hence require stochastic gradients to be bounded. These nonsmooth subgradient methods do not necessarily yield the best possible rate of convergence if the objective/constraint functions are smooth or contain certain smooth components.
This paper is organized as follows. Section 1.1 describes notation and terminologies. Section 2 exclusively deals with the ConEx method for solving problem (1.1) in the convex setting. Subsection 2.1 states the main convergence results of the ConEx method and subsection 2.2 shows the details of the convergence analysis. Section 3 presents the proximal point method for solving problem (1.1) in the nonconvex setting and establishes its convergence behavior and iteration complexity. We also introduce an inexact variant of proximal point method in which subproblems are approximately solved and shows an overall iteration complexity result when the subproblems are solved using the ConEx method developed earlier.
1.1 Notation and terminologies
Throughout the paper, we use the following notations. Let , , and and the constraints in (1.1) be expressed as . Here bold denotes the vector of elements . Size of the vector is left unspecified whenever it is clear from the context. denotes a general norm and denotes its dual norm. stands for the Euclidean norm and inner product is denoted as . Let be the Euclidean ball of radius centered at origin. Nonnegative orthant of this ball is denoted as . For a convex set , we denote the normal cone at as and its dual cone as , interior as and relative interior as . For a scalar valued function and a scalar , the notation stands for the set . The “” operation on sets denotes the Minkowski sum of the sets. We refer to the distance between two sets as .
for any . For any vector , we define as elementwise application of the operator . The -th element of vector is denoted as unless otherwise explicitly specified a different notation for certain special vectors.
A function is -Lipschitz smooth if the gradient is a -Lipschitz function, i.e. for some
An equivalent form is:
A refined version of the above property differentiates between negative and positive curvature. In particular, we have
Here, we say that satisfies (1.3) with parameter with respect to . In many cases, it is possible that a convex function is a combination of Lipschitz smooth and nonsmooth functions. Let be continuously differentiable with Lipschitz gradient and -strongly convex with respect to . We define the prox-function associated with as
Based on the smoothness and strong convexity of , we have the following relation
Moreover, we say that a function is -strongly convex with respect to if
For any convex function , we denote the subdifferential as which is defined as follows: at a point in the relative interior of , is comprised of all subgradients of at which are in the linear span of . For a point , the set consists of all vectors , if any, such that there exists and with . With this definition, it is well-known that, if a convex function is Lipschitz continuous, with constant , with respect to a norm , then the set is nonempty for any and
which also implies
where is the dual norm. See  for more details.
2 Constraint Extrapolation for Convex Functional Constrained Optimization
In this section, we present a novel constraint extrapolation (ConEx) method for solving problem (1.1) in the convex setting. To motivate our proposed method, observe that the KKT point of (1.1) coincides with the solution of the following saddle point problem:
In other words, is a saddle point of the Lagrange function such that
for all , whenever the optimal dual, , exists. Throughout this section, we assume the existence of satisfying (2.2). The following definition describes a widely used optimality measure for the convex problem (1.1).
A point is called a -optimal solution of problem (1.1) if
A stochastic -approximately optimal solution satisfies
As mentioned earlier, for the convex composite case, we assume that , are “simple” functions in the sense that, for any vector and nonnegative , we can efficiently compute the following prox operator
2.1 The ConEx method
ConEx is a single-loop primal-dual type method for functional constrained optimization. It evolves from the primal-dual methods for solving bilinear saddle point point problems (e.g., [9, 10, 25, 21, 20]). Recently Hamedani and Aybat  show that these methods can also handle more general functional coupling term. However, as discussed earlier, existing primal-dual methods [33, 18] for general saddle point problems, when applied to functional constrained problems, require the projection of dual multipliers onto a possibly unknown bounded set in order to ensure the boundedness of the operators, as well as the proper selection of stepsizes. One distinctive feature of ConEx is to use value of linearized constraint functions in place of exact function values when defining the operator of the saddle point problem and the extrapolation/momentum step. With this modification, we show that the ConEx method still converges even though the feasible set of in problem (2.1) is unbounded. In addition, we show that the ConEx is a unified algorithm for solving functional constrained optimization problems in the following sense. First, we establish explicit rate of convergence for the ConEx method for solving functional constrained stochastic optimization problems where either the objective and/or constraints are given in the form of expectation. Second, we consider the composite constrained optimization problem in which objective function and/or constraints can be nonsmooth. Third, we consider the two cases of convex or strongly convex objective, . For strongly convex objective, we also establish the convergence rate of the distance between last iterate to the optimal solution .
Before proceeding to the algorithm, we introduce the problem setup in more details. First, we assume that satisfies the following composite Lipschitz smoothness and nonsmoothness condition:
for all and for all . For constraints, we make a similar assumption as in (2.4). Moreover, we make an additional assumption that the constraint functions are Lipschitz continuous. In particular, we have
for all and for all , and
Note that the Lipschitz-continuity assumption in (2.6) is common in the literature when , are nonsmooth functions. If , are Lipschitz smooth then their gradients are bounded due to the compactness of . Hence (2.6) is not a strong assumption for the given setting. Note that due to relations (2.5) and (2.6), we have
where and constants and are defined as
We denote as the vector of moduli of strong convexity for , and as the modulus of strong convexity for . We say that problem (1.1) is a convex composite Lipschitz smooth functional constrained minimization problem if (2.5) is satisfied with for all and (2.4) is satisfied with . Otherwise, (1.1) is a nonsmooth problem. To be succinct, problem (1.1) is Lipschitz smooth if , otherwise it is a nonsmooth problem.
We assume that we can access the first-order information of functions and zeroth-order information of function using a stochastic oracle (SO). In particular, given , SO outputs , and such that
where is a random variable which models the source of uncertainty and is independent of the search point . Note that the last relation of (2.8) is satisfied if we have individual stochastic oracles such that . In particular, we can set . We call , as stochastic subgradients of functions at point , respectively. We use stochastic subgradients , in the -th iteration of the ConEx method where is a realization of random variable which is independent of the search point .
We denote a linear approximation of at point with
where as defined earlier. For ease of notation, we denote . We can do this, since for all , we approximate with linear function approximation taken at . We use a stochastic version of in our algorithm, which is denoted as . In particular, we have
where . Here, we used as an independent (of ) realization of random variable . In other words, and are conditionally independent estimates of for under the condition that is fixed. As we show later, independent samples of are required to show that is an unbiased estimator of .
We are now ready to formally describe the constraint extrapolation method (see Algorithm 1).
As mentioned earlier, the term in Line 3 of Algorithm 1 can be shown to be an unbiased estimator of . Moreover, the term is an approximation to . Essentially, Line 3 represents a stochastic approximation for the term which is an extrapolation of the constraints, hence justifying the name of the algorithm. Line 4 is the standard prox operator of the form . Line 5 also uses a prox operator defined in (2.3) which uses Bregman divergence instead of standard Euclidean norm. The final output of the algorithm in Line 7 is the weighted average of all primal iterates generated. If we choose for then we recover the deterministic gradients and function evaluation. Henceforth, we assume general non-negative values for such ’s and provide a combined analysis for these settings. Later, we substitute appropriate values of ’s to finish the analysis for the following three different cases.
Deterministic setting where both the objective and constraints are deterministic. Here for all .
Semi-stochastic setting where the constraints are deterministic but the objective is stochastic. Here, for all . However, can take arbitrary values.
Fully-stochastic setting where both function and gradient evaluations are stochastic. Here, all can take arbitrary values.
Below, we specify a stepsize policy and state the convergence properties of Algorithm 1 for solving problem (1.1) in the convex setting. The proof of this result is involved and will be deferred to Section 2.2.
Set and in Algorithm 1 according to the following:
Then, we have
We discuss some important features of the iteration complexity result (2.12) in the following remark.
Problem (1.1) is Lipschitz smooth then . Moreover, suppose that then . Then we obtain the following iteration complexity results: deterministic case: , semi-stochastic case: . For the fully-stochastic case, after noting that is of the same order as and replacing , we can see that the iteration complexity in (2.12) reduces to . It is worth noting that for Lipschitz smooth case, and are of the same order.
For the nonsmooth case, we have . Then, we obtain the following complexity results: deterministic case: , semi-stochastic case: and fully-stochastic case: .
Note that . For the Lipschitz smooth case, if then and we obtain a convergence rate of the nonsmooth case for Lipschitz smooth problem.
In contrast to the stepsize scheme we will develop later for the strongly convex case (c.f. (2.14)), the stepsize scheme in (2.9) depends on implying that we need to estimate whether , especially for the Lipschitz smooth case in order to obtain smaller iteration complexity of in the deterministic case. We can replace in the definition of by , then the last term in (2.9) changes to
If then and hence the complexity results from first two bullet points of this remark hold.
Consider the pathological case of and Lipschitz smooth problem (1.1), but . We call it pathological because the stepsize scheme (2.9) set according to the standard definition of does not yield convergence in this case for deterministic setting. In particular, the last term in the infeasibility bound (2.11) would change to which is undefined. One possible solution for this is to artificially set in the stepsize scheme, (2.9), to be some large positive number and forego of the faster convergence of . After this change, we would obtain a convergence rate of . This change to is not required for semi-stochastic or fully-stochastic setting as implies convergence rate of . An alternative approach would be to design a line search procedure on for the right value of , since there exists a verifiable condition based on the constraint violation . In this way, we can still obtain an convergence rate Lipschitz smooth problems.
Similar to the semi-stochastic and fully-stochastic settings discussed above, the ConEx method converges for nonsmooth problems with the standard definition of .
Now we provide another theorem which states the stepsize policy and the resulting convergence properties of the ConEx method for solving problem (1.1) in the strongly convex setting. The proof of this result can be found in Section 2.2.
An immediate corollary of the above theorem is the following:
We obtain an -optimal solution of problem (1.1) in iterations, where