Stochastic Firstorder Methods for Convex and
Nonconvex Functional Constrained Optimization
Abstract
Functional constrained optimization is becoming more and more important in machine learning and operations research. Such problems have potential applications in riskaverse machine learning, semisupervised learning and robust optimization among others. In this paper, we first present a novel Constraint Extrapolation (ConEx) method for solving convex functional constrained problems, which utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration) step. We show that this method is a unified algorithm that achieves the bestknown rate of convergence for solving different functional constrained convex composite problems, including convex or strongly convex, and smooth or nonsmooth problems with stochastic objective and/or stochastic constraints. Many of these rates of convergence were in fact obtained for the first time in the literature. In addition, ConEx is a singleloop algorithm that does not involve any penalty subproblems. Contrary to existing dual methods, it does not require the projection of Lagrangian multipliers into a (possibly unknown) bounded set. Second, for nonconvex functional constrained problem, we introduce a new proximal point method which transforms the initial nonconvex problem into a sequence of convex functional constrained subproblems. We establish the convergence and rate of convergence of this algorithm to KKT points under different constraint qualifications. For practical use, we present inexact variants of this algorithm, in which approximate solutions of the subproblems are computed using the aforementioned ConEx method and establish their associated rate of convergence. To the best of our knowledge, most of these convergence and complexity results of the proximal point method for nonconvex problems also seem to be new in the literature.
1 Introduction
In this paper, we study the following composite optimization problem with functional constraints:
(1.1) 
Here, is a convex compact set, and are continuous functions which are not necessarily convex, is a proper convex lower semicontinuous function, and are convex and continuous functions. Problem (1.1) covers different convex and nonconvex settings depending on the assumptions on and , .
In the convex setting, we assume that , , are convex or strongly convex functions, which can be either smooth, nonsmooth or the sum of smooth and nonsmooth components. We also assume that , , are “simple” functions in the sense that, for any given vector and nonnegative weight vector , a certain proximal operator associated with the function can be computed efficiently. For such problems, Lipschitz smoothness properties of ’s is of no consequence due to the simplicity of this proximal operator.
For the nonconvex case, we assume that , , are smooth functions, which are not necessarily convex, but satisfying a certain lower curvature condition (c.f. (1.3)). However, we do not put the simplicity assumption about the proximal operator associated with convex functions , , in order to cover a broad class of nonconvex problems, including those with nondifferentiable objective functions or constraints.
Constrained optimization problems of the above form are prevalent in data science. One such example arises from risk averse machine learning. Let models the loss for a random datapoint . Our goal is to minimize a certain risk measure [42, 43], e.g., the socalled conditional value at risk that penalizes only the positive deviation of the loss function, subject to the constraint that the expected loss is less than a threshold value. Therefore, one can formulate this problem as
(1.2) 
where denotes conditional value at risk and is the tolerance on the average loss that one can consider as acceptable. In many practical situations, the loss function is nonconvex w.r.t. . Other examples of problem (1.1) can also be found in semisupervised learning, where one would like minimize the loss function defined over the labeled samples, subject to certain proximity type constraints for the unlabeled samples.
There exists a variety of literature on solving convex functional constrained optimization problems (1.1). One research line focuses on primal methods without involving the Lagrange multipliers including the cooperative subgradient methods [38, 26] and levelset methods [27, 34, 29, 4, 28]. One possible limitation of these methods is the difficulty to directly achieve accelerated rate of convergence when the objective or constraint functions are smooth. Constrained convex optimization problems can also be solved by reformulating them as saddle point problems which will then be solved by using primaldual type algorithms (see [33, 18]). The main hurdle for existing primaldual methods exists in that they require the projection of dual multipliers inside a ball whose diameter is usually unknown. Other alternative approaches for constrained convex problems include the classical exact penalty, quadratic penalty and augmented Lagrangian methods [6, 22, 23, 46]. These approaches however require the solutions of penalty subproblems and hence are more complicated than primal and primaldual methods. Recently, research effort has also been directed to stochastic optimization problems with functional constraints [26, 4]. In spite of many interesting findings, existing methods for solving these problems are still limited: a) many primal methods solve only stochastic problems with deterministic constraints [26], and the convergence for accelerated primaldual methods [33, 18] has not been studied for stochastic functional constrained problems; and b) a few algorithms for solving problems with expectation constraints require either a constraint evaluation step [26], or stochastic lower bounds on the optimal value [4], thus relying on a lighttail assumption for the stochastic noise and conservative sampling estimates based on Bernstein inequality. Some other algorithms require even more restrictive assumptions that the noise associated with stochastic constraints has to be bounded [47].
The past few years has also seen a resurgence of interest in the design of efficient algorithms for nonconvex stochastic optimization, especially for stochastic and finitesum problems due to their importance in machine learning. Most of these studies need to assume that the constraints are convex, and focus on the analysis of iteration complexity, i.e., the number of iterations required to find an approximate stationary point, as well as possible ways to accelerate such approximate solutions. If the nonconvex functional constraints do not appear, one type of approach for solving (1.1) is to directly generalize stochastic gradient descent type methods (see [15, 16, 41, 1, 13, 45, 35, 37, 20]) for solving problems with nonconvex objective functions. An alternative approach is to indirectly utilize convex optimization methods within the framework of proximalpoint methods which transfer nonconvex optimization problems into a series of convex ones (see [17, 7, 14, 11, 19, 24, 40, 36]). While direct methods are simpler and hence easier to implement, indirect methods may provide stronger theoretical performance guarantees under certain circumstances, e.g., when the problem has a large conditional number, many components and/or multiple blocks [24]. However, if nonconvex functional constraints do appear in (1.1), the study on its solution methods is scarce. While there is a large body of work on the asymptotic analysis and the optimality conditions of penaltybased approaches for general constrained nonlinear programming (for example, see [6, 32, 3, 2, 39]), only a few works discussed the complexity of these methods for solving problems with nonconvex functional constraints [8, 44, 12]. However, these techniques are not applicable to our setting because they cannot guarantee the feasibility of the generated solutions, but certain local nonincreasing properties for the constraint functions. On the other hand, the feasibility of the nonconvex functional constraints appear to be important in our problems of interest.
In this paper, we attempt to address some of the aforementioned significant issues associated with both convex and nonconvex functional constrained optimization. Our contributions mainly exist in the following several aspects.
Firstly, for solving convex functional constrained problems, we present a novel primaldual type method, referred to as the Constraint Extrapolation (ConEx) method. One distinctive feature of this method from existing primaldual methods is that it utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration/momentum) step. As a consequence, contrary to the wellknown Nemirovski’s mirrorprox method [33] and a new primaldual method recently developed by Hamedani and Aybat [18], ConEx does not require the projection of Lagrangian multipliers onto a (possibly unknown) bounded set. In addition, ConEx is a singleloop algorithm that does not involve any penalty subproblems. Due to the builtin acceleration step, this method can explore problem structures and hence achieve better rate of convergence than primal methods. In fact, we show that this method is a unified algorithm that achieves the bestknown rate of convergence for solving different convex functional constrained problems, including convex or strongly convex, and smooth or nonsmooth problems with stochastic objective and/or stochastic constraints.
Strongly convex (1.1)  Convex (1.1)  

Cases  Smooth  Nonsmooth  Smooth  Nonsmooth 
Deterministic  
Semistochastic  
Fullystochastic 
Table 1 provides a brief summary for the iteration complexity of the ConEx method for solving different functional constrained problems. For the strongly convex case, ConEx can obtain convergence to an approximate solution (i.e., optimality gap and infeasibility are ) as well as convergence of the distance of the last iterate to the optimal solution. The complexity bounds provided in Table 1 for the strongly convex case hold for both types of convergence criterions. For semi and fullystochastic case, we use the notion of expected convergence instead of exact convergence used in the deterministic case. It should be noted that in Table 1, we ignore the impact of various Lipschitz constants and/or stochastic noises for the sake of simplicity. In fact, the ConEx method achieves quite a few new complexity results by reducing the impact of these Lipschitz constants and stochastic noises (see Theorems 2.1 and 2.1 and discussions afterwards). Even though ConEx is a primaldual type method, we can show its convergence irrespective of the knowledge of the optimal Lagrange multipliers as it does not require the projection of multipliers onto the ball. In particular, convergence rates of the ConEx method for nonsmooth cases (either convex or strongly convex) in Table 1 holds irrespective of the knowledge of the optimal Lagrange multipliers. For smooth cases, if certain parameters of ConEx method are not big enough (compared to the norm of optimal Lagrange multipliers), then it converges at the rates for nonsmooth problems of the respective case. As one can see from Table 1, such a change would cause a suboptimal convergence rate in terms of only for the deterministic case, but complexity will be the same for both semi and fullystochastic cases. It is worth mentioning that faster convergence rates for the smooth cases can still be attained by incorporating certain line search procedures. To the best of our knowledge, this is the first time in the literature that a simple singleloop algorithm was developed for solving all different types of convex functional constrained problems in an optimal manner.
Secondly, we aim to extend the ConEx method for the nonconvex setting and present a new framework of proximal point method for solving the nonconvex functional constrained optimization problems, which otherwise seem to be difficult to solve by using direct approaches. The key component of our method is to exploit the structure of the nonconvex objective and constraints , , thereby turning the original problem into a sequence of functional constrained subproblems with a strongly convex objective and strongly convex constraints. We show that when the initial point is strictly feasible, then all the subsequent points generated in the algorithm remain strictly feasible. Hence by Slater condition, there exists Lagrange multipliers attaining strong duality for each subproblem. Furthermore, we analyze the conditions under which the dual variables are bounded, and show asymptotic convergence of the sequence to the KKT points of the original problem. Moreover, we provide the first iteration complexity of this proximal point method under certain regularity conditions. More specifically, we show that this method requires iterations to obtain an appropriately defined KKT point.
For practical use, we propose an inexact proximal point type algorithm for which only approximate solutions of the subproblems are given. To develop the convergence analysis of the proposed method, we present different termination criterions for controlling the accuracy for solving the subproblems, either based on the distance to the optimal solution, or in terms of functional optimality gap and constraint violation, depending on different types of constraint qualifications. We then establish the convergence or complexity of the inexact proximal point method for solving nonconvex functional constrained problems. We also present the overall complexity of the inexact proximal point method when the ConEx method is used to solve the subproblems under appropriate constraint qualification conditions.
Close to the completion of our paper, we notice that Ma et. al. [30] also worked independently on the analysis of the proximalpoint methods for nonconvex functional constrained problems. In fact, the initial version of [30] was released almost at the same time as ours. In spite of some overlap, there exist a few essential differences between our work and [30]. First, we establish the convergence/complexity of the proximal point method under a variety of constraint qualification conditions, including MangasarianFromovitz constraint qualification (MFCQ), strong MFCQ, and strong feasibility, and hence our work covers a broader class of nonconvex problems, while [30] only consider a uniform Slater’s condition. Strong feasibility condition is stronger than the uniform Slater’s condition but is easier to verify. Second, [30] uses a different definition of subdifferential than ours and the definition of the KKT conditions in [30] comes from convex optimization problems. While it is unclear under what constraint qualification this KKT condition is necessary for local optimality, it is possible to put their problem into our composite framework in (1.1) and compute the subdifferential that provably yields our KKT condition under the aforementioned MFCQ. Third, for solving the convex subproblems we provide a unified algorithm, i.e., ConEx, that can achieve the bestknown rate of convergence for solving different problem classes, including deterministic, semi and fullystochastic, smooth and nonsmooth problems. On the other hand, different methods were suggested for solving different types of problems in [30]. In particular, a variant of the switching subgradient method, which was firstly presented by Polyak in [38] for the general convex case, and later extended by [26] for the stochastic and strongly convex cases, was suggested for solving deterministic problems. For the stochastic case they directly apply the algorithm in [47] and hence require stochastic gradients to be bounded. These nonsmooth subgradient methods do not necessarily yield the best possible rate of convergence if the objective/constraint functions are smooth or contain certain smooth components.
Outline
This paper is organized as follows. Section 1.1 describes notation and terminologies. Section 2 exclusively deals with the ConEx method for solving problem (1.1) in the convex setting. Subsection 2.1 states the main convergence results of the ConEx method and subsection 2.2 shows the details of the convergence analysis. Section 3 presents the proximal point method for solving problem (1.1) in the nonconvex setting and establishes its convergence behavior and iteration complexity. We also introduce an inexact variant of proximal point method in which subproblems are approximately solved and shows an overall iteration complexity result when the subproblems are solved using the ConEx method developed earlier.
1.1 Notation and terminologies
Throughout the paper, we use the following notations. Let , , and and the constraints in (1.1) be expressed as . Here bold denotes the vector of elements . Size of the vector is left unspecified whenever it is clear from the context. denotes a general norm and denotes its dual norm. stands for the Euclidean norm and inner product is denoted as . Let be the Euclidean ball of radius centered at origin. Nonnegative orthant of this ball is denoted as . For a convex set , we denote the normal cone at as and its dual cone as , interior as and relative interior as . For a scalar valued function and a scalar , the notation stands for the set . The “” operation on sets denotes the Minkowski sum of the sets. We refer to the distance between two sets as .
for any . For any vector , we define as elementwise application of the operator . The th element of vector is denoted as unless otherwise explicitly specified a different notation for certain special vectors.
A function is Lipschitz smooth if the gradient is a Lipschitz function, i.e. for some
An equivalent form is:
A refined version of the above property differentiates between negative and positive curvature. In particular, we have
(1.3) 
Here, we say that satisfies (1.3) with parameter with respect to . In many cases, it is possible that a convex function is a combination of Lipschitz smooth and nonsmooth functions. Let be continuously differentiable with Lipschitz gradient and strongly convex with respect to . We define the proxfunction associated with as
(1.4) 
Based on the smoothness and strong convexity of , we have the following relation
(1.5) 
Moreover, we say that a function is strongly convex with respect to if
(1.6) 
For any convex function , we denote the subdifferential as which is defined as follows: at a point in the relative interior of , is comprised of all subgradients of at which are in the linear span of . For a point , the set consists of all vectors , if any, such that there exists and with . With this definition, it is wellknown that, if a convex function is Lipschitz continuous, with constant , with respect to a norm , then the set is nonempty for any and
which also implies
where is the dual norm. See [5] for more details.
2 Constraint Extrapolation for Convex Functional Constrained Optimization
In this section, we present a novel constraint extrapolation (ConEx) method for solving problem (1.1) in the convex setting. To motivate our proposed method, observe that the KKT point of (1.1) coincides with the solution of the following saddle point problem:
(2.1) 
In other words, is a saddle point of the Lagrange function such that
(2.2) 
for all , whenever the optimal dual, , exists. Throughout this section, we assume the existence of satisfying (2.2). The following definition describes a widely used optimality measure for the convex problem (1.1).
Definition.
A point is called a optimal solution of problem (1.1) if
A stochastic approximately optimal solution satisfies
As mentioned earlier, for the convex composite case, we assume that , are “simple” functions in the sense that, for any vector and nonnegative , we can efficiently compute the following prox operator
(2.3) 
2.1 The ConEx method
ConEx is a singleloop primaldual type method for functional constrained optimization. It evolves from the primaldual methods for solving bilinear saddle point point problems (e.g., [9, 10, 25, 21, 20]). Recently Hamedani and Aybat [18] show that these methods can also handle more general functional coupling term. However, as discussed earlier, existing primaldual methods [33, 18] for general saddle point problems, when applied to functional constrained problems, require the projection of dual multipliers onto a possibly unknown bounded set in order to ensure the boundedness of the operators, as well as the proper selection of stepsizes. One distinctive feature of ConEx is to use value of linearized constraint functions in place of exact function values when defining the operator of the saddle point problem and the extrapolation/momentum step. With this modification, we show that the ConEx method still converges even though the feasible set of in problem (2.1) is unbounded. In addition, we show that the ConEx is a unified algorithm for solving functional constrained optimization problems in the following sense. First, we establish explicit rate of convergence for the ConEx method for solving functional constrained stochastic optimization problems where either the objective and/or constraints are given in the form of expectation. Second, we consider the composite constrained optimization problem in which objective function and/or constraints can be nonsmooth. Third, we consider the two cases of convex or strongly convex objective, . For strongly convex objective, we also establish the convergence rate of the distance between last iterate to the optimal solution .
Before proceeding to the algorithm, we introduce the problem setup in more details. First, we assume that satisfies the following composite Lipschitz smoothness and nonsmoothness condition:
(2.4) 
for all and for all . For constraints, we make a similar assumption as in (2.4). Moreover, we make an additional assumption that the constraint functions are Lipschitz continuous. In particular, we have
(2.5) 
for all and for all , and
(2.6)  
Note that the Lipschitzcontinuity assumption in (2.6) is common in the literature when , are nonsmooth functions. If , are Lipschitz smooth then their gradients are bounded due to the compactness of . Hence (2.6) is not a strong assumption for the given setting. Note that due to relations (2.5) and (2.6), we have
where and constants and are defined as

(2.7) 
We denote as the vector of moduli of strong convexity for , and as the modulus of strong convexity for . We say that problem (1.1) is a convex composite Lipschitz smooth functional constrained minimization problem if (2.5) is satisfied with for all and (2.4) is satisfied with . Otherwise, (1.1) is a nonsmooth problem. To be succinct, problem (1.1) is Lipschitz smooth if , otherwise it is a nonsmooth problem.
We assume that we can access the firstorder information of functions and zerothorder information of function using a stochastic oracle (SO). In particular, given , SO outputs , and such that
(2.8) 
where is a random variable which models the source of uncertainty and is independent of the search point . Note that the last relation of (2.8) is satisfied if we have individual stochastic oracles such that . In particular, we can set . We call , as stochastic subgradients of functions at point , respectively. We use stochastic subgradients , in the th iteration of the ConEx method where is a realization of random variable which is independent of the search point .
We denote a linear approximation of at point with
where as defined earlier. For ease of notation, we denote . We can do this, since for all , we approximate with linear function approximation taken at . We use a stochastic version of in our algorithm, which is denoted as . In particular, we have
where . Here, we used as an independent (of ) realization of random variable . In other words, and are conditionally independent estimates of for under the condition that is fixed. As we show later, independent samples of are required to show that is an unbiased estimator of .
We are now ready to formally describe the constraint extrapolation method (see Algorithm 1).
As mentioned earlier, the term in Line 3 of Algorithm 1 can be shown to be an unbiased estimator of . Moreover, the term is an approximation to . Essentially, Line 3 represents a stochastic approximation for the term which is an extrapolation of the constraints, hence justifying the name of the algorithm. Line 4 is the standard prox operator of the form . Line 5 also uses a prox operator defined in (2.3) which uses Bregman divergence instead of standard Euclidean norm. The final output of the algorithm in Line 7 is the weighted average of all primal iterates generated. If we choose for then we recover the deterministic gradients and function evaluation. Henceforth, we assume general nonnegative values for such ’s and provide a combined analysis for these settings. Later, we substitute appropriate values of ’s to finish the analysis for the following three different cases.

Deterministic setting where both the objective and constraints are deterministic. Here for all .

Semistochastic setting where the constraints are deterministic but the objective is stochastic. Here, for all . However, can take arbitrary values.

Fullystochastic setting where both function and gradient evaluations are stochastic. Here, all can take arbitrary values.
Below, we specify a stepsize policy and state the convergence properties of Algorithm 1 for solving problem (1.1) in the convex setting. The proof of this result is involved and will be deferred to Section 2.2.
Theorem.
Suppose (2.4), (2.5), (2.6) and (2.8) are satisfied. Let be a given constant, , and
Set and in Algorithm 1 according to the following:

(2.9) 
where
Then, we have
(2.10) 
and
(2.11) 
where
As a consequence, the number of iterations performed by Algorithm 1 to find an optimal solution of problem (1.1) can be bounded by
(2.12) 
We discuss some important features of the iteration complexity result (2.12) in the following remark.
Remark.
We derive from (2.12) the convergence rate of the ConEx method for solving convex problem (1.1) in both Lipschitz smooth and nonsmooth cases.

Problem (1.1) is Lipschitz smooth then . Moreover, suppose that then . Then we obtain the following iteration complexity results: deterministic case: , semistochastic case: . For the fullystochastic case, after noting that is of the same order as and replacing , we can see that the iteration complexity in (2.12) reduces to . It is worth noting that for Lipschitz smooth case, and are of the same order.

For the nonsmooth case, we have . Then, we obtain the following complexity results: deterministic case: , semistochastic case: and fullystochastic case: .

Note that . For the Lipschitz smooth case, if then and we obtain a convergence rate of the nonsmooth case for Lipschitz smooth problem.

In contrast to the stepsize scheme we will develop later for the strongly convex case (c.f. (2.14)), the stepsize scheme in (2.9) depends on implying that we need to estimate whether , especially for the Lipschitz smooth case in order to obtain smaller iteration complexity of in the deterministic case. We can replace in the definition of by , then the last term in (2.9) changes to
(2.13) If then and hence the complexity results from first two bullet points of this remark hold.

Consider the pathological case of and Lipschitz smooth problem (1.1), but . We call it pathological because the stepsize scheme (2.9) set according to the standard definition of does not yield convergence in this case for deterministic setting. In particular, the last term in the infeasibility bound (2.11) would change to which is undefined. One possible solution for this is to artificially set in the stepsize scheme, (2.9), to be some large positive number and forego of the faster convergence of . After this change, we would obtain a convergence rate of . This change to is not required for semistochastic or fullystochastic setting as implies convergence rate of . An alternative approach would be to design a line search procedure on for the right value of , since there exists a verifiable condition based on the constraint violation . In this way, we can still obtain an convergence rate Lipschitz smooth problems.

Similar to the semistochastic and fullystochastic settings discussed above, the ConEx method converges for nonsmooth problems with the standard definition of .
Now we provide another theorem which states the stepsize policy and the resulting convergence properties of the ConEx method for solving problem (1.1) in the strongly convex setting. The proof of this result can be found in Section 2.2.
Theorem.
An immediate corollary of the above theorem is the following:
Corollary.
We obtain an optimal solution of problem (1.1) in iterations, where