A Smooth Inexact Penalty Reformulation of Convex Problemswith Linear Constraints

A Smooth Inexact Penalty Reformulation of Convex Problems
with Linear Constraints

Tatiana Tatarenko Angelia Nedić The School of Electrical, Computer and Energy Engineering at Arizona State University, Tempe, AZ 85281
Abstract

In this work, we consider a constrained convex problem with linear inequalities and provide an inexact penalty re-formulation of the problem. The novelty is in the choice of the penalty functions, which are smooth and can induce a non-zero penalty over some points in feasible region of the original constrained problem. The resulting unconstrained penalized problem is parametrized by two penalty parameters which control the slope and the curvature of the penalty function. With a suitable selection of these penalty parameters, we show that the solutions of the resulting penalized unconstrained problem are feasible for the original constrained problem, under some assumptions. Also, we establish that, with suitable choices of penalty parameters, the solutions of the penalized unconstrained problem can achieve a suboptimal value which is arbitrarily close to the optimal value of the original constrained problem. For the problems with a large number of linear inequality constraints, a particular advantage of such a smooth penalty-based reformulation is that it renders a penalized problem suitable for the implementation of fast incremental gradient methods, which require only one sample from the inequality constraints at each iteration. We consider applying SAGA proposed in saga () to solve the resulting penalized unconstrained problem.

keywords:
convex minimization, linear constraints, inexact penalty, incremental methods

1 Introduction

In this paper, we study the problem of minimizing a convex function over a convex and closed set that is the intersection of finitely many convex and closed sets , ( is large), i.e.,

(1)
(2)

Throughout the paper, the function is assumed to be convex over . Optimization problems of the form (1) arise in many areas of research, such as digital filter settings in communication systems filter (), energy consumption in Smart Grids SmartG (), convex relaxations of various combinatorial optimization problems in machine learning applications clustering (); matching ().

Our interest is in case when is large, which prohibits us from using projected gradient and augmented Lagrangian methods BertsekasConstrOpt (), which require either computation of the (Euclidean) projection or an estimation of the gradient for the sum of many functions, at each iteration. To reduce the complexity, one may consider a method that operates on a single set from the constraint set collection at each iteration. Algorithms using random constraint sampling for general convex optimization problems (1) have been first considered in Nedich2011 () and were extended in WangBerts () to a broader class of randomization over the sets of constraints. Moreover, the convergence rate analysis is performed in WangBerts () to demonstrate that the feasibility error diminishes to zero at a rate , whereas the optimality error diminishes to zero with the rate of . For the general convex problems of type (1), the latter rate is optimal over the class of optimization methods based on noisy first-order information.

A special case of the problem (1) with is a feasibility problem, for which random sampling methods have been considered in Polyak2001 () for the case of the sets given by convex inequalities, and in CalafiorePolyak2001 () for a more specialized case of linear matrix inequalities. In Nedic-cdc-2010 (), a connection between the convergence properties of stochastic gradient methods and the existence of solutions for problem (1) has been studied, and a linear convergence rate has been established for some special cases of the constraint sets (such as those admitting easily computable Euclidean projections). Algorithms with the linear convergence to a solution of feasibility problems defined by a system of linear equations and inequalities have been considered in Leventhal (); Strohmer (). An iterated randomized projection scheme for systems of linear equations is proposed in Strohmer (), which is a randomized variant of Kaczmarz’s method. This variant employs a single projection per each iteration and is shown to converge with the linear rate that does not depend on the number of equations, but instead, depends on the condition number associated with the linear system of equations.

A possible reformulation of problem (1) is through the use of the indicator functions of the constraint sets, resulting in the following unconstrained problem

(3)

where is the indicator function of the set (taking value at the points and, otherwise, taking value ). The advantage of this reformulation is that the objective function is the sum of convex functions and incremental methods can be employed that compute only a (sub)-gradient of one of the component functions at each iteration. The traditional incremental methods do not have memory, and their origin can be traced back to work of Kibardin Kibardin (). They have been studied for smooth least-square problems Ber96 (); Ber97 (); Luo91 (), for training the neural networks Gri94 (); Gri00 (); LuT94 (), for smooth convex problems Sol98 (); Tse98 () and fot non-smooth convex problems GGM06 (); HeD09 (); JRJ09 (); Kiw04 (); NeB00 (); NeB01 (); NeBBor01 (); Wright08 () (see BertsekasPenalty () for a more comprehensive survey of these methods). These traditional memoryless incremental methods (randomized and deterministic), while simple to implement to solve problem (3), cannot achieve the optimal convergence rate even when is smooth and strongly convex. This is due to the non-smoothness of the indicator functions and the errors that are accumulated during the incremental processing of the functions in the sum.

Reformulation (3) has been considered in Kundu2017 () as a departure point toward an exact penalty reformulation using the set-distance functions, thus yielding a penalized problem of the following form:

(4)

where

with being some norm in and being the distance function to a set . This exact penalty formulation has been motivated by a simple exact penalty model proposed in Bertsekas2011 () (using only the set-distance functions) and a more general penalty model considered in BertsekasPenalty (). In Kundu2017 (), a lower bound on the penalty level has been identified guaranteeing that the optimal solutions of the penalized problem are also optimal solutions of the original problem (3). However, the proposed approaches in Kundu2017 () do not utilize incremental processing, but rather approaches where a full (sub)-gradient of the function objective in (4) is used.

Unlike Kundu2017 (), our objective in this paper is to consider a penalty-based reformulation of problem (1) (with linear constraints) that will allow us to take advantage of the penalized problem structure for the use of incremental methods. In order to achieve the optimal convergence rates, we would like to depart from the traditional incremental methods. In particular, we would like to have a penalty reformulation of problem (1) that will enable us to employ one of recently developed fast incremental algorithms. These algorithms are designed to solve optimization problems involving a large sum of functions saga (); svrg (); SAG () which arise in machine learning applications. Unlike the traditional incremental methods that are memoryless, these fast incremental algorithms require storage of the past (sub)-gradients. Typically, they require storing the same number of the (sub)-gradients as the number of the component functions in the objective. The stored information is effectively used to control the error due to the incremental processing of the functions, which in turn allows these algorithms to achieve optimal convergence rates. A drawback of the fast incremental algorithms, such as SAGA and its various modifications Katusha (); finito (); svrg (); scda (); accSAGA (), is that they are not designed to efficiently handle a possibly large number of constraints. At most, these algorithms allow us to deal with so called composite optimization problems, where the composite term corresponds to a regularization function promoting some special properties of model parameters and has a simple structure for determining the proximal point saga ().

Our focus is on problem (1) with linear constraints,

where and for all . Our objective is to develop a penalty model for this problem that will allow us to implement fast incremental methods saga (); svrg (); SAG () to solve the resulting unconstrained penalized problem. In order to do so, we will develop a smooth penalty framework motivated by the approach in BertsekasPenalty (), and provide the relations for the solutions of problem (7) and the solutions of the corresponding penalized problem. We consider a penalized reformulation of problem (7) in the following form:

(5)
(6)

where the function is a smooth penalty function associated with a linear inequality constraint , while and are the penalty parameters. The penalty parameters will control the slope and the curvature of the penalty function The novelty is in the use of inexact smooth penalty function that has Lipschitz continuous gradients, which are not related to the squared set-distance function, which is in contrast to the inexact distance-based smooth penalties considered in Siedlecki (). Also, this is contrast with the use of non-smooth exact penalty functions in BertsekasPenalty (). A key property of our penalty framework is its accuracy guarantee, as follows: For a given accuracy , we show that there exists a range of values for parameters and such that any optimal solution of the penalized problem (5) is feasible for the original linearly constrained problem. Moreover, we provide estimates that characterize sub-optimality of the solutions of the penalized problem, i.e., we show that the solutions are located within the -neighborhood of the solutions of the original constrained problem.

These properties of the penalized problem allow us to apply any fast incremental method saga (); svrg (); SAG (). We will employ SAGA to solve the smooth penalized problem to obtain a suboptimal point with the sublinear rate in the case of smooth convex function and the linear rate , with , in the case of smooth strongly convex .

The paper is organized as follows. In Section 2, we formulate the penalized problem, establish some properties of the chosen penalty function and provide some elementary relation between the penalized problem and the original constrained problem. In Section 3, we investigate the relation for the solutions of the original problem and its penalized variant. In Section 4 we consider applying an existing fast incremental method, namely SAGA, for solving the penalized problem. In Section 5, we provide some numerical results to illustrate the performance of SAGA for the penalized problem in comparison with a method that uses random projections, as proposed in Nedich2011 (). We conclude the paper in Section 6.

2 Penalized Problem and its Properties

We consider the following optimization problem:

(7)
(8)

where the vectors , , are nonzero. We will assume that the problem is feasible. Associated with problem (7), we consider a penalized problem

(9)
(10)

where

(11)

Here, and are penalty parameters. The vectors and scalars are the same as those characterizing the constraints in problem (7). For a given nonzero vector and , the penalty function is given by (see also Figure 1)

(12)

For any , the function satisfies the following relations:

(13)
(14)
(15)
\setkeys

Ginwidth=1\OVP@calc

Figure 1: Penalty functions for the constraint , , with .

Observe that can be viewed as a composition of a scalar function

(16)

with a linear function , which is scaled by . In particular, we have

(17)

The function is convex on for any . Thus, the function is convex on , implying that the objective function (11) of the penalized problem (9) is convex over for any and .

Furthermore, observe that the function is twice differentiable for any , with the second derivative given by

Thus, the function has Lipschitz continuous derivatives with constant . Then, the function is differentiable for any and its gradient is given by

(18)

which is Lipschitz continuous with a constant ,

(19)

In view of the definition of the penalty function in (11) and relation (18), we can see that the magnitude of the “slope” of the penalty function is controlled by the parameter , while the ratio of the parameters and is controlling the “curvature” of the penalty function.

Our choice of the penalty function is motivated by a desire to have the minimizers of the penalized problem (9) being feasible for the original problem (7). Note that the penalty function proposed above is a version of the one-sided Huber losses. Originally, the Huber loss functions were introduced in applications of robust regression models to make them less sensitive to outliers in data in comparison with the squared error loss Huber (). In contrast, we use this type of penalty function to smoothen the exact penalties based on the distance to the sets proposed in BertsekasPenalty (). Furthermore, an appropriate choice of the parameter allows us to overcome the limitation of the smooth penalties based on the squared distances to the sets , which typically provide an infeasible solution (for the original problem), due to a small penalized value around an optimum lying close to the feasibility set boundary Siedlecki ().

In what follows, we let denote the (Euclidean) projection of a point on a convex closed set , so we have

The following lemma provides some additional properties of the penalty function that we will use later on. In fact, the lemma shows stronger results than what we will use, but the results may be of their own interest.

Lemma 1.

Given a nonzero vector and a scalar , consider the penalty function defined in (12) with . Let Then, we have for ,

and for any ,

Proof.

Given a vector , we have

so that

If , then the last two cases in the definition of reduce to when , corresponding to when . When , we have

To prove the monotonicity property, in view of relation (17), where is defined in (16), it suffices to show that the function has the monotonicity property, i.e., that we have for ,

To show this let . Note that, for and the functions and coincide, i.e.,

When we have

Next, consider the case when . Let be fixed and we view the function as a function of . For the partial derivative with respect to , we have

where the inequality follows by . Thus, is non-decreasing in , implying that . Since was arbitrary, it follows that

Finally, let , in which case we have

where the inequality is obtained by using valid for any .  

In view of Lemma 1, for the function in (11) we obtain for any and any ,

This relation implies an inclusion relation for the level sets of the functions and , as given by the following corollary.

Corollary 1.

For any and for any , we have

In particular, if the function has bounded level sets, then the functions also have bounded level sets for any and .

While Corollary 1 shows some inclusion relations for the level sets of and , for the same value , it will be important in our analysis to identify a value of for which these level sets are nonempty. The following corollary shows that choosing , for any feasible , can be used to construct non-empty level sets.

Corollary 2.

Let and be arbitary, and let be a feasible point for the original problem (7). Then, for the scalar defined by

the level set is nonempty and the solution set of the penalized problem (9) is contained in the level set .

Proof.

Let and be arbitrary, and be any feasible point for the original problem. Since is feasible, by relation (14), we have

Therefore,

implying that belongs to the level set . Noting that

by Corollary 1, we obtain

 

In Corollary 2, the solution set of the penalized problem (9) may be empty. In the next section, we will consider the cases when the solution sets are nonempty for both the original and the penalized problems.

3 Relations for Penalized Problem and Original Problem Solutions

In what follows, we establish some important relations between the solutions of the penalized problem and the original problem. A key role in the analysis plays a special property of the linear constraint set, which is valid when the constraint set of problem (7) has a nonempty interior. To provide this property, we let be the set defined by the th inequality in the constraint set of problem (7), i.e.,

and we define the set as the intersection of these sets

We make the following assumption on the interior of the set .

Assumption 1.

The interior of the set is not empty, i.e., there is a point such that for some ,

We next provide a lemma that will be important for our analysis of solution feasibility of the penalized problem. In this lemma and later on, we use the following notation

(20)

Moreover, conditions for solution feasibility of the penalized problem involve a constant from Hoffman’s lemma Hoffman () stating that for the sets

(21)
Lemma 2.

Let Assumption 1 hold and let be a positive constant such that , where is defined by Assumption 1. Then, for any there exists a feasible point such that

where is defined in (20) and is Hoffman’s constant defined in (21).

Proof.

Let and consider the perturbed set , which is obtained by perturbing the inequalities by amount of toward the interior of (see Figure 2), i.e.,

Assumption 1 and the condition imply that .

Figure 2: Illustration of the set .

Let us define

By the definition of , we have for all . Hence, taking into account the definition of the penalty functions , (see (12)), we obtain

thus showing the relation in part (a).

To estimate the distance , let us consider an intermittent point obtained by projecting on and by projecting the resulting point on the set . Since is the closest point in the set to ,

(22)

Next, note that the constant in Hoffman’s lemma (see (21)) depends only on the vectors , (not on the values ). Thus, Hoffman’s result in (21) applies to the set with the same constant as it holds in respect to the set , which implies that

Therefore, according to the definition of , it follows that

Since , we have that for all . Hence,

From the preceding relation and relation (22) it follows that

thus establishing the result in part (b).  

We next turn our attention to the solution sets of the problems. We let and denote the solution sets of the original problem and the penalized problem, respectively, i.e.,

In our main result establishing that , under some conditions on the penalty parameters and , we will require that the function has uniformly bounded subgradients over a suitably defined region. If the constraint set is bounded, then the set can be taken as such a region and an upper bound for the subgradient norms can be defined by

where is the subdifferential set of at . If is unbounded, we identify a suitable region in the following lemma. In particular, the region should be large enough to contain the sets for a range of penalty values, and also the points from Lemma 2(b) for each .

Lemma 3.

Let Assumption 1 hold and let be a positive constant such that , where is defined by Assumption 1. Assume that has bounded level sets. Then, for all and satisfying for some , there is a ball centered at the origin that contains all the points with and the points satisfying Lemma 2(b) with . The radius of this ball depends on some feasible point , the given value of , the value from Assumption 1, and the problem characteristics reflected in the constants , and from Hoffman’s result (see (21)).

Proof.

Since has bounded level sets, by Corollary 1, the functions also have bounded level sets for all and . Hence, the solution set is nonempty and, also, the solution sets are nonempty for all and . We next employ Corollary 2 to construct a compact set that contains the optimal sets are nonempty for all and for a range of values of these penalty parameters.

To start, we choose some feasible point and, by Corollary 2, we obtain

where

Under the assumption that for some , we have where (see (20) for the definition of ). Thus, we consider the level set

which is bounded by the assumption that has bounded level sets. Furthermore,

Hence, these optimal sets are uniformly bounded, i.e., for some ,

Since the projection operator is non-expansive, the projections of the points in the set on the set are also bounded, i.e., for some ,

(23)

Finally, for each , consider a point as given in Lemma 2. Then, by Lemma 2(b) for each , it follows that

where we use assumption that . Thus, for each , the point from Lemma 2(b) satisfies the following relation

where

In view of (23), the ball centered at the origin with the radius also contains for all and for all and , with and . Since , we see that the constant depends on the choice of the feasible point , the given value of , the value from Assumption 1, and the problem characteristics reflected in the constants , and from Hoffman’s result (see (21)).  

In what follows, we will let denote the radius of the ball identified in Lemma 3, and suppress the dependence on the other parameters. We define

(24)

With Lemma 2 and Lemma 3, we are ready to provide a key relation for the solutions of the penalized problem and the original problem. Specifically, we show that for sufficiently small values of the penalty , the solutions of the penalized problem are feasible for the original problem.

Proposition 1.

Let be a given accuracy parameter. Let Assumption 1 hold and assume that has bounded level sets. Let the parameters and be chosen such that

with

where is arbitrary, is the constant from Assumption 1, is the constant from Hoffman’s bound (see (21)), the scalars and are defined in (20), while is defined by (24). Then, every point in the solution set of the penalized problem is feasible for the problem (7), namely .

Proof.

Since has bounded level sets, the solution set and the solution sets are nonempty for all and . To arrive at a contradiction, let us assume that there exists some and satisfying the conditions in the proposition and such that . Thus, there exists a solution and . Define

We consider two possibilities: and .

Case 1: . By Lemma 1 we have that for all . Thus, by the definition of the functions , for any we can write

Then, by Hoffman’s lemma (see (21)), for some we have

Letting in the preceding relation, we obtain

where in the second inequality we use the assumption that the norms of the subgradients in the subdifferential set are bounded by in a region containing the point (see Lemma 3 and (24)). Taking into the account that when (see inequality (14) and the definition of the set ) and using , we see that

Note that the condition and the definition of imply . Using the relations and , which we assumed, we further obtain

(25)

where the last inequality is obtained by using , which is equivalent to

The last inequality holds due to the conditions that we imposed on the parameters and , namely, that and . Thus, relation (25) implies that , which contradicts the fact that is an unconstrained minimizer of .

Case 2: . Since , under Assumption 1 and the condition , we can apply Lemma 2 with . According to Lemma 2, there exists a feasible point such that

(26)

and

(27)

Using the point , we have

(28)