Fundamentals of cone regression

# Fundamentals of cone regression

## 1Introduction

Cone regression analysis is a valuable alternative to more traditional parametric-regression models, in all cases where the functional relationships between the response (dependent) and the explanatory (independent) variables is unknown and nonlinear and the constraints are a set of linear inequalities. Several important statistical problems including isotonic, concave and constrained spline regression, or ANOVA under partial orderings can be seen as particular instances of the more general cone regression problem. Cone regression admits several formulation approaches and implementation strategies, whose choice severely impacts numerical performances. However, due to the little exposure to the topics of optimization theory in modern-day Statistics, many optimization and numerical approaches are commonly ignored by statisticians. This paper is a contribution to fill this gap in the literature, addressing the fundamentals of cone regression from a theoretical and practical point of view. With the goal of going in deep with comparisons and numerical issues, we focus in particular on the concave regression problem. In spite of its theoretical simplicity, since the number of constraints increases linearly with the data size, concave regression offers a good basis to discuss the fundamentals of cone regression and related numerical issues.

The problem of concave regression is to estimate a regression function subject to concavity constraints represented by a set of linear inequalities. Brought to the attention of the scientific community by micro-economists interested in estimating production function [22], the problem of concave regression arises not only in the field of micro-economy (indirect utility, production or cost functions, Laffer curve) but also in medicine (dose response experiments) and biology (growth curves, hazard and failure rate in survival analysis). First addressed by Hildreth in 1954 [22], the search for efficient methods for solving large concave regression problems is still an open issue nowadays. This may appear quite surprising considering the noticeable advances in convex optimization since then, but it can be probably understood when considering that most efforts have been devoted to theoretical issues such as generalizations and convergence while comparatively little attention has been paid to the issues of efficiency and numerical performances in practice [38].

In this paper, we formulate the cone regression problem by different optimization approaches, we highlight similarities and difference between the various algorithms passed in review, we propose several improvements to enhance stability and to bound the computational cost and we estimate the expected performance of available algorithms, establishing in particular which is the most competitive technique for solving large instances of the problem. Finally, in the light of this study, we give recommendations for further research.

In Section 2, we state formally the problem of cone regression also introducing some basic notations and results that will be used thoroughly. In Section 3 we survey the state of the art distinguishing between the class of algorithms with asymptotic convergence and the class of algorithms with time finite convergence. In Section 6 we make a numerical comparison of performances and finally, in Section 7, we draw some concluding remarks.

## 2Statement of the problem, basic notations and basic facts

The aim of a regression analysis is to produce a reasonable analysis to the unknown response function , which can be modeled as

where is the explanatory (dependent) variable, is the response (independent) random variable, and is an error term, which is usually assumed to be a mean zero random variable. Typically, one has observations on and for selected values of . For each level of input, say , there may be several trials and corresponding observations of output . Let be the number of trials at level of input and let be the observed output for the trial at this level. Than we have

Inference about the response function may be drawn by assuming that the function can be approximated by some given algebraic form with several unknown parameters to be estimated from the data. However, the difficulty with this procedure is that the inferences often depend critically upon the algebraic form chosen. Alternatively, one may know some properties of the relation being studied but does not have sufficient information to put the relation into any simple parametric form. In this case, a nonparametric approach is more appropiated. Let be the expected value of output at input level :

Estimates of can be derived by the method of maximum likelihood, or by the method of least squares or other formulations. If there were no a priori restriction on , than the maximum likelihood estimation of would just be the mean of observed output for the level of input , that is

Instead, since in the cone regression problem the known property of the regression function can be expressed by a set of linear inequalities, to obtain the maximum likelihood estimates, the likelihood function should be maximized subject to the linear inequality constraints.

Formally, given a dataset of dependent variables represented by the vectors , corresponding to the independent variable values , the problem of cone regression is to estimate the closest function to the dataset via a least squares regression subject to a set of linear inequality constraints by solving

Denoting by those vectors that satisfy the linear inequality constraints for a fixed , then is a closed convex set in and the feasibility set can be written as the nonempty intersection of a family of closed subsets . Being the intersection of closed convex sets, the set is also a closed convex set. More precisely, since each is an half-space which contains the origin, the feasibility set is a convex polyhedral cone. In matrix form, can be written as . In the case of concave regression with is a matrix such that each row represents a concave two-piece linear function with a negative second difference at only and the linear inequalities are as follows

In the following, we give alternative formulations of the cone regression problem that rest on optimization theory.

### 2.1Convex quadratic programming (CQP) formulation

#### Primal formulation

The problem (Equation 2) is to find the point in the cone that is closest to . The solution is found at the orthogonal projection of onto , written as using the metric , represented by the symmetric positive definite matrix .

For the problem (Equation 4), the matrix is diagonal with element on the diagonal. In practice, if is the mean value measured at , than corresponds to the size of the sample at . Since is a closed, convex and non empy set on the Hilbert space , it is a set of Chebyshev, that is the projection exists and it is unique.

#### Dual formulation

The dual formulation of problem (Equation 4) rests on the Moreau decomposition theorem [33], which is a generalization in convex analysis of the orthogonal projection theorem for vectorial sub-spaces. Central to the Moreau decomposition theorem is the definition of polar cone of a given convex cone , which is given below.

The Moreau decomposition theorem is as follows.

By relying on this theorem we can alternatively solve problem (Equation 4) by first finding the projection on the polar cone and then computing the solution to the primal problem as the difference (see Figure 1). This alternative is attracting since, as it will be clarified below, it implies an analytically simpler form for the constraints.

Before stating an important Lemma about the relationship between the polar cone and the constraint matrix , let us introduce the definition of edges of a polyhedral convex cone.

Intuitively speaking, the edges of a polyhedral convex cone are one-dimensional rays, which always passe through a fixed point (the vertex).

To see that, observe that is polar to since

,: , which is the definition of polar cone of . Conversely, is polar to since: .

By relying on this results Khun-Tucker [27] proved the following theorem:

As it can be observed, in the dual formulation each element of the vector must satisfy a single positivity constraint.

Goldman [18] noticed that dual problem can be also view as a minimum distance problem in the same parameter space as the primal problem.

where is a rotation of the dual cone with its vertex translated to . also solves the re-parametrized dual problem.

### 2.2Linear complementarity problem (LCP) formulation

The CQP (Equation 4) can be recasted as a linear complementarity problem (LCP). To see that, let us consider the Lagrangian associated to problem (Equation 4).

where is the vector of dual variables associated to each of the convexity constraints. By applying the Karush–Kuhn–Tucker (KKT) optimality conditions [27] to (Equation 6), that is

we obtain the equivalent LCP

where , and . Note that by dropping the constant term from the Lagrangian and dividing it by : . Therefore: . By taking the transpose and multiplying for : . This LCP has a unique complementary solution. Denoting by its solution, is the optimal solution of the dual problem ( ?).

The condition is called complementarity condition and the way in which it is dealt with determines if the optimization algorithm belongs to the class of interior point methods that will be introduced in section ? or to the class of active set methods that will be detailed in Section 3.2.

### 2.3Proximal formulation

The CQP (Equation 4) can be solved by using a proximity operator [34]. Proximity operators are used to solve problems of the form

where are convex functions from to , that are not necessarily differentiable. Each is treated through its proximity operator which is defined as follows.

The proximity operator is characterized by the property

where is the subdifferential of .

The proximity operator of a convex function is a generalization of the projection operator onto a closed convex set . To see that let us consider the indicator function of

By using the fact that minimizing over is equivalent to minimizing over

it results that .

The solution to problem (Equation 4) can be therefore understood as the proximity operator over

Alternatively, using the fact that , the dual problem (Equation 4) can be seen as the proximity operator over the sum of indicator functions of the convex sets :

Intuitively, the base operation of a proximal algorithm is evaluating the proximal operator of a function, which itself involves solving a small convex optimization problem. These subproblems often admit closed form solutions or can be solved very quickly with standard or simple specialized methods.

## 3State of the art

In this section, we review existing algorithms for solving the cone regression problem. The algorithmic approaches, and in turn their numerical performances, strongly depend on the choice of the problem formulation. All existing methods are iterative and they can attain the optimal solution since is a Chebyshev set and therefore the optimal solution must exist in this closed set. However, in terms of their numerical perfomances they can be classified into two broad classes: the class of methods never or in very simple cases attain the optimal solution [22] and those of methods that converge to the optimal solution in a finite number of steps [49]. As it will be clarified in the following, methods with asymptotic convergence rest on the properties of the sub-gradient or more in general of proximity operators and act by finding the solution as the limit of a sequence of successive approximations. They are typically derived from the primal, the dual or from the proximal formulation. Methods with finite-time convergence exploit the geometric properties of polyhedral convex cones and find the exact solution as non-negative linear combination of functions, forming a basis in a specified finite dimensional space. They are typically derived from the linear complementarity problem formulation.

### 3.1Algorithms with asymptotic convergence

This section includes algorithms based on the primal formulation such as Least Squares in a Product Space (section ?, algorithms based on the dual formulation such as the Uzawa’s method (section ?) and Hildret’s method (section ?), algorithms that solve the dual problem simultaneously with the primal problem such as the Dykstra’s alternating projection method (section ?) and algorithms based on the proximal formulation such as Alternating Direction Method of Multipliers (section ?).

#### Least squares in a product space (LSPS)

Since the Euclidean space equipped with the dot product is an Hilbert space , the problem (Equation 4) can be recasted in the fold Cartesian product of , say [39]. Let be the Cartesian product of the sets , i.e., the closed convex set and let be the diagonal vector subspace, i.e. . Then, the CQP (Equation 4) is equivalent to

where . Using this strategy, the problem of projecting onto the intersection of convex sets is reduced to the problem of projecting, in an higher dimensional space, onto only two convex sets, one of which is a simple vector subspace. Geometrically, this is equivalent to find a point in which is at minimum distance from . This point can be obtained iteratively by

The advantage of this strategy is in that it allows to speed up the convergence since a bigger (than 2, which is the upper bound for Féjer sequences [15]) relaxation interval can be allowed.

#### Uzawa method

A classical method to solve a convex minimization problem subject to inequality constraints is the Uzawa method [2], which search directly for the saddle point of the Lagrangian (Equation 6). In fact, if the Lagrangian admits a saddle point, say , then the duality gap is null and is a critical point of the Lagrangian. Since the dual function is differentiable, it can be minimized explicitly by using the gradient descent method. Therefore the Uzawa method alternates a minimization step over with respect to with fixed and a maximization step with respect to onto , with fixed. The algorithmic parameter can be fixed to optimize convergence by relying on theoretical considerations. Therefore the CQP ( ?) is equivalent to find

#### Hildreth’s algorithm

Hildreth [22] proposed to apply the Gauss Seidel algorithm [25] to the dual problem ( ?). A single cycle of the Hildreth’s algorithm consists in updating each element of sequentially in an arbitrary fixed order. Therefore each cycle consists of steps, each of which corresponds to a projection onto the cone ,. The algorithm gives rise to a sequence of points, each of one differs from the preceding in exactly one coordinate. At the cycle , the is used in the estimation of the point so that the best available estimations are used for computing each variable. The convergence of the Gauss Seidel algorithm is guaranteed only if the matrix is full row rank, so that there are not redundancies among the inequality restrictions, and it is guaranteed independently of the initial point only if is positive definite and symmetric. The algorithm is sensitive to the normalization as well as to the order of the projections.

#### Primal-dual interior point methods

First introduced by Karmakar in 1984 [26], primal-dual interior point methods act by perturbing the complementarity condition in the LCP formulation Equation 8 and replacing with . The partition of vectors and into zero and nonzero elements is gradually revealed as the algorithm progresses by forcing a reduction of . All iterates satisfy the inequality constraints strictly. The solution is approached from either the interior or exterior of the feasible region but never lie on the boundary of this region. Let the function be such that the roots of this function are solutions to the first and the last optimality conditions in Equation 8.

The perturbed complementarity condition introduces a nonlinearity, therefore for each fixed a system of nonlinear equations should be solved. The nonlinear system is typically solved by using a Newton-like algorithm [5]. Each iteration of the Newton’s method finds a search direction from the current iterate and it is computationally expensive but can make significant progress towards the solution. For instance in barrier methods, which are the most efficent of the family, this is achieved by using a penalizing term, called barrier function, for violations of constraints whose value on a point increases to infinity as the point approaches the boundary of the feasible region. Interior point methods must be initialized at an interior point, or else the barrier function is undefined. The interested reader is referred to [46] for further information about interior point methods.

#### Dykstra’s algorithm

In , Dykstra [13] proposed a generalization of the Hildreth’s procedure applicable to the case of constraints corresponding to more general convex cones than polyhedral convex ones. The Dykstra’s algorithm is based on the idea, before suggested by Von Neumann [48] to the case of subspaces, of computing the projection onto the intersection of convex sets by relying on the solution of the simpler problem of projecting onto the individual sets. In the case of concave regression the projection onto a single convex set involves only three points and, if the constraint is not satisfied, it corresponds to the straight line fitting the points ,,.

Dykstra’s algorithm iterates by passing sequentially over the individual sets and projects onto each one a deflected version of the previous iterate. More precisely, before projecting onto the cone during the th cycle, the residuum obtained when projecting onto at the previous th cycle, say is removed and a new residuum associated to the cone , say is computed after the projection. In practice, each is the projection of onto , where .

If each is a subspace, then at each new cycle the residuum of the projection onto each convex cone is of course the projection of onto the cone . Therefore, the Dykstra’s procedure for subspaces reduces to exactly the cyclic, iterated projections of von Neumann. In this case, for , the sum of the residua over the cones approximates the projection of onto the polar cone and therefore, for the Moreau decomposition theorem, approximates the projection onto .

However, if the are not subspaces, is not a linear operator and then the von Neumann algorithm does not necessarily converge.

The Dykstra’s algorithm can also be interpreted as a variation of the Douglas–Rachford splitting method applied to the dual proximal formulation (Equation 10).

The seminal works of Hildreth and Dykstra have inspired many studies mostly devoted to theoretical investigations about their behavior in a Hilbert space [7], about its convergence [24], about its relation to other methods [17] and about its interpretation in more general frameworks, such as the proximal splitting methods [4]. Han [20], as well as Iusem and De Pierro [24], showed that in the polyhedral case, the method of Dysktra becomes the Hildreth’s algorithm and therefore it has the same geometric interpretation of Gauss Seidel to the dual problem. Gaffke and Mathar (1989) [17] showed the relation of the Dysktra algorithm to the method of component-wise cyclic minimization over a product space, also proposing a fully simultaneous Dykstra algorithm. The only works devoted to give some insight about a more efficient implementation are the ones of Ruud. Goldman and Ruud [18]() generalized the method of Hildreth showing that there is not need to restrict the iterations to one element of at a time: one can optimize over subsets or/and change the order in which the elements are taken. This observation is important for the speed of convergence since the slow speed can be understood as a symptom of near multicollinearity among restrictions. Because the intermediate projections are so close to one another, the algorithm makes small incremental steps towards the solution. They also remarked that Dykstra uses a parametrization in the primal parameter space, that causes numerical round off errors in the variable residuum. These round off errors cumulate so that the fitted value does not satisfy the constraints of the dual problem. It would be better to use a parametrization on the dual so that the contraints in the dual would be satisfied at each iteration. Later, Ruud [44] proved that the contraction property of the proposed generalizations rests solely on the requirement that every constraints appears in at least one subproblem of an iteration. As one approach the solution, constraints that are satisfied at the solution are eliminated. To remove satisfied constraints would accelerate the Hildreth procedure. The authors propose to reduce the set of active constraints, that is the constraints satisfied as an equation at the corresponding points, by removing as many constraints as possible through periodic optimization over all positive elements of .

#### Alternating Direction Method of Multipliers (ADMM)

ADMM is an augmented Lagrangian technique [21] which can be applied to problems of the form

where the matrix is assumed to be irreducible () and the intersection of the relative interiors of the domains of the two functions is assumed to be not empty (). ADMM minimizes the augmented Lagrangian over the two variables of the problems, say and , first with fixed, then over with fixed, and then applying a proximal maximization step with respect to the Lagrange multiplier . The augmented Lagrangian of index is

where . Denoting by the proximal operator which maps a point to the unique minimizer of and denoting by the implementation detailed in the Appendix is obtained.

The ADMM method rests on the proximal formulation Equation 9. Indeed, it can be viewed as an application of the Douglas-Rachford splitting algorithm [14].

### 3.2Algorithms with time-finite convergence

All algorithms that will be reviewed in this section are active set methods resting on the LCP formulation Equation 8. Active set methods work by choosing a subset of indices such that is allowed to be non-zero and forcing the corresponding to be zero, while the remaining indices force to be zero and allow to take nonzero values. In this section we will review active set algorithm suach as the mixed primal-dual basis algorithm (section ?), the critical index algorithm (section ?) and the Meyer’s algorithm (section ?).

Before detailing the algorithms, we introduce some definitions and basic results about the geometry of polyhedral convex cones, on which are based the algorithms presented in this section. For further details the reader is referred to [45].

#### Properties of polyhedral convex cones with m≤n

Lemma ? establishes the relationship between the constraint matrix and the edges of the polar cone , namely . We would like now to determine the edges of the constraint cone .

Let the vectors vectors orthogonal to and orthonormal to each other so that the set forms a basis for . By defining the dual basis of the basis as the set of vectors that verify the relationship

the constraint cone can be equivalently written as

To see that let and . Then are the first coordinates of . Since by construction, by multiplying both members at left for and at right for , we obtain: . Therefore gives the negative coordinates of in the basis . Furthermore, points in have their first coordinates non-negative and can be written as , where for .

Taking into account Def. ?, Equation 15 established that the vectors are the edges of the constrain cone .

A polyhedral convex cone arises as the intersection of a finite number of half-spaces whose defining hyperplanes pass through the origin. The th row of is normal to the hyperplane generating the th closed half-space.

The following Lemma, proved by Rockafellar in 1970 [43], establishes a relationship between the first vectors of the dual basis and the faces of the cone , where denotes the subspace spanned by the edges of .

Denoting by and the projections of onto and respectively, being the two subspaces orthogonal, can be written as . Since can be easily computed as with , the problem (Equation 4) reduces to find the projection of onto .

The next Lemma, proved by Zarantonello [51] focuses on the set of points in projecting on a given point , say . Before stating it we need to define the support cone of a closed convex set at a given point.

Then any point in projects onto an unique point in and belong to a unique non-negative orthant, or sector

Figure 2 illustrates this result for . Points in project onto the subspace spanned by the vectors , that is on the face . Vectors belonging to project onto the origin, vectors belonging to project on themself, while each other vector of project onto an unique face of .

Therefore, if the sector containing the vector is known, then the projection of onto can be easily computed as projection of onto the subspace spanned by the . This reduces the problem of projecting onto to the problem of finding the set of indices such that the sector contains . The set complementary of with respect to corresponds to the indices of the constraints satisfied at equality in the optimal solution.

The algorithms described in this section propose different strategies to find the optimal set .

#### Early algorithms based on the properties of polyhedral convex cones

The first algorithm addressing the problem of projecting a point onto a polyhedral convex cone by a non-asymptotic procedure dates back to work of Wilhelmsen [49] in 1976. Wilhelmsen assume that the generators of the cone are known and propose an algorithm which compute a sequence of nearest points to in subcones of . Each subcone is chosen so that and is closer to than is . This means that is in the near side of the supporting hyperplane of with respect to . The key step is to find given and the proposed procedure to do that is laborious.

Pshenichny and Danilin (1978) [41] proposed an algorithm similar to the one of Wilhelmsen which also converges in a finite number of steps. In both algorithm can be any integer even larger than . A more efficient procedure, but with the more restrictive assumption that has been proposed by Fraser and Massam in 1989.

#### Mixed primal-dual basis algorithm

Fraser and Massam [16] proposed an iterative algorithm to solve the general problem of projecting a data point onto a polyhedral convex cone generated by linear inequality constraints.

Polyhedral convex cones generated by a number of inequalities at least equal to the dimension of the space they belong to have been the subject of section ?. As seen there, the problem of projecting a data point onto this class of cones can be reduced to find the set of edges, or generators of the cone, indexed by such that the sector contains the data point.

To this goal, the set of edges of the polar cone is completed by vectors orthogonal to and orthonormal to each other. In the case of concave regression , the set is completed by a constant function , where is the -dimensional unitary vector and by a linear function , where and . The set of vectors form a basis for .

Let the vectors be the dual basis of the basis as defined in (Equation 14). Fraser and Massam called the vectors and primal and dual vectors respectively. A primal-dual basis for , associated to the set of indices is a basis made up of a subset of the primal basis vectors and a complementary subset of the dual basis vector . For the primal basis vectors, corresponding to the edges of , are simply the columns of . Using the above definitions, the problem of projecting a point onto the cone can be formulated as follows.

Finding the sector containing is achieved moving along a fixed line joining an arbitrary chosen initial point inside the cone or on its boundary to the data point . By moving along a fixed line, many sectors are crossed: each time a sector is crossed the successive approximation is obtained by projecting the point on the line passing through and on the face of (see Lemma ?) so that the distance is decreasing in .

At each iteration each basis differs from another by one vector only and therefore the coordinates of onto the new basis are easy to calculate: what changes is that one coordinate becomes equal to zero and therefore there is not need of estimating the inverse of a matrix at each step. For this reason this algorithm is faster that the one of Wilhelmsen.

Points belonging to the sector have non-negative coordinates in the mixed primal-dual basis relative to the cone . Therefore the procedure terminates when the coordinates of the data point in the primal dual basis are all nonnegative, meaning that the point is on the face of the sector containing the data point .

The number of iterations needed is equal to the number of different sectors that the line joining the initial point to the data point has to cross. This number is bounded above by . It is worth to remark that crossing a sector corresponds to a pivot step in the equivalent LCP. In fact, the net effect of a pivot step is of moving from the point contained in the face of to the point contained in the face of .

Ten years later, Meyer [32] generalized the algorithm of Fraser and Massam to the case of more constraints than dimensions, that is when .

#### Critical index algorithm: Nearest point problem in simplicial cones

Murty and Fathy (1982) [37] considered the general problem of projecting a given vector onto a simplicial cone . The definition of simplicial cone and pos cone, on which the former definition is based are as follows.

For any point the vector , where is called combination vector corresponding to . Therefore the projection of onto can be expressed as a nonnegative linear combination of the edges of the cone: , where the optimal combination vector corresponding to is .

Murty and Fathy named the set of indices such that set of critical indices. Using the definitions of simplicial cone and combination vector, the original problem of projecting the point onto the cone can be formulated as follows.

This formulation has the same structure of the dual formulation ( ?), where the combination vector in ( ?) plays the same role as the dual variable in ( ?). The only difference is that in ( ?) the matrix is not squared as the matrix .

As shown in Section 2.2 for the dual quadratic formulation ( ?), the formulation ( ?) can be recasted as a LCP. This equivalency is important since the following theorem, proved in [37] applies to the LCP formulation of ( ?).

The fact that finding a critical index reduces the dimension of the problem can be argued geometrically. Let be a critical index, then denoting by the subproblem ( ?), where its solution is also the solution to the . Defining and , where , than is orthogonal to and the cone is the direct sum of the full line generated by and the simplicial cone .

Solving is an dimensional problem. If is the solution of the , then the solution to the is obtained as .

By relying on Theorem ?, the authors proposed an algorithm consisting of a subroutine to identify a critical index for the problem, followed by a subroutine which reduces the size of the problem once a critical index is found. Since the solution is the orthogonal projection onto the linear hull of , if is known, the solution of the equivalent and correspondingly the solution to can be easily found.

The routine to identify a critical index operates on the by exploiting the geometric properties of projection faces of a pos cone, whose definition is as follows.

In the following we enunciated some theorems on which the critical index algorithm is based. Their proofs can be found in [36] (chapter 7).

Theorem ? tells that the projection onto the cone belongs to the pos cone generated by the set of vectors corresponding to critical indices and that such pos cone is a projection face. Therefore, is the set of critical index is known, for Theorem ? the solution can be computed as projection onto the linear subspace spanned by vectors in .

The routine maintains a non empty subset of called the current set denoted by , and a point called the current point denoted by . At each stage of the routine . When termination occurs the routine either finds the nearest point in to in which case the problem is completely solved or it finds a critical index of the problem. In the latter case an LCP of order can be constructed and the same routine can be applied to this smaller problem. Hence the unique solution of the original problem can be obtained after at most applications of the routine which finds a critical index.

A characterization useful to find a critical index or the solution to the problem is provided by the following theorem.

A characterization of the optimal solution in terms of separating hyperplanes is given by Robertson et al. [42].

The routine to find a critical index alternates distance reduction operations with line-search and projection steps to find a projection face. In practice, the routine starts by projecting on the closest edge to the data point. If the optimality condition is not satisfied, than the procedure iteratively adds vectors to and updates the point while consistently reduces the distance between and . The distance reduction operation is carried out efficiently by projecting onto the nonnegative hull of two vectors in , the current point and a vector satisfying one of the conditions given by the following theorem.

Once such updates are not longer possibles, it employs a sequence of line-search steps and projections in the subspace spanned by the vectors in to find a projection face of the corresponding pos cone. This line-search is in the same spirit than the one proposed by Fraser and Massam since the goal is to reduce the distance to the data point while keeping at the interior of a pos cone. In the particular case of concave regression, for which , it can be implemented exactly in the same way.

This algorithm results to be much faster than the MPDB algorithm. The primary source of its computational efficiency is in that it relies mostly on distance reduction operations and size reduction steps whose computational requirement is relatively small compared to the computational effort required to find a projection face through a line-search.

Recently, Liu and Fathy (2011) [29] generalized the work of Murty and Fathy (1982) to polyhedral non-simplicial cones, hence allowing the set to contain more than vectors. What allows the generalization is the equivalence between the structure of the two problems through the concept of polar cone.

The authors also proposed several strategies for efficient implementation mostly based on the mathematical properties of the entities involved. We have incorporated, where possible, these strategies, to all algorithm tested for objective evaluation of performances.

#### Meyer’s algorithm

Meyer [31] considered the general problem of projecting a data point onto a polyhedral convex cone generated by a finite number of linear inequalities. The problem is reduced to find the set of indices , where is the number of linearly independent constraints, corresponding to not saturated constraints at the solution. Meyer called these indices hinges.

When , the algorithm can be applied to both the primal and the dual formulation, whereas for it is applied to the dual formulation. In the following, since for the problem of concave regression we consider how to solve the primal problem (Equation 4).

The search for is performed iteratively, starting with an initial guess by removing or adding one index at time, until the optimal solution is obtained. For each candidate two conditions are tested: the interior point condition and optimality condition.

The interior point condition is satisfied when the current iterate belongs to the interior of , that is when , where . By using the following theorem

can be computed as projection of onto the linear hull spanned by the vectors . If the feasibility condition is not satisfied, the index corresponding to the most negative coefficient is added to and the interior point condition is checked again.

Once the feasibility condition is satisfied, the optimality condition is tested by using the characterization given in Theorem ?. If it is not satisfied, the vector which most violates the condition is removed. The procedure continues until both conditions are satisfied.

Convergence is guaranteed by the fact that when the algorithm replaces just one edge, the Sum of the Squared Errors (SSE) after is less than the SSE before so that the algorithm never produces the same set of edges twice, which would result in an infinite loop.

In practice, each time an hinge is added, the best solution with hinges where the first hinges are already given is obtained. But this is not in general the best fit with hinges, so that some hinge may need to be changed. Therefore, the optimal solution can be interpreted as the best approximation with the biggest possible number of hinges.

## 4Issues about effectivness for large-scale problems

In this section, we discuss stenghts and limitations of the algorithms detailed above in solving large-scale instances of a particular kind of cone regression, the concave regression. In particular, we consider computational issues related to numerical stability, computational cost and memory load as well as the suitability to take advantage of available good estimates and to be implemented in an online fashion.

### 4.1Suitability to take advantage of available good estimates

One general strategy for reducing the computational cost of a large-scale optimization problem is to use an initial guess, easier to calculate and close to the optimal solution.

Within the class of algorithms with asymptotic convergence, splitting-based methods work by activing each of the convex constraints repetitively and by combining them to obtain a sequence converging to a feasible point. Since the projection point is characterized by the variational inequality

the projection operator is a closed contraction. Therefore the set of fixed points of is exactly . This prevents the use of an initialization point belonging to the feasible set as well as the use of multiscale strategies since there is not guarantee that the solution from a previous level does not belong to the feasible set.

The same difficult arises when considering interior point algorithms, since them need to be initialized to an interior point. In [18], Goldman proved that the Dykstra’s algorithm can potentially starts from better starting values than the given data point . The author established the convergence to the nearest point to the primal cone from an arbitrary point in the intersection of and the ball of radius , where , is a rotation of radiants of the dual cone with its vertex translated at . A point satisfying these conditions can be obtained efficiently by using distance reduction operations based on Theorem ?. It is worth to remark that this result can be easily interpreted in the active set framework. In fact, the Dykstra’s algorithm can be undestood as a primal active set method and its solution is a primal dual feasible point. Therefore, any dual feasible point, that is every point belonging to the set , can be used as initialization.

All algorithm with time finite convergence detailed in the previous section are primal-dual and they reduce the problem of projecting a given data point onto a convex set to the problem of finding the set of indices corresponding to not saturated constraints at the solution. In general, they involve the selection of a subset from a collection of items, say . With this formulation, they potentially allow to take advantage of a good estimate of the optimal active set. However, the adaptation is not rapid since the active set estimate is updated of one-by-one changes preventing this class of method from being effective general-purpose solvers for large-scale problems.

In the algorithm of Fraser and Massam, the set of successive approximations are obtained by moving along a fixed line connecting the initial guess to the data point. The number of iterations needed to reduce the data point reduces the number of sectors to be crossed to join the sector containing the data point.

By contrast, in the algorithm of Meyer the proximity of the initial guess to the data point is not a good criterion selection for the initial guess. In fact, given an initial guess, the solution is attained by adding and/or removing one index at time until optimal solution is found. Taking into account that the optimal solution can be interpreted as the best approximation with the biggest possible number of hinges, if the optimal solution contains just a few hinges, than using the empty set as an initial guess would result much faster than using the full set of possible hinges. On the contrary, if just a few constraints are satisfied at equality in the optimal solution, than the full set of indices will be a much better initial guess than the empty set. Therefore even if the choice of the initial guess may highly influence the performances, its choice depends on the data and there is not a well established criterion to fix it.

The Murty and Fathy’s algorithm reduces the size of the problem each time a critical index is found. Therefore it is not compatible with strategies that take advantage of a good initial estimate since a good estimate does not lead to find a critical index faster.

To overcome the limitation of active set methods in taking advantage of a good estimate, Curtis et al. [11] have recently proposed an euristic framework that allows for multiple simultaneous changes in the active-set estimate, which often leads to a rapid identification of the optimal set. However, there is not guarantee of the computational advantages for general problems and, furthermore, the authors recommend their approach when solving generic quadratic programming problems with many degrees of freedom, that is not the case of general concave regression problems.

#### PAV’s inspired approximate solution

To evaluate through numerical simulations the suitability to take advantage from good initial estimates, we propose an algorithm inspired to Pool Adjacent Violators (PAV), whose computational complexity is (n). Starting from the original signal, violated constraints are removed one by one by projecting the current iterate onto the convex cone corresponding to the violated constraint until a primal feasible solution is obtained. Since the dual feasibility of each iterate is not guaranteed, the founded solution is not optimal. However, in our experience the solution is a very good approximation of the optimal solution.

### 4.2Suitability to be implemented in an online fashion

Another strategy to deal with large projection problems would be to build and evaluate the solution incrementally according to the order of its input, as done in online methodologies developed for dynamic optimization [6] over a stream of input. Even if the input is given in advance, inputs are processed sequentially and the algorithm must respond in real-time with no knowledge of future input. Each new input may cause to rising or falling constraints and the final solution is a sequence of feasible solutions, one for each time step, such that later solutions build on earlier solutions incrementally.

Of course this strategy requires that the algorithm respond in real-time for each new input, that would not possible when dealing with large matrix inverse computations. Let be the projection of onto and let be the projection of onto . When a new element is added to the data point, a new constraint is added too, so that the constraint cone has a new edge. If this constraint corresponds to a critical index, that is to a constraint satisfied at equality in the optimal solution, then the projection face will be the same so that no further computing will be needed. On the contrary, if the new constraint does not correspond to a critical index, the projection face will change, including the edge corresponding to the new constraint and removing and adding some others. Therefore, the major difficulty faced by online strategy is the same faced in exploiting good estimates.

### 4.3Computational issues

As highlithed in the previous section, despite the different strategies implemented by algorithms with time-finite convergence, generally an index at time is iteratively added or removed from the current set of indices until both the feasibility condition and the optimality condition are satisfied. Checking the optimality condition involves computing the inverse of a matrix that differs slightly from the matrix of the previous iteration. What “slightly” exactly means depends on the specific algorithm and it is detailed in the following.

The algorithm of Fraser and Massam involves the computation of a fixed size matrix inverse at each iteration. The matrix to be inverted differs from the matrix of the previous iteration only for one column, being the change of the form , where is an unitary vector with only one nonzero component corresponding to the index of the column to be changed, and corresponds to the difference between the elements of the dual basis vector and the elements of the primal basis vector or viceversa. In this case the inverse can be updated efficiently by using the Sherman-Morrison formula: , where , , . Therefore only two matrix multiplications and a scalar product are needed to compute the new inverse at each step.

The algorithm of Meyer, as well as the one of Liu and Fathi [29] involve the computation of an inverse matrix of variable size at each iteration. The matrix to be inverted differs from the matrix of the previous iteration only for one column, which has been added or removed. Since the matrix to be inverted , with is generally rectangular, the Moore-Penrose generalized inverse or pseudoinverse is computed:.

In Matlab and Scilab, the computation of the pseudoinverse is based on the Singular Value Decomposition (SVD) and singular values lower than a tolerance are treated as zero. The advantage of this approach is that the pseudoinverse can be computed also for a nonsingular matrix. However, the method proposed by Liu and Fathi [29] to improve the computational efficiency of their algorithm does not take advantage of SVD approach since it consists in updating the matrix . If the matrix is ill-conditioned, then the inverse cannot be computed with good accuracy and for the matrix is even more so because this operation squares the condition number of the matrix .

A better solution would be to update directly the pseudoinverse. This can be achieved when a column is added by using the method proposed in [1] Let , and . The pseudoinverse of can be computed from the pseudoinverse of as follows.

, where and .

Alternatively, the transformation can be efficiently computed by a QR decomposition approach. Let be the QR decomposition of , where is an invertible upper triangular matrix. Then: . The inverse of an upper triangular matrix can be efficiently implemented by a left matrix division or by more sophisticated methods as the one proposed in [30].

Courrieu [9] proposed a method based on the full rank Cholesky decomposition which has a computation time substantially shorter of the method based on SVD decomposition. The two main operations on which his method is based are the full rank Cholesky factorization of and the inversion of , where is a full rank matrix . On a serial processor these computations are of complexity order and respectively. However, in a parallel architecture, with as many processor as necessary, the time complexity for Cholesky factorization of could reduce to , while the time complexity for the inversion of the symmetric positive definite matrix could reduce to . However, the computational advantage of this method can be appreciated only when , since the inverse of a matrix has to be computed, which is not in general the case, specially for concave regression problems.

The method proposed by Zhu and Li [52] for recursive constrained least squares problems, found that the exact solution of Linear Equality-constrained Least Squares can be obtained by the same recursion as for the unconstrained problem, provided that the Rescricted Least Squares procedure is appropriately initialized. However, even this approach offer significant advantages in term of computational cost only when the number of constraints is small, which is not the case for large-scale concave regression problems.

## 5Improving the active set framework for concave regression problems

An active set algorithm for solving the concave regression problem generates a sequence of quasi stationary points. A primal (dual) feasible active set algorithm generates a sequence of primal (dual) feasible quasi stationary points with decreasing objective function values and terminates when the dual (primal) feasibility is satisfied. An active set induces a unique partition of the indices into blocks. A block of such partition is a set of consecutive integers, say, , such that the index of the constraint is in for each such that . Conversly any such partition of indices determines a unique active set. Denoting by the multiplier associated with the ith constraint, the Karush-Kuhn-Tucker can be written as:

It is easy to show that: and . Knowing that each block can be represented by an affine function , the case of blocks the above systems can be written as:

Therefore, for each block the unknown variables to be computed are . The systems to be solved can be written as , where , and