Least Squares and Shrinkage Estimation under Bimonotonicity Constraints
In this paper we describe active set type algorithms for minimization of a smooth function under general order constraints, an important case being functions on the set of bimonotone matrices. These algorithms can be used, for instance, to estimate a bimonotone regression function via least squares or (a smooth approximation of) least absolute deviations. Another application is shrinkage estimation in image denoising or, more generally, regression problems with two ordinal factors after representing the data in a suitable basis which is indexed by pairs . Various numerical examples illustrate our methods.
active set algorithm, dynamic programming, estimated risk, pool-adjacent-violators algorithm, regularization.
AMS subject classifications:
62-04, 62G05, 62G08, 90C20, 90C25, 90C90
Monotonicity and other qualitative constraints play an important role in contemporary nonparametric statistics. One reason for this success is that such constraints are often plausible or even justified theoretically, within an appropriate mathematical formulation of the application. Moreover, by imposing shape constraints one can often avoid more traditional smoothness assumptions which typically lead to procedures requiring the choice of some tuning parameter. A good starting point for statistical inference under qualitative constraints is the monograph by Robertson et al. .
Estimation under order constraints leads often to the following optimization problem: For some dimension let be a given functional. For instance,
with a certain weight vector and a given data vector . In general we assume that is continuously differentiable, strictly convex and coercive, i.e.
where is some norm on . The goal is to minimize over the following subset of : Let be a given collection of pairs of different indices , and define
This defines a closed convex cone in containing all constant vectors.
For instance, if consists of , , …, , then is the cone of all vectors such that . Minimizing (1) over all such vectors is a standard problem and can be solved in steps via the pool-adjacent-violators algorithm (PAVA). The latter was introduced in a special setting by Ayer et al.  and extended later by numerous authors, see  and Best and Chakravarti .
As soon as is not of type (1) or differs from the aforementioned standard example, the minimization of over becomes more involved. Here is another example for and which is of primary interest in the present paper: Let with integers , and identify with the set of all matrices with rows and columns. Further let be the set of all matrices such that
This corresponds to the set of all pairs with and such that either or . Hence there are constraints.
Minimizing the special functional (1), i.e. , over the bimonotone cone is a well recognized problem with various proposed solutions, see, for instance, Spouge et al. , Burdakow et al. , and the references cited therein. However, all these algorithms exploit the special structure of or (1). For general functionals , e.g. quadratic functions with positive definite but non-diagonal hessian matrix, different approaches are needed.
The remainder of this paper is organized as follows. In Section 2 we describe the bimonotone regression problem and argue that the special structure (1) is sometimes too restrictive even in that context. In Section 3 we derive possible algorithms for the general optimization problem described above. These algorithms involve a discrete optimization step which gives rise to a dynamic program in case of . For a general introduction to dynamic programming see Cormen et al. . Other ingredients are active methods as described by, for instance, Fletcher , Best and Chakravarti  or Dümbgen et al. , sometimes combined with the ordinary PAVA in a particular fashion. It will be shown that all these algorithms find the exact solution in finitely many steps, at least when is an arbitrary quadratic and strictly convex function. Finally, in Section 4 we adapt our procedure to image denoising via bimonotone shrinkage of generalized Fourier coefficients. The statistical method in this section was already indicated in Beran and Dümbgen  but has not been implemented yet, for lack of an efficient computational algorithm.
2 Least squares estimation of bimonotone regression functions
Suppose that one observes , , …, with real components , and . The points are regarded as fixed points, which is always possible by conditioning, while
for an unknown regression function and independent random errors , , …, with mean zero. In some applications it is plausible to assume to be bimonotone increasing, i.e. non-decreasing in both arguments. Then it would be desirable to estimate under that constraint only. One possibility would be to minimize
over all bimonotone functions . The resulting minimizer is uniquely defined on the finite set of all design points , .
For a more detailed discussion, suppose that we want to estimate on a finite rectangular grid
where and contain at least the different elements of and , respectively, but maybe additional points as well. For and let be the number of all such that , and let be the average of over these indices . Then equals
where stands for the matrix .
Setting 1: Complete layout.
Suppose that for all . Then the resulting optimization problem is precisely the one described in the introduction.
Setting 2a: Incomplete layout and simple interpolation/extrapolation.
Suppose that the set of all index pairs with differs from . Then
fails to be coercive. Nevertheless it can be minimized over with the algorithms described later. Let be such a minimizer. Since it is uniquely defined on only, we propose to replace it with , where
and and denote the minimum and maximum, respectively, of . Note that and belong to and are extremal in the sense that any matrix with for all satisfies necessarily for all .
Setting 2b: Incomplete layout and light regularization.
Instead of restricting one’s attention to the index set , one can estimate the full matrix by minimizing a suitably penalized sum of squares,
over for some small parameter . Here is a convex quadratic function on such that is strictly convex. One possibility would be Tychonov regularisation with and a certain reference value , for instance, . In our particular setting we prefer the penalty
because it yields smoother interpolations than the recipe for Setting 2a or the Tychonov penalty. One can easily show that the resulting quadratic function is strictly convex but with non-diagonal hessian matrix. Thus it fulfills our general requirements but is not of type (1).
Note that adding a penalty term such as (2) could be worthwhile even in case of a complete layout if the underlying function is assumed to be smooth. But this leads to the nontrivial task of choosing appropriately. Here we use the penalty term mainly for smooth interpolation/extrapolation with just large enough to ensure a well-conditioned Hessian matrix. We refer to this as “light regularization”, and the exact value of is essentially irrelevant.
To illustrate the difference between simple interpolation/extrapolation and light regularization with penalty (2) we consider just two observations, and , and let , with and . Thus except for , while and . Any minimizer of over satisfies and , so the recipe for Setting 2a yields
(Binary regression). We generated a random matrix with rows, columns and independent components , where
with and . Thereafter we removed randomly all but of the components . The resulting data are depicted in the upper left panel of Figure 2, where missing values are depicted grey, while the upper right panel shows the true signal . The lower panels depict the least squares estimator with simple interpolation/extrapolation (left) and light regularization based on (2) with (right). Note that both estimators are very similar. Due to the small value of , the main differences occur in regions without data points.
The quality of an estimator for may be quantified by the average absolute deviation,
For the estimator with simple interpolation/extrapolation, turned out to be , the estimator based on light regularization performed slightly better with .
3 The general algorithmic problem
We return to the general framework introduced in the beginning with a continuously differentiable, strictly convex and coercive functional and a closed convex cone determined by a collection of inequality constraints.
Before starting with explicit algorithms, let us characterize the point
It is well-known from convex analysis that a point coincides with if, and only if,
where denotes the gradient of at . This characterization involves infinitely many inequalities, but it can be replaced with a criterion involving only finitely many constraints.
3.1 Extremal directions of
Note that contains all constant vectors , , where . It can be represented as follows:
Then any vector may be represented as
with coefficients such that .
Here and denote the minimum and maximum, respectively, of the components of .
Modified characterization of .
Application to .
Applying Lemma 3.1 to the cone yields the following representation: With
any matrix may be written as
with coefficients and , .
There is a one-to-one correspondence between the set and the set of all vectors with components via the mapping
Since such a vector corresponds to a subset of with elements, we end up with
Hence the cardinality of grows exponentially in . Nevertheless, minimizing a linear functional over is possible in steps, as explained in the next section.
Proof of Lemma 3.1.
For let be the different elements of , i.e. and . Then
Obviously, these weights are nonnegative and sum to . Furthermore, one can easily deduce from that belongs to for any real threshold .
3.2 A dynamic program for
For some matrix let be given by
The minimum of over the finite set may be obtained by means of the following recursion: For and define
where we use the conventions that and . In the recursion formula for , the term is the minimum of over all matrices with and (if ), while is the minimum of over all with .
Table 1 provides pseudocode for an algorithm that determines a minimizer of over .
|for downto do|
|for to do|
|while and do|
3.3 Active set type algorithms
Throughout this exposition we assume that minimization of over an affine linear subspace of is feasible. This is certainly the case if is a quadratic functional. If is twice continuously differentiable with positive definite Hessian matrix everywhere, this minimization problem can be solved with arbitrarily high accuracy by a Newton type algorithm.
All algorithms described in this paper alternate between two basic procedures which are described next. In both procedures is replaced with a vector such that unless .
Basic procedure 1: Checking optimality of
Suppose that satisfies already the the following two equations:
According to (3), this vector is already the solution if, and only if, for all . Thus we determine
and do the following: If , we know that and stop the algorithm. Otherwise we determine
and replace with
This vector lies in the cone , too, and satisfies the inequality . Then we proceed with basic procedure 2.
Basic procedure 2: Replacing with a “locally optimal” point
The general idea of basic procedure 2 is to find a point such that
for some in a finite family of linear subspaces of . Typically these subspaces are obtained by replacing some inequality constraints from with equality constraints and ignoring the remaining ones. This approach is described below as basic procedure 2a. But we shall see that it is potentially useful to modify this strategy; see basic procedures 2b and 2c.
Basic procedure 2a: The classical active set approach.
This is a linear subspace of containing and which is determined by those constraints from which are “active” in . It has the additional property that for any vector ,
Precisely, if , and otherwise,
The key step in basic procedure 2a is to determine and . If , which is equivalent to , we are done and return . This vector satisfies (6) with and . The latter fact follows simply from . If , we repeat this key step with in place of .
In both cases the key step yields a vector satisfying , unless . Moreover, if , then the vector space is contained in with strictly smaller dimension, because at least one additional constraint from becomes active. Hence after finitely many repetitions of the key step, we end up with a vector satisfying (6) with . Table 2 provides pseudocode for basic procedure 2a.
Basic procedure 2b: Working with complete orders.
The determination and handling of the subspace in basic procedure 2a may be rather involved, in particular, when the set consists of more than constraints. One possibility to avoid this is to replace and in the key step with the following subspace and cone , respectively:
Note that , and one easily verifies that if . Basic procedure 2b works precisely like basic procedure 2a, but with in place of , and is replaced with
Then basic procedure 2b yields a vector satisfying (6) with .
When implementing this procedure, it is useful to determine a permutation of such that . Let denote those indices such that if . Then, with ,
Basic procedure 2c: A shortcut via the PAVA.
In the special case of being the weighted least squares functional in (1), one can determine
directly by means of the PAVA with a suitable modification for the equality constraints defining .
The whole algorithm and its validity
All subspaces and , , correspond to partitions of into index sets. Namely, the linear subspace corresponding to such a partition consists of all vectors with the property that for arbitrary indices belonging to the same set from the partition. Thus the subspaces used in basic procedures 2a-b belong to a finite family of linear subspaces of all containing .
We may start the algorithm with initial point
Now suppose that have been chosen such that
with linear spaces . Then satisfies (5), and we may apply basic procedure 1 to check whether . If not, we may also apply a variant of basic procedure 2 to get minimizing on a linear subspace , where . Since is finite, we will obtain after finitely many steps.
Similar arguments show that our algorithm based on basic procedure 2c reaches an optimum after finitely many steps, too.
Final remark on coercivity.
As mentioned for Setting 2a, the algorithm above may be applicable even in situations when the functional fails to be coercive. In fact, we only need to assume that attains a minimum, possibly non-unique, over any linear space , or any cone , and we have to able to compute it. In Setting 2a, one can verify this easily.
4 Shrinkage estimation
We consider a regression setting as in Section 2, this time with Gaussian errors . As before, the regression function is reduced to a matrix
for given design points and . This matrix is no longer assumed to be bimonotone, but the latter constraint will play a role in our estimation method.
4.1 Transforming the signal
At first we represent the signal with respect to a certain basis of . To this end let and be orthonormal matrices in and , respectively, to be specified later. Then we write
Thus contains the coefficients of with respect to the new basis matrices . The purpose of such a transformation is to obtain a transformed signal with many coefficients being equal or at least close to zero.
One particular construction of such basis matrices and is via discrete smoothing splines: For given degrees , consider annihilators
with unit row vectors such that
An important special case is . Here
satisfy the equations and .
Next we determine singular value decompositions of and , namely,
with column-orthonormal matrices , , and . The vectors and correspond to the space of polynomials of order at most and , respectively. In particular, we always choose and . Then
One may also write
For moderately smooth functions we expect to have a decreasing trend in and in . This motivates a class of shrinkage estimators which we describe next.
4.2 Shrinkage estimation in the simple balanced case
In the case of observations such that each grid point is contained in , our input data may be written as a matrix