Least Squares and Shrinkage Estimation under Bimonotonicity Constraints

# Least Squares and Shrinkage Estimation under Bimonotonicity Constraints

Rudolf Beran and Lutz Dümbgen
University of California at Davis and University of Bern
September 2008, revised January 2009
###### Abstract

In this paper we describe active set type algorithms for minimization of a smooth function under general order constraints, an important case being functions on the set of bimonotone matrices. These algorithms can be used, for instance, to estimate a bimonotone regression function via least squares or (a smooth approximation of) least absolute deviations. Another application is shrinkage estimation in image denoising or, more generally, regression problems with two ordinal factors after representing the data in a suitable basis which is indexed by pairs . Various numerical examples illustrate our methods.

##### Key words:

active set algorithm, dynamic programming, estimated risk, pool-adjacent-violators algorithm, regularization.

##### AMS subject classifications:

62-04, 62G05, 62G08, 90C20, 90C25, 90C90

## 1 Introduction

Monotonicity and other qualitative constraints play an important role in contemporary nonparametric statistics. One reason for this success is that such constraints are often plausible or even justified theoretically, within an appropriate mathematical formulation of the application. Moreover, by imposing shape constraints one can often avoid more traditional smoothness assumptions which typically lead to procedures requiring the choice of some tuning parameter. A good starting point for statistical inference under qualitative constraints is the monograph by Robertson et al. .

Estimation under order constraints leads often to the following optimization problem: For some dimension let be a given functional. For instance,

 (1) Q(θ) = p∑u=1wu(Zu−θu)2

with a certain weight vector and a given data vector . In general we assume that is continuously differentiable, strictly convex and coercive, i.e.

 Q(θ) → ∞as ∥θ∥→∞,

where is some norm on . The goal is to minimize over the following subset of : Let be a given collection of pairs of different indices , and define

 K=K(C) = {θ∈Rp:θu≤θv for all (u,v)∈C}.

This defines a closed convex cone in containing all constant vectors.

For instance, if consists of , , …, , then is the cone of all vectors such that . Minimizing (1) over all such vectors is a standard problem and can be solved in steps via the pool-adjacent-violators algorithm (PAVA). The latter was introduced in a special setting by Ayer et al.  and extended later by numerous authors, see  and Best and Chakravarti .

As soon as is not of type (1) or differs from the aforementioned standard example, the minimization of over becomes more involved. Here is another example for and which is of primary interest in the present paper: Let with integers , and identify with the set of all matrices with rows and columns. Further let be the set of all matrices such that

 θi,j≤θi+1,j whenever i

This corresponds to the set of all pairs with and such that either or . Hence there are constraints.

Minimizing the special functional (1), i.e. , over the bimonotone cone is a well recognized problem with various proposed solutions, see, for instance, Spouge et al. , Burdakow et al. , and the references cited therein. However, all these algorithms exploit the special structure of or (1). For general functionals , e.g. quadratic functions with positive definite but non-diagonal hessian matrix, different approaches are needed.

The remainder of this paper is organized as follows. In Section 2 we describe the bimonotone regression problem and argue that the special structure (1) is sometimes too restrictive even in that context. In Section 3 we derive possible algorithms for the general optimization problem described above. These algorithms involve a discrete optimization step which gives rise to a dynamic program in case of . For a general introduction to dynamic programming see Cormen et al. . Other ingredients are active methods as described by, for instance, Fletcher , Best and Chakravarti  or Dümbgen et al. , sometimes combined with the ordinary PAVA in a particular fashion. It will be shown that all these algorithms find the exact solution in finitely many steps, at least when is an arbitrary quadratic and strictly convex function. Finally, in Section 4 we adapt our procedure to image denoising via bimonotone shrinkage of generalized Fourier coefficients. The statistical method in this section was already indicated in Beran and Dümbgen  but has not been implemented yet, for lack of an efficient computational algorithm.

## 2 Least squares estimation of bimonotone regression functions

Suppose that one observes , , …, with real components , and . The points are regarded as fixed points, which is always possible by conditioning, while

 Zt = μ(xt,yt)+εt

for an unknown regression function and independent random errors , , …, with mean zero. In some applications it is plausible to assume to be bimonotone increasing, i.e. non-decreasing in both arguments. Then it would be desirable to estimate under that constraint only. One possibility would be to minimize

 n∑t=1(Zt−μ(xt,yt))2

over all bimonotone functions . The resulting minimizer is uniquely defined on the finite set of all design points , .

For a more detailed discussion, suppose that we want to estimate on a finite rectangular grid

 {(x(i),y(j)):1≤i≤r,1≤j≤s},

where and contain at least the different elements of and , respectively, but maybe additional points as well. For and let be the number of all such that , and let be the average of over these indices . Then equals

 Q(θ) = ∑i,jwij(Zij−θij)2,

where stands for the matrix .

##### Setting 1: Complete layout.

Suppose that for all . Then the resulting optimization problem is precisely the one described in the introduction.

##### Setting 2a: Incomplete layout and simple interpolation/extrapolation.

Suppose that the set of all index pairs with differs from . Then

 Q(θ) = ∑u∈Uwu(Zu−θu)2

fails to be coercive. Nevertheless it can be minimized over with the algorithms described later. Let be such a minimizer. Since it is uniquely defined on only, we propose to replace it with , where

 θ–ij = max({ˇθi′j′:(i′,j′)∈U,i′≤i,j′≤j}∪{ˇθmin}), ¯¯¯θij = min({ˇθi′j′:(i′,j′)∈U,i≤i′,j≤j′}∪{ˇθmax}),

and and denote the minimum and maximum, respectively, of . Note that and belong to and are extremal in the sense that any matrix with for all satisfies necessarily for all .

##### Setting 2b: Incomplete layout and light regularization.

Instead of restricting one’s attention to the index set , one can estimate the full matrix by minimizing a suitably penalized sum of squares,

 Q(θ) = ∑u∈Uwu(Zu−θu)2+λP(θ),

over for some small parameter . Here is a convex quadratic function on such that is strictly convex. One possibility would be Tychonov regularisation with and a certain reference value , for instance, . In our particular setting we prefer the penalty

 (2) P(θ) = ∑((i,j),(k,ℓ))∈Cr,s(θkℓ−θij)2,

because it yields smoother interpolations than the recipe for Setting 2a or the Tychonov penalty. One can easily show that the resulting quadratic function is strictly convex but with non-diagonal hessian matrix. Thus it fulfills our general requirements but is not of type (1).

Note that adding a penalty term such as (2) could be worthwhile even in case of a complete layout if the underlying function is assumed to be smooth. But this leads to the nontrivial task of choosing appropriately. Here we use the penalty term mainly for smooth interpolation/extrapolation with just large enough to ensure a well-conditioned Hessian matrix. We refer to this as “light regularization”, and the exact value of is essentially irrelevant.

###### Example 2.1

To illustrate the difference between simple interpolation/extrapolation and light regularization with penalty (2) we consider just two observations, and , and let , with and . Thus except for , while and . Any minimizer of over satisfies and , so the recipe for Setting 2a yields

 ^θij = ⎧⎨⎩0if i≤2,j≤3,1if i≥6,j≥7,0.5else.

The left panel of Figure 1 shows the latter fit , while the right panel shows the regularized fit based on (2) with . In these and most subsequent pictures we use a gray scale from to .

###### Example 2.2

(Binary regression).   We generated a random matrix with rows, columns and independent components , where

with and . Thereafter we removed randomly all but of the components . The resulting data are depicted in the upper left panel of Figure 2, where missing values are depicted grey, while the upper right panel shows the true signal . The lower panels depict the least squares estimator with simple interpolation/extrapolation (left) and light regularization based on (2) with (right). Note that both estimators are very similar. Due to the small value of , the main differences occur in regions without data points.

The quality of an estimator for may be quantified by the average absolute deviation,

For the estimator with simple interpolation/extrapolation, turned out to be , the estimator based on light regularization performed slightly better with .

## 3 The general algorithmic problem

We return to the general framework introduced in the beginning with a continuously differentiable, strictly convex and coercive functional and a closed convex cone determined by a collection of inequality constraints.

Before starting with explicit algorithms, let us characterize the point

 ^θ = argminθ∈KQ(θ).

It is well-known from convex analysis that a point coincides with if, and only if,

 (3) ∇Q(θ)⊤θ = 0 ≤ ∇Q(θ)⊤ηfor all η∈K,

where denotes the gradient of at . This characterization involves infinitely many inequalities, but it can be replaced with a criterion involving only finitely many constraints.

### 3.1 Extremal directions of K

Note that contains all constant vectors , , where . It can be represented as follows:

###### Lemma 3.1

Define

 E = K∩{0,1}p.

Then any vector may be represented as

 x = min(x)1+∑e∈Eλee

with coefficients such that .

Here and denote the minimum and maximum, respectively, of the components of .

##### Modified characterization of ^θ.

By means of Lemma 3.1 one can easily verify that (3) is equivalent to the following condition:

 (4) ∇Q(θ)⊤θ = 0 ≤ ∇Q(θ)⊤efor all e∈E∪{−1}.

Thus we have to check only finitely many constraints. Note, however, that the cardinality of may be substantially larger than the dimension , so that checking (4) is far from trivial.

##### Application to Kr,s.

Applying Lemma 3.1 to the cone yields the following representation: With

 Er,s = Kr,s∩{0,1}r×s

any matrix may be written as

 x = ao1r×s+∑e∈Er,sλee

with coefficients and , .

There is a one-to-one correspondence between the set and the set of all vectors with components via the mapping

 e ↦ (i+s∑j=1eij)ri=1.

Since such a vector corresponds to a subset of with elements, we end up with

 #Er,s = (r+sr)=(r+ss).

Hence the cardinality of grows exponentially in . Nevertheless, minimizing a linear functional over is possible in steps, as explained in the next section.

##### Proof of Lemma 3.1.

For let be the different elements of , i.e.  and . Then

 x = a01+m∑i=1(ai−ai−1)(1{xt≥ai})pt=1.

Obviously, these weights are nonnegative and sum to . Furthermore, one can easily deduce from that belongs to for any real threshold .

### 3.2 A dynamic program for Er,s

For some matrix let be given by

 L(x) = r∑i=1s∑j=1aijxij.

The minimum of over the finite set may be obtained by means of the following recursion: For and define

 H(k,ℓ) = min{r∑i=ks∑j=1aijeij:e∈Er,s,ekℓ=1}, H(k,s+1) = min{r∑i=ks∑j=1aijeij:e∈Er,s}.

Then

 mine∈Er,sL(e) = H(1,s+1),

and

 H(k,1) = r∑i=ks∑j=1aij, H(k,ℓ+1) = min(H(k,ℓ),s∑j=ℓ+1aij+H(k+1,ℓ+1))

where we use the conventions that and . In the recursion formula for , the term is the minimum of over all matrices with and (if ), while is the minimum of over all with .

Table 1 provides pseudocode for an algorithm that determines a minimizer of over .

### 3.3 Active set type algorithms

Throughout this exposition we assume that minimization of over an affine linear subspace of is feasible. This is certainly the case if is a quadratic functional. If is twice continuously differentiable with positive definite Hessian matrix everywhere, this minimization problem can be solved with arbitrarily high accuracy by a Newton type algorithm.

All algorithms described in this paper alternate between two basic procedures which are described next. In both procedures is replaced with a vector such that unless .

#### Basic procedure 1: Checking optimality of θ∈K

Suppose that satisfies already the the following two equations:

 (5) ∇Q(θ)⊤θ = 0 = ∇Q(θ)⊤1.

According to (3), this vector is already the solution if, and only if, for all . Thus we determine

 Δ ∈ argmine∈E∇Q(θ)⊤e

and do the following: If , we know that and stop the algorithm. Otherwise we determine

 to = argmint∈RQ(θ+tΔ) > 0

and replace with

 θnew := θ+toΔ.

This vector lies in the cone , too, and satisfies the inequality . Then we proceed with basic procedure 2.

#### Basic procedure 2: Replacing θ∈K with a “locally optimal” point θnew∈K

The general idea of basic procedure 2 is to find a point such that

 (6) θnew = argminx∈VQ(x)

for some in a finite family of linear subspaces of . Typically these subspaces are obtained by replacing some inequality constraints from with equality constraints and ignoring the remaining ones. This approach is described below as basic procedure 2a. But we shall see that it is potentially useful to modify this strategy; see basic procedures 2b and 2c.

##### Basic procedure 2a: The classical active set approach.

For define

 V(θ) = {x∈Rp:xu=xv for all (u,v)∈C with θu=θv}.

This is a linear subspace of containing and which is determined by those constraints from which are “active” in . It has the additional property that for any vector ,

 λ(θ,x)=max{t∈[0,1]:(1−t)θ+tx∈K} > 0.

Precisely, if , and otherwise,

 λ(θ,x) = min(u,v)∈C:xu>xvθv−θuθv−θu−xv+xu.

The key step in basic procedure 2a is to determine and . If , which is equivalent to , we are done and return . This vector satisfies (6) with and . The latter fact follows simply from . If , we repeat this key step with in place of .

In both cases the key step yields a vector satisfying , unless . Moreover, if , then the vector space is contained in with strictly smaller dimension, because at least one additional constraint from becomes active. Hence after finitely many repetitions of the key step, we end up with a vector satisfying (6) with . Table 2 provides pseudocode for basic procedure 2a.

##### Basic procedure 2b: Working with complete orders.

The determination and handling of the subspace in basic procedure 2a may be rather involved, in particular, when the set consists of more than constraints. One possibility to avoid this is to replace and in the key step with the following subspace and cone , respectively:

 V∗(θ) = {x∈Rp:for all u,v∈{1,…,p}, xu=xv if θu=θv}, K∗(θ) = {x∈Rp:for all u,v∈{1,…,p}, xu≤xv if θu≤θv}.

Note that , and one easily verifies that if . Basic procedure 2b works precisely like basic procedure 2a, but with in place of , and is replaced with

 λ∗(θ,x) = max{t∈[0,1]:(1−t)θ+tx∈K∗(θ)}.

Then basic procedure 2b yields a vector satisfying (6) with .

When implementing this procedure, it is useful to determine a permutation of such that . Let denote those indices such that if . Then, with ,

 V∗(θ) = {x∈Rp:for 1≤ℓ≤q,  xσ(i) is constant in i∈{iℓ−1+1,…,iℓ}}, K∗(θ) = {x∈V∗(θ):for 1≤ℓ

and

 λ∗(θ,x) = min2≤ℓ≤p:xσ(iℓ−1)>xσ(iℓ)θσ(iℓ)−θσ(iℓ−1)θσ(iℓ)−θσ(iℓ−1)−xσ(iℓ)+xσ(iℓ−1).
##### Basic procedure 2c: A shortcut via the PAVA.

In the special case of being the weighted least squares functional in (1), one can determine

 θnew = argminx∈K∗(θ)Q(x)

directly by means of the PAVA with a suitable modification for the equality constraints defining .

### The whole algorithm and its validity

All subspaces and , , correspond to partitions of into index sets. Namely, the linear subspace corresponding to such a partition consists of all vectors with the property that for arbitrary indices belonging to the same set from the partition. Thus the subspaces used in basic procedures 2a-b belong to a finite family of linear subspaces of all containing .

We may start the algorithm with initial point

 θ(0) = (argmint∈RQ(t1))⋅1.

Now suppose that have been chosen such that

with linear spaces . Then satisfies (5), and we may apply basic procedure 1 to check whether . If not, we may also apply a variant of basic procedure 2 to get minimizing on a linear subspace , where . Since is finite, we will obtain after finitely many steps.

Similar arguments show that our algorithm based on basic procedure 2c reaches an optimum after finitely many steps, too.

##### Final remark on coercivity.

As mentioned for Setting 2a, the algorithm above may be applicable even in situations when the functional fails to be coercive. In fact, we only need to assume that attains a minimum, possibly non-unique, over any linear space , or any cone , and we have to able to compute it. In Setting 2a, one can verify this easily.

## 4 Shrinkage estimation

We consider a regression setting as in Section 2, this time with Gaussian errors . As before, the regression function is reduced to a matrix

 M=(μ(x(i),y(j)))i,j∈Rr×s

for given design points and . This matrix is no longer assumed to be bimonotone, but the latter constraint will play a role in our estimation method.

### 4.1 Transforming the signal

At first we represent the signal with respect to a certain basis of . To this end let and be orthonormal matrices in and , respectively, to be specified later. Then we write

 M = U~MV⊤ = ∑i,j~Mijuiv⊤jwith~M = U⊤MV = (u⊤iMvj)i,j.

Thus contains the coefficients of with respect to the new basis matrices . The purpose of such a transformation is to obtain a transformed signal with many coefficients being equal or at least close to zero.

One particular construction of such basis matrices and is via discrete smoothing splines: For given degrees , consider annihilators

 A = ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣a11⋯a1,k+10a22⋯a2,k+2⋱⋱0ar−k,r−k⋯ar−k,r⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ ∈ R(r−k)×r, B = ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣b11⋯b1,ℓ+10b22⋯b2,ℓ+2⋱⋱0bs−ℓ,s−ℓ⋯bs−ℓ,s⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ ∈ R(s−ℓ)×s,

with unit row vectors such that

 A(xe(i))ri=1 = 0for e=0,…,k−1, B(ye(j))sj=1 = 0for e=0,…,ℓ−1.

An important special case is . Here

 A = 1√2⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1−101−1⋱⋱01−1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦andB = 1√2⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1−101−1⋱⋱01−1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

satisfy the equations and .

Next we determine singular value decompositions of and , namely,

 A = ~U⋅[0(r−k)×kdiag(a1,…,ar−k)0 ≤ a1 ≤ ⋯ ≤ ar−k]⋅U⊤ B = ~V⋅[0(s−ℓ)×ℓdiag(b1,…,bs−ℓ)0 ≤ b1 ≤ ⋯ ≤ bs−ℓ]⋅V⊤

with column-orthonormal matrices , , and . The vectors and correspond to the space of polynomials of order at most and , respectively. In particular, we always choose and . Then

One may also write

 M = U % polynomial parthalf-polyn.\ interactionsk×ℓk×(s−ℓ)half-polyn.\ interactionsnon-polyn.\ interactions(r−k)×ℓ(r−k)×(s−ℓ) V⊤.

For moderately smooth functions we expect to have a decreasing trend in and in . This motivates a class of shrinkage estimators which we describe next.

### 4.2 Shrinkage estimation in the simple balanced case

In the case of observations such that each grid point is contained in , our input data may be written as a matrix

 Z = M+ε

with