Adaptive Approximation for Multivariate Linear Problemswith Inputs Lying in a Cone

# Adaptive Approximation for Multivariate Linear Problems with Inputs Lying in a Cone

## Abstract

We study adaptive approximation algorithms for general multivariate linear problems where the sets of input functions are non-convex cones. While it is known that adaptive algorithms perform essentially no better than non-adaptive algorithms for convex input sets, the situation may be different for non-convex sets. A typical example considered here is function approximation based on series expansions. Given an error tolerance, we use series coefficients of the input to construct an approximate solution such that the error does not exceed this tolerance. We study the situation where we can bound the norm of the input based on a pilot sample, and the situation where we keep track of the decay rate of the series coefficients of the input. Moreover, we consider situations where it makes sense to infer coordinate and smoothness importance. Besides performing an error analysis, we also study the information cost of our algorithms and the computational complexity of our problems, and we identify conditions under which we can avoid a curse of dimensionality.

## 1 Introduction

In many situations, adaptive algorithms can be rigorously shown to perform essentially no better than non-adaptive algorithms. Yet, in practice adaptive algorithms are appreciated because they relieve the user from stipulating the computational effort required to achieve the desired accuracy. The key to resolving this seeming contradiction is to construct a theory based on assumptions that favor adaptive algorithms. We do that here.

Adaptive algorithms infer the necessary computational effort based on the function data sampled. Adaptive algorithms may perform better than non-adaptive algorithms if the set of input functions is non-convex. We construct adaptive algorithms for general multivariate linear problems where the input functions lie in non-convex cones. Our algorithms use a finite number of series coefficients of the input function to construct an approximate solution that satisfies an absolute error tolerance. We show our algorithms to be essentially optimal. We derive conditions under which the problem is tractable, i.e., the information cost of constructing the approximate solution does not increase exponentially with the dimension of the input function domain. In the remainder of this section we define the problem and essential notation. But first, we present a helpful example.

### 1.1 An Illustrative Example

Consider the case of approximating functions defined over , using a Chebyshev polynomial basis. The input function is denoted , and the solution is . In this case,

 f =∑k∈Nd0ˆf(k)uk=:SOL(f),k=(k1,…,kd)∈Nd0, uk :=d∏ℓ=1~ukℓ,~uk(x):=cos(kcos−1(x))∀k∈N0.

Approximating well by a finite sum requires knowing which terms in the infinite series for are more important. Let denote a Hilbert space of input functions where the norm of is a -weighted norm of the series coefficients:

 ∥f∥F:=∥∥ ∥∥(ˆf(k)λk)k∈Nd0∥∥ ∥∥2,where λ=(λk)k∈Nd0,λk:=d∏ℓ=1kℓ>0wℓkrℓ,r>0.

The are non-negative coordinate weights, which embody the assumption that may depend more strongly on coordinates with larger than those with smaller . The definition of the -norm implies that an input function must have series coefficients that decay quickly enough as the degree of the polynomial increases. Larger implies smoother input functions.

The ordering of the weights,

 λk1≥λk2≥⋯>0, (1)

implies an ordering of the wavenumbers, . It is natural to approximate the solution using the first series coefficients as follows:

 APP(f,n):=n∑i=1ˆf(ki)uki∀f∈F, n∈N.

Here, we assume that it is possible to sample the series coefficients of the input function. This is a less restrictive assumption than being able to sample any linear functional, but it is more restrictive than only being able to sample function values. An important future problem is to extend the theory in this chapter to the case where the only function data available are function values.

The error of this approximation in terms of the norm on the output space, , can be expressed as

 where∥∥ ∥ ∥∥∑k∈Nd0ˆg(k)uk∥∥ ∥ ∥∥G:=∥∥(ˆg(k))k∈Nd0∥∥2.

If one has a fixed data budget, , then is the best answer.

However, our goal is an algorithm, that satisfies the error criterion

 ∥SOL(f)−ALG(f,ε)∥G≤ε∀ε>0, f∈C, (2)

where is the error tolerance, and is the set of input functions for which is successful. This algorithm contains a rule for choosing —depending on and —so that . The objectives of this chapter are to

• construct such a rule,

• choose a set of input functions for which the rule is valid,

• characterize the information cost of ,

• determine whether has optimal information cost, and

• understand the dependence of this cost on the number of input variables, , as well as the error tolerance, .

We return to this example in Section 1.6 to discuss the answers to some of these questions. We perform some numerical experiments for this example in Section 4.3.

### 1.2 General Linear Problem

Now, we define our problem more generally. A solution operator maps the input function to an output, . As in the illustrative example above, the Banach spaces of inputs and outputs are defined by series expansions:

 F:=⎧⎨⎩f=∑k∈Kˆf(k)uk:∥f∥F:=∥∥ ∥∥(ˆf(k)λk)k∈K∥∥ ∥∥ρ<∞⎫⎬⎭,1≤ρ≤∞, G:={g=∑k∈Kˆg(k)vk:∥g∥G:=∥∥(ˆg(k))k∈K∥∥τ<∞},1≤τ≤ρ.

Here, is a basis for the input Banach space , is a basis for the output Banach space , is a countable index set, and is the sequence of weights. These bases are defined to match the solution operator:

 SOL(uk)=vk∀k∈K. (3)

The represent the importance of the series coefficients of the input function. The larger is, the more important is.

Although this problem formulation is quite general in some aspects, condition (3) is somewhat restrictive. In principle, the choice of basis can be made via the singular value decomposition, but in practice, if the norms of and are specified without reference to their respective bases, it may be difficult to identify bases satisfying (3).

To facilitate our derivations below, we establish the following lemma via Hölder’s inequality:

###### Lemma 1

Let be some proper or improper subset of the index set . Moreover, let be defined by the relation

 1ρ+1ρ′=1τ,i.e., ρ′:=ρτρ−τ,

so . Let be the norm of a subset of the weights. Then the following are true for :

 (4)
 (5)

Equality (5) illustrates how inequality (4) may be made tight.

• We give the proof for . The proof for follows similarly. The proof of inequality (4) proceeds by applying Hölder’s inequality:

 ∥SOL(f)∥G (6)

Substituting the formula for in (5) into equation (6) and applying the relationship between , , and yields

Moreover,

 ∥f∥F=∥∥ ∥∥(ˆf(k)λk)k∈K∥∥ ∥∥ρ=R∥∥(λρ′/ρk)k∈K∥∥ρΛρ′/ρ=R∥∥(λk)k∈K∥∥ρ′/ρρ′Λρ′/ρ=R.

This completes the proof.

Taking in the lemma above, the norm of the solution operator can be expressed in terms of the norm of as follows:

 (7)

We assume throughout this chapter that the weights are chosen to keep this norm is finite, namely,

 ∥∥λ∥∥ρ′<∞. (8)

As in Section 1.1, here in the general case the are assumed to have a known order as was specified in (1). We also assume that all are positive to avoid the trivial case where can be expressed exactly as a finite sum for all .

### 1.3 An Approximation and an Algorithm

The optimal approximation based on series coefficients of the input function is defined in terms of the series coefficients of the input function corresponding to the largest as follows:

 APP:F×N0→G,APP(f,0)=0,  APP(f,n):=n∑i=1ˆf(ki)vki ∀n∈N. (9)

By the argument leading to (6) it follows that

 ∥SOL(f)−APP(f,n)∥G=∥∥(ˆf(ki))∞i=n+1∥∥τ. (10)

An upper bound on the approximation error follows from Lemma 1:

 ∥SOL(f)−APP(f,n)∥G≤∥∥ ∥∥(ˆf(ki)λki)∞i=n+1∥∥ ∥∥ρ∥∥(λki)∞i=n+1∥∥ρ′. (11)

This leads to the following theorem.

###### Theorem 1

Let denote the ball of radius in the space of input functions. The error of the approximation defined in (9) is bounded tightly above as

 supf∈BR∥SOL(f)−APP(f,n)∥G≤R∥∥(λki)∞i=n+1∥∥ρ′. (12)

Moreover, the worst case error over of , for any approximation based on series coefficients of the input function, can be no smaller.

• The proof of (12) follows immediately from (11) and Lemma 1. The optimality of follows by bounding the error of an arbitrary approximation, , applied to functions that mimic the zero function.

Let depend on the series coefficients indexed by . Use Lemma 1 with to choose to mimic the zero function, have norm , and have as large a solution as possible, i.e.,

 ˆf(k′1)=⋯=ˆf(k′n)=0,∥f∥F=R, ∥SOL(f)∥G=R∥∥(λk)k∉J∥∥ρ′by (???). (13)

Then because mimics the zero function, and

 supf∈BR∥SOL(f)−APP(f,n)∥G ≥max±∥∥SOL(±f)−APP′(±f,n)∥∥G=max±∥∥SOL(±f)−APP′(0,n)∥∥G ≥12[∥∥SOL(f)−APP′(0,n)∥∥G+∥∥−SOL(f)−APP′(0,n)∥∥G] ≥∥SOL(f)∥G=R∥∥(λk)k∉J∥∥ρ′by (???).

The ordering of the implies that for arbitrary can be no smaller than the case . This completes the proof.

While approximation is a key piece of the puzzle, our ultimate goal is an algorithm, , satisfying the absolute error criterion (2). The non-adaptive Algorithm 1 satisfies this error criterion for .

After defining the information cost of an algorithm and the problem complexity in the next subsection, we demonstrate that this non-adaptive algorithm is optimal when the set of inputs is chosen to be . However, typically one cannot bound the norm of the input function a priori, so Algorithm 1 is impractical.

The key difficulty is that error bound (12) depends on the norm of the input function. In contrast, we will construct error bounds for that only depend on function data. These will lead to adaptive algorithms satisfying error criterion (2). For such algorithms, the set of allowable input functions, , will be a cone, not a ball.

Note that algorithms satisfying error criterion (2) cannot exist for . Any algorithm must require a finite sample size, even if it is huge. Then, there must exist some that looks exactly like the zero function to the algorithm but for which is arbitrarily large. Thus, algorithms satisfying the error criterion exist only for some strict subset of . Choosing that subset well is both an art and a science.

### 1.4 Information Cost and Problem Complexity

The information cost of is denoted and defined as the number of function data—in our situation, series coefficients—required by . For adaptive algorithms this cost varies with the input function . We also define the information cost of the algorithm in general, recognizing that it will tend to depend on :

 COST(ALG,C,ε,R):=maxf∈C∩BRCOST(ALG,f,ε).

Note that while the cost depends on , has no knowledge of beyond the fact that it lies in . It is common for to be , or perhaps asymptotically .

Let denote the set of all possible algorithms that may be constructed using series coefficients and that satisfy error criterion (2). We define the computational complexity of a problem as the information cost of the best algorithm:

 COMP(A(C),ε,R):=minALG∈A(C)COST(ALG,C,ε,R).

These definitions follow the information-based complexity literature [12, 11]. We define an algorithm to be essentially optimal if there exist some fixed positive , , and for which

 COST(ALG,C,ε,R)≤COMP(A(C),ωε,R)∀ε∈(0,εmax], R∈[Rmin,∞). (14)

If the complexity of the problem is , the cost of an essentially optimal algorithm is also . If the complexity of the problem is asymptotically , then the cost of an essentially optimal algorithm is also asymptotically . We will show that our adaptive algorithms presented in Sections 2 and 3 are essentially optimal.

###### Theorem 2

The non-adaptive Algorithm 1 has an information cost for the set of input functions that is given by

 COST(ALG,BR,ε,R′)=min{n∈N0:∥∥(λki)∞i=n+1∥∥ρ′≤ε/R}.

This algorithm is essentially optimal for the set of input functions , namely,

 COST(ALG,BR,ε,R′)≤COMP(A(BR),ωε,R′)∀ε∈(0,εmax], R∈[Rmin,∞),

where and are arbitrary and fixed, and .

• Fix positive , , , and as defined above. For and , the information cost of non-adaptive Algorithm 1 follows from its definition. Let

 n∗(ε,R):=COST(ALG,BR,ε,R′).

Construct an input function as in the proof of Theorem 1 with . By the argument in the proof of Theorem 1, any algorithm in that can approximate with an error no greater than must use at least series coefficients. Thus,

 COST(ALG,BR,ε,R′) =n∗(ε,R)=n∗(εR′/R,R′) ≤n∗(ωε,R′)since R′/R≥ω ≤COMP(A(BR′),ωε,R′)≤COMP(A(BR),ωε,R′).

Thus, Algorithm 1 is essentially optimal.

For Algorithm 1, the information cost, , depends on the decay rate of the tail norm of the . This decay may be algebraic or exponential and also determines the problem complexity, , as a function of the error tolerance, .

This theorem illustrates how an essentially optimal algorithm for solving a problem for a ball of input functions, , can be non-adaptive. However, as alluded to above, we claim that it is impractical to know a priori which ball your input function lies in. On the other hand, in the situations described below where is a cone, we will show that actually contains only adaptive algorithms via the lemma below. The proof of this lemma follows directly from the definition of non-adaptivity.

###### Lemma 2

For a given set of input functions, , if contains any non-adaptive algorithms, then for every ,

 COMP(A(C),ε):=supR>0COMP(A(C),ε,R)<∞.

### 1.5 Tractability

Besides understanding the dependence of on , we also want to understand how depends on the dimension of the domain of the input function. Suppose that , for some , and let denote the dependence of the input space on the dimension . The set of functions for which our algorithms succeed, , depends on the dimension, too. Also, , , , and depend implicitly on dimension, and this dependence is sometimes indicated explicitly by the subscript .

Different dependencies of on the dimension and the error tolerance are formalized as different notions of tractability. Since the complexity is defined in terms of the best available algorithm, tractability is a property that is inherent to the problem, not to a particular algorithm. We define the following notions of tractability (for further information on tractability we refer to the trilogy [8], [9], [10]). Note that in contrast to these references we explicitly include the dependence on in our definitions. This dependence is natural for cones and might be different if is not a cone.

• We say that the adaptive approximation problem is strongly polynomially tractable if and only if there are non-negative , , , and such that

 COMP(A(Cd),ε,R)≤CRpε−p∀d∈N, ε∈(0,εmax], R∈[Rmin,∞).

The infimum of satisfying the bound above is denoted by and is called the exponent of strong polynomial tractability.

• We say that the problem is polynomially tractable if and only if there are non-negative , , , and such that

 COMP(A(Cd),ε,R)≤CdqRpε−p∀d∈N, ε∈(0,εmax], R∈[Rmin,∞).
• We say that the problem is weakly tractable iff

 limd+Rε−1→∞ logCOMP(A(Cd),ε,R)d+Rε−1=0.

Necessary and sufficient conditions on these tractability notions will be studied for different types of algorithms in Sections 2.2 and 3.3.

We remark that, for the sake of brevity, we focus here on tractability notions that are summarized as algebraic tractability in the recent literature (see, e.g., [6]). Theoretically, one could also study exponential tractability, where one would essentially replace by in the previous tractability notions. A more detailed study of tractability will be done in a future paper.

### 1.6 The Illustrative Example Revisited

The example in Section 1.1 chooses and . Thus, we obtain by Theorem 2:

 COMP(A(BR),ε,R) =COST(ALG,BR,ε,R)=min{n∈N0:λkn+1≤ε/R}.

Using the non-increasing ordering of the , we employ a standard technique for bounding the largest in terms of the sum of the power of all the . For ,

 (n+1)λpkn+1 ≤n+1∑i=1λpki≤∑k∈Nd0λpk=d∏ℓ=1[1+wpℓ∞∑k=11kpr] ≤exp(ζ(pr)∞∑ℓ=1wpℓ)since log(1+x)≤x ∀x≥0.

Hence, substituting the above upper bound on into the formula for the complexity of the problem, we obtain an upper bound on the complexity:

 COMP(A(BR),ε,R) ≤min{n∈N0:1n+1exp(ζ(pr)∞∑ℓ=1wpℓ)≤(εR)p} =⌈(Rε)pexp(ζ(pr)∞∑ℓ=1wpℓ)⌉−1.

If is the infimum of the for which is finite, and is finite, then we obtain strong polynomial tractability and an exponent of strong tractability that is . On the other hand, if the coordinate weights are all unity, , then there are different with a value of , and so , and the problem is not tractable.

### 1.7 What Comes Next

In the following section we define a cone of input functions, , in (16) whose norms can be bounded above in terms of the series coefficients obtained from a pilot sample. Adaptive Algorithm 2 is shown to be optimal for this . We also identify necessary and sufficient conditions for tractability.

Section 3 considers the situation where function data is relatively inexpensive, and we track the decay rate of the series coefficients. Adaptive Algorithm 3 is shown to be optimal in this situation.

Section 4 considers the case where the most suitable weights are not known a priori and are instead inferred from function data. Adaptive Algorithm 4 combines this inference step with Algorithm 2 to construct an approximation that satisfies the error criterion.

## 2 Bounding the Norm of the Input Function Based on a Pilot Sample

### 2.1 The Cone and the Optimal Algorithm

The premise of an adaptive algorithm is that the finite information we observe about the input function tells us something about what is not observed. Let denote the number of pilot observations, based on the set of wavenumbers

 K1:={k1,…,kn1}, (15)

where the are defined by the ordering of the in (1). Let be some constant inflation factor greater than one. The cone of functions whose norm can be bounded well in terms of a pilot sample, , is given by

 C=⎧⎪⎨⎪⎩f∈F:∥f∥F≤A∥∥ ∥∥(ˆf(k)λk)k∈K1∥∥ ∥∥ρ⎫⎪⎬⎪⎭. (16)

Referring to error bound (11), we see that the error of depends on the series coefficients not sampled. The definition of allows us to bound these as follows:

 ∥∥ ∥∥(ˆf(ki)λki)∞i=n+1∥∥ ∥∥ρ ≤⎡⎢⎣Aρ∥∥ ∥∥(ˆf(k)λk)k∈K1∥∥ ∥∥ρρ−∥∥ ∥∥(ˆf(ki)λki)ni=1∥∥ ∥∥ρρ⎤⎥⎦1/ρ∀f∈C.

This inequality together with error bound (11) implies the data-based error bound

 ∥SOL(f)−APP(f,n)∥G≤ERR((ˆf(ki))ni=1,n)∀f∈C, (17a) where ERR((ˆf(ki))ni=1,n):=⎡⎢⎣Aρ∥∥ ∥∥(ˆf(k)λk)k∈K1∥∥ ∥∥ρρ−∥∥ ∥∥(ˆf(ki)λki)ni=1∥∥ ∥∥ρρ⎤⎥⎦1/ρ∥∥(λki)∞i=n+1∥∥ρ′,n≥n1. (17b)

This error bound decays as increases and as the tail norm of the decreases. This data-driven error bound underlies Algorithm 2, which is successful for defined in (16):

###### Theorem 3

Algorithm 2 yields an answer satisfying absolute error criterion (2), i.e., for defined in (16). The information cost is

 COST(ALG,C,ε,R)=min{n≥n1:∥∥(λki)∞i=n+1∥∥ρ′≤ε/[(Aρ−1)1/ρR]}. (18)

There exist positive and for which the computational complexity has the lower bound

 COMP(A(C),ε,R)≥min{n≥n1:∥∥(λki)∞i=n+1∥∥ρ′≤2ε/[(1−1/A)R]}∀ε∈(0,εmax], R∈[Rmin,∞). (19)

Algorithm 2 is essentially optimal. Moreover, contains only adaptive algorithms.

• The upper bound on the computational cost of this algorithm is obtained by noting that

 COST(ALG,C,ε,R) ≤min{n≥n1:(Aρ−1)1/ρR∥∥(λki)∞i=n+1∥∥ρ′≤ε},

since for all , . Moreover, this inequality is tight for some , namely, those certain for which for . This completes the proof of (18).

To prove the lower complexity bound, choose and such that

 ∥∥(λki)∞i=n1+1∥∥ρ′>2εmax/[(1−1/A)Rmin].

Let be any algorithm that satisfies the error criterion, (2), for this choice of in (16). Fix and arbitrarily. Two fooling functions will be constructed of the form .

The input function is defined via its series coefficients as in Lemma 1, having nonzero coefficients only for :

 ∣∣ˆf1(k)∣∣=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩R(1+1/A)λρ′/ρ+1k2∥∥(λk)k∈K1∥∥ρ′/ρρ′,k∈K1,0,k∉K1,∥f1∥F=R(1+1/A)2.

Suppose that samples the series coefficients for , and let denote the cardinality of .

Now, construct the input function , having zero coefficients for and also as in Lemma 1:

 ∥SOL(f2)∥G=R(1−1/A)2∥∥(λk)k∉J∥∥ρ′. (20)

Let . By the definitions above, it follows that

 ∥f±∥F ∥∥ ∥∥(ˆf±(ki)λki)n1i=1∥∥ ∥∥ρ =∥∥ ∥∥(ˆf1(ki)±ˆf2(ki)λki)n1i=1∥∥ ∥∥ρ

Therefore, . Moreover, since the series coefficients for are the same for , it follows that . Thus, must be quite similar to .

Using an argument like that in the proof of Theorem 1, it follows that

 ε ≥max±∥∥SOL(f±)−ALG′(f±,ε)∥∥G=max±∥∥SOL(f±)−ALG′(f+,ε)∥∥G ≥12[∥∥SOL(f+)−ALG′(f+,ε)∥∥G+∥∥SOL(f−)−ALG′(f+,ε)∥∥G] ≥R(1−1/A)2∥∥(λki)∞i=n+1∥∥ρ′,

by the ordering of the in (1). By the choice of and above, it follows that . This inequality then implies lower complexity bound (19). Because it follows from Lemma 2 that contains only adaptive algorithms.

The essential optimality of Algorithm 2 follows by observing that

 COST(ALG,C,ε,R)≤COMP(A(C),ωε,R)for ω=1−1/A2(Aρ−1)1/ρ.

This satisfies definition (14).

The above derivation assumes that . If , then our cone consists of functions whose series coefficients vanish for wavenumbers outside . The exact solution can be constructed using only the pilot sample. Our algorithm is then non-adaptive, but succeeds for input functions in the cone , which is an unbounded set.

We may not be able to guarantee that a particular of interest lies in our cone, , but we may derive necessary conditions for to lie in . The following proposition follows from the definition of in (16) and the fact that the term on the left below underestimates .

###### Proposition 1

If , then

 ∥∥ ∥∥(ˆf(ki)λki)nki=1∥∥ ∥∥ρ≤A∥∥ ∥∥(ˆf(k)λk)k∈K1∥∥ ∥∥ρ∀n∈N. (21)

If condition (21) is violated in practice, then , and Algorithm 2 may output an incorrect answer. The remedy is to make more inclusive by increasing the inflation factor, , and/or the pilot sample size, .

### 2.2 Tractability

In this section, we write instead of , to stress the dependence on , and for the same reason we write instead of . Recall that we assume that . Let

 n(δ,d):=min{n≥0:∥∥(λd,ki)∞i=n+1∥∥ρ′≤δ}∀δ>0.

From Equations (18) and (19), we obtain that

 COMP(A(Cd),ω\textuploε,R)≤n(ε/R,d)≤COMP(A(Cd),ω\textuphiε,R)∀ε∈(0,εmax], R∈[Rmin,∞),

where the positive constants and depend on , but not depend on , , or . From the equation above, it is clear that tractability depends on the behavior of as and tend to infinity. We would like to study under which conditions we obtain the various tractability notions defined in Section 1.5.

To this end, we distinguish two cases, depending on whether is infinite or not. This distinction is useful because it allows us to relate the computational complexity of the algorithms considered in this chapter to the computational complexity of linear problems on certain function spaces considered in the classical literature on information-based complexity, as for example [8]. The case corresponds to the worst-case setting, where one studies the worst performance of an algorithm over the unit ball of a space. The results in Theorem 4 below are indeed very similar to the results for the worst-case setting over balls of suitable function spaces. The case corresponds to the so-called average-case setting, where one considers the average performance over a function space equipped with a suitable measure. For both of these settings there exist tractability results that we will make use of here.

#### Case 1: ρ′=∞:

If , we have, due to the monotonicity of the ,

 n(ε/R,d)=min{n≥0:λd,kn+1≤ε/R}.

We then have the following theorem.

###### Theorem 4

Using the same notation as above, the following statements hold for the case .

• We have strong polynomial tractability if and only if there exist and such that

 supd∈N∞∑i=i0ληd,ki<∞. (22)

Furthermore, the exponent of strong polynomial tractability is then equal to the infimum of those for which (22) holds.

• We have polynomial tractability if and only if there exist and such that

 supd∈Nd−η1∞∑i=⌈Kdη2⌉λη3d,ki<∞.
• We have weak tractability if and only if

 supd∈Nexp(−cd)∞∑i=1exp(−c(1λd,ki))<∞for allc>0. (23)
• Letting , we see that