Learning Functions of Few Arbitrary Linear Parameters in High Dimensions

Learning Functions of Few Arbitrary Linear Parameters in High Dimensions

Abstract

Let us assume that is a continuous function defined on the unit ball of , of the form , where is a matrix and is a function of variables for . We are given a budget of possible point evaluations , , of , which we are allowed to query in order to construct a uniform approximating function. Under certain smoothness and variation assumptions on the function , and an arbitrary choice of the matrix , we present in this paper

1. a sampling choice of the points drawn at random for each function approximation;

2. algorithms (Algorithm 1 and Algorithm 2) for computing the approximating function, whose complexity is at most polynomial in the dimension and in the number of points.

Due to the arbitrariness of , the choice of the sampling points will be according to suitable random distributions and our results hold with overwhelming probability. Our approach uses tools taken from the compressed sensing framework, recent Chernoff bounds for sums of positive-semidefinite matrices, and classical stability bounds for invariant subspaces of singular value decompositions.

AMS subject classification (MSC 2010): 65D15, 03D32, 68Q30, 60B20, 60G50

Key Words: high dimensional function approximation, compressed sensing, Chernoff bounds for sums of positive-semidefinite matrices, stability bounds for invariant subspaces of singular value decompositions.

1 Introduction

1.1 Learning high dimensional functions from few samples

In large scale data analysis and learning, several real-life problems can be formulated as capturing or approximating a function defined on with dimension very large, from relatively few given samples or queries. The usual assumption on the class of functions to be recovered is smoothness. The more regular a function is, the more accurately and the more efficiently it can be numerically approximated. However, in the field of information based complexity it has been clarified that such a problem is in general intractable, i.e., it does not have polynomial complexity. To clarify this poor approximation phenomenon, assume

to be the class of smooth functions we would like to approximate. We define the sampling operator , where is a suitable measurement operator and a recovery map. For example can take samples , of and can be a suitable interpolation operator. The approximation error provided by such a sampling operator is given by

With this notion we further define the approximation numbers

indicating the performance of the best sampling method, and

(1)

which is the minimal number of samples we need for the best sampling method to achieve a uniform accuracy .

1.2 Intractability results

Recent results by Novak and Woźniakowski [24] state that for a uniform approximation over we have for all or for all . Hence, the number of samples to approximate even a -function grows exponentially with the dimension . This result seems to obliterate any hope for an efficient solution of the learning problem in high dimension, and this phenomenon is sometimes referred to as the curse of dimensionality.
Nevertheless, very often the high dimensional functions which we can expect as solutions to real-life problems exhibit more structure and eventually are much better behaved with respect to the approximation problem. There are several models currently appearing in the literature for which the approximation problem is tractable, i.e., the approximation error does not grow exponentially with respect to the dimension .

According to the behavior of the information complexity , cf. (1), for small and large , one speaks about

  • polynomial tractability: if depends polynomially on and ,

  • strong polynomial tractability: if depends polynomially only on ,

  • weak tractability: if .

We point to [23, Chapters 1 and 2] for further notions of tractability and many references.

In the next two subsections we will recount a few relevant approaches leading in some cases to (some sort of) tractability.

1.3 Functions of few variables

A function of variables ( large) may be a sum of functions, which only depend on variables ( small):

(2)

In optimization such functions are called partially separable. This model arises for instance in physics, when we consider problems involving interaction potentials, such as the Coulomb potential in electronic structure computations, or in social and economical models describing multiagent dynamics. Once is fixed and , the learning problem of such functions is tractable, even if the are not very smooth. We specifically refer to the recent work of DeVore, Petrova, and Wojtaszczyk [13] which describes an adaptive method for the recovery of high dimensional functions in this class, for .
This model can be extended to functions which are only approximatively depending on few variables, by considering the unit ball of the weighted Sobolev space of functions with

(3)

where , and are non-negative weights; the definition and the choice of leads us again to the model (2). A study of the tractability of this class, for various weights, can be found in [23].

1.4 Functions of one linear parameter in high dimensions

One of the weaknesses of the model classes introduced above is that they are very coordinate biased. It would be desirable to have results for a class of basis changes which would make the model basis-independent. A general model assumes that,

(4)

for an arbitrary matrix. While solution to this unconstrained problems have so far been elusive, the special case of

(5)

where is a stochastic vector, i.e., , , , and is a function for has been fully addressed with an optimal recovery method in [11].

The aim of this work is to find an appropriate formulation of the general model (4), which generalizes both the model of active coordinates as well as the model of one stochastic vector, and to analyze the tractability of the corresponding approximation problem. The rest of the paper is organized as follows. After introducing some basic notations, the next section is dedicated to the motivation and discussion of the generalized model. As an introduction to our formulation and solution approach, we then proceed to analyze the simple case of one active direction in Section 3, under milder assumptions on the vector , before finally addressing the fully generalized problem in Section 4. The last section is dedicated to the discussion of further extensions of our approach, to be addressed in successive papers.

1.5 Notations

In the following we will deal exclusively with real matrices and we denote the space of real matrices by . The entries of a matrix are denoted by lower case letters and the corresponding indices, i.e., . The transposed matrix of a matrix is the matrix with entries . For we can write its (reduced) singular value decomposition [19] as

with , , , matrices with orthonormal columns and a diagonal matrix where are the singular values. For specific matrices we write the singular value decomposition

For symmetric, positive semidefinite matrices, i.e., and for all vectors , we can take and the singular value decomposition is equivalent to the eigenvalue decomposition. Note also that , where is the largest eigenvalue of the matrix (actually, this holds for , whereas we may want to consider instead of if ). The rank of denoted by is the number of nonzero singular values. We define the Frobenius norm of a matrix as

It is also convenient to introduce the vector norms

We denote by the identity matrix. The symbol stands for the unit ball and for the ball of radius in . The unit sphere in is denoted by . Finally, indicates the Lebesgue measure in .

2 The General Model and Its Simplifications

The first approach one may be tempted to consider to a generalization of (5) is to ask that is of the form , where is a stochastic matrix with orthonormal rows, i.e., , for all , , and is a function for . There are however two main problems with this formulation. The conditions of stochasticity and orthonormality of the rows of together are very restrictive - the only matrices satisfying both of them are those having only one non-negative entry per column - and the domain of cannot be chosen generically as but depends on , i.e., it is the k-dimensional polytope . Thus we will at first return to the unconstrained model in (4) and give up the conditions of stochasticity and orthonormality. This introduces rotational invariance for the rows of A and the quadrant defined by is no longer set apart as search space. In consequence and to avoid the complications arising with the polytope we will therefore focus on functions defined on the Euclidean ball.
To be precise, we consider functions of the form (4), where is an arbitrary matrix whose rows are in , for some ,

Further, we assume, that the function is defined on the image of under the matrix and is twice continuously differentiable on this domain, i.e., , and

For the uniform surface measure on the sphere we define the matrix

(6)

From the identity we get that

(7)

and therefore that the rank of is or less. We will require to be well conditioned, i.e., that its singular values satisfy .
The parameters in our model are the dimension (large), the linear parameter dimension (small), the nonnegative constants , , and .
We now show that such a model can be simplified as follows. First of all we see that giving up the orthonormality condition on the rows of was actually unnecessary. Let us consider the singular value decomposition of , hence we rewrite

where and . In particular, by simple direct computations,

  • , and

  • .

Hence, by possibly considering different constants and , we can always assume that , meaning is row-orthonormal. Note that for a row-orthonormal matrix , equation (7) tells us that the singular values of are the same as those of , where

The following simple result states that our model is almost well-defined. As we will see later, the conditions on and will be sufficient for the unique identification of by approximation up to any accuracy, but not necessarily for the unique identification of and .

Lemma 2.1.

Assume that with two matrices such that and that has rank . Then for some orthonormal matrix .

Proof.

Because and are row-orthonormal the singular values of and are the same as those of , i.e., we have and , where is a diagonal matrix containing the singular values of in nonincreasing order and are orthonormal matrices. Inserting this into (7) we get

and are both row-orthonormal, so we have two singular value decompositions of . Because the singular vectors are unique up to an orthonormal transform, we have for some orthonormal matrix or for , which is by construction orthonormal. ∎

With the above observations in mind, let us now restate the problem we are addressing and summarize our requirements. We restrict the learning problem to functions of the form , where and . As we are interested in recovering from a small number of samples, the accuracy will depend on the smoothness of . In order to get simple convergence estimates, we require . These choices determine two positive constants for which

(8)

and

(9)

For the problem to be well-conditioned we need that the matrix is positive definite

(10)

for a fixed constant (actually later we may simply choose ).

Remark 1.

Let us shortly comment on condition (10) in the most simple case , by showing that such a condition is actually necessary in order to formulate a tractable algorithm for the uniform approximation of from point evaluations.
The optimal choice of is given by

(11)

cf. Theorem 3.7. Furthermore, we consider the function given by for and zero otherwise. Notice that, for every with , the function vanishes everywhere on outside of the cap , see Figure 1. The measure of obviously does not depend on and is known to be exponentially small in [21], see also Section 3.3. Furthermore, it is known, that there is a constant and unit vectors , such that the sets are mutually disjoint and . Finally, we observe that

Figure 1: The function and the spherical cap .

We conclude that any algorithm making only use of the structure of and the condition (9) needs to use exponentially many sampling points in order to distinguish between and for some of the ’s as constructed above. Hence, some additional conditions like (8) and (10) are actually necessary to avoid the curse of dimensionality and to achieve at least some sort of tractability. Let us observe that decays exponentially with for the function considered above. We shall further discuss the role of in Section 3.3.

Contrary to the approach in [11] our strategy used to learn functions of the type (4) is to first find an approximation to . Once this is known, we will give a pointwise definition of the function on such that is a good approximation to on . This will be in a way such that the evaluation of at one point will require only one function evaluation of . Consequently, an approximation of on its domain using standard techniques, like sampling on a regular grid and spline-type approximations, will require a number of function evaluations of depending only on the desired accuracy and , but not on . We will therefore restrict our analysis to the problem of finding , defining , and the amount of queries necessary to do that.

3 The One Dimensional Case

For the sake of an easy introduction, we start by addressing our recovery method again in the simplest case of a ridge function

(12)

where is a row vector, , and is a function from the image of under to , i.e., .
The ridge function terminology was introduced in the 1970’s by Logan and Shepp [22] in connection with the mathematics of computer tomography. However these functions have been considered for some time, but under the name of plane waves. See, for example, [12, 20]. Ridge functions and ridge function approximation are studied in statistics. There they often go under the name of projection pursuit. Projection pursuit algorithms approximate a function of variables by functions of the form

(13)

Hence the recovery of in (12) from few samples can be seen as an instance of the projection pursuit problem. For a survey on some approximation-theoretic questions concerning ridge functions and their connections to neural networks, see [27] and references therein, and the work of Candès and Donoho on ridgelet approximation [5, 6, 7].
For further clarity of notations, in the following we will assume to be a row vector, i.e., a matrix, while other vectors, , are always assumed to be column vectors. Hence the symbol stands for the product of the matrix with the vector .

3.1 The Algorithm

As in [11] a basic ingredient of the algorithm is a version of Taylor’s theorem giving access to the vector . For , , , with , we have, by Taylor expansion, the identity

(14)

for a suitable . Thanks to our assumptions (8) and (9), the term is uniformly bounded as soon as is bounded. We will consider the above equality for several directions and at several sampling points .

To be more precise we define two sets of points. The first

(15)

contains the sampling points and is drawn at random in according to the probability measure . For the second, containing the derivative directions, we have

(16)

Actually we identify with the matrix whose rows are the vectors . To write the instances of (14) in a concise way we collect the directional derivatives , as columns in the matrix , i.e.,

(17)

and we define the matrices and entrywise by

(18)

and

(19)

We denote by the columns of and by the columns of , . With these matrices we can write the following factorization

(20)

The algorithm we propose to approximate the vector is now based on the fact that the matrix has a very special structure, i.e., , where . In other words every column is a scaled copy of the vector and compressible if is compressible. We define a vector compressible informally by saying that it can be well approximated in -norm by a sparse vector. Actually, any vector with small -norm can be approximated in by its best -term approximation according to the following well-known estimate

(21)

Thus by changing view point to get

we see that due to the random construction of we actually have a compressed sensing problem and known theory tells us that we can recover a stable approximation to via -minimization (see Theorem 3.2 for the precise statement). To get an approximation of we then simply have to set for such that is maximal. From these informal ideas we derive the following algorithm.

Algorithm 1: Given , draw at random the sets and as in (15) and (16), and construct according to (18). Set . Find (22) Set . Define and .

The quality of the final approximation clearly depends on the error between and , which can be controlled through the number of compressed sensing measurements , and the size of , which is related to the number of random samples . If is satisfied with large, we shall show in Lemma 3.6 with help of Hoeffding’s inequality that also is large with high probability. If the value of is unknown and small, the values of produced by Algorithm 1 could be small as well and, as discussed after the formula (11), no reliable and tractable approximation procedure is possible.

To be exact we will in the next section prove the following approximation result.

Theorem 3.1.

Let and . Then there is a constant such that using function evaluations of , Algorithm 1 defines a function that, with probability

(23)

will satisfy

(24)

where

(25)

and depends only on and from (8) and (9).

Remark 2.

1. We shall fix as defined by (25) for the rest of this section. Furthermore, we suppose that the selected parameters ( and ) are such that holds. See Remark 4 (ii) for knowing how we can circumvent in practice the case that this condition may not hold, clearly invalidating the approximation (24).

2. In order to show a concrete application of the previous result, let us consider, for simplicity, a class of uniformly smooth functions such that ; hence, by Proposition 3.8, is independent of the dimension . If additionally we choose , , and such that , , for and for , then, according to Theorem 3.1, we obtain the uniform error estimate

with high probability. Notice that, if , then the number of evaluation points , for , is actually independent of the dimension .

3.2 The Analysis

We will first show that is a good approximation to for all . This follows by the results from the framework of compressed sensing [3, 8, 10, 14, 16, 18, 17]. In particular, we state the following useful result which is a specialization of Theorem 1.2 from [36], to the case of Bernoulli matrices.

Theorem 3.2.

Assume that is an random matrix with all entries being independent Bernoulli variables scaled with , see, e.g., (16).

(i) Let . Then there are two positive constants , such that the matrix has the Restricted Isometry Property

(26)

for all such that with probability at least

(27)

(ii) Let us suppose that . Then there are positive constants , such that, with probability at least

(28)

the matrix has the following property. For every , and every natural number we have

(29)

where

is the best -term approximation of .

Remark 3.

(i) The first part of Theorem 3.2 is well known, see, e.g., [3] or [16, Page 15] and references therein.
(ii) The second part of Theorem 3.2 is relatively new. It follows from Theorem 2.3 of [36] combined with Theorem 3.5 of [13], and the first part of Theorem 3.2. Without the explicit bound of the probability (28), it appears also as Theorem 1.2 in [36].

Applied to the situation at hand we immediately derive the following corollary.

Corollary 3.3.

(i) Let . Then with probability at least

all the vectors calculated in Algorithm 1 satisfy

(30)

where depends only on and from (8) and (9).

(ii) If furthermore holds, then with the same probability also

(31)

where depends again only on and from (8) and (9).

Proof.

We apply Theorem 3.2 to the equation and . To do so, we have to estimate the best -term approximation error of and the size of the errors . We start by bounding . Recall that due to the construction of every column is a scaled copy of the vector , i.e., , so we have by (21)

(32)

This finishes the proof of the first part.

To prove the second part, we estimate the size of the errors using (19),

(34)

leading to

Together with our assumption this finishes the proof. ∎

Next we need a technical lemma to relate the error between the normalized version of and to the size of .

Lemma 3.4 (Stability of subspaces - one dimensional case).

Let us fix , , , and with norm . If we assume then

(35)
Proof.

Applying the triangular inequality and its reverse form several times and using that we get