Deep nets for local manifold learning

Deep nets for local manifold learning

Charles K. Chui  and H. N. Mhaskar Department of Statistics, Stanford University, Stanford, CA 94305. The research of this author is supported by ARO Grant W911NF-15-1-0385. email: ckchui@stanford.edu. Department of Mathematics, California Institute of Technology, Pasadena, CA 91125; Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA 91711. The research of this author is supported in part by ARO Grant W911NF-15-1-0385. email: hrushikesh.mhaskar@cgu.edu.
Abstract

The problem of extending a function defined on a training data on an unknown manifold to the entire manifold and a tubular neighborhood of this manifold is considered in this paper. For embedded in a high dimensional ambient Euclidean space , a deep learning algorithm is developed for finding a local coordinate system for the manifold without eigen–decomposition, which reduces the problem to the classical problem of function approximation on a low dimensional cube. Deep nets (or multilayered neural networks) are proposed to accomplish this approximation scheme by using the training data. Our methods do not involve such optimization techniques as back–propagation, while assuring optimal (a priori) error bounds on the output in terms of the number of derivatives of the target function. In addition, these methods are universal, in that they do not require a prior knowledge of the smoothness of the target function, but adjust the accuracy of approximation locally and automatically, depending only upon the local smoothness of the target function. Our ideas are easily extended to solve both the pre–image problem and the out–of–sample extension problem, with a priori bounds on the growth of the function thus extended.

1 Introduction

Machine learning is an active sub–field of Computer Science on algorithmic development for learning and making predictions based on some given data, with a long list of applications that range from computational finance and advertisement, to information retrieval, to computer vision, to speech and handwriting recognition, and to structural healthcare and medical diagnosis. In terms of function approximation, the data for learning and prediction can be formulated as , obtained with an unknown probability distribution. Examples include: the Boston housing problem (of predicting the median price of a home based on some vector of 13 other attributes [60]) and the floor market problem [59, 3] (that deals with the indices of the wheat floor pricing in three major markets in the United States). For such problems, the objective is to predict the index in the next month, say, based on a vector of their values over the past few months. Other similar problems include the prediction of blood glucose level of a patient based on a vector of the previous few observed levels [54, 51], and the prediction of box office receipts () on the date of release of a movie in preparation, based on a vector of the survey results about the movie [57]. It is pointed out in [40, 38, 19] that all the pattern classification problems can also be viewed fruitfully as problems of function approximation. While it is an ongoing research to allow non–numeric input (e.g., [10]), we restrict our attention in this paper to the consideration of , for some integer .

In the following discussion, the first component is considered as input, while the second component is considered the output of the underlying process. The central problem is to estimate the conditional expectation of given . Various statistical techniques and theoretical advances in this direction are well–known (see, for example [61]). In the context of neural and radial–basis–function networks, an explicit formulation of the input/output machines was pointed out in [25, 24]. More recently, the nature of deep learning as an input/output process is formulated in the same way, as explained in [35, 55]. To complement the statistical perspective and understand the theoretical capabilities of these processes, it is customary to think of the expected value of , given , as a function of . The question of empirical estimation in this context is to carry out the approximation of given samples , where is a finite training data set. In practice, because of the random nature of the data, it may be possible that there are several pairs of the form in the data for the same values of . In this case, a statistical scheme, such as some kind of averaging of being the simplest one, can be used to obtain a desired value for the sample of at , . From this perspective, the problem of extending from the traning data set to not in in machine learning is called the generalization problem.

We will illustrate this general line of ideas by using neural networks as an example. To motivate this idea, let us first recall a theorem originating with Kolmogorov and Arnold [37, Chapter 17, Theorem 1.1]. According to this theorem, there exist universal Lipschitz continuous functions and universal scalars , for which every continuous function can be written as

(1.1)

where is a continuous function that depends on . In other words, for a given , only one function has to be determind to give the representation formula (1.1) of .

A neural network, used as an input/output machine, consists of an input layer, one or more hidden layers, and an output layer. Each hidden layer consists of a number of neurons arranged according to the network architecture. Each of these neurons has a local memory and performs a simple non–linear computation upon its input. The input layer fans out the input to the neurons at the first hidden layer. The output layer typically takes a linear combination of the outputs of the neurons at the last hidden layer. The right hand side of (1.1) is a neural network with two hidden layers. The first contains neurons, where the –th neuron computes the sum . The next hidden layer contains neurons each evaluating the function on the output of the –th neuron in the first hidden layer. The output layers takes the sum of the results as indicated in (1.1).

From a practical point of view, such a network is clearly hard to construct, since only the existence of the functions and is known, without a numerical procedure for computing these. In the early mathematical development of neural networks during the late 1980s and early 1990s, instead of finding these functions for the representation of a given continuous function in (1.1), the interest was to study the existence and characterization of universal functions , called activation functions of the neural networks, such that each neuron evaluates the activation function upon an affine transform of its input, and the network is capable of approximating any desired real-valued continuous target function arbitrarily closely on , where is any compact set.

For example, a neural network with one hidden layer can be expressed as a function

(1.2)

Here, the hidden layer consists of neurons, each of which has a local memory. The local memory of the –th neuron contains the weights , and the threshold . Upon receiving the input from the input later, the –th neuron evaluates as its output, where is a non–linear activation function. The output layer is just one circuit where the coefficients are stored in a local memory, and the evaluates the linear combination as indicated in (1.2). Training of this network in order to learn a function on a compact subset to an accuracy of involves finding the parameters , , so that

(1.3)

The most popular technique for doing this is the so called back–propagation, which seeks to find these quantities by minimizing an error functional usually with some regularization parameters. We remark that the number of neurons in the approximant (1.2) must increase, if the tolerance in the approximation of the target function is required to be smaller.

From a theoretical perspective, the main attraction of neural networks with one hidden layer is their universal approximation property as formulated in (1.3), which overshadows the properties of their predecessors, namely: the perceptrons [52]. In particular, the question of finding sufficient conditions on the actvation function that ensure the universal approximation property was investigated in great detail by many authors, with emphasis on the most popular sigmoidal function, defined by the property for and for . For example, Funahashi [22] applied some discretization of an integral formula from [28] to prove the universal approximation property for some sigmoidal function . A similar theorem was proved by Hornik, Stinchcombe, White [27] by using the Stone–Weierstrass theorem, and another by Cybenko [17] by applying the Hahn–Banach and Riesz Representation theorems. A constructive proof via approximation by ridge functions was given in our paper [11], with algorithm for implementation presented in our follow-up work [12]. A complete characterization of which activation functions are allowed to achieve the universal approximation property was given later in [49, 36].

However, for neural networks with one hidden layer, one of the severe limitations to applying training algorithms based on optimization, such as back–propagation or those proposed in the book [61] of Vapnik, is that it is neccessary to know the number of neurons in in advance. Therefore, one major problem in the 1990s, known as the complexity problem, was to estimate the number of neurons required to approximate a function to a desired accuracy. In practice, this gives rise to a trade-off: to achieve a good approximation, one needs to have a large number of neurons, which makes the implementation of the training algorithm harder.

In this regard, nearly a century of research in approximation theory suggests that the higher the order of smoothness of the target function, the smaller the number of neurons should be, needed to achieve the desired accuracy. There are many different definitions of smoothness that give rise to different estimates. For example, under the condition that the Fourier transform of the target function satisfies , Barron [1] proved the existence of a neural network with neurons that gives an error of . While it is interesting to note that this number of neurons is essentially independent of the dimension , the constants involved in the term as well as the number of derivatives needed to ensure the condition on the target function may increase with . Several authors have subsequently improved upon such results under various conditions on the activation function as well as the target function so as to ensure that the constants depend polynomially on (e.g., [31, 32, 44] and references therein).

The most commonly understood definition of smoothness is just the number of derivatives of the target function. It is well–known from the theory of -widths that if is an integer, and the only a priori information assumed on the unknown target function is that it is –times continuously differentiable function, then a stable and uniform approximation to within by neural networks must have at least a constant multiple of neurons. In [42], we gave an explicit construction for a neural network that achieves the accuracy of using neurons arranged in a single hidden layer. It follows that this suffers from a curse of dimensionality, in that the number of neurons increases exponentially with the input dimension . Clearly, if the smoothness of the function increases linearly with , as it has to in order to satisfy the condition in [1], then this bound is also “dimension independent”.

While this is definitely unavoidable for neural networks with one hidden layer, the most natural way out is to achieve local approximation; i.e., given an input , construct a network with a uniformly bounded number of neurons that approximates the target function with the optimal rate of approximation near the point , preferably using the values of the function also in a neighborhood of . Unfortunately, this can never be achieved as we proved in [13]. Furthermore, we have proved in [14] that even if we allow each neuron to evaluate its own activation function, this local approximation fails. Therefore the only way out is to use a neural network with more than one hidden layer, called deep net (for deep neural network). Indeed, local approximation can be achieved by a deep net as proved in our papers [40, 41]. In this regard, it is of interest to point out that an adaptive version of [40, 41] was derived in [48] for prediction of time series, yielding as much as 150% improvement upon the state–of–the–art at that time, in the study of the floor market problem.

Of course, the curse of dimensionality is inherent to the problem itself, whether with one or more hidden layers. Thus, while it is possible to construct a deep net to approximate a function at each point arbitrarily closely by using a uniformly bounded number of neurons, the uniform approximation on an entire compact set, such as a cube, would still require an approximation at a number of points in the cube, and this number increasing exponentially with the input dimension. Equivalently, the effective number of neurons for approximation on the entire cube is still exponentially increasing with the input dimension.

In addition to the high dimensionality, another difficulty in solving the function approximation problem is that the data may be not just high dimensional but unstructured and sparse. A relatively recent idea which has been found very useful in applications, in fact, too many to list exhaustively, is to consider the points as being sampled from an unknown, low dimensional sub–manifold of the ambient high dimensional space . The understanding of the geometry of is the subject of the bulk of modern research in the area of diffusion geometry. An introduction to this subject can be found in the special issue [9] of Applied and Computational Harmonic Analysis. The basic idea is to construct the so–called diffusion matrix from the data, and use its eigen–decomposition for finding local coordinate charts and other useful aspects of the manifold. The convergence of the eigen–decomposition of the matrices to that of the Laplace–Beltrami and other differential operators on the manifold is discussed, for example, in [2, 33, 58]. It is shown in [29, 30] that some of the eigenfunctions on the manifolds yield a local coordinate chart on the manifold. In the context of deep learning, this idea is explored as a function approximation problem in [53], where a deep net is developed in order to learn the coordinate system given by the eigenfunctions.

On the other hand, while much of the research in this direction is focused on understanding the data geometry, the theoretical foundations for the problems of function approximation and harmonic analysis on such data–defined manifold are developed extensively in [38, 20, 21, 46, 47, 15]. The theory is developed more recently for kernel construction on directed graphs and analysis of functions on changing data in our paper [39]. However, a drawback of the approach based on data–defined manifolds, known as the out–of–sample extension problem, is that since the diffusion matrix is constructed entirely using the available data, the whole process must be done again if new data become available. A popular idea is then to extend the eigen–functions to the ambient space by using the so called Nyström extension [16].

The objective of this present paper is to describe a deep learning approach to the problem of function approximation, using three groups of networks in the deep net. The lowest layer accomplishes dimensionality reduction by learning the local coordinate charts on the unknown manifold without using any eigen–decomposition. Having found the local coordinate system, the problem is reduced to the classical problem of approximating a function on a cube in a relatively low dimensional Euclidean space. For the next two layers, we may now apply the powerful techniques from approximation theory to approximate the target function , given the samples on the training data set . We describe two approaches to construct the basis functions using multi–layered neural networks, and to construct other networks to use these basis functions in the next layer to accomplish the desired function approximation.

We summarize some of the highlights of our paper.

  • We give a very simple learning method for learning the local coordinate chart near each point. The subsequent approximation process is then entirely local to each coordinate patch.

  • Our method allows us to solve the pre–image problem easily; i.e., to generate a point on the manifold corresponding to a given local coordinate description.

  • The learning method itself does not involve any optimization based technique, except probably for reducing the noise in the values of the function itself.

  • We provide optimal error bounds on approximation based on the smoothness of the function, while the method itself does not require an a priori knowledge of such smoothness.

  • Our methods can solve easily the out–of–sample extension problem. Unlike the Nyström extension process, our method does not require any elaborate construction of kernels defined on the ambient space and commuting with certain differential operators on the unknown manifold.

  • Our method is designed to control the growth of the out–of–sample extension in a tubular neighborhood of , and is local to each coordinate patch.

This paper is organized as follows. In Section 2, we describe the main ideas in our approach. The local coordinate system is described in detail in Section 2.2. Having thus found a local coordinate chart around the input, the problem of function approximation reduces to the classical one. In Section 2.3, we demonstrate how the popular basis functions used in this theory can be implemented using neural networks with one or more hidden layers. The function approximation methods which work with unstructured data without using optimization are described in Section 2.6. In Section 3, we explain how our method can be used to solve both the pre–image problem and the out–of–sample extension problem.

2 Main ideas and results

The purpose of this paper is to develop a deep learning algorithm to learn a function , where is a dimensional compact Riemannian sub–manifold of a Euclidean space , with , given training data of the form , . It is important to note that itself is not known in advance; the points are known only as –dimensional vectors, presumed to lie on . In Sub–section 2.1, we explain our main idea briefly. In Sub–section 2.2, we derive a simple construction of the local coordinate chart for . In Sub–section 2.3, we describe the construction of a neural network with one or more hidden layers to implement two of the basis functions used commonly in function approximation. While the well known classical approximation algorithms require a specific placement of the training data, one has no control on the location of the data in the current problem. In Section 2.6, we give algorithms suitable for the purpose of solving this problem.

2.1 Outline of the main idea

Our approach is the following.

  1. is a finite union of local coordinate neighborhoods, and belongs to one of them, say . We find a local coordinate system for this neighborhood in terms of Euclidean distances on , say , where is the dimension of the manifold. Let , and with a relabeling for notational convenience, be the points in , . This way, we have reduced the problem to approximating at , given the values , where . We note that is a subset of the unit cube of low dimensional Euclidean space, representing a local coordinate patch on . Thus, the problem of approximation of on this patch is reduced that of approximation of , a well studied classical approximation problem.

  2. We will summarize the solution to this problem using neural networks with one or more hidden layers, e.g., an implementation of multivariate tensor product spline approximation using multi–layerd neural network.

Thus, the layers of our deep learning networks will have three main layers.

  1. The bottom layer receives the input , figures out which of the points are in the coordinate neighborhood of , and computes the local coordinates , .

  2. The next several layers compute the local basis functions necessary for the approximation, for example, the –splines and their translates using the multi–layered neural network as in [40].

  3. The last layer receives the data , and computes the approximation described in Step 2 above.

2.2 Local coordinate learning

We assume that are integers, is a dimensional smooth, compact, connected, Riemannian sub–manifold of a Euclidean space , with geodesic distance .

Before we discuss our own construction of a local coordinate chart on , we wish to motivate the work by describing a result from [30]. Let be the sequence of eigenvalues of the (negative of the) Laplace–Beltrami operator on , and for each , be the eigenfunction corresponding to the eigenvalue . We define a formal “heat kernel” by

(2.1)

The following result is a paraphrasing of the heat triangulation theorem proved in [30, Theorem 2.2.7] under weaker assumptions on .

Theorem 2.1

(cf. [30, Theorem 2.2.7]) Let . There exist constants , depending on with the following property. Let be linearly independent vectors in , and be chosen so that is in the direction of , , and

and . Let be the geodesic ball of radius , centered at , and

(2.2)

Then

(2.3)

Since the paper [30] deals with a very general manifold, the mapping is not claimed to be a diffeomorphism, although it is obviously one–one on .

We note that even in the simple case of a Euclidean sphere, an explicit expression for the heat kernel is not known. In practice, the heat kernel has to be approximated using appropriate Gaussian networks [33]. In this section, we aim to obtain a local coordinate chart that is computed directly in terms of Euclidean distances on , and depends upon trainable parameters. The construction of this chart constitutes the first hidden layer of our deep learning process. As explained in the introduction, once this chart is in place, the question of function extension on the manifold reduces locally to the well studied problem of function extension on a dimensional unit cube.

To describe our constructions,we first develop some notation.

In this section, it is convenient to use the notation rather than , which we will use in the rest of the sections. If is an integer, and , denotes the Euclidean norm of . If , we will write , . If , ,

There exists with the following properties. The manifold is covered by finitely many geodesic balls such that for the center of any of these balls, there exists a diffeomorphism, namely, the exponential coordinate map from to the geodesic ball around [18, p. 65]. If is the Jacobian matrix for , given by , , then

(2.4)

Further, there exists (independent of ) such that

(2.5)

Let . Then (2.5) implies that

(2.6)

In turn, this leads to

(2.7)

Let , , be chosen with the following properties:

(2.8)

and, with the matrix function defined by

(2.9)

we have

(2.10)

Any set with these properties will be called coordinate stars around . We note that the matrix has columns given by , , and hence, can be computed without reference to the map . Let

(2.11)
Theorem 2.2

Let . Then
(a) is a diffeomorphism on . If , , , then

(2.12)

(b) The function is a diffeomorphism from onto .

Remark 2.1

Let be a geodesic ball around . For , we define

Then Theorem 2.2(b) shows that is a diffeomorphism from onto . Since ,

maps diffeomorphically onto . Let . Then is a neighborhood of and maps diffeomorphically onto . We oberve that is a union of finitely many neighborhoods of the form , so that any belongs to at least one such neighborhood. Moreover, can be computed entirely in terms of the description of in terms of its –dimensional coordinates.

Remark 2.2

The trainable parameters are thus , and the points . Since , the condition (2.10) is satisfied if are along linearly independent directions as in Theorem 2.1.

Remark 2.3

Since the mapping in Remark 2.1 is a quadratic polynomial in , it can be implemented as a neural network with a single hidden layer using the activation function given in (2.22) as described in Sub–Section 2.4.

Example 2.1

Let , and be the circular helix defined by

Clearly, is a one dimensional manifold, and is the arclength parameter, measured from . The curvature at any point is . For any point , , with the diffeomorphism given by , . An interesting fact is that can be made arbitrarily small by choosing close to , even though the geodesic distance between and is . Let , , and . Let . It is easy to calculate that

so that

If , then . Therefore, using the well known estimates

we obtain that

(2.13)

and

(2.14)

Hence, for any points , we have

We note that the neighborhood where this estimate holds and the constants are independent of the curvature.

The remainder of this section is devoted to the proof of Theorem 2.2.

Lemma 2.1

Let . Each of the following statements hold for the matrix defined in (2.9):

(2.15)
(2.16)
(2.17)

With as in (2.11), for , ,

(2.18)

Proof. In view of (2.6) and the mean value theorem, we have for ,

(2.19)

We observe further that for any integers , . Consequently, for any , ,

This proves (2.15).

In view of (2.19), used with in place of , , we obtain for all , ,

This proves (2.16).

In view of (2.5), (2.16), (2.15), we obtain for that

This proves (2.17). The estimate (2.18) follows easily from this and (2.10).

Proof of Theorem 2.2. In this proof only, let be the Jacobian of : . Then . The estimate (2.10) shows that is invertible, and that . The estimate (2.17) then shows that

(2.20)

Therefore, the inverse function theorem as given in [56, Theorem 9.24 and its proof] implies that is a diffeomorphism on as claimed. For , (2.18) shows that . Also, (2.16) and (2.6) show that . Hence, the mean value theorem implies that

(2.21)

Together with (2.7), this implies (2.12).

The part (b) follows also from [56, Theorem 9.24 and its proof] and (2.20).

2.3 Local basis functions

Having found a local coordinate map on a neighborhood of on , the problem of extending from to is reduced to extending from , a classical approximation problem. There is, of course, 100+ years of research on this subject. We restrict ourselves to two examples, which can be implemented using neural networks with one or more hidden layers. One of the most popular activation function in the deep learning literature (e.g., [35]) is the rectified linear unit function

Since this function is not continuously differentiable, there are some technical difficulties to use common algorithms like back–propagation with these activation functions. Although we do not need back–propagation in our theory, we prefer to deal with a rectified quadratic unit function defined for by

(2.22)

which is continuously differentiable on . Our theory will work in general with any activation function of order ; i.e., with a function that satisfies

(2.23)

but for the sake of clarity of exposition, we will use only the activation function defined in (2.22).

2.4 Polynomials

The most basic class of classical approximants is the set of all polynomials. For , we denote the class of all algebraic polynomials of coordinatewise degree at most in variables by . (It is convenient to use the same notation also when is not an integer; in this case, is just .

The basic implementation of polynomials is given in [11, Proof of Theorem 3.1], where an explicit construction is given for finding the weights , the thresholds and the coefficients used in (2.24) below.

Theorem 2.3

Let , , , then there exist weights and real numbers , such that

(2.24)

Here, the weights and the thresholds are independent of and the coefficients are linear functionals on .

We observe that

(2.25)

while

(2.26)

so that the expression on the right hand side of (2.24) can be expressed as a neural network with hidden layers.

We note that a neural network with one hidden layer is given in [42], but using a activation function ; e.g., . This uses the fact that for and , such that none of the derivatives , , equal to ,

(2.27)

A finite difference scheme to implement this differentiation yields a neural network with one hidden layer, containing exactly neurons, and should be stable for functions. If stability is a greater concern, then one may use other numerical differentiation schemes to implement this formula, e.g., spectral methods [26].

2.5 –splines

For , and integer , let

A tensor product cardinal -spline at is defined by

(2.28)

It is explained in [40, 41] that the quantity can be computed using a neural network with a sigmoidal function of order consisting of finitely many neurons arranged in multiple hidden layers (the number of neurons and layers depending on and alone). Thus, if is a power of , then each of the terms can be implemented as an iterated power of (cf. (2.26).) The product of such expressions can be implemented using either Theorem 2.3 as a network with mulitple hidden layer and utilizing the rectified quadratic unit function as the activation function, or a discretization of the formula (2.27) using a sigmoidal function as explained in Sub–section 2.4.

2.6 Function approximation

In this section, if , is the norm of .

2.6.1 Spline based approximation

In [7, Section 4.5], [8], a quasi–interpolatory spline function is defined by

(2.29)

where are compactly supported linear functionals, designed specifically to ensure that for every polynomial of coordinatewise degree at most in variables. With , , one has the approximation bound for small :

(2.30)

The linear functionals are based on finitely many samples of at the grid points in a compact subset of . In our context, the data for approximating is not in this form. Therefore, we may use the following algorithm given in [50], where we assume that is scaled so as to be supported on .

Given: A set of points in . Let

and be sufficiently small.

Objective: To find real numbers such that the functional

(2.31)

satisfies

Steps:

  1. Divide into congruent subcubes of side not exceeding .

  2. Choose , so that each subcube has exactly one point of .

  3. Solve the following (underdetermined) system of equations for the unknowns , .

    (2.32)
  4. Set if .

  5. Output .

Substituting in place of in the definition of yields the desired spline approximation

(2.33)

and it is proved in [50] that the estimate (2.30) holds with replacing .

2.6.2 Polynomial quasi–interpolation

A standard method for polynomial approximation is to consider a filtered projection defined in (2.38) below.

The Chebyshev polynomials (of first kind) are defined recursively for and integer by

(2.34)

In terms of monomials, the Chebyshev polynomials are given by

(2.35)

For and multi–integer , the tensor product Chebyshev polynomial is defined by

(2.36)

We choose a smooth low pass filter ; i.e., an even function such that if , and if , and abuse the notation as usual to define

With this filter, we define the kernel

(2.37)

Then the filtered projection operator is defined by

(2.38)

It is well known that if is any continuous function on , then converges uniformly to at the near optimal rate of approximation. For example, if has partial derivatives up to order in each variable, then analogously to (2.30), but for large rather than small ,

(2.39)

Theoretically, the question then is to compute using the data as in Sub–section 2.6.1. The procedure we describe below from [43, 45, 51] also describes the choice of the parameter depending upon the data.

Given: A set of points in . Let

and be sufficiently small. We also fix an integer .

Objective: To find real numbers such that the functional

(2.40)

satisfies

for all –variate polynomials of coordinatewise degree .

Steps:

  1. Divide into congruent subcubes of side not exceeding .

  2. Choose , so that each subcube has exactly one point of .

  3. Solve the following (underdetermined) system of equations for the unknowns , .