Nonlinear Approximation via Compositions \fundingH. Yang was supported by the startup of the Department of Mathematics at the National University of Singapore.

Nonlinear Approximation via Compositionsthanks: \fundingH. Yang was supported by the startup of the Department of Mathematics at the National University of Singapore.

Zuowei Shen Department of Mathematics, National University of Singapore (). matzuows@nus.edu.sg    Haizhao Yang Department of Mathematics, National University of Singapore (). haizhao@nus.edu.sg    Shijun Zhang Department of Mathematics, National University of Singapore (). zhangshijun@u.nus.edu
Abstract

Given a function dictionary and an approximation budget , nonlinear approximation seeks the linear combination of the best terms to approximate a given function with the minimum approximation error

Motivated by recent success of deep learning, we propose dictionaries with functions in a form of compositions, i.e.,

for all , and implement using ReLU feed-forward neural networks (FNNs) with hidden layers. We further quantify the improvement of the best -term approximation rate in terms of when is increased from to or to show the power of compositions. In the case when , our analysis shows that increasing cannot improve the approximation rate in terms of .

In particular, for any function on , regardless of its smoothness and even the continuity, if can be approximated using a dictionary when with the best -term approximation rate , we show that dictionaries with can improve the best -term approximation rate to . We also show that for Hölder continuous functions of order on , the application of a dictionary with in nonlinear approximation can achieve an essentially tight best -term approximation rate . Finally, we show that dictionaries consisting of wide FNNs with a few hidden layers are more attractive in terms of computational efficiency than dictionaries with narrow and very deep FNNs for approximating Hölder continuous functions if the number of computer cores is larger than in parallel computing.

D
\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersNonlinear Approximation via CompositionsZ. Shen, H. Yang, and S. Zhang

eep Neural Networks, ReLU Activation Function, Nonlinear Approximation, Function Composition, Hölder Continuity, Parallel Computing.

1 Introduction

For non-smooth and high-dimensional function approximation, a favorable technique popularized in recent decades is the nonlinear approximation devore_1998 that does not limit the approximants to come from linear spaces, obtaining sparser representation, cheaper computation, and more robust estimation, and therein emerged the bloom of many breakthroughs in applied mathematics and computer science (e.g., wavelet analysis doi:10.1137/1.9781611970104, dictionary learning DBLP:journals/corr/TariyalMSV16, data compression and denoising 600614; 374249, adaptive pursuit Davis94adaptivenonlinear; 6810285, compressed sensing 1614066; 4472240).

Typically, nonlinear approximation is a two-stage algorithm that designs a good redundant nonlinear dictionary, , in its first stage, and identifies the optimal approximant as a linear combination of elements of in the second stage:

(1.1)

where is the target function in a Hilbert space associated with a norm denoted as , , is a nonlinear map from to with the -th coordinate being , and is a linear map from to with the -th coordinate being . The nonlinear approximation seeks and such that

(1.2)

which is also called the best -term approximation. One remarkable approach of nonlinear approximation is based on one-hidden-layer neural networks that give simple and elegant bases of the form , where is a linear transform in with the transformation matrix (named as the weight matrix) and a shifting vector (called bias), and is a nonlinear function (called the activation function). The approximation

includes wavelets pursuits 258082; 471413, adaptive splines devore_1998; PETRUSHEV2003158, radial basis functions radiusbase; HANGELBROEK2010203; Xie2013, sigmoidal neural networks Sig1; Sig2; Sig3; Sig4; Sig5; Cybenko1989ApproximationBS; HORNIK1989359; barron1993, etc. For functions in Besov spaces with smoothness , radiusbase; HANGELBROEK2010203 constructed an \⃝raisebox{-0.5pt}{1}\⃝raisebox{-0.5pt}{1}\⃝raisebox{-0.5pt}{1}In this paper, we use the big notation when we only care about the scaling in terms of the variables inside and the prefactor outside is independent of these variables. approximation that is almost optimal Lin2014 and the smoothness cannot be reduced generally HANGELBROEK2010203. For Hölder continuous functions of order on , Xie2013 essentially constructed an approximation, which is far from the lower bound as we shall prove in this paper. Achieving the optimal approximation rate of general continuous functions in constructive approximation, especially in high dimensional spaces, remains an unsolved challenging problem.

1.1 Problem Statement

ReLU FNNs have been proved to be a powerful tool in many fields from various points of view NIPS2014_5422; 6697897; Bartlett98almostlinear; Sakurai; pmlr-v65-harvey17a; Kearns; Anthony:2009; PETERSEN2018296, which motivates us to tackle the open problem above via function compositions in the nonlinear approximation using deep ReLU FNNs, i.e.,

(1.3)

where with , for , is the ReLU activation function, and is a Hölder continuous function. For the convenience of analysis, we consider for . Let be the dictionary consisting of ReLU FNNs with width and depth . To identify the optimal FNN to approximate , it is sufficient to solve the following optimization problem

(1.4)

The fundamental limit of nonlinear approximation via the proposed dictionary is essentially determined by the approximation power of function compositions in (1.3), which gives a performance guarantee of the minimizer in (1.4). Since function compositions are implemented via ReLU FNNs, the remaining problem is to quantify the approximation capacity of deep ReLU FNNs, especially their ability to improve the best -term approximation rate in for any fixed defined as

(1.5)

Function compositions can significantly enrich the dictionary of nonlinear approximation and this idea was not considered in the literature previously due to the expensive computation of function compositions in solving the minimization problem in (1.4). Fortunately, recent development of efficient algorithms for optimization with compositions (e.g., backpropagation techniques werbos1975beyond; Fukushima1980; rumelhart1986psychological and parallel computing techniques 10.1007/978-3-642-15825-4_10; Ciresan:2011:FHP:2283516.2283603) makes it possible to explore the proposed dictionary in this paper. Furthermore, with advanced optimization algorithms Duchi:2011:ASM:1953048.2021068; Johnson:2013:ASG:2999611.2999647; ADAM, good local minima of (1.4) can be identified efficiently NIPS2016_6112; DBLP:journals/corr/NguyenH17; opt.

1.2 Related work and contribution

The main goal in the remaining article is to quantify the best -term approximation rate defined in (1.5) for ReLU FNNs in the dictionary with a fixed depth when is a Hölder continuous function. This topic is related to several existing approximation theories in the literature, but none of these existing works can be applied to answer the problem addressed in this paper.

First of all, this paper identifies explicit formulas for the best -term approximation rate

(1.6)

for any and a Hölder continuous function of order with a constant , while existing theories NIPS2017_7203; bandlimit; yarotsky2017; DBLP:journals/corr/LiangS16; Hadrien; suzuki2018adaptivity; PETERSEN2018296; yarotsky18a; DBLP:journals/corr/abs-1807-00297 can only provide implicit formulas in the sense that the approximation error contains an unknown prefactor and work only for sufficiently large or larger than some unknown numbers. For example, the approximation rate in yarotsky18a via a narrow and deep ReLU FNN is with unknown and for larger than a sufficiently large unknown number ; the approximation rate in yarotsky18a via a wide and shallow ReLU FNN is with unknown and for larger than a sufficiently large unknown number . For another example, given an approximation error , PETERSEN2018296 proved the existence of a ReLU FNN with a constant but still unknown number of layers approximating a function within the target error. Similarly, given the error, Hadrien; bandlimit; DBLP:journals/corr/abs-1807-00297 estimate the scaling of the network size in and the scaling contains unknown prefactors. Given an arbitrary and , no existing work can provide an explicit formula for the approximation error to guide practical network design, e.g., to guarantee whether the network is large enough to meet the accuracy requirement. This paper provides such formulas for the first time and in fact the bound in these formulas is asymptotically tight as we shall prove later.

Second, our target functions are Hölder continuous, while most of existing works aim for a smaller function space with certain smoothness, e.g. functions in with NIPS2017_7203; DBLP:journals/corr/LiangS16; yarotsky2017; DBLP:journals/corr/abs-1807-00297, band-limited functions bandlimit, Korobov spaces Hadrien, or Besev spaces suzuki2018adaptivity. To the best of our knowledge, there is only one existing article yarotsky18a concerning the approximation power of deep ReLU FNNs for . However, the conclusion of yarotsky18a only works for ReLU FNNs with a fixed width and a sufficiently large , instead of a fixed and an arbitrary as required in the nonlinear approximation (see Figure 1 for the comparison of the conclusion of yarotsky18a and this paper).

As we can see in Figure 1, the improvement of the best -term approximation rate in terms of when is increased from to or is significant, which shows the power of depth in ReLU FNNs. However, in the case when , our analysis shows that increasing cannot improve the approximation rate in terms of . As an interesting corollary of our analysis, for any function on , regardless of its smoothness and even the continuity, if can be approximated using using functions in with the best -term approximation rate , we show that functions in can improve the best -term approximation rate to . Extending this conclusion for a general dimensional function is challenging and we leave it as future work.

From the point of view of analysis techniques, this paper introduce new analysis methods merely based on the structure of FNNs, while existing works NIPS2017_7203; bandlimit; yarotsky2017; DBLP:journals/corr/LiangS16; Hadrien; suzuki2018adaptivity; PETERSEN2018296; yarotsky18a; DBLP:journals/corr/abs-1807-00297 rely on constructing FNNs to approximate traditional basis in approximation theory, e.g., polynomials, splines, and sparse grids, which are used to approximate smooth functions.

Figure 1: A comparison of existing works and our contribution on the approximation power of ReLU FNNs for Hölder continuous functions of order . Existing results in two cases: 1) approximation rate for ReLU FNNs with depth and width ; 2) approximation rate for ReLU FNNs with depth larger than a sufficiently large unknown number and width . Our contribution: approximation rate for ReLU FNNs width depth and width in the case of ( in the case of ).

Finally, we analyze the approximation efficiency of neural networks in parallel computing, a very important point of view that was not paid attention to in the literature. In most applications, the efficiency of deep learning computation highly relies on parallel computation. We show that a narrow and very deep neural network is inefficient if its approximation rate is not exponentially better than wide and shallower networks. Hence, neural networks with layers are more attractive in modern computational platforms, considering the computational efficiency per training iteration in parallel computing platforms. Our conclusion does not conflict with the current state-of-the-art deep learning research since most of these successful deep neural networks have a depth that is asymptotically relative to the width.

1.3 Organization

The rest of the paper is organized as follows. Section 2 summarizes the notations throughout this paper. Section 3 presents the main theorems while Section LABEL:sec:numerical shows numerical tests in parallel computing to support the claims in this paper. Finally, Section LABEL:sec:con concludes this paper with a short discussion.

2 Preliminaries

For the purpose of convenience, we present notations and elementary lemmas used throughout this paper as follows.

2.1 Notations

  • Matrices are denoted by bold uppercase letters, e.g., is a real matrix of size , and denotes the transpose of . Correspondingly, is the -th entry of ; is the -th column of ; is the -th row of .

  • Vectors are denoted as bold lowercase letters, e.g., is a column vector of size and is the -th element of . are vectors consisting of numbers with .

  • The Lebesgue measure is denoted as .

  • The set difference of and is denoted by . denotes for any .

  • For a set of numbers , and a number , .

  • For any , let and .

  • Assume , then means that there exists positive independent of , , and such that when goes to for all .

  • Define as the class of functions defined on satisfying the uniformly Lipchitz property of order with a Lipchitz constant .

  • Let CPL() be the set of continuous piecewise linear functions with pieces mapping to . The endpoints of each linear piece are called “breakpoints” in this paper.

  • Let denote the rectified linear unit (ReLU), i.e. . With the abuse of notations, we define as for any .

  • We will use NN as a ReLU neural network for short and use Python-type notations to specify a class of NN’s, e.g., is a set of ReLU FNN’s satisfying conditions given by , each of which may specify the number of inputs (#input), the total number of nodes in all hidden layers (node), the number of hidden layers (layer), the total number of parameters (parameter), and the width in each hidden layer (widthvec), the maximum width of all hidden layers (maxwidth) etc. For example, is a set of NN’s satisfying:

    • maps from to .

    • has two hidden layers and the number of nodes in each hidden layer is .

  • is short for . For example,

  • For , if we define and , then the architecture of can be briefly described as follows:

    where and are the weight matrix and the bias vector in the -th linear transform in , respectively, i.e., for and for .

2.2 Lemmas

Let us study the properties of ReLU FNNs with only one hidden layer to warm up in Lemma 1 below. It indicates that for any .

Lemma 1

Suppose with an architecture:

Then is a continuous piecewise linear function. Let , then we have:

  1. Given a sequence of strictly increasing numbers , ,, , there exists independent of and such that the break points of are exactly , , on the interval \⃝raisebox{-0.5pt}{2}\⃝raisebox{-0.5pt}{2}\⃝raisebox{-0.5pt}{2}We only consider the interval and hence and are treated as break points. might not have a real break point in a small open neighborhood of or ..

  2. Suppose and are given in (1). Given any sequence , there exist and such that for and is linear on for .

Part in Lemma 1 follows by setting . The existence in Part is equivalent to the existence of a solution of linear equations, which is left for the reader. Next, we study the properties of ReLU FNNs with two hidden layers. In fact, we can show that the closure of contains for any , where the closure is in the sense of -norm for any . The proof of this property relies on the following lemma.

Lemma 2

For any , given any samples with and for , there exists satisfying the following conditions:

  1. for ;

  2. is linear on each interval for ;

  3. .

Proof

For simplicity of notation, we define , , and for any . Since , the architecture of is

(2.1)

Note that maps to and hence each entry of itself is a sub-network with one hidden layer. Denote , then . Our proof of Lemma 2 is mainly based on the repeated applications of Lemma 1 to determine parameters of such that Conditions to hold.

Step : Determine and .

By Lemma 1, and such that sub-networks in have the same set of break points: , no matter what and are.

Step : Determine and .

This is the key step of the proof. Our ultimate goal is to set up such that, after a nonlinear activation function, there exists a linear combination in the last step of our network (specified by and as shown in (2.1)) that can generate a desired matching the sample points . In the previous step, we have determined the break points of by setting up and ; in this step, we will identify and to fully determine . This will be conducted in two sub-steps.

Step : Set up.

Suppose is a continuous piecewise linear function defined on fitting the given samples for , and is linear between any two adjacent points of . We are able to choose and such that for by Lemma 1, since there are points in . Define , where as shown in Equation (2.1), since is positive by the construction of Lemma 1. Then we have for . See Figure 2 (a) for an illustration of , , and .

Step : Mathematical induction.

For each , given , we determine , , , and , to completely specify and , which in turn can determine . Hence, it is only enough to show how to proceed with an arbitrary , since the initialization of the induction, , has been constructed in Step . See Figure 2 (b)-(d) for the illustration of the first two induction steps. We recursively rely on the fact of that

  • for ,

  • is linear on each interval for ,

to construct satisfying similar conditions as follows:

  • for ,

  • is linear on each interval for .

The induction process for , , , , and can be divided into four parts.

Step : Define index sets.

Let and . The cardinality of is . We will use and to generate samples to determine CPL functions and in the next step.

(a) (b) (c) (d)
Figure 2: Illustrations of the proof of Lemma 2, especially Step of the proof, when , with the “don’t-care” region in red. (a) Given samples marked with “star” signs, suppose is a CPL function fitting the samples, construct such that is closer to than in the sense. (b) Construct and such that is closer to than in the sense in a subset of the “important” region. (c) Construct and such that is closer to than in the sense in a larger subset of the “important” region. (d) The visualization of , which is in the “important” areas that have been processed and may remain large near the “don’t-care” region. will decay quickly outside the “don’t-care” region as increases.

Step : Determine and .

By Lemma 1, we can choose and to fully determine such that each matches a specific value for . The values of are specified as:

  • If , specify the values of and such that and . The existence of these values fulfilling the requirements above comes from the fact that is linear on the interval and only depends on the values of and on . Now it is easy to verify that satisfies and for , and is linear on each interval for .

  • If , let . Then on the interval .

  • Finally, specify the value of at as .

Step : Determine and .

Similarly, we choose and such that matches specific values as follows:

  • If , specify the values of and such that and . Then satisfies and for , and is linear on each interval for .

  • If , let . Then on the interval .

  • Finally, specify the value of at as .

Step : Construct from and .

For the sake of clarity, the properties of and constructed in Step are summarized below:

  1. for ;

  2. If and ;

  3. If and ;

  4. and are linear on each interval for . In other words, and are linear on each interval for .

See Figure 2 (a)-(c) for the illustration of , , , , and , and to verify their properties as listed above.

Note that , so for . Now we define , then

  • for ;

  • is linear on each interval for .

See Figure 2 (b)-(d) for the illustration of , , and , and to verify their properties as listed just above. This finishes the mathematical induction process. As we can imagine based on Figure 2, when increases, the support of shrinks to the “don’t-care” region.

Step : Determine and .

With the special vector function constructed in Step , we are able to specify and to generate a desired with a well-controlled -norm matching the samples .

In fact, we can simply set and , which finishes the construction of . The rest of the proof is to verify the properties of . Note that