PowerNet: Efficient Representations of Polynomials and Smooth Functions by Deep Neural Networks with Rectified Power Units
Deep neural network with rectified linear units (ReLU) is getting more and more popular recently. However, the derivatives of the function represented by a ReLU network are not continuous, which limit the usage of ReLU network to situations only when smoothness is not required. In this paper, we construct deep neural networks with rectified power units (RePU), which can give better approximations for smooth functions. Optimal algorithms are proposed to explicitly build neural networks with sparsely connected RePUs, which we call PowerNets, to represent polynomials with no approximation error. For general smooth functions, we first project the function to their polynomial approximations, then use the proposed algorithms to construct corresponding PowerNets. Thus, the error of best polynomial approximation provides an upper bound of the best RePU network approximation error. For smooth functions in higher dimensional Sobolev spaces, we use fast spectral transforms for tensor-product grid and sparse grid discretization to get polynomial approximations. Our constructive algorithms show clearly a close connection between spectral methods and deep neural networks: a PowerNet with layers can exactly represent polynomials up to degree , where is the power of RePUs. The proposed PowerNets have potential applications in the situations where high-accuracy is desired or smoothness is required.
keywords:deep neural network, rectified linear unit, rectified power unit, sparse grid, PowerNet
Artificial neural network (ANN) has been a hot research topic for several decades. Deep neural network (DNN), a special class of ANN with multiple hidden layers, is getting more and more popular recently. Since 2006, when efficient training methods were introduced by Hinton et al hinton_fast_2006 , DNNs have brought significant improvements in several challenging problems including image classification, speech recognition, computational chemistry and numerical solutions of high-dimensional partial differential equations, see e.g. hinton_deep_2012 ; lecun_deep_2015 ; krizhevsky_imagenet_2017 ; han_solving_2017 ; zhang_deep_2018 , and references therein.
The success of ANNs rely on the fact that they have good representation power. Actually, the universal approximation property of neural networks is well-known: neural networks with one hidden layer of continuous/monotonic sigmoid activation functions are dense in continuous function space and , see e.g. cybenko_approximation_1989 ; funahashi_approximate_1989 ; hornik_multilayer_1989 for different proofs in different settings. Actually, for neural network with non-polynomial activation functions, the upper bound of approximation error is of spectral type even using only one-hidden layer, i.e. error rate can be obtained theoretically for approximation functions in Sobolev space , where is the number of dimensions, is the number of hidden nodes in the neural networkmhaskar_neural_1996 . It is believed that one of the basic reasons behind the success of DNNs is the fact that deep neural networks have broader scopes of representation than shallow ones. Recently, several works have demonstrated or proved this in different settings. For example, by using the composition function argument, Poggio et al poggio_why_2017 showed that deep networks can avoid the curse of dimensionality for an important class of problems corresponding to compositional functions. In the general function approximation aspect, it has been proved by Yarotsky yarotsky_error_2017 that DNNs using rectified linear units (abbr. ReLU, a non-smooth activation function defined as ) need at most units and nonzero weights to approximation functions in Sobolev space within error. This is similar to the results of shallow networks with one hidden layer of activation units, but only optimal up to a factor. Similar results for approximating functions in with using ReLU DNNs are given by Petersen and Voigtlaenderpetersen_optimal_2018 . The significance of the works by Yarotsky yarotsky_error_2017 and Peterson and Voigtlaender petersen_optimal_2018 is that by using a very simple rectified nonlinearity, DNNs can obtain high order approximation property. Shallow networks do not hold such a good property. Other works show ReLU DNNs have high-order approximation property include the work by E and Wange_exponential_2018 and the recent work by Opschoor et al.opschoor_deep_2019 , the latter one relates ReLU DNNs to high-order finite element methods.
A basic fact used in the error estimate given in yarotsky_error_2017 and petersen_optimal_2018 is that can be approximated by a ReLU network with layers. To remove this approximation error and the extra factor in the size of neural networks, we proposed to use rectified power units (RePU) to construct exact neural network representations of polynomials li_better_2019 . The RePU function is defined as
where is a non-negative integer. When , we have the Heaviside step function; when , we have the commonly used ReLU function . We call , rectified quadratic unit (ReQU) and rectified cubic unit (ReCU) for , respectively. Note that, some pioneering works have been done by Mhaskar and his coworkers (see e.g. mhaskar_approximation_1993 , chui_neural_1994a ) to give an theoretical upper bound of DNN function approximations by converting splines into RePU DNNs. However, for very smooth functions, their constructions of neural network are not optimal and meanwhile are not numerically stable. The error bound obtained is quasi-optimal due to an extra factor, where is related to the smoothness of the underlying functions. The extra factor is removed in our earlier workli_better_2019 by introducing some explicit optimal and stable constructions of ReQU networks to exactly represent polynomials. In this paper, we extend the results to deep networks using general RePUs with .
Comparing with other two constructive approaches (The Qin Jiushao algorithm and the first-composition-then-combination method used in mhaskar_approximation_1993 , chui_neural_1994a , etc), our constructions of RePU neural networks to represent polynomials are optimal in the numbers of network layers and hidden nodes. To approximate general smooth functions, we first approximate the function by its best polynomial approximation, then convert the polynomial approximation into a RePU network with optimal size. The conclusion of algebraic convergence for functions and exponential convergence for analytic functions then follows straightforward. For multi-dimensional problems, we use the concept of sparse grid to improve the error estimate of neural networks and lessen the curse of dimensionality.
The main advantage of the ReLU function is that ReLU DNNs are relatively easier to train than DNNs using other analytic sigmoidal activation units in traditional applications. The latter ones have well-known severe gradient vanishing phenomenon. However, ReLU networks have some limitations. E.g., due to the fact that the derivatives of a ReLU network function are not continuous, ReLU networks are hard to train when the loss function contains derivatives of the network, thus functions with higher-order smoothness are desired. Such an example is the deep Ritz method solving partial differential equations (PDEs) recently developed by E and Yue_deep_2018 , where ReQU networks are used.
The remain part of this paper is organized as follows. In Section 2 we first show how to realize univariate polynomials and approximate smooth functions using RePU networks. Then we construct RePU network realization of multivariate polynomials and general multivariate smooth functions in Section 3, with extensions to high-dimensional functions in sparse space given in Subsection 3.3. A short summary is given in Section 4.
2 Approximation of univariate smooth functions
We first introduce notations. Denote by the set of all positive integer, , for .
We define a neural network with input of dimension , number of layer as a matrix-vector sequence
where are matrices, are vectors called bias, and .
If is a neural network defined by (2.1), and is an arbitrary activation function, then define the neural network function
where is defined as
Here we denote vector variables by bold letters and use the definition
We use three quantities to measure the complexity of a neural network : number of layers , number of nodes(i.e. activation units) , and number of nonzero weights , which are , and , respectively. For the neural network defined in (2.1), , are the dimensions of , and (for ) is the number of nonzero weights in the -th affine transformation. Note that, in this paper, we define as the layers of affine transformations defined in (2.3). We also call the input layer, the output layer, and , hidden layers. So, there are hidden layers, which is the number of layers of activation units.
We define as the collection of all neural networks of input dimension , output dimension with at most neurons arranged in layers, i.e.
For given activation function , we further define
To construct complex networks from simple ones, We first introduce several network compositions.
Let and , be two neural networks such that the input layer of has the same dimension as the output layer of . We define the the concatenation of and as
By the definition, we have
Let , be two neural networks both with layers. Suppose the input dimensions of the two networks are respectively. We define the parallelization of and as
Here are formed from correspondingly, by padding zero columns in the end to one of them such that they have same number of columns. Obviously, is a neural network with -dimensional input and layers. We have the relationship
For , defined as above but not necessarily have same dimensions of input, we define the tensor product of and as
Obviously, is a -layer neural network with dimensional input and dimensional output. We have the relationship
2.1 Basic properties of RePU networks
Our analyses rely upon the fact: and can all be realized by a one-hidden-layer neural network with a few number of coefficients, which is presented in the following lemma.
The monomials can be exactly represented by neural networks with one hidden layer of a finite number of activation nodes. More precisely:
For , the monomial can be realized exactly using a network having one hidden layer with two nodes as following,
Correspondingly, the neural network is defined as
A graph representation of is sketched in Fig. 1(a).
For , the monomial can be realized exactly using a network having only one hidden layer with no more than nodes as
A graph representation of is sketched in Fig. 1(b). Note that, when , we have a trivial realization: , . When , the implementation in (i) is more efficient. When , we obtain the network realization of identity function .
(1) It is easy to check that has an exact realization given by
(2) For the case of , we consider the following linear combination
where are parameters to be determined. are binomial coefficients. Identify the above expression with a polynomial of degree does not exceed , i.e. , we obtain the following linear system
where the top-left sub-matrix of is a Vandermonde matrix , which is invertible as long as are distant. The choices of are discussed later in Remark 1. Denote , , . We have
To represent , we have in (2.17), where and is the Kronecker delta function. ∎
The inverse of Vandermonde matrix will inevitably be involved in the solution of (2.17), which make the formula (2.11) difficult to use for large due to the geometrically growth of the condition number of the Vandermonde matrix gautschi_optimally_1975 ; beckermann_condition_2000 ; gautschi_optimally_2011 . The condition number of the Vandermonde matrices with three different choices of symmetric nodes are given in Figure 1. The three choices for symmetric nodes are Chebyshev nodes
and numerically calculated optimal nodes. The counterparts of these three different choices for non-negative nodes are also depicted in Figure 1. Most of the results are from gautschi_optimally_2011 . For large the numerical procedure to calculate the optimal nodes may not succeed. But the growth rates of the condition number of Vandermonde matrices using Chebyshev nodes on is close to the optimal case, so we use Chebyshev nodes (2.18) for large . For smaller values of , we use numerically calculated optimal nodes, which are given for in gautschi_optimally_1975 :
Note that, in some special cases, if non-negative nodes are used, the number of activation functions in the network construction can be reduced. However, due to the fact that the condition number in this case is larger than the case with symmetric nodes, we will not consider the use of all non-negative nodes in this paper.
Based on Lemma 1, one can easily obtain following results.
In the implementation of polynomials, operations of the form will be frequently involved. Following lemma asserts that can be realized by using only one hidden layer.
Bivariate monomials , can be realized as a linear combination of at most activation units of as
where , . A particular formula is given by (4.8) in the appendix section. The corresponding neural network is defined as
A graph representation of is sketched in Fig. 1(f). Obviously, the numbers of nonzero weights in the first layer and second layer affine transformation are and correspondingly.
The proof of Lemma 2 is lengthy. We put it in the appendix section.
A polynomial of the form can be realized as a linear combination of at most activation units of as
Here with contains only a linear combination layer. A graph representation of is sketched in Fig. 1(h). The numbers of nonzero weights in the first layer and second layer affine transformations are at most and correspondingly. Here
2.2 Optimal realizations of polynomials by RePU networks with no error
For with , by Lemma 1, the number of layers, hidden units and nonzero weights required in a network to realize it is no more than , , , correspondingly. For , we have the following Theorem.
For , there exist a network with
to exactly represent the monomial defined on . Here, represents the largest integer not exceeding , and represents the smallest integer no less than , for .
1) For , , we first express in positional numeral system with radix as follows:
where , for and . Then
Introducing intermediate variables
then can be calculated iteratively as
Therefore, to construct a neural network expressing , we need to realize three basic operations: , and multiplication. By Lemma 1, each step of iteration (2.33) can be realized by a network with one hidden layer. Then the overall neural network to realize is a concatenation of those one-layer sub-networks. We give the construction process of the neural network as follows.
For , the first sub-network is constructed according to Lemma 1 as
It is easy to see that the number of nodes in the hidden layer is , and the number of non-zeros in and is .
For the sub-network are constructed as
The number of nodes in layer is , and the number of non-zeros in and is at most . The number of non-zeros in and is at most .
For , the sub-network is constructed as
By a straightforward calculation, we get the number of nodes in Layer is at most , and the number of non-zeros in and is .