Rational neural networks

# Rational neural networks

## Abstract

We consider neural networks with rational activation functions. The choice of the nonlinear activation function in deep learning architectures is crucial and heavily impacts the performance of a neural network. We establish optimal bounds in terms of network complexity and prove that rational neural networks approximate smooth functions more efficiently than ReLU networks with exponentially smaller depth. The flexibility and smoothness of rational activation functions make them an attractive alternative to ReLU, as we demonstrate with numerical experiments.

## 1 Introduction

Deep learning has become an important topic across many domains of science due to its recent success in image recognition, speech recognition, and drug discovery Hinton et al. (2012); Krizhevsky et al. (2012); LeCun et al. (2015); Ma et al. (2015). Deep learning techniques are based on neural networks, which contain a certain number of layers to perform several mathematical transformations on the input. A nonlinear transformation of the input determines the output of each layer in the neural network: , where is a matrix called the weight matrix, is a bias vector, and is a nonlinear function called the activation function (also called activation unit). The computational cost of training a neural network depends on the total number of nodes (size) and the number of layers (depth). A key question in designing deep learning architectures is the choice of the activation function to reduce the number of trainable parameters of the network while keeping the same approximation power Goodfellow et al. (2016).

While smooth activation functions such as sigmoid, logistic, or hyperbolic tangent are widely used, they suffer from the “vanishing gradient problem” Bengio et al. (1994) because their derivatives are zero for large inputs. Neural networks based on polynomial activation functions are an alternative Cheng et al. (2018); Daws Jr. and Webster (2019); Goyal et al. (2019); Guarnieri et al. (1999); Ma and Khorasani (2005); Vecci et al. (1998), but can be numerically unstable due to large gradients for large inputs Bengio et al. (1994). Moreover, polynomials do not approximate non-smooth functions efficiently Trefethen (2013), which can lead to optimization issues in classification problems. A popular choice of activation function is the Rectified Linear Unit (ReLU) defined as  Jarrett et al. (2009); Nair and Hinton (2010). It has numerous advantages, such as being fast to evaluate and zero for many inputs Glorot et al. (2011). Many theoretical studies characterize and understand the expressivity of shallow and deep ReLU neural networks from the perspective of approximation theory DeVore et al. (1989); Liang and Srikant (2016); Mhaskar (1996); Telgarsky (2016); Yarotsky (2017).

ReLU networks also suffer from drawbacks, which are most evident during training. The main disadvantage is that the gradient of ReLU is zero for negative real numbers. Therefore, its derivative is zero if the activation function is saturated Maas et al. (2013). To tackle these issues, several adaptations to ReLU have been proposed such as Leaky ReLU Maas et al. (2013), Exponential Linear Unit (ELU) Clevert et al. (2015), Parametric Linear Unit (PReLU) He et al. (2015), and Scaled Exponential Linear Unit (SELU) Klambauer et al. (2017). These modifications outperform ReLU in image classification applications, and some of these activation functions have trainable parameters, which are learned by gradient descent at the same time as the weights and biases of the network. To obtain significant benefits for image classification and partial differential equation (PDE) solvers, one can perform an exhaustive search over trainable activation functions constructed from standard units Jagtap et al. (2020); Ramachandran et al. (2017). However, most of the “exotic” activation functions in the literature are motivated by empirical results and are not supported by theoretical statements on their potentially improved approximation power over ReLU.

In this work, we study rational neural networks, which are neural networks with activation functions that are trainable rational functions. In Section 3, we provide theoretical statements quantifying the advantages of rational neural networks over ReLU networks. In particular, we remark that a composition of low-degree rational functions has a good approximation power but a relatively small number of trainable parameters. Therefore, we show that rational neural networks require fewer nodes and exponentially smaller depth than ReLU networks to approximate smooth functions to within a certain accuracy. This improved approximation power has practical consequences for large neural networks, given that a deep neural network is computationally expensive to train due to expensive gradient evaluations and slower convergence. The experiments conducted in Section 4 demonstrate the potential applications of these rational networks for solving PDEs and Generative Adversarial Networks (GANs).1 The practical implementation of rational networks is straightforward in the TensorFlow framework and consists of replacing the activation functions by trainable rational functions. Finally, we highlight the main benefits of rational networks: the fast approximation of functions, the trainability of the activation parameters, and the smoothness of the activation function.

## 2 Rational neural networks

We consider neural networks whose activation functions consist of rational functions with trainable coefficients and , i.e., functions of the form:

 F(x)=P(x)Q(x)=∑rPi=0aixi∑rQj=0bjxj,aP≠0,bQ≠0, (1)

where and are the polynomial degrees of the numerator and denominator, respectively. We say that is of type and degree .

The use of rational functions in deep learning is motivated by the theoretical work of Telgarsky, who proved error bounds on the approximation of ReLU neural networks by high-degree rational functions and vice versa Telgarsky (2017). On the practical side, neural networks based on rational activation functions are considered by Molina et al. Molina et al. (2019), who defined a safe Padé Activation Unit (PAU) as

 F(x)=∑rPi=0aixi1+|∑rQj=1bjxj|.

The denominator is selected so that does not have poles located on the real axis. PAU networks can learn new activation functions and are competitive with state-of-the-art neural networks for image classification. However, this choice results in a non-smooth activation function and makes the gradient expensive to evaluate during training. In a closely related work, Chen et al. Chen et al. (2018) propose high-degree rational activation functions in a neural network, which have benefits in terms of approximation power. However, this choice can significantly increase the number of parameters in the network, causing the training stage to be computationally expensive.

In this paper, we use low-degree rational functions as activation functions, which are then composed together by the neural network to build high-degree rational functions. In this way, we can leverage the approximation power of high-degree rational functions without making training expensive. We highlight the approximation power of rational networks and provide optimal error bounds to demonstrate that rational neural networks theoretically outperform ReLU networks. Motivated by our theoretical results, we consider rational activation functions of type , i.e., and . This type appears naturally in the theoretical analysis due to the composition property of Zolotarev sign functions (see Section 3.1): the degree of the overall rational function represented by the rational neural network is a whopping , while the number of trainable parameters only grows linearly with respect to the depth of the network. Moreover, a superdiagonal type allows the rational activation function to behave like a nonconstant linear function at , unlike a diagonal type, e.g., , or the ReLU function. A low-degree activation function keeps the number of trainable parameters small, while the implicit composition in a neural network gives us the approximation power of high-degree rationals. This choice is also motivated empirically, and we do not claim that the type is the best choice for all situations as the configurations may depend on the application (see Figure 6 of the Supplementary Material). Our experiments on the approximation of smooth functions and GANs suggest that rational neural networks are an attractive alternative to ReLU networks (see Section 4). We observe that a good initialization, motivated by the theory of rational functions, prevents rational neural networks from having arbitrarily large values.

## 3 Theoretical results on rational neural networks

Here, we demonstrate the theoretical benefit of using neural networks based on rational activation functions due to their superiority over ReLU in approximating functions. We derive optimal bounds in terms of the total number of trainable parameters (also called size) needed by rational networks to approximate ReLU networks as well as functions in the Sobolev space . Throughout this paper, we take to be a small parameter with . We show that an -approximation on the domain of a ReLU network by a rational neural network must have the following size (indicated in brackets):

 Rational [Ω(log(log(1/ϵ)))]≤ReLU≤% Rational [O(log(log(1/ϵ)))], (2)

where the constants only depend on the size and depth of the ReLU network. Here, the upper bound means that all ReLU networks can be approximated to within by a rational network of size . The lower bound means that there is a ReLU network that cannot be -approximated by a rational network of size less than , for some constant . In comparison, the size needed by a ReLU network to approximate a rational neural network within the tolerance of is given by the following inequalities:

 ReLU [Ω(log(1/ϵ))]≤Rational≤ReLU [O(log(1/ϵ))3], (3)

where the constants only depend on the size and depth of the rational neural network. This means that all rational networks can be approximated to within by a ReLU network of size , while there is a rational network that cannot be -approximated by a ReLU network of size less than . A comparison between (2) and (3) suggests that rational networks could be more resourceful than ReLU. A key difference between rational networks and neural networks with polynomial activation functions is that polynomials perform poorly on non-smooth functions such as ReLU, with an algebraic convergence of  Trefethen (2013) rather than the (root-)exponential convergence with rationals (see Figure 1 (left)).

### 3.1 Approximation of ReLU networks by rational neural networks

Telgarsky showed that neural networks and rational functions can approximate each other in the sense that there exists a rational function of degree2 that is -close to a ReLU network Telgarsky (2017), where is a small number. To prove this statement, Telgarsky used a rational function constructed with Newman polynomials Newman (1964) to obtain a rational approximation to the ReLU function that converges with square-root exponential accuracy. That is, Telgarsky needed a rational function of degree to achieve a tolerance of . A degree rational function can be represented with coefficients, i.e., and in Equation 1. Therefore, the rational approximation to a ReLU network constructed by Telgarsky requires at least parameters. In contrast, for any rational function, Telgarsky showed that there exists a ReLU network of size that is an -approximation on .

Our key observation is that by composing low-degree rational functions together, we can approximate a ReLU network much more efficiently in terms of the size (rather than the degree) of the rational network. Our theoretical work is based on a family of rationals called Zolotarev sign functions, which are the best rational approximation on , with , to the sign function Achieser (2013); Petrushev and Popov (2011), defined as

 sign(x)=⎧⎨⎩−1,x<0,0,x=0,1,x>0.

A composition of Zolotarev sign functions of type has type but can be represented with parameters instead of . This property enables the construction of a rational approximation to ReLU using compositions of low-degree Zolotarev sign functions with parameters in Lemma .

###### Lemma

Let . There exists a rational network of size such that

 ∥R−ReLU∥∞:=maxx∈[−1,1]|R(x)−ReLU(x)|≤ϵ.

Moreover, no rational network of size smaller than can achieve this.

The proof of Lemma (see Supplementary Material) shows that the given bound is optimal in the sense that a rational network requires at least parameters to approximate the ReLU function on to within the tolerance . The convergence of the Zolotarev sign functions to the ReLU function is much faster, with respect to the number of parameters, than the rational constructed with Newman polynomials (see Figure 1 (left)).

The converse of Lemma , which is a consequence of a theorem proved by Telgarsky (Telgarsky, 2017, Theorem 1.1), shows that any rational function can be approximated by a ReLU network of size at most .

###### Lemma

Let . If is a rational function, then there exists a ReLU network of size such that .

To demonstrate the improved approximation power of rational neural networks over ReLU networks ( versus ), it is known that a ReLU networks that approximates , which is rational, to within on must be of size at least  (Liang and Srikant, 2016, Theorem 11).

We can now state our main theorem based on Lemmas and 3.1. Theorem provides bounds on the approximation power of ReLU networks by rational neural networks and vice versa. We regard Theorem as an analogue of (Telgarsky, 2017, Theorem 1.1) for our Zolotarev sign functions, where we are counting the number of training parameters instead of the degree of the rational functions. In particular, our rational networks have high degrees but can be represented with few parameters due to compositions, making training more computationally efficient. While Telgarsky required a rational function with parameters to approximate a ReLU network with fewer than nodes in each of layers to within a tolerance of , we construct a rational network that only has size .

###### Theorem

Let and let denote the vector 1-norm. The following two statements hold:

1. Let be a rational network with layers and at most nodes per layer, where each node computes and is a rational function with Lipschitz constant (, , and are possibly distinct across nodes). Suppose further that and . Then, there exists a ReLU network of size

 O(kMlog(MLM/ϵ)3)

such that .

2. Let be a ReLU network with layers and at most nodes per layer, where each node computes and the pair (possibly distinct across nodes) satisfies . Then, there exists a rational network of size

 O(kMlog(log(M/ϵ)))

such that .

Theorem highlights the improved approximation power of rational neural networks over ReLU networks. ReLU networks of size are required to approximate rational networks while rational networks of size only are sufficient to approximate ReLU networks.

### 3.2 Approximation of functions by rational networks

A popular question is the required size and depth of deep neural networks to approximate smooth functions Liang and Srikant (2016); Montanelli et al. (2019); Yarotsky (2017). In this section, we consider the approximation theory of rational networks. In particular, we consider the approximation of functions in the Sobolev space , where is the regularity of the functions and . The norm of a function is defined as

 ∥f∥Wn,∞([0,1]d)=max|n|≤nesssupx∈[0,1]d|Dnf(x)|,

where is the multi-index , and is the corresponding weak derivative of . In this section, we consider the approximation of functions from

 Fd,n:={f∈Wn,∞([0,1]d),∥f∥Wn,∞([0,1]d)≤1}.

By the Sobolev embedding theorem Brezis (2010), this space contains the functions in , which is the class of functions whose first derivatives are Lipschitz continuous. Yarotsky derived upper bounds on the size of neural networks with piecewise linear activation functions needed to approximate functions in  (Yarotsky, 2017, Theorem 1). In particular, he constructed an -approximation to functions in with a ReLU network of size at most and depth smaller than . The term is introduced by a local Taylor approximation, while the term is the size of the ReLU network needed to approximate monomials, i.e., for , in the Taylor series expansion.

We now present an analogue of Yarotsky’s theorem for a rational neural network.

###### Theorem

Let , , , and . There exists a rational neural network of size

 O(ϵ−d/nlog(log(1/ϵ)))

and maximum depth such that .

The proof of Theorem consists of approximating by a local Taylor expansion. One needs to approximate the piecewise linear functions and monomials arising in the Taylor expansion by rational networks using Lemma and Proposition (see Supplementary Material). The main distinction between Yarotsky’s argument and the proof of Theorem is that monomials can be represented by rational neural networks with a size that does not depend on the accuracy of . In contrast, ReLU networks require parameters. Meanwhile, while ReLU neural networks can exactly approximate piecewise linear functions with a constant number of parameters, rational networks can approximate them with a size of a most (see Lemma ). That is, rational neural networks approximate piecewise linear functions much faster than ReLU networks approximate polynomials. This allows the existence of a rational network approximation to with exponentially smaller depth () than the ReLU networks constructed by Yarotsky.

A theorem proved by DeVore et al. DeVore et al. (1989) gives a lower bound of on the number of parameters needed by a neural network to express any function in with an error , under the assumption that the weights are chosen continuously. Comparing and , we find that rational neural networks require exponentially fewer nodes than ReLU networks with respect to the optimal bound of to approximate functions in .

## 4 Experiments using rational neural networks

In this section, we consider neural networks with trainable rational activation functions of type . We select the type based on empirical performance; roughly, a low-degree (but higher than ) rational function is ideal for generating high-degree rational functions by composition, with a small number of parameters. The rational activation units can be easily implemented in the open-source TensorFlow library Abadi et al. (2016) by using the polyval and divide commands for function evaluations. The coefficients of the numerators and denominators of the rational activation functions are trainable parameters, determined at the same time as the weights and biases of the neural network by backpropagation and a gradient descent optimization algorithm.

One crucial question is the initialization of the coefficients of the rational activation functions Chen et al. (2018); Molina et al. (2019). A badly initialized rational function might contain poles on the real axis, leading to exploding values, or converge to a local minimum in the optimization process. Our experiments, supported by the empirical results of Molina et al. Molina et al. (2019), show that initializing each rational function with the best rational approximation to the ReLU function (as described in Lemma ) produces good performance. The underlying idea is to initialize rational networks near a network with ReLU activation functions, widely used for deep learning. Then, the adaptivity of the rational functions allows for further improvements during the training phase. We represent the initial rational function used in our experiments in Figure 1 (right). The coefficients of this function are obtained by using the minimax command, available in the Chebfun software Driscoll et al. (2014); Filip et al. (2018) for numerically computing rational approximations (see Table 1 in the Supplementary Material).

In the following experiments, we use a single rational activation function of type at each layer, instead of different functions at each node to reduce the number of trainable parameters and the computational training expense.

### 4.1 Approximation of functions

Raissi, Perdikaris, and Karniadakis Raissi (2018); Raissi et al. (2019) introduce a framework called deep hidden physics models for discovering nonlinear partial differential equations (PDEs) from observations. This technique requires to solving the following interpolation problem: given the observation data at the spatio-temporal points , find a neural network (called the identification network), that minimizes the loss function

 L=1NN∑i=1|N(xi,ti)−ui|2. (4)

This technique has successfully discovered hidden models in fluid mechanics Raissi et al. (2020), solid mechanics Haghighat et al. (2020), and nonlinear partial differential equations such as the Korteweg–de Vries (KdV) equation Raissi et al. (2019). Raissi et al. use an identification network, consisting of layers and nodes per layer, to interpolate samples from a solution to the KdV equation. Moreover, they observe that networks based on smooth activation functions, such as the hyperbolic tangent () or the sinusoid (), outperform ReLU neural networks Raissi (2018); Raissi et al. (2019). However, the performance of these smooth activation functions highly depends on the application.

Moreover, these functions might not be adapted to approximate non-smooth or highly oscillatory solutions. Recently, Jagtap, Kawaguchi, and Karnidakis Jagtap et al. (2020) proposed and analyzed different adaptive activation functions to approximate smooth and discontinuous functions with physics-informed neural networks. More specifically, they use an adaptive version of classical activation functions such as sigmoid, hyperbolic tangent, ReLU, and Leaky ReLU. The choice of these trainable activation functions introduces another parameter in the design of the neural network architecture, which may not be ideal for use for a black-box data-driven PDE solver.

We illustrate that rational neural networks can address the issues mentioned above due to their adaptivity and approximation power (see Section 3). Similarly to Raissi Raissi (2018), we use a solution to the KdV equation:

 ut=−uux−uxxx,u(x,0)=−sin(πx/20),

as training data for the identification network (see the left panel of Figure 2). We train and compare four neural networks, which contain ReLU, sinusoid, rational, and polynomial activation functions, respectively.3 The mean squared error (MSE) of the neural networks on the validation set throughout the training phase is reported in the right panel of Figure 2. We observe that the rational neural network outperforms the sinusoid network, despite having the same asymptotic convergence rate. The network with polynomial activation functions (chosen to be of degree 3 in this example) is harder to train than the rational network, as shown by the non-smooth validation loss (see the right panel of Figure 2). We highlight that rational neural networks are never much bigger in terms of trainable parameters than ReLU networks since the increase is only linear with respect to the number of layers. Here, the ReLU network has parameters (consisting of weights and biases), while the rational network has . The ReLU, sinusoid, rational, and polynomial networks achieve the following mean square errors after epochs:

 MSE(uReLU)=1.9×10−4, MSE(usin)=3.3×10−6, MSE(urat)=1.2×10−7, MSE(upoly)=3.6×10−5.

The absolute approximation errors between the different neural networks and the exact solution to the KdV equation is illustrated in Figure 5 of the Supplementary Material. The rational neural network is approximatively five times more accurate than the sinusoid network used by Raissi and twenty times more accurate than the ReLU network. Moreover, the approximation errors made by the ReLU network are not uniformly distributed in space and time and located in specific regions, indicating that a network with non-smooth activation functions is not appropriate to resolve smooth solutions to PDEs.

Generative adversarial networks (GANs) are used to generate fake examples from an existing dataset Goodfellow et al. (2014). They usually consist of two networks: a generator to produce fake samples and a discriminator to evaluate the samples of the generator with the training dataset. Radford et al. Radford et al. (2015) describe deep convolutional generative adversarial networks (DCGANs) to build good image representations using convolutional architectures. They evaluate their model on the MNIST and Imagenet image datasets Deng et al. (2009); LeCun et al. (1998). This section highlights the simplicity of using rational activation functions in existing neural network architectures by training an Auxiliary Classifier GAN (ACGAN) Odena et al. (2017) on the MNIST dataset. In particular, the neural network4, denoted by ReLU network in this section, consists of convolutional generator and discriminator networks with ReLU and Leaky ReLU Maas et al. (2013) activation units (respectively) and is used as a reference GAN. As in the experiment described in Section 4.1, we replace the activation units of the generative and discriminator networks by a rational function with trainable coefficients (see Figure 1). We initialize the activation functions in the training phase with the best rational function that approximates the ReLU function on .

We show images of digits from the first five classes generated by a ReLU and rational GANs at different epochs of the training in Figure 3 (the samples are generated randomly and are not manually selected). We observe that a rational network can generate realistic images with a broader range of features than the ReLU network, as illustrated by the presence of bold numbers at the epoch 20 in the bottom panel of Figure 3. However, the digits one generated by the rational network are identical, suggesting that the rational GAN suffers from mode collapse. It should be noted that generative adversarial networks are notoriously tricky to train Goodfellow et al. (2016). The hyper-parameters of the reference model are intensively tuned for a piecewise linear activation function (as shown by the use of Leaky ReLU in the discriminator network). Moreover, many stabilization methods have been proposed to resolve the mode collapse and non-convergence issues in training, such as Wasserstein GAN Arjovsky et al. (2017), Unrolled Generative Adversarial Networks Metz et al. (2016), and batch normalization Ioffe and Szegedy (2015). These techniques could be explored and combined with rational networks to address the mode collapse issue observed in this experiment.

## 5 Conclusions

We have investigated rational neural networks, which are neural networks with smooth trainable activation functions based on rational functions. Theoretical statements demonstrate the improved approximation power of rational networks in comparison with ReLU networks. In practice, it seems beneficial to select the activation function as very low-degree rationals, making training more computationally efficient. We emphasize that it is simple to implement rational networks in existing deep learning architectures, such as TensorFlow, together with the ability to have trainable activation functions.

There are many future research directions exploring the potential applications of rational networks in fields such as image classification, time series forecasting, and generative adversarial networks. These applications already employ nonstandard activation functions to overcome various drawbacks of ReLU. Another exciting and promising field is the numerical solution and data-driven discovery of partial differential equations with deep learning. We believe that popular techniques such as physics-informed neural networks Raissi et al. (2019) could benefit from rational neural networks to improve the robustness and performances of PDE solvers, both from a theoretical and practical viewpoint.

Neural networks have applications in diverse fields such as facial recognition, credit-card fraud, speech recognition, and medical diagnosis. There is a growing understanding of the approximation power of neural networks, which is adding theoretical justification to their use in societal applications. We are particularly interested in the future applicability of rational neural networks in discovering and solving of partial differential equations (PDEs). Neural networks, in particular rational neural networks, have the potential to revolutionize fields where PDE models derived by mechanistic principles are lacking.

{ack}

The authors thank the National Institute of Informatics (Japan) for funding a research visit, during which this project was initiated. We thank Gilbert Strang for making us aware of Telgarsky’s paper Telgarsky (2017). We also thank Matthew Colbrook and Nick Trefethen for their suggestions on the paper. This work is supported by the EPSRC Centre For Doctoral Training in Industrially Focused Mathematical Modelling (EP/L015803/1) in collaboration with Simula Research Laboratory. The work of the third author is supported by the National Science Foundation grant no. 1818757.

## Appendix A Supplementary Material

### a.1 Deferred proofs of Section 3.1

We first show that a rational function can approximate the absolute value function on with square-root exponential convergence.

###### Lemma

For any integer , we have

 minr∈Rk,kmaxx∈[−1,1]||x|−xr(x)|≤4e−π√k/2,

where is the space of rational functions of type at most . Thus, is a rational approximant to of type at most .

Moreover, if for some and integers , then can be written as , where .

###### Proof

Let be a real number and consider the sign function on the domain , i.e.,

 sign(x)={−1,x∈[−1,−ℓ],+1,x∈[ℓ,1].

By [4, Equation (33)], we find that for any ,

 minr∈Rk,kmaxx∈[−1,−ℓ]∪[ℓ,1]|sign(x)−r(x)|≤4[exp(π22log(4/ℓ))]−k.

Let be the rational function of type that attains the minimum [4, Equation (12)]. We refer to such as the Zolotarev sign function. It is given by

 r(x)=Mx∏⌊(k−1)/2⌋j=1x2+c2j∏⌊k/2⌋j=1x2+c2j−1,cj=ℓ2sn2(jK(κ)/k;κ)1−sn2(jK(κ)/k;κ).

Here, is a real constant selected so that equioscillates on , , is the first Jacobian elliptic function, and is the complete elliptic integral of the first kind. Since we have the following inequality,

 maxx∈[−1,−ℓ]∪[ℓ,1]||x|−xr(x)| =maxx∈[−1,−ℓ]∪[ℓ,1]∣∣x⋅sign(x)−xr(x)∣∣ ≤maxx∈[−1,−ℓ]∪[ℓ,1]∣∣sign(x)−r(x)∣∣.

The last inequality follows because on . Moreover, since for (see [4, Equation (12)]) we have

 maxx∈[−ℓ,ℓ]||x|−xr(x)|≤maxx∈[−ℓ,ℓ]|x|≤ℓ.

Therefore,

 maxx∈[−1,1]||x|−xr(x)|≤max⎧⎨⎩ℓ,4[exp(π22log(4/ℓ))]−k⎫⎬⎭.

Now, select to minimize this upper bound. One finds that and the result follows immediately.

For the final claim, let be the Zolotarev sign function of type on , with . By definition, is the best rational approximation of degree to the sign function on . We know from [32, 44] that there exist Zolotarev sign functions , where each is of type , such that

 r(x):=Zk(x;ℓ)=Rp(⋯(R2(R1(x))⋯). (5)

The proof of Lemma is a direct consequence of the previous lemma and the properties of Zolotarev sign functions.

###### Proof (Proof of Lemma )

Let , , , and be the Zolotarev sign function of type . Again from [32, 44], we see that there exist Zolotarev sign functions of type such that their composition equals , i.e.,

 r(x):=Z3k(x;ℓ)=Rk(⋯(R2(R1(x))⋯). (6)

Following the proof of Lemma , we have the inequality

 maxx∈[−1,1]||x|−xr(x)|≤4e−π√3k/2, (7)

where we chose . Now, we take

 k=⌈ln(2/π2)+2ln(ln(4/ϵ))ln(3)⌉, (8)

so that the right-hand side of Equation 7 is bounded by . Finally, we use the identity

 ReLU(x)=|x|+x2,x∈R,

to define a rational approximation to the ReLU function on the interval as

 ~r(x)=12(xr(x)1+ϵ+x).

Therefore, we have the following inequalities for ,

 |ReLU(x)−~r(x)| =12∣∣∣|x|−xr(x)1+ϵ∣∣∣≤12(1+ϵ)(||x|−xr(x)|+ϵ|x|) ≤ϵ1+ϵ≤ϵ.

Then, is a composition of rational functions of type and can be represented using at most coefficients (see Equation 5). Moreover, using Equation 8, we see that , which means that is representable by a rational network of size . Finally, for .

The upper bound on the complexity of the neural network obtained in Lemma is optimal, as proved by Vyacheslavov [59].

###### Theorem (Vyacheslavov)

The following inequalities hold:

 C1e−π√k≤maxx∈[−1,1]||x|−rk(x)|≤C2e−π√k,k≥0, (9)

where is the best rational approximation to in from . Here, are constants that are independent of .

We first deduce the following corollary, giving lower and upper bounds on the optimal rational approximation to the ReLU function.

###### Corollary

The following inequalities hold:

 C12e−π√k≤∥ReLU−rk∥∞≤C22e−π√k,k≥0, (10)

where is the best rational approximation to ReLU on in and are constants given by Theorem .

###### Proof

Let be an integer and let be any rational function of degree . Now, define . Since , we have

 ∥ReLU−rk∥∞=maxx∈[−1,1]∣∣∣12(r% abs(x)+x)−12(|x|+x)∣∣∣=maxx∈[−1,1]12∣∣rabs(x)−|x|∣∣≥12C1e−π√k,

where the inequality is from Theorem . Now, let be the best rational approximation to on . Now, define . We find that

 ∥ReLU−rReLU∥∞=maxx∈[−1,1]∣∣∣12(|x|+x)−12(rk(x)+x)∣∣∣=maxx∈[−1,1]12||x|−rk(x)|≤12C2e−π√k,

which proves that the best approximation to ReLU satisfies the upper bound.

We now show that a rational neural network must be at least in size (total number of nodes) to approximate the ReLU function to within .

###### Proposition

Let . A rational neural network that approximates the ReLU function on to within has size of at least .

###### Proof

Let be a rational neural network with nodes at each of its layers, and assume that its activation functions are rational functions of type at most . Let be the maximum of the degrees of the activation functions of . Such a network has size . Note that itself is a rational function of degree , where from additions and compositions of rational functions we have . If is an -approximation to the ReLU function on , we know by Corollary that

 C12e−π√d≥ϵ,d≥(1πln(C12ϵ))2. (11)

The statement follows by minimizing the size of , i.e., subject to

 dMrM∏i=1ki≥(1πln(C12ϵ))2.

That is,

 M∑i=1ln(ki)+Mln(dr)≥2ln(ln(C12ϵ))−2ln(π). (12)

We introduce a Lagrange multiplier and define the Lagrangian of this optimization problem as

 L(k1,…,kM,λ)=M∑i=1ki+λ[2ln(ln(C12ϵ))−2ln(π)−M∑i=1ln(ki)−Mln(dr)].

One finds using the Karush–Kuhn–Tucker conditions [31] that . Then, using Equation 12, we find that satisfies

 ln(λ)≥2M[ln(ln(C12ϵ))−ln(π)]−ln(dr)=:ln(λ∗). (13)

Therefore, the rational network with layers that approximates the ReLU function to within on has a size of at least , where is given by Equation 13 and depends on . We now minimize with respect to the number of layers . We remark that minimizing is equivalent of minimizing , where

 ln(s(M))=ln(M)+ln(λ∗)=ln(M)+2M[ln(ln(C12ϵ))−ln(π)]−ln(dr).

One finds that one should take and . The result follows.

We now show that ReLU neural networks can approximate rational functions.

###### Proof (Proof of Lemma )

Let and be a rational function. Take , which is still a rational function. Without loss of generality, we can assume that is an irreducible rational function (otherwise cancel factors till it is irreducible). Since is a rational, it can be written as with . Moreover, we know that for so we can assume that for (it is either positive or negative by continuity). Since is continuous on , there is an integer such that for . Furthermore, we find that for because and for . By [56, Theorem 1.1], there exists a ReLU network of size such that

 maxx∈[0,1]∣∣∣f(x)−p(x)q(x)∣∣∣≤ϵ2.

We now define a scaled ReLU network such that for . Therefore, for all ,

 ∣∣~f(x)−~R(x)∣∣=∣∣∣f(x)1+ϵ/2−p(x)q(x)∣∣∣≤11+ϵ/2(∣∣∣f(x)−p(x)q(x)∣∣∣+ϵ2∣∣∣p(x)q(x)∣∣∣)≤ϵ.

Therefore, is a ReLU neural network of size that is an -approximation to on .

We can now prove Theorem that shows how rational neural networks can approximate ReLU networks and vice versa. The structure of the proof closely follows [56, Lemma 1.3].

###### Proof (Proof of Theorem )

The statement of Theorem comes in two parts, and we prove them separately. 1.  Consider the subnetwork of the rational network , consisting of the layers of up to the th layer for some . Let denote the ReLU network obtained by replacing each rational function in by a ReLU network approximation at a given tolerance for and , such that for (see Lemma 3.1). Let be the output of the rational network at layer and node for . Now, approximate node in the st layer by a ReLU network with tolerance (see Lemma 3.1). The approximation error between the rational and the approximating ReLU network at layer and node satisfies

 Ei,J+1 =|fri,J+1(a⊤i,J+1HReLU(x)+bi,J+1)−ri,J+1(a⊤i,J+1H(x)+bi,J+1)| ≤|fri,J+1(a⊤i,J+1HReLU(x)+bi,J+1)−ri,J+1(a⊤i,J+1HReLU(x)+bi,J+1)|(1) +|ri,J+1(a⊤i,J+1HReLU(x)+bi,J+1)−ri,J+1(a⊤i,J+1H(x)+bi,J+1)|(2).

The first term is bounded by

 (1)≤maxx∈[−1,1]∣∣ri,J+1(x)−fri,J+1∣∣≤ϵJ+1,

since by assumption. The second term is bounded as the Lipschitz constant of is at most . That is,

 (2)≤L∥ai,J+1∥1maxx∈[−1,1]d∥∥HReLU(x)−H(x)∥∥∞≤Lmaxx∈[−1,1]d∥∥HReLU(x)−H(x)∥∥∞,

where we used the fact that and for . We find that we have the following set of inequalities:

 max1≤i≤kj+1Ei,j+1≤Lmax1≤i≤kjEi,j+ϵj+1,1≤i≤kj,1≤j≤J+1,

with . If we select , then we find that . When , the ReLU network approximates the original rational network, , and the ReLU network has size

 O(kM∑j=1log(MLj−Mϵ)3).

where we used the fact that for . This can be simplified a little since

 M∑j=1log(MLj−Mϵ)3=M∑j=1(log(MLM/ϵ)+jlog(1/L))3=O(Mlog(MLM/ϵ)3).

2.  Telgarsky proved in [56, Lemma 1.3] that if is a neural network obtained by replacing all the ReLU activation functions in by rational functions for , which satisfies and for , then

 maxx∈[−1,1]d|f(