Constructing Multilayer Perceptrons as Piecewise LowOrder Polynomial Approximators: A Signal Processing Approach
Abstract
The construction of a multilayer perceptron (MLP) as a piecewise loworder polynomial approximator using a signal processing approach is presented in this work. The constructed MLP contains one input, one intermediate and one output layers. Its construction includes the specification of neuron numbers and all filter weights. Through the construction, a onetoone correspondence between the approximation of an MLP and that of a piecewise loworder polynomial is established. Comparison between piecewise polynomial and MLP approximations is made. Since the approximation capability of piecewise loworder polynomials is well understood, our findings shed light on the universal approximation capability of an MLP.
I Introduction
Piecewise loworder polynomial approximation (or regression) is a classic topic in numerical analysis. It fits multiple loworder polynomials to a function in partitioned intervals. The approximation capability of piecewise loworder polynomials is well understood. In contrast, the representation power of neural networks is not as obvious. The representation power of neural networks has been studied for three decades, and universal approximation theorems in various forms have been proved [3, 4, 2, 1, 11]. These proofs indicate that neural networks with appropriate activations can represent a wide range of functions of interest. Their goals were to make their theorems as general as possible. The proofs were based on tools from functional analysis (e.g., the HahnBanach theorem [10]) and real analysis (e.g., SprecherKolmogorov theorem [12, 5]), which is less well known in the signal processing community. Furthermore, although they show the existence of a solution, most of them do not offer a specific MLP design. Lack of MLP design examples hinders our understanding of the behavior of neural networks.
A feedforward construction of an MLP with two intermediate layers was presented to solve a classification problem in [7]. As a sequel to [7], this work addresses the regression problem. For ease of presentation, our treatment will focus on the approximation of a univariate function. We adopt a signal processing (or numerical analysis) approach to tackle it. Through the design, we can explain how MLPs offer universal approximations intuitively. The construction includes the choice of activation functions, neuron numbers, and filter weights between layers. A onetoone correspondence between an MLP and a piecewise loworder polynomial will be established. Except for the trivial case of the unit step activation function, results on MLP construction and its equivalence to piecewise loworder polynomial approximators are new. Another interesting observation is the critical role played by activation functions. That is, activation functions actually define kernels of piecewise loworder polynomials. Since the approximation error of piecewise loworder polynomials is well understood, our findings shed light on the universal approximation capability of an MLP as well.
The rest of this paper is organized as follows. The construction of MLPs corresponding to pieceswise constant, linear and cubic polynomial approximations is detailed in Sec. II. The generalization from the 1D case to multivariate vector functions is sketched in Sec. III. Concluding remarks and future research directions are given in Sec. IV.
Ii Proposed MLP Constructions
Without loss of generality, we consider a continuous function defined on interval , which is uniformly divided into subintervals. Let be the length of the subinterval and , where . We show how to construct an MLP so that it serves as a piecewise constant or a piecewise linear approximation to in this section.
Iia Piecewise Constant Approximation
A piecewise constant approximation of can be written as
(1) 
where
(2) 
is the unit box function. We can rewrite the unit box function as the difference of two unit step functions
(3) 
where
(4) 
is the unit step function. Then, we can rewrite Eq. (1) as
(5) 
Our designed MLP consists of one input node denoting , one output node denoting , and intermediate nodes (or neurons), denoted by , where . We use and to denote the weight and bias between the input node, and and and to denote the weight and bias between and the output node , respectively. The response at before activation, denoted by , and the response at , denoted by , can be written as
(6)  
(7) 
where Act is the activation function, and are weights and and are biases.
IiB Piecewise Linear Approximation
A commonly used nonlinear activation function is the rectified linear unit (ReLU), which can be written as
(9) 
We will construct an MLP using as the nonlinear activation function and show its equivalence to a piecewise linear approximation to . The latter can be expressed as
(10) 
where
(11) 
and where is the unit triangle function in form of
(12) 
With one input node denoting and one output node denoting , we construct an MLP that has 4 neurons in two pairs, denoted by , , in the intermediate layer to yield the same effect of . We use and to denote the weight and bias from the input node to , respectively. In our design, they are specified as:
(13) 
The responses of neurons , and after ReLU nonlinear activation are shown in Figs. 1 (a) and (b), respectively, where the dotted lines indicate those of individual neurons and the solid lines show those of combined responses. Finally, the combination of all responses of all 4 neurons at the output node is shown in Fig. 1 (c). It is a unit triangle plus a vertical shift.
Similarly, we use and to denote the weight and bias from to the output node , respectively. They are specified below:
(14) 
It is easy to verify from the Fig. 1 (d) that these four neurons are nothing but a weighted triangle centered at with its base between and and height and can be approximated by a sequence of such weighted triangles.
IiC Piecewise Cubic Approximation
Next, we design an MLP that offers a piecewise cubic approximation to . Besides incoming and outgoing weights/biases, we will design the unit cubic activation (see Fig. 2(a)) in form of:
(15) 
where , , satisfies two constraints: i) and, ii) is antisymmetric with respect to ; i.e.,
(16) 
Note that these two constraints imply and . By substituting , , into Eq. (15), we get , , and . There is one degree of freedom left. To complete the design, one choice is to specify the firstorder derivative at the inflection point: . As shown in Fig. 2(a), the inflection point has the maximum firstorder derivative and its secondorder derivative is zero.
For interval centered at , we can use two neurons to build a thirdorder unit bump function plus a vertical shift as shown in Fig. 2(b). The weight and bias from the input node to are specified as:
(17) 
where and are activated by the unit cubic function, . The weights and biases from , , to output node are:
(18) 
where , , is the solution to the following linear system of equations:
(19) 
with boundary conditions . Given the target regression function at the righthandside of (19), one can solve the system for , , uniquely. If is smooth in a local region, as shown in Fig. 2(c).
There is an alternative design as shown in Fig. 2(d), where we increase the spacing between two adjacent bumps from to . The weight and bias from the input node to still follow Eq. (17) with , where , while Eq. (18) should be changed to
(20) 
This is fine since there is no interference between ,
. Furthermore, using the Taylor series
expansion
(21) 
The design in Fig. 2(c) demands neurons while that in Fig. 2(d) demands neurons only in total. The latter design appears to be simpler.
IiD Discussion
This section builds a bridge between Piecewise loworder polynomial and MLP approximations. The approximation errors of piecewise loworder polynomials can be derived using Taylor’s series expansion. As goes to infinity, the errors of piecewise constant, linear and cubic polynomials converge to zero at a rate of , and . The same conclusion applies to the corresponding designed MLPs.
There is however difference between them. To give an example, consider the kernel density estimation problem [9]:
(22) 
where is a kernel. The unknown function at given point can be expressed as the summation of weighted kernels, where weights can be solved by a regression method (say, linear regression) for given . The box, triangle, and thirdorder unit bump functions are kernels to smooth statistical variations in local intervals. The preselected kernel could be too local and rigid to be efficient in handling long and midrange correlations in . In contrast, parameters in the first stage of an MLP allow a richer choice of kernels. Our MLP design does not exploit this flexibility fully. For example, we may adopt more different values (see Fig. 3).
One focal point of neural network research is to understand the role of activation in a neuron. Historically, earliest activation was the simple unit step function [8]. It offers “on” and “off” two states with an abrupt jump. Such activation has zero derivatives except at and, as a result, backpropagation is not able to update parameters. Several variants such as the sigmoid, ReLU, leaky ReLU activations have been developed later. Paired activations are used to build kernels of loworder polynomials here. As presented earlier, higherorder nonlinear activations are not fundamental in the universal approximation capability of neural networks. They affect the kernel shape only. Instead, the role of nonlinear activation is interpreted as a tool to resolve the sign confusion problem caused by the cascade of multiple layers in [6] and [7].
Iii Generalization to Multivariate Vector Functions
Our presentation has focused on the case of 1D input/output so far. In general, the regression problem can be a dimensional vector function of multivariables:
(23) 
where . We will provide a sketch to the generalization from the univariate scalar function to the multivariate vector function in this section. Generalization of our proposed design to a vector function is straightforward since we can design one MLP for each output dimention.
The generalization from the univariate case to the multivariate case using 1D piecewise loworder polynomials, we can proceed as follows:

Partition into D cells of volume size ;

Conduct the 1D approximation separately;

Perform the tensor product of 1D kernels to form D kernels.
For the MLP design, we cannot perform the tensor product in the third step directly. Instead, we can use the tensor product result to find the weighted contributions of neighboring grid points to the center of each cell and construct an MLP accordingly.
In most practical machine learning applications, the multivariate vector function, , it is assumed to be sufficiently smooth. That is, its lowerorder partial derivatives exist. Consequently, Talyor series expansion for multivariable functions can be adopted to estimate the convergence rate to the target regression function.
Iv Conclusion and Future Work
A bridge between piecewise loworder polynomials and threelayer MLP approximators was built in this work. Instead of following the traditional path of universal approximation theorem of neural networks based on functional or real analysis, we proposed a signal processing approach to construct MLPs. The construction includes the choice of activation functions, neuron numbers, and filter weights between adjacent layers. Our design decomposes an MLP into two stages: 1) kernel construction in the first stage and 2) weight regression in the second stage. We can explain how an MLP offers an universal approximation to a univariate function intuitively by this approach. Kernel design plays an important role in many machine learning and statistical inference problems. It was shown to be related to the shape of the activation function here.
Given an MLP architecture, backprogagation is used to solve kernel construction and weight regression problems jointly via endtoend optimization. It is however an open problem to find the optimal number of neurons in the intermediate layer. On the other hand, the traditional approach decouples the whole problem into two subproblems and solve them one by one in a feedforward manner. It is interesting to study the relationship between the kernel design and the firststage filter weight/bias determination of MLPs. It is also desired to exploit prior knowledge about data and applications for flexible kernel design in feedforward learning systems.
Footnotes
 under the assumption that loworder derivatives of are bounded
References
 (1995) Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks 6 (4), pp. 911–917. Cited by: §I.
 (1992) Approximation by ridge functions and neural networks with one hidden layer. Journal of Approximation Theory 70 (2), pp. 131–141. Cited by: §I.
 (1989) Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §I.
 (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2 (5), pp. 359–366. Cited by: §I.
 (1957) On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. In Doklady Akademii Nauk, Vol. 114, pp. 953–956. Cited by: §I.
 (2016) Understanding convolutional neural networks with a mathematical model. Journal of Visual Communication and Image Representation 41, pp. 406–413. Cited by: §IID.
 (2020) From twoclass linear discriminant analysis to interpretable multilayer perceptron design. arXiv preprint arXiv:2009.04442. Cited by: §I, §IID.
 (1943) A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics 5 (4), pp. 115–133. Cited by: §IID.
 (1962) On estimation of a probability density function and mode. The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §IID.
 (1973) Functional analysis. McGrawHill, New York. Cited by: §I.
 (1998) Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Neural networks 11 (1), pp. 15–37. Cited by: §I, §IIA.
 (1965) On the structure of continuous functions of several variables. Transactions of the American Mathematical Society 115, pp. 340–355. Cited by: §I.