Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach

Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach

Abstract

The construction of a multilayer perceptron (MLP) as a piecewise low-order polynomial approximator using a signal processing approach is presented in this work. The constructed MLP contains one input, one intermediate and one output layers. Its construction includes the specification of neuron numbers and all filter weights. Through the construction, a one-to-one correspondence between the approximation of an MLP and that of a piecewise low-order polynomial is established. Comparison between piecewise polynomial and MLP approximations is made. Since the approximation capability of piecewise low-order polynomials is well understood, our findings shed light on the universal approximation capability of an MLP.

multilayer perceptron, feedforward neural network, approximation theory, piecewise polynomial approximation.

©©20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

I Introduction

Piecewise low-order polynomial approximation (or regression) is a classic topic in numerical analysis. It fits multiple low-order polynomials to a function in partitioned intervals. The approximation capability of piecewise low-order polynomials is well understood. In contrast, the representation power of neural networks is not as obvious. The representation power of neural networks has been studied for three decades, and universal approximation theorems in various forms have been proved [3, 4, 2, 1, 11]. These proofs indicate that neural networks with appropriate activations can represent a wide range of functions of interest. Their goals were to make their theorems as general as possible. The proofs were based on tools from functional analysis (e.g., the Hahn-Banach theorem [10]) and real analysis (e.g., Sprecher-Kolmogorov theorem [12, 5]), which is less well known in the signal processing community. Furthermore, although they show the existence of a solution, most of them do not offer a specific MLP design. Lack of MLP design examples hinders our understanding of the behavior of neural networks.

A feedforward construction of an MLP with two intermediate layers was presented to solve a classification problem in [7]. As a sequel to [7], this work addresses the regression problem. For ease of presentation, our treatment will focus on the approximation of a univariate function. We adopt a signal processing (or numerical analysis) approach to tackle it. Through the design, we can explain how MLPs offer universal approximations intuitively. The construction includes the choice of activation functions, neuron numbers, and filter weights between layers. A one-to-one correspondence between an MLP and a piecewise low-order polynomial will be established. Except for the trivial case of the unit step activation function, results on MLP construction and its equivalence to piecewise low-order polynomial approximators are new. Another interesting observation is the critical role played by activation functions. That is, activation functions actually define kernels of piecewise low-order polynomials. Since the approximation error of piecewise low-order polynomials is well understood, our findings shed light on the universal approximation capability of an MLP as well.

The rest of this paper is organized as follows. The construction of MLPs corresponding to pieceswise constant, linear and cubic polynomial approximations is detailed in Sec. II. The generalization from the 1-D case to multivariate vector functions is sketched in Sec. III. Concluding remarks and future research directions are given in Sec. IV.

Ii Proposed MLP Constructions

Without loss of generality, we consider a continuous function defined on interval , which is uniformly divided into sub-intervals. Let be the length of the subinterval and , where . We show how to construct an MLP so that it serves as a piecewise constant or a piecewise linear approximation to in this section.

Ii-a Piecewise Constant Approximation

A piecewise constant approximation of can be written as

(1)

where

(2)

is the unit box function. We can rewrite the unit box function as the difference of two unit step functions

(3)

where

(4)

is the unit step function. Then, we can rewrite Eq. (1) as

(5)

Our designed MLP consists of one input node denoting , one output node denoting , and intermediate nodes (or neurons), denoted by , where . We use and to denote the weight and bias between the input node, and and and to denote the weight and bias between and the output node , respectively. The response at before activation, denoted by , and the response at , denoted by , can be written as

(6)
(7)

where Act is the activation function, and are weights and and are biases.

With the unit step function as activation and Eq. (5), it is easy to verify that we can choose the following weights and biases for the MLP:

(8)

where . The above derivation was given in [11], and it is repeated here for the sake of completeness.

Ii-B Piecewise Linear Approximation

A commonly used nonlinear activation function is the rectified linear unit (ReLU), which can be written as

(9)

We will construct an MLP using as the nonlinear activation function and show its equivalence to a piecewise linear approximation to . The latter can be expressed as

(10)

where

(11)

and where is the unit triangle function in form of

(12)

With one input node denoting and one output node denoting , we construct an MLP that has 4 neurons in two pairs, denoted by , , in the intermediate layer to yield the same effect of . We use and to denote the weight and bias from the input node to , respectively. In our design, they are specified as:

(13)

The responses of neurons , and after ReLU nonlinear activation are shown in Figs. 1 (a) and (b), respectively, where the dotted lines indicate those of individual neurons and the solid lines show those of combined responses. Finally, the combination of all responses of all 4 neurons at the output node is shown in Fig. 1 (c). It is a unit triangle plus a vertical shift.

Fig. 1: Responses of (a) neurons and individually (dotted lines) and jointly (solid lines), (b) neurons and , (c) all four neurons combined, and (d) the approximation of with a sequence of weighted triangle functions.

Similarly, we use and to denote the weight and bias from to the output node , respectively. They are specified below:

(14)

It is easy to verify from the Fig. 1 (d) that these four neurons are nothing but a weighted triangle centered at with its base between and and height and can be approximated by a sequence of such weighted triangles.

Ii-C Piecewise Cubic Approximation

Next, we design an MLP that offers a piecewise cubic approximation to . Besides incoming and outgoing weights/biases, we will design the unit cubic activation (see Fig. 2(a)) in form of:

(15)

where , , satisfies two constraints: i) and, ii) is anti-symmetric with respect to ; i.e.,

(16)

Note that these two constraints imply and . By substituting , , into Eq. (15), we get , , and . There is one degree of freedom left. To complete the design, one choice is to specify the first-order derivative at the inflection point: . As shown in Fig. 2(a), the inflection point has the maximum first-order derivative and its second-order derivative is zero.

Fig. 2: (a) The unit cubic activation function, (b) responses of and individually (dotted lines) and jointly (solid lines), (c) and (d) two designs for the approximation to with a sequence of weighted third-order bumps.

For interval centered at , we can use two neurons to build a third-order unit bump function plus a vertical shift as shown in Fig. 2(b). The weight and bias from the input node to are specified as:

(17)

where and are activated by the unit cubic function, . The weights and biases from , , to output node are:

(18)

where , , is the solution to the following linear system of equations:

(19)

with boundary conditions . Given the target regression function at the right-hand-side of (19), one can solve the system for , , uniquely. If is smooth in a local region, as shown in Fig. 2(c).

There is an alternative design as shown in Fig. 2(d), where we increase the spacing between two adjacent bumps from to . The weight and bias from the input node to still follow Eq. (17) with , where , while Eq. (18) should be changed to

(20)

This is fine since there is no interference between , . Furthermore, using the Taylor series expansion1 and based on the fact is the inflection point of two associated cubic activation functions, we can derive

(21)

The design in Fig. 2(c) demands neurons while that in Fig. 2(d) demands neurons only in total. The latter design appears to be simpler.

Ii-D Discussion

This section builds a bridge between Piecewise low-order polynomial and MLP approximations. The approximation errors of piecewise low-order polynomials can be derived using Taylor’s series expansion. As goes to infinity, the errors of piecewise constant, linear and cubic polynomials converge to zero at a rate of , and . The same conclusion applies to the corresponding designed MLPs.

There is however difference between them. To give an example, consider the kernel density estimation problem [9]:

(22)

where is a kernel. The unknown function at given point can be expressed as the summation of weighted kernels, where weights can be solved by a regression method (say, linear regression) for given . The box, triangle, and third-order unit bump functions are kernels to smooth statistical variations in local intervals. The pre-selected kernel could be too local and rigid to be efficient in handling long- and mid-range correlations in . In contrast, parameters in the first stage of an MLP allow a richer choice of kernels. Our MLP design does not exploit this flexibility fully. For example, we may adopt more different values (see Fig. 3).

Fig. 3: Two kernels of different supports centered at , which can be achieved by adding more neurons to the intermediate layer of an MLP.

One focal point of neural network research is to understand the role of activation in a neuron. Historically, earliest activation was the simple unit step function [8]. It offers “on” and “off” two states with an abrupt jump. Such activation has zero derivatives except at and, as a result, backpropagation is not able to update parameters. Several variants such as the sigmoid, ReLU, leaky ReLU activations have been developed later. Paired activations are used to build kernels of low-order polynomials here. As presented earlier, higher-order nonlinear activations are not fundamental in the universal approximation capability of neural networks. They affect the kernel shape only. Instead, the role of nonlinear activation is interpreted as a tool to resolve the sign confusion problem caused by the cascade of multiple layers in [6] and [7].

Iii Generalization to Multivariate Vector Functions

Our presentation has focused on the case of 1D input/output so far. In general, the regression problem can be a -dimensional vector function of multivariables:

(23)

where . We will provide a sketch to the generalization from the univariate scalar function to the multivariate vector function in this section. Generalization of our proposed design to a vector function is straightforward since we can design one MLP for each output dimention.

The generalization from the univariate case to the multivariate case using 1D piecewise low-order polynomials, we can proceed as follows:

  1. Partition into -D cells of volume size ;

  2. Conduct the 1D approximation separately;

  3. Perform the tensor product of 1D kernels to form -D kernels.

For the MLP design, we cannot perform the tensor product in the third step directly. Instead, we can use the tensor product result to find the weighted contributions of neighboring grid points to the center of each cell and construct an MLP accordingly.

In most practical machine learning applications, the multivariate vector function, , it is assumed to be sufficiently smooth. That is, its lower-order partial derivatives exist. Consequently, Talyor series expansion for multivariable functions can be adopted to estimate the convergence rate to the target regression function.

Iv Conclusion and Future Work

A bridge between piecewise low-order polynomials and three-layer MLP approximators was built in this work. Instead of following the traditional path of universal approximation theorem of neural networks based on functional or real analysis, we proposed a signal processing approach to construct MLPs. The construction includes the choice of activation functions, neuron numbers, and filter weights between adjacent layers. Our design decomposes an MLP into two stages: 1) kernel construction in the first stage and 2) weight regression in the second stage. We can explain how an MLP offers an universal approximation to a univariate function intuitively by this approach. Kernel design plays an important role in many machine learning and statistical inference problems. It was shown to be related to the shape of the activation function here.

Given an MLP architecture, backprogagation is used to solve kernel construction and weight regression problems jointly via end-to-end optimization. It is however an open problem to find the optimal number of neurons in the intermediate layer. On the other hand, the traditional approach decouples the whole problem into two subproblems and solve them one by one in a feedforward manner. It is interesting to study the relationship between the kernel design and the first-stage filter weight/bias determination of MLPs. It is also desired to exploit prior knowledge about data and applications for flexible kernel design in feedforward learning systems.

Footnotes

  1. under the assumption that low-order derivatives of are bounded

References

  1. T. Chen and H. Chen (1995) Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks 6 (4), pp. 911–917. Cited by: §I.
  2. C. K. Chui and X. Li (1992) Approximation by ridge functions and neural networks with one hidden layer. Journal of Approximation Theory 70 (2), pp. 131–141. Cited by: §I.
  3. G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §I.
  4. K. Hornik, M. Stinchcombe and H. White (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2 (5), pp. 359–366. Cited by: §I.
  5. A. N. Kolmogorov (1957) On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. In Doklady Akademii Nauk, Vol. 114, pp. 953–956. Cited by: §I.
  6. C. J. Kuo (2016) Understanding convolutional neural networks with a mathematical model. Journal of Visual Communication and Image Representation 41, pp. 406–413. Cited by: §II-D.
  7. R. Lin, Z. Zhou, S. You, R. Rao and C. J. Kuo (2020) From two-class linear discriminant analysis to interpretable multilayer perceptron design. arXiv preprint arXiv:2009.04442. Cited by: §I, §II-D.
  8. W. S. McCulloch and W. Pitts (1943) A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics 5 (4), pp. 115–133. Cited by: §II-D.
  9. E. Parzen (1962) On estimation of a probability density function and mode. The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §II-D.
  10. W. Rudin (1973) Functional analysis. McGraw-Hill, New York. Cited by: §I.
  11. F. Scarselli and A. C. Tsoi (1998) Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Neural networks 11 (1), pp. 15–37. Cited by: §I, §II-A.
  12. D. A. Sprecher (1965) On the structure of continuous functions of several variables. Transactions of the American Mathematical Society 115, pp. 340–355. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
416948
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description