Minimum “Norm” Neural Networks are Splines\fundingThis work is partially supported by AFOSR/AFRL grant FA9550-18-1-0166 and the NSF Research Traineeship (NRT) grant 1545481.

Minimum “Norm” Neural Networks are Splinesthanks: \fundingThis work is partially supported by AFOSR/AFRL grant FA9550-18-1-0166 and the NSF Research Traineeship (NRT) grant 1545481.

Rahul Parhi Department of Electrical and Computer Engineering, University of Wisconsin–Madison, Madison, WI 53706 (, ). rahul@ece.wisc.edu rdnowak@wisc.edu    Robert D. Nowak1
2footnotemark: 2
Abstract

We develop a general framework based on splines to understand the interpolation properties of overparameterized neural networks. We prove that minimum “norm” two-layer neural networks (with appropriately chosen activation functions) that interpolate scattered data are minimal knot splines. Our results follow from understanding key relationships between notions of neural network “norms”, linear operators, and continuous-domain linear inverse problems.

s
\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamremarkfctFact \newsiamthmclaimClaim \newsiamthmquestionQuestion \headersMinimum “Norm” Neural Networks are SplinesRahul Parhi and Robert D. Nowak

plines, neural networks, inverse problems

{AMS}

41A25, 46E27, 47A52, 68T05, 82C32, 94A12

1 Introduction

Contradicting classical statistical wisdom, recent trends in data science and machine learning have shown that overparameterized models that interpolate the training data perform surprisingly well on new, unseen data. This phenomenon is very often seen in the generalization performance of overparameterized neural network models. These models are typically trained to zero training error, i.e., they interpolate the training data, yet they predict very well on new, test data. Prior work has tried to understand this phenomenon from a statistical perspective by studying the statistical properties of such interpolating models [interp-sgd, understand-deep-kernel, interp-stat-opt, overfit-perfect-fit]. To control the complexity of overparameterized neural networks, solutions with minimum -norm of the network weights are often encouraged in optimization using regularization (often referred to as “weight decay” in Stochastic Gradient Descent (SGD) steps). Motivated by this practice, the functional mappings generated by two-layer, infinite-width neural networks with Rectified Linear Unit (ReLU) activation functions were studied in [nn-linear-spline]. Such networks that interpolate data subject to minimizing the -norm of the network weights were shown to correspond to linear spline interpolation of the data.

In this work we build off these results and develop a general framework based on splines to understand what function spaces can be learned by neural networks. We study more general two-layer neural networks that interpolate data while minimizing certain “norms”. We relate this optimization to an equivalent optimization over functions that live in the native space of a particular linear operator. Both the neural network activation function and the “norm” are tailored to the linear operator.

Our key contribution is understanding key relationships among neural networks “norms”, linear operators, and continuous-domain linear inverse problems. With these connections we turn to the recent work of [L-splines] who introduce a framework for -splines in which particular type of spline is a solution to a continuous-domain linear inverse problem involving a linear operator . We prove that minimizing the “norm” of a specific neural network architecture (where the architecture and “norm” are determined by the choice of the linear operator ) subject to interpolating scattered data exactly solves this inverse problem. By noticing and understanding these connections we bridge the gap between spline theory and understanding the interpolation properties of overparameterized neural networks.

In particular, we prove that overparameterized two-layer neural networks mapping can learn functions that interpolate scattered data with a large class of -splines, including polynomial splines, fractional splines, and many exponential splines. This result is not only interesting to the data science and machine learning communities, but also to the spline community as a new, and perhaps unconventional, way to compute splines. Additionally, since our result follows from neural networks indirectly solving continuous-domain linear inverse problems, this is also interesting to the inverse problem community as a new way to solve such inverse problems as opposed to more standard multiresolution or grid-based approaches [inv-prob1, inv-prob2]. Our work may also be relevent to recent work on developing an infinite-dimensional theory of compressed sensing [infinite-dim-cs1, infinite-dim-cs2] as well as other inverse problems set in the continuous domain [inv-prob-space-measures].

This paper is organized as follows. In Section 2 we provide relevant background from spline theory and introduce the framework of -splines from [L-splines]. In Section 3 we introduce the neural networks we’ll be interested in as well as the notion of neural network “norms”. In Section 4 we present our main theoretical results establishing connections among neural networks, continuous-domain linear inverse problems, and splines. In Section 5 we get into detail of which -splines a neural network can learn. In Section 6, we discuss various neural network “norms” that result in minimum “norm” neural networks learning splines. In Section 7 we verify that the theory developed in Section 4 holds with some empirical validation.

2 Splines and Continuous-Domain Linear Inverse Problems

In this section we state results from spline theory that make powerful associations between splines and operators. Specifically, we review -splines and their connections to continuous-domain linear inverse problems with generalized total-variation regularization [L-splines]. The main result of interest is a representer theorem that describes the solution sets of the aforementioned inverse problems.

2.1 Preliminaries

To state the main result we require some technical notation (see Section 3 of [L-splines] for a more thorough discussion). Let denote the space of tempered distributions, which is the continuous dual of the Schwartz space of smooth and rapidly decaying test functions on  [folland, rudin]. In this work, we’ll be interested in the space of finite Radon measures. For each we can associate a tempered distribution 111In the sense that . Note that since we’re working with tempered distributions this is a slight abuse of notation when the measure associated with is not absolutely continuous with respect to the Lebesgue measure., so it becomes convenient to think that .

The Riesz–Markov–Kakutani representation theorem says that is the continuous dual of , the space of continuous functions vanishing at infinity. Since is a Banach space when equipped with the uniform norm, we have the dual norm

(1)

and so

A key observation is that is a space larger than that includes absolutely integrable tempered distributions. Indeed, for any we have

and we remark that is dense in . We’re interested in the space since the Dirac deltas for , but with .

Definition 1 ([L-splines, Definition 1])

A linear operator is called spline-admissible if it satisfies the following properties

  • it is translation-invariant, i.e., where is the translation operator;

  • there exists a function such that , i.e., is a Green’s function of .

  • the kernel has finite-dimension .

Definition 2 ([L-splines, Definition 2])

Let be spline-admissible in the sense of Definition 1. Then, a function is called a non-uniform -spline with spline knots and weights if

is called the innovation of the spline .

Remark 1

Associating a spline with an operator captures many common splines including

  • the well studied polynomial splines of order by choosing where is the derivative operator222Here and in the rest of this paper, derivatives are understood in the distributional sense. [poly-splines].

  • the fractional splines of order by choosing , where is the fractional derivative and  [fract-splines].

  • the exponential splines by choosing , where is the identity operator [exp-splines, exp-splines-unser].

2.2 Continuous-Domain Representer Theorem

With the preliminaries out of the way we can now state the relevant representer theorem result of [L-splines].

Theorem 1 ([L-splines, Based on Theorems 1 and 2])

Let be a spline-admissible operator in the sense of Definition 1 and consider the problem of interpolating the scattered data . Then, the extremal points of the general constrained minimization problem

(2)
s.t.

are necessarily non-uniform -splines of the form

(3)

with the knots and where is a Green’s function of and . Here, is the native space of defined by

The full solution set of Eq. 2 is the convex hull of the extremal points.

Remark 2

This theorem provides a powerful result since this problem is defined over a continuum and hence has an (uncountably) infinite number of degrees of freedom, yet the solutions are intrinsically sparse since they can be represented with only coefficients.

3 Neural Network Architectures and “Norms”

We’ll consider two-layer networks with activation function

(4)

where is a generalized bias term333We will see in Section 4 that depending on how we choose the activation function , the generalized bias term is a constant or “simple” function. and

where for .

Put . We will often write or for . Let be the parameter space, i.e.,

Remark 3

Infinite-width networks are the same as the above, but we consider the continuum limit of the number of neurons.

3.1 “Norm” of a Neural Network

Let be a non-negative function that measures the “size” of . We refer to this as the “norm” of the neural network parameterized by . Here and in the rest of the paper we write “norm” in quotes since may not be a true norm. However, the “norms” are non-negative and increasing functions of the weight magnitudes.

Let be any function that can be represented by a neural network for some parameter . The minimum “norm” neural network is defined by the following constrained minimization

s.t.

where is the parameter space of bounded “norm” neural networks. Note that by considering overparameterization in the limit, universal approximation essentially says we can represent any continuous function exactly with only mild conditions on the activation function [uat1, uat2, uat3]. Thus, we will first consider overparameterization in the limit and consider infinite-width networks.

Remark 4

We will see later (in Theorem 2) that since our goal is to establish an equivalence between minimum “norm” neural networks and splines for the scattered data interpolation problem, we only require a sufficiently wide network.

Remark 5

As will become clear later, neural networks are only capable of learning a subset of spline-admissible operators that satisfy a few key properties. We refer to this subset of spline-admissible operators as neural-network-admissible operators and define them as follows.

Definition 3

A linear operator is called neural-network-admissible if it satisfies the following properties

  • it is spline-admissible in the sense of Definition 1;

  • it commutes with the reflection operator upto a complex constant of magnitude , i.e., if is the reflection operator444., for some such that ;

  • for a Green’s function of , we have for any ,

We will now state some relevant facts about these neural-network-admissible operators, which follow by direct calculations. {fct} It is easy to verify that the third property in Definition 3 is implied by the second. Moreover, if the second bullet in Definition 3 did not enforce that , then the third bullet in Definition 3 would never hold.

{fct}

If a neural-network-admissible admits a causal555Recall that a function is called causal if for . Green’s function , then it also admits a non-causal Green’s function

with the property

Remark 6

The reason we need to consider a subset of spline-admissible operators is because the “shape” of a neural network is not the same as the “shape” of a spline. In particular, in a neural network representation, reflections of the activation functions occur (since the weights can be negative) which forces us to require the second bullet in Definition 3.

With the understanding of what kind of neural networks we are considering and what a “norm” of a neural network should look like, to establish a connection between minimum “norm” neural networks and splines there are three questions that need to be answered: {question} What is the activation function ? {question} What is the space of generalized bias functionals ? {question} What is the neural network “norm” ? We can answer Remark 6 and Remark 6 immediately since they follow directly from comparing Eq. 3 and Eq. 4. Notice that -splines (resp. two-layer neural networks) have the “shape” of linear combinations of Green’s functions (resp. activation functions) plus a term in (resp. generalized bias term). Hence, given a neural-network-admissible operator , the answer to

  • Remark 6 is to choose to be a Green’s function of .

  • Remark 6 is to choose

    (5)

    with . As in Theorem 1, .

    Remark 7

    If admits a non-causal Green’s function as in Section 3.1, then we can simply take , as shown later in the proof of Lemma 1.

    Remark 8

    If the constant function is in and admits a non-causal Green’s function as in Section 3.1, we can simply take meaning we can use the “standard” feedforward neural network architecture. Many commonly used splines, e.g., polynomial splines, admit such a Green’s function.

Before answering Remark 6, we need to develop some connections between neural networks and splines. We will answer Remark 6 in Section 6.

4 Neural Networks and Splines

When considering infinite-width networks it becomes convenient to work with the integral representation

where is a signed measure over the weights and biases and as in Eq. 4. With this representation, the measure corresponds exactly to , the last layer weights in a finite-width network, when corresponds to a finite-width network (i.e., is a finite linear combination of of Dirac measures) and we have the equality

(6)

where given a measurable space and any signed measure defined on the -algebra , the total variation of is defined by666When and is the Borel sigma algebra on , is exactly the same as defined in Eq. 1. We could generalize Eq. 1 to general measurable spaces and write instead of . where is the Jordan decomposition of  [folland].

When working with infinite-width networks we have where

is our new parameter space.

We will first prove the following lemma about a subset of infinite-width neural networks that are easier to work with analytically. Following [nn-linear-spline], we consider the parameter space :

of networks where the first layer weights are constrained in absolute value to be .

Lemma 1

Suppose we want to represent a function with an infinite-width network with parameter space and activation function chosen to be the Green’s function of a neural-network-admissible operator . Then, the following is true

(7)

Proof

Write

(8)

where in , is a distribution such that , where is the Dirac measure and is the Lebesgue measure.

Next, from the constraint ,

where holds by the second bullet in Definition 3. Put

Hence,

(9)

From Eq. 8,

By the third bullet in Definition 3 and Section 3.1, either controls a term in or a term that is . Thus, regardless of what is, we can always adjust to establish the constraint . Then, since , we see that representing while minimizing only depends on .

Since

we have from Eq. 9 that the minimization in the lemma statement as

where holds since completely determines and vice-versa, is completely determined by , by the decomposition in Eq. 9, and by the argument of the previous paragraph, we have that the value of has no effect on being able to represent and so the only free term in controlling is ; thus we can simply minimize over the choice of .

Remark 9

Definition 3 provides both necessary and sufficient conditions for the analysis in the proof of Lemma 1 to hold.

Corollary 1

Consider the problem of interpolating the scattered data . We have the following equivalence

(10)

Proof

From Lemma 1, given , , we have

(11)

By Theorem 1, we know that a minimal knot -spline777An -spline with knots., say, is a solution to the right-hand side of Eq. 10. It suffices to show that is also a solution to the left-hand side of Eq. 10. Clearly we can construct with a neural network parameterized by (simply take to be a finite linear combination of Dirac measures at the spline knots), so is feasible. We will now proceed by contradiction. Suppose there exists a that interpolates with . Hence by Eq. 11,

which contradicts the optimality of for the right-hand side of Eq. 10.

Theorem 2

Let , , be a feedforward neural network architecture with activation function chosen to be a Green’s function of a neural-network-admissible operator and last layer bias chosen according to Eq. 5, i.e.,

Consider the problem of interpolating the scattered data . We have the following equivalence

(12)

so long as the number of neurons is . The solutions of the above minimizations are minimal knot, non-uniform -splines.

Proof

By Theorem 1, we know the extremal points to the right-hand side optimization has knots and by noting that each neuron can create exactly one knot, is a necessary and sufficient number of neurons. It’s then a matter of using the equality in Eq. 6 and invoking Corollary 1 and Theorem 1 to prove the theorem.

Remark 10

For a neural network to learn an -spline, we must solve the left-hand side optimization in Eq. 12. The constraints for pose a challenge, but we will see in Section 6 that we can rewrite the left-hand side optimization in Eq. 12 in the form

s.t.

for some neural network “norm” , and thus answer Remark 6. Before answering this question, we will first see (in Section 5) some examples of neural-network-admissible operators.

Remark 11

In all our optimizations, we considered the setting of ideal sampling, i.e., we are optimizing subject to interpolating given points. From [L-splines], all our results hold for generalized sampling where instead of the constraints

we can replace the sampling functionals , , with any weak-continous linear measurement operator

and instead consider the constraints

This allows us to bring our framework into the context of infinite-dimensional compressed sensing and other more general inverse problems [infinite-dim-cs1, infinite-dim-cs2, inv-prob-space-measures].

5 Neural-Network-Admissible Operators

The framework we have developed in Section 4 relies on our definition of neural-network-admissible operators (Definition 3). This definition captures many common splines used in practice. The prominent examples include:

  • the polynomial splines of order by choosing . Then . The operator admits two obvious Green’s functions

    and

    Remark 12

    When , is precisely the ReLU. Thus our framework captures ReLU neural networks which correspond to linear spline interpolations.

  • the fractional splines of order by choosing , where is the fractional derivative and . Then . The operator admits the obvious causal Green’s function

    where is Euler’s Gamma function. Just as before, we can use Section 3.1 to also find a non-causal Green’s function.

  • many of the exponential splines. Specifically,

    • the even exponential splines by choosing

      with being even. Then .

    • the odd exponential splines by choosing

      with being odd. Then .

    Exponential splines are of interest from a systems theory perspective since they easily model cascades of first-order linear and translation-invariant systems. For a full treatement of exponential splines we refer to [exp-splines-unser, exp-splines-unser2].

    To compute the Green’s functions for these operaters it becomes convenient to work with the transfer function of . We follow a similar computation as in [exp-splines-unser]. Since is neural-network-admissible, it’s linear and translation-invariant, i.e., is a convolution operator. Thus there exists an such that . In particular, , the impulse response of .

    For both even and odd exponential splines, the bilateral Laplace transform888Here and in the rest of the paper we will use capital letters to denote the bilateral Laplace transform of their lower-case counterparts. of will be a monic polynomial of degree . Thus in both cases we can write

    where are the roots of the polynomial . This can also be written as

    where are the distinct roots of and is the multiplicity of the root . Thus we have . Since we want to find such that , it follows that

    where the last equality follows from a partial fraction decomposition which imposes the coefficients . Finally, by taking the inverse transform we find the causal Green’s function

    One can then use Section 3.1 to find a non-causal Green’s function .

(a) Causal activations for polynomial splines
(b) Non-causal activations for polynomial splines
(c) Causal activations for fractional splines
(d) Causal activations for exponential splines
Figure 1: Examples of neural-network-admissible activation functions.

Some examples of what these activation functions may look like can be seen in Fig. 1.

We can also characterize whether or not a given fits into our framework based on its transfer function (so long as it exists). Since is neural-network-admissible and hence a convolution operator, we have for so the second bullet in Definition 3 is equivalent to saying

(13)

Taking the bilateral Laplace transform of Eq. 13, we find

In other words, for to be neural-network-admissible we require its transfer function satisfies

(14)

for some with . This immediately implies that any with even or odd transfer function is neural-network-admissible (choose or , respectively). With Eq. 14 in hand, we have a simple test of whether or not a given fits under our framework.

It would also be useful to characterize whether or not a given activation function fits into our framework. This is equivalent to, for a given , finding such that , then checking if Eq. 14 holds. Given , we have

(15)

So as long as exists, we can find with Eq. 15 and check it against Eq. 14.

Remark 13

The sigmoid activation function

fits into our framework. Indeed,

where holds by the substitution . Then,

(16)

is odd and thus the sigmoid activation fits into our framework. Thus we can consider this as a notion of a “sigmoidal spline”. Moreover,

(17)

where is the unit step function, i.e., is a Green’s function of . From Eq. 16 we have

which is exactly the transfer function of . Thus, in the limit, the “sigmoidal spline” recovers the polynomial spline of order one as we’d expect from Eq. 17.

Remark 14

The activation functions and do not fit into our framework since their bilateral Laplace transforms do not exist.

Remark 15

The Gaussian activation function

fits into our framework. Indeed,

Then,

which is even and thus fits into our framework. A two-layer neural network with Gaussian activation functions can be thought of as a generalized kernel machine where the bandwidth of the kernel is now a trainable parameter.

6 Neural Network “Norms”

In this section we will answer Remark 6. To begin, we state the following general proposition about training networks is our generalization of Theorem 1 from [neyshabur]. Theorem 1 from [neyshabur] relates minimizing the -norm of the network weights in a two-layer ReLU network to minimizing the -norm of the last layer of network while constraining the weights of the first layer. Our result introduces a generalized notion of the “-norm” of the weights that holds for neural networks with activation functions that satisfy a homogeneity condition. Moreover, our proof is completely constructive unlike Theorem 1 from [neyshabur].

Proposition 1

Suppose we want to represent a function with a finite-width network with activation that is -non-negative homogeneous, i.e., for all . Then, the following is true

(18)

Proof

See Appendix A.

Remark 16

The polynomial splines of order , , have Green’s functions that are -non-negative homogeneous (as we saw in Section 5), so for the special case of polynomial splines we have the corollary to Theorem 2 that

s.t.

has solutions that are minimal knot, non-uniform polynomial splines of order . Written differently, we have

s.t.

where we have the neural network “norm”

(19)

which recovers the -norm of the weights when , which corresponds exactly to ReLU activations by choosing and also corresponds to training a neural network with weight decay. This says perhaps “non-linear” notions of weight decay in SGD should be considered when training neural networks.

Remark 17

In the event we do not have the homogenity property as in Proposition 1, we can always use regularization have the corollary to Theorem 2 that

s.t.

where is a small constant, has solutions that are minimal knot, non-uniform -splines. In other words, we have the regularized neural network “norm”

(20)

In practice, one could take .

7 Empirical Validation and Discussion

To verify that our theory developed in Section 4 holds, we verify that empirically neural networks actually do learn splines. In our empirical results, we consider regularized problems of the form

where is the squared error loss with the regularization parameter . We verify our theory holds with linear and cubic splines. We found empirically that minimizing the “norm” Eq. 20 or the “norm” Eq. 19 made no difference in the learned interpolations. We used PyTorch999https://pytorch.org/ to implement the neural networks and used AdaGrad [adagrad] as the optimization procedure with a step size of . To compute spline interpolations with standard methods we used SciPy101010Specifically, we used scipy.interpolate.InterpolatedUnivariateSpline. [scipy].

(a) Linear Spline
(b)
(c)
Figure 2: Here . In (\subref*subcap:linear-spline), (\subref*subcap:linear-causal), and (\subref*subcap:linear-noncausal), we have , , and . The neural network interpolations (\subref*subcap:linear-causal) and (\subref*subcap:linear-noncausal) are not the “connect the dots” linear spline, but have extra knots. Both are clearly valid solutions to both the minimizations in Theorem 2.
(a) Cubic Spline
(b)
(c)
Figure 3: Here . In (\subref*subcap:cubic-spline), (\subref*subcap:cubic-causal), and (\subref*subcap:cubic-noncausal), we have , , and . The neural network interpolations (\subref*subcap:cubic-causal) and (\subref*subcap:cubic-noncausal) are exactly the same (up to floating point precision) as the cubic spline.
(a) Cubic Spline
(b) No regularization
(c) With regularization
Figure 4: Here . We see that regularization plays a key role in the neural network actually learning the proper spline interpolation.

In Fig. 2 we compute a linear spline using standard methods and also compute it by training a neural network while minimizing the “norm” Eq. 19 using both a causal and non-causal Green’s function of . The neural network interpolations have more knots than the “connect the dots” linear spline, though it’s clear that both solutions are still minimizers to Theorem 2.

In Fig. 3 we compute a cubic spline using standard methods and also compute it by training a neural network while minimizing the “norm” Eq. 19 using both a causal and non-causal Green’s function of . In this case, the neural network interpolations learn the exact same function as the standard cubic spline. Thus we see that neural networks are indeed capable of learning splines.

In Fig. 4 we show that explicit regularization can be needed to learn the spline interpolations. We see that if we have no regularization while training a neural network with a causal Green’s function of , the learned interpolation is not the standard cubic spline, but the moment we include the proper regularization, the neural network learns the cubic spline function exactly.

8 Conclusions and Future Work

We have developed a general framework based on the theory of splines for understanding the interpolation properties of sufficiently wide minimum “norm” neural networks that interpolate scattered data. We have proven that neural networks are capable of learning a large class of -splines and thus overparameterized neural networks do in fact learn “nice” interpolations of data. To the data science and machine learning communities, this gives intuition as to why overparameterized and interpolating models generalize well on new, unseen, data. To the spline and inverse problems communities, we have shown that by simply training a neural network (a discrete object) to zero error with appropriate regularization, we can exactly solve various continuous-domain linear inverse problems. Our current results hold for two-layer neural networks. Future work will be directed towards developing a theory based on splines for both deep and multivariate neural networks.

Deep architectures.

It remains an open question about what kind of functions do minimum “norm” deep networks learn. Working with the framework of -splines required that our “building blocks” be Green’s functions of operators. With two-layer networks, this works by simply letting the activation function be the desired Green’s function. With deep networks, due to function compositions, it becomes very unclear what exactly the “building blocks” are, and what would be a reasonable “norm”. In the case of Green’s functions of operators, it would make sense that the function composition simply increases the order of the polynomial spline. Since the Fourier transform of a piecewise polynomial with pieces of degree is , we conjecture that a fairly deep network with a Green’s function of activation function should learn something approaching a bandlimited interpolation.

Multivariate functions.

Our theory only holds for functions in the case. Extending our results for the multivariate case would require showing that minimum “norm” neural networks that interpolate scattered data solve some multivariate continuous-domain linear inverse problem whose solutions are some type of spline. Recent work has examined the minimum -norm of all the network weights for multivariate ReLU networks subject to representing a particular function, but has not made any connections to splines [relu-multi]. They show that in the multivariate case, there even exist continuous piecewise linear functions that two-layer networks cannot represent with a finite -norm of the network weights.

Appendix A Proof of Proposition 1

Proof

Let be an optimal solution for the left-hand side in Eq. 18 and let be an optimal solution for the right-hand side in Eq. 18. We will prove the claim by using the optimal solution for the left-hand side to construct a feasible solution for the right-hand side and vice-versa. We will then use the constructed feasible solutions to show

Starting with Eq. 22, for put