Minimum “Norm” Neural Networks are Splines^{†}^{†}thanks: \fundingThis work is partially supported by AFOSR/AFRL grant FA95501810166 and the NSF Research Traineeship (NRT) grant 1545481.
Abstract
We develop a general framework based on splines to understand the interpolation properties of overparameterized neural networks. We prove that minimum “norm” twolayer neural networks (with appropriately chosen activation functions) that interpolate scattered data are minimal knot splines. Our results follow from understanding key relationships between notions of neural network “norms”, linear operators, and continuousdomain linear inverse problems.
remarkRemark \newsiamremarkhypothesisHypothesis \newsiamremarkfctFact \newsiamthmclaimClaim \newsiamthmquestionQuestion \headersMinimum “Norm” Neural Networks are SplinesRahul Parhi and Robert D. Nowak
plines, neural networks, inverse problems
41A25, 46E27, 47A52, 68T05, 82C32, 94A12
1 Introduction
Contradicting classical statistical wisdom, recent trends in data science and machine learning have shown that overparameterized models that interpolate the training data perform surprisingly well on new, unseen data. This phenomenon is very often seen in the generalization performance of overparameterized neural network models. These models are typically trained to zero training error, i.e., they interpolate the training data, yet they predict very well on new, test data. Prior work has tried to understand this phenomenon from a statistical perspective by studying the statistical properties of such interpolating models [interpsgd, understanddeepkernel, interpstatopt, overfitperfectfit]. To control the complexity of overparameterized neural networks, solutions with minimum norm of the network weights are often encouraged in optimization using regularization (often referred to as “weight decay” in Stochastic Gradient Descent (SGD) steps). Motivated by this practice, the functional mappings generated by twolayer, infinitewidth neural networks with Rectified Linear Unit (ReLU) activation functions were studied in [nnlinearspline]. Such networks that interpolate data subject to minimizing the norm of the network weights were shown to correspond to linear spline interpolation of the data.
In this work we build off these results and develop a general framework based on splines to understand what function spaces can be learned by neural networks. We study more general twolayer neural networks that interpolate data while minimizing certain “norms”. We relate this optimization to an equivalent optimization over functions that live in the native space of a particular linear operator. Both the neural network activation function and the “norm” are tailored to the linear operator.
Our key contribution is understanding key relationships among neural networks “norms”, linear operators, and continuousdomain linear inverse problems. With these connections we turn to the recent work of [Lsplines] who introduce a framework for splines in which particular type of spline is a solution to a continuousdomain linear inverse problem involving a linear operator . We prove that minimizing the “norm” of a specific neural network architecture (where the architecture and “norm” are determined by the choice of the linear operator ) subject to interpolating scattered data exactly solves this inverse problem. By noticing and understanding these connections we bridge the gap between spline theory and understanding the interpolation properties of overparameterized neural networks.
In particular, we prove that overparameterized twolayer neural networks mapping can learn functions that interpolate scattered data with a large class of splines, including polynomial splines, fractional splines, and many exponential splines. This result is not only interesting to the data science and machine learning communities, but also to the spline community as a new, and perhaps unconventional, way to compute splines. Additionally, since our result follows from neural networks indirectly solving continuousdomain linear inverse problems, this is also interesting to the inverse problem community as a new way to solve such inverse problems as opposed to more standard multiresolution or gridbased approaches [invprob1, invprob2]. Our work may also be relevent to recent work on developing an infinitedimensional theory of compressed sensing [infinitedimcs1, infinitedimcs2] as well as other inverse problems set in the continuous domain [invprobspacemeasures].
This paper is organized as follows. In Section 2 we provide relevant background from spline theory and introduce the framework of splines from [Lsplines]. In Section 3 we introduce the neural networks we’ll be interested in as well as the notion of neural network “norms”. In Section 4 we present our main theoretical results establishing connections among neural networks, continuousdomain linear inverse problems, and splines. In Section 5 we get into detail of which splines a neural network can learn. In Section 6, we discuss various neural network “norms” that result in minimum “norm” neural networks learning splines. In Section 7 we verify that the theory developed in Section 4 holds with some empirical validation.
2 Splines and ContinuousDomain Linear Inverse Problems
In this section we state results from spline theory that make powerful associations between splines and operators. Specifically, we review splines and their connections to continuousdomain linear inverse problems with generalized totalvariation regularization [Lsplines]. The main result of interest is a representer theorem that describes the solution sets of the aforementioned inverse problems.
2.1 Preliminaries
To state the main result we require some technical notation (see Section 3 of [Lsplines] for a more thorough discussion). Let denote the space of tempered distributions, which is the continuous dual of the Schwartz space of smooth and rapidly decaying test functions on [folland, rudin]. In this work, we’ll be interested in the space of finite Radon measures. For each we can associate a tempered distribution ^{1}^{1}1In the sense that . Note that since we’re working with tempered distributions this is a slight abuse of notation when the measure associated with is not absolutely continuous with respect to the Lebesgue measure., so it becomes convenient to think that .
The Riesz–Markov–Kakutani representation theorem says that is the continuous dual of , the space of continuous functions vanishing at infinity. Since is a Banach space when equipped with the uniform norm, we have the dual norm
(1) 
and so
A key observation is that is a space larger than that includes absolutely integrable tempered distributions. Indeed, for any we have
and we remark that is dense in . We’re interested in the space since the Dirac deltas for , but with .
Definition 1 ([Lsplines, Definition 1])
A linear operator is called splineadmissible if it satisfies the following properties

it is translationinvariant, i.e., where is the translation operator;

there exists a function such that , i.e., is a Green’s function of .

the kernel has finitedimension .
Definition 2 ([Lsplines, Definition 2])
Let be splineadmissible in the sense of Definition 1. Then, a function is called a nonuniform spline with spline knots and weights if
is called the innovation of the spline .
Remark 1
Associating a spline with an operator captures many common splines including

the well studied polynomial splines of order by choosing where is the derivative operator^{2}^{2}2Here and in the rest of this paper, derivatives are understood in the distributional sense. [polysplines].

the fractional splines of order by choosing , where is the fractional derivative and [fractsplines].

the exponential splines by choosing , where is the identity operator [expsplines, expsplinesunser].
2.2 ContinuousDomain Representer Theorem
With the preliminaries out of the way we can now state the relevant representer theorem result of [Lsplines].
Theorem 1 ([Lsplines, Based on Theorems 1 and 2])
Let be a splineadmissible operator in the sense of Definition 1 and consider the problem of interpolating the scattered data . Then, the extremal points of the general constrained minimization problem
(2)  
s.t. 
are necessarily nonuniform splines of the form
(3) 
with the knots and where is a Green’s function of and . Here, is the native space of defined by
The full solution set of Eq. 2 is the convex hull of the extremal points.
Remark 2
This theorem provides a powerful result since this problem is defined over a continuum and hence has an (uncountably) infinite number of degrees of freedom, yet the solutions are intrinsically sparse since they can be represented with only coefficients.
3 Neural Network Architectures and “Norms”
We’ll consider twolayer networks with activation function
(4) 
where is a generalized bias term^{3}^{3}3We will see in Section 4 that depending on how we choose the activation function , the generalized bias term is a constant or “simple” function. and
where for .
Put . We will often write or for . Let be the parameter space, i.e.,
Remark 3
Infinitewidth networks are the same as the above, but we consider the continuum limit of the number of neurons.
3.1 “Norm” of a Neural Network
Let be a nonnegative function that measures the “size” of . We refer to this as the “norm” of the neural network parameterized by . Here and in the rest of the paper we write “norm” in quotes since may not be a true norm. However, the “norms” are nonnegative and increasing functions of the weight magnitudes.
Let be any function that can be represented by a neural network for some parameter . The minimum “norm” neural network is defined by the following constrained minimization
s.t. 
where is the parameter space of bounded “norm” neural networks. Note that by considering overparameterization in the limit, universal approximation essentially says we can represent any continuous function exactly with only mild conditions on the activation function [uat1, uat2, uat3]. Thus, we will first consider overparameterization in the limit and consider infinitewidth networks.
Remark 4
We will see later (in Theorem 2) that since our goal is to establish an equivalence between minimum “norm” neural networks and splines for the scattered data interpolation problem, we only require a sufficiently wide network.
Remark 5
As will become clear later, neural networks are only capable of learning a subset of splineadmissible operators that satisfy a few key properties. We refer to this subset of splineadmissible operators as neuralnetworkadmissible operators and define them as follows.
Definition 3
A linear operator is called neuralnetworkadmissible if it satisfies the following properties

it is splineadmissible in the sense of Definition 1;

it commutes with the reflection operator upto a complex constant of magnitude , i.e., if is the reflection operator^{4}^{4}4., for some such that ;

for a Green’s function of , we have for any ,
We will now state some relevant facts about these neuralnetworkadmissible operators, which follow by direct calculations. {fct} It is easy to verify that the third property in Definition 3 is implied by the second. Moreover, if the second bullet in Definition 3 did not enforce that , then the third bullet in Definition 3 would never hold.
If a neuralnetworkadmissible admits a causal^{5}^{5}5Recall that a function is called causal if for . Green’s function , then it also admits a noncausal Green’s function
with the property
Remark 6
The reason we need to consider a subset of splineadmissible operators is because the “shape” of a neural network is not the same as the “shape” of a spline. In particular, in a neural network representation, reflections of the activation functions occur (since the weights can be negative) which forces us to require the second bullet in Definition 3.
With the understanding of what kind of neural networks we are considering and what a “norm” of a neural network should look like, to establish a connection between minimum “norm” neural networks and splines there are three questions that need to be answered: {question} What is the activation function ? {question} What is the space of generalized bias functionals ? {question} What is the neural network “norm” ? We can answer Remark 6 and Remark 6 immediately since they follow directly from comparing Eq. 3 and Eq. 4. Notice that splines (resp. twolayer neural networks) have the “shape” of linear combinations of Green’s functions (resp. activation functions) plus a term in (resp. generalized bias term). Hence, given a neuralnetworkadmissible operator , the answer to

Remark 6 is to choose to be a Green’s function of .

Remark 7
If admits a noncausal Green’s function as in Section 3.1, then we can simply take , as shown later in the proof of Lemma 1.
Remark 8
If the constant function is in and admits a noncausal Green’s function as in Section 3.1, we can simply take meaning we can use the “standard” feedforward neural network architecture. Many commonly used splines, e.g., polynomial splines, admit such a Green’s function.
Before answering Remark 6, we need to develop some connections between neural networks and splines. We will answer Remark 6 in Section 6.
4 Neural Networks and Splines
When considering infinitewidth networks it becomes convenient to work with the integral representation
where is a signed measure over the weights and biases and as in Eq. 4. With this representation, the measure corresponds exactly to , the last layer weights in a finitewidth network, when corresponds to a finitewidth network (i.e., is a finite linear combination of of Dirac measures) and we have the equality
(6) 
where given a measurable space and any signed measure defined on the algebra , the total variation of is defined by^{6}^{6}6When and is the Borel sigma algebra on , is exactly the same as defined in Eq. 1. We could generalize Eq. 1 to general measurable spaces and write instead of . where is the Jordan decomposition of [folland].
When working with infinitewidth networks we have where
is our new parameter space.
We will first prove the following lemma about a subset of infinitewidth neural networks that are easier to work with analytically. Following [nnlinearspline], we consider the parameter space :
of networks where the first layer weights are constrained in absolute value to be .
Lemma 1
Suppose we want to represent a function with an infinitewidth network with parameter space and activation function chosen to be the Green’s function of a neuralnetworkadmissible operator . Then, the following is true
(7) 
Proof
Write
(8) 
where in , is a distribution such that , where is the Dirac measure and is the Lebesgue measure.
Next, from the constraint ,
where holds by the second bullet in Definition 3. Put
Hence,
(9) 
From Eq. 8,
By the third bullet in Definition 3 and Section 3.1, either controls a term in or a term that is . Thus, regardless of what is, we can always adjust to establish the constraint . Then, since , we see that representing while minimizing only depends on .
Since
we have from Eq. 9 that the minimization in the lemma statement as
where holds since completely determines and viceversa, is completely determined by , by the decomposition in Eq. 9, and by the argument of the previous paragraph, we have that the value of has no effect on being able to represent and so the only free term in controlling is ; thus we can simply minimize over the choice of .
Remark 9
Definition 3 provides both necessary and sufficient conditions for the analysis in the proof of Lemma 1 to hold.
Corollary 1
Consider the problem of interpolating the scattered data . We have the following equivalence
(10) 
Proof
From Lemma 1, given , , we have
(11) 
By Theorem 1, we know that a minimal knot spline^{7}^{7}7An spline with knots., say, is a solution to the righthand side of Eq. 10. It suffices to show that is also a solution to the lefthand side of Eq. 10. Clearly we can construct with a neural network parameterized by (simply take to be a finite linear combination of Dirac measures at the spline knots), so is feasible. We will now proceed by contradiction. Suppose there exists a that interpolates with . Hence by Eq. 11,
which contradicts the optimality of for the righthand side of Eq. 10.
Theorem 2
Let , , be a feedforward neural network architecture with activation function chosen to be a Green’s function of a neuralnetworkadmissible operator and last layer bias chosen according to Eq. 5, i.e.,
Consider the problem of interpolating the scattered data . We have the following equivalence
(12) 
so long as the number of neurons is . The solutions of the above minimizations are minimal knot, nonuniform splines.
Proof
By Theorem 1, we know the extremal points to the righthand side optimization has knots and by noting that each neuron can create exactly one knot, is a necessary and sufficient number of neurons. It’s then a matter of using the equality in Eq. 6 and invoking Corollary 1 and Theorem 1 to prove the theorem.
Remark 10
For a neural network to learn an spline, we must solve the lefthand side optimization in Eq. 12. The constraints for pose a challenge, but we will see in Section 6 that we can rewrite the lefthand side optimization in Eq. 12 in the form
s.t. 
for some neural network “norm” , and thus answer Remark 6. Before answering this question, we will first see (in Section 5) some examples of neuralnetworkadmissible operators.
Remark 11
In all our optimizations, we considered the setting of ideal sampling, i.e., we are optimizing subject to interpolating given points. From [Lsplines], all our results hold for generalized sampling where instead of the constraints
we can replace the sampling functionals , , with any weakcontinous linear measurement operator
and instead consider the constraints
This allows us to bring our framework into the context of infinitedimensional compressed sensing and other more general inverse problems [infinitedimcs1, infinitedimcs2, invprobspacemeasures].
5 NeuralNetworkAdmissible Operators
The framework we have developed in Section 4 relies on our definition of neuralnetworkadmissible operators (Definition 3). This definition captures many common splines used in practice. The prominent examples include:

the polynomial splines of order by choosing . Then . The operator admits two obvious Green’s functions
and
Remark 12
When , is precisely the ReLU. Thus our framework captures ReLU neural networks which correspond to linear spline interpolations.

the fractional splines of order by choosing , where is the fractional derivative and . Then . The operator admits the obvious causal Green’s function
where is Euler’s Gamma function. Just as before, we can use Section 3.1 to also find a noncausal Green’s function.

many of the exponential splines. Specifically,

the even exponential splines by choosing
with being even. Then .

the odd exponential splines by choosing
with being odd. Then .
Exponential splines are of interest from a systems theory perspective since they easily model cascades of firstorder linear and translationinvariant systems. For a full treatement of exponential splines we refer to [expsplinesunser, expsplinesunser2].
To compute the Green’s functions for these operaters it becomes convenient to work with the transfer function of . We follow a similar computation as in [expsplinesunser]. Since is neuralnetworkadmissible, it’s linear and translationinvariant, i.e., is a convolution operator. Thus there exists an such that . In particular, , the impulse response of .
For both even and odd exponential splines, the bilateral Laplace transform^{8}^{8}8Here and in the rest of the paper we will use capital letters to denote the bilateral Laplace transform of their lowercase counterparts. of will be a monic polynomial of degree . Thus in both cases we can write
where are the roots of the polynomial . This can also be written as
where are the distinct roots of and is the multiplicity of the root . Thus we have . Since we want to find such that , it follows that
where the last equality follows from a partial fraction decomposition which imposes the coefficients . Finally, by taking the inverse transform we find the causal Green’s function
One can then use Section 3.1 to find a noncausal Green’s function .

Some examples of what these activation functions may look like can be seen in Fig. 1.
We can also characterize whether or not a given fits into our framework based on its transfer function (so long as it exists). Since is neuralnetworkadmissible and hence a convolution operator, we have for so the second bullet in Definition 3 is equivalent to saying
(13) 
Taking the bilateral Laplace transform of Eq. 13, we find
In other words, for to be neuralnetworkadmissible we require its transfer function satisfies
(14) 
for some with . This immediately implies that any with even or odd transfer function is neuralnetworkadmissible (choose or , respectively). With Eq. 14 in hand, we have a simple test of whether or not a given fits under our framework.
It would also be useful to characterize whether or not a given activation function fits into our framework. This is equivalent to, for a given , finding such that , then checking if Eq. 14 holds. Given , we have
(15) 
So as long as exists, we can find with Eq. 15 and check it against Eq. 14.
Remark 13
The sigmoid activation function
fits into our framework. Indeed,
where holds by the substitution . Then,
(16) 
is odd and thus the sigmoid activation fits into our framework. Thus we can consider this as a notion of a “sigmoidal spline”. Moreover,
(17) 
where is the unit step function, i.e., is a Green’s function of . From Eq. 16 we have
which is exactly the transfer function of . Thus, in the limit, the “sigmoidal spline” recovers the polynomial spline of order one as we’d expect from Eq. 17.
Remark 14
The activation functions and do not fit into our framework since their bilateral Laplace transforms do not exist.
Remark 15
The Gaussian activation function
fits into our framework. Indeed,
Then,
which is even and thus fits into our framework. A twolayer neural network with Gaussian activation functions can be thought of as a generalized kernel machine where the bandwidth of the kernel is now a trainable parameter.
6 Neural Network “Norms”
In this section we will answer Remark 6. To begin, we state the following general proposition about training networks is our generalization of Theorem 1 from [neyshabur]. Theorem 1 from [neyshabur] relates minimizing the norm of the network weights in a twolayer ReLU network to minimizing the norm of the last layer of network while constraining the weights of the first layer. Our result introduces a generalized notion of the “norm” of the weights that holds for neural networks with activation functions that satisfy a homogeneity condition. Moreover, our proof is completely constructive unlike Theorem 1 from [neyshabur].
Proposition 1
Suppose we want to represent a function with a finitewidth network with activation that is nonnegative homogeneous, i.e., for all . Then, the following is true
(18) 
Proof
See Appendix A.
Remark 16
The polynomial splines of order , , have Green’s functions that are nonnegative homogeneous (as we saw in Section 5), so for the special case of polynomial splines we have the corollary to Theorem 2 that
s.t. 
has solutions that are minimal knot, nonuniform polynomial splines of order . Written differently, we have
s.t. 
where we have the neural network “norm”
(19) 
which recovers the norm of the weights when , which corresponds exactly to ReLU activations by choosing and also corresponds to training a neural network with weight decay. This says perhaps “nonlinear” notions of weight decay in SGD should be considered when training neural networks.
Remark 17
In the event we do not have the homogenity property as in Proposition 1, we can always use regularization have the corollary to Theorem 2 that
s.t. 
where is a small constant, has solutions that are minimal knot, nonuniform splines. In other words, we have the regularized neural network “norm”
(20) 
In practice, one could take .
7 Empirical Validation and Discussion
To verify that our theory developed in Section 4 holds, we verify that empirically neural networks actually do learn splines. In our empirical results, we consider regularized problems of the form
where is the squared error loss with the regularization parameter . We verify our theory holds with linear and cubic splines. We found empirically that minimizing the “norm” Eq. 20 or the “norm” Eq. 19 made no difference in the learned interpolations. We used PyTorch^{9}^{9}9https://pytorch.org/ to implement the neural networks and used AdaGrad [adagrad] as the optimization procedure with a step size of . To compute spline interpolations with standard methods we used SciPy^{10}^{10}10Specifically, we used scipy.interpolate.InterpolatedUnivariateSpline. [scipy].
In Fig. 2 we compute a linear spline using standard methods and also compute it by training a neural network while minimizing the “norm” Eq. 19 using both a causal and noncausal Green’s function of . The neural network interpolations have more knots than the “connect the dots” linear spline, though it’s clear that both solutions are still minimizers to Theorem 2.
In Fig. 3 we compute a cubic spline using standard methods and also compute it by training a neural network while minimizing the “norm” Eq. 19 using both a causal and noncausal Green’s function of . In this case, the neural network interpolations learn the exact same function as the standard cubic spline. Thus we see that neural networks are indeed capable of learning splines.
In Fig. 4 we show that explicit regularization can be needed to learn the spline interpolations. We see that if we have no regularization while training a neural network with a causal Green’s function of , the learned interpolation is not the standard cubic spline, but the moment we include the proper regularization, the neural network learns the cubic spline function exactly.
8 Conclusions and Future Work
We have developed a general framework based on the theory of splines for understanding the interpolation properties of sufficiently wide minimum “norm” neural networks that interpolate scattered data. We have proven that neural networks are capable of learning a large class of splines and thus overparameterized neural networks do in fact learn “nice” interpolations of data. To the data science and machine learning communities, this gives intuition as to why overparameterized and interpolating models generalize well on new, unseen, data. To the spline and inverse problems communities, we have shown that by simply training a neural network (a discrete object) to zero error with appropriate regularization, we can exactly solve various continuousdomain linear inverse problems. Our current results hold for twolayer neural networks. Future work will be directed towards developing a theory based on splines for both deep and multivariate neural networks.
Deep architectures.
It remains an open question about what kind of functions do minimum “norm” deep networks learn. Working with the framework of splines required that our “building blocks” be Green’s functions of operators. With twolayer networks, this works by simply letting the activation function be the desired Green’s function. With deep networks, due to function compositions, it becomes very unclear what exactly the “building blocks” are, and what would be a reasonable “norm”. In the case of Green’s functions of operators, it would make sense that the function composition simply increases the order of the polynomial spline. Since the Fourier transform of a piecewise polynomial with pieces of degree is , we conjecture that a fairly deep network with a Green’s function of activation function should learn something approaching a bandlimited interpolation.
Multivariate functions.
Our theory only holds for functions in the case. Extending our results for the multivariate case would require showing that minimum “norm” neural networks that interpolate scattered data solve some multivariate continuousdomain linear inverse problem whose solutions are some type of spline. Recent work has examined the minimum norm of all the network weights for multivariate ReLU networks subject to representing a particular function, but has not made any connections to splines [relumulti]. They show that in the multivariate case, there even exist continuous piecewise linear functions that twolayer networks cannot represent with a finite norm of the network weights.
Appendix A Proof of Proposition 1
Proof
Let be an optimal solution for the lefthand side in Eq. 18 and let be an optimal solution for the righthand side in Eq. 18. We will prove the claim by using the optimal solution for the lefthand side to construct a feasible solution for the righthand side and viceversa. We will then use the constructed feasible solutions to show
Starting with Eq. 22, for put