A representer theorem for deep neural networks

# A representer theorem for deep neural networks

Michael Unser Biomedical Imaging Group, École polytechnique fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland
###### Abstract

We propose to optimize the activation functions of a deep neural network by adding a corresponding functional regularization to the cost function. We justify the use of a second-order total-variation criterion. This allows us to derive a general representer theorem for deep neural networks that makes a direct connection with splines and sparsity. Specifically, we show that the optimal network configuration can be achieved with activation functions that are nonuniform linear splines with adaptive knots. The bottom line is that the action of each neuron is encoded by a spline whose parameters (including the number of knots) are optimized during the training procedure. The scheme results in a computational structure that is compatible with the existing deep-ReLU and MaxOut architectures. It also suggests novel optimization challenges, while making the link with minimization and sparsity-promoting techniques explicit.

## 1 Introduction

The basic regression problem in machine learning is to find a parametric representation of a function given a set of data points such that for [4]. Classically, there are two steps involved. The first is the design, which can be abstracted in the choice of a given parametric class of functions , where encodes the parameters. For instance, could be a neural network with weights . The second is the training, which basically amounts to an interpolation/approximation problem where the chosen model is fit to the data. In practice, the optimal parameter is determined via the functional minimization

 θ0=argminθM∑m=1E(ym,f(xm|θ)), (1)

where is a convex error function that quantifies the discrepancy of the fit to the data. The classical choice is , which yields the least-squares solution.

The most delicate step is the design, because it has to deal with two conflicting requirements. First is the desire for universality, meaning that the parametric model should be flexible enough to allow for the faithful representation of a large class of functions—ideally, the complete family of continuous functions , as the dimensionality of goes to infinity. Second is the quest for sparsity, meaning that the model should have a small number of parameters, which leads to an increase in robustness and trustworthiness.

This work aims at streamlining the design of neural networks based on variational principles inspired by kernel methods. To set up the stage, we now briefly review two popular approaches to supervised learning.

### 1.1 Kernel methods

A kernel estimator is a linear model with adjustable parameters and predefined data centers of the form

 f(x|θ)=M∑m=1amh(x,xm), (2)

where is the input variable of the model and where is a positive-definite kernel, a preferred choice being the Gaussian kernel [14, 1]. This expansion is at the heart of the whole class of kernels methods, including radial-basis functions and support-vector machines [32, 38, 34].

The elegance of kernel estimators lies in that they can be justified based on regularization theory [22, 7, 24]. The incentive there is to remove some of the arbitrariness of model selection by formulating the learning task as a global minimization problem that takes care of the design and training jointly. The property that makes such an integrated approach feasible is that any Hilbert space of continuous functions on has a unique reproducing kernel such that (i) ; and (ii) for any and [2]. The idea, then, is to formulate the “regularized” version of Problem (1) as

 fRKHS=argminf∈H(M∑m=1E(ym,f(xm))+λ∥f∥2H), (3)

where the second term penalizes solutions with a large -norm and is an adjustable tradeoff factor. Under the assumption that the loss function is convex, the representer theorem [15, 33, 34] then states that the solution of (3) exists, is unique, and such that . This ultimately results in the same linear expansion as (2). The argument also applies the other way round since any positive-definite kernel specifies a unique reproducing-kernel Hilbert space (RKHS) , which then provides the regularization functional in (3) that is matched to the kernel estimator specified by (2).

The other remarkable feature of kernel expansions is their universality, under mild conditions on [20]. In other words, one has the guarantee that the generic linear model of (2) can reproduce any continuous function to a desired degree of accuracy by including sufficiently many centers, with the error vanishing as . Moreover, because of the tight connection between kernels, RKHS, and splines [5, 19, 39], one can invoke standard results in approximation theory to obtain quantitative estimates of the approximation error of smooth functions as a function of and of the widest gap between data centers [41]. Finally, there is a well known link between kernel methods derived from regularization theory and neural networks, albeit “shallow” ones that involve a single nonlinear layer [22].

### 1.2 Deep neural networks

While kernel methods have been a major (and winning) player in machine learning since the mid ’90s, they have been recently outperformed by deep neural networks in many real-world applications such as image classification [16], speech recognition [13], and image segmentation [26], to name a few.

The leading idea of deep learning is to build more powerful learning architectures via the stacking/composition of simpler entities (see the review papers by LeCun, Bengio and Hinton [17] and Schmidhuber [30] and the recent textbook [11] for more detailed explanations). In this work, we focus on the popular class of feedforward networks that involve a layered composition of affine transformations (linear weights) and pointwise nonlinearities. The deep structure of such a network is specified by its node descriptor where is the total number of layers (depth of the network) and is the number of neurons at the th layer. The action of a (scalar) neuron (or node) indexed by is described by the relation where denotes the multivariate input of the neuron, is a predefined activation function (such as a sigmoid or a ReLU), a set of linear weights, and an additive bias. The outputs of layer are then fed as inputs of layer , and so forth for .

To obtain a global description, we group the neurons within a given layer and specify the two corresponding vector-valued maps:

1. Linear step (affine transformation)

 fℓ:x↦fℓ(x)=Wℓx−bℓ (4)

with weighting matrix and bias vector .

2. Nonlinear step (activation functions)

 σℓ:x=(x1,…,xNℓ)↦σℓ(x)=(σ1,ℓ(x1),…,σNℓ,ℓ(xNℓ)) (5)

with the possibility of adapting the scalar activation functions on a per-node basis.

This allows us to describe the overall action of the full -layer deep network by

 fdeep(x)=(σL∘fL∘σL−1∘⋯∘σ2∘f2∘σ1∘f1)(x), (6)

which makes its compositional structure explicit. The design step therefore consists in fixing the architecture of the deep neural net: One must specify together with the activation functions . The activations are traditionally chosen to be not only the same for all neurons within a layer, but also the same across layers. This results in a computational structure with adjustable parameters (weights of the linear steps). These are then set during training via the minimization of (1), which is achieved by stochastic gradient descent with efficient error backpropagation [29].

While researchers have considered a variety of possible activation functions, such as the traditional sigmoid, a preferred choice that has emerged over the years is the rectified linear unit: [10]. The reasons that support this choice are multiple. The initial motivation was to promote sparsity (in the sense of decreasing the number of active units), capitalizing on the property that ReLU acts as a gate and works well in combination with -regularization [10]. Second is the empirical observation that the training of very deep networks is much faster if the hidden layers are composed of ReLU activation functions [17]. Last but not least is the connection between deep ReLU networks and splines—to be further developed in this paper.

A key observation is that a deep ReLU network implements a multivariate input-output relation that is continuous and piecewise-linear (CPWL) [21]. This remarkable property is due to the ReLU itself being a linear spline, which has prompted Poggio et al. to interpret deep neural networks as hierarchical splines [23]. Finally, it has been shown that any CPWL function admits a deep ReLU implementation [40, 3], which is quite significant since the CPWL family has universal approximation properties.

Our purpose in this paper is to strengthen the connection between splines and multilayer ReLU networks. To that end, we formulate the design of a deep neural network globally within the context of regularization theory, in direct analogy with the variational formulation of kernel estimators given by (3). The critical aspect, of course, is the selection of an appropriate regularization functional which, for reasons that will be exposed next, will take us outside of the traditional realm of RKHS.

Having set the deep architecture of the neural network, we then formulate the training as a global optimization task whose outcome is a combined set of optimal neuronal activations and linear weights. The foundational role of the representer theorem (Theorem 3) is that it will provide us with the parametric representation of the optimal activations, which can then be leveraged by a numerical implementation.

## 2 From deep neural networks to deep splines

Given the generic structure of a deep neural network, we are interested in investigating the possibility of optimizing the shape of the activation function(s) on a node-by-node basis. We now show how this can be achieved within the context of infinite-dimensional regularization theory.

### 2.1 Choice of regularization functional

For practical relevance, the scheme should favor simple solutions such as an identity or a linear scaling. This will retain the possibility of performing a classical linear regression. It is also crucial that the activation function be differentiable to be compatible with the chain rule when the backpropagation algorithm is used to train the network. Lastly, we want to promote activation functions that are locally linear (such as the ReLU), since these appear to work best in practice. Hence, an idealized solution would be a function with a “sparse” second derivative. In other words, we would expect that its second derivative vanishes almost everywhere. This suggests the imposition of a bound on the second total-variation of , which is defined as

 TV(2)(σ)△=∥D2σ∥M=supφ∈C2(R):∥φ∥∞≤1⟨D2σ,φ⟩

where is the second derivative operator and (the total-variation norm in the sense of measure theory) is a slight extension of the classical -norm. Here, refers to the space of regular real-valued Borel measures (a.k.a. Radon measures) on [27]. The basic property is that for any , which implies that . However, the shifted Dirac distribution for any shift , while with , which shows that the space is (slightly) larger than . The connection with ReLU is that , which confirms that the ReLU activation function is intrinsically sparse with .

Since our formulation involves a joint optimization of all network components, it is important to decouple the effect of the various stages. The only operation that is common to linear transformations and pointwise nonlinearities is a linear scaling, which is therefore transferable from one level to the next. Since most regularization schemes are scale-sensitive, it is essential to prevent such a transfer. We achieve this by restricting the class of admissible weight vectors acting on a given node indexed by to those that have a unit norm. In other words, we shall normalize the scale of all linear modules with the introduction of the new variable .

### 2.2 Supporting optimality results

As preparation for our representer theorem, we start with a lemma on the -optimality of piecewise-linear interpolation, which can be deduced from the general spline theory presented in [37]. We also present arguments to disqualify the use of the more conventional Sobolev-type regularization.

The formal definition of our native space (i.e., the space over which the optimization is performed) is

 BV(2)(R)={f:R→R:∥D2f∥M<∞}, (7)

which is the class of functions with bounded second total variation. As explained in SI-1, we can endow with the norm

 ∥f∥BV(2)△=∥D2f∥M+|f(0)|+|f(1)−f(0)|, (8)

which turns it into a bona fide Banach space. We can also show that this space is large enough to represent any function .

###### Lemma 1 (TV(2)-optimality of piecewise-linear interpolants).

Consider a series of scalar data points with and . Then, under the hypothesis of feasibility (i.e., whenever ), the extremal points of the interpolation problem

 argminf∈BV(2)(R)∥D2f∥M s.t. f(xm)=ym,m=1,…,M

are nonuniform splines of degree (a.k.a. piecewise-linear functions) with no more than adaptive knots.

The proof is given in SI-2. Lemma 1 implies that there exists an optimal interpolator, not necessarily unique, whose generic parametric form is given by

 fspline(x)=b1+b2x+K∑k=1ak(x−τk)+ (9)

where , with the caveat that the intrinsic spline descriptors, given by the (minimal) number of knots and the knot locations , are not known beforehand. This means that these descriptors need to be optimized jointly with the expansion coefficients and . Ultimately, this translates into a solution that has a polygonal graph with breakpoints and that perfectly interpolates the data points otherwise, as shown in Figure 1.

The feasibility hypothesis in Lemma 1 is not restrictive since a function returns a single value for each input point. We are aware of two antecedents to Lemma 1 (e.g., [8, Corollary 2.2], [18, Proposition 1]), albeit not in the form suitable for our purpose because these earlier results restrict the domain of to a finite interval.

Since -regularization penalizes the variations of the derivative, it will naturally produce (sparse) solutions with a small number of knots. This means that an optimal spline will typically have fewer knots than there are data points, while the list of its knots with may not necessarily be a subset of , as illustrated in Figure 1. This push towards model simplification (Occam’s razor) is highly desirable. It distinguishes this formulation of splines from the more conventional one, which, in the case of interpolation, simply tells us “to connect the dots” with and for (see the solid-line illustration in Figure 1).

It is well known that the classical linear interpolator is the solution of the following variational problem, which we like to see as the precursor of RKHS kernel methods [25, 39].

###### Proposition 2 (Sobolev optimality of piecewise-linear interpolation).

Let the native space be the first-order Sobolev space . Given a series of distinct data points , the interpolation problem

 argminf∈H1(R)∫R|Df(x)|2dx s.t.  f(xm)=ym, m=1,…,M

has a unique piecewise-linear solution that can be written as

 s2(x)=b1+M∑m=1am(x−xm)+. (10)

While the result is elegant and translates into a straightforward implementation, the scheme can be cumbersome for large data sets because the number of parameters in (10) increases with the number of data points. The other limitation is that the use of -regularization disqualifies the simple linear solution , which has an infinite cost.

As one may expect, there are also direct extensions of Lemma 1 and Proposition 2 for regularized least-squares approximations. Moreover, the distinction between the two types of solutions—smoothing splines [31] vs. adaptive regression splines [18]—is even more striking111In the least-square setting, one can adjust the strength of -regularization to control the number of knots and easily produce solutions with . for noisy data fitting applications, which brings us back to our initial goal: the design and training of neural networks.

### 2.3 Representer theorem for deep neural networks

Our aim is to determine the optimal activation functions for a deep neural network in a task-dependent fashion. This problem is inherently ill-posed because activations are infinite-dimensional entities while we only have access to finite data. As in the case of interpolation, we resolve the ambiguity by imposing an appropriate form of regularization. Having singled out as the most favorable choice, we now proceed with the enunciation of our representer theorem for deep neural networks.

###### Theorem 3 (TV(2)-optimality of deep spline networks).

Let the -layer feedforward neural network with node descriptor take the form

 x↦f(x)=(σL∘ℓL∘σL−1∘⋯∘ℓ2∘σ1∘ℓ1)(x), (11)

which is an alternating composition of the normalized linear transformations with linear weights such that and the nonlinear activations with . Given a series of data points , we then define the training problem

 argmin(Uℓ),(σn,ℓ∈BV(2)(R))(M∑m=1E(ym,f(xm)) +μN∑ℓ=1Rℓ(Uℓ)+λL∑ℓ=1,Nℓ∑n=1TV(2)(σn,ℓ)⎞⎠ (12)

where is an arbitrary convex error function such that for any , is some arbitrary convex cost that favors certain types of linear transformations, and are two adjustable regularization parameters. If the solution of (3) exists, then it is achieved by a deep spline network with individual activations of the form

 σn,ℓ(x)=b1,n,ℓ+b2,n,ℓx+Kn,ℓ∑k=1ak,n,ℓ(x−τk,n,ℓ)+, (13)

with adaptive parameters , , and , .

The proof is given in SI-3. This result translates into a computational structure where each node of the network (with fixed index ) is characterized by

• its number of knots (ideally, much smaller than );

• the location of these knots (equivalent to ReLU biases);

• the expansion coefficients , also written as and to avoid notational overload.

The fundamental point is that these parameters (including the number of knots) are data-dependent and adjusted automatically through the minimization of (3). All this takes place during training.

## 3 Interpretation and discussion

Theorem 3 tells us that we can configure a neural network optimally by restricting our attention to piecewise-linear activation functions , or spline activations, for short. In effect, this means that the “infinite-dimensional” minimization problem specified by (3) can be converted into a tractable finite-dimensional problem where, for each node , the parameters to be optimized are the number of knots, the locations of the spline knots, and the linear weights . The enabling property is going to be (16), which converts the continuous-domain regularization into a discrete -norm. This is consistent with the expectation that bounding the second-order total-variation favors solutions with sparse second derivatives—i.e., linear splines with the fewest possible number of knots. The idea is that -minimization helps reducing the number of active coefficients [36].

The other important feature is that the knots are adaptive and that they can be learned during training using the standard backpropagation algorithm. What is required is the derivative of the activation functions. It is given by

 σ′n,ℓ(x)=b2,n,ℓ+Kn,ℓ∑k=1ak,n,ℓ1[τk,n,ℓ,+∞)(x), (14)

where is an indicator function that is zero for and otherwise. These derivatives are piecewise-constant splines with jumps of height at the knot locations . By differentiating (14) once more, we get

 σ′′n,ℓ(x)=Kn,ℓ∑k=1ak,n,ℓδ(x−τk,n,ℓ), (15)

where is the Dirac distribution. Owing to the property that , we then readily deduce that

 TV(2){σn,ℓ}=∥σ′′n,ℓ∥M=Kn,ℓ∑k=1|ak,n,ℓ|=∥an,ℓ∥1, (16)

which converts the continuous-domain regularization into a more familiar minimum -norm constraint on the underlying expansion coefficients.

What is even more interesting, from a practical point of view, is that the corresponding system translates into a deep ReLU network modulo a slight modification of the standard architecture described by (6). Indeed, the primary basis functions in (13) are shifted ReLUs, so that each spline activation can be realized by way of a simple one-layer ReLU subnetwork with the spline knots being encoded in the biases. In particular, when the only active coefficients is (i.e., , , and ), we have a perfect equivalence with the classical deep ReLU structure described by (6) with . The enabling property is that

 (wTn,ℓx−zn,ℓ)+=(an,ℓuTn,ℓx−zn,ℓ)+=an,ℓ(uTn,ℓx−τn,ℓ)+,

with , and . Concretely, this means that, for every layer , we can absorb the single ReLU coefficients into the prior linear transformation and consider unnormalized transformations (as in (4)) rather than the normalized ones of Theorem 3 with .

Theorem 3 then suggests that the next step in complexity is to add the linear term to each node, since its regularization cost vanishes. The other design extreme is to let , in which case the whole network collapses, leading to an affine mapping of the form with and . More generally, the framework provides us with the possibility of controlling the number of knots (and hence the complexity of the network) through the simple adjustment of the regularization parameter , with the number of knots increasing as .

A characteristic property of deep spline networks, to be considered here as a superset of the traditional deep ReLU networks, is that they produce an input-output relation that is continuous and piecewise-linear (CPWL) in the following sense: the corresponding function is continuous ; its domain can be partitioned into a finite set of non-overlapping convex polytopes over which it is affine [35, 40]. More precisely, for all where has the same parametric form as in (4) . This simply follows from the observation that , with as specified by (13), is CPWL and that the CPWL property is conserved through functional composition. In fact, the CPWL property for is equivalent to the function being a nonuniform spline of degree 1.

Another powerful architecture that is known to generate CPWL functions is the MaxOut network [12]. There, the non-linear steps in (6) are replaced by max-pooling operations. It turns out that these operations are also expressible in terms of deep splines, as illustrated in Figure 2 for the simple case where the maximum is taken over two inputs. Interestingly, this conversion requires the use of the linear term which is absent in conventional ReLU networks. This reinforces the argument made by Goodfellow et al. concerning the capability of MaxOut to learn activation functions.

The optimality result in Theorem 3 holds for a remarkably broad family of cost functions, which should cover all cases of practical interest. The first condition is that the data term, as its name suggest, be solely dependent on and . The second is that the regularization of the weights—the part that constrains the linear steps—and the regularizatiom of the individual activation functions are decoupled from each others. Obvious generalizations of the result include

• cases where the optimization in (3) is performed over a subset of the components while other network elements such as critical linear weights, activation functions222A prominent example is the use of the softmax function [4, 11] to convert the output of a neural network into a set of pseudo-probabilities. or even pooling operators are fixed beforehand; in particular, this includes the important subclass of networks that are not fully connected.

• generalized forms of regularization where is substituted by , where is any monotonically increasing function.

An attractive feature that is offered by the deep-spline parameterization is the possibility of suppressing a network layer—or rather, merging two adjacent ones—when the optimal solution is such that for fixed and . This is a property that results from the presence of the linear component and has not been exploited so far.

Lastly, we like to contrast the result in Theorem 3 with the classical representer theorem of machine learning [33]. The commonality is that both theorems provide a parametric representation of the solution in the form of a linear “kernel” expansion. The primary distinction is that the classical representer theorem is restricted to “shallow” networks with . Yet, there is another difference even more crucial for our purpose: the fact that the knots in (13) are adaptive and few (, while the centers in (2) are fixed and as numerous as there are data points in the training set. A final observation is that the ReLU function is not a valid kernel in the traditional sense of the term because it is not positive-definite. We note, however, that it can be substituted by another equivalent spline generator , which is conditionally positive-definite [41]. Again, the property that makes this feasible is the presence of the linear term .

## 4 Directions for future research

The main contribution of this work is to provide the theoretical foundations for an integrated approach to neural networks where a sub-part of the design—the optimal shaping of activations—can be transferred to the training part of the process and formulated as a global optimization problem. It also deepens the connection between splines and multilayer ReLU networks, as a pleasing side product. While the concept seems promising and includes the theoretical possibility of suppressing unnecessary layers, it raises a number of issues that can only be answered through extensive experimentation with real data. The next task ahead is to demonstrate the capability of spline activations to improve upon the state-of-the-art. (Except for a potential risk of over-parameterization, deep-spline networks should perform at least as well as deep ReLU networks since the latter constitute a subset of the former.) In particular, one needs to assess the usefulness of the linear-activation component that is suggested by the theory and not present in traditional ReLU systems.

Inspired by the many successes of convolutional networks for the classification and processing of signals, we propose to investigate deep-spline variants of such systems where some of the weighting matrices and spline activations are shared in a sliding-window fashion. The hope is that one may obtain a boost in performance by tuning the activation function to the corresponding convolution kernel, with a relatively modest parametric and computational overhead.

We expect the greatest challenge for training a deep spline network to be the proper optimization of the number of knots at each neuron, given that the solution with the fewest parameters is the most desirable. A possible stategy is to allocate a knot budget per node and to then rely on standard iterative -norm minimization techniques to produce a sparse solution [6, 9, 36]. Such a scheme may still require some explicit knot-deletion step, either as post-processing or during the training iterations, to effectively trim down the number of parameters. A potential difficulty is that minimum interpolants are typically non-unique, because the underlying regularization is semi-convex. This means that the solution found by an iterative algorithm—assuming that the minimum of the regularization energy is achieved—is not necessarily the sparsest one within the (convex) solution set. Designing an algorithm that can effectively deal with this issue will be a very valuable contribution to the field.

### Acknowlegdments

The research was partially supported by the Swiss National Science Foundation under Grant 200020-162343. The author is thankful to Shayan Aziznejad, Anais Badoual and Kyong Hwan Jin for helpful discussions.

## References

• [1] Mauricio A Alvarez, Lorenzo Rosasco, and Neil D Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195–266, 2012.
• [2] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
• [3] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.
• [4] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
• [5] C. de Boor and R. E. Lynch. On splines and their minimum properties. Journal of Mathematics and Mechanics, 15(6):953–969, 1966.
• [6] D. L. Donoho. For most large underdetermined systems of linear equations the minimal -norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006.
• [7] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1, Apr 2000.
• [8] SD Fisher and JW Jerome. Spline solutions to extremal problems in one and several variables. Journal of Approximation Theory, 13(1):73–83, 1975.
• [9] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Springer, 2013.
• [10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
• [11] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep Learning, volume 1. MIT press Cambridge, 2016.
• [12] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. Proceedings of Machine Learning Research, 28(3):1319–1327, 2013.
• [13] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
• [14] T. Hofmann, B. Schölkopf, and A. J. Smola. Kernel methods in machine learning. Annals of Statistics, 36(3):1171–1220, 2008.
• [15] George Kimeldorf and Grace Wahba. Some results on tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1):82 – 95, 1971.
• [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
• [17] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436–444, 2015.
• [18] E. Mammen and S. van de Geer. Locally adaptive regression splines. Annals of Statistics, 25(1):387–413, 1997.
• [19] C. Micchelli. Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Constructive Approximation, 2(1):11–22, 1986.
• [20] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006.
• [21] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems, pages 2924–2932, 2014.
• [22] Tomaso Poggio and Federico Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247(4945):978–982, 1990.
• [23] Tomaso Poggio, Lorenzo Rosasco, Amnon Shashua, Nadav Cohen, and Fabio Anselmi. Notes on hierarchical splines, DCLNs and i-theory. Technical report, Center for Brains, Minds and Machines (CBMM), 2015.
• [24] Tomaso Poggio and Steve Smale. The mathematics of learning: Dealing with data. Notices of the AMS, 50(5):537–544, 2003.
• [25] P.M. Prenter. Splines and Variational Methods. Wiley, New York, 1975.
• [26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
• [27] Walter Rudin. Real and Complex Analysis. McGraw-Hill, New York, 3rd edition, 1987.
• [28] Walter Rudin. Functional Analysis. McGraw-Hill Book Co., New York, 2nd edition, 1991. McGraw-Hill Series in Higher Mathematics.
• [29] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533, 1986.
• [30] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
• [31] I. J. Schoenberg. Spline functions and the problem of graduation. Proceedings of the National Academy of Sciences, 52(4):947–950, October 1964.
• [32] B. Schölkopf, Kah-Kay Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Transactions on Signal Processing, 45(11):2758–2765, Nov 1997.
• [33] Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In David Helmbold and Bob Williamson, editors, Computational Learning Theory, pages 416–426, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
• [34] Bernhard Schölkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2002.
• [35] J.M. Tarela and M.V. Martinez. Region configurations for realizability of lattice piecewise-linear models. Mathematical and Computer Modelling, 30(11):17 – 27, 1999.
• [36] M. Unser, J. Fageot, and H. Gupta. Representer theorems for sparsity-promoting regularization. IEEE Transactions on Information Theory, 62(9):5167–5180, September 2016.
• [37] M. Unser, J. Fageot, and J. P. Ward. Splines are universal solutions of linear inverse problems with generalized-TV regularization. SIAM Review, 59(4):769–793, December 2017.
• [38] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.
• [39] G. Wahba. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1990.
• [40] Shuning Wang and Xusheng Sun. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425–4431, 2005.
• [41] H. Wendland. Scattered Data Approximations. Cambridge University Press, 2005.

Supporting Information

## SI-1. Banach structure of BV(2)(R) and of its predual space

While the definition of given in (7) is convenient for expository purposes, it is not directly usable for mathematical analysis because the functional is only a semi-norm. To lift the ambiguity due to the non-trivial null space, we select a biorthogonal system for (the null space of ). Because we are dealing with an interpolation problem, our proposed choice333The canonical choice of analysis functionals is not appropriate here because of the requirement that the sampling functionals should be weak*-continuous (see proof of Proposition 6). is and , with and , which are such that , , and (biorthogonality property). We then rely on [37, Theorem 5] to get the following characterization.

###### Proposition 4 (Banach structure of BV(2)(R)).

Let be a biorthogonal system for . Then, equipped with the norm

 ∥f∥=∥D2f∥M+2∑n=1|⟨ϕn,f⟩|,

is a (non-reflexive) Banach space. Moreover, every has the unique direct-sum decomposition

 f=L−1ϕw+p, (17)

where , , and , with

 gϕ(x,y) =(x−y)+−2∑n=1pn(x)⟨ϕn,(⋅−y)+⟩. (18)

Central to our formulation is the unique operator such that

 D2L−1ϕw=w (right-inverse property) (19) ⟨ϕ1,L−1ϕw⟩=0,⟨ϕ2,L−1ϕw⟩=0 (boundary conditions) (20)

for all . Specifically, (20) ensures the orthogonality of the two components of the direct sum decomposition of in (17), while (19) and the biorthogonality of guarantee its unicity.

By fixing and (finite difference), we obtain the formula of the norm for given by (8). The corresponding expression of the kernel of given by (18) is

 gϕ(x,y) =(x−y)+−(1−x)(−y)+−x(1−y)+. (21)

A crucial observation for the proof of Lemma 1 is that the function specified by (21) is compactly supported and bounded—in contrast with the leading term of the expansion in (18), which represents the impulse response of the conventional shift-invariant inverse of (two-fold integrator). In fact, these functions are triangle-shaped B-splines with the following characteristics (see Figure 3).

• for : is supported in and takes its maximum at

• for : is supported in and takes its extremum at

• for : is supported in and takes its maximum at .

The characterization of the predual of our native space is required for testing the hypothesis of weak*-continuity, which is fundamental to our formulation because the space is non-reflexive [28]. To that end, we first recall that the predual of is the space of continuous functions that vanish at infinity equipped with the sup norm (Riesz-Markov theorem) [27]. Then, can also be described as the closure of (Schwartz’ space of smooth and rapidly decaying functions) equipped with the sup norm. Based on Theorem 6 in [37], we get a similar characterization for the predual of .

###### Proposition 5 (Predual space).

The native space in Proposition 4 is the continuous dual of , which is itself the closure of equipped with the norm

 ∥f∗∥′BV(2)△=∥L−1∗ϕf∗∥∞+|⟨p1,f∗⟩|+|⟨p2,f∗⟩| (22)

where is the adjoint of specified by (18). Moreover, every has the unique direct-sum decomposition

 f∗=D2φ+ϕ,

where and .

## SI-2. Proof of Lemma 1

###### Proof.

The result is a special case of [37, Theorem 4]. The latter is a general optimality result for splines that holds for a wide class of (spline-admissible) regularization operators and for arbitrary linear functionals, subject to some weak*-continuity requirement. Hence, we only need to ensure that the underlying mathematical hypotheses are met for and .

• Spline-admissibility. The operator is spline-admissible in the sense of [37, see Definition 1]: Its causal Green’s function is (ReLU) which exhibits the algebraic rate of growth , while its null space is finite-dimensional with . These are precisely the basis functions associated with that appear in (9).

• Weak*-continuity of sampling functionals with respect to the topology specified in SI-1.

###### Proposition 6.

The functional is weak*-continuous on for any with continuity bound

 |f(xm)|=|⟨δ(⋅−xm),f⟩|≤(1+2|xm|) ∥f∥BV(2),

for any .

###### Proof.

The key here is that where the latter kernel—defined by (21)—is continuous, bounded and compactly-supported, and hence vanishing at . Consequently, , which proves that . Based on the observation that , we then easily estimate the norm of as

 ∥δ(⋅−xm)∥′BV(2) =supy∈R|gϕ(xm,y)| +|⟨1,δ(⋅−xm)⟩|+|⟨x,δ(⋅−xm)⟩| ≤|xm| +1 + |xm|<∞.

Finally, we recall that the property that two Banach spaces and form a dual pair implies that for any and . Taking and allows us to translate the above norm estimate into the announced continuity bound. ∎

• Well-posedness of reconstruction for . It is well-known that the classical linear regression problem

 b=argminb1,b2M∑m=1|ym−(b1+b2xm)|2

is well posed and has a unique solution if and only if contains at least two distinct points, say , which takes care of the final hypothesis in [37, Theorem 4].

## SI-3. Proof of Theorem 3

###### Proof.

Let the function be a (not necessarily unique) solution of the problem summarized by (3). This solution is described by (11) with some optimal choice of transformation matrices and pointwise nonlinearities for and .

As we apply to the data point and progressively move through the layers of the network, we generate a series of vectors , according to the following recursive definition:

• Initialization (input of the network): .

• Recursive update: For , calculate

 zm,ℓ=(z1,m,ℓ,…,zNℓ,m,ℓ)=~Uℓ~ym,ℓ−1 (23)

and construct with

 ~yn,m,ℓ=~σn,ℓ(zn,m,ℓ)n=1,…,Nℓ. (24)

At the output level, we get for , which are the values that determine the data-fidelity part of the criterion associated with the optimal network and represented by the term in (3). Likewise, the specification of the optimal linear transforms fixes the regularization cost . Having set these quantities, we concentrate on the final element of the problem: the characterization of the “optimal” activations in-between the locations associated with the “auxiliary” data points , . The key is to recognize that we can now consider the various activation functions individually because the variation of in-between data points is entirely controlled by without any influence on the other terms of the cost functional. Since the solution achieves the global optimum, we therefore have that

 ~σn,ℓ=argminf∈BV(2)(R)∥D2f∥M s.t. f(zn,m,ℓ)=~yn,m,ℓ,  m=1