Mehler’s Formula, Branching Process, and Compositional Kernels of Deep Neural Networks
We utilize a connection between compositional kernels and branching processes via Mehler’s formula to study deep neural networks. This new probabilistic insight provides us a novel perspective on the mathematical role of activation functions in compositional neural networks. We study the unscaled and rescaled limits of the compositional kernels and explore the different phases of the limiting behavior, as the compositional depth increases. We investigate the memorization capacity of the compositional kernels and neural networks by characterizing the interplay among compositional depth, sample size, dimensionality, and non-linearity of the activation. Explicit formulas on the eigenvalues of the compositional kernel are provided, which quantify the complexity of the corresponding reproducing kernel Hilbert space. On the methodological front, we propose a new random features algorithm, which compresses the compositional layers by devising a new activation function.
Kernel methods and deep neural networks are arguably two representative methods that achieved the state-of-the-art results in regression and classification tasks (Shankar et al., 2020). However, unlike the kernel methods where both the statistical and computational aspects of learning have been understood reasonably well, there are still many theoretical puzzles around the generalization, computation and representation aspects of deep neural networks (Zhang et al., 2017). One hopeful direction to resolve some of the puzzles in neural networks is through the lens of kernels (Rahimi and Recht, 2008, 2009; Cho and Saul, 2009; Belkin et al., 2018b). Such a connection can be readily observed in a two-layer infinite-width network with random weights, see the pioneering work by Neal (1996a) and Rahimi and Recht (2008, 2009). For deep networks with hierarchical structures and randomly initialized weights, compositional kernels (Daniely et al., 2017b, b; Poole et al., 2016) are proposed to rigorously characterize such a connection, with promising empirical performances (Cho and Saul, 2009). A list of simple algebraic operations on kernels (Stitson et al., 1999; Shankar et al., 2020) are introduced to incorporate specific data structures that contain bag-of-features, such as images and time series.
In this paper, we continue to study deep neural networks and their dual compositional kernels, furthering the aforementioned mathematical connection, based on the foundational work of (Rahimi and Recht, 2008, 2009) and (Daniely et al., 2017b, a). We focus on a standard multilayer perceptron architecture with Gaussian weights and study the role of the activation function and its effect on composition, data memorization, spectral properties, algorithms, among others. Our main results are based on a simple yet elegant connection between compositional kernels and branching processes via Mehler’s formula (Lemma 3.1). This new connection, in turn, opens up the possibility of studying the mathematical role of activation functions in compositional deep neural networks, utilizing the probabilistic tools in branching processes (Theorem 3.4). Specifically, the new probabilistic insight allows us to answer the following questions:
Limits and phase transitions. Given an activation function, one can define the corresponding compositional kernel (Daniely et al., 2017b, a). How to classify the activation functions according to the limits of their dual compositional kernels, as the compositional depth increases? What properties of the activation functions govern the different phases of such limits? How do we properly rescale the compositional kernel such that there is a limit unique to the activation function? The above questions will be explored in Section 3.
Memorization capacity of compositions: tradeoffs. Deep neural networks and kernel machines can have a good out-of-sample performance even in the interpolation regime (Zhang et al., 2017; Belkin et al., 2018b), with perfect memorization of the training dataset. What is the memorization capacity of the compositional kernels? What are the tradeoffs among compositional depth, number of samples in the dataset, input dimensionality, and properties of the non-linear activation functions? Section 4 studies such interplay explicitly.
Spectral properties of compositional kernels. Spectral properties of the kernel (and the corresponding integral operator) affect the statistical rate of convergence, for kernel regressions (Caponnetto and Vito, 2006). What is the spectral decomposition of the compositional kernels? How do the eigenvalues of the compositional kernel depend on the activation function? Section 5 is devoted to answering the above questions.
New randomized algorithms. Given a compositional kernel with a finite depth associate with an activation, can we devise a new “compressed” activation and new randomized algorithms, such that the deep neural network (with random weights) with the original activation is equivalent to a shallow neural network with the “compressed” activation? Such algorithmic questions are closely related to the seminal Random Fourier Features (RFF) algorithm in Rahimi and Recht (2008, 2009), yet different. Section 6 investigates such algorithmic questions by considering compositional kernels and, more broadly, the inner-product kernels. Differences to the RFF are also discussed in detail therein.
Borrowing the insight from branching process, we start with studying the role of activation function in the compositional kernel, memorization capacity, and spectral properties, and conclude with the converse question of designing new activations and random nonlinear features algorithm based on kernels, thus contributing to a strengthened mathematical understanding of activation functions, compositional kernel classes, and deep neural networks.
1.1 Related Work
The connections between neural networks (with random weights) and kernel methods have been formalized by researchers using different mathematical languages. Instead of aiming to provide a complete list, here we only highlight a few that directly motivate our work. Neal (1996b, a) advocated using Gaussian processes to characterize the neural networks with random weights from a Bayesian viewpoint. For two-layer neural networks, such correspondence has been strengthened mathematically by the work of Rahimi and Recht (2008, 2009). By Bochner’s Theorem, Rahimi and Recht (2008) showed that any positive definite translation-invariant kernel could be realized by a two-layer neural network with a specific distribution on the weights, via trigonometric activations. Such insights also motivated the well-known random features algorithm, random kitchen sinks (Rahimi and Recht, 2009). One highlight of such an algorithm is that in the first layer of weights, sampling is employed to replace the optimization. Later, several works extended along the line, see, for instance, Kar and Karnick (2012) on the rotation-invariant kernels, Pennington et al. (2015) on the polynomial kernels, and Bach (2016) on kernels associated to ReLU-like activations (using spherical harmonics). Recently, Mei and Montanari (2019) investigated the precise asymptotics of the random features model using random matrix theory. For deep neural networks, compositional kernels are proposed to carry such connections further. Cho and Saul (2009) introduced the compositional kernel as the inner-product of compositional features. Daniely et al. (2017b, a) described the compositional kernel through the language of the computational skeleton, and introduced the duality between the activation function and compositional kernel. We refer the readers to Poole et al. (2016); Yang (2019); Shankar et al. (2020) for more information on the connection between kernels and neural networks.
One might argue that neural networks with static random weights may not fully explain the success of neural networks, noticing that the evolution of the weights during training is yet another critical component. On this front, Chizat and Bach (2018b); Mei et al. (2018); Sirignano and Spiliopoulos (2018); Rotskoff and Vanden-Eijnden (2018) employed the mean-field characterization to describe the distribution dynamics of the weights, for two-layer networks. Rotskoff and Vanden-Eijnden (2018); Dou and Liang (2020) studied the favorable properties of the dynamic kernel due to the evolution of the weight distribution. Nguyen and Pham (2020) carried the mean-field analysis to multi-layer networks rigorously. On a different tread (Jacot et al., 2019; Du et al., 2018; Chizat and Bach, 2018a; Woodworth et al., 2019), researchers showed that under specific scaling, training over-parametrized networks could be viewed as a kernel regression with perfect memorization of the training data, using a tangent kernel (Jacot et al., 2019) built from a linearization around its initialization. For a more recent resemblance between the kernel learning and the deep learning on the empirical side, we refer the readers to Belkin et al. (2018b).
Mehler’s formula. We will start with reviewing some essential background on the Hermite polynomials that is of direct relevance to our paper.
Definition 2.1 (Hermite polynomials).
The probabilists’ Hermite polynomials for non-negative integers follows the recursive definition with and
We define the normalized Hermite polynomials as
The set forms an orthogonal basis of under the Gaussian measure as
Proposition 2.2 (Mehler’s formula).
Mehler’s formula establishes the following equality on Hermite polynomials: for any and
Branching process. Now, we will describe the branching process and the compositions of probability generating functions (PGF).
Definition 2.3 (Probability generating function).
Given a random variable on non-negative integers with the following probability distribution
define the associated generating function as
It is clear that , and is non-decreasing and convex on .
Definition 2.4 (Galton-Watson branching process).
The Galton-Watson (GW) branching process is defined as a Markov chain , where denotes the size of the -th generation of the initial family. Let be a random variable on non-negative integers describing the number of direct children, that is, it has children with probability with . Begin with one individual , and let it reproduce according to the distribution of , and then each of these children then reproduce independently with the same distribution as . The generation sizes are then defined by
where denotes the number of children for the -th individual in generation .
Multi-layer Perceptrons. We now define the fully-connected Multi-Layer Perceptrons (MLPs), which is among the standard architectures in deep neural networks.
Definition 2.6 (Activation function).
Throughout the paper, we will only consider the activation functions that are -integrable under the Gaussian measure . The Hermite expansion of is denoted as
We will explicitly mention the following two assumptions when they are assumed. Otherwise, we will work with the activation in Definition 2.6.
Assumption 1 (Normalized activation function).
Assume that the activation function is normalized under the Gaussian measure , in the following sense
with the Hermite coefficients satisfying
Assumption 2 (Centered activation function).
Assume that the activation function is centered under the Gaussian measure , in the following sense
or equivalently the Hermite coefficient .
Definition 2.8 (Fully-connected MLPs with random weights).
Given an activation function , the number of layers , and the input vector , we define a multi-layer feed-forward neural network which inductively computes the output for each intermediate layer
Here denotes the identity matrix of size , and denotes the Kronecker product between two matrices. The activation is applied to each component of the vector input, and the weight matrix in the -th layer is sampled from a multivariate Gaussian distribution . For a vector and a scalar , the notation denotes the component-wise division of by scalar .
3 Compositional Kernel and Branching Process
3.1 Warm up: Duality
We start by describing a simple duality between the activation function in multi-layer perceptrons (that satisfies Assumption 1) and the probability generating function in branching processes. This simple yet essential duality allows us to study deep neural networks, and compare different activation functions borrowing tools from branching processes. This duality in Lemma 3.1 can be readily established via the Mehler’s formula (Proposition 2.2). To the best of our knowledge, this probabilistic interpretation (Lemma 3.3) is new to the literature.
Lemma 3.1 (Duality: activation and generating functions).
(sketch) We can rewrite Equation 3.2 in terms of the bivariate normal distribution with correlation as . Then, by expanding the density of using Mehler’s formula we would retrieve the Hermite coefficients . ∎
Based on the above Lemma 3.1, it is easy to define the compositional kernel associated with a fully-connected MLP with activation . The compositional kernel approach of studying deep neural networks has been proposed in Daniely et al. (2017b, a). To see why, let us recall the MLP with random weights defined in Definition 2.8. Then, for any fixed data input , the following holds almost surely for random weights
Motivated by the above equation, one can introduce the asymptotic compositional kernel defined by a deep neural network with activation , in the following way.
Definition 3.2 (Compositional kernel).
Let be an activation function that satisfies Assumption 1. Define the -layer compositional kernel to be the (infinite-width) compositional kernel associated with the fully-connected MLPs (from Definition 2.8), such that for any , we have
Since the kernel only depends on the inner-product , when there is no confusion, we denote for any
We will now point out the following connection between the compositional kernel for deep neural networks and the Galton-Watson branching process. Later, we will study the (rescaled) limits, phase transitions, memorization capacity, and spectral decomposition of such compositional kernels.
Lemma 3.3 (Duality: MLP and Branching Process).
Let be an activation function that satisfies Assumption 1, and be the dual generating function as in Lemma 3.1. Let be the Galton-Watson branching process with offspring distribution . Then for any , the compositional kernel has the following interpretation using the Galton-Watson branching process
(sketch) We prove by induction on using , where . ∎
The above duality can be extended to study other network architectures. For instance, in the residual network, the duality can be defined as follows: for , , and a centered activation function (Assumption 2), define the dual residual network PGF as
In the Sections 3.2 and 4 and later in experiments, we will elaborate on the costs and benefits of adding a linear component to the PGF in the corresponding compositional behavior, both in theory and numerics. The above simple calculation sheds light on why in practice, residual network can tolerate a larger compositional depth.
3.2 Limits and phase transitions
In this section, we will study the properties of the compositional kernel, in the lens of branching process, utilizing the duality established in the previous section. One important result in branching process is the Kesten-Stigum Theorem (Kesten and Stigum, 1966), which can be employed to assert the rescaled limit and phase transition of the compositional kernel in Theorem 3.4.
Theorem 3.4 (Rescaled non-trivial limits and phase transitions: compositional kernels).
Let be an activation function that satisfies Assumption 1, with be the corresponding Hermite coefficients that satisfy . Define two quantities that depend on ,
. Then, if , we have
and, if , we have for all ;
and . Then, there exists with and a unique positive random variable (that depends on ) with a continuous density on . And, the non-trivial rescaled limit is
and . Then, for any positive number , we have
The above shows that when looking at the compositional kernel at the rescaled location for a fixed , the limit can be characterized by the moment generating function associated with a negative random variable individual to the activation function . The intuition behind such a rescaled location is that the limiting kernels witness an abrupt change of value at near for large (see the below Corollary 3.5). In the case , the proper rescaling in Theorem 3.4 stretches out the curve and zooms in the narrow window of width local to to inspect the detailed behavior of the compositional kernel . Conceptually, the above Theorem classifies the rescaled behavior of the compositional kernel into three phases according to and , functionals of the activation . One can also see that the unscaled limit for the compositional kernel has the following simple behavior.
Corollary 3.5 (Unscaled limits and phase transitions).
Under the same setting as in Theorem 3.4, the following results hold:
. Then, for all , if , we have
and for all if ;
. Then, there exists a unique with
Under additional assumptions of on such as no fixed points or non-negativity, we can extend the above results to
Under the additional Assumption 2 on , for non-linear activation , we have and . Therefore, the unscaled limit for non-linear compositional kernel is
We remarks that the fact (ii) in the above corollary is not new and has been observed by Daniely et al. (2017b). On the one hand, they use the fact (ii) to shed light on why more than five consecutive fully connected layers are rare in practical architectures. On the other hand, the phase transition at corresponds to the edge-of-chaos and exponential expressiveness of deep neural networks studied in Poole et al. (2016), using physics language.
4 Memorization Capacity of Compositions: Tradeoffs
One advantage of deep neural networks (DNN) is their exceptional data memorization capacity. Empirically, researchers observed that DNNs with large depth and width could memorize large datasets (Zhang et al., 2017), while maintaining good generalization properties. Pioneered by Belkin, a list of recent work contributes to a better understanding of the interpolation regime (Belkin et al., 2018a, 2019, c; Liang and Rakhlin, 2020; Hastie et al., 2019; Bartlett et al., 2019; Liang et al., 2020; Feldman, 2019; Nakkiran et al., 2020). With the insights gained via branching process, we will investigate the memorization capacity of the compositional kernels corresponding to MLPs, and study the interplay among the sample size, dimensionality, properties of the activation, and the compositional depth in a non-asymptotic way.
In this section, we denote as the dataset with each data point lying on the unit sphere. We denote by with as the maximum absolute value of the pairwise correlations. Specifically, we consider the following scaling regimes on the sample size relative to the dimensionality :
Small correlation: We consider the scaling regime with some small constant , where the dataset is generated from a probabilistic model with uniform distribution on the sphere.
Large correlation: We consider the scaling regime with some large constant , where the dataset forms a certain packing set of the sphere. The results also extend to the case of i.i.d. samples with uniform distribution on the sphere.
We name it “small correlation” since can be vanishingly small, in the special case . Similarly, we call it “large correlation” as can be arbitrarily close to , in the special case .
For the results in this section, we make the Assumptions 1 and 2 on the activation function , which are guaranteed by a simple rescaling and centering of any activation function. Let be the empirical kernel matrix for the compositional kernel at depth , with
For kernel ridge regression, the spectral properties of the empirical kernel matrix affect the memorization capacity: when has full rank, the regression function without explicit regularization can interpolate the training dataset. Specifically, the following spectral characterization on the empirical kernel matrix determines the rate of convergence in terms of optimization to the min-norm interpolated solution, thus further determines memorization. The in following definition of -memorization can be viewed as a surrogate to the condition number of the empirical kernel matrix, as the condition number is bounded by .
Definition 4.1 (-memorization).
We call that a symmetric kernel matrix associated with the dataset has a -memorization property if the eigenvalues of are well behaved in the following sense
We denote by the minimum compositional depth such that the empirical kernel matrix has the -memorization property.
(-closeness) We say that a kernel matrix associated with the dataset satisfies the closeness property if
We denote by the minimum compositional depth such that the empirical kernel matrix satisfies the -closeness property.
We will assume throughout the rest of this section that
(Symmetry of PGF) for all .
4.1 Small correlation regime
To study the small correlation regime, we consider a typical instance of the dataset that are generated i.i.d. from a uniform distribution on the sphere.
Theorem 4.3 (Memorization capacity: small correlation regime).
Let be a dataset with random instances. Consider the regime with some absolute constant small enough that only depends on the activation . For any , with probability at least , the minimum compositional depth to obtain -memorization satisfies
The proof is due to the sharp upper and lower estimates obtained in the following lemma.
Lemma 4.4 (-closeness: small correlation regime).
Consider the same setting as in Theorem 4.3. For any , with probability at least , the minimum depth to obtain -closeness satisfies
In this small correlation regime, Theorem 4.3 states that in order for us to memorize a size -dataset , the depth for the compositional kernel scales with three quantities: the linear component in the activation function , a factor between and , and the logarithm of the regime scaling . Two remarks are in order. First, as the quantity becomes larger, we need a larger depth for the compositional kernel to achieve the same memorization. However, such an effect is mild since the regime scaling enters precisely logarithmically in the form of . In other words, for i.i.d. data on the unit sphere with , it is indeed easy for a shallow compositional kernel (with depth at most ) to memorize the data. In fact, consider the proportional high dimensional regime , then a very shallow network with is sufficient and necessary to memorize. Second, can be interpreted as the amount of non-linearity in the activation function. Therefore, when the non-linear component is larger, we will need fewer compositions for memorization. This explains the necessary large depth of an architecture such as ResNet (Equation 3.1), where a larger linear component is added in each layer to the corresponding kernel. A simple contrast should be mentioned for comparison: memorization is only possible for linear models when , whereas with composition and non-linearity, suffices for good memorization.
4.2 Large correlation regime
To study the large correlation regime, we consider a natural instance of the dataset that falls under such a setting. The construction is based on the sphere packing.
Definition 4.5 (r-polarized packing).
For a compact subset , we say is a -polarized packing of if for all , we have and . We define the polarized packing number of as , that is
Theorem 4.6 (Memorization capacity: large correlation regime).
Let a size- dataset be a maximal polarized packing set of the sphere . Consider the regime with some absolute constant that only depends on the activation . For any , the minimum depth to obtain -closeness satisfies
Here is a constant that only depends on .
One can carry out an identical analysis in the large correlation regime for the i.i.d. random samples case with , as in the sphere packing case. Here the constant only depends on . Exactly the same bounds on hold with high probability.
The proof is due to the sharp upper and lower estimates in the following lemma.
Lemma 4.8 (-closeness: large correlation regime).
Consider the same setting as in Theorem 4.6. For any , the minimum depth to obtain -closeness satisfies
In this large correlation regime, to memorize a dataset, the behavior of the compositional depth is rather different from the small correlation regime. By Theorem 4.6, we have that the depth scales with following quantities: a factor between and same as before, the regime scaling , and functionals of the activation , , and . Few remarks are in order. First, in this large correlation regime , memorization is indeed possible. However, the compositional depth needed increases precisely linearly as a function of the regime scaling . The above is in contrast to the small correlation regime, where the dependence on the regime scaling is logarithmic as . For hard dataset instances on the sphere with , one needs at least depth for the compositional kernel to achieve memorization. In fact, consider the fixed dimensional regime with , then a deeper network with depth is sufficient and necessary to memorize, which is much larger than the depth needed in the proportional high dimensional regime with . Second, for larger values of and we will need less compositional depths, as the amount of non-linearity is larger. To sum up, with non-linearity and composition, even in the regime with a hard data instance, memorization is possible but with a deep neural network.
5 Spectral Decomposition of Compositional Kernels
In this section, we investigate the spectral decomposition of the compositional kernel function. We study the case where the base measure is a uniform distribution on the unit sphere, denoted by . Let be the surface area of . To state the results, we will need some background on the spherical harmonics. We consider the dimension , and will use to denote an integer.
Definition 5.1 (Spherical harmonics, Chapter 2.8.4 in Atkinson and Han (2012)).
Let be the space of -th degree spherical harmonics in dimension , and let be an orthonormal basis for , with
Then, the sets form an orthogonal basis for the space of -integrable functions on with the base measure , noted below
Moreover, the dimensionality are the coefficients of the generating function
Definition 5.2 (Legendre polynomial, Chapter 2.7 in Atkinson and Han (2012)).
Define the Legendre polynomial of degree with dimension to be
The following orthogonality holds
Recall that the compositional kernel and the random variable , which denotes the size of the -th generation, as in Lemma 3.3. Then, we have the following theorem describing the spectral decomposition of the compositional kernel function and the associated integral operator.
Theorem 5.3 (Spectral decomposition of compositional kernel).
Consider any . Then, the following spectral decomposition holds for the compositional kernel with any fixed depth :
where the eigenfunctions form an orthogonal basis of , and the eigenvalues satisfy the following formula
The associated integral operator with respect to the kernel is defined as
From Theorem 5.3, we know that the eigenfunctions of the operator are the spherical harmonic basis , with identical eigenvalues such that
The above spectral decompositions are important because it helps us study the generalization error (in the fixed dimensional setting) of regression methods with the compositional kernel . More specifically, understanding the eigenvalues of the compositional kernels means that we can employ the classical theory on reproducing kernel Hilbert spaces regression (Caponnetto and Vito, 2006) to quantify generalization error, when the dimension is fixed. In the case when dimensionality grows with the sample size, several attempts have been made to understand the generalization properties of the inner-product kernels (Liang and Rakhlin, 2020; Liang et al., 2020) in the interpolation regime, which includes these compositional kernels as special cases.
6 Kernels to Activations: New Random Features Algorithms
Given any activation function (as in Definition 2.6), we can define a sequence of positive definite (PD) compositional kernels , with , whose spectral properties have been studied in the previous section, utilizing the duality established in Section 3.1. Such compositional kernels are non-linear functions on the inner-product (rotation-invariant), and we will call them the inner-product kernels (Kar and Karnick, 2012). In this section, we will investigate the converse question: given an arbitrary PD inner-product kernel, can we identify an activation function associated with it? We will provide a positive answer in this section. Direct algorithmic implications are new random features algorithms that are distinct from the well-known random Fourier features and random kitchen sinks algorithms studied in Rahimi and Recht (2008, 2009).
Define an inner-product kernel , with ,
where is a continuous function. Denote the expansion of under the Legendre polynomials (see Definition 5.2) as
To define the activations corresponding to an arbitrary PD inner-product kernel, we require the following theorem due to Schoenberg (1942).
Proposition 6.1 (Theorem 1 in Schoenberg (1942)).
Now, we are ready to state the activation function defined based on the inner-product kernel function .
Theorem 6.2 (Kernels to activations).
Consider any positive definite inner product kernel on associated with the continuous function . Assume without loss of generality that , and recall the definition of in (5.3). Due to Proposition 6.1, the Legendre coefficients , defined in Equation (6.2), of are non-negative.
One can define the following dual activation function
Then, the following statements hold:
is in the following sense
where is sampled from a uniform distribution on the sphere .
The above theorem naturally induces a new random features algorithm for kernel ridge regression, described below. Note that the kernel can be any compositional kernel, which is positive definite and of an inner-product form.
It is clear from Theorem 5.3 that all compositional kernels are positive definite, though the converse statement is not true. A notable example is the kernel , which is PD kernel but with negative Taylor coefficients, thus cannot be a compositional kernel. For the special case of compositional kernels with depth and activation