Deep Gated Networks (A study In Paths)

# Deep Gated Networks: A framework to understand training and generalisation in deep learning

Chandrashekar Lakshminarayanan and Amit Vikram Singh,
{chandru@iitpkd.ac.in,amitkvikram@gmail.com}
Both authors contributed equally.

## Abstract

Understanding the role of (stochastic) gradient descent (SGD) in the training and generalisation of deep neural networks (DNNs) with ReLU activation has been the object study in the recent past. In this paper, we make use of deep gated networks (DGNs) as a framework to obtain insights about DNNs with ReLU activation. In DGNs, a single neuronal unit has two components namely the pre-activation input (equal to the inner product the weights of the layer and the previous layer outputs), and a gating value which belongs to and the output of the neuronal unit is equal to the multiplication of pre-activation input and the gating value. The standard DNN with ReLU activation, is a special case of the DGNs, wherein the gating value is based on whether or not the pre-activation input is positive or negative. We theoretically analyse and experiment with several variants of DGNs, each variant suited to understand a particular aspect of either training or generalisation in DNNs with ReLU activation. Our theory throws light on two questions namely i) why increasing depth till a point helps in training and ii) why increasing depth beyond a point hurts training? We also present experimental evidence to show that gate adaptation, i.e., the change of gating value through the course of training is key for generalisation.

\savesymbol

AND

## 1 Introduction

Given a dataset , and a deep neural network (DNN), parameterised by , whose prediction on input example is , in this paper, we are interested in the stochastic gradient descent (SGD) procedure to minimise the squared loss given by . As with some of the recent works by Du et al. (2018); Du and Hu (2019) to understand SGD in deep networks, we adopt the trajectory based analysis, wherein, one looks at the (error) trajectory, i.e., the dynamics of the error defined as . Let 1, then the error dynamics is given by:

 et+1=et−αtKtet, (1)

where is a small enough step-size, is an Gram matrix, and is a neural tangent feature (NTF) matrix whose entries are given by 2. In particular, we obtain several new insights related to the following:

The Depth Phenomena: It is well known in practice that increasing depth (of DNNs) till a point improves their training performance. However, increasing the depth beyond a point degrades training. We look at the spectral properties of the Gram matrix for randomised (symmetric Bernoulli) initialisation, and reason about the depth phenomena.

Gate adaptation, i.e., the dynamics of the gates in a deep network and its role in generalisation performance.

Conceptual Novelties: In this paper, we bring in two important conceptual novelties. First novelty is the framework of deep gated networks (DGN), previously studied by Fiat et al. (2019), wherein, the gating is decoupled from the pre-activation input. Second novelty is what we call as the path-view. We describe these two concepts first, and then explain the gains/insights we obtain from them.

Deep Gated Networks (DGNs): We consider networks with depth , and width (which is the same across layers). At time , the output of a DGN for an input can be specified by its gating values and network weights as shown in Table 1.

together with the collection of the gating values at time given by (where is the gating of node in layer), recovers the outputs for all the inputs in the dataset using the definition in Table 1.

Note that the standard DNN with ReLU activation is a special DGN, wherein, , the node in the layer is given by .

Path-View: A path starts from an input node of the given network, passes through any one of the weights in each layer of the layers and ends at the output node. Using the paths, we can express the output as the summation of individual path contributions. The path-view has two important gains: i) since it avoids the usual layer after layer approach we are able to obtain explicit expression for information propagation that separates the ‘signal’ (the input ) from the ‘wire’ (the connection of the weights in the network) (ii) the role of the sub-networks becomes evident. Let denote the data matrix, and let denote the weight in the layer and let be a cross product of index sets. Formally,

A path can be defined as , where , and . We assume that the paths can be enumerated as . Thus, throughout the paper, we use the symbol to denote a path as well as its index in the enumeration.

The strength of a path is defined as .

The activation level of a path for an input is defined as .

Conceptual Gain I (Feature Decomposition): Define , where . The output is then given by:

 ^yt(xs)=ϕ⊤xs,Gtwt, (2)

where . In this paper, we interpret as the hidden feature and , the strength of the paths as the weight vector.

A hard dataset for DNNs: The ability of DNNs to fit data has been demonstrated in the past Zhang et al. (2016), i.e., they can fit even random labels, and random pixels of standard datasets such as MNIST. However, for standard DNNs with ReLU gates, with no bias parameters, a dataset with points namely and for some cannot be memorised. The reason is that the gating values are the same for both and (for that matter any positive scaling of ), and hence , and thus it not possible to fit arbitrary values for and .

Conceptual Gain II (Similarity Metric): In DGNs similarity of two different inputs depends on the overlap of sub-networks that are simultaneously active for both the inputs. Let be the hidden feature matrix obtained by stacking . Now the Gram matrix of hidden features is given by where 3, stands for the total number of paths that start at any input node (due to symmetry this number does not vary with ) and are active for both input examples . Each input example has a sub-network that is active, and similarity (inner product) of two different inputs depends on the similarity of between the corresponding sub-networks (in particular the total number of paths that are simultaneously active) that are active for the two inputs.

Conceptual Gain III (Deep Information Propagation): An explicit expression for the Gram matrix as . Here, is a summation of the inner-products of the path features (see Section 2). Thus the input signals stay as it is in the calculations in an algebraic sense, and are separated out from the wires, i.e., the network whose effect is captured in .

A Decoupling assumption: We assume that the gating and weights are statistically independent, and that weights ( of them ) are sampled from with probability . Under these assumptions we obtain the following key results and insights:

Depth Phenomena I: Why does increasing depth till a point helps training?

Because, increasing depth causes whitening of inputs. In particular, we show that , where is the Hadamard product. The ratio is the fractional overlap of active sub-networks, say at each layer the overlap of active gates is , then for a depth , the fractional overlap decays at exponential rate, i.e., , leading to whitening.

Depth Phenomena II: Why does increasing depth beyond a point hurts training?

Because, (for ). Thus for large width converges to its expected value. However, for a fixed width, increasing depth makes the entries of deviate from , thus degrading the spectrum of .

Key Take away: To the best of our knowledge, we are the first to present a theory to explain the depth phenomena. While the ReLU gates do not satisfy the decoupling assumption, we hope to relax the decoupling assumption in the future and extend the results for decoupled gating to ReLU activation as well.

Conceptual Gain IV (Twin Gradient Flow): The NTF matrix can be decomposed as , from which it is evident that the gradient has two components namely i) strength adaptation: keeping the sub-networks (at time ) corresponding to each input example fixed, the component learns the strengths of the paths in those sub-networks, and ii) gate adaptation: this component learns the sub-networks themselves. Ours is the first work to analytically capture the two gradients.

Conceptual Gain V (Fixing the ReLU artefact): In standard ReLU networks the gates are and hence . Thus the role of gates has been unaccounted for in the current literature. By parameterising the gates by , and introducing a soft-ReLU gating (with values in ), we can show that Gram matrix can be decomposed into , where is the Gram matrix of strength adaptation and is the Gram matrix corresponding to activation adaptation. Ours is the first work to point out that the Gram matrix has a gate adaptation component.

Conceptual Gain VI (Sensitivity sub-network) : Our expression for involves what we call sensitivity sub-network formed by gates which take intermediate values, i..e, are not close to either or . We contend that by controlling such sensitive gates, the DGN is able to learn the features over the course of training.

Evidence I (Generalisation needs gate adaptation): We experiment with two datasets namely standard CIFAR-10 (classification) and Binary-MNIST which has classes and with labels (squared loss). We observe that whenever gates adapt, test performance gets better.

Evidence II (Lottery is in the gates:) We obtain test accuracy just by tuning the gates of a parameterised DGN with soft-gates. We also observe that by copying the gates from a learnt network and training the weights from scratch also gives good generalisation performance. This gives a new interpretation for the lottery ticket hypothesis Frankle and Carbin (2018), i.e., the real lottery is in the gates.

Lessons Learnt: Rethinking generalisation needs to involve a study on how gates adapt. Taking a cue from Arora et al. (2019), we look at , and observe that in the case of adaptive gates/learned gates is always upper bounded by when gates are non-adaptive/non-learned gates.

Organisation: The rest of the paper has Section 2, where, we consider DGNs with fixed or frozen gates, and Section 3, where, we look at DGNs with adaptable gates. The idea is to obtain insights by progressing stage by stage from easy to difficult variants of DGNs, ordered naturally according to the way in which paths/sub-networks are formed. The proofs of the results are in the supplementary material.

## 2 Deep Information Propagation in DGN

In this section, we study deep information propagation (DIP) in DGNs when the gates are frozen, i.e., , and our results are applicable to the following:

(i) Deep Linear Networks (DLN): where, all the gating values are . Here, since all the paths are always on, we do not have any control over how the paths are formed.

(ii) Fixed random gating (DGN-FRG): Note that contains gating values, corresponding to the input examples, layer outputs and nodes in each layer. In DGN-FRG, , where each gating value is chosen to be or with probabilities and respectively. Here, we have full control over the gating/activation level of the paths. These networks are restricted solely towards understanding questions related to optimisation. Generalisation is irrelevant for DGN-FRG networks because there is no natural means to extend the random gate generation for unseen inputs.

(iii) Gated linear unit (GaLU): networks, wherein the gating values are generated by another separate network which is DNN with ReLU gates. Unlike DGN-FRG, GaLU networks can generalise.

We first express the Gram matrix in the language of the paths. We then state our assumption in Assumption 12 followed by our main result on deep information propagation in DGNs (Theorem 2.3). We then demonstrate how our main result applies to DGN-FRG and GaLU networks.4

When the gates are frozen, the weight update affects only the path strengths . This is captured as follows:

Sensitivity of the path strength: Let be a path, and be its strength at time . Further, let be a weight and without loss of generality, let belong to layer . At time , the derivative of path with respect to denoted by is given by:

 φt,p(m)=dΠl=1l≠l′(m)Θt(l,p(l−1),p(l)),∀p⇝θ(m),φt,p(m)=0,∀p⇝θ(m) (3)

The sensitivity of a path with respect to is then given by the vector .

###### Lemma 2.1 (Signal vs Wire Decomposition).

Let . The Gram matrix is then given by

 Kt(s,s′)=din∑i=1x(i,s)x(i,s′)κt(s,s′,i) (4)

In Lemma 2.1, is the amount of overall interaction within the DGN in the dimension of inputs . Note that, thanks to the path-view, the ‘signal’, i.e., gets separated from the ‘wire’, i.e., the connections in the DGN, which is captured in the ’’ term. Further, the algebraic expression for in (4) applies to all DGNs (including the standard DNN with ReLU activations).

Simplifying , which contains the joint path activity given by , and the inter-path interaction given by , is the next step. Towards this end, we state and discuss Assumptions 12.

###### Assumption 1.

over the set .

Decoupling of paths from one another, which stands for the fact that the inner product of two different paths is on expectation. This is captured in Lemma 2.2 below.

###### Lemma 2.2.

Under Assumption 1, for paths , at initialisation we have (i) , (ii)

###### Assumption 2.

is statistically independent of .

Decoupling of gates and paths: In a standard DNN with ReLU activations, the gating and path strengths are statistically dependent, because, conditioned on the fact that a given ReLU activation is on, the incoming weights of that activation cannot be simultaneously all negative. Assumption 2 makes path strength statistically independent of path activity.

###### Theorem 2.3 (DIP in DGN).

Under Assumption 1,  2, and it follows that

 E[K0]=dσ2(d−1)(x⊤x⊙λ0) Var[K0]≤O(d2inσ4(d−1)max{d2w2(d−2)+1,d3w2(d−2)})

Choice of : Note that, in the case of gates taking values in , is a measure of overlap of sub-networks that start at any given input node, end at the output node, and are active for both input examples . Loosely speaking, say, in each layer fraction of the gates are on, then is about . Thus, to ensure that the signal does not blow up in a DGN, we need to ensure , which means .

In the expression for , note that the input Gram matrix is separate from , which is a Gram matrix of the active sub-networks. Thanks to the path-view, we do not lose track of the input information, which, is otherwise bound to happen if we were to choose a layer by layer view of DIP in DGNs.

Fixed Random Gating (DGN-FRG) involves sampling the gates from , and hence there is a random sub-network which is active for each input. Under FRG, we can obtain closed form expression for the ’’ term as below:

###### Lemma 2.4.

Under Assumption 12 and gates sampled iid , we have,

(i)

ii)

DIP in DGN-FRG: For , we have:

 E[K0]d=⎡⎢ ⎢ ⎢ ⎢⎣⋅⋅⋅⋅⋅⋅⟨xs,xs⟩⋅⟨xs,xs′⟩μd−1⋅⋅⟨xs′,xs⟩μd−1⋅⟨xs′,xs′⟩⋅⋅⋅⋅⋅⋅⎤⎥ ⎥ ⎥ ⎥⎦

Experiment : Consider the dataset , where , and , . The input Gram matrix is a matrix with all entries equal to and its rank is equal to 1. Since all the inputs are identical, this is the worst possible case for optimisation.

Why increasing depth till a point helps ? In the case of Experiment , we have:

 E[K0]d=⎡⎢ ⎢ ⎢ ⎢⎣1μd−1…μd−1……1…μd−1……μd−1…1……μd−1…μd−11⎤⎥ ⎥ ⎥ ⎥⎦ (5)

i.e., all the diagonal entries are and non-diagonal entries are . Now, let be the eigenvalues of , and let and be the largest and smallest eigenvalues. From the structure of (5), one can easily show that and corresponds to the eigenvector with all entries as , and repeats times, which corresponds to eigenvectors given by for .

Why increasing depth beyond a point hurts? In Theorem 2.3, note that for a fixed width , as the depth increases, the variance of increases, and hence the entries of deviates from its expected value . Thus the structure of the Gram matrix degrades from (5), leading to smaller eigenvalues.

Numerical Evidence (Gram Matrix): We fix arbitrary diagonal and non-diagonal entries, and look at their value averaged over run (see Figure 2). The actual values shown in bold indeed follow the ideal values shown in the dotted lines and the values are as per (5)).

Numerical Evidence (Spectrum): Next, we look at the cumulative eigenvalue (e.c.d.f) obtained by first sorting the eigenvalues in ascending order then looking at their cumulative sum. The ideal behaviour (middle plot of Figure 1) as predicted from theory is that for indices , the e.c.d.f should increase at a linear rate, i.e., the cumulative sum of the first indices is equal to , and the difference between the last two indices is . In Figure 1, we plot the e.c.d.f for various depths and two different width namely . It can be seen that as increases, the difference between the ideal and actual e.c.d.f curves is less ( when compared to ).

Numerical Evidence (Role of Depth): In order to compare how the rate of convergence varies with the depth in DGN-FRG network, we set the step-size , , and fit the data described in Experiment . We use the vanilla SGD-optimiser. Note that if follows from (1) that the convergence rate is determined by a linear recursion, and choosing can be seen to be equivalent to having a constant step-size of but dividing the Gram matrix by its maximum eigenvalue instead. Thus, after this rescaling, the maximum eigenvalue is uniformly across all the instances, and the convergence should be limited by the smaller eigenvalues. We also look at the convergence rate of the ratio , and we observe that the convergence rate gets better with depth as predicted by theory (Figure 3).

GaLU Networks: Here, the gating values are obtained from a DNN with ReLU activation, parameterised by (these weights are frozen). Now, let us define , and . Note that, due to the inherent symmetry in weights (Assumption 1) , we can expect roughly half the number of activations to be on, and it follows that with . Also, let , let be the maximum overlap between gates of a layer (maximum taken over over input pairs and layers ), then it follows that . Thus, we can see that while the non-diagonal entries of the decay at a different rates, the rate of decay is nonetheless upper bounded by . Note that in DGN-FRG decay of non-diagonal terms is at a uniform rate given by .

Experiment : To characterise the optimisation performance of GaLU and ReLU networks, we consider the dataset , where, and , . The results are shown in Figure 4. The rationale behind choosing this data set is that, we want the inputs to be highly correlated by choice.

GaLU Networks (Depth helps in training): The trend is similar to DGN-FRG case, in that, both e.c.d.f as well as convergence get better with increasing depth. Here too we set the step-size (and use vanilla SGD). We also observe that in Experiment GaLU networks optimise better than standard ReLU networks, and it is also true that the e.c.d.f for the case of GaLU is better than that of ReLU. This can be attributed to the fact that, in ReLU network the dot product of two different active paths is not zero and hence the Gram matrix entries fall back to the algebraic expression for in (4).

## 3 Generalisation

ReLU networks generalise better than GaLU: We trained both ReLU and GaLU networks on standard MNIST dataset to close to accuracy. We observed that the GaLU network trains a bit faster than the ReLU network (see Figure 18 ). However, in test data we obtain accuracy of around and for GaLU and ReLU networks respectively. A key difference between GaLU and ReLU networks is that, in ReLU networks the gates are adapting, i.e., keeps changing with time. This leads us to the following natural question:

Is gate adaptation key for generalisation performance?

Hidden features are in the sub-networks and are learned: We consider “Binary”-MNIST data set with two classes namely digits and , with the labels taking values in and squared loss. We trained a standard DNN with ReLU activation (, ). Recall from Section 1, that (the Gram matrix of the features) and let be its normalised counterpart. For a subset size, ( examples per class) we plot , (where is the labeling function), and observe that reduces as training proceeds (see middle plot in Figure 6). Note that , where are the orthonormal eigenvectors of and are the corresponding eigenvalues. Since , the only way reduces is when more and more energy gets concentrated on s for which s are also high. However, in , only changes with time. Thus, which is a measure of overlap of sub-networks active for input examples , changes in a manner to reduce . We can thus infer that the right active sub-networks are learned over the course of training.

DGNs with adaptable gates: To investigate the role of gate adaptation, we study parameterised DGNs of the general form given in Table 2. The specific variants we study in the experiments are in Table 3.

We now explain the idea behind the various gates (and some more) in Table 3 as follows:

The most general gate is called the soft-GaLU gate denoted by . Here, the gating values and hence the path activation levels are decided by , and the path strengths are decided by . This network has parameters.

The standard DNN with ReLU gating is denoted by , where signifies that the outputs are (see Table 2). Here, both the gating (and hence path activation levels) and the path strengths are decided by the same parameter namely .

is a DNN with what we call the soft-ReLU gates, where the gating values are in instead of . Here too, like the standard ReLU networks, both the gating values and the path strengths are decided by .

is what we call a GaLU-frozen DGN, where the gating parameters are initialised but not trained.

is a network where only the gating parameters are trainable and the parameters which dictate the path strengths namely are initialised by not trained.

Fixing ReLU artefact: We refer to the function in Table 2 as the soft gating function. Note that, the NTF matrix has two components given by . In the case of ReLU activation, the gating values are either , the activation levels are also and hence their derivative is . In contrast, the ‘soft-gating’ is differentiable, and hence if follows that , which is also accounted in the analysis.

For soft-GaLU: Since there are two set of parameters (total ) is a , we have , where , and .

For soft-ReLU: .

###### Definition 3.1.

For a soft-GaLU DGN, using any , define .

###### Lemma 3.1.

Under Assumptions 12, in soft-GaLU networks we have: (i) , (ii) , (iii)

Adaptable gates generalise better: In all experiments below, we use step-size , and we use RMSprop to train. \comment

Experiment : On ‘Binary’-MNIST, we train four different networks (, ), namely, (soft-GaLU), (soft-ReLU), (ReLU), and (GaLU with frozen gates). We observe that the networks with adaptable gates generalise better the GaLU network with frozen gates as shown in Figure 7. \comment

Experiment : We train a frozen ReLU network , where (weights chosen according to Assumption 2), and a standard ReLU network in which the gates adapt. We observe that generalisation is better when the gates adapt (see right most plot in Figure 7). \comment