Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension

# Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension

Yuandong Tian
yuandong@fb.com
###### Abstract

In this paper, we adopt a student-teacher setting to analyze the phenomenon of student specialization, when both teacher and student are deep ReLU networks and student is over-realized (i.e., more student nodes than teacher at each layer). In such a setting, the student network learns from the output of a fixed teacher network of the same depth with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is small at every training sample, student nodes specialize to teacher nodes in the lowest layer under mild conditions. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are specialized by student first and weak ones are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-realization is an important factor to enable specialization at the critical points, and helps more student nodes specialize to teacher nodes with fewer iterations. Different from Neural Tangent Kernel (Jacot et al., 2018) and statistical mechanics approach (Goldt et al., 2019), our approach operates on finite width, mild degrees of over-realization and finite input dimension. Experiments justify our finding. The code is released in https://github.com/facebookresearch/luckmatters.

\arxivtrue

## 1 Introduction

Deep Learning has achieved great success in the recent years (Silver et al., 2016; He et al., 2016; Devlin et al., 2018). Although networks with even one-hidden layer can fit any function (Hornik et al., 1989), it remains an open question how such networks can generalize to new data. Different from what traditional machine learning theory predicts, empirical evidence (Zhang et al., 2017) shows more parameters in neural network lead to better generalization. How over-parameterization yields strong generalization is an important question for understanding how deep learning works.

In this paper, we analyze deep ReLU networks with teacher-student setting: a fixed teacher network provides the output for a student to learn via SGD. Both teacher and student are deep ReLU networks. Similar to (Goldt et al., 2019), the student is over-realized compared to the teacher: at each layer , the number of student nodes is larger than the number of teacher (). Note that over-realization is different from over-parameterization, which means that the total number of parameters in the student model is larger than the training set size .

The student-teacher setting has a long history (Saad and Solla, 1996, 1995; Freeman and Saad, 1997; Mace and Coolen, 1998) and recently gains increasing interest (Goldt et al., 2019; Aubin et al., 2018) in analyzing 2-layered network. While worst-case performance on arbitrary data distributions may not be a good model for real structured dataset and can be hard to analyze, using a teacher network implicitly enforces an inductive bias and could potentially lead to better generalization bound.

Specialization, that is, a student node becomes increasingly correlated with a teacher node during training (Saad and Solla, 1996), is one of the important topic in this setup. If all student nodes are specialized to the teacher, then student tends to output the same as the teacher and generalization performance can be expected. Empirically, it has been observed in 2-layer networks (Saad and Solla, 1996; Goldt et al., 2019) and multi-layer networks (Tian et al., 2019; Li et al., 2016), in both synthetic and real dataset. In contrast, theoretical analysis is limited with strong assumptions (e.g., Gaussian inputs, infinite input dimension, local convergence, 2-layer setting, small number of hidden nodes). In this paper, with arbitrary training distribution and finite input dimension, we show rigorously that when gradient at each training sample is small (i.e., the interpolation setting as suggested in (Ma et al., 2017; Liu and Belkin, 2018; Bassily et al., 2018)), the student node at the lowest layer can be proven to specialize to the teacher nodes: each teacher node is aligned with at least one student node in the lowest layer. This explains one-to-many mapping between teacher and student nodes and the existence of un-specialized student nodes, as observed empirically in (Saad and Solla, 1996). Furthermore, from the proof condition, more over-realization encourages specialization.

Our setting is different from previous works. (1) While statistical mechanics approaches (Saad and Solla, 1996; Goldt et al., 2019; Gardner and Derrida, 1989; Aubin et al., 2018) assume both the training set size and the input dimension goes to infinite (i.e., the thermodynamics limits) and assume Gaussian inputs, our analysis allows finite and impose no parametric constraints on the input data distribution. (2) While Neural Tangent Kernel (Jacot et al., 2018; Du et al., 2019) and mean-field approaches (Mei et al., 2018) requires infinite (or very large) width, our setting applies to finite width as long as student is slightly over-realized (). In this paper we study the infinite training sample case (the training set is a region), and leave finite sample analysis as the future work.

In addition, we further analyze the training dynamics and show that most student nodes converge first towards strong teacher nodes with large fan-out weights in magnitude. While this makes training robust to dataset noise and naturally explains implicit regularization, the same mechanism also leaves weak teacher nodes unexplained until very late stage of training, yielding high generalization error with finite iterations. In this situation, we show that over-realization plays another important role: once the strong teacher nodes have been covered, there are always spare student nodes ready to switch to weak teacher nodes quickly. Empirically, we show more teacher nodes are covered with the same number of iterations, and generalization is also improved.

We verify our findings with numerical experiments. Starting with 2-layer setting, we justify Theorem 2 and Theorem 3 with Gaussian inputs, showing one-to-many specialization and existence of un-specialized nodes. For deep ReLU networks, we show specialization happens not only in the lowest layer, as suggested by Theorem 4, but also in other hidden layers, on both Gaussian inputs and CIFAR10. We also perform ablation studies about the effect of student over-realization. For training dynamics, we show the strong/weak teacher effects in 2-layer settings and how over-realization improves specialization and generalization. For reproducibility, we release the code in https://github.com/facebookresearch/luckmatters.

## 2 Related Works

Teacher-student setting. The teacher student setting has a long history (Engel and Van den Broeck, 2001; Gardner and Derrida, 1989). The seminar works (Saad and Solla, 1996, 1995) studies 1-hidden layer case from statistical mechanics point of view in which the input dimension goes to infinity, or so-called thermodynamics limits. They study symmetric solutions and locally analyze the symmetric breaking behavior and onset of specialization of the student nodes towards the teacher. Recent follow-up works (Goldt et al., 2019) makes the analysis rigorously and empirically shows that random initialization and training with SGD indeed gives student specialization in 1-hidden layer case, which is consistent with our experiments. With the same assumption,  (Aubin et al., 2018) studies phase transition property of specialization in 2-layer networks with small number of hidden nodes using replica formula. In these works, inputs are assumed to be Gaussian and step or Gauss error function is used as nonlinearity. Few works study teacher-student setting with more than two layers. (Allen-Zhu et al., 2019) shows the recovery results for 2 and 3 layer networks, with modified SGD, batchsize 1 and heavy over-parameterization.

In comparison, our work shows that specialization happens around the SGD critical points in the lowest layer for deep ReLU networks, without any parametric assumptions of input distribution.

Local minima is Global. While in deep linear network, all local minima are global (Laurent and Brecht, 2018; Kawaguchi, 2016), situations are quite complicated with nonlinear activations. While local minima is global when the network has invertible activation function and distinct training samples (Nguyen and Hein, 2017; Yun et al., 2018) or Leaky ReLU with linear separate input data (Laurent and von Brecht, 2017), multiple works (Du et al., 2018; Ge et al., 2017; Safran and Shamir, 2017; Yun et al., 2019) show that in GD case with population or empirical loss, spurious local minima can happen even in two-layered network. Many are specific to two-layer and hard to generalize to multi-layer setting. In contrast, our work brings about a generic formulation for deep ReLU network and gives recovery properties in the student-teacher setting.

Learning wild networks. Recent works on Neural Tangent Kernel (Jacot et al., 2018; Du et al., 2019; Allen-Zhu et al., 2019) show the global convergence of GD for multi-layer networks with infinite width.  (Li and Liang, 2018) shows the convergence in one-hidden layer ReLU network using GD/SGD to solution with good generalization, when the input data are assumed to be clustered into classes. Both lines of work assume heavily over-parameterized network, requiring polynomial growth of number of nodes with respect to the number of samples.  (Chizat and Bach, 2018) shows global convergence of over-parameterized network with optimal transport.

(Tian et al., 2019) assumes mild over-realization and gives convergence results for 2-layer network when a subset of the student network is close to the teacher. Our work extends it with much fewer and weaker assumptions.

## 3 Mathematical Framework

Notation. Consider a student network and its associated teacher network (Fig. 1(a)). Denote the input as . We focus on multi-layered networks with the activation function as the ReLU nonlinearity. We use the following equality extensively: , where is the indicator function. For node , , and are its activation, gating function and backpropagated gradient after the gating.

Both teacher and student networks have layers. The input layer is layer 0 and the topmost layer (layer that is closest to the output) is layer . For layer , let be the number of teacher nodes while be the number of student nodes. is the weight matrix connecting layer to layer on the student side. where each is the weight vector. Similarly we have teacher weight . Finally, as the collection of all trainable parameters.

Let be the activation vector of layer , be the diagonal matrix of gating function. For ReLU, the diagonal element is either 0 or 1. Let be the backpropated gradient vector. By definition, the input layer has and , where is the input dimension. Note that , and are all dependent on . For notation brevity, we often use rather than .

All notations with superscript are only dependent on the teacher and remains constant throughout the training. At the output (topmost) layer, since there is no ReLU gating. Note that is the dimension of output for both teacher and student. With the notation, gradient flow update can be written as:

 ˙Wl=Ex[fl−1(x)g⊺l(x)] (1)

In SGD, the expectation is taken over a batch. In GD, it is over the entire dataset.

Bias term. Note that with the same notation we can also include the bias term. In this case, , , , and (last diagonal element is always ). In the text, by slightly abuse of notation, we will not distinguish the cases with bias or without bias.

Objective. We assume that both the teacher output and the student output are vectors of length . We use the output of teacher as the input of the student and the objective is:

 minWJ(W)=12Ex[∥f∗L(x)−fL(x)∥2] (2)

By the backpropagation rule, we know that for each sample , the (negative) gradient . The gradient gets backpropagated until the first layer is reached.

Note that here, the gradient sent to layer is correlated with the activations and at the same layer. Intuitively, this means that the gradient “pushes” the student node to align with output dimension of the teacher. A natural question arises:

Are student nodes specialized to teacher nodes at the same layers after training? (*)

One might wonder this is hard since the student’s intermediate layer receives no direct supervision from the corresponding teacher layer, but relies only on backpropagated gradient. Surprisingly, the following theorem shows that it is possible to build connections at the intermediate layers:

###### Lemma 1 (Recursive Gradient Rule).

At layer , the backpropagated satisfies

 gl(x)=Dl(x)[Al(x)f∗l(x)−Bl(x)fl(x)], (3)

where the mixture coefficient and . The matrices and are defined in a top-down manner:

 Vl−1(x)=Vl(x)Dl(x)W⊺l,V∗l−1(x)=V∗l(x)D∗l(x)W∗⊺l (4)

In particular, .

For convenience, we can write in which . Each element of , and each element of , . Note that Lemma 1 applies to arbitrarily deep ReLU networks and allows different number of nodes for the teacher and student. In particular, student can be over-realized.

Note that Eqn. 3 can also be written as:

 gl(x)=Dl(x)V⊺l(x)el(x), (5)

where is the residue at each layer . In particular, at the top layer , we have since and are all identity matrices.

Let be the infinite training set, where is the input data distribution. Unlike previous works, we don’t impose any distributional assumption on . Let , which is the image of the training set at the output of layer . Then the mixture coefficient and have the following property:

###### Corollary 1 (Piecewise constant).

can be decomposed into a finite (but potentially exponential) set of regions plus a zero-measure set, so that and are constant within each region with respect to .

## 4 Critical Point Analysis

It seems hard to achieve the goal (*) since the student intermediate node doesn’t have direct supervision from the teacher intermediate node, and there exists many different ways to explain teacher’s supervision. However, thanks to the property of ReLU node and subset sampling in SGD, at SGD critical point, under mild condition, the teacher node aligns with at least one student node.

### 4.1 SGD critical points leads to interpolation setting

###### Definition 1 (SGD critical point).

is a SGD critical point if for any batch, for .

###### Theorem 1 (Interpolation).

Denote as a dataset of samples. If is a critical point for SGD, then either or .

Note that such critical points exist since student is over-realized. In this case, critical points in SGD is much stronger than those in GD, where the gradient is always averaged at a fixed data distribution. Note that if has bias term, then always and . For topmost layer, immediately we have , which is global optimum with zero training loss. In the following, we want to check whether this condition leads to generalization, i.e., whether the teacher’s weights are recovered or aligned by the student, i.e., whether for teacher , there exists a student at the same layer so that for some (bias term included).

### 4.2 Assumption of teacher network

Obviously, not all teacher networks can be properly reconstructed. A trivial example is that a teacher network always output zero since all the training samples lie in the inactive halfspace of its ReLU nodes. Therefore, we need to impose condition on the teacher network.

Let be the activation region of node . Note that the halfspace is an open set. Let be the decision boundary of node .

###### Definition 2 (Observer).

Node is an observer of node if . See Fig. 4(a).

###### Definition 3 (Specialization/Alignment/Co-linearity).

Node is specialized to (or aligned with) node , if with some .

Without loss of generality, we set all teacher nodes to be normalized: (bias included in -norm), except for the topmost layer . For the teacher node at layer , we impose additional condition:

###### Assumption 1 (Teacher Network Condition at layer l).

We require that (1) the teacher weights are not co-linear. and (2) the boundary of is visible in the training set: .

Fig. 2 visualizes the two assumptions. The first requirement is trivial to satisfy. For the second requirement, since two teacher nodes (or two consecutive layers) who behaves linearly in the training set are indistinguishable, if we want to reconstruct weight , we would need the second requirement to hold at layer . If not, then one counter-example is that there exist identity mappings from layer to layer , and non-identity mapping (real operations) from layer to layer . In this case, there is ambiguity whether these real operations are performed in or and reconstruction will not follow.

In the following, we remove the subscript when for notation brevity.

### 4.3 Student Specialization, 2-layer case

We start with 2-layer case, in which and are constant over , since there is no ReLU gating at the top layer . In this case, from the SGD critical point at , and alignment between teacher and student can be achieved:

###### Theorem 2 (Student Specialization, 2-layers).

If Assumption 1 holds for , at SGD critical point, if a teacher node is observed by a student node and , then there exists at least one student node aligned with .

Proof sketch. The intuition is that ReLU activations can be proven to be mutually linear independent, if their boundaries are within the training region . For details, please check Lemma 2, Lemma 3 and Lemma 6 in the Appendix. On the other hand, the gradient of each student node when active, is , a linear combination of teacher and student nodes (note that and are -th rows of and ). Therefore, zero gradient means that the summation of coefficients of co-linear ReLU nodes is zero. Since teachers are not co-linear (Assumption 1), a non-zero coefficient for teacher means that it has to be co-linear with at least one student node, so that the summation of coefficients is zero. Alignment with multiple student nodes is also possible. In contrast, for deep linear models, alignment does not happen since a linear subspace spanned by intermediate layer can be represented by different sets of bases.

Note that a necessary condition of a reconstructed teacher node is that its boundary is in the active region of student, or is observed (Definition 2). This is intuitive since a teacher node which behaves like a linear node is partly indistinguishable from a bias term.

For student nodes that are not aligned with any teacher node, if they are observed by other student nodes, then following a similar logic, we have the following:

###### Theorem 3 (Un-specialized Student Nodes are Prunable).

If Assumption 1 holds for , at SGD critical point, if an unaligned student has independent observers, i.e., the -by- matrix stacking the fan-out weights of these observers is full rank, then . In particular, if node is not co-linear with any other student, then .

###### Corollary 2.

With sufficient observers, the contribution of all unspecialized student nodes is zero.

The specialization patterns are summarized in Fig. 4(b). Theorem 3 and Corollary 2 explain why network pruning is possible (LeCun et al., 1990; Hassibi et al., 1993; Hu et al., 2016). With these alignment patterns, we could also explain why low-cost critical points can be connected via line segments, but not a single straight line, like what (Kuditipudi et al., 2019) explains, but without assuming their condition of -dropout stableness of a trained network. In fact, according to the analysis above, it already holds at SGD critical points.

Our theorem is also consistent with Theorem 5 in (Tian et al., 2019) which also shows the fan-out weights are zero up on convergence in 2-layer networks, if the initialization is close. In contrast, Theorem 3 analyzes the critical point rather than the local dynamics near the critical points.

Note that a related theorem (Theorem 6) in (Laurent and von Brecht, 2017) studies 2-layer network with scalar output and linear separable input, and discusses characteristics of individual data point contributing loss in a local minima of GD. In our paper, no linear separable condition is imposed.

### 4.4 Multi-layer case

Thanks to Lemma 1 which holds for deep ReLU networks, we can use similar intuition as in the 2-layer case, to analyze the behavior of the lowest layer () in the multiple layer case. The difference here is that and are no longer constant over . Fortunately, using Corollary 1, we know that and are piece-wise constant that separate the input region into a finite (but potentially exponential) set of constant regions plus a zero-measure set. This suggests that we could check each region separately. If the boundary of a teacher and a student lies in the region (Fig. 5), similar logic applies:

###### Theorem 4 (Student Specialization, Multiple Layers).

If Assumption 1 holds for , at SGD critical points, for any teacher node at , if there is a region and a student observer so that and , then at least one student node specializes to node .

The theorem suggests a few interesting consequences:

The role of over-realization. Theorem 2 and Theorem 4 suggest that over-realization (more student nodes in the hidden layer ) is important. More student nodes mean more observers, and the existence argument in these theorems is more likely to happen and more teacher nodes can be covered by student, yielding better generalization.

Bottom-up training. Note that even with random (e.g., at initialization), Theorem 4 still holds with high probability (when ) and teacher can still align with student . This suggests a picture of bottom-up training in backpropagation: After the alignment of activations at layer , we just treat layer as the low-level features and the procedure repeats until the student matches with the teacher at all layers. This is consistent with many previous works that empirically show the network is learned in a bottom-up manner (Li et al., 2018).

Note that the alignment may happen concurrently across layers: if the activations of layer start to align, then activations of layer , which depends on activations of layer , will also start to align since there now exists a that yields strong alignments, and so on. This creates a critical path from important student nodes at the lowest layer all the way to the output, and this critical path accelerates the convergence of that student node. We leave a formal analysis to the future work.

Deeper Student than Teacher. Note that both theorems only requires Assumption 1 be hold at , and it is fine to have identity mappings at “information passing” layers . Therefore, when student network is deeper than teacher, by adding identity layers beyond the topmost layer on the teacher side, the conclusion of Theorem 4 still holds. This also suggests that the lowest layer of the student would match the lowest layer of the teacher, even if they have different number of layers.

We just analyze the ideal case in which we have infinite number of training samples ( is a region), and have performed SGD for infinite iterations so that we could reach critical points in which for every . A natural question arises. Are these conditions achievable in practice?

In practice, the gradient of SGD never reaches but might fluctuate around (i.e., ). In this case, Theorem 5 shows that a rough recovery still follows. We now can see the ratio of recovery for weights/biases separately, as a function of . Note that here we separate weights and biases: , and is the angle of two (bias-free) weights and .

###### Theorem 5 (Noisy Specialization).

If Assumption 1 holds and any two teachers , satisfy or . Suppose for any with , then for any teacher at , if there exists a region and a student observer so that , and , then is roughly aligned with a student : and for any . The hidden constants depends on , and the size of region .

Although the proof of Theorem 5 still assumes infinite number of data points, in the proof only a discrete set of data points are important to complete the proof by contradiction. We leave a formal analysis to future work.

## 5 Analysis on Training Dynamics

From the previous analysis, we see at SGD critical points, under mild conditions, each teacher node will be aligned with at least one student node. This would naturally yields low generalization error.

A natural question arise: is “running SGD long enough” a sufficient condition to achieve these critical points? Previous works (Ge et al., 2017; Livni et al., 2014) show that empirically SGD does not recover the parameters of a teacher network up to permutation.

There are several reasons. First, from Theorem 3, there exist student nodes that are not aligned with the teacher, so a simple permutation test on student weights might fail. Second, as suggested by Theorem 5, to recover a teacher node with small (and thus small since ), the SGD gradient bound needs to be very small, which means we would need to run SGD for a very long time. In fact, if then the teacher node has no contribution to the output and reconstruction never happens. This is particularly problematic if the output dimension is (scalar output), since a single small teacher weight would block the recovery of the entire teacher node . Given a finite number of iterations, how much the student is aligned with the teacher implicitly suggests the generalization performance.

In the following, as a starting point, we analyze the training dynamics of 2-layer network where and are all constant. Here , and is the residue, from Eqn. 5 we have:

 ˙wk=Ex[f0gk]=Ex[f0zke⊺1vk] (6)

We formally define the notion of strong or weak teacher node as follows. As in Section 4.2, without loss of generality, for layer , we set teacher (bias included in norm).

###### Definition 4.

A teacher node is strong (or weak), if is large (or small). See Fig. 6(a).

### 5.1 Weight magnitude

From Eqn. 6, we know that for both ReLU and linear network (since ):

 12d∥wk∥2dt=w⊺k˙wk=Ex[fke⊺1vk] (7)

When there is only a single output, , is a scalar and Eqn. 7 is simply an inner product between the residue and the activation of node , over the batch. So if the node has activation which aligns well with the residual, the inner product is larger and grows faster.

On the other hand, for two layer network, since is the -th row of , we have and thus

 12d∥vk∥2dt=Ex[fke⊺2vk] (8)

Relationship to techniques to compare network representations. Singular Vector Canonical Correlation Analysis (SVCCA) by Raghu et al. (2017) proposes to compare the intermediate representation of networks trained with different initialization: first perform SVD decomposition on , where is the number of data point in evaluation, to find the major left singular space, and then linearly project the spaces from two networks to a common subspace using CCA. From our perspective, the reason why it works is quite clear: first SVD step removes those unaligned student activations, since their magnitudes are often small due to Eqn. 7; then CCA step merges co-linear student nodes and removes the permutation ambiguity.

### 5.2 Simplification of Dynamics

To further understand the dynamics, in particular to see that converge to any teacher node rather than just increase its norm, let’s check one term in . Using for , we have:

 Ex[fl−1zkf∗j]=Ex[fl−1zkz∗jf⊺l−1]w∗j=Gkjw∗j (9)

where . Putting it in another way, we want to check the spectrum property of the PSD matrix . Intuitively, the direction of should lie between and to push towards , and the magnitude is large when and are close to each other. This means that if is dominated by a teacher (i.e., is large), then would push towards . This also shows that SGD will first try fitting strong teacher nodes, then weak teacher nodes.

Theorem 6 confirms this intuition if follows spherical symmetric distribution (e.g., ).

###### Theorem 6.

If follows spherical symmetric distribution, then , where is the angle between and .

As a result, for all , is always between and since and are always non-negative. Without such symmetry, we assume the following holds:

###### Assumption 2.

, where .

Note that critical point analysis is applicable to any batch size, including . On the other hand, Assumption 2 holds when a moderately large batchsize leads to a decent estimation of the terms. Intuitively, the two help determine an optimal batchsize.

With Assumption 2, we can write the dynamics as , where the time-varying per-node residue of node is defined as the following. Here is a scalar related to , and is the angle (bias included) between and , no matter whether they are from teacher or student:

 rk=∑jαjkψ(θjk)w∗j−∑k′βk′kψ(θk′k)wk′−νwk (10)

### 5.3 Symmetry breaking, Winners-take-all and Focus Shifting

Now we want to understand the simplified dynamics . Due to its non-linear nature and complicated behaviors, we provide an intuitive analysis to understand the nature of dynamics, and leave a formal analysis as the future work.

Let . First, for two nodes , regardless of the form of , we have:

###### Theorem 7.

For dynamics , we have .

For a simple (and symmetric) case that with all , where is a joint contribution of all teacher nodes, we could show that (in the proof of Theorem 7), when , and vice versa. So the system provides negative feedback until and according to Eqn. 7, the ratio between and remains constant, after initial transition. We can also show that will align with and every student node goes to .

However, due to Theorem 6, the net effect can be different for different students and thus are different. This opens the door for complicated dynamic behavior of neural network training.

Symmetry breaking. As one example, if we add a very small delta to some node, say to so that . Then to reach a stationary point , we have and thus according to Theorem 7, grows exponentially (see details in the proof of Theorem 7). This symmetric breaking behavior provides a potential winners-take-all mechanism, since according to Theorem 6, the coefficient of depends critically on the initial angle between and .

Strong teacher nodes are learned first. If is the largest among teacher nodes, then for any , their for other teacher . According to Eqn. 10 thus heavily biases towards . Following the analysis above, all student nodes move towards teacher . As a result, strong teacher learns first and is often covered by multiple co-linear students (Fig. 11, teacher-1).

Focus shifting to weak teacher nodes. The process above continues until residual along the direction of quickly shrinks and residual corresponding to other teacher node (e.g., for ) becomes dominant. Since each is different, student node whose direction is closer to () will shift their focus towards (Fig. 6(b)) This is shown in the green (shift to teacher-3) and magenta (shift to teacher-5) curves in one of the simulations (Fig. 11, first row).

Possible slow convergence to weak teacher nodes. While expected angle between two weights from random initialization is , shifting a student node from chasing after a strong teacher node to a weaker one could yield a large initial angle (e.g., close to ) between and . For example, all student nodes have been attracted to the opposite direction of a weak teacher node. In this case, the convergence can be arbitrarily slow.

In the following we give one concrete worst case example (Fig. 6(c)). Suppose we only have one student . Let be the orthogonal complement projection operator. Note that (Eqn. 69 in Appendix) and . From Eqn. 10 we have:

 ddt(w∗⊺j¯wk)=w∗⊺jP⊥wkrk=αjkψ(θjk)w∗⊺jP⊥wkw∗j=αjkψ(θjk)sin2θjk (11)

On the other hand, . So we arrive at

 ˙θjk=−αjkψ(θjk)sinθjk (12)

Since around , the time spent from to some is when . Therefore, it takes infinite time for the student to converge to teacher .

The role played by over-realization. In this case, over-realization helps by having more student nodes that are possibly ready for shifting towards weaker teachers, and thus accelerate convergence (Fig. 12). Alternatively, we could reinitialize those student nodes (Prakash et al., 2019).

Connection to Competitive Lotka-Volterra equations. The training dynamics is related to Competitive Lotka-Volterra equations (Smale, 1976; Zhu and Yin, 2009; Hosono, 1998), which has been extensively used to model the dynamics of multiple species competing over limited resources in a biological system. In our setting, the teacher nodes are the “resources” and the student nodes are the “species”. The main difference here is that there are multiple teacher nodes that are not mutually replaceable, and a certain kind of resource becomes more important if one species have already obtained a lot of it.

## 6 Experiments on Synthetic Dataset

We first verify our theoretical finding on synthetic dataset. We generate the input distribution using with and we sample as training and another as evaluation.

We construct teacher networks in the following manner. For two-layered network, the output dimension and input dimension . For multi-layered network, we use 50-75-100-125 (i.e, , , and ). The teacher network is constructed to satisfy Assumption 1: at each layer, teacher filters are distinct from each other and their bias is set so that of the input data activate the nodes. This makes their boundary (maximally) visible in the dataset.

To train the model, we use vanilla SGD with learning rate and batchsize .

Two layer networks. First we verify Theorem 2 and Theorem 3 in the 2-layer setting. Fig. 11 shows student nodes correlate with different teacher nodes over time. Fig. 7 shows for different degrees of over-realization (///), for nodes with weak specialization (i.e., its normalized correlation to the most correlated teacher is low), their magnitudes of fan-out weights are small. Otherwise the nodes with strong specialization have high fan-out weights.

Deep Networks. For deep ReLU networks, we observe specialization not only at the lowest layer, as suggested by Theorem 4, but also at multiple hidden layers. This is shown in Fig. 8. For each student node , the x-axis is its best normalized correlation to teacher nodes, and y-axis is , which is equivalent to in 2-layer case. In the plot, we can also see the lowest layer learns first (the “L-shape” curve was established at epoch 10), then the top layers follow.

Ablation on the effect of over-realization. To further understand the role of over-realization, we plot the average rate of a teacher node that is matched with at least one student node successfully (i.e., correlation ). Fig. 9 shows that stronger teacher nodes are more likely to be matched, while weaker ones may not be explained well, in particular when the strength of the teacher nodes are polarized ( is large). Over-realized student can explain more teacher nodes, while a student with nodes has sufficient capacity to fit the teacher perfectly, it gets stuck despite long training.

Note that from the loss curve (Fig. 10), the difference is not large since weak teacher nodes do not substantially affect final loss. However, to achieve state-of-the-art performance on the real dataset, grabbing every weak teacher node can be important.

Training Dynamics. We set up a diverse strength of teacher node by constructing the fanout weights of teacher node as follows:

 ∥v∗j∥∼1/jp, (13)

where is the teacher polarity factor that controls how strong the energy decays across different teacher nodes. means all teacher nodes are symmetric, and large means that the strength of teacher nodes are more polarized.

Fig. 11 and Fig. 12 show that many student nodes specialize to a strong teacher node first. Once the strong teacher node was covered well, weaker teacher nodes are covered after many epochs.

## 7 Experiments on CIFAR-10

We also experiment on CIFAR-10. We first pre-train a teacher network with 64-64-64-64 ConvNet ( are channel sizes of the hidden layers, ) on CIFAR-10 training set. Then the teacher network is pruned in a structured manner to keep strong teacher nodes. The student is over-realized based on teacher’s remaining channels.

The convergence and specialization behaviors of student network is shown in Fig. 13. Specialization happens at all layers for different degree of over-realization. Over-realization boosts student specialization, measured by mean of maximal normalized correlation at each layer ( is the normalized activation of node over evaluation samples), and improves generalization, evaluated on CIFAR-10 evaluation set.

## 8 Conclusion and Future Work

In this paper, we use student-teacher setting to analyze how an (over-realized) deep ReLU student network trained with SGD learns from the output of a teacher. When the magnitude of gradient per sample is small (student weights are near the critical points), the teacher can be proven to be specialized by (possibly multiple) students and thus the teacher network is recovered in the lowest layer. By analyzing training dynamics, we also show that strong teacher node with large is specialized first, while weak teacher node is specialized slowly. This reveals one important reason why the training takes long to reconstruct all teacher weights and why generalization, indicated by the number of teacher node covered by the student, improves with more training.

Our analysis might help address the following apparent contradiction in neural network training: on one hand, over-parameterization helps generalization; on the other hand, works on model pruning (LeCun et al., 1990; Liu et al., 2019; Hu et al., 2016; Han et al., 2015) shows there are a lot of parameter redundancy in trained deep model and recent lottery ticket hypothesis (Frankle and Carbin, 2019; Frankle et al., 2019; Zhou et al., 2019; Morcos et al., 2019) shows that a subnetwork in the trained model, if trained alone, could achieve comparable or even better generalization. Based on our analysis, the intuition to explain both sides of the empirical evidences is that a good initialization is important but hard to find, and over-parameterization (or over-realization with respect to a latent teacher network representing the dataset) can help compensate for that by creating more observers (Theorem 4) and accelerating the focus shifting (Sec. 5.3). Due to the symmetry breaking property (Sec. 5.3), small changes in the initialization could lead to huge difference in the learned weight. Therefore, it is unlikely to directly find the lucky weights or winning tickets at initialization, until after the symmetry breaking happens. A formal analysis about initialization is left to future work.

As the next step, we would like to extend our analysis to finite sample case, and analyze the training dynamics in a more formal way. Verifying the insights from theoretical analysis on a large dataset (e.g., ImageNet) is also the next step.

## 9 Acknowledgement

I thank Banghua Zhu, Xiaoxia (Shirley) Wu, Lexing Ying, Jonathan Frankle and Ari Morcos for insightful discussions.

## References

• Z. Allen-Zhu, Y. Li, and Y. Liang (2019) Learning and generalization in overparameterized neural networks, going beyond two layers. NeurIPS. Cited by: §2.
• Z. Allen-Zhu, Y. Li, and Z. Song (2019) A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine LearningAdvances in neural information processing systemsAdvances in neural information processing systemsAdvances in Neural Information Processing SystemsProceedings of the 28th international conference on machine learning (ICML-11)International Conference on Learning RepresentationsAdvances in Neural Information Processing Systems, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 242–252. External Links: Link Cited by: §2.
• B. Aubin, A. Maillard, F. Krzakala, N. Macris, L. Zdeborová, et al. (2018) The committee machine: computational to statistical gaps in learning a two-layers neural network. pp. 3223–3234. Cited by: §1, §1, §2.
• R. Bassily, M. Belkin, and S. Ma (2018) On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564. Cited by: §1.
• L. Chizat and F. Bach (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. pp. 3036–3046. Cited by: §2.
• J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
• S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai (2019) Gradient descent finds global minima of deep neural networks. ICML. Cited by: §1, §2.
• S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh (2018) Gradient descent learns one-hidden-layer cnn: don’t be afraid of spurious local minima. ICML. Cited by: §2.
• A. Engel and C. Van den Broeck (2001) Statistical mechanics of learning. Cambridge University Press. Cited by: §2.
• J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: training pruned neural networks. ICLR. Cited by: §8.
• J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin (2019) The lottery ticket hypothesis at scale. arXiv preprint arXiv:1903.01611. Cited by: §8.
• J. A. Freeman and D. Saad (1997) Online learning in radial basis function networks. Neural Computation 9 (7), pp. 1601–1622. Cited by: §1.
• E. Gardner and B. Derrida (1989) Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General 22 (12), pp. 1983–1994. External Links: Cited by: §1, §2.
• R. Ge, J. D. Lee, and T. Ma (2017) Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501. Cited by: §2, §5.
• S. Goldt, M. S. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborová (2019) Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. NeurIPS. Cited by: Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension, §1, §1, §1, §1, §2.
• S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §8.
• B. Hassibi, D. G. Stork, and G. J. Wolff (1993) Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. Cited by: §4.3.
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
• K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §1.
• Y. Hosono (1998) The minimal speed of traveling fronts for a diffusive lotka-volterra competition model. Bulletin of Mathematical Biology 60 (3), pp. 435–448. Cited by: §5.3.
• H. Hu, R. Peng, Y. Tai, and C. Tang (2016) Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250. Cited by: §4.3, §8.
• A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. pp. 8571–8580. Cited by: Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension, §1, §2.
• K. Kawaguchi (2016) Deep learning without poor local minima. In Advances in neural information processing systems, pp. 586–594. Cited by: §2.
• R. Kuditipudi, X. Wang, H. Lee, Y. Zhang, Z. Li, W. Hu, S. Arora, and R. Ge (2019) Explaining landscape connectivity of low-cost solutions for multilayer nets. CoRR abs/1906.06247. External Links: Link, 1906.06247 Cited by: §4.3.
• T. Laurent and J. Brecht (2018) Deep linear networks with arbitrary loss: all local minima are global. In International Conference on Machine Learning, pp. 2908–2913. Cited by: §2.
• T. Laurent and J. von Brecht (2017) The multilinear structure of relu networks. arXiv preprint arXiv:1712.10132. Cited by: §2, §4.3.
• Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §4.3, §8.
• C. Li, H. Farkhoor, R. Liu, and J. Yosinski (2018) Measuring the intrinsic dimension of objective landscapes. ICLR. Cited by: §4.4.
• Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. E. Hopcroft (2016) Convergent learning: do different neural networks learn the same representations?. In ICLR, Cited by: §1.
• Y. Li and Y. Liang (2018) Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157–8166. Cited by: §2.
• C. Liu and M. Belkin (2018) Mass: an accelerated stochastic method for over-parametrized learning. arXiv preprint arXiv:1810.13395. Cited by: §1.
• Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. In International Conference on Learning Representations, External Links: Link Cited by: §8.
• R. Livni, S. Shalev-Shwartz, and O. Shamir (2014) On the computational efficiency of training neural networks. In Advances in neural information processing systems, pp. 855–863. Cited by: §5.
• S. Ma, R. Bassily, and M. Belkin (2017) The power of interpolation: understanding the effectiveness of sgd in modern over-parametrized learning. arXiv preprint arXiv:1712.06559. Cited by: §1.
• C. Mace and A. Coolen (1998) Statistical mechanical analysis of the dynamics of learning in perceptrons. Statistics and Computing 8 (1), pp. 55–88. Cited by: §1.
• S. Mei, A. Montanari, and P. Nguyen (2018) A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences 115 (33), pp. E7665–E7671. Cited by: §1.
• A. S. Morcos, H. Yu, M. Paganini, and Y. Tian (2019) One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. NeurIPS. Cited by: §8.
• Q. Nguyen and M. Hein (2017) The loss surface of deep and wide neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2603–2612. Cited by: §2.
• A. Prakash, J. Storer, D. Florencio, and C. Zhang (2019) RePr: improved training of convolutional filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10666–10675. Cited by: §5.3.
• M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein (2017) Svcca: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, pp. 6076–6085. Cited by: §5.1.
• D. Saad and S. A. Solla (1995) On-line learning in soft committee machines. Physical Review E 52 (4), pp. 4225. Cited by: §1, §2.
• D. Saad and S. A. Solla (1996) Dynamics of on-line gradient descent learning for multilayer neural networks. In Advances in neural information processing systems, pp. 302–308. Cited by: §1, §1, §1, §2.
• I. Safran and O. Shamir (2017) Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968. Cited by: §2.
• D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
• S. Smale (1976) On the differential equations of species in competition. Journal of Mathematical Biology 3 (1), pp. 5–7. Cited by: §5.3.
• Y. Tian, T. Jiang, Q. Gong, and A. Morcos (2019) Luck matters: understanding training dynamics of deep relu networks. arXiv preprint arXiv:1905.13405. Cited by: §1, §2, §4.3.
• C. Yun, S. Sra, and A. Jadbabaie (2018) Global optimality conditions for deep neural networks. External Links: Link Cited by: §2.
• C. Yun, S. Sra, and A. Jadbabaie (2019) Small nonlinearities in activation functions create bad local minima in neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
• C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. ICLR. Cited by: §1.
• H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. NeurIPS. Cited by: §8.
• C. Zhu and G. Yin (2009) On competitive lotka–volterra model in random environments. Journal of Mathematical Analysis and Applications 357 (1), pp. 154–170. Cited by: §5.3.

## 10 Appendix

### 10.1 Lemma 1

###### Proof.

We prove by induction. When we know that , by setting and the fact that (no ReLU gating in the last layer), the condition holds.

Now suppose for layer , we have:

 gl(x) = Dl(x)[Al(x)f∗l(x)−Bl(x)fl(x)] (14) = Dl(x)V⊺l(x)[V∗l(x)f∗l(x)−Vl(x)fl(x)] (15)

Using

 fl(x) = Dl(x)W⊺lfl−1(x) (16) f∗l(x) = D∗l(x)W∗⊺lf∗l−1(x) (17) gl−1(x) = Dl−1(x)Wlgl(x) (18)

we have:

 gl−1(x) = Dl−1(x)Wlgl(x) (19) = Dl−1(x)WlDl(x)V⊺l(x)V⊺l−1(x)[V∗l(x)f∗l(x)−Vl(x)fl(x)] (20) = Dl−1(x)V⊺l−1(x)⎡⎢ ⎢ ⎢⎣V∗l(x)D∗l(x)W∗⊺lV∗l−1(x)f∗l−1(x)−Vl(x)Dl(x)W⊺lVl−1(x)fl−1(x)⎤⎥ ⎥ ⎥⎦ (21) = Dl−1(x)V⊺l−1(x)[V∗l−1(x)f∗l−1(x)−Vl−1(x)fl−1(x)] (22) = Dl−1(x)[Al−1(x)f