Revisiting “Over-smoothing” in Deep GCNs
Oversmoothing has been assumed to be the major cause of performance drop in deep graph convolutional networks (GCNs). The evidence is usually derived from Simple Graph Convolution (SGC), a linear variant of GCNs. In this paper, we revisit graph node classification from an optimization perspective and argue that GCNs can actually learn anti-oversmoothing, whereas overfitting is the real obstacle in deep GCNs. This work interprets GCNs and SGCs as two-step optimization problems and provides the reason why deep SGC suffers from oversmoothing but deep GCNs does not. Our conclusion is compatible with the previous understanding of SGC, but we clarify why the same reasoning does not apply to GCNs. Based on our formulation, we provide more insights into the convolution operator and further propose a mean-subtraction trick to accelerate the training of deep GCNs. We verify our theory and propositions on three graph benchmarks. The experiments show that (i) in GCN, overfitting leads to the performance drop and oversmoothing does not exist even model goes to very deep (100 layers); (ii) mean-subtraction speeds up the model convergence as well as retains the same expressive power; (iii) the weight of neighbor averaging ( is the common setting) does not significantly affect the model performance once it is above the threshold ().
Graph neural networks (GNNs) are widely used in modeling real-world connections, like protein networks , social networks , and co-author networks . One could also construct similarity graphs by linking data points that are close in the feature space even when there is no explicit graph structure. There have been several successful GNN architectures: ChebyshevNet , GCN , SGC , GAT , GraphSAGE  and other subsequent variants tailored for practical applications [8, 9, 10].
Recently, researchers started to explore the fundamentals of GNNs, such as expressive power [11, 12, 13, 14], and analyze their capacity and limitations. One of the frequently mentioned limitations is oversmoothing . In deep GCNs, over-smoothing means that after multi-layer graph convolution, the effect of Laplacian smoothing makes node representations become more and more similar, eventually becoming indistinguishable. This issue was first mentioned in  and has been widely discussed since then, such as in JKNet , DenseGCN , DropEdge , and PairNorm . However, these discussions were mostly on the powering effect of convolution operator (where is the convolution operator, and is the number of layers). This essentially implies a GCN variant without an activation function (i.e., a simplified graph convolution, or SGC ).
In this work, we instead argue that deep GCNs can learn anti-oversmoothing, while overfitting is the real cause of performance drop. This paper focuses on graph node classification and starts from the perspective of graph-based optimization (minimizing ) [20, 21], where is the (supervised) empirical loss and is an (unsupervised) graph regularizer, which encodes smoothness over the connected node pairs. We interpret a GCN as a two-step optimization problem: (i) forward propagation to minimize by viewing as constants and as parameters, and (ii) back propagation to minimize by updating . This work therefore derives GCN as:
Similarly, an SGC is interpreted as:
From this formulation, we show that a deep SGC indeed suffers from oversmoothing but deep GCNs will learn to prevent oversmoothing because (i) is conditioned on in GCN, and (ii) and are somehow contradicted. To prevent gradient vanishing/exploding, in this paper, we add skip connections  to all deep architectures by default. An illustration of deep GCNs, deep SGC and directly learning using DNN is shown in Fig. 1.
As is mentioned above, the training of deep GCNs is a learning process of anti-oversmoothing, which is extremely slow in practice and sometimes may not converge. Based on the formulation, we further propose a mean-subtraction trick to accelerate the training of deep GCNs. Extensive experiments verify our theories and provide more insights about deep GCNs.
2 Background of Graph Transductive Learning
Graph representation learning aims at embedding the nodes into low-dimensional vectors, while simultaneously preserving both graph topology structure and node feature information. Given a graph , let be the set of nodes, and let be a set of possible classes. Assume that each node is associated with a class label . A graph could be represented by an adjacency matrix with when two nodes are connected . The degree matrix is diagonal where . Let denote the feature vectors for each node. Given a labelled set , the goal of transductive learning on a graph is to transductively predict labels for the remaining unknown nodes . A well-studied solution category is to include graph regularizers [23, 20, 24, 25] into the classification algorithm. Graph-convolution-based models [3, 4, 6, 7] are a powerful learning approach in this space.
2.1 Graph-based Regularization
There is a rather general class of embedding algorithms that include graph regularizers. They could be described as: finding a mapping by minimizing the following two-fold loss:
where is the low-dimensional representation of nodes. The first term is the empirical risk on the labelled set . The second term is a graph regularizer over connected pairs, so as to make sure that a trivial solution is not reached.
The measurements on graphs are usually invariant to node permutations. A canonical way is to use Dirichlet energy  for the graph-base regularization,
where is the normalized Laplacian operator, which induces a semi-norm on , penalizing the changes between adjacent vertices. Same normalized formulation could be found in [27, 28, 29, 30, 20], and some related literature also use the unnormalized version [24, 31].
2.2 Graph Convolutional Network
GCNs are derived from graph signal processing [32, 33, 34]. On the spectral domain, the operator is a real-valued symmetric semidefinite matrix and the graph convolution is parameterized by a learnable filter on the its eigenvalue matrix. Kipf et al.  made assumptions of the largest eigenvalue (i.e., ) and simplified it with two-order Chebyshev expansion,
A multi-layer graph convolutional network (GCN) is formulated as the following layer-wise propagation rule ( is an activation function, e.g., ReLU):
where is the renormalization trick, and are the layer-wise feature and parameter matrices, respectively.
3 GCN as Two-Step Optimization
In this section, we re-interpret a GCN as a two-step optimization problem, where STEP1 is to minimize by viewing as constants and as parameters while STEP2 minimizes by updating . Overall, the GCN architecture is interpreted as a layer-wise combination of MLP architecture and the gradient descent algorithm of minimizing . In the meantime, the training process of parameters is entirely inherited from MLP, which aims at only minimizing . Let us first discuss a gradient descent algorithm for minimizing the form .
3.1 Gradient Descent for Trace Optimization.
Given the Laplacian operator , we consider to minimize the trace on feature domain , where is the input dimension. To prevent the trivial solution , we consider the energy constraint on , i.e., . The trace optimization problem is:
where denotes the Forbenius-norm of . To solve this, We equivalently transform the optimization problem into the Reyleigh Quotient form , which is,
It is obvious that is scaling invariant on , i.e., , .
Suppose an initial guess is , one-step of trace optimization aims at finding a better guess , which satisfies and . Our strategy is first view the problem as unconstrainted optimization on and update the guess to by gradient descent. Then we rescale and reach the improved guess , which meets the norm constraint.
Given the initial guess , we move against the derivative of by the learning rate and reach an intermediate solution in the unconstrainted space:
Immediately, we get (details in Appendix B). Then, we rescale (to achieve the improved guess , which naturally satisfies ) by a constant to meet the norm constraint, such that . Therefore, the improved guess satisfies and has the following form,
Note that, if we conduct the trace optimization algorithm enough times, the optimal solution will finally be proportional to the largest eigenvector of , which causes oversmoothing.
3.2 Layer-wise Propagation and Optimization
We introduce the trace optimization solution into the layer-wise propagation of MLP. Given the node set , features and a labelled set , a label mapping, , is usually a deep neural network, which could be tailored according to the practical applications. For example, could be a convolutional neural network (CNN) for image recognition or a recurrent neural network (RNN) for language processing. In this scenario, we first consider a simple multi-layer perceptron (MLP). The forward propagation rule of an MLP is given by,
where and are layer-wise parameters and inputs.
STEP1: minimizing in Forward Propagation.
Let us fix parameters and consider , i.e., the output of layer, as an initial guess of the trace optimization problem. We know from Sec. 3.1 that through one-step gradient descent, Eqn. (9) will find an improved guess for minimizing :
where the constant scalar in Eqn. (11) is absorbed into parameter matrix . Therefore, a GCN forward propagation is essentially applying STEP1 layerwise in the forward propagation of an MLP, which is formulated as a composition of mappings on initial feature .
STEP2: minimizing in Back Propagation.
After forward propagation, the cross-entropy loss over the labelled set is calculated. In this procedure, we then conversely view as constants and as parameters of the MLP and conduct standard back-propagation algorithm.
3.3 GCN: combining STEP1 and STEP2
In essence, STEP1 essentially defines a combined architecture, where the layer-wise propagation is adjusted by an additional step of trace optimization. In STEP2, under that architecture, the optimal is learned and a low-dimension is reached with respect to explicitly and implicitly, after standard loss back-propagation. We express it as a two-step optimization,
In this section, the learning rate is specially chosen, and it satisfies since is semi-definite and is smaller than the largest eigenvalue of , which is smaller than . In the experiment section, we reveal that is related to the weight of neighbor information averaging. We further test different and provide more insights on graph convolution operators. In the following sections, we use to denote the convolutional operator and use for the random walk form .
4 The Over-smoothing Issue
The recent successes in applying GNNs are largely limited to shallow architectures (e.g., 2-4 layers). Model performance decreases when adding more intermediate layers. Summarized in , there are three possible contributing factors: (i) overfitting due to increasing number of parameters (one matrix per layer); (ii) gradient vanishing/exploding; (iii) oversmoothing due to Laplacian smoothing. The first two points are common to all deep neural networks. The issue of oversmoothing is therefore the focus in this work. We show that deep GCNs can learn anti-oversmoothing by nature, but overfitting is the major cause of performance drop.
In this section, we first recall the commonly discussed oversmoothing problem (in SGC). Based on the re-formulation in Sec. 3.3, we then show how deep GCNs have the ability to learn anti-oversmoothing. Further, we propose an easy but effective mean-subtraction trick to speed up anti-oversmoothing, which accelerates the convergence in training deep GCNs.
4.1 Over-smoothing in SGC.
Oversmoothing means that node representations become similar and finally go indistinguishable after multi-layer graph convolution. Starting from the random walk theory, the analysis for oversmoothing is usually done on a connected, un-directed and non-bipartite graph. The issue is discussed in [16, 17, 15, 18, 19] and mainly on the -layer linear SGC .
SGC was proposed in , with the hypothesis that the non-linear activation is not critical while the majority of the benefit arises from the local averaging . The authors directly remove the activation function and proposed a linear “-layer” model,
where has collapsed into a single . This model explicitly disentangles the dependence of STEP1 and STEP2. We similarly formulate the SGC model in the form of two-step optimization,
Given any random signal and a symmetric matrix , the following property holds almost everywhere on , where has non-negative eigenvalues and is the eigenvector associated with the largest eigenvalue of .
For two widely used convolution operators and , they have the same dominant eigenvalue with eigenvectors and 1, respectively. From the two-step optimization form, SGC is essentially conducting gradient descent algorithm times in STEP1. According to Theorem 1 (see proofs in Appendix D), if goes to infinity, then each output feature channel will become or , which means oversmoothing. In STEP2, SGC model will seek to minimize on the basis of the oversmoothed features. The independence between STEP1 and STEP2 accounts for the performance drop in deep SGC.
4.2 Anti-oversmoothing in GCN.
On the contrary, the result of in STEP1 is dependent on , i.e., also on STEP2, in GCN. In fact, after STEP1, node representation in GCN will be oversmoothed to some extent as well. However, during STEP2, GCN will learn to update layer-wise and make node features separable, such that will be minimized, during which, the effect of STEP1 (minimizing , i.e., making node features inseparable) will be mitigated and actually increases implicitly. In essence, the dependency of enables GCNs to do anti-ovesmoothing during STEP2.
We demonstrate on the Karate club dataset: this graph has 34 vertices of 4 classes (the same labeling as [3, 35]) and 78 edges. A 32-layer GCN (deep enough for this demo dataset), with 16 hidden units in each layer, is considered. The model is trained on identity feature matrix with basic residual connection . The training set consists of two labeled examples per class. After 1000 epochs, the model achieves accuracy in the testing samples. We present the feature-wise smoothing by layer in Fig. 2.a and node-wise smoothing by layer in Fig. 2.b. The y-axis score of the first two figures are calculated by cosine similarity. We also calculate and present as well as for each training epoch in Fig. 2.c. More details of the demo are given in Appendix F.
From the demonstrations, we observe that without training (blue curves in Fig. 2.a and Fig. 2.b and the beginning of Fig. 2.c), the issue of oversmoothing do exist for deep GCNs. Because forward propagation mixes up features layer-by-layer. However, this issue is automatically addressed during training and with more training epochs, feature-wise smoothing and node-wise smoothing are gradually mitigated. The effect is more obvious when referring to Fig. 2.c, where actually increases during the training (small indicates oversmoothing). The gradual increase in demonstrates that GCNs have the ability to learn to anti-oversmooth by nature. Then, the next question comes: what is the real cause of performance drop in deep GCNs? Our answer is overfitting. We give practical support in the experimental section.
4.3 Mean-subtraction: an Accelerator
Although deep GCNs could learn anti-oversmoothing naturally, another practical problem appears that the convergence of training deep GCNs is extremely slow (sometimes may not converge). This issue has not been explored extensively in the literature. In this work, we present a mean-subtraction trick to accelerate the training of deep GCNs, which theoretically magnifies the effect of Fiedler vector. PairNorm  also includes a mean-subtraction step, however, their purpose is to simplify derivation. This section provides more insights and motivation to use mean-subtraction.
There are primarily two reasons to use mean-subtraction: (i) deep neural network classifiers are discriminators that draw a boundary between classes. Therefore, the mean feature of the entire dataset (a DC signal) does not help with the classification, whereas the components away from the center (the AC signal) matters; (ii) layer-wise mean-subtraction will eliminate the dominant eigen component ( or ) and actually magnifies the Fiedler vector (the eigenvector associated with the second smallest eigenvalue of ), which reveals important community information and graph conductance . This helps to set an initial graph partition and spends up model training (STEP2).
We start with one of the most popular convolution operator and its largest eigenvector . Given any non-zero , the mean-subtraction gives,
where . Eqn. (16) reveals that mean-subtraction reduces the components aligned with -space. This is exactly a step of numerical approximation of the Fiedler vector, which sets the initial graph partition (demonstration in Appendix F) and makes the feature separable. For , the formulation could be adjusted by a factor of (refer to Appendix E).
In this section, we present experimental evidence to answer the following three questions: (i) whether oversmoothing is an issue in deep GCNs and why? (ii) How to stabilize and accelerate the training of deep GCNs? (iii) Does the learning rate matter? How about changing them? We also provide more insights and draw useful conclusions for the practical usage of GCN models and its variants.
The experiments show the performance of deep GCNs on the semi-supervised node classification tasks. All the deep models (with more than 3 hidden layers) are implemented with basic skip-connection [3, 22]. Since skip-connection (also called residual connection) are necessary in deep architectures, we do not consider them as new models. Three benchmark citation networks (Cora, Citeseer, Pubmed) are considered. We follow the same experimental settings from  and show the basic statistics of datasets in Table. 1. All the experiments are conducted 20 times and mainly finished in a Linux server with 64GB memory, 32 CPUs and a single GTX-2080 GPU.
5.1 Overfitting in Deep GCNs
The performance of GCNs is known to decrease with increasing number of layers, for which, a common explanation is “oversmoothing”. In Sec. 4, we contradict this thesis and conjecture instead that overfitting is the major reason for the drop of performance in deep GCNs; we show that deep GCNs actually learn anti-oversmoothing. In this section, we provide evidence to support our conjecture.
Performance vs Depth.
We first evaluate vanilla GCN models (with residual connection) with hidden layers on Cora, Citeseer and Pubmed. The results of training and test accuracy are reported in Fig. 3.
Form Fig. 3, we know immediately that test accuracy drops in the beginning (1-4 layers) and then remains stable even as the model depth increases, which means the increasing number () of hidden layers does not hurt model performance. Thus, oversmoothing is not the reason. From 2 to 3 or 3 to 4 layers, we notice that these is a big rise in training accuracy (up to ) and simultaneously a big drop in training loss (to ) and test accuracy consistently on the three datasets. This is more consistent with overfitting.
Deep GCNs Learn Anti-oversmoothing.
We recall that in Sec. 4.2, we show from an optimization perspective that the dependency of on allows the network to learn anti-oversmoothing. To verify our theory, we compare SGC and GCN on Cora and Pubmed with various depth. To make it clear, SGC is actually a linear model, the depth means the number of graph convolution . Model performance in both training and test is shown in Fig. 4.
It is interesting that the accuracy of SGC decreases rapidly  with more graph convolutions either for training or test. This is a strong indicator of oversmoothing, because node features converge to the same stationary due to the effect of STEP1 (specified in Theorem 1). The performance of the GCN model is not as good as SGC soon after 2 layers because of overfitting, but it stabilizes at a high accuracy even as the model goes very deep, which again verifies that GCNs naturally have the power of anti-oversmoothing.
To facilitate the training of deep GCNs, we proposed mean-subtraction in Sec. 4.3. In this section, we evaluate the efficacy of the mean-subtraction trick and compare it with vanilla GCNs , PairNorm  and the widely used BatchNorm  in the deep learning area. The four models have same settings, such as the number of layers (64), learning rate (0.01), and hidden units (16). Mean-subtraction is to subtract the mean feature value before each convolution layer (and PairNorm further re-scales the feature by variance). They do not include additional parameters. BatchNorm adds more parameters for each layer and learns to whiten the input of each layer. The experiment is conducted on Cora with training epochs.
Fig. 5 reports the training and test curve for the four model variants. GCN and GCN+BatchNorm perform similarly, which means BatchNorm does not help substantially in training deep GCNs (at least for Cora). GCN+mean-subtraction and GCN+PairNorm give fast and stable training/test convergence. However, we could tell from the training curve that the PairNorm trick seems to suffer a lot from overfitting, leading to a drop in test accuracy. In sum, mean-subtraction not only speeds up the model convergence but also retains the same expressive power. It is an ideal trick for training deep GCNs.
5.3 Performance vs Learning Rate
In Sec. 3, we choose the learning rate . However, a different learning rate does lead to different weights of neighbor information aggregation (we show that is a monotonically increasing function in Appendix C). There are also some efforts on trying different ways to aggregate neighbor information [6, 7, 18, 38]. In this section, we consider the form “” with and exploit a group of convolution operators by their normalized version. GCN with normalized is named as -GCN. We evaluate this operator group on Cora and list the experimental results in Table. 2
We conclude that when is small (i.e., is small), which means the gradient of does not contribute much to the end effect, -GCN is more of a DNN. As increases, a significant increase in model performance is initially observed. When exceeds some threshold, the accuracy saturates, remaining high (or maybe decreases slightly in shallow models, i.e., -layer) even as we increase substantially. We conclude that for the widely used shallow GCNs, the common choice of weight , which means a learning rate, , is large enough to include the gradient descent effect and small enough to avoid the drop in accuracy. To find the best weight of neighbor averaging, further inspection is needed in future work.
We reformulate GCNs from an optimization perspective by plugging the gradient of graph regularizer into a standard MLP. From this formulation, we revisit the commonly discussed “oversmoothing issue” in deep GCNs and provide a new understanding: deep GCNs have the power to learn anti-oversmoothing by nature, but overfitting is the real cause of performance drop when the model goes deep. We further propose a cheap but effective mean-subtraction trick to accelerate the training of deep GCNs. Extensive experiments are presented to verify our theory and provide more practical insights.
Appendix A and Spectral Clustering
Graph Regularizer .
is commonly formulated by Dirichlet energy, , where is a mapping from the input feature to low-dimensional representation . To minimize , this paper adds constraint on the magnitude of , i.e., , which gives,
Given a graph with binary adjacency matrix , a partition of node set into set could be written as in graph theory. For normalized spectral clustering, the indicator vectors is written as , where represents the affiliation of node in class set and is the volume.
The is a matrix containing these indicator vectors as columns. For each row of , there is only one non-empty entry, implying . Let us revisit the Normalized Cut of a graph for a partition .
Also, satisfies . When the discreteness condition is relaxed and is substitute by , the normalized graph cut problem (normalized spectral clustering) is relaxed into,
This is a standard trace minimization problem which is solved by the matrix the eigen matrix of . Compared to Eqn. (17), Eqn. (20) has a stronger contraints, which outputs the optimal solution irrelevant to the inputs (feature matrix ). However, Eqn. (17) only add constraints on the magnitude of , which balances the trade-off and will give a solution induced by both the eigen matrix of and the original feature .
Appendix B Reyleigh Quotient
The Reyleigh Quotient of a vector is the scalar,
which is invariant to the scaling of . For example, , we have . When we view as a function on -dim variable , it has stationary points , where is the eigenvector of . Let us assume , then the stationary value at point will be exactly the eigenvalue ,
When is not an eigenvector of , the partial derivatives of with respect to the vector coordinate is calculated as,
Thus, the derivative of with respect to is collected as,
Suppose is the normalized Laplacian matrix. Let us first consider to minimize without any constraints. Since is a symmetric real-valued matrix, it could be factorized by Singular Value Decomposition,
where is the rank of and are the eigen values. For any non-zero vector , it is decomposed w.r.t. the eigen space of ,
where is the coordinates and is a component tangent to the eigen space spanned by . Let us consider the component of within the eigen space and discuss later. Therefore, the Reyleigh Quotient can be calculated by,
Recall the partial derivative of w.r.t. in Eqn. (24). Think about to minimize by gradient descent and always consider the learning rate (the same as what we used in the main text. The factor is from that the in appendix does not have the scalar ),
The initial is regarded as an starting point, and the next point is given by gradient descent,
The new Reyleigh Quotient value is,
The eigen properties of could be derived from , where they have the same eigenvector, and any eigenvalue of will adjust to be an eigenvalue of . Therefore, we do further derivation,
So far, to get the ideal effect, a final check is needed: whether the Reyleigh Quotient does decrease after the gradient descent.
Also, we show the asymptotic property of in gradient descent,
where is the -th new point given by gradient descent. So far, we finish the proof of well-definedness of gradient descent with the .
In fact, as stated above, is invariant to the scaling of , so we could scale on its magnitude, i.e., making as a constraint during the gradient descent iteration, all the properties and results still hold.
In the main text, instead of using a vector , we use a feature matrix and define our Reyleigh Quotient by . In fact, different feature channels of could be viewed as independent vector signal and for each channel, the same gradient descent analysis is applied. Therefore, we finish the detailed proof for our formulation in the main text, which is of the following form,
Appendix C Learning Rate and Neighbor Averaging Weight
We show the relation of learning rate and neighbor averaging weight in this section (the derivation is in terms of the main text, so does not have factor ).
Thus, we have,
According to the formulation, is a monotonically increasing function on variable and is valid when . Therefore, when , the domain and when (we know from Eqn. (33) that ), the domain of the function is bounded, .
The choice of lies in the valid domain for . Also, in the valid domain, with respect to the change of , can vary in the range monotonically.
Appendix D Proof of Theorem 1
Given any non-zero signal and a symmetric matrix (with non-negative eigenvalues), we factorize them in the eigenspace,
where is of rank , and are eigen matrices. are coordinates of in the eigenspace and is a component tangent to the eigenspace. In a -layer SGC, the effect of graph convolution is the same as applying Laplacian smoothing times,