DFNets: Spectral CNNs for Graphs with FeedbackLooped Filters
Abstract
We propose a novel spectral convolutional neural network (CNN) model on graph structured data, namely Distributed FeedbackLooped Networks (DFNets). This model is incorporated with a robust class of spectral graph filters, called feedbacklooped filters, to provide better localization on vertices, while still attaining fast convergence and linear memory requirements. Theoretically, feedbacklooped filters can guarantee convergence w.r.t. a specified error bound, and be applied universally to any graph without knowing its structure. Furthermore, the propagation rule of this model can diversify features from the preceding layers to produce strong gradient flows. We have evaluated our model using two benchmark tasks: semisupervised document classification on citation networks and semisupervised entity classification on a knowledge graph. The experimental results show that our model considerably outperforms the stateoftheart methods in both benchmark tasks over all datasets.
1 Introduction
Convolutional neural networks (CNNs) Krizhevsky et al. (2012) are a powerful deep learning approach which has been widely applied in various fields, e.g., object recognition Sharif Razavian et al. (2014), image classification Hu et al. (2015), and semantic segmentation Li et al. (2017). Traditionally, CNNs only deal with data that has a regular Euclidean structure, such as images, videos and text. In recent years, due to the rising trends in network analysis and prediction, generalizing CNNs to graphs has attracted considerable interest Bruna et al. (2013); Defferrard et al. (2016); Hamilton et al. (2017); Perozzi et al. (2014). However, since graphs are in irregular nonEuclidean domains, this brings up the challenge of how to enhance CNNs for effectively extracting useful features (e.g. topological structure) from arbitrary graphs.
To address this challenge, a number of studies have been devoted to enhancing CNNs by developing filters over graphs. In general, there are two categories of graph filters: (a) spatial graph filters, and (b) spectral graph filters. Spatial graph filters are defined as convolutions directly on graphs, which consider neighbors that are spatially close to a current vertex Atwood and Towsley (2016); Duvenaud et al. (2015); Hamilton et al. (2017). In contrast, spectral graph filters are convolutions indirectly defined on graphs, through their spectral representations Bruna et al. (2013); Chung and Graham (1997); Defferrard et al. (2016). In this paper, we follow the line of previous studies in developing spectral graph filters and tackle the problem of designing an effective, yet efficient CNNs with spectral graph filters.
Previously, Bruna et al. Bruna et al. (2013) proposed convolution operations on graphs via a spectral decomposition of the graph Laplacian. To reduce learning complexity in the setting where the graph structure is not known a priori, Henaff et al. Henaff et al. (2015) developed a spectral filter with smooth coefficients. Then, Defferrard et al. Defferrard et al. (2016) introduced Chebyshev filters to stabilize convolution operations under coefficient perturbation and these filters can be exactly localized in khop neighborhood. Later, Kipf et al. Kipf and Welling (2017) proposed a simple layerwise propagation model using Chebyshev filters on 1hop neighborhood. Very recently, some works attempted to develop rational polynomial filters, such as Cayley filters Levie et al. (2017) and ARMA Bianchi et al. (2019). From a different perspective, Petar et al. Veličković et al. (2017) proposed a selfattention based CNN architecture for graph filters, which extracts features by considering the importance of neighbors.
One key idea behind existing works on designing spectral graph filters is to approximate the frequency responses of graph filters using a polynomial function (e.g. Chebyshev filters Defferrard et al. (2016)) or a rational polynomial function (e.g. Cayley filters Levie et al. (2017) and ARMA Bianchi et al. (2019)). Polynomial filters are sensitive to changes in the underlying graph structure. They are also very smooth and can hardly model sharp changes, as illustrated in Figure 1. Rational polynomial filters are more powerful to model localization, but they often have to trade off computational efficiency, resulting in higher learning and computational complexities, as well as instability.
Contributions. In this work, we aim to develop a new class of spectral graph filters that can overcome the above limitations. We also propose a spectral CNN architecture (i.e. DFNet) to incorporate these graph filters. In summary, our contributions are as follows:

Improved localization. A new class of spectral graph filters, called feedbacklooped filters, is proposed to enable better localization, due to its rational polynomial form. Basically, feedbacklooped filters consist of two parts: feedforward and feedback. The feedforward filtering is klocalized as polynomial filters, while the feedback filtering is unique which refines klocalized features captured by the feedforward filtering to improve approximation accuracy. We also propose two techniques: scalednormalization and cutoff frequency to avoid the issues of gradient vanishing/exploding and instabilities.

Efficient computation. For feedbacklooped filters, we avoid the matrix inversion implied by the denominator through approximating the matrix inversion with a recursion. Thus, benefited from this approximation, feedbacklooped filters attain linear convergence time and linear memory requirements w.r.t. the number of edges in a graph.

Theoretical properties. Feedbacklooped filters enjoy several nice theoretical properties. Unlike other rational polynomial filters for graphs, they have theoretically guaranteed convergence w.r.t. a specified error bound. On the other hand, they still have the universal property as other spectral graph filters Isufi et al. (2017b), i.e., can be applied without knowing the underlying structure of a graph. The optimal coefficients of feedbacklooped filters are learnable via an optimization condition for any given graph.

Dense architecture. We propose a layerwise propagation rule for our spectral CNN model with feedbacklooped filters, which densely connects layers as in DenseNet Huang et al. (2017). This design enables our model to diversify features from all preceding layers, leading to a strong gradient flow. We also introduce a layerwise regularization term to alleviate the overfitting issue. In doing so, we can prevent the generation of spurious features and thus improve accuracy of the prediction.
To empirically verify the effectiveness of our work, we have evaluated feedbacklooped filters within three different CNN architectures over four benchmark datasets to compare against the stateoftheart methods. The experimental results show that our models significantly outperform the stateoftheart methods. We further demonstrate the effectiveness of our model DFNet through the node embeddings in a 2D space of vertices from two datasets.
2 Spectral Convolution on Graphs
Let be an undirected and weighted graph, where is a set of vertices, is a set of edges, and is an adjacency matrix which encodes the weights of edges. We let and . A graph signal is a function and can be represented as a vector whose component is the value of at the vertex in . The graph Laplacian is defined as , where is a diagonal matrix with and is an identity matrix. has a set of orthogonal eigenvectors , known as the graph Fourier basis, and nonnegative eigenvalues , known as the graph frequencies Chung and Graham (1997). is diagonalizable by the eigendecomposition such that , where and is a hermitian transpose of . We use and to denote the smallest and largest eigenvalues of , respectively.
Given a graph signal , the graph Fourier transform of is and its inverse is Sandryhaila and Moura (2013); Shuman et al. (2013). The graph Fourier transform enables us to apply graph filters in the vertex domain. A graph filter can filter by altering (amplifying or attenuating) the graph frequencies as
(1) 
Here, , which controls how the frequency of each component in a graph signal is modified. However, applying graph filtering as in Eq. 1 requires the eigendecomposition of , which is computationally expensive. To address this issue, several works Bianchi et al. (2019); Defferrard et al. (2016); Hammond et al. (2011); Kipf and Welling (2017); Levie et al. (2017); Liao et al. (2019) have studied the approximation of by a polynomial or rational polynomial function.
Chebyshev filters. Hammond et al. Hammond et al. (2011) first proposed to approximate by a polynomial function with order polynomials and Chebyshev coefficients. Later, Defferrard et al. Defferrard et al. (2016) developed Chebyshev filters for spectral CNNs on graphs. A Chebyshev filter is defined as
(2) 
where is a vector of learnable Chebyshev coefficients, is rescaled from , the Chebyshev polynomials are recursively defined with and , and controls the size of filters, i.e., localized in khop neighborhood of a vertex Hammond et al. (2011). Kipf and Welling Kipf and Welling (2017) simplified Chebyshev filters by restricting to 1hop neighborhood.
Lanczos filters. Recently, Liao et al. Liao et al. (2019) used the Lanczos algorithm to generate a lowrank matrix approximation for the graph Laplacian. They used the affinity matrix . Since holds, and share the same eigenvectors but have different eigenvalues. As a result, and correspond to the same . To approximate the eigenvectors and eigenvalues of , they diagonalize the tridiagonal matrix to compute Ritzvectors and Ritzvalues , and thus . Accordingly, a khop Lanczos filter operation is,
(3) 
where is a vector of learnable Lanczos filter coefficients. Thus, spectral convolutional operation is defined as . Such Lanczos filter operations can significantly reduce computation overhead when approximating large powers of , i.e. . Thus, they can efficiently compute the spectral graph convolution with a very large localization range to easily capture the multiscale information of the graph.
Cayley filters. Observing that Chebyshev filters have difficulty in detecting narrow frequency bands due to , Levie et al. Levie et al. (2017) proposed Cayley filters, based on Cayley polynomials:
(4) 
where is a real coefficient and is a vector of complex coefficients. denotes the real part of a complex number , and is a parameter called spectral zoom, which controls the degree of “zooming” into eigenvalues in . Both and are learnable during training. To improve efficiency, the Jacobi method is used to approximately compute Cayley polynomials.
ARMA filters. Bianchi et al. Bianchi et al. (2019) sought to address similar issues as identified in Levie et al. (2017). However, different from Cayley filters, they developed a firstorder ARMA filter, which is approximated by a firstorder recursion:
(5) 
where and are the filter coefficients, , and . Accordingly, the frequency response is defined as:
(6) 
where , , and Isufi et al. (2017b). Multiple ARMA filters can be applied in parallel to obtain a ARMA filter. However, the memory complexity of parallel ARMA filters is times higher than ARMA graph filters.
We make some remarks on how these existing spectral filters are related to each other. (i) As discussed in Bianchi et al. (2019); Levie et al. (2017); Liao et al. (2019), polynomial filters (e.g. Chebyshev and Lanczos filters) can be approximately treated as a special kind of rational polynomial filters. (ii) Further, Chebyshev filters can be regarded as a special case of Lanczos filters. (iii) Although both Cayley and ARMA filters are rational polynomial filters, they differ in how they approximate the matrix inverse implied by the denominator of a rational function. Cayley filters use a fixed number of Jacobi iterations, while ARMA filters use a firstorder recursion plus a parallel bank of ARMA. (iv) ARMA by Bianchi et al. Bianchi et al. (2019) is similar to GCN by Kipf et al. Kipf and Welling (2017) because they both consider localization within 1hop neighborhood.
3 Proposed Method
We introduce a new class of spectral graph filters, called feedbacklooped filters, and propose a spectral CNN for graphs with feedbacklooped filters, namely Distributed FeedbackLooped Networks (DFNets). We also discuss optimization techniques and analyze theoretical properties.
3.1 FeedbackLooped Filters
Feedbacklooped filters belong to a class of Auto Regressive Moving Average (ARMA) filters Isufi et al. (2017a, b). Formally, an ARMA filter is defined as:
(7) 
The parameters and refer to the feedback and feedforward degrees, respectively. and are two vectors of complex coefficients. Computing the denominator of Eq. 7 however requires a matrix inversion, which is computationally inefficient for large graphs. To circumvent this issue, feedbacklooped filters use the following approximation:
(8) 
where , , , and is the largest eigenvalue of . Accordingly, the frequency response of feedbacklooped filters is defined as:
(9) 
To alleviate the issues of gradient vanishing/exploding and numerical instabilities, we further introduce two techniques in the design of feedbacklooped filters: scalednormalization and cutoff frequency.
Scalednormalization technique. To assure the stability of feedbacklooped filters, we apply the scalednormalization technique to increasing the stability region, i.e., using the scalednormalized Laplacian , rather than just . This accordingly helps centralize the eigenvalues of the Laplacian and reduce its spectral radius bound. The scalednormalized Laplacian consists of graph frequencies within , in which eigenvalues are ordered in an increasing order.
Cutoff frequency technique. To map graph frequencies within to a uniform discrete distribution, we define a cutoff frequency , where and refers to the largest eigenvalue of . The cutoff frequency is used as a threshold to control the amount of attenuation on graph frequencies. The eigenvalues are converted to binary values such that if and otherwise. This trick allows the generation of ideal highpass filters so as to sharpen a signal by amplifying its graph Fourier coefficients. This technique also solves the issue of narrow frequency bands existing in previous spectral filters, including both polynomial and rational polynomial filters Defferrard et al. (2016); Levie et al. (2017). This is because these previous spectral filters only accept a small band of frequencies. In contrast, our proposed feedbacklooped filters resolve this issue using a cutoff frequency technique, i.e., amplifying frequencies higher than a certain low cutoff value while attenuating frequencies lower than that cutoff value. Thus, our proposed filters can accept a wider range of frequencies and capture better characteristic properties of a graph.
3.2 Coefficient Optimisation
Given a feedbacklooped filter with a desired frequency response: , we aim to find the optimal coefficients and that make the frequency response as close as possible to the desired frequency response, i.e. to minimize the following error:
(10) 
However, the above equation is not linear w.r.t. the coefficients and . Thus, we redefine the error as follows:
(11) 
Let , , with and with are two Vandermondelike matrices. Then, we have . Thus, the stable coefficients and can be learned by minimizing as a convex constrained leastsquares optimization problem:
(12) 
Here, the parameter controls the tradeoff between convergence efficiency and approximation accuracy. A higher value of can lead to slower convergence but better accuracy. It is not recommended to have very low values due to potentially unacceptable accuracy. is the stability condition, which will be further discussed in detail in Section 3.4.
3.3 Spectral Convolutional Layer
We propose a CNNbased architecture, called DFNets, which can stack multiple spectral convolutional layers with feedbacklooped filters to extract features of increasing abstraction. Let and . The propagation rule of a spectral convolutional layer is defined as:
(13) 
where refers to a nonlinear activation function such as . is a graph signal matrix where refers to the number of features. is a matrix of activations in the layer. and are two trainable weight matrices in the layer. To compute , a vertex needs access to its hop neighbors with the output signal of the previous layer , and its hop neighbors with the input signal from . To attenuate the overfitting issue, we add , namely kernel regularization Cortes et al. (2009), and a bias term . We use the xavier normal initialization method Glorot and Bengio (2010) to initialise the kernel and bias weights, the unitnorm constraint technique Douglas et al. (2000) to normalise the kernel and bias weights by restricting parameters of all layers in a small range, and the kernel regularization technique to penalize the parameters in each layer during the training. In doing so, we can prevent the generation of spurious features and thus improve the accuracy of prediction ^{1}^{1}1DFNets implementation can be found at: https://github.com/wokas36/DFNets.
In this model, each layer is directly connected to all subsequent layers in a feedforward manner, as in DenseNet Huang et al. (2017). Consequently, the layer receives all preceding feature maps as input. We concatenate multiple preceding feature maps columnwise into a single tensor to obtain more diversified features for boosting the accuracy. This densely connected CNN architecture has several compelling benefits: (a) reduce the vanishinggradient issue, (b) increase feature propagation and reuse, and (c) refine information flow between layers Huang et al. (2017).
3.4 Theoretical Analysis
Feedbacklooped filters have several nice properties, e.g., guaranteed convergence, linear convergence time, and universal design. We discuss these properties and analyze computational complexities.
Convergence. Theoretically, a feedbacklooped filter can achieve a desired frequency response only when Isufi et al. (2017b). However, due to the property of linear convergence preserved by feedbacklooped filters, stability can be guaranteed after a number of iterations w.r.t. a specified small error Isufi et al. (2017a). More specifically, since the pole of rational polynomial filters should be in the unit circle of the zplane to guarantee the stability, we can derive the stability condition by Eq. 7 in the vertex domain and correspondingly obtain the stability condition in the frequency domain as stipulated in Eq. 12 Isufi et al. (2017a).
Universal design. The universal design is beneficial when the underlying structure of a graph is unknown or the topology of a graph changes over time. The corresponding filter coefficients can be learned independently of the underlying graph and are universally applicable. When designing feedbacklooped filters, we define the desired frequency response function over graph frequencies in a binary format in the uniform discrete distribution as discussed in Section 3.1. Then, we solve Eq. 12 in the leastsquares sense for this finite set of graph frequencies to find optimal filter coefficients.
Spectral Graph Filter  Type  Learning  Time  Memory 

Complexity  Complexity  Complexity  
Chebyshev filters Defferrard et al. (2016)  Polynomial  
Lanczos filters Liao et al. (2019)  
Cayley filters Levie et al. (2017)  Rational polynomial  
ARMA filters Bianchi et al. (2019)  
parallel ARMA filters Bianchi et al. (2019)  
Feedbacklooped filters (ours) 
Complexity. When computing as in Eq. 8, we need to calculate for and for . Nevertheless, is computed only once because . Thus, we need multiplications for each in the first term in Eq. 8, and multiplications for the second term in Eq. 8. Table 1 summarizes the complexity results of existing spectral graph filters and ours, where refers to the number of Jacobi iterations in Levie et al. (2017). Note that, when (i.e., one spectral convolutional layer), feedbacklooped filters have the same learning, time and memory complexities as Chebyshev filters, where .
4 Numerical Experiments
We evaluate our models on two benchmark tasks: (1) semisupervised document classification in citation networks, and (2) semisupervised entity classification in a knowledge graph.
4.1 Experimental SetUp
Datasets. We use three citation network datasets Cora, Citeseer, and Pubmed Sen et al. (2008) for semisupervised document classification, and one dataset NELL Carlson et al. (2010) for semisupervised entity classification. NELL is a bipartite graph extracted from a knowledge graph Carlson et al. (2010). Table 2 contains dataset statistics Yang et al. (2016).
Dataset  Type  #Nodes  #Edges  #Classes  #Features  %Labeled Nodes 

Cora  Citation network  2,708  5,429  7  1,433  0.052 
Citeseer  Citation network  3,327  4,732  6  3,703  0.036 
Pubmed  Citation network  19,717  44,338  3  500  0.003 
NELL  Knowledge graph  65,755  266,144  210  5,414  0.001 
Baseline methods. We compare against twelve baseline methods, including five methods using spatial graph filters, i.e., Semisupervised Embedding (SemiEmb) Weston et al. (2012), Label Propagation (LP) Zhu et al. (2003), skipgram graph embedding model (DeepWalk) Perozzi et al. (2014), Iterative Classification Algorithm (ICA) Lu and Getoor (2003), and semisupervised learning with graph embedding (Planetoid*) Yang et al. (2016), and seven methods using spectral graph filters: Chebyshev Defferrard et al. (2016), Graph Convolutional Networks (GCN) Kipf and Welling (2017), Lanczos Networks (LNet) and Adaptive Lanczos Networks (AdaLNet) Liao et al. (2019), CayleyNet Levie et al. (2017), Graph Attention Networks (GAT) Veličković et al. (2017), and ARMA Convolutional Networks (ARMA) Bianchi et al. (2019).
We evaluate our feedbacklooped filters using three different spectral CNN models: (i) DFNet: a densely connected spectral CNN with feedbacklooped filters, (ii) DFNetATT: a selfattention based densely connected spectral CNN with feedbacklooped filters, and (iii) DFATT: a selfattention based spectral CNN model with feedbacklooped filters.
Model  L2 reg.  #Layers  #Units  Dropout  [p, q]  

DFNet  9e2  5  [8, 16, 32, 64, 128]  0.9  [5, 3]  0.5 
DFNetATT  9e4  4  [8, 16, 32, 64]  0.9  [5, 3]  0.5 
DFATT  9e3  2  [32, 64]  [0.1, 0.9]  [5, 3]  0.5 
Hyperparameter settings. We use the same data splitting for each dataset as in Yang et al. Yang et al. (2016). The hyperparameters of our models are initially selected by applying the orthogonalization technique (a randomized search strategy). We also use a layerwise regularization (L2 regularization) and bias terms to attenuate the overfitting issue. All models are trained 200 epochs using the Adam optimizer Kingma and Ba (2015) with a learning rate of 0.002. Table 3 summarizes the hyperparameter settings for citation network datasets. The same hyperparameters are applied to the NELL dataset except for L2 regularization (i.e., 9e2 for DFNet and DFnetATT, and 9e4 for DFATT). For , we choose the best setting for each model. For selfattention, we use 8 multiattention heads and 0.5 attention dropout for DFNetATT, and 6 multiattention heads and 0.3 attention dropout for DFATT. The parameters , and are applied to all three models over all datasets.
4.2 Comparison with Baseline Methods
Table 4 summarizes the results of classification in terms of accuracy. The results of the baseline methods are taken from the previous works Kipf and Welling (2017); Liao et al. (2019); Veličković et al. (2017); Yang et al. (2016). Our models DFNet and DFNetATT outperform all the baseline methods over four datasets. Particularly, we can see that: (1) Compared with polynomial filters, DFNet improves upon GCN (which performs best among the models using polynomial filters) by a margin of 3.7%, 3.9%, 5.3% and 2.3% on the datasets Cora, Citeseer, Pubmed and NELL, respectively. (2) Compared with rational polynomial filters, DFNet improves upon CayleyNet and ARMA by 3.3 and 0.5 on the Cora dataset, respectively. For the other datasets, CayleyNet does not have results available in Levie et al. (2017). (3) DFNetATT further improves the results of DFNet due to the addition of a selfattention layer. (4) Compared with GAT (Chebyshev filters with selfattention), DFATT also improves the results and achieves 0.4%, 0.6% and 3.3% higher accuracy on the datasets Cora, Citeseer and Pubmed, respectively.
Additionally, we compare DFNet (our feedbacklooped filters + DenseBlock) with GCN + DenseBlock and GAT + DenseBlock. The results are also presented in Table 4. We can see that our feedbacklooped filters perform best, no matter whether or not the dense architecture is used.
Model  Cora  Citeseer  Pubmed  NELL 

SemiEmb Weston et al. (2012)  59.0  59.6  71.1  26.7 
LP Zhu et al. (2003)  68.0  45.3  63.0  26.5 
DeepWalk Perozzi et al. (2014)  67.2  43.2  65.3  58.1 
ICA Lu and Getoor (2003)  75.1  69.1  73.9  23.1 
Planetoid* Yang et al. (2016)  64.7  75.7  77.2  61.9 
Chebyshev Defferrard et al. (2016)  81.2  69.8  74.4   
GCN Kipf and Welling (2017)  81.5  70.3  79.0  66.0 
LNet Liao et al. (2019)  79.5  66.2  78.3   
AdaLNet Liao et al. (2019)  80.4  68.7  78.1   
CayleyNet Levie et al. (2017)  81.9       
ARMA Bianchi et al. (2019)  84.7  73.8  81.4   
GAT Veličković et al. (2017)  83.0  72.5  79.0   
GCN + DenseBlock  82.7 0.5  71.3 0.3  81.5 0.5  66.4 0.3 
GAT + Dense Block  83.8 0.3  73.1 0.3  81.8 0.3   
DFNet (ours)  85.2 0.5  74.2 0.3  84.3 0.4  68.3 0.4 
DFNetATT (ours)  86.0 0.4  74.7 0.4  85.2 0.3  68.8 0.3 
DFATT (ours)  83.4 0.5  73.1 0.4  82.3 0.3  67.6 0.3 
.
4.3 Comparison under Different Polynomial Orders
In order to test how the polynomial orders and influence the performance of our model DFNet, we conduct experiments to evaluate DFNet on three citation network datasets using different polynomial orders and . Figure 2 presents the experimental results. In our experiments, and turn out to be the best parameters for DFNet over these datasets. In other words, this means that feedbacklooped filters are more stable on and than other values of and . This is because, when and , Eq. 12 can obtain better convergence for finding optimal coefficients than in the other cases. Furthermore, we observe that: (1) Setting to be too low or too high can both lead to poor performance, as shown in Figure 2.(a), and (2) when is larger than , the accuracy decreases rapidly as shown in Figure 2.(b). Thus, when choosing and , we require that holds.
4.4 Evaluation of ScaledNormalization and Cutoff Frequency
To understand how effectively the scalednormalisation and cutoff frequency techniques can help learn graph representations, we compare our methods that implement these techniques with the variants of our methods that only implement one of these techniques. The results are presented in Figure 3. We can see that, the models using these two techniques outperform the models that only use one of these techniques over all citation network datasets. Particularly, the improvement is significant on the Cora and Citeseer datasets.
4.5 Node Embeddings
We analyze the node embeddings by DFNets over two datasets: Cora and Pubmed in a 2D space. Figures 5 and 5 display the visualization of the learned 2D embeddings of GCN, GAT, and DFNet (ours) on Pubmed and Cora citation networks by applying tSNE Maaten and Hinton (2008) respectively. Colors denote different classes in these datasets. It reveals the clustering quality of theses models. These figures clearly show that our model DFNet has better separated 3 and 7 clusters respectively in the embedding spaces of Pubmed and Cora datasets. This is because features extracted by DFNet yield better node representations than GCN and GAT models.
5 Conclusions
In this paper, we have introduced a spectral CNN architecture (DFNets) with feedbacklooped filters on graphs. To improve approximation accuracy, we have developed two techniques: scaled normalization and cutoff frequency. In addition to these, we have discussed some nice properties of feedbacklooped filters, such as guaranteed convergence, linear convergence time, and universal design. Our proposed model outperforms the stateoftheart approaches significantly in two benchmark tasks. In future, we plan to extend the current work to timevarying graph structures. As discussed in Isufi et al. (2017b), feedbacklooped graph filters are practically appealing for timevarying settings, and similar to static graphs, some nice properties would likely hold for graphs that are a function of time.
References
 [1] (2016) Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1993–2001. Cited by: §1.
 [2] (2019) Graph neural networks with convolutional ARMA filters. arXiv preprint arXiv:1901.01343. Cited by: §1, §1, §2, §2, §2, Table 1, §4.1, Table 4.
 [3] (2013) Spectral networks and locally connected networks on graphs. International Conference on Learning Representations (ICLR). Cited by: §1, §1, §1.
 [4] (2010) Toward an architecture for neverending language learning. In TwentyFourth AAAI Conference on Artificial Intelligence (AAAI), Cited by: §4.1.
 [5] (1997) Spectral graph theory. American Mathematical Soc.. Cited by: §1, §2.
 [6] (2009) L2 regularization for learning kernels. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence (UAI), pp. 109–116. Cited by: §3.3.
 [7] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems (NeurIPS), pp. 3844–3852. Cited by: §1, §1, §1, §1, §2, §2, §3.1, Table 1, §4.1, Table 4, Table 5, Appendices.
 [8] (2000) On gradient adaptation with unitnorm constraints. IEEE Transactions on Signal processing 48 (6), pp. 1843–1847. Cited by: §3.3.
 [9] (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems (NeurIPS), pp. 2224–2232. Cited by: §1.
 [10] (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (AIStats), pp. 249–256. Cited by: §3.3.
 [11] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1024–1034. Cited by: §1, §1.
 [12] (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129–150. Cited by: §2, §2.
 [13] (2015) Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163. Cited by: §1.
 [14] (2015) Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors 2015. Cited by: §1.
 [15] (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4700–4708. Cited by: 4th item, §3.3.
 [16] (2017) Autoregressive moving average graph filters: a stable distributed implementation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4119–4123. Cited by: §3.1, §3.4.
 [17] (2017) Autoregressive moving average graph filtering. IEEE Transactions on Signal Processing 65 (2), pp. 274–288. Cited by: 3rd item, §2, §3.1, §3.4, §5.
 [18] (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
 [19] (2017) Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §2, §2, §4.1, §4.2, Table 4, Appendices.
 [20] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS), pp. 1097–1105. Cited by: §1.
 [21] (2017) Cayleynets: graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67 (1), pp. 97–109. Cited by: §1, §1, §2, §2, §2, §2, §3.1, §3.4, Table 1, §4.1, §4.2, Table 4, Table 5.
 [22] (2017) Fully convolutional instanceaware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367. Cited by: §1.
 [23] (2019) LanczosNet: multiscale deep graph convolutional networks. In Proceedings of the seventh International Conference on Learning Representation (ICLR), Cited by: §2, §2, §2, Table 1, §4.1, §4.2, Table 4, Appendices, Appendices.
 [24] (2003) Linkbased classification. In Proceedings of the 20th International Conference on Machine Learning (ICML), pp. 496–503. Cited by: §4.1, Table 4.
 [25] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.5.
 [26] (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp. 701–710. Cited by: §1, §4.1, Table 4.
 [27] (2013) Discrete signal processing on graphs: graph fourier transform. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6167–6170. Cited by: §2.
 [28] (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
 [29] (2014) CNN features offtheshelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813. Cited by: §1.
 [30] (2013) The emerging field of signal processing on graphs: extending highdimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30, pp. 83–98. Cited by: §2.
 [31] (2017) Graph attention networks. International Conference on Learning Representations (ICLR). Cited by: §1, §4.1, §4.2, Table 4, Appendices.
 [32] (2012) Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §4.1, Table 4.
 [33] (2016) Revisiting semisupervised learning with graph embeddings. In Proceedings of The 33rd International Conference on Machine Learning (ICML), pp. 40–48. Cited by: §4.1, §4.1, §4.1, §4.2, Table 4.
 [34] (2003) Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML), pp. 912–919. Cited by: §4.1, Table 4.
Appendices
In the following, we provide further experiments on comparing our work with the others.
Comparison with different spectral graph filters. We have conducted an ablation study of our proposed graph filters. Specifically, we compare our feedbacklooped filters, i.e., the newly proposed spectral filters in this paper, against other spectral filters such as Chebyshev filters and Cayley filters. To conduct this ablation study, we remove the dense connections from our model DFNet. The experimental results are presented in table 5. It shows that feedbacklooped filters improve localization upon Chebyshev filters by a margin of 1.4%, 1.7% and 7.3% on the datasets Cora, Citeseer and Pubmed, respectively. It also improves upon Cayley filters by a margin of 0.7% on the Cora dataset.
Model  Cora  Citeseer  Pubmed 

Chebyshev filters [7]  81.2  69.8  74.4 
Cayley filters [21]  81.9     
Feedbacklooped filters (ours)  82.6 0.3  71.5 0.4  81.7 0.6 
Comparison with LNet and AdaLNet using different data splittings. We have benchmarked the performance of our DFNet model against the models LNet and AdaLNet proposed in [23], as well as Chebyshev, GCN and GAT, over three citation network datasets Cora, Citeseer and Pubmed. We use the same data splittings [5.2%, 3%, 1%, and 0.5%] as used in [23]. Note that, 5.2% is the standard data splitting that was also used in previous works [7, 19, 31]). All the experiments are repeated 10 times. For our model DFNet, we use the same hyperparameter settings as discussed in Section 4.2.
Training Split  Chebyshev  GCN  GAT  LNet  AdaLNet  DFNet 

5.2% (standard)  78.0 1.2  80.5 0.8  82.6 0.7  79.5 1.8  80.4 1.1  85.2 0.5 
3%  62.1 6.7  74.0 2.8  56.8 7.9  76.3 2.3  77.7 2.4  80.5 0.4 
1%  44.2 5.6  61.0 7.2  48.6 8.0  66.1 8.2  67.5 8.7  69.5 2.3 
0.5%  33.9 5.0  52.9 7.4  41.4 6.9  58.1 8.2  60.8 9.0  61.3 4.3 
Training Split  Chebyshev  GCN  GAT  LNet  AdaLNet  DFNet 

5.2% (standard)  70.1 0.8  68.1 1.3  72.2 0.9  66.2 1.9  68.7 1.0  74.2 0.3 
1%  59.4 5.4  58.3 4.0  46.5 9.3  61.3 3.9  63.3 1.8  67.4 2.3 
0.5%  45.3 6.6  47.7 4.4  38.2 7.1  53.2 4.0  53.8 4.7  55.1 3.2 
0.3%  39.3 4.9  39.2 6.3  30.9 6.9  44.4 4.5  46.7 5.6  48.3 3.5 
Training Split  Chebyshev  GCN  GAT  LNet  AdaLNet  DFNet 

5.2% (standard)  69.8 1.1  77.8 0.7  76.7 0.5  78.3 0.3  78.1 0.4  84.3 0.4 
0.1%  55.2 6.8  73.0 5.5  59.6 9.5  73.4 5.1  72.8 4.6  75.2 3.6 
0.05%  48.2 7.4  64.6 7.5  50.4 9.7  68.8 5.6  66.0 4.5  67.2 7.3 
0.03%  45.3 4.5  57.9 8.1  50.9 8.8  60.4 8.6  61.0 8.7  59.3 6.6 
Tables 68 present the experimental results. Table 6 shows that DFNet performs significantly better than all the other models over the Cora dataset, including LNet and AdaLNet proposed in [23]. Similarly, Table 7 shows that DFNet performs significantly better than all the other models over the Citeseer dataset. For the Pubmed dataset, as shown in Table 8, DFNet performs significantly better than almost all the other models, except for only one case in which DFNet performs slightly worse than AdaLNet using the splitting 0.03%. These results demonstrate the robustness of our model DFNet.