SWNet: SmallWorld Neural Networks and Rapid Convergence
Abstract
Training large and highly accurate deep learning (DL) models is computationally costly. This cost is in great part due to the excessive number of trained parameters, which are wellknown to be redundant and compressible for the execution phase. This paper proposes a novel transformation which changes the topology of the DL architecture such that it reaches an optimal crosslayer connectivity. This transformation leverages our important observation that for a set level of accuracy, convergence is fastest when network topology reaches the boundary of a SmallWorld Network. Smallworld graphs are known to possess a specific connectivity structure that enables enhanced signal propagation among nodes. Our smallworld models, called SWNets, provide several intriguing benefits: they facilitate data (gradient) flow within the network, enable featuremap reuse by adding longrange connections and accommodate various network architectures/datasets. Compared to densely connected networks (e.g., DenseNets), SWNets require a substantially fewer number of training parameters while maintaining a similar level of classification accuracy. We evaluate our networks on various DL model architectures and image classification datasets, namely, CIFAR10, CIFAR100, and ILSVRC (ImageNet). Our experiments demonstrate an average of improvement in convergence speed to the desired accuracy.
1 Introduction
Deep learning models are increasingly popular for various learning tasks, particularly in visual computing applications. A big advantage for DL is that it can automatically learn the relevant features by computing on a large corpus of data, thus, eliminating the need for handselection of features common in traditional methods. In the contemporary big data realm, visual datasets are increasingly growing in size and variety. For instance, the ILSVRC challenge dataset has 22K classes with over 14M images [25]. To increase inference accuracy on such challenging datasets, DL models are evolving towards higher complexity architectures. Stateoftheart models tend to reach good accuracy, but they suffer from a dramatically high training cost.
As DL models grow deeper and more complex, the large number of stacked layers gives rise to a variety of problems, e.g., vanishing gradients [7, 3], which renders the models hard to train. To facilitate convergence and enhance the gradient flow for deeper models, creation of bypass connections was recently suggested. These shortcuts connect the layers that would otherwise be disconnected in a traditional Convolutional Neural Network (CNN) [28, 9, 12, 35]. To curtail the cost of handcrafted DL architecture exploration, the existing literature typically realizes the shortcuts by replicating the same building block throughout the network [9, 12, 35]. However, such repeated pattern of blocks in these networks induces unnecessary redundancies [11] that increase the computational overhead.
This paper proposes a novel methodology that transforms the topology of conventional CNNs such that they reach optimal crosslayer connectivity. This transformation is based on our observation that the pertinent connectivity pattern highly impacts training speed and convergence. To ensure computational efficiency, our architectural modification takes place prior to training. Thus, the incorporated connectivity measure must be independent of network gradients/loss and training data. Towards this goal, we view CNNs as graphs and revisit SmallWorld Networks (SWNs) [34] from graph theory to transform CNNs into highlyconnected smallworld topologies. WattsStrogatz SWNs [34] are widely used in the analysis of complex graphs; Due to SWNs’ specific connection pattern, these structures provide theoretical guarantees for considerably decreased consensus times [23, 32, 36].
Our network modification algorithm takes as input a conventional CNN architecture and enforces the smallworld property on its topology to generate a new network, , called SWNet. We leverage a quantitative metric for smallworldness and devise a customized rewiring algorithm. Our algorithm restructures the interlayer connections in the input CNN to find a topology that balances regularity and randomness, which is the key characteristic of SWNs [34]. Smallworld property in CNNs translates to an architecture where all layers are interlinked via sparse connections. An example of such network is shown in Fig. 1.
SWNets have similar quality of prediction and number of trainable parameters as their baseline feedforward architectures, but due to the added sparse links and the optimal SWN connectivity, they warrant better data flow. In summary, our architecture modification has three main properties: (i) It removes noncritical connections and reduces computational implications. (ii) It increases the degrees of freedom during training, allowing faster convergence. (iii) It provides customized data paths in the model for better crosslayer information propagation.
We conduct comprehensive experiments on various network architectures and showcase SWNets’ performance on popular image classification benchmarks including CIFAR10, CIFAR100, and ImageNet. Our smallworld CNNs achieve an average of 2.1fold improvement in training iterations required to achieve comparable classification accuracy as the baseline models. We further compare SWNet with the stateoftheart DenseNet model and show that with fewer parameters, SWNets demonstrate identical performance during training.
2 Related Work and Background
2.1 Related work
Bypass Connections. A substantial amount of research has focused on the addition of bypass connections to the hierarchical CNN architecture to enhance interlayer information flow and enable feature reuse. Authors of [28] implement the bypass connections using parametrized (gated) interlinks to enable model finetuning. In order to avert the burst in the number of trainable parameters caused by such gated connections, ResNets [9] use identity links (skip connections) to connect the concatenated layers. Such skip connections follow a modular structure. There exist a significant amount of redundancy in (deep) ResNets as alternative interlayer connections may exist that render higher accuracy while having lower model complexity; as shown by [13], not all identity links are necessary.
A variation of ResNets that uses wider residual blocks is introduced in [33, 35] to further improve image classification accuracy, while the effects of such architectural modification on the convergence speed and training overhead still need a more comprehensive study. Inception networks [31] are another example of benefiting from wider networks. Authors of [30] show that addition of residual connections to the initially proposed inception architecture drastically increases model convergence speed. This work further motivated us to study CNN convergence gains by addition of bypass connections.
DenseNets [12] group CNN layers in blocks with each layer connected to all its preceding layers. This is done by concatenating previous layers’ featuremaps and using it as the input. Another work [11] argues that such dense connectivity pattern incurs redundancies since earlier features might not be required in later layers. The authors propose to prune such redundancies to generate a more efficient architecture for CNN inference phase. However, the paper does not explore the possible effects of pruning on training.
In summary, the prior work mainly focuses on accuracy gains of longrange connections with little attention to the training overhead induced by the introduction of redundant parameters. In contrast to prior art, we select only to add longrange connections that are key contributors to model accuracy as well as convergence speed. To the best of our knowledge, SWNet is the first work to intertwine the smallworld property with CNNs and to examine the trained network in terms of convergence speed and accuracy.
To further highlight the distinction between our work and prior art, Fig. 2 illustrates the connection patterns in a ResNet, DenseNet, and SWNet architecture. In contrast to these two models, SWNet is not structured upon fixed building blocks and therefore can adapt to any given network architecture. Different from DenseNets which only accommodate fully dense connections, SWNet leverages customized sparse convolutions. Such sparsities enable selective connectivity between pairs of layers that enhance convergence speed while ensuring a low redundancy.
Smallwold Network. Perhaps the first investigation of SWNs in the context of deep learning was performed in [6], where the authors transform simple MLPs to SWN graphs and study the accuracy benefits for diagnosis of diabetes. SWNet substantially differs from this work as our solution is applicable to convolutional neural networks and uses a different mathematical model and smallworldness metric.
2.2 Background: SmallWorld Networks
Watts and Strogatz [34] observed that realworld complex networks, e.g., the anatomical connections in the brain and the neural network of animals, cannot be modeled using the existing regular or random graph classes. As such, they introduced the new category of smallworld networks. Members of the smallworld class have two main characteristics: 1) They have a small average pairwisedistance between graph nodes. 2) Nodes within the graph exhibit a relatively high (local) clustered structure. The first property is mainly associated with random graphs while the second property is prominent in regular graph classes. Such networks have shown significant enhancement in signal propagation speed, consensus, synchronization, and computational capability [29, 19, 2, 18, 36].
Randomness is introduced into a regular graph structure by iterative removal and addition of edges with probability, , in order to construct an SWN. Fig. 3 demonstrates the transition between a regular structure and its corresponding random graph as the rewiring probability increases from to . Intermediate values of interpolate between complete regularity and randomness to generate an SWN.
3 SWNet: SmallWorld CNNs
We propose to restructure the interlayer connections in a DL model such that its topology falls into the smallworld category while the total number of parameters in the network is held constant. Throughout the paper, we use the terms DL model and CNN interchangeably but emphasize that our approach is easily applicable to models without convolutions, e.g., MultiLayer Perceptrons (MLPs).
In the following, we first elaborate on the smallworld criteria and introduce methods to distinguish SWNs from other topologies (Sec. 3.1). We then explain our conversion of an arbitrary CNN into its equivalent SWN (Sec. 3.2). Lastly, we delineate our implementation and formalize the operations performed in a SWNet (Sec. 3.3).
3.1 Metric for SmallWorldness
To examine the smallworld property for a given graph, we study two properties, namely, the characteristic path length () and the global clustering coefficient (). is defined as the average distance between pairs of nodes in the graph and is a measure for the density of connections between neighbors of any node in the network. A completely random graph lacks clustering but enjoys a small . By definition, a graph is smallworld if it has a similar but higher than an () random graph [37] constructed using the same number of vertices and edges.
Let us denote the clustering coefficient and the characteristic path length of a given graph () by and , respectively. In a similar fashion, we represent the corresponding characteristics of the equivalent random graph by , . We use a quantitative measure of the smallworld property form [14] which categorizes a network as a SWN if and is calculated using Eq. (1).
(1) 
3.2 Smallworld Architecture Acquisition
3.2.1 Graph Generation
In order to modify a given CNN architecture and generate the equivalent SWN, we first model all connections within the network as a graph representation. In this context, a connection is defined as a linear operation performed between an input element and a trainable weight (network parameter) found in Convolution () and FullyConnected () layers. For layers, each featuremap channel is represented by a node and each edge represents a kernel. For layers each neuron is assigned a separate node and the edges correspond to weight matrix elements.
3.2.2 Architecture Search
After generating the graph pertinent to the input CNN architecture, we proceed to find the equivalent SWN. To perform this task, the initial graph is randomly rewired with different probabilities, , similar to Fig. 3. For each rewired graph, we compute the characteristic path length and clustering coefficient and use the captured pattern for each criterion to detect the smallworld topology using the smallworldness measure defined in Sec. 3.1.
Rewiring Policy. Let us denote an edge with where and are the start and end nodes. To perform random rewiring with probability , we visit all edges in the graph once. Each edge is rewired with probability or kept the same with probability . If the edge is to be rewired, a new second node is randomly sampled from the set of nodes that are nonneighbor to the edge’s start node, . This second node is selected such that no selfloops or repeated links exist in the rewired graph. Once the destination node is chosen, the initial edge, is removed and replaced by . Fig. 4 demonstrates our rewiring mechanism. Note that our rewiring methodology does not alter the number of connections in the CNN. As a result, the total number of trainable parameters in the SWN model equals that of the original network.
Network Profiling. Using the aforementioned rewiring policy, we generate various graphs by sweeping the rewiring probability in the [0,1] interval. Fig. 5 demonstrates the correlation between and as the rewiring probability is changed for a 14layer CNN model. For conventional CNNs, the clustering coefficient is zero and the characteristic path length can be quite large specifically for very deep networks (leftmost points on Fig. 5). As such, CNNs are far from networks with the smallworld property. Random rewiring replaces shortrange connections to immediately subsequent layers with longerrange connections. Consequently, is reduced while increases as the network shifts towards the smallworld equivalent. We select the topology with the maximum value of smallworld property, , as the SWNet. As a direct result of such architectural modification, the new network enjoys enhanced connectivity in the corresponding CNN which results in better gradient propagation and training speedup.
To demonstrate the efficiency of the SWN versus other configurations generated during the probability sweep, we train several rewired networks on the MNIST dataset [20], each of which is constructed from a 5layer CNN. Fig. 6 demonstrates the convergence speed of these various architectures versus rewiring probability used to generate them from the baseline model. Due to the addition of longrange connections, almost all models show convergence improvements over the baseline. However, the perfect balance between node clustering and average path length is achieved for the SWN. This, in turn, renders the fastest convergence.
3.3 SWNet Methodology
CNN Formulation. Conventional CNNs are comprised of subsequent layers where each layer, , in the network performs a combination of linear and nonlinear operations on its input, , to generate the corresponding output, . We denote core linear operations ( and ) in a CNN by with the subscript representing the layer index. Other operations can take the form of Batch Normalization () [15], Rectified Linear Unit () [8], and Pooling [21]. For each linear layer, we bundle one or more of such operations together and show them as one composite function, . For an arbitrary layer in a conventional CNN, the output is formalized as:
(2) 
Note that the cascaded nature of CNNs implies that the generated output from one layer serves as the input to the immediately succeeding layer: .
Sparse Connections in SWNets. One major difference between SWNets and conventional CNNs is that SWNet layers can be interconnected regardless of their position in the network hierarchy. More specifically, the output of each layer of a SWNet is connected to all its succeeding layers via sparse weight tensors. These connections are implemented via convolution kernels with coarsegrained sparsity patterns. Fig. 7 shows the convolution filters of an example sparse connection from a layer with 5 output channels to a layer with 3 output channels and its smallworld graph representation. Let us denote sparse connections from layer to layer by . The output of the th layer in SWNet can then be calculated as:
(3) 
Comparing the above formulation with Eq. 2, we highlight the extra summation term that accounts for the interlayer connections. Note that in Eq. 3, both and are sparse tensors. The interlayer connectivity in SWNet enables enhanced data flow, both in inference and training stages, while the sparse connections mitigate unnecessary parameter utilization. In contrast to the previously proposed feature concatenation methodology [12], we perform summation over the featuremaps. By means of this approach, we mitigate the appearance of extremely high dimensional kernels that result from channelwise featuremap concatenation. Furthermore, the summation of featuremaps enables SWNet to be applicable to all network architectures with various layer configurations.
Composite Nonlinear Operation. In contrast to DenseNets [12] and ResNets [9] where several linear layers are concatenated before pooling is performed, SWNets support pooling immediately after each layer as seen in conventional CNN architectures. We experiment with various configurations of the widelyused nonlinear operations, i.e., BN, ReLU, Maxpool to investigate the effect of ordering on network convergence. Our experiments demonstrate that SWNet convergence is enhanced when the composite nonlinear function, is implemented as a ReLU, followed by Maxpooling, and BN as shown in Fig. 2.
Convolution  MaxPool  Convolution  MaxPool  Convolution  MaxPool  Convolution  MaxPool  Convolution  MaxPool  Classifier  

















Conv 

Conv    Conv    Conv 







 

 

 


1000FC, softmax 
We modify the ConvNetC fullyconnected layers form [26] to comply with the CIFAR datasets.
Convolution  Dense Block (1)  Transition Block  Dense Block (2)  Transition Block  Dense Block (3)  Classifier  

DenseNet40 [12] (C10)  Conv  [ Conv]  Conv  [ Conv]  Conv  [ Conv]  average pool 
average pool  average pool  10FC, softmax 
Conv denotes a , followed by a and a convolution layer.
4 Experiments
We conduct proofofconcept experiments on different network architectures and image classification benchmarks to empirically demonstrate the enhanced convergence speed of SWNets compared to the baseline (conventional) counterparts. Our implementations are available in popular neural network development APIs, Keras [4] and PyTorch [24].
4.1 Datasets
CIFAR. We carry out our experiments on the two available CIFAR [16] datasets. CIFAR10 (C10) and CIFAR100 (C100) benchmarks consist of colored images with dimensionality that are categorized in 10 and 100 classes, respectively. Each dataset contains 50,000 samples for training and 10,000 samples for testing. We use standard data augmentation routines popular in prior work [9, 13]. The samples are normalized using perchannel mean and standard deviation. At training time, random horizontal mirroring, shifting, and slight rotation are also applied.
ImageNet. The ISLVRC2012 dataset, widely known as the ImageNet [5], consists of 1000 different classes of colored images with 1.2 million samples for training and 50,000 samples for validation. We use the augmentation scheme proposed in [26, 10] to preprocess input samples. During training, we resize the images by randomly sampling the shorter edge from [256, 480]. A crop is then randomly sampled from the image. We also perform perchannel normalization as well as horizontal mirroring [17].
4.2 Benchmarked Architectures
Tab. 1 encloses our baseline CNN architectures. SWNets maintain the same feedforward architecture as the baseline networks and are constructed by 1) replacing the original layers with sparse convolutions and 2) implementing additional sparse convolutions between nonconsecutive layers. To match the dimensionality of interlayer connected featuremaps, we tune the stride in the longrange sparse connections and use zeropadding where necessary^{1}^{1}1 We make sure that the stride is smaller than convolution window size. This approach enables us to control the dimensionality of the produced featuremaps as well as tune the impact of added longrange connections.
4.3 Results on CIFAR
4.3.1 ConvNetC
Training. We train the ConvNetC [26] model on C10 and C100 benchmarks with a batch size of 128. To prevent overfitting, dropout layers with a rate of are added after layers with no , and a rate of before the first layer. The smallworld model is constructed using the same configuration of layers as the baseline, including the dropout layers. We use StochasticGradientDescent (SGD) optimizer with Nesterov, momentum, and a weight decay. Models are trained for and iterations on C10 and C100, respectively. The initial learning rate is set to 0.01 for both datasets and learning rate is decayed by 0.5 upon optimization plateau.
Convergence. Fig. 8(a) illustrates the test error and training loss of the baseline and SWNets as two representatives of the convergence speed. Similarly, for C100 benchmark, the corresponding convergence curve is presented in Fig. 8(b). While these figures qualitatively demonstrate the effectiveness of our methodology, we provide a quantitative measure for a solid comparison between SWNet and the baseline. We investigate several points corresponding to various test accuracies and compare the two models’ convergence time to these points. Tab. 2 summarizes the peraccuracy speedup of SWNet over the baseline model. As seen, the speedup varies for different accuracies, however, for all test accuracies, SWNet requires a substantially fewer number of iterations for convergence. At final saturation point (marked by on Fig. 8), both models achieve comparable accuracies while SWNet enjoys a and reduction in convergence time for C10 and C100 datasets, respectively.
CIFAR10 
Baseline  Test Error (%)  24.21  17.80  9.22  8.51  7.56 

Iterations  1408  2560  8704  11008  18560  
SWNet  Test Error (%)  23.73  17.57  8.64  8.25  7.44  
Iterations  896  1536  4992  5888  7040  
Speedup  1.57  1.67  1.74  1.87  2.64  
CIFAR100 
Baseline  Test Error (%)  77.08  52.3  41.54  31.14  29.52 
Iterations  2944  6144  9472  16128  28928  
SWNet  Test Error (%)  76.67  50.57  40.18  31.15  29.26  
Iterations  384  1408  3072  7808  10240  
Speedup  7.67  4.36  3.08  2.1  2.82 
4.3.2 DenseNet
DenseNets [12] achieve stateoftheart accuracy by connecting all neurons from different layers of a dense block with trainable (dense) parameters. Such dense connectivity pattern results in high redundancy in the parameter space and causes extra overhead on training. We show that a SWNet with only sparse connections and much fewer parameters achieves similar results as DenseNet.
Training. We train a DenseNet model with 40 layers and (Tab. 1) on C10 dataset. The equivalent SWNet is constructed by removing all longrange dense connection from the architecture and rewiring the remaining shortrange edges such that each dense block transitions into a smallworld structure. The SWNet maintains the same number of layers while the interlayer connections are implemented using sparse convolution kernels, thus incurring substantially fewer number of trainable parameters.
We use the publicly available PyTorch implementation for DenseNets ^{2}^{2}2https://github.com/andreasveit/densenetpytorch and replace the model with our smallworld network. Same training scheme as explained in the original DenseNet paper [12] is used: models are trained for iterations with a batch size of 64. Initial learning rate is and decays by 10 at and of the total training iterations.
Convergence. Fig. 9 demonstrates the test accuracy of the models versus the number of epochs. As can be seen, although SWNet has much fewer parameters, both models achieve comparable validation accuracy while showing identical convergence speed. We report the computational complexity (FLOPs) of the models as the total number of multiplications performed during a forward propagation through the network. Tab. 3 compares the benchmarked DenseNet and SWNet in terms of FLOPs and number of trainable weights in and layers. We highlight that SWNet achieves comparable test accuracy while having reduction in parameter space size.
Model  Depth  Params  FLOPs  Test Error 

DenseNet ()  40  910K  285.3M  0.071 
SWNet  40  98K  85.5M  0.074 
4.4 Results on ImageNet
4.4.1 AlexNet
Training. We train the AlexNet [17] model on ImageNet dataset and follow the architecture provided in Caffe model zoo [1] (See Tab. 1). In order to mitigate overfitting, we add dropout layers with probability 0.5 after each layer (except the last). Loss minimization is performed by means of SGD with Nesterov [22] and a 0.9 momentum. We set the batch size to 64 for both models and incorporate an exponential decay for the learning rate: initial learning rate is set to and the decay factor is [27].
Convergence. Fig. 10 demonstrates the test error and training loss of the baseline and SWNets. As can be seen, for all values of test error, the convergence of our smallworld architecture is faster. Similar to CIFAR benchmarks, to fully examine the performance of our model, we report the speedup of SWNet over the baseline for several values of test error. Tab. 4 encloses the pointwise comparison between the benchmarked models. As indicated by our evaluations, SWNet converges to the final test accuracy after 3776 iterations while the baseline model needs 5120 iterations, resulting in a overall speedup.
AlexNet 
Baseline  Test Error (%)  51.72  46.29  44.21  42.31  42.01 

Iterations  1088  2304  3264  4416  5120  
SWNet  Test Error (%)  51.97  46.49  44.25  42.31  41.55  
Iterations  768  1664  2368  3520  3776  
Speedup  1.42  1.38  1.38  1.25  1.36 
4.4.2 ResNet
Training. We adopt the training scheme in the original ResNet paper [9]. To build the SWNet, we first remove all shortcut and bottleneck connections from the model. We then rewire the connections in the acquired plain network such that it becomes smallworld. No dropout is used for the baseline and SWNets. Batch size is set to 128 and we use SGD with momentum and weight decay of . Initial learning rate is set to 0.1 and decays by when the accuracy plateaus. We train the models for iterations and report singlecrop accuracies.
Convergence. Test error and training loss for baseline ResNet and the SWNet are shown in Fig. 11. As seen, SWNet achieves both higher accuracy and higher convergence speed throughout training. For a more quantitative comparison, we enclose pointwise speedups for various iterations and test errors in Tab. 5. As evident from the results, systematic restructuring of long edges in SWNet allows for a better convergence speed compared to the replicated blocks in the baseline ResNet.
ResNet18 
BaseLine  Test Error (%)  60.37  56.94  37.91  31.72 

Iterations  1792  3456  7424  9344  
SWNet  Test Error (%)  59.63  56.76  37.86  31.68  
Iterations  512  768  3584  7168  
Speedup  3.50  4.50  2.07  1.31 
5 Discussion on Longrange Connections
The selected smallworld structure for a given CNN has two main characteristics, namely high clustering of nodes and small average path length between neurons across layers. We postulate that such qualities render the SWN desirable during training due to the enhanced information flow paths existent in these efficientlyconnected networks. To examine our hypothesis, we visualize the weights connecting different layers of the trained SWNet for C10, C100 (ConvNetC), and ImageNet (AlexNet) benchmarks. Fig. 12 presents a heat map of the average absolute values of weights connecting each pair of layers.
Each square at position of the heatmap represents the strength of the connections between layers and where denotes network input. Color shades of orange, red and maroon indicate strong interlayer dependency while the white color indicates that no connections are present between the corresponding layers in SWNet. We summarize our observations based on the heat map as the following:

Each layer has strong connections to its nonsubsequent layers indicating that longrange edges established in SWNet are crucial to performance.

The input layer has spread weights across all layers of the network which demonstrates the importance of connections between earlier and deeper layers.

SWNet preserves the strong connections between one layer and the immediately proceeding layer, thus, maintaining the conventional CNN data flow.
6 Conclusion
We propose a novel methodology that adaptively modifies conventional feedforward DL models to new architectures, called SWNet, that fall into the category of smallworld networks—a class of complex graphs used to study realworld models such as human brain and the neural networks of animals. By leveraging the intriguing features of smallworld networks, e.g., enhanced signal propagation speed and synchronizability, SWNets enjoy enhanced data flow within the network, resulting in substantially faster convergence speed during training. Our smallworld models are implemented via sparse connections from each layer in the traditional CNN to all the succeeding layers. Such sparse convolutions enable SWNets to benefit from longrange connections while mitigating the redundancy in the parameter space existent in prior art. As our experiments demonstrate, SWNets are able to achieve stateoftheart accuracy in lower number of training iterations, on average. Furthermore, compared to a denselyconnected architecture, SWNets achieve comparable accuracy while having reduction in the number of parameters. In summary, due to their optimal graph connectivity and fast response to training, SWNets can be advantageous for smart vision applications.
References
 [1] Bair/bvlc alexnet model.
 [2] M. Barahona and L. M. Pecora. Synchronization in smallworld systems. Physical review letters, 89(5):054101, 2002.
 [3] Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [4] F. Chollet et al. Keras. https://keras.io, 2015.
 [5] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
 [6] O. Erkaymaz, M. Ozer, and M. Perc. Performance of smallworld feedforward neural networks for the diagnosis of diabetes. Applied Mathematics and Computation, 311:22–28, 2017.
 [7] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
 [8] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 [11] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. group, 3(12):11, 2017.
 [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
 [13] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
 [14] M. D. Humphries and K. Gurney. Network âsmallworldnessâ: a quantitative method for determining canonical network equivalence. PloS one, 3(4):e0002051, 2008.
 [15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [16] A. Krizhevsky, V. Nair, and G. Hinton. Cifar10 (canadian institute for advanced research).
 [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [18] M. Kuperman and G. Abramson. Small world effect in an epidemiological model. Physical Review Letters, 86(13):2909, 2001.
 [19] V. Latora and M. Marchiori. Efficient behavior of smallworld networks. Physical review letters, 87(19):198701, 2001.
 [20] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
 [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [22] Y. Nesterov et al. Gradient methods for minimizing composite objective function, 2007.
 [23] R. OlfatiSaber. Ultrafast consensus in smallworld networks. In American Control Conference, 2005. Proceedings of the 2005, pages 2371–2378. IEEE, 2005.
 [24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [27] L. N. Smith. Cyclical learning rates for training neural networks. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 464–472. IEEE, 2017.
 [28] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
 [29] S. H. Strogatz. Exploring complex networks. nature, 410(6825):268, 2001.
 [30] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, 2017.
 [31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [32] A. TahbazSalehi and A. Jadbabaie. Small world phenomenon, rapidly mixing markov chains, and average consensus algorithms. In Decision and Control, 2007 46th IEEE Conference on, pages 276–281. IEEE, 2007.
 [33] S. Targ, D. Almeida, and K. Lyman. Resnet in resnet: generalizing residual architectures. arXiv preprint arXiv:1603.08029, 2016.
 [34] D. J. Watts and S. H. Strogatz. Collective dynamics of âsmallworldânetworks. nature, 393(6684):440, 1998.
 [35] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [36] D. H. Zanette. Dynamics of rumor propagation on smallworld networks. Physical review E, 65(4):041908, 2002.
 [37] X. Zhang, C. Moore, and M. E. Newman. Random graph models for dynamic networks. The European Physical Journal B, 90(10):200, 2017.