Dense neural networks as sparse graphs and the lightning initialization
Abstract
Even though dense networks have lost importance today, they are still used as final logic elements. It could be shown that these dense networks can be simplified by the sparse graph interpretation. This in turn shows that the information flow between input and output is not optimal with an initialization common today. The lightning initialization sets the weights so that complete information paths exist between input and output from the start. It turned out that pure dense networks and also more complex networks with additional layers benefit from this initialization. The networks accuracy increases faster. The lightning initialization has two parameters which behaved robustly in the tests carried out. However, especially with more complex networks, an improvement effect only occurs at lower learning rates, which shows that the initialization retains its positive effect over the epochs with learning rate reduction.
1 Introduction
The development of the neural networks trends to even more complex structures. These deep networks are build by the combination of different types (Krizhevsky et al. [2012], Szegedy et al. [2014], He et al. [2015a]). Usually the finish is a fully connected layer or a dense network. Several studies show that dense networks can be pruned to significantly sparse nets (Setiono [1996], Denil et al. [2013], Han et al. [2015]). In opposite to this, these sparse networks can not be trained successfully. Frankle and Carbin [2018] recently showed that the difference of extreme pruning and incapacity to train sparse dense networks are initially based on the lottery hypotheses. Shortly, within a big network the probability of a valid subset of weights is so high that a random initialization results in a successful training.
Common initialization is based on normal or uniform random numbers. Different researchers have shown that the variance of weights at each layer should be scaled based on the number of neurons enclosed (LeCun et al. [2012], Glorot and Bengio [2010], He et al. [2015b]). The weight initialization supports the training through the scaled variance, so that the optimization converges faster. The initialization with random numbers does not consider the information transport between the input and the output. Therefore, we consider that it is possible to improve the initialization for a better learning performance.
2 Sparse graph interpretation
A fully connected feed forward network with nonequal weights does not transport information on every path. The optimizer distributes the weights over a value range based on the initialization. For common dense nets this is nearly a continuous distribution. Small weights represent a weak connection, because the output of all neurons is in a comparable range. Relatively, small weights do not have a larger impact for the activation of the neuron. We approximate this behavior of small values as a non existent connection.
To interpret the weights as a sparse graph, the weights have to be categorized into inhibiting (), inactive () and activating () connections. By choosing a threshold for the weight magnitude, the connections will be separated in active and inactive. Positive weights are activating and negative weights are inhibiting edges. Figure 1 shows this on a very simple example. The threshold is chosen that the five strongest edges are active. So, the edge is inactive in this example.
A result of the sparse graph interpretation is the behavior that a neuron could lose all its inputs, outputs or both. In figure 1\subreffig:simple_net_b node has no input. This neuron produces no changing output and even the output has a big weight, there is no information about the input in A. The neuron does not influence the result of the node E, because the output is constant. So, the missing connection produces the dead path . These dead paths could build dead areas in bigger networks.
active net  changed connections  accuracy  

hidden 0  hidden 1  output  over all  
Networks have to be oversized to be trained. This indicates the fact of pruning, which has been shown for example by Han et al. [2015] and Frankle and Carbin [2018]. Equivalent to this pruning effect, the sparse graph interpretation of the network is insensitive against the remaining active network size. Table 1 shows an example network with different remaining active network sizes and the resulting connection type changes of the active edges. A LeNet 300100 LeCun et al. [1998] with a uniform initialization by Glorot and Bengio [2010] is trained for 30 epochs. All other parameters are equal to section 4.1. This training reached a maximum accuracy of . The value based sparse graph interpretation is used to reinitialize the network. The weight of active edges is set to , for inactive edges to and for inhibiting edges to . Then, this reinitialized network is trained with the same setup as the parent state. The only change is the different initialization and that inactive edges of the parent graph are pruned in the child network.
The achieved maximum accuracy is comparable for all chosen thresholds. The rate of changed connection types decreases to more active edges. If of the network is inactive the change of all remaining active edges is only . of all active connections holds their initial type based on the previous step. With inactive connections, the network is still trainable and reached in 10 epochs a maximum accuracy of . The remaining net is large enough to solve the task, but the structure of the network has to be changed more compared to the parent structure.
active net  Pearson correlation coefficient  accuracy  
hidden 0  hidden 1  output  over all  
The Pearson correlation coefficient indicates the linear correlation between two variables. Table 2 shows the Pearson correlation coefficient between the weights of the parent and the child network. A coefficient near indicates a direct linear correlation. This shows that not even the structure remains also the weights get similar values through the learning.
3 Lightning initialization
The common initializations do not take the sparse graph theory into account. A random initialization produces for increasing deepness only weakly connected subgraphs, similar to figure 2. The information transport is interrupted. Based on the sparse graph interpretation, an initialization that builds paths between the input and output layer should improve the learning of the network. If paths between input and output exists, the information in the backpropagation should be better distributed to the input near layers.
Figure 3 shows the result of an experiment that demonstrates the relation between the number of remaining edges in a network and the portion of edges on complete paths between input and output. The different initializations lead to very similar curves with only little noise. The transition zone between no existing path to all edges be on complete paths is with very sharp, in respect to the total number of possible edges. For a LeNet 300100 with 784 inputs and 10 outputs the strongest 10000 edges do not build a sparse graph that connects the input and output. Figure 4 shows the relation between the portion of edges that are located on complete paths and the validation accuracy of the unpruned network for different training epochs. The experiment is repeated five times, which is represented in the different node shapes. Figure 4\subreffig:path1000 shows this relation for the 1000 and figure 4\subreffig:path10000 for the 10000 strongest edges. Both, figure 4\subreffig:path1000 and figure 4\subreffig:path10000 show that the solver arranges the sparse graph edges to form complete paths.
The lightning initialization has the goal to build a sparse graph, which consists only of complete paths. For this, a specific number of “lightnings” (complete paths) are randomly chosen. Edges on these paths get nonzero value. This value is symmetrically distributed to positive and negative. The strength of these edges can be equal or noisy. It is possible that an edge is part of more than a single path. In this case the strengths are not summed up. The strengths of edges on multiple paths are equally distributed for edges that are only on a single a path. The used lightning initializations in this paper have all constant strengths and no noise.
4 Experimental setup
The concepts are tested on different network topologies and problems. This section describes the used models for different problems and the used optimization parameters. The tests are categorized in two parts. First, networks that only consists of dense layers. Second, networks that only have dense layers at the tail.
4.1 Mnist
The MNIST dataset (http://yann.lecun.com/exdb/mnist/) has been chosen to demonstrate the behavior for pure dense networks. To solve the MNIST problem a LeNet 300100 by LeCun et al. [1998] and a SGD optimizer with a constant learningrate of 0.05 is used. The batch size is set to 100 and the data is preprocessed that the target range is between 1 and a minimum of 0.
The LeNet 300100 is build with “relu” activation functions in all layers except the output layer, which is build with “softmax” as activation function. Figure 9\subreffig:lenet300100 shows the schematic of the used LeNet 300100 network.
4.2 Cifar10
The CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) dataset is used in combination with two different network architectures. One is the LeNet5 by LeCun et al. [1998] based on the code of BIGBALLON [2018]. Figure 9\subreffig:lenet5 shows the schematic of this network. This net is solved with the Adam solver with a scheduled learning rate, shown in table 3. A batch size of 32, data augmentation and data preprocessing is used. The data is scaled that it has a range of 1 and a mean of 0.
The ResNet is used as second network architecture. Therefor two structures were optimized. The reference is a ResNet20 based on the code of kerasteam [2018]. The second structure is the custom ResNet14d, which has a three layer dense network as finish. The schematic of the ResNet20 is shown in figure 10\subreffig:resnet20 and the ResNet14d schematic in figure 10\subreffig:resnet14d. The networks are optimized by an Adam solver with a scheduled learning rate, shown in table 3. Batch size, data augmentation and data preprocessing are equal to the LeNet5 networks.
Epochs  learning rate 

5 Results
5.1 Pure dense networks
The LeNet 300100 network, as described in section 4.1, was initialized by different methods and trained for 30 epochs. Figure 5 shows the result of this experiment. After the first epoch all variants reached an accuracy of . The lightning initialization performs this first training much better. The wrong answer probability of the lightning initialization is lower in respect to the mean of the other initializations. After 30 epochs the lightning initialized model performs better then the other initializations. So, the LeNet 300100 network with lightning initialization learns faster and better compared to the other initializations.
As assumed in section 3, the optimizer uses the initialized paths and performs a better learning. Over the training in every epoch all largest 10000 edges are at complete paths between input and output. So, the lightning initialization produces an alternative implementation of the paths and results in a different solution. This is shown in figure 6 by the cumulative distribution of the absolute weight values separated by layers. The lightning initialization produces two categories of absolute weight values. Especially in the “hidden 2” and “output” layer these two categories are clearly shown. In the “hidden 1” layer the amount of active edges is so small that the active category nearly vanishes by the resolution.
The other initializations show a different behavior. The most values are nearly uniform distributed, which is represented by the approximately straight lines for the most of the values. Only the biggest weights strive for even lager values. This can be seen in the significant asymptotic behavior against . By these initializations, the weights are uniformly distributed (by the truncated initialization normal distributed). The optimizer keeps this distribution for most values. Only the largest values do not correspond to this distribution. It can be assumed that these are the values that are mainly responsible for the solution.
The lightning initialization has two parameters, the amount of lightnings and the strength of them. The parameters are robust against the LeNet network in different sizes for the MNIST problem. Figure 7 shows the best accuracy for various combinations of lightning amounts and strengths. The lightning initialization is robust against its parameters for the LeNet networks and the MNIST problem. Parameters in a range of factor 100 for both parameters produce usable and similar results. The larger networks give an idea that for an optimum lightning strength and number depend reciprocally on each other.
5.2 Networks with additional layer types
To show the behavior of the lightning initialization on more complex networks, it is tested on the CIFAR10 dataset with two different models. The LeNet5 network consists in the upper layers of convolutions and max pooling. The tail is build by three dense layers, which is similar to the LeNet 300100 of the pure dense test in section 5.1. Figure 8\subreffig:lenet5 shows the comparison of the LeNet5 network solving the CIFAR10 problem with a classical initialization based on He et al. [2015b] and the lightning initialization with different parameters for the dense layers. Similar to the results in section 5.1, the lightning variant shows a better and faster learning. But the effect is less effective and seems only observable at lower learning rates. On the other hand, this shows that the alternative initialization retains its effect through the epochs.
Table 4\subreftab:lenetcifar10 shows the best accuracies of a little parameter study for the LeNet5 network solving the CIFAR10 problem with lightning initialization for the dense layer. In opposition to the results in section 5.1 the lightning initialization performs better with more and stronger lightnings. It is predictable that the parameters have borders in both directions. The parameter study in section 5.1 is located only on the one end. It seems that the parameter for the strength of the LeNet5 are less robust in comparison to the LeNet test in section 5.1. But this has to be tested on a larger parameter study.


The LeNet5 network is relatively simple. To show that the lightning initialization works as well with deeper networks, it is tested with a ResNet14d network described in section 4.2. The classical ResNet has only one dense layer at the end, which is not usable with lightning because every edge creates a complete path. Therefore, the ResNet14d network is modified that the end contains three dense layers. Figure 8\subreffig:resnet shows the learning curve of a ResNet14d with a classical initialization based on He et al. [2015b], a lightning initialized variant and the result of the ResNet20 to compare the benefit of lightning against a much more complex ResNet. Equal to the LeNet5 experiment, the lightning initialization improves the learning primarily for lower learning rates.
Table 4\subreftab:resnetcifar10 shows the best accuracies for the same parameters as for the LeNet5 study in table 4\subreftab:lenetcifar10. The dependency against the strength parameter seems to be less than for the LeNet5 network. The number of lightnings shows the opposite behavior, because the optimum is below 100 for the ResNet14d tests and above 5000 for the LeNet5. The best result for the ResNet14d was achieved for 90 lightnings with a strength of 0.25 an reached an accuracy of , which is shown in figure 8\subreffig:resnet.
6 Conclusion
It has been shown that a dense network can be interpreted as sparse graph. By using a threshold, the network can be transfered into a sparse graph. A sparse graph only consists from the information about an existing connection between two nodes and if this connection is activating or inhibiting. A LeNet 300100, which solved the MNIST problem, can be interpreted as sparse graph and can be reconstructed only based on this information. More complex networks with additional layers are more difficult to reconstruct, probably because the range of possible solutions is larger and thus the learning process gains influence.
The sparse graph interpretation results in the lightning initialization thus the network is initialized with complete paths. This assists the development of the solution. The experiments show that several network architectures for different problems benefit from the lightning initialization. The learning process is faster and the resulting accuracies are better. Parameter studies demonstrates that the two parameters, number and strength of lightning, are robust against the LeNet 300100 network solving the MNIST problem. Both networks, which are tested against the CIFAR10 problem, are more sensitive against this parameters. But this needs to be investigated more closely in order to be able to make a clear statement about the robustness against complex networks.
Supplementary material
The experiment of section 2 is also done with the LeNet5 network. But the sparse graph interpretation is only applied to the dense layers. The convolution layers are not manipulated. To use the trained weights of the convolution layers and set them as untrainable is a closer approach to the experiment in section 2. Table 5 and 6 show the results for this case. Table 7 and 8 show the result of the case that the convolution layers are trainable and initialized with the original weights of the unpruned network.
active net  changed connections  accuracy  

hidden 0  hidden 1  output  over all  
active net  Pearson correlation coefficient  accuracy  
hidden 0  hidden 1  output  over all  
Even that the changing rates are higher than for the test from section 2, the network shows the same behavior like the LeNet300100 experiment. Only the reproducing effect is weaker. Probably this happens, because the development process of the convolution layers and the dense layers are not the same. In the original network both learn and develop together. The result of one sector influences the result of the other. Even though the convolution layers are fixed to a useful state and the dense sector use this, its learning process differs.
By using the convolution layers as a trainable part of the network, which is originally initialized, the difference between the original network and the sparse graph variants increases. This can be expected, because the additional parameters of the convolution layers increase the amount of opportunities to solve the problem. On the other hand, the starting point of the dense is radically changed. Both sectors interact through the learning. Thus it is clear that this variant behaves worse than with fixed convolution layers.
active net  changed connections  accuracy  

hidden 0  hidden 1  output  over all  
active net  Pearson correlation coefficient  accuracy  
hidden 0  hidden 1  output  over all  
kernel shape, stridesoutput shapeactivation and additional stepsfeatureslayer type
The layer types are:
 i

input
 d

dense
 c2

2D convolution
 mP

max pooling
 aP

average pooling
 f

flatten
Possible activations and additional steps are:
 bn

batch normalization
 relu

ReLu activation
 softmax

softmax activation
 0.3

30% dropout
References
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, pages 1–9, 2012. ISSN 10495258. doi: http://dx.doi.org/10.1016/j.protcy.2014.09.007. URL https://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf.
 Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going Deeper with Convolutions. arXiv:1409.4842, 2014. ISSN 10636919. doi: 10.1109/CVPR.2015.7298594. URL http://openaccess.thecvf.com/content{_}cvpr{_}2015/papers/Szegedy{_}Going{_}Deeper{_}With{_}2015{_}CVPR{_}paper.pdfhttps://arxiv.org/abs/1409.4842.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015a. URL https://arxiv.org/pdf/1512.03385.pdf.
 Rudy Setiono. Extracting rules from pruned neural networks for breast cancer diagnosis. Artificial Intelligence in Medicine, 8(1):37–51, 1996. ISSN 09333657. doi: 10.1016/09333657(95)000194. URL https://ac.elscdn.com/0933365795000194/1s2.00933365795000194main.pdf?{_}tid=450388cc117c4758baca3029ad1f6b9b{&}acdnat=1523960568{_}6c03b26fa13f7eb26c48094557767b81.
 Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting Parameters in Deep Learning. 2013. ISSN 10495258. URL http://papers.nips.cc/paper/5025predictingparametersindeeplearning.pdfhttp://arxiv.org/abs/1306.0543.
 Song Han, Jeff Pool, John Tran, and William J Dally. Learning both Weights and Connections for Efficient Neural Networks. 2015. ISSN 01406736. doi: 10.1016/S01406736(95)925252. URL https://arxiv.org/pdf/1506.02626.pdfhttp://arxiv.org/abs/1506.02626.
 Jonathan Frankle and Michael Carbin. The Lottery Ticket Hypothesis: Training Pruned Neural Networks. 2018. doi: arXiv:1803.03635v1. URL https://arxiv.org/pdf/1803.03635.pdfhttp://arxiv.org/abs/1803.03635.
 Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus Robert Müller. Efficient backprop. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7700 LECTU:9–48, 2012. ISSN 03029743. doi: 10.1007/97836423528983. URL http://yann.lecun.com/exdb/publis/pdf/lecun98b.pdf.
 Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. PMLR, 9:249–256, 2010. ISSN 15324435. doi: 10.1.1.207.2059. URL http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc{_}location=ufihttp://machinelearning.wustl.edu/mlpapers/paper{_}files/AISTATS2010{_}GlorotB10.pdf.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, volume 2015 Inter, pages 1026–1034, 2015b. ISBN 9781467383912. doi: 10.1109/ICCV.2015.123. URL https://arxiv.org/pdf/1502.01852.pdf.
 Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323, 1998. ISSN 00189219. doi: 10.1109/5.726791.
 BIGBALLON. cifar10cnn/1_lecun_network/lenet_dp_da_wd_keras.py. GitHub.com, 2018. Commit id 589b74556433fb9716871f80d74f998c11071c0b; https://github.com/BIGBALLON/cifar10cnn/tree/589b74556433fb9716871f80d74f998c11071c0b.
 kerasteam. keras/examples/cifar10_resnet.py. GitHub.com, 2018. Commit id 7113063a93ef8d77c6d7d21a2756c1ac802d83de; https://github.com/kerasteam/keras/tree/7113063a93ef8d77c6d7d21a2756c1ac802d83de.