# Discovering Neural Wirings

###### Abstract

The success of neural networks has driven a shift in focus from feature engineering to architecture engineering. However, successful networks today are constructed using a small and manually defined set of building blocks. Even in methods of neural architecture search (NAS) the network connectivity patterns are largely constrained. In this work we propose a method for discovering neural wirings. We relax the typical notion of layers and instead enable channels to form connections independent of each other. This allows for a much larger space of possible networks. The wiring of our network is not fixed during training – as we learn the network parameters we also learn the structure itself. Our experiments demonstrate that our learned connectivity outperforms hand engineered and randomly wired networks. By learning the connectivity of MobileNetV1 mobilenetv1 () we boost the ImageNet accuracy by 10\% at \sim 41M FLOPs. Moreover, we show that our method generalizes to recurrent and continuous time networks.

## 1 Introduction

Deep neural networks have shifted the prevailing paradigm from feature engineering to feature learning. The architecture of deep neural networks, however, must still be hand designed in a process known as architecture engineering. A myriad of recent efforts attempt to automate the process of the architecture design by searching among a set of smaller well-known building blocks mnasnet (); wu2018fbnet (); zoph2016neural (); liu2018progressive (); cai2018proxylessnas (); darts (). While methods of search range from reinforcement learning to gradient based approaches wu2018fbnet (); darts (), the space of possible connectivity patterns is still largely constrained. NAS methods explore wirings between large blocks. We believe that more efficient solutions may arrive from searching the space of wirings at a more fine grained level, i.e. single channels.

In this work, we consider an unconstrained set of possible wirings by allowing channels to form connections independent of each other. This enables us to discover a wide variety of operations (e.g. depthwise separable convs mobilenetv1 (), channel shuffle and split shufflenet (), and more). Formally, we treat the network as a large neural graph where each each node processes a single channel.

One key challenge lies in searching the space of all possible wirings – the number of possible sub-graphs is combinatorial in nature. When considering thousands of nodes, traditional search methods are either prohibitive or offer approximate solutions. In this paper we introduce a simple and efficient algorithm for discovering neural wirings (DNW). Our solution searches the space of all possible wirings with a simple augmentation to the backwards pass.

Most similar to our approach is work in randomly wired neural networks randwire () aiming to explore the space of novel neural network wirings. Intriguingly, they show that constructing neural networks with random graph algorithms often outperforms a manually engineered architecture. However, these wirings are fixed at training.

Our method for discovering neural wirings is as follows: We consider the sole constraint that that the total number of edges in the neural graph is fixed to be k. We begin by randomly assigning a weight to each edge. We then choose the weighted edges with the highest magnitude and refer to the remaining edges as hallucinated. As we train, we modify the weights of all edges according to a specified update rule. Accordingly, a hallucinated edge may strengthen to a point it replaces a real edge. We tailor the update rule so that when swapping does occur, it is beneficial.

We consider the application of DNW for static and dynamic neural graphs. In the static regime each node has a single output and the graphical structure is acyclic. In the case of a dymanic neural graph we allow the state of a node to vary with time. Dymanic neural graphs may contain cycles and express popular sequential models such as LSTMs lstm (). As dymanic neural graphs are strictly more expressive than static neural graphs, they can also express feed-forward networks (as in Figure 1).

Our work may also be regarded as an effective mechanism for training a sparse network. The Lottery Ticket Hypothesis lth (); lth2 () has demonstrated the existence of sparse sub-networks which may be trained in isolation. However, their method for identifying so-called winning-tickets is quite expensive and requires multiple passes of training. Though we do not train our networks in isolation (we consider a set of hallucinated edges), DNW may be used to train a sparse network in a single pass.

We demonstrate the efficacy of DNW on small and large scale data-sets, and for feed-forward, recurrent, and continuous networks. Notably, we augment MobileNetV1 mobilenetv1 () with DNW to achieve a 10\% improvement on ImageNet imagenet () from the hand engineered MobileNetV1 at \sim 41M FLOPs^{1}^{1}1We follow shufflenet (); shufflenetv2 () and define FLOPS as the number of Multiply Adds..

## 2 Discovering Neural Wirings

In this section we describe our method for jointly discovering the structure and learning the parameters of a neural network. We first consider the algorithm in a familiar setting, a feed-forward neural network, which we abstract as a static neural graph. We then present a more expressive dynamic neural graph which extends to discrete and continuous time and generalizes feed-forward, recurrent, and continuous time neural networks.

### 2.1 Static Neural Graph

A static neural graph is a directed acyclic graph \mathcal{G}=(\mathcal{V},\mathcal{E}) consisting of nodes \mathcal{V} and edges \mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}. The state of a node v\in\mathcal{V} is given by the random variable Z_{v}. At each node v we apply a function f_{\theta_{v}} and with each edge (u,v) we associate a weight w_{uv}. In the case of a multi-layer perceptron, f is simply a parameter-free non-linear activation like ReLU alexnet ().

For any set \mathcal{A}\subseteq\mathcal{V} we let \mathbf{Z}_{\mathcal{A}} denote \left(Z_{v}\right)_{v\in\mathcal{A}} and so \mathbf{Z}_{\mathcal{V}} is a vector containing the state of all nodes in the network.

\mathcal{V} contains a subset of input nodes \mathcal{V}_{0} with no parents and output nodes \mathcal{V}_{E} with no children. The input data \mathcal{X}\sim p_{x} flows into the network through \mathcal{V}_{0} as \mathbf{Z}_{\mathcal{V}_{0}}=g_{\phi}(\mathcal{X}) for a function g which may have parameters \phi. Similarly, the output of the network \hat{\mathcal{Y}} is given by h_{\psi}(\mathbf{Z}_{\mathcal{V}_{E}}).

Z_{v}=\begin{cases}f_{\theta_{v}}\left(\sum_{(u,v)\in\mathcal{E}}w_{uv}Z_{u}% \right)&v\in\mathcal{V}\setminus\mathcal{V}_{0}\\ g_{\phi}^{(v)}(\mathcal{X})&v\in\mathcal{V}_{0}.\end{cases} | (1) |

For brevity, we let \mathcal{I}_{v} denote the “input" to node v, where \mathcal{I}_{v} may be expressed

\mathcal{I}_{v}=\sum_{(u,v)\in\mathcal{E}}w_{uv}Z_{v}. | (2) |

In this work we consider the case where the input and output of each node is a two-dimensional matrix, commonly referred to as a channel. Each node performs a non-linear activation followed by normalization and convolution (which may be strided to reduce the spatial resolution). As in randwire (), we no longer conform to the traditional notion of “layers" in a deep network.

The combination of a separate 3\times 3 convolution for each channel (depthwise convolution) followed by a 1\times 1 convolution (pointwise convolution) is often referred to as a depthwise seperable convolution, and is essential in efficient network design mobilenetv1 (); shufflenetv2 (). With a static neural graph this process may be interpreted equivalently as a 3\times 3 convolution at each node followed by information flow on a complete bipartite graph.

### 2.2 Discovering a k-Edge neural graph

We now outline our method for discovering the edges of a static neural graph subject to the constraint that the total number of edges must not exceed k.

We consider a set of real edges \mathcal{E} and a set of hallucinated edges \mathcal{E}_{\text{hal}}=\mathcal{V}\times\mathcal{V}\setminus\mathcal{E}. The real edge set is comprised of the k-edges which have the largest magnitude weight. As we allow the magnitude of the weights in both sets to change throughout training the edges in \mathcal{E}_{\text{hal}} may replace those in \mathcal{E}.

Consider a hallucinated edge (u,v)\not\in\mathcal{E}. If the gradient is pushing \mathcal{I}_{v} in a direction which aligns with Z_{u}, then our update rule strengthens the magnitude of the weight w_{uv}. If this alignment happens consistently then w_{uv} will be eventually be strong enough to enter the real edge set \mathcal{E}. As the total number of edges is conserved, when (u,v) enters the edge set \mathcal{E} another edge is removed and placed in \mathcal{E}_{\text{hal}}. This procedure is illustrated by Algorithm 1, where \mathcal{V} is the node set, \mathcal{V}_{0},\mathcal{V}_{E} are the input and output node sets, g_{\phi}, h_{\psi} and \{f_{\theta_{v}}\}_{v\in\mathcal{V}} are the input, output, and node functions, p_{xy} is the data distribution, k is the number of edges in the graph and \mathcal{L} is the loss.

In practice we may also include a momentum and weight decay^{2}^{2}2Weight decay wd () may in fact be very helpful for eliminating dead ends. term in the weight update rule (line 10 in Algorithm 1). In fact, the weight update rule looks nearly identical to that in traditional SGD & Backprop but for one key difference: we allow the gradient to flow to edges which did not exist during the forwards pass. Importantly, we do not allow the gradient to flow through these edges and so the rest of the parameters update as of traditional SGD & Backprop. This gradient flow is illustrated in Figure 3.

Under certain conditions we formally show that swapping an edge from \mathcal{E}_{\text{hal}} to \mathcal{E} decreases the loss \mathcal{L}. We first consider the simple case where the hallucinated edge (i,k) replaces (j,k)\in\mathcal{E}. In Section B we discuss the proof to a more general case.

We let \tilde{w} to denote the weight w after the weight update rule \tilde{w}_{uv}=w_{uv}+\left\langle Z_{u},-\alpha{\partial\mathcal{L}\over% \partial\mathcal{I}_{v}}\right\rangle. We assume that \alpha is small enough so that sign(\tilde{w}) = sign(w).

Claim: Assume \mathcal{L} is Lipschitz continuous. There exists a learning rate \alpha^{*}>0 such that for \alpha\in(0,\alpha^{*}) the process of swapping (i,k) for (j,k) will decrease the loss when the state of the nodes are fixed and |w_{ik}|<|w_{jk}| but |\tilde{w}_{ik}|>|\tilde{w}_{jk}|.

Proof. Let \mathcal{A} be value of \mathcal{I}_{k} after the update rule if (j,k) is replaced with (i,k). Let \mathcal{B} be the state of \mathcal{I}_{k} after the update rule if we do not allow for swapping. \mathcal{A} and \mathcal{B} are then given by

\displaystyle\mathcal{A}=\tilde{w}_{ik}Z_{i}+\sum_{(u,k)\in\mathcal{E},\ u\neq i% ,j}\tilde{w}_{uk}Z_{u},\hskip 28.452756pt\mathcal{B}=\tilde{w}_{jk}Z_{j}+\sum_% {(u,k)\in\mathcal{E},\ u\neq i,j}\tilde{w}_{uk}Z_{u}. | (3) |

Additionally, let g=-\alpha{\partial\mathcal{L}\over\partial\mathcal{I}_{k}} be the direction in which the loss most steeply descends with respect to \mathcal{I}_{k}. By Lemma 1 (Section C of the Appendix) it suffices to show that moving \mathcal{I}_{k} towards \mathcal{A} is more aligned with g then moving \mathcal{I}_{k} towards \mathcal{B}. Formally we wish to show that

\displaystyle\left\langle\mathcal{A}-\mathcal{I}_{k},g\right\rangle | \displaystyle\geq\left\langle\mathcal{B}-\mathcal{I}_{k},g\right\rangle | (4) |

which simplifies to

\displaystyle\tilde{w}_{ik}\left\langle Z_{i},g\right\rangle | \displaystyle\geq\tilde{w}_{jk}\left\langle Z_{j},g\right\rangle | (5) | ||

\displaystyle\iff\tilde{w}_{ik}(\tilde{w}_{ik}-w_{ik}) | \displaystyle\geq\tilde{w}_{jk}(\tilde{w}_{jk}-w_{jk}). | (6) |

In the case where \tilde{w}_{ik} and (\tilde{w}_{ik}-w_{ik}) have the same sign but \tilde{w}_{jk} and (\tilde{w}_{jk}-w_{jk}) have different signs the inequality immediately holds. This corresponds to the case where w_{ik} increases in magnitude but w_{jk} decreases in magnitude. The opposite scenario (w_{ik} decreases in magnitude but w_{jk} increases) is impossible since |w_{ik}|<|w_{jk}| but |\tilde{w}_{ik}|>|\tilde{w}_{jk}|.

We now consider the scenario where both sides of the inequality (equation 6) are positive. Simplifying further we obtain

\displaystyle\left(\tilde{w}_{jk}w_{jk}-\tilde{w}_{ik}w_{ik}\right)\geq\left(% \tilde{w}_{jk}^{2}-\tilde{w}_{ik}^{2}\right) | (7) |

and are now able to identify a range for \alpha such that the inequality above is satisfied. By assumption the right hand side is less than 0 and sign(\tilde{w}) = sign(w) so \tilde{w}w=|\tilde{w}||w|. Accordingly, it suffices to show that

|\tilde{w}_{jk}||w_{jk}|-|\tilde{w}_{ik}||w_{ik}|\geq 0. | (8) |

If we let \epsilon=|w_{jk}|-|w_{ik}| and \alpha^{*}=\sup\{\alpha:|\tilde{w}_{ik}|\leq|\tilde{w}_{jk}|+\epsilon|\tilde{w% }_{jk}||\tilde{w}_{ik}|^{-1}\}, then for \alpha\in(0,\alpha^{*})

|\tilde{w}_{jk}||w_{jk}|-|\tilde{w}_{ik}||w_{ik}|\geq|\tilde{w}_{jk}|\left(% \underbrace{|w_{jk}|-|w_{ik}|}_{=\epsilon}-\epsilon\right)=0 | (9) |

the inequality (equation 7) is satisfied. Here we are implicitly using our assumption that the gradient is bounded and we may “tune” \alpha to control the magnitude |\tilde{w}_{ik}|-|\tilde{w}_{jk}|. In the case where \alpha=\inf\{\alpha:|\tilde{w}_{ik}|>|\tilde{w}_{jk}|\} the right hand side of equation 7 becomes 0 while the left hand side is \epsilon>0.

### 2.3 Dynamic Neural Graph

We now consider a more general setting where the state of each node Z_{v}(t) may vary through time. We refer to this model as a dynamic neural graph.

The initial conditions of a dynamic neural graph are given by

Z_{v}(0)=\begin{cases}g_{\phi}^{(v)}(\mathcal{X})&v\in\mathcal{V}_{0}\\ 0&v\in\mathcal{V}\setminus\mathcal{V}_{0}\end{cases} | (10) |

where \mathcal{V}_{0} is a designated set of input nodes, which may now have parents. {mdframed} Discrete Time Dynamics: For a discrete time neural graph we consider times \ell\in\{0,1,...,L\}. The dynamics are then given by

Z_{v}(\ell+1)=f_{\theta_{v}}\left(\sum_{(u,v)\in\mathcal{E}}w_{uv}Z_{u}(\ell),% \ \ell\right) | (11) |

and the network output is \hat{\mathcal{Y}}=h_{\psi}(\mathbf{Z}_{\mathcal{V}_{E}}(L)). We may express equation 11 more succinctly as

\mathbf{Z}_{\mathcal{V}}(\ell+1)=\textbf{f}_{\theta}\left(\mathbf{\mathcal{A}}% _{\mathcal{G}}\mathbf{Z}_{\mathcal{V}}(\ell),\ \ell\right) | (12) |

where \mathbf{Z}_{\mathcal{V}}(\ell)=\left(Z_{v}(\ell)\right)_{v\in\mathcal{V}}, \textbf{f}_{\theta}(\textbf{z},\ell)=(f_{\theta_{v}}(z_{v},\ell))_{v\in% \mathcal{V}}, and \mathbf{\mathcal{A}}_{\mathcal{G}} is the weighted adjacency matrix for graph \mathcal{G}. Equation 12 suggests the following interpretation: At each time step we send information through the edges using \mathbf{\mathcal{A}}_{\mathcal{G}} then apply a function at each node. {mdframed} Continuous Time Dynamics: As in node (), we consider the case where t may take on a continuous range of values. We then arrive at dynamics given by

\nabla\ \mathbf{Z}_{\mathcal{V}}(t)=\textbf{f}_{\theta}\left(\mathbf{\mathcal{% A}}_{\mathcal{G}}\mathbf{Z}_{\mathcal{V}}(t),\ t\right). | (13) |

Interestingly, if \mathcal{V}_{0} is a strict subset of \mathcal{V} we uncover an Augmented Neural ODE anode (). The discrete time case is unifying in the sense that it may also express any static neural graph. In Figure 1 we illustrate than an MLP may also be expressed by a discrete time neural graph. Additionally, the discrete time dynamics are able to capture sequential models such as LSTMs lstm (), as long as we allow input to flow into \mathcal{V}_{0} at any time.

In continuous time it is not immediately obvious how to incorporate strided convolutions. One approach is to keep the same spatial resolution throughout and pad with zeros after applying strided convolutions. This design is illustrated by Figure 2.

We may also apply Algorithm 1 to learn the structure of dynamic neural graphs. One may use backpropogation through time bptt () and the adjoint-sensitivity method node () for optimization in the discrete and continuous time settings respectively. In Section 3.1, we demonstrate empirically that our method performs better than a random graph, though we do not formally justify the application of our algorithm in this setting.

### 2.4 Implementation details for Large Scale Experiments

For large scale experiments we do not consider the dynamic case as optimization is too expensive. Accordingly, we now present our method for constructing a large and efficient static neural graph. With this model we may jointly learn the structure of the graph along with the parameters on ImageNet imagenet (). As illustrated by Table 4 our model closely follows the structure of MobileNetV1 mobilenetv1 (), and so we refer to it as MobileNetV1-DNW. We consider a separate neural graph for each spatial resolution – the output of graph \mathcal{G}_{i} is the input of graph \mathcal{G}_{i+1}. For width multiplier mobilenetv1 () d and spatial resolution s\times s we constrain MobileNetV1-DNW to have the same number of edges for resolution s\times s than the corresponding MobileNetV1 \times d. We use a slightly smaller width multiplier to obtain a model with the same FLOPs as the number of depthwise convolutions in MobileNetV1-DNW is fixed. If we interpret a pointwise convolution with c_{1} input channels and c_{2} output channels as a complete bipartite graph then the number of edges is simply c_{1}*c_{2}.

We also constrain the longest path in graph \mathcal{G} to be equivalent to the number of layers of the corresponding MobileNetV1. We do so by partitioning the nodes \mathcal{V} into blocks \mathcal{B}=\{\mathcal{B}_{0},...,\mathcal{B}_{L-1}\} where \mathcal{B}_{0} is the input nodes \mathcal{V}_{0}, \mathcal{B}_{L-1} is output nodes \mathcal{V}_{E}, and we only allow edges between nodes in \mathcal{B}_{i} and \mathcal{B}_{j} if i<j. The longest path in a graph with L blocks is then L-1. Splitting the graph into Blocks also improves efficiency as we may operate on one block at a time. The structure of MobileNetV1 may be recovered by considering a complete bipartite graph between adjacent blocks.

The operation f_{\theta_{v}} at each node is a batch-norm batchnorm () (2 parameters), ReLU alexnet (), 3\times 3 convolution (9 parameters) triplet. When the spatial resolution decreases in MobileNetV1 we change the convolutional stride of the input nodes to 2.

In models denoted MobileNetV1-DNW-Small (\times d) we also limit the last fully connected (FC) layer to have the same number of edges as the FC layer in MobileNetV1 (\times d). In the normal setting of MobileNetV1-DNW we do not modify the last FC layer.

## 3 Experiments

In this section we demonstrate the effectiveness of DNW for image classification in small and large scale settings. We begin by comparing our method with a random wiring on a small scale dataset and model. This allows us to experiment in static, discrete time, and continuous settings. Next we explore the use of DNW at scale with experiments on ImageNet imagenet (). Finally, we compare DNW with other methods of discovering network structures.

Throughout this section we let RG denote our primary baseline – a randomly wired graph. To construct a randomly wired graph with k-edges we assign a uniform random weight to each edge then pick the k edges with the largest magnitude weights. As shown in randwire (), random graphs often outperform manually designed networks.

### 3.1 Small Scale Experiments For Static and Dynamic Neural Graphs

We begin by training tiny classifiers for the CIFAR-10 dataset cifar (). Our initial aim is not to achieve state of the art performance but instead to explore DNW in the static, discrete, and continuous time settings. As illustrated by Table 2, our method outperforms a random graph by a large margin.

The image is first downsampled^{3}^{3}3We use two 3\times 3 strided convolutions. The first is standard while the second is depthwise-separable. then each channel is given as input to a node in a neural graph. The static graph uses 5 blocks and the discrete time graph uses 5 time steps. For the continuous case we backprop through the operation of an adaptive ODE solver^{4}^{4}4We use a 5th order Runge-Kutta method ode () as implemented by node ().. The models have 41k parameters. At each node we first perform ReLU followed by a 3\times 3 single channel convolution, followed by Instance Norm instancenorm ().

Model | Accuracy |
---|---|

Static (RG) | 74.0\pm 1.1\% |

Static (DNW) | 80.3\pm 0.8\% |

Discrete Time (RG) | 77.3\pm 0.7\% |

Discrete Time (DNW) | 82.3\pm 0.6\% |

Continuous (RG) | 78.5\pm 1.2\% |

Continuous (DNW) | 83.1\pm 0.3\% |

Model | Accuracy |
---|---|

MobileNetV1 (\times 0.25) | 86.3\pm 0.2\% |

MobileNetV1-RG(\times 0.225) | 87.2\pm 0.1\% |

No Update Rule | 86.7\pm 0.5\% |

L1 + Anneal | 84.3\pm 0.6\% |

Lottery Ticket{}^{\dagger} | 87.9\pm 0.3\% |

Fine Tune \alpha=0.1^{\dagger} | 89.4\pm 0.2\% |

Fine Tune \alpha=0.01^{\dagger} | 89.7\pm 0.1\% |

Fine Tune \alpha=0.001^{\dagger} | 88.7\pm 0.2\% |

MobileNetV1-DNW(\times 0.225) | 89.7\pm 0.2\% |

### 3.2 ImageNet Classification

For large scale experiments on ImageNet imagenet () we are limited to exploring DNW in the static case (recurrent and continuous time networks are more expensive to optimize due to lack of parallelization). Although our network follows the simple structure of MobileNetV1 mobilenetv1 () we are able to achieve higher accuracy than modern networks which are more advanced and optimized. Notably, MobileNetV2 mobilenetv2 () extends MobileNetV1 by adding residual connections and linear bottlenecks and ShuffleNet shufflenet (); shufflenetv2 () introduces channel splits and channel shuffles. The results of the large scale experiments may be found in Table 3.

As standard, we have divided the results of Table 3 to consider models which have similar FLOPs. In the more sparse case (\sim 41M FLOPs) we are able to use DNW to boost the performance of MobileNetV1 by 10\%. Though random graphs perform extremely well we still observe a 7\% boost in performance. In each experiment we train for 250 epochs using Cosine Annealing as the learning rate scheduler with initial learning rate 0.1, as in randwire ().

Model | Params | FLOPs | Accuracy |
---|---|---|---|

MobileNetV1 (\times 0.25) mobilenetv1 () | 0.5M | 41M | 50.6\% |

MobileNetV2 (\times 0.15)^{*} mobilenetv2 () | — | 39M | 44.9\% |

MobileNetV2 (\times 0.4)^{**} | — | 43M | 56.6\% |

DenseNet (\times 0.5)^{*} densenet () | — | 42M | 41.1\% |

Xception (\times 0.5)^{*} xception () | — | 40M | 55.1\% |

ShuffleNetV1 (\times 0.5,\ g=3) shufflenet () | — | 38M | 56.8\% |

ShuffleNetV2 (\times 0.5) shufflenetv2 () | 1.4M | 41M | 60.3\% |

MobileNetV1-RG(\times 0.225) | 1.4M | 45.3M | 53.3\% |

MobileNetV1-DNW-Small (\times 0.15) | 0.26M | 25M | 50.3\% |

MobileNetV1-DNW-Small (\times 0.225) | 0.4M | 41.6M | 59.9\% |

MobileNetV1-DNW(\times 0.225) | 1.2M | 42.6M | 60.9\% |

MnasNet-search1 mnasnet () | 1.9M | 65M | 64.9\% |

MobileNetV1-DNW(\times 0.3) | 1.3M | 65M | 65.0\% |

MobileNet (\times 0.25) | 1.3M | 149M | 63.7\% |

MobileNetV2 (\times 0.6)^{*} | — | 141M | 66.6\% |

MobileNetV2 (\times 0.75)^{***} | — | 145M | 67.9\% |

DenseNet (\times 1)^{*} | — | 142M | 54.8\% |

Xception (\times 1)^{*} | — | 145M | 65.9\% |

ShuffleNetV1 (\times 0.5,\ g=3) | — | 140M | 67.4\% |

ShuffleNetV1 (\times 0.5) | 2.3M | 146M | 69.4\% |

MobileNetV1-RG(\times 0.49) | 1.8M | 148M | 64.1\% |

MobileNetV1-DNW(\times 0.49) | 1.8M | 147M | 70.4\% |

### 3.3 Related Methods

We compare DNW with various methods for discovering neural wirings. In Table 2 we use the structure of MobileNetV1-DNW but try other methods which find k-edge sub-networks. The experiments in Table 2 are conducted using CIFAR-10 cifar (). We train for 160 epochs using Cosine Annealing as the learning rate scheduler with initial learning rate \alpha=0.1 unless otherwise noted.

The Lottery Ticket Hypothesis: The authors of lth (); lth2 () offer an intriguing hypothesis: sparse sub-networks may be trained in isolation. However, their method for finding so-called winning tickets is quite expensive as it requires training the full graph from scratch. We follow one-shot pruning from lth2 (). One-shot pruning is more comparable in training FLOPS than iterative pruning lth (), though both methods are more expensive in training FLOPS than DNW. After training the full network \mathcal{G}_{\text{full}} (i.e. no edges pruned) the optimal sub-network \mathcal{G}_{k} with k-edges is chosen by taking the weights with the highest magnitude. In the row denoted Lottery Ticket we retrain \mathcal{G}_{k} using the initialization of \mathcal{G}_{\text{full}}. We found it better to initialize \mathcal{G}_{k} with the weights of G_{\text{full}} after training – denoted by FT for fine-tune (we try different initial learning rates \alpha). Though these experiments perform comparably with DNW, their training is much more expensive as the full graph must initially be trained.

Exploring Randomly Wired Networks for Image Recognition: The authors of randwire () explore “a more diverse set of connectivity patterns through the lens of randomly wired neural networks." They achieve impressive performance on ImageNet imagenet () using random graph algorithms to generate the structure of a neural network. Their network connectivity, however, is fixed during training. Throughout this section we have a random graph (denoted RG) as our primary baseline – as in randwire () we have seen that random graphs outperform hand-designed networks.

No Update Rule: In this ablation on DNW we do not apply the update rule to the hallucinated edges. An edge may only leave the hallucinated edge set if the magnitude of a real edge is sufficiently decreased. This experiment demonstrates the importance of the update rule.

L1 + Anneal: We experiment with a simple pruning technique – start with a fully connected graph and remove edges by magnitude throughout training until there are only k remaining. We found that accuracy was much better if we added an L1 regularization term.

Neural Architecture Search: As illustrated by Table 3, our network (with a very simple MobileNetV1 like structure) is able to achieve comparable accuracy to an expensive method which performs neural architecture search using reinforcement learning mnasnet ().

## 4 Scope & Limitations

Efficiency: Our training process is still more expensive than training a sparse sub-network in isolation – on the backwards pass we must consider the complete graph. For this reason our large scale experiments are still limited to the small FLOP regimes. In the future we hope to explore more stochastic methods where we only update a “mini-batch" of weight values.

Locality: Our algorithm is quite simple and local. We anticipate that methods of discovering neural wirings which take global structure into account may perform better. We look forward to exploring these methods in future work.

## 5 Conclusion

We present a novel method for discovering neural wirings. With a simple algorithm we demonstrate a significant boost in accuracy over randomly wired networks. Just as in randwire (), our networks are free from the typical constraints of NAS. This work suggests exciting directions for more complex and efficient methods of discovering neural wirings.

#### Acknowledgments

We thank Sarah Pratt, Mark Yatskar and the Beaker team. Computations on beaker.org were supported in part by credits from Google Cloud.

## References

- [1] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
- [2] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. In NeurIPS, 2018.
- [3] François Chollet. Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1800–1807, 2017.
- [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR 2009, 2009.
- [5] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. CoRR, abs/1904.01681, 2019.
- [6] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR 2019, 2019.
- [7] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket hypothesis at scale. CoRR, abs/1903.01611, 2019.
- [8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997.
- [9] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
- [10] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
- [11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- [12] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- [13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60:84–90, 2012.
- [14] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In NIPS, 1991.
- [15] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
- [16] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. CoRR, abs/1806.09055, 2019.
- [17] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
- [18] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
- [19] Mark B. Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
- [20] F Shampine, Lawrence. Some practical runge-kutta formulas. Math. Comput., 46(173):135–150, January 1986.
- [21] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. CoRR, abs/1807.11626, 2018.
- [22] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/1607.08022, 2016.
- [23] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, Oct 1990.
- [24] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443, 2018.
- [25] Saining Xie, Alexander Kirillov, Ross B. Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. CoRR, abs/1904.01569, 2019.
- [26] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
- [27] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.

## Appendix A Architecture

Stage | Output | MobilNetV1 | MobileNetV1-DNW |

0 | 112\times 112 | 3\times 3\ \texttt{conv},\ c=32d\ ,s=2 | g_{\phi}=\left(3\times 3\ \texttt{conv},\ c=32\ ,s=2\right) |

1 | 112\times 112 | 3\times 3\ \texttt{dwconv},c=32d | \mathcal{G}_{1} with |\mathcal{V}|=32+64 |

3\times 3\ \texttt{pwconv},c=64d | |\mathcal{V}_{0}|=32,\ |\mathcal{V}_{E}|=64,\ |\mathcal{B}|=2 | ||

2 | 56\times 56 | 3\times 3\ \texttt{dwconv},c=64d,\ s=2 | \mathcal{G}_{2} with |\mathcal{V}|=64+2*128 |

3\times 3\ \texttt{pwconv},c=128d | |\mathcal{V}_{0}|=64,\ |\mathcal{V}_{E}|=128,\ |\mathcal{B}|=3 | ||

3\times 3\ \texttt{dwconv},c=128d | |||

3\times 3\ \texttt{pwconv},c=128d | |||

3 | 28\times 28 | 3\times 3\ \texttt{dwconv},c=128d,\ s=2 | \mathcal{G}_{3} with |\mathcal{V}|=128+2*256 |

3\times 3\ \texttt{pwconv},c=256d | |\mathcal{V}_{0}|=128,\ |\mathcal{V}_{E}|=256,\ |\mathcal{B}|=3 | ||

3\times 3\ \texttt{dwconv},c=256d | |||

3\times 3\ \texttt{pwconv},c=256d | |||

4 | 14\times 14 | 3\times 3\ \texttt{dwconv},c=256d,\ s=2 | \mathcal{G}_{4} with |\mathcal{V}|=256+6*512 |

3\times 3\ \texttt{pwconv},c=512d | |\mathcal{V}_{0}|=256,\ |\mathcal{V}_{E}|=512,\ |\mathcal{B}|=7 | ||

5\times\begin{cases}3\times 3\ \texttt{dwconv},c=512d\\ 3\times 3\ \texttt{pwconv},c=512d\end{cases} | |||

5 | 7\times 7 | 3\times 3\ \texttt{dwconv},c=512d,\ s=2 | \mathcal{G}_{5} with |\mathcal{V}|=512+2*1024 |

3\times 3\ \texttt{pwconv},c=1024d | |\mathcal{V}_{0}|=512,\ |\mathcal{V}_{E}|=1024,\ |\mathcal{B}|=3 | ||

3\times 3\ \texttt{dwconv},c=1024d | |||

3\times 3\ \texttt{pwconv},c=1024d | |||

6 | 1000 | 7\times 7\ \texttt{pool},1024d\times 1000\ \texttt{FC} | h_{\psi}=\left(7\times 7\ \texttt{pool},1024\times 1000\ \texttt{FC}\right) |

## Appendix B A More General Case

We now consider the case where the hallucinated edge (i,\ell) replaces (j,k)\in\mathcal{E}.

As before we use \tilde{w} to denote the weight w after the weight update rule \tilde{w}_{uv}=w_{uv}+\left\langle Z_{u},-\alpha{\partial\mathcal{L}\over% \partial\mathcal{I}_{v}}\right\rangle. We assume that \alpha is small enough so that sign(\tilde{w}) = sign(w).

Claim: Assume \mathcal{L} is Lipschitz continuous. There exists a learning rate \alpha^{*}>0 such that for \alpha\in(0,\alpha^{*}) the process of swapping (i,\ell) for (j,k) will decrease the loss when the state of the nodes are fixed, there is no path from i to j, and |w_{i\ell}|<|w_{jk}| but |\tilde{w}_{i\ell}|>|\tilde{w}_{jk}|.

Proof. Let \mathcal{A}_{k},\mathcal{A}_{\ell} be value of \mathcal{I}_{k} and \mathcal{I}_{\ell} after the update rule if (j,k) is replaced with (i,\ell). Let \mathcal{B}_{k} and \mathcal{B}_{\ell} be the state of \mathcal{I}_{k} and \mathcal{I}_{\ell} after the update rule if we do not allow for swapping. \mathcal{A}_{k},\mathcal{A}_{\ell}, \mathcal{B}_{k} and \mathcal{B}_{\ell} are then given by

\displaystyle\mathcal{A}_{k}=\sum_{(u,k)\in\mathcal{E},\ u\neq j}\tilde{w}_{uk% }Z_{u}, | \displaystyle \mathcal{B}_{k}=\tilde{w}_{jk}Z_{j}+\sum_{(u,k)\in\mathcal{E% },\ u\neq j}\tilde{w}_{uk}Z_{u} | (14) | ||

\displaystyle\mathcal{A}_{\ell}=\tilde{w}_{i\ell}Z_{i}+\sum_{(u,\ell)\in% \mathcal{E},\ u\neq i}\tilde{w}_{u\ell}Z_{u}, | \displaystyle \mathcal{B}_{\ell}=\sum_{(u,\ell)\in\mathcal{E},\ u\neq i}% \tilde{w}_{u\ell}Z_{u}. | (15) |

Additionally, let g_{k}=-\alpha{\partial\mathcal{L}\over\partial\mathcal{I}_{k}} and g_{\ell}=-\alpha{\partial\mathcal{L}\over\partial\mathcal{I}_{\ell}} be the direction in which the loss most steeply descends with respect to \mathcal{I}_{k} and \mathcal{I}_{\ell}. By Lemma 1 (Section C of the Appendix) it suffices to show that

\displaystyle\left\langle\mathcal{A}_{k}-\mathcal{I}_{k},g_{k}\right\rangle+% \left\langle\mathcal{A}_{\ell}-\mathcal{I}_{\ell},g_{\ell}\right\rangle\geq% \left\langle\mathcal{B}_{k}-\mathcal{I}_{k},g_{k}\right\rangle+\left\langle% \mathcal{B}_{\ell}-\mathcal{I}_{\ell},g_{\ell}\right\rangle | (16) |

which simplifies to

\displaystyle\tilde{w}_{i\ell}\left\langle Z_{i},g_{\ell}\right\rangle | \displaystyle\geq\tilde{w}_{jk}\left\langle Z_{j},g_{k}\right\rangle | (17) | ||

\displaystyle\iff\tilde{w}_{i\ell}(\tilde{w}_{i\ell}-w_{i\ell}) | \displaystyle\geq\tilde{w}_{jk}(\tilde{w}_{jk}-w_{jk}). | (18) |

We are now in the equivalent setting as equation 6 and may complete the proof as before.

In practice there may be a path from i and j the state of the nodes will never be the fixed due to stochasticity of mini-batches and updates to the rest of the parameters in the network. However, as the graph grows large the state of one node will have little effect on the state of another, even if there is a path between them. The proofs are done in an idealized case and the empirical results demonstrate that the method works in practice.

## Appendix C Lemma 1

Here we show that for sufficiently small \alpha

\left\langle\gamma_{1},-\alpha{\partial\mathcal{L}\over\partial Z_{v}}\right% \rangle>\left\langle\gamma_{2},-\alpha{\partial\mathcal{L}\over\partial Z_{v}}\right\rangle | (19) |

implies that

\mathcal{L}\left(\mathcal{I}_{v}+\alpha\gamma_{1}\right)<\mathcal{L}\left(% \mathcal{I}_{v}+\alpha\gamma_{2}\right). | (20) |

Note that for brevity we have written the loss as a function of \mathcal{I}_{v}. By taking a Taylor expansion we find that

\displaystyle\mathcal{L}\left(\mathcal{I}_{v}+\alpha\gamma\right) | (21) | ||

\displaystyle=\mathcal{L}\left(\mathcal{I}_{v}\right)+\left\langle\alpha\gamma% ,{\partial\mathcal{L}\over\partial Z_{v}}\right\rangle+\mathcal{O}(\alpha^{2}) | (22) |

and so for sufficiently small \alpha

\mathcal{L}\left(\mathcal{I}_{v}\right)-\mathcal{L}\left(\mathcal{I}_{v}+% \alpha\gamma\right)\approx\left\langle\gamma,-\alpha{\partial\mathcal{L}\over% \partial Z_{v}}\right\rangle | (24) |

which completes the lemma.

An equivalent argument holds for two dimensions.

\left\langle\gamma_{1},-\alpha{\partial\mathcal{L}\over\partial Z_{v}}\right% \rangle+\left\langle\xi_{1},-\alpha{\partial\mathcal{L}\over\partial Z_{u}}% \right\rangle>\left\langle\gamma_{2},-\alpha{\partial\mathcal{L}\over\partial Z% _{v}}\right\rangle+\left\langle\xi_{2},-\alpha{\partial\mathcal{L}\over% \partial Z_{u}}\right\rangle | (25) |

implies that

\mathcal{L}\left(\mathcal{I}_{v}+\alpha\gamma_{1},\mathcal{I}_{u}+\alpha\xi_{1% }\right)<\mathcal{L}\left(\mathcal{I}_{v}+\alpha\gamma_{2},\mathcal{I}_{u}+% \alpha\xi_{2}\right). | (26) |

By taking a Taylor expansion we find that

\displaystyle\mathcal{L}\left(\mathcal{I}_{v}+\alpha\gamma,\mathcal{I}_{u}+% \alpha\xi\right) | (27) | ||

\displaystyle=\mathcal{L}\left(\mathcal{I}_{v},\mathcal{I}_{u}\right)+\left% \langle\alpha\gamma,{\partial\mathcal{L}\over\partial Z_{v}}\right\rangle+% \left\langle\alpha\xi,{\partial\mathcal{L}\over\partial Z_{u}}\right\rangle+% \mathcal{O}(\alpha^{2}) | (28) |

and so for sufficiently small \alpha

\mathcal{L}\left(\mathcal{I}_{v},\mathcal{I}_{u}\right)-\mathcal{L}\left(% \mathcal{I}_{v}+\alpha\gamma,\mathcal{I}_{u}+\alpha\xi\right)\approx\left% \langle\gamma,-\alpha{\partial\mathcal{L}\over\partial Z_{v}}\right\rangle+% \left\langle\xi,-\alpha{\partial\mathcal{L}\over\partial Z_{u}}\right\rangle. | (30) |