ItNet: iterative neural networks with tiny graphs for accurate and efficient anytime prediction

ItNet: iterative neural networks with tiny graphs for accurate and efficient anytime prediction


Deep neural networks have usually to be compressed and accelerated for their usage in low-power, e.g. mobile, devices. Recently, massively-parallel hardware accelerators were developed that offer high throughput and low latency at low power by utilizing in-memory computation. However, to exploit these benefits the computational graph of a neural network has to fit into the in-computation memory of these hardware systems that is usually rather limited in size. In this study, we introduce a class of network models that have a tiny memory footprint in terms of their computational graphs. To this end, the graph is designed to contain loops by iteratively executing a single network building block. Furthermore, the trade-off between accuracy and latency of these so-called iterative neural networks is improved by adding multiple intermediate outputs both during training and inference. We show state-of-the-art results for semantic segmentation on the CamVid and Cityscapes datasets that are especially demanding in terms of computational resources. In ablation studies, the improvement of network training by intermediate network outputs as well as the trade-off between weight sharing over iterations and the network size are investigated.

I Introduction

For massively-parallel hardware accelerators (Schemmel et al., 2010; Merolla et al., 2014; Yao et al., 2020; Graphcore IPU, 2020), every neuron and synapse in the network model has its physical counterpart on the hardware system. Usually, by design, memory and computation is not separated anymore, but neuron activations are computed next to the memory, i.e. the parameters, and fully in parallel. This is in contrast to the rather sequential data processing of CPUs and GPUs, for which the computation of a network model is tiled and the same arithmetic unit is re-used multiple times for different neurons. Since the computation is performed fully in parallel and in memory, the throughput of massively-parallel accelerators is usually much higher than for CPUs and GPUs. This can be attributed to the fact that the latency and power consumption for accessing local memory, like for in-memory computing, are much lower than for computations on CPUs and GPUs that require the frequent access to non-local memory like DRAM (Sze et al., 2017). However, the network graph has to fit into the memory of the massively-parallel hardware accelerators to allow for maximal throughput. If the network graph exceeds the available memory, in principle, the hardware has to be re-configured at high frequency to sequentially process the partitioned graph, as it is the case for CPUs and GPUs, and the benefit in terms of high throughput would be substantially reduced or even lost. Even higher throughputs can be obtained by using mixed-signal massively-parallel hardware systems that usually operate on shorter time scales than digital ones, e.g., compare Schemmel et al. (2010) and Yao et al. (2020) to Merolla et al. (2014) and Graphcore IPU (2020).


Fig. 1: This study in a nutshell. a) The iterative neural network (ItNet): first, images are pre-processed and potentially down-scaled by a sub-network called data block. Then, the output of this data block is processed by another sub-network that is iteratively executed. After every iteration , the output of this iterative block is fed back to its input and, at the same time, is further processed by the classification block to predict the semantic map . This network generates multiple outputs with increasing quality and computational costs and heavily re-uses intermediate network activations. While the parameters are shared between iterations of the iterative block, the parameters of the classification block are independent for each . b-d) Intersection-over-union (mIoU) over the size of the computational graph, the latency, and the multiply-accumulate operations (MACs), respectively, on the validation set of the Cityscapes dataset. To our knowledge ESPNetv2 (Mehta et al., 2019) is the state-of-the-art in terms of mIoU over MACs, for which we show the mIoU if not using pre-trained weights (Ruiz and Verbeek, 2019). RecUNets (Wang et al., 2019) share the weights of U-Nets and recurrently connect their bottlenecks. ENets (Paszke et al., 2016) are one of the first efficient network models. Note that the ItNet requires more MACs, but has a substantially smaller size of the computational graph, and a lower latency than the reference networks. The throughput of ItNets is potentially also higher, since the small graph size allows for the execution on massively-parallel hardware systems. e) Image, label, and network predictions (with the mIoU in their corner) over the network outputs of the ItNet in (c). The data sample with the th-percentile mIoU is shown.

In order to achieve neural networks with tiny computational graphs, in which nodes are operations and edges are activations, we heavily re-use a single building block of the network (see the iterative block in Figure 1a). Not only the structure of computations, i.e. the type of network layers including their dimensions and connectivity, is identical for each iteration of this building block, but also the parameters are shared between iterations. In the computational graphs of these so-called iterative neural networks (ItNets), the re-used building blocks with shared weights can be represented by nodes with self-loops. Compared to conventional feed-forward networks, loops simplify the graph by reducing the number of unique nodes and, consequently, its computational footprint. However, the restriction of sharing weights usually decreases the number of free parameters and, hence, the accuracy of networks. To isolate and quantify this effect we compare networks with weight sharing to networks, for which the parameters of the building blocks are chosen to be independent between iterations of the building block. In contrast to the above proposal, conventional deep neural networks for image processing usually do not share weights and have no (e.g., Huang et al., 2017) or few (e.g., one building block for each scale like by Greff et al., 2017) layers of identical structure. Liao and Poggio (2016) share weights between re-used building blocks, but use multiple unique building blocks.

To improve the training of networks, which contain loops in their graphs, and to reduce the latency of networks during inference we use multiple intermediate outputs. Multi-output networks that heavily re-use intermediate activations are beneficial for a wide range of applications, especially in the mobile domain. In an online manner, they allow to trade off latency versus accuracy with barely any overhead (e.g., Huang et al., 2018). From an application point of view, the benefit of this trade-off can be best described in the following two scenarios (Huang et al., 2018): In the so-called anytime prediction scenario, the prediction of a network is progressively updated, whereas the first output defines the initial latency of the network. In a second scenario, a limited computational budget can be unevenly distributed over a set of samples with different “difficulties” in order to increase the average accuracy. In this study, we only address the scenario of anytime prediction, and refer to literature (e.g., Huang et al., 2018) for the distribution of the computational budget.

If all nodes in the network graph are computed in parallel, like on massively-parallel hardware systems, the latency for inference is dominated by the depth of the network, i.e. the longest path from input to output (Fischer et al., 2018). In order to reduce this latency and to allow the repetitive execution of a single building block, we use networks that compute all scales in parallel (similar to Huang et al., 2018; Ke et al., 2017) and increase the effective depth for each scale over the consecutive iterations of this multi-scale building block. Furthermore, multi-scale networks are also beneficial for the integration of global information, as especially required by dense prediction tasks like semantic segmentation (Zhao et al., 2017). To further reduce the latency we do not only compute all scales in parallel, but also keep the depth of each scale as shallow as possible.

In deep learning literature, the computational costs are usually quantified by counting the parameters and/or the multiply-accumulate operations (MACs) required for the inference of a single sample. For fully convolutional networks, the number of parameters is independent of the spatial resolution of the network’s input and the intermediate feature maps. Especially for large inputs as commonly used for semantic segmentation, the number of parameters does not cover the main workload and is, hence, not suited as a measure for computational costs. MACs have the advantage that they can be easily calculated and are usually a good approximation for the latency and throughput on CPUs and even GPUs. However, for most novel hardware accelerators, not the MACs, but the non-local memory transfers are dominating the computational costs in terms of power consumption (Chen et al., 2016; Sze et al., 2017; Chao et al., 2019). These memory transfers are minimized on massively-parallel hardware systems as long as the network graph fits into the in-computation memory of these systems, i.e. the memory of their arithmetic units. Since both the power consumption during inference and the production cost scale with the size of this memory we additionally compare the size of the computational graphs between networks. Note that the practical benefits of ItNets cannot be demonstrated on conventional CPUs or GPUs, since these hardware systems do not support the processing of neural networks in a fully-parallel and, hence, low-latency fashion.

Note that, in this study, we focus on network models in the low-power regime of only few billion MACs, while focusing on the challenging scenario of large-scale images (for datasets, see Section II-D). The key contributions of this study are:

  • We introduce efficient networks with tiny computational graphs that heavily re-use a single building block and provide multiple intermediate outputs (Sections II-E, II-B and II-A).

  • We search for the best hyperparameters of this model and investigate the effect of multiple outputs and weight sharing on the network training (Sections III-B, III-A and II-C).

  • To our knowledge, we set the new state-of-the-art in terms of accuracy over the size of the computational graph and discuss the potential benefits of executing these so-called ItNets on massively-parallel hardware systems (Sections IV, III-C and 1).

We will release the source code upon acceptance for publication.

Ii Methods

Fig. 2: Detailed description of the data block (a) and iterative block (b) as shown in Figure 1: a) First, images are down-sampled in two stages, each using a stride and pooling of size (inspired by Paszke et al., 2016). Then, the output of the second stage is processed on different scales. For clarity, an example with is shown. Convolutional layers (with kernel size 3), batch-normalization layers and ReLU activation functions are denoted with Conv, BN and R, respectively. b) The iterative block receives input at different scales denoted as A, B and C. In order to mix the information between the different scales (e.g. Zhao et al., 2017) the inputs A, B and C are concatenated and processed with a residual block (He et al., 2016). Then, an additional bottleneck residual blocks (BRB; with expansion factor ; Sandler et al., 2018) are applied to obtain the output of the iterative block consisting of feature maps at again different scales. This output is both fed back as input for the next iteration and forwarded to the classification block. The classification block consists of a convolutional layer with output channels and an 8-fold bilinear up-sampling to the original spatial dimensions. In the first iteration () of the iterative block, the mixing of the scales is skipped and the input is directly processed by the bottleneck residual blocks (see injection of input as , and ). The number of output channels are denoted by and , respectively. The batch-normalization layers within the iterative blocks (dashed boxes) are never shared between iterations.

The following networks process images of size and output semantic maps of size with being the number of classes.

Ii-a Network architecture

We introduce a class of neural networks with tiny computational graphs by heavily re-using intermediate activations and weights (Figure 1a). Conceptionally, the network model can be split into three main building blocks: the data block, the iterative block and the classification block (for an overview and details, see Figure 1a and Figure 2, respectively). While the data block is executed only once for each image, the iterative block can be executed multiple times in a row by feeding back its output as the input for the next iteration. The classification block outputs the prediction of the semantic map by processing the intermediate activations of the feedback signal. While the weights of the iterative block are shared between iterations, the weights of the classification block are unique for each iteration. For comparability with previous studies, we limit the number of MACs and select the network with the highest accuracy by optimizing the following architectural hyperparameters: the number of scales , the number of iterations , and the number of bottleneck residual blocks .

Ii-B Network training

For training, we use a joint cost function for all outputs of the network:


where is the categorical cross entropy between the true labels and the network predictions . The weight factors are normalized as follows: .

We use the Adam optimizer with , and a learning rate that we multiply with after and of the number of overall training epochs. We use a batch size of and train the network for ( for Figure 5) and for the CamVid and Cityscapes datasets, respectively. For creftypecap color=yellow, inline, Figure 4 and the appendix, we report the mean values and the errors of the means across trials. For Figures 5 and 1, we report the trial with the highest peak accuracy over trials.

For the results shown in Figures 5 and 1, we use dropout with rate after the depth-wise convolutions in the bottleneck residual blocks and an L2 weight decay of in all convolutional layers. For all other results, we do not use dropout and weight decay.

Ii-C Network evaluation

Throughout this study, we measure the quality of semantic segmentation by calculating the mean intersection-over-union (mIoU Jaccard, 1912), which is the ratio of the area of overlap and the area of union

averaged over all classes.

We consider a network to perform well if it achieves a high mIoU while requiring few MACs. To this end, we calculate the area under the curve of the mIoU () over MACs () with output index as follows:


with , where denotes the mIoU at chance level for the CamVid dataset. To compensate for different maximum numbers of MACs for different sets of hyperparameters, we normalize as follows: . 1

The size of the computational graph is computed by accumulating the memory requirements of all nodes, i.e. network layers, in the network graph. For each layer, the total required memory is the sum of the memory for parameters, input feature maps and output feature maps.

The theoretical latency of a network if executed fully in parallel is determined by the length of the path from input to output, i.e. the depth, of this network (see also Section I).

For both the size of the computational graph and the latency, we only consider convolutional layers like commonly done in literature (e.g., Paszke et al., 2016; Wu et al., 2018; Mehta et al., 2019). This means, we ignore other network layers like normalization, activations, concatenations, additions and spatial resizing, for which we assume that they can be fused with the convolutional layers.

Ii-D Datasets

The CamVid dataset (Brostow et al., 2008) consists of ( for training, for validation, for testing) annotated images filmed from a moving car. We use the same static pre-processing as in Badrinarayanan et al. (2017) to obtain images and semantic labels of size and normalize the pixel values to the interval by dividing all pixel values by . For online data augmentation during training, we horizontally flip the pairs of images and labels at random and use random crops of size .

The Cityscapes dataset (Cordts et al., 2016) consists of ( for training, for validation) annotated images and we use the validation set for testing. We resize the original images and semantic labels to and divide all pixel values by and subtract to obtain values in the interval . For online data augmentation during training, we horizontally flip the pairs of images and labels at random.

For both datasets, pixels with class labels not marked for training are ignored in the cost function.

Ii-E Batch normalization

In case of independent parameters between iterations of the iterative block, i.e.  is independent from in Figure 1a, batch normalization improves the network training. However, in case of weight sharing (), also sharing the parameters of batch normalization between iterations significantly worsens the network training. Since not sharing batch-normalization parameters would violate our idea to re-use the identical building block again and again, we place batch-normalization layers between the iterations of the iterative block (see Figure 2b). For comparability, we use the same setup also for networks without weight sharing, although the average validation mIoU is slightly decreased compared to networks that instead use batch normalization after each convolution (creftypecap color=yellow, inlinea).

Iii Results

Fig. 3: a) Comparison of mIoU over MACs between networks with batch normalization applied between iterative blocks (purple curve) and applied after each convolution (green curve) on the test set of the CamVid dataset. Note that, for both cases, batch normalization is always applied after each convolution in the data block. b) Peak mIoU over MACs for networks with different widths on the validation set of the CamVid dataset. Each data point represents the peak mIoU over all network outputs for one specific network width and the corresponding MACs for this output. For the following studies, we selected the network highlighted in red that achieves, out of trials, a maximum mIoU of and for independent and shared weights, respectively. 2

In order to obtain accurate networks with low computational costs, we first search for the best set of the architectural hyperparameters , and (Section III-A). Since we are also interested in the trade-off between weight sharing and the networks size, we also consider networks without weight sharing in this search. Then, we investigate the impact of intermediate losses on the network performance to find the best set of weight factors for the loss function (Section III-B). Finally, for the found set of hyperparameters and weight factors, we show results of ItNets on the CamVid and Cityscapes dataset (Section III-C).

Iii-a Search for architectural hyperparameters

To find accurate networks with low computational costs we perform a grid search over the hyperparameters of our network model as described in Figures 2 and 1. For each set of hyperparameters, we choose the largest possible number of channels that results in a network with less than billion MACs 3for the last output of the network on single samples of the CamVid dataset.

Since we are interested in network architectures with a high mean intersection-over-union (mIoU) and a low number of MACs, we sort the architectures by their area under the curve (see Equation 2) for both networks with independent and shared weights. Networks with a peak performance , where is the highest peak performance over all sets of hyperparameters, are considered as not suitable for applications and are discarded. For comparability between networks with independent and shared weights, we choose the architecture with the lowest average rank across the two classes of networks and obtain the following set of hyperparameters: iterations, scales and bottleneck residual blocks (for the results of the grid search, see Figures 7 and 6 in the appendix). Note that many iterations () result in a high re-usage of the iterative block, a low latency and a small computational graph.

Iii-B Improvement of training by using multiple outputs

Fig. 4: Ablation studies for multi-output training of ItNets with (top row) and without (bottom row) weight sharing on the validation set of the CamVid dataset: The mIoU over MACs is shown for the same architectural hyperparameters, but for different sets of . a-c and e-g) The weight factors are set to for the shown outputs and to otherwise. Each different set of weight factors is highlighted with a different color. d and h) The weight factors are modulated over the network outputs. We increase the weight factor from 1 to 16 (incr) or keep the weights uniform over all outputs.

We study the impact of intermediate network outputs on the network performance by training the same network architecture with different sets of weight factors of the loss function (see Equation 1). In summary, later network outputs benefit from the additional loss applied to earlier outputs. This is supported by ablation studies, in which we trained networks with single, only late, or thinned out outputs. The following results and conclusions are similar for both network types with and without weight sharing and, hence, we discuss them jointly.

The training of single outputs is worse than jointly training all outputs, except for only training the first output (see Figure 4a and e). However, a network with only the first output has no practical relevance due to its tiny network size and, consequently, low mIoU.

The intermediate activations of the early layers are optimized for the potentially conflicting tasks of providing the basis for high accuracy at early outputs and, at the same time, top accuracy at late outputs. However, our experiments suggest that especially the optimization of the early outputs improves the performance of late outputs (see Figure 4b and f). Our observation that early losses do not conflict with late losses allows to keep these early losses resulting in networks with very low latency (see Figure 5d). In addition, keeping the first output to allow for low latency, but decreasing the density of outputs, removes unnecessary constraints and slightly improves the mIoU (see Figure 4c and g). The removed outputs are likely not required in applications, since the improvement of the mIoU between consecutive outputs is rather small.

So far, we removed sets of outputs from the loss function, but kept the weight factors identical for all remaining outputs. For linearly increasing the weight factors over the outputs, the mIoU jointly decreases for all outputs (compare different colors in Figure 4d and h). This unexpected decrease in the mIoU for late outputs may be attributed to the observed effect that earlier outputs are crucial for late performance as discussed above.

Iii-C Comparing the performance of ItNets to the literature

Fig. 5: Network performance in terms of mIoU over the number of parameters (a), the number of MACs (b), the size of the computational graph (c), and the latency from input to output of the network (d) on the test set of the CamVid dataset. Note that the size of the ItNets without weight sharing, i.e. with independent weights between iterations (green), is reduced to match the accuracy of ItNets with weight sharing (purple; for details, see red data points in creftypecap color=yellow, inlineb). According to Figure 4c and g we thin out the losses. ENet (Paszke et al., 2016) and CGNet (Wu et al., 2018) denote efficient reference networks. Note that the MACs of these reference data points are normalized to the actual input size () that was used to obtain the reported mIoU values.

Iterative networks require less than half the size for the computational graph compared to networks that are state-of-the-art in terms of the number of MACs (compare 276MB for ItNet to 639MB for ESPNetv2 to achieve a mIoU of for the Cityscapes dataset as shown in Figure 1b). In addition, the multiple intermediate network outputs allow for anytime prediction and a lower latency that is further reduced by the shallow network design (Figure 1c). However, the number of MACs for this ItNet is approximately ten times larger than for the ESPNetv2 (16.7 compared to 1.7 billion MACs in Figure 1d).

By design, ItNets and ESPNetv2 have fundamentally different computational graphs and it is unclear which differences have the biggest effect on the network performance. Since introducing a loop into the computational graphs, as done for ItNets, has the biggest effect on the size of the computational graphs, we compare networks with and without such loops exemplarily for the CamVid dataset. To exclude effects of other differences between these two types of networks, we compare ItNets, for which the weights are shared between iterations, to ItNets with identical hyperparameters but independent weights. Compared to ItNets with shared weights, ItNets with independent weights have substantially more free parameters that, consequently, results in better mIoUs. To compensate for this increase in the peak mIoUs we decrease the width of the latter by a factor of ( without weight sharing to with weight sharing; see data points highlighted in red in creftypecap color=yellow, inlineb). Then, compared to ItNets with shared weights, ItNets with independent weights have the number of parameters (compare 424 to 182 thousand parameters to achieve a mIoU of approximately for the CamVid test set) and approximately three times less MACs (compare 2.0 to 6.2 billion MACs as highlighted in red in creftypecap color=yellow, inlineb). The size of the computational graph is dominated by the number of unique nodes and their size. By untying the weights between the iterations of the iterative block, the loop in the network graph has to be unrolled over its iterations, which substantially increases the number of unique nodes and, hence, the computational graph by a factor larger than (compare 83MB to 384MB in Figure 5c). The latency is identical for all networks, since we use the same architectural hyperparameters (Figure 5d).

Iv Discussion

In this study, we introduce a new class of network models called iterative neural networks that have loops in their computational graphs to reduce their memory footprints and, hence, to enable their execution on massively-parallel hardware systems. We investigated the trade-off between the size of the computational graphs and the prediction accuracies, and showed that ItNets achieve state-of-the-art performance for this trade-off (see Figure 1b). However, the reduction of the computational graph comes with an increase in the number of MACs (see Figure 1d), which is expected to be smaller than the increase in throughput offered by novel, massively-parallel hardware accelerators, especially if the computational graphs fit into the in-computation memory of these hardware systems (see also Section I). For example, the Graphcore Benchmark (2020) reports a -fold increase in throughput compared to GPUs for recurrent networks of small size and Esser et al. (2016) report more than frames per second and watt for computer vision tasks. During the training of ItNets, we observe that additional and especially early network outputs improve the performance of late outputs, which was also observed by Zhou et al. (2019) on pre-trained networks for age estimation. In principle, the presented methods could also be applied to different tasks, like object classification and detection, for which we expect similar observations. However, these datasets are usually provided in a lower spatial resolution, which reduces the challenge and the need to reduce the size of these already rather small computational graphs.

Since massively-parallel hardware systems are not easily accessible at the moment, we cannot provide real benchmark data, but will in the following exemplarily discuss the potential benefits of our network models if executed on such hardware systems. To this end, we describe the execution of our model on the Graphcore IPU (2020) (for details, see Jia et al., 2019), since this system seems to be the most promising in terms of commercialization and availability. If the full network graph fits into the in-computation memory of the Graphcore IPU (2020) (900MB for the Colossus MK2 GC200 IPU processor), the input data could be pipelined through this graph in a streaming fashion without the need to reconfigure the hardware. Then, by design, independent nodes in this graph are executed fully in parallel (see also Fischer et al., 2018). This results in substantially lower latencies during network inference compared to systems that do not execute network graphs fully in parallel like CPUs and GPUs (an up to 25 times lower latency is reported by the Graphcore Benchmark, 2020). In contrast to this fully parallel execution, for CPUs and GPUs, the workload of the network graph has to be tiled and the arithmetic units are continuously reconfigured for each tile. ItNets significantly reduce the footprint of their graphs by introducing loops and, hence, improve the network’s accuracy for the same footprint. Although the in-computation memory is huge for the Colossus MK2 GC200 IPU that is optimized for data centers, we expect embedded versions of similar systems to have a considerably smaller in-computation memory that, then, require networks with tiny computational graphs. Other massively-parallel hardware designs increase the density of their in-computation memory, i.e. synapses per chip area, by aggressive quantization (Merolla et al., 2014) or mixed-signal implementation (Schemmel et al., 2010; Yao et al., 2020), which imposes additional challenges for the development of network models.

For the Cityscapes dataset, our ItNet already requires two Nvidia V100 GPUs with 32GB memory each if the standard training procedure is not modified. Consequently, exploring large-scale ItNets is difficult by using backpropagation on GPUs. However, since ItNets are optimized for their execution on massively-parallel hardware systems, this demonstrates the fundamentally different operation principles between these systems and GPUs as well as highlights the need to think beyond workloads that are tailored for GPUs.

In parallel to this study, a novel training technique was introduced by Bai et al. (2020) that significantly improves the scaling of vision networks composed of iteratively executed building blocks with weight sharing. Instead of using backpropagation, they optimize the equilibrium point of a fix-point process described by the iterative execution of the building block. They achieved task accuracies comparable to state-of-the-art networks and, to this end, required approximately the same number of parameters compared to the baselines. In line with our study, this results in comparably large building blocks that are executed many times and, consequently, require significantly more MACs than the baseline networks4. However, the training method of Bai et al. (2020) can not be applied to networks, for which the weights are allowed to be updated between iterations and, hence, the effect of weight sharing can not be studied in isolation. In addition, their method includes the raw image (8MB in float32 for Cityscapes) into the state of the iterative block, which prevents to shrink their networks to the regime of few billion MACs as shown by our method, for which the full state, i.e. the output of the iterative building block, has a size of only 4.8MB.

The concept of reusing convolutional building blocks has, previously, also been applied to predict semantic maps (Pinheiro and Collobert, 2014) and to generate super-resolution images (Cheng et al., 2018). However, instead of reusing a building block that processes all spatial scales in parallel like in the study at hand, these works reuse the same building block for different spatial scales, which does not allow for anytime predictions.

As an alternative to implementing the recurrence by convolutions as presented in this study, Valipour et al. (2017) and Wang et al. (2019) connect intermediate feature maps of convolutional encoder-decoder networks by recurrent units, like LSTMs, to predict semantic maps for the frames of videos or to consecutively improve the prediction of semantic maps for single images, respectively. Ballas et al. (2016) apply a similar technique to encoder networks for video tasks. However, this alternative approach has a more heterogeneous computational workload and usually results in bigger network graphs with lower accuracy (see Figure 1b).

As an extension of this study, partly releasing the constraint of weight sharing may allow for a significant reduction in the required size of the iterative block. To this end, a ratio between shared and free weights may be chosen close to, but smaller than one. This will require to reconfigure the network graph on hardware, which may be costly or not possible at all depending on the hardware system. An alternative approach to release constraints would be to replace the weights of the building block by a functions that modulates the weights over the iterations of the building block. Another interesting open question for further studies is the root cause for the observation that early outputs improve the performance of late outputs. This effect may be attributed to some kind of knowledge distillation and / or to the shortcut for the gradients from network output to input.


Thank you Anna Khoreva for fruitful discussions and thank you Robin Hutmacher for your technical support.

Appendix A Appendix

Fig. 6: Grid search over the following hyperparameters on the validation set of the CamVid dataset: number of layers , number of blocks and number of residual blocks . Parameters are not shared between iterations of the iterative block. The best set of hyperparameters (for details about the selection, see Section III-A) is depicted in red.
Fig. 7: Like Figure 6, but the parameters are shared between iterations of the iterative block.


  1. todo: add visualization
  2. todo: check position of figure
  3. todo: motivate this number
  4. todo: add details?
  5. todo: add comment that learning curves visually checked


  1. V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.
  2. S. Bai, V. Koltun, and J. Z. Kolter. Multiscale deep equilibrium models. In Advances in Neural Information Processing Systems, 2020.
  3. N. Ballas, L. Yao, C. Pal, and A. C. Courville. Delving deeper into convolutional networks for learning video representations. In International Conference on Learning Representations (ICLR), 2016.
  4. G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, pages 44–57, 2008.
  5. P. Chao, C.-Y. Kao, Y.-S. Ruan, C.-H. Huang, and Y.-L. Lin. HarDNet: A low memory traffic network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  6. Y. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 367–379, 2016.
  7. X. Cheng, X. Li, J. Yang, and Y. Tai. SESR: Single image super resolution with recursive squeeze and excitation networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 147–152, 2018.
  8. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  9. S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha. Convolutional networks for fast, energy-efficient neuromorphic computing. Proceedings of the National Academy of Sciences, 113(41):11441–11446, 2016.
  10. V. Fischer, J. Koehler, and T. Pfeil. The streaming rollout of deep networks - towards fully model-parallel execution. In Advances in Neural Information Processing Systems 31, pages 4039–4050. 2018.
  11. Graphcore Benchmark. Graphcore benchmarks, 2020. URL
  12. Graphcore IPU. Colossus mk2 ipu processor, 2020. URL
  13. K. Greff, R. K. Srivastava, and J. Schmidhuber. Highway and residual networks learn unrolled iterative estimation. In International Conference on Learning Representations (ICLR), 2017.
  14. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  15. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
  16. G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations (ICLR), 2018.
  17. P. Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, 1912.
  18. Z. Jia, B. Tillman, M. Maggioni, and D. P. Scarpazza. Dissecting the Graphcore IPU architecture via microbenchmarking. CoRR, abs/1912.03413, 2019. URL
  19. T.-W. Ke, M. Maire, and S. X. Yu. Multigrid neural architectures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  20. Q. Liao and T. A. Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. CoRR, abs/1604.03640, 2016. URL
  21. S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi. ESPNetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  22. P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S. Modha. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
  23. A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. ENet: A deep neural network architecture for real-time semantic segmentation. CoRR, abs/1606.02147, 2016. URL
  24. P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. volume 32 of Proceedings of Machine Learning Research, pages 82–90, Bejing, China, 22–24 Jun 2014. PMLR.
  25. A. Ruiz and J. Verbeek. Adaptative inference cost with convolutional neural mixture models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  26. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  27. J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier, and S. Millner. A wafer-scale neuromorphic hardware system for large-scale neural modeling. In 2010 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1947–1950, 2010.
  28. V. Sze, Y. Chen, T. Yang, and J. S. Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
  29. S. Valipour, M. Siam, M. Jagersand, and N. Ray. Recurrent fully convolutional networks for video segmentation. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 29–36, 2017.
  30. W. Wang, K. Yu, J. Hugonot, P. Fua, and M. Salzmann. Recurrent U-Net for resource-constrained segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  31. T. Wu, S. Tang, R. Zhang, and Y. Zhang. CGNet: A light-weight context guided network for semantic segmentation. CoRR, abs/1811.08201, 2018. URL
  32. P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and H. Qian. Fully hardware-implemented memristor convolutional neural network. Nature, 577(7792):641–646, 2020.
  33. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  34. Y. Zhou, Y. Bai, S. S. Bhattacharyya, and H. Huttunen. Elastic neural networks for classification. In 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pages 251–255, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description