# Multi-scale Convolution Aggregation and Stochastic Feature Reuse for DenseNets

###### Abstract

Recently, Convolution Neural Networks (CNNs) obtained huge success in numerous vision tasks. In particular, DenseNets have demonstrated that feature reuse via dense skip connections can effectively alleviate the difficulty of training very deep networks and that reusing features generated by the initial layers in all subsequent layers has strong impact on performance. To feed even richer information into the network, a novel adaptive Multi-scale Convolution Aggregation module is presented in this paper. Composed of layers for multi-scale convolutions, trainable cross-scale aggregation, maxout, and concatenation, this module is highly non-linear and can boost the accuracy of DenseNet while using much fewer parameters. In addition, due to high model complexity, the network with extremely dense feature reuse is prone to overfitting. To address this problem, a regularization method named Stochastic Feature Reuse is also presented. Through randomly dropping a set of feature maps to be reused for each mini-batch during the training phase, this regularization method reduces training costs and prevents co-adaptation. Experimental results on CIFAR-10, CIFAR-100 and SVHN benchmarks demonstrated the effectiveness of the proposed methods.

## 1 Introduction

Recently, deep learning became a dominant field of machine learning for various vision tasks, such as recognition and classification. In particular, Convolutional Neural Networks (CNNs) have achieved an unprecedented success through AlexNet [14], which has incurred a new line of research concentrating on constructing better performing CNNs [28]. Increasingly deeper architectures are being created and trained based on the observation that, the deeper the network is, the higher-level features it is able to extract. AlexNets have 5 convolutional layers [14], VGG Nets [23] have 16 or 19, GoogLeNets [27] have 22, and ResNets [6] feature over 1000 layers employing residual connections.

As the networks became very deep, two common issues have emerged: gradients explosion and vanishing. To deal with these problems, several creative architectures, such as Highway networks [24], Deeply-Supervised Nets [16] and ResNets [6], have been designed. The key ideas are passing information flow from one layer to another via shortcuts or adding “companion” objective functions at each hidden layer respectively. Stochastic depth [10] trains an ensemble of ResNets with different depth values by randomly dropping a set of layers during the training phase. FractalNets [15] repeatedly utilize a simple expansion rule to generate an ultra-deep network containing interacting subpaths of different lengths. Based on the above work, DenseNet [9] was introduced, which connects each layer to every subsequent layer. As a result, a given layer in DenseNet takes all feature maps extracted by preceding layers as input. This new connection pattern allows DenseNets to obtain significant improvements over the state-of-the-art on several object recognition benchmark tasks.

On another front, inception series [12, 23, 26, 28] have been shown to achieve remarkable performance at very low memory costs. This module is composed by convolutions with different kernel size (11, 33, 55) and a 33 max pooling, and then concatenates results from the convolutions and pooling. This design strengthens the regularization and scale invariance of extracted features. Recently, feature pyramid networks (FPN) [19] and deep layer aggregation [34] have been proposed, which aim at exploiting the inherent multi-scale, pyramidal hierarchy of CNNs. Features at different scale levels are merged together to achieve higher accuracy with fewer parameters.

Inspired by the benefits of multi-scale convolutions [19, 34] and features fusion for training deep networks, we design a novel module, referred as Multi-scale Convolution Aggregation (MCA) to work with DenseNets. As shown in Fig. 1, the MCA module consists of layers for multi-scale convolutions, cross-scale aggregation, maxout, and concatenation. We observe that DenseNets utilizing MCA module can substantially reduce parameters number and classification error than using other multi-scale designs. The reduction in parameters results from the new design of fusing pyramidal convolutions instead of simply concatenating them. The increase in accuracy is attributed to the following factors: 1) strengthening scale-invariance because of the multi-scale convolutions with four kernels with different receptive field sizes; 2) given a specific task, the network automatically chooses the most suitable scales via four trainable gating units to adaptively make use of multi-scale information; 3) the use of two maxout activations stimulates the competition among neural units of different receptive fields and enhances the learning ability of the network; 4) higher non-linearity; and 5) compared with traditional concatenation in GoogleNets, our module dramatically reduces the number of parameters while preserving sufficient multi-scale information by aggregation and maxout functions.

In addition to various methods of architecture design, difficulties in training deep networks motivated research on optimization and initialization techniques. These include dropout [8], maxout activation [4], batch normalization [11, 12], group normalization [31], Xavier initialization [2], He initialization [5], etc., which have been applied in a wide range of networks as essential components.

To reduce the possibility of overfitting in DenseNets and to further boost the generalization of networks, we also develop a regularization method named Stochastic Feature Reuse (SFR). Similar to stochastic depth [10], SFR contains gates for dropping selected feature maps delivered from preceding layers; see Fig. 4. During training step, each layer randomly reuses different preceding feature maps for different mini-batch, resulting each mini-batch is trained under a sub-network with a unique connection scheme. This approach effectively addresses overfitting problem of DenseNet by substantially reducing the number of parameters while improving the performance of DenseNets.

We evaluate the impacts of both MCA module and SFR on three widely used benchmark datasets: CIFAR-10 [13], CIFAR-100 [13] and Street View House Number (SVHN) [21]. The comparisons show that our model can achieve comparable test accuracy with relatively lower computation costs and outperform the state-of-the-art performance of DenseNets.

## 2 Related Work

Deeper feed-forward neural networks tend to generate larger dividends in performances of various vision tasks. This leads to the recent resurgence of exploration in sophisticated CNNs architectures [9] with hugely increased classification accuracy on ImageNet [1], e.g. from AlexNet [14] to GoogLeNets [27], and ResNets [6] to DenseNets [9].

Comparisons of layerwise performance, analysis [20, 32] and visualization of feature maps [33, 37] show that networks with deeper layers are able to extract more semantic and higher-level representations. On the other hand, very deep networks make training more difficult, especially when using a first-order optimizer with purely random initialization and traditional activation functions (tanh, sigmoid etc.), which often cause gradients vanishing and internal covariate shifts. To overcome these problems, a lot of research has been carried out [15].

To deeply dig into high-performance architectures, a series of independent methods have been explored. One of more dominative is to increase the network width. GoogLeNets [12, 23, 26, 28] use the inception module to build deep networks and this component concatenates feature maps produced by a set of filters with different receptive field size. Other well-known structures, such as Resnet in Resnet [29] and Wide residual networks [35], also demonstrate that simply increasing the number of filters in each layer can dramatically improve test accuracy. More recently, FractalNets [15] obtained excellent results using a wider block structure. In addition to increasing depth and width of networks, there are a growing number of research works focusing on aggregation or fusion. Deep Layer Aggregation [34] provides a novel approach to fuse features vertically across layers, which substantially improves recognition accuracy with less computational cost.

Inspired by these findings, we design a novel MCA module, which first broadens the width of the initial convolution layer of DenseNets through multi-scale convolutions, then fuses the filters using cross-scale aggregation parameterised by trainable weights. The idea of multi-scale convolutions also follows a neuroscience model [22] suggesting that the raw image should be processed at different scales and then joined together for next layers, so that the deeper layers can become robust to scale shift [27].

Another breakthrough in deep learning is the introduction of skip connections, which addresses the challenges of training deep networks. Highway Networks [24] efficiently train deep networks by introducing the bypassing path, which is the primary factor that eases the training pain. ResNets [6] further enhance this new connection pattern through substituting bypassing paths with residual connections, and achieve record-breaking performance on ImageNet [1]. Recently, DenseNets [9] densely connect all preceding layers with each layer to reuse all preceding feature maps and outperform the state-of-the-art results on several competitive benchmarks. Moreover, stochastic depth [10] was proposed as a successful approach to train an over 1000-layer ResNet through randomly dropping a few layers during training. Analogous to dropout [8], this method demonstrates that stochastically dropping is an extremely powerful technique to regularize networks. Our SFR regularizer was motivated by the observations on Dropout, Stochastic Depth and DenseNets. However, instead of dropping layers as in Stochastic Depth, our regularizer drops features by randomly blocking a set of bypassing paths.

## 3 Methodology

DenseNets. Both the MCA module and the SFR regularizer proposed in this paper are based on DenseNets [9]. Assume that a single input image is represented by and is passed through a DenseNet that has layers. Each layer comprises a composite function that includes one Batch Normalization layer [12], one ReLU layer [3], and one convolution layer. DenseNets introduce a new connectivity scheme: the output of each layer is directly connected to all subsequent layers. Consequently, the layer receives the outputs of all preceding layers. That is:

(1) |

where is the output of layer , is the concatenation of feature maps produced by layers 0, 1, 2, …, . The total number of channels in a -layer DenseNet, , can be approximatively computed as:

(2) |

where represents the number of input channels into first dense block and is the growth rate of the DenseNet.

### 3.1 Multi-scale Convolution Aggregation Module

Through concatenating different groups of convolutions, Inception [23] module and its variant [17] have shown that multi-scale convolution filters can boost the performance of deep networks. Inspired by their findings, we design a novel MCA module to enhance the representative and learning capacity of DenseNets. The new module consists of layers for multi-scale convolutions, cross-scale aggregation, maxout, and concatenation. It is placed in front of the DenseNet as initial layers so that abundant features extracted at different scales can be fed into the network.

Multi-scale Convolutions. Given the input image , the multi-scale convolutions layer computes the following:

(3) |

where are the results of convolutions with , , , and kernels respectively. represents parameters of different kernels and denotes the concatenation operator.

Feeding the concatenation of four groups of convolutions, , into DenseNets directly helps to improve the performance of the network since the network bandwidth is increased. When evaluated on CIFAR-10 dataset, a standard DenseNet with depth and growth rate achieves , whereas the DenseNet with as input achieves a test accuracy of 94.31. However, since and the number of initial channels for DenseNet, in Equ. (2), equals to the length of , the number of parameters is increased from 4.2 millions to 5.7 millions.

Cross-scale Aggregation. In order to reduce the complexity of the model while maintaining a high test accuracy and maximizing effective information flow of the network, an adaptive aggregation function is applied. Here we aggregate convolution results under four kernels into two branches that represent fine and coarse scales, respectively; see Fig. 2. Since trainable gating weights are introduced, the unit is similar to a small-scale voting system. For each mini-batch, proper weights are automatically assigned to different scales via our voting mechanism. This helps to preserve most contributive multi-scale information and suppress flows with lower importance. Specifically, the cross-scale aggregation layer performs the following operation:

(4) |

where represents pixelwise summation aggregation. are learnable gating weights for convolution results at different scales. Their values indicate the importance of respective scales. In practice, we also found that trainable aggregation works much better than equal-weighted fusion, since the voting system in the former approach makes the module more adaptive on a variety of datasets. As shown in Fig. 6, the finally converged weights on different datasets vary widely, which indicates that different datasets favor the contributions from different scales. Fig. 3 visually compares the results obtained through the two versions of the aggregation.

Compared to the Inception module that simply concatenates different groups of convolutions, the aggregation layer we used can significantly reduces the number of parameters. On CIFAR-10 dataset, the number of parameters is reduced from 5.7 millions to 4.2 millions in DenseNet with depth = 40 and growth rate .

Maxout. Previous work have shown that: 1) maxout exploits the model averaging behavior as the approximation is more accurate; 2) back-forward flow of maxout can avoid pitfalls such as failing to use a large set of filters [4]; and 3) grouping is important in deep networks [31]. Hence, to better regularize our fusion results, here two maxout operations are independently performed after cross-scale aggregation layer, one for the two fines scale channels and the other for the two coarser scale channels. That is, we have the final output of MCA module :

(5) |

With maxout layer introduced, the whole MCA module can be viewed as a highly non-linear transformation between original input and the first dense block of DenseNets. It includes four gating units parametrized by controlling the flow of multi-scale information.

Backward Propagation. The process of gradients back-propagation is the same as the traditional back-propagation. Here, we present the derivation formula in terms of weights of multi-scale convolutions; see Equ. (6). represent the outputs of multi-scale convolutions and are kernel weights. We define the maxout function as and denotes its first-order derivative. The input image is and are bias vectors.

(6) |

where is the loss function of the whole network and are the weight and bias of the first layer in the first dense block. is the sensitivity of layer.

### 3.2 Stochastic Feature Reuse

Dropout [8], Drop-connect [30] and Maxout [4] provide excellent regularization methods through modifying interactions among neural units or connections between different layers in order to break co-adaptation. These techniques have been supported by subsequent research and applied in a wide range of network architectures, such as ResNets [6] and FractalNets [15]. Recent stochastic depth [10] and drop-path [15] successfully extend dropout and make impressive progress in vision tasks.

Motivated by these structures, we propose “Stochastic Feature Reuse” (SFR) as an effective regularizer in DenseNets to promote the generalization of networks and to overcome overfitting especially when the growth rate is high. Fig. 4 illustrates the model of SFR. For each mini-batch, a new mask tensor obeying Bernoulli distribution is randomly generated for each layer and the input of layer is modified as follows:

(7) |

During the training time, when a set of skip connections are blocked, there is no need to perform forward and backward computations trough those. Hence, these dropped features are not reused by the current layer. Since a large amount of computation is saved, SFR can speed up the convergence of the network. When testing, all features are reused in order to make use of the full-width network [10].

As a regularizer, SFR can enhance the performance of DenseNets and deal with the overfitting issue [9] through discouraging co-adaptation. In addition, SFR also implicitly trains an ensemble of DenseNets, which helps to improve the performance. For a -layer DenseNet, there are possible combinations and the final network used at the testing stage can be viewed as the average of these sub-networks.

## 4 Experiments

The presented MCA module and SFR regularizer are evaluated using three widely adopted benchmarks: CIFAR-10 [13], CIFAR-100 [13] and SVHN [21]. The results show that the performance of DenseNets with MCA modules is superior to the original DenseNets and that the SFR regularizer can effectively prevent overfitting.

### 4.1 Implementation and Training Details

In our experiments, we report test error from the epoch with the lowest validation error and we use the same construction and training scheme as introduced in DenseNet [9]. When evaluating the MCA module, the DenseNet part has three dense blocks, all have equal numbers of layers and the same growth rate. When evaluating SFR regularizer, an additional dense block with SFR is added so that the performance of the original DenseNet is not affected. Each composite function of dense block uses a convolution layer with zero-padding to keep the feature maps fixed. Between two dense blocks, there are bottleneck layers with a compression factor. In this paper, we set compression factor as 1.0 in standard DenseNet while set as 0.5 in the structure of DenseNet with bottleneck and compression (DenseNet-BC). At the end of the last dense block, a global average pooling layer, followed by a softmax layer, is attached. The sizes of feature maps in each of the three dense blocks are , and , respectively.

Similar to the standard DenseNet [9], DenseNets in our experiments are optimized through the first-order SGD optimizer. We train 350 epochs for CIFAR and 40 epochs for SVHN. Initial learning rate is 0.1 and divided by 10 at epochs 150, 225 and 300 for CIFAR and epochs 20 and 30 for SVHN. We also add weight decay (0.0001) term into our loss function and use Nesterov momentum [25] of 0.9 for optimization. Hinton’s Dropout [8] layer with drop probability , Batch Normalization [12] layer and He Initialization of weights [5] are applied as well.

Model | Depth | Params. | C10() | C100() | SVHN() |

Stochastic Pooling [36] | - | - | 15.13 | 42.51 | 2.80 |

Maxout Networks [4] | - | - | 11.68 | 38.57 | 2.47 |

Network in Network [18] | - | - | 10.41 | 35.68 | 2.35 |

Deeply Supervised Net [16] | - | - | 9.69 | 1.92 | |

Competitive Multi-scale [17] | - | 4.48M | 6.87 | 27.56 | 1.76 |

Highway Network [24] | - | - | - | ||

Fractal Network [15] | 21 | 38.6M | 10.18 | 35.34 | 2.01 |

FractalNet with Drop-path [15] | 21 | 38.6M | 7.33 | 28.20 | 1.87 |

ResNet [6] | 110 | 1.7M | - | - | |

Stochastic Depth [10] | 110 | 1.7M | 11.66 | 37.80 | 1.75 |

ResNet(pre-activation) [7] | 164 | 1.7M | 11.26 | 35.58 | - |

1001 | 10.2M | 10.56 | 33.47 | - | |

DenseNet() [9] | 40 | 1.0M | 7.00 | 27.55 | 1.79 |

DenseNet() [9] | 100 | 27.2M | 5.83 | 23.42 | 1.59 |

DenseNet()[9] | 53 | 7.8M | 6.45 | 24.32 | 1.78 |

DenseNet with SFR() | 53 | 7.8M | 6.08 | 23.82 | 1.66 |

DenseNet-BC()[9] | 100 | 0.8M | 5.92 | 24.15 | 1.76 |

DenseNet-BC with MCA() | 100 | 0.8M | 5.41 | 24.07 | - |

DenseNet with MCA() | 40 | 1.0M | 6.44 | 27.44 | 1.77 |

DenseNet with MCA() | 40 | 4.2M | 23.78 | 1.66 | |

DenseNet with MCA() | 40 | 11.6M | 5.76 |

### 4.2 Datasets

The CIFAR-10 dataset [13] consists of 60,000 (50,000 for training + 10,000 for testing) natural color images of 3232 resolution. Objects from ten classes (e.g. vehicles, flowers etc.) have equal volume of training and test images and are centered in these images. The CIFAR-100 dataset extends the number of classes in CIFAR-10 to 100, but each class only consists of 600 images. Due to more classes and fewer samples for each class, the classification for CIFAR-100 is considered as more challenging. Street View House Number (SVHN) dataset is also a well-known benchmark in computer vision, which consists of color images of digits 0 to 9 of 3232 resolution. There are 73,257 training, 26,032 testing, and 531,131 additional training images respectively.

In our experiments, we apply the same normalization methods on input images as the original DenseNet. For CIFAR dataset, we subtract mean values and divide standard deviations, whereas for SVHN images, the pixel values were divided by 255. We do not use any data augmentation in the experiments, and only focus on comparing our approaches with other network models on original datasets.

### 4.3 Results and Discussion

We train our networks with different depths (40, 53, 100) and growth rates () and compare our approach with other well-known models on CIFAR-10, CIFAR-100, SVHN; see Table 1.

#### Multi-scale Convolution Aggregation.

To better evaluate our novel module, we train different patterns of aggregation on CIFAR-10 and test the best model on CIFAR-10, CIFAR-100 and SVHN. The performance of our structure with different setting on three benchmarks are shown in the bottom of Table 1. With relatively fewer parameters (4.2M), it obtains the lowest classification error rate on CIFAR-10 (5.38%) and CIFAR-100 (23.78%), and second best results on SVHN (1.66%). In the case of , depth = 40, our model gets impressive results (22.65%) on CIFAR-100 and (1.61%) on SVHN. This demonstrates that our MCA module has much higher representative capacity and is able to preserves abundant information of multi-scale convolutions. This is crucial for preventing overfitting and promoting generalization ability.

Fig. 5(right) compares different aggregation patterns for fussing multi-scale convolutions information (, depth = 40). The aggregation parameterized by gating weights gains the best performance with only four parameters added. Its success may be attributed to the following factors:

Factor 1: Aggregating different scales with trainable weights is more flexible and representative than aggregating with hand-crafted weights. During the process of SGD, the weights of different kernel sizes are treated independently and adaptively. Since pixels at different distances from the central point should have different importance, this strategy can preserve richer multi-scale information (texture, edges, corners, etc.) while using much fewer parameters than simply concatenating them.

More importantly, for vision tasks with different complexity, weights of gating units may vary under similar trends during training, but often converge to different final values; see Figure 6. This suggests that the optimal scale for convolutions can be different for different datasets. For instance, in CIFAR tasks, the module assigns high weights ( and ) to fine-scale features, whereas less coarse-scale information is delivered to subsequent DenseNet. On the other hand, for the SVHN dataset, the weight has much higher relative value than for the other two dataset, whereas the weight is almost 0. This observation suggests that, for simple digits classification tasks, coarse-scale features extracted by convolution is more important than in other more complicated tasks. To further demonstrate this point, we also run our module on another simple dataset MNIST and obtain the similar observation ( for MNIST vs. for CIFAR-10).

Factor 2: The combination of three dominant joining methods (summation, maxout and concatenation) makes our model highly non-linear and capable of effectively aggregating multi-scale representations. Each joining method has its own advantages. The combination of different approaches is also studied in [26], which shows a better performance. By utilizing two maxout, the units of the aggregation layer have strong competition which is beneficial for training and optimizing deep networks. The two branches of the fine-scale and coarse-scale aggregations enhance the scale invariance property.

#### Stochastic Feature Reuse.

We evaluate SFR on the same three datasets and compare it with the original DensNet with depth and growth rate . The additional dense block with SFR is placed at the front or at the end of the original DenseNet; see Table 2 for details. The comparison shows that placing the additional dense block with SFR at the end of the DenseNet generates lower error rates on all three datasets. We attribute the accuracy improvement to the fact that SFR randomly generates a new sub-network with different propagation path for each mini-batch and implicitly train an ensemble of different networks. This kind of dropout can disorganize the co-adaptation among reused features and prevent overfitting. On the other hand, adding the additional dense block with SFR to the front of DenseNet actually hurt the performance since this will lead that shallow layers are too narrow to pass sufficient information flow. In addition, we observe that SFR should work with Hinton’s Dropout, without which the accuracy also degenerates.

Block index | Error() | Dataset | |
---|---|---|---|

DenseNet [9] | 6.45 | CIFAR-10 | |

SFR | 6.08 | CIFAR-10 | |

SFR | 8.99 | CIFAR-10 | |

SFR(No Dropout) | 10.00 | CIFAR-10 | |

DenseNet [9] | None | 24.32 | CIFAR-100 |

SFR | 23.82 | CIFAR-100 | |

SFR | 26.54 | CIFAR-100 | |

SFR(No Dropout) | 27.15 | CIFAR-100 | |

DenseNet [9] | None | 1.78 | SVHN |

SFR | 1.66 | SVHN | |

SFR | 2.12 | SVHN | |

SFR(No Dropout) | 3.02 | SVHN |

Another observation is that our method is more effective on wider DenseNets, as narrow networks in general do not have serious co-adaptation issue or long training time. The ways of widening DenseNet mainly includes using a larger growth rate or increasing channels of the first initial layer. Hence, to illustrate the impact of different growth rate on the performance of our SFR, we firstly evaluate on CIFAR-10 based on three growth rates 12, 24 and 40. Moreover, the case of wider initial layer also be considered. Here we expand the first initial convolution layer four times via four-scale convolutions and the training results are shown in Fig. 5(left). Table 3 shows the results under different cases. SFR test error increases to 6.32% when growth rate adds up to 40 as a very large bandwidth causes slight overfitting. Using SRF with high drop probability addresses this issue.

Width | w.o. SFR | w. SFR | Improve | |
---|---|---|---|---|

SFR() | 17196 | 6.93 | 6.80 | 0.13 |

SFR() | 34392 | 6.45 | 6.08 | 0.37 |

SFR(WIL) | 34536 | 6.09 | 5.76 | 0.33 |

SFR() | 57320 | 6.53 | 6.32 | 0.21 |

## 5 Conclusions

A novel network module, referred as Multi-scale Convolution Aggregation, is presented in this paper. It consists of 4 groups of multi-scale convolutions, cross-scale aggregation parametrized by 4 trainable weights and 2 maxout that produces 2 branches of feature maps representing smaller and larger receptive fields respectively. In our experiments, Densenets with our new model obtain excellent performance while requiring substantially fewer parameters than utilizing traditional inception module. Instead of simple equal-weighted aggregation, our aggregation employs self-adaptive strategy to control the information flow of convolution filters. It automatically optimizes these weights according to different vision tasks. Trainable aggregation guarantees the maximum use of multi-scale convolutions and is the key for reducing parameters, whereas maxout strengthens the competitions among units in fine-scale and coarse-scale branches. The combination of three joining methods: concatenation, summation and maxout makes the networks highly non-linear.

In addition, a Stochastic Feature Reuse strategy is also presented for training deep DenseNets effectively and efficiently. This regularizer downsamples a new subnets of the basic DenseNet for each mini-batch during training but reuses all feature maps produced by preceding layers at test stage. Our method enhances the performance of DenseNets through breaking the co-adaptation among reused features and implicitly training an ensemble of multi-subnets with different widths. Being a simple and easy-to-apply approach, SFR is more useful for wider DenseNets with a larger growth rate and can effectively alleviate the difficulties of training wide networks.

For future work, we would like to explore the applications of the MCA module in other prominent deep architectures, as we felt MCA can be beneficial through introducing scale-invariance information without adding feature redundancy. In addition, when evaluating SFR, we empirically use a constant drop probability. It is interesting and meaningful to explore other configurations of the drop probability in future experiments.

## References

- [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- [2] X. Glorot and Y. Bengio. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
- [3] X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
- [4] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout Networks. arXiv preprint arXiv:1302.4389, 2013.
- [5] K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification. In Proceedings of IEEE International Conference on Computer Vision, pages 1026–1034, 2015.
- [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- [7] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
- [8] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv preprint arXiv:1207.0580, 2012.
- [9] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 4700–4708, 2017.
- [10] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep Networks with Stochastic Depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
- [11] S. Ioffe. Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models. In Advances in Neural Information Processing Systems, pages 1942–1950, 2017.
- [12] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of International Conference on Machine Learning, pages 448–456, 2015.
- [13] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny Images. 2009.
- [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet Classification with Deep Convolutional Neural Networks. In Proceedings of Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
- [15] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep Neural Networks without Residuals. arXiv preprint arXiv:1605.07648, 2016.
- [16] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-Supervised Nets. In Proceedings of Conference on Artificial Intelligence and Statistics, pages 562–570, 2015.
- [17] Z. Liao and G. Carneiro. Competitive Multi-Scale Convolution. arXiv preprint arXiv:1511.05635, 2015.
- [18] M. Lin, Q. Chen, and S. Yan. Network in Network. arXiv preprint arXiv:1312.4400, 2013.
- [19] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 4, 2017.
- [20] o. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- [21] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, volume 2011, page 5, 2011.
- [22] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust Object Recognition with Cortex-like Mechanisms. IEEE transactions on Pattern Analysis and Machine Intelligence, 29(3):411–426, 2007.
- [23] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, pages 1–14, 2014.
- [24] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway Networks. arXiv preprint arXiv:1505.00387, 2015.
- [25] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. In International Conference on Machine Learning, pages 1139–1147, 2013.
- [26] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, Inception-Resnet and the Impact of Residual Connections on Learning. In AAAI, volume 4, page 12, 2017.
- [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
- [29] S. Targ, D. Almeida, and K. Lyman. Resnet in Resnet: Generalizing Residual Architectures. arXiv preprint arXiv:1603.08029, 2016.
- [30] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus. Regularization of Neural Networks Using Dropconnect. In Proceedings of International Conference on Machine Learning, pages 1058–1066, 2013.
- [31] Y. Wu and K. He. Group Normalization. arXiv preprint arXiv:1803.08494, 2018.
- [32] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How Transferable are Features in Deep Neural Networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014.
- [33] F. Yu, V. Koltun, and T. Funkhouser. Dilated Residual Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, 2017.
- [34] F. Yu, D. Wang, and T. Darrell. Deep Layer Aggregation. arXiv preprint arXiv:1707.06484, 2017.
- [35] S. Zagoruyko and N. Komodakis. Wide Residual Networks. arXiv preprint arXiv:1605.07146, 2016.
- [36] M. D. Zeiler and R. Fergus. Stochastic Pooling for Regularization of Deep Convolutional Neural Networks. arXiv preprint arXiv:1301.3557, 2013.
- [37] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In European Conference on Computer Vision, pages 818–833, 2014.