EIGEN: Ecologically-Inspired GENetic Approach for Neural Network Structure Searching
Designing the structure of neural networks is considered one of the most challenging tasks in deep learning. Recently, a few approaches have been proposed to automatically search for the optimal structure of neural networks, however, they suffer from either prohibitive computation cost (e.g., 256 Hours on 250 GPU in real2017large ()) or unsatisfactory performance compared to those of hand-crafted neural networks. In this paper, we propose an Ecologically-Inspired GENetic approach for neural network structure search (EIGEN), that includes succession, mimicry and gene duplication. Specifically, we first use primary succession to rapidly evolve a community of poor initialized neural network structures into a more diverse community, followed by a secondary succession stage for fine-grained searching based on the networks from the primary succession. Extinction is applied in both stages to reduce computational cost. Mimicry is employed during the entire evolution process to help the inferior networks imitate the behavior of a superior network and gene duplication is utilized to duplicate the learned blocks of novel structures, both of which help to find the better network structures. Extensive experimental results show that our proposed approach can achieve the similar or better performance compared to the existing genetic approaches with dramatically reduced computation cost. For example, the network discovered by our approach on CIFAR-100 dataset achieves 78.1% test accuracy under 120 GPU hours, compared to 77.0% test accuracy in more than 65, 536 GPU hours in real2017large ().
EIGEN: Ecologically-Inspired GENetic Approach for Neural Network Structure Searching
Jian Ren Rutgers University Zhe Li The University of Iowa Jianchao Yang Toutiao AI Lab Ning Xu Snap Research Tianbao Yang The University of Iowa David J. Foran Rutgers University
noticebox[b]Preprint. Work in progress.\end@float
Deep Convolutional Neural Networks (CNN) have achieved tremendous success among many computer vision tasks krizhevsky2012imagenet (); girshick2015fast (); simonyan2014very (). However, a hand-crafted network structure tailored to one task may perform poorly on another task. Therefore, it usually requires extensive amount of human efforts to design an appropriate network structure for a certain task.
Recently, there are emerging research works real2017large (); xie2017genetic (); zoph2016neural (); zoph2017learning (); zhong2017practical (); baker2016designing () on automatically searching neural network structures for the image recognition tasks. In this paper, we focus on optimizing the evolution-based algorithms xie2017genetic (); miikkulainen2017evolving (); liu2017hierarchical (); stanley2002evolving () for network structures searching as the existing works suffer from either one of the following issues: prohibitive computational cost or unsatisfied performance compared with hand-crafted network structures. In real2017large (), it costs more than hours on GPU for searching neural network structures, which can not be affordable by general users. In xie2017genetic (), the final learned network structure by their genetic approach achieves about test accuracy on CIFAR-10, even though better performance as 92.9% could be obtained after fine-tuning certain parameters and modifying some structures on the discovered network. In li2018evoltuion (), they firstly aim to achieve the better performance with the reduced computational cost by the proposed aggressive selection strategy in genetic approach and more mutations operation to increase diversity which is decreased by the proposed selection strategy. In their work, they reduce computational cost dramatically from more than GPU hours (GPUH) to few hundreds GPUH. However, their approach still suffers performance sacrifice, for example, test accuracy compared to test accuracy from real2017large () on CIFAR-10 dataset.
Along this research line, in this paper, we study the genetic approach to achieve the better test performance compared to real2017large () or competitive performance to hand-crafted network structures he2016deep () under limited computation cost li2018evoltuion () and without the pre-designed architectures introduced by human liu2017hierarchical (). Inspired from primary, secondary succession from ecological system sahney2008recovery (), we enforce a poor initialized community of neural network structures to rapidly evolve to a community containing network structures with dramatically improved performance. After the first stage of primary succession, we perform the fine-grained search for better networks in a community during the secondary succession stage. During the evolution process, we explore the mimicry greeney2012feeding () to help the inferior networks learn the behavior from superior networks to obtain the better performance. In addition, we also introduce the gene duplication to further utilize the novel block of layers that appeared in the discovered network structure.
The contribution of this paper can be summarized in three-fold:
We incorporate primary and secondary succession from ecological system into our genetic framework to search optimal network structures under the limited computation cost.
We explore the mimicry to help search the better networks during the evolution as well as the gene duplication to utilize the discovered beneficial structures.
Extensive experimental results show that the obtained neural network structures achieve the better performance compared with existing genetic approaches and competitive performance with the hand-crafted network structures. On the other side, the computation cost required by our approach is dramatically reduced compared with other genetic approaches.
2 Related Work
In this section, we briefly review some related works on genetic-based approaches for searching optimal neural network structures.
Recently, there emerges a few studies real2017large (); xie2017genetic () targeting on this challenging task. Since in this paper we focus on achieving better performance with limited computational cost through a genetic approach, we will highlight the differences between our work with several existing studies in the following from two aspects: reducing computation cost and improving performance.
In real2017large (), the authors encode each individual network structure as a graph into DNA and define several different mutation operations such as IDENTITY and RESET-WEIGHTS to apply to each parent network to generate children networks. The essential part of this genetic approach is that they utilize a large amount of computation cost to search the optimal neural network structures in the giant searching space. Specifically, the entire searching procedure costs more than hours with GPUs to achieve test accuracy from the learned network structure on CIFAR-10 dataset, which is not affordable by general users.
Due to prohibitive computation cost, in xie2017genetic () the authors impose restriction on the neural network searching space. In their work, they only learn one block of network structure and stack the learned block by certain times in a designed routine to obtain the best network structure. Through this mechanism, the computation cost is reduced to several hundreds GPU hours, however, the test performance of the obtained network structure is not satisfactory, for example, the found network achieves test accuracy on CIFAR-10, even though fine-tuning parameters and modifying certain structures on the learned network structure could lead to the test accuracy as 92.9%.
In li2018evoltuion (), they aim to achieve better performance from the automatically learned network structure with limited computation cost in the course of evolution, which is not brought up previously. Different from restricting the search space to reduce computational cost xie2017genetic (); dufourq2017eden (), they propose the aggressive selection strategy to eliminate the weak neural network structures in the early stage. However, this aggressive selection strategy may decrease the diversity which is the nature of genetic approach to improve performance. In order to remedy this issue, they define more mutation operations such as add_fully_connected or add_pooling. Finally, they reduce computation cost dramatically to 72 GPUH on CIFAR-10. However, there is still performance loss in their approach. For example, on CIFAR-10 dataset, the test accuracy of the found network is about lower than real2017large ().
At the end of this section, we highlight that our work is in the line of li2018evoltuion (). Inspired from ecological concepts, we propose the Ecologically-Inspired GENetic approach (EIGEN) for neural network structure search by evolving the networks through rapid succession, and explore the mimicry and gene duplication along the evolution.
Our genetic approach for searching the optimal neural network structures follows the standard procedures: i) initialize population in the first generation with simple network structures; ii) evaluate the fitness score of each neural network structure (fitness score is the measurement defined by users for their purpose such as validation accuracy, number of parameters in network structure, number of FLOP in inference stage, and so on); iii) apply a selection strategy to decide the survived network structures based on the fitness scores; iv) apply mutation operations on the survived parent network structures to create the children networks for next generation. The last three steps are repeated until the convergence of the fitness score. Note that in our genetic approach, the individual is denoted by an acyclic graph with each node representing a certain layer such as convolution, pooling and concatenation layer. A children network can be generated from a parent network through a mutation procedure. A population includes a fixed number of networks in each generation, which is set as 10 in our experiments. For details of using genetic approach to search neural network structures, we refer the readers to li2018evoltuion (). In the following, we discuss our approach which applies the ecological concepts of succession, extinction, mimicry and gene duplication to the genetic approach for an accelerated search of neural network structures.
3.1 Evolution under Rapid Succession
Our inspiration comes from the fact that in an ecological system, the community is dominated by diversified fast-growing individuals during the primary succession, while in the secondary succession, the community is dominated by more competitive individuals sahney2008recovery (). Therefore, we treat all the networks during each generation of the evolution process as a community and treat the individual networks in each generation as a population, instead of focusing on evolving a single network real2017large ().
With this treatment, we propose a two-stage rapid succession for accelerated evolution, analogous to the ecological succession. The proposed rapid succession includes a primary succession, where it starts with a community consists of a group of poorly initialized individuals which only contains one global pooling layer, and a secondary succession which starts after the primary succession. In the primary succession, a large search space is explored to allow the community grow at a fast speed, and a relatively small search space is used in the secondary succession for fine-grained search.
To depict the exploration space at each generation, we define mutation step-size as the maximum mutation iterations between the parent and children. The actual mutation step for each child is uniformly chosen from . In the primary succession, in order to have diversified fast-growing individuals, a large mutation step-size is used in each generation so the mutated children could be significantly different from each other and from their parent. Since we only go through the training procedure after finishing the entire mutation steps, the computation cost for each generation will not increase with the larger step-size. In the secondary succession, we adopt a relative small mutation step-size to perform a fine-grained search for network structures.
Each mutation step is randomly selected from the nine following operations including:
INSERT-CONVOLUTION: A convolutional layer is randomly inserted into the network. The inserted convolutional layer has a default setting with kernel size as 33, number of channels as 32, and stride as 1. The convolutional layer is followed by batch normalization ioffe2015batch () and Rectified Linear Units krizhevsky2012imagenet ().
INSERT-CONCATENATION: A concatenation layer is randomly inserted into the network where two bottom layers share the same size of feature maps.
INSERT-POOLING: A pooling layer is randomly inserted into the network with kernel size as 22 and stride as 2.
REMOVE-CONVOLUTION, REMOVE-CONCATENATION, REMOVE-POOLING: The three operations randomly remove a convolutional layer, a concatenation layer and a pooling layer, respectively.
ALTER-NUMBER-OF-CHANNELS, ALTER-STRIDE, ALTER-FILTER-SIZE: The three operations modify the hyper-parameters in the convolutional layer. The number of channels is randomly selected from a list of ; the stride is randomly selected from a list of ; and the filter size is randomly selected from .
During the succession, we employ the idea from previous work li2018evoltuion () that only the best individual in the previous generation will survive. However, instead of evaluating the population in each generation after all the training iterations, it is more efficient to extinguish the individuals that may possibly fail at early iterations, especially during the primary succession where the diversity in the population leads to various performances. Based on the assumption that a better network should have better fitness score at earlier training stages, we design our extinction algorithm as follows.
To facilitate the presentation, we denote as the number of population in each generation, and as the landmark iterations, and as fitness score (validation accuracy used in our work) of the network in the generation after training and iteration, and as threshold to eliminate the weak networks at and iterations in the generation. In the generation, we have fitness scores for all networks and after training and iterations, respectively. Note that can be less than since the weak networks are eliminated after iterations. The thresholds and are updated at iteration as
where is a sorting operator in decreasing order on a list of values and the subscribes and represents and value after the sorting operation, and are the hyper-parameters.
For each generation, we perform the following steps until the convergence of the fitness: (i) train the population for iterations, extinguish the individuals with fitness less than ; (ii) train the remaining population for iterations, and distinguish the population with fitness less than ; (iii) the survived individuals are further trained till convergence and the best one is chosen as the parent for next generation. The details for the extinction algorithm are described in Algorithm 1.
In biology evolution, mimicry is a phenomenon that the inferior species learn behaviours from the superior species. For example, moth caterpillars learn to imitate body movements of a snake so that they could scare off predators that are usually prey items for snakes greeney2012feeding (). The analogy with mimicry signifies that we could force inferior networks adopt (learn) the behaviors, such as statistics of feature maps romero2014fitnets (); yim2017gift () or logits hinton2015distilling (); bucilu2006model (), from superior networks in designing neural network structure during the evolution.
In our approach, we force the inferior networks to learn the behavior of the superior network by generating similar distribution of logits in the evolution procedure. Since learning the distribution of logits from the superior network gives more freedom for inferior network structure, compared to learning statistics of feature maps. This is in fact the knowledge distillation proposed in hinton2015distilling (). More specifically, for the given training image with one-hot class label , we define as the logits predicted from the pre-trained superior network, and as the logits predicted by the inferior network. We use the following defined as the loss function to encode the prediction discrepancy between inferior and superior networks as well as the difference between inferior networks prediction and ground truth annotations during the evolution:
where is the softmax function, is the cross-entropy of two input probability vectors such that , is the ratio controlling two loss terms and is a hyper-parameter. We adopt the terms from knowledge distillation hinton2015distilling () where student network and teacher network represent the inferior network and superior network, respectively. We fix as a constant.
3.3 Gene Duplication
During the primary succession, the rapid changing of network architectures leads to the novel beneficial structures decoded in DNA real2017large () that are not shown in the previous hand-designed networks. To further leverage the automatic discovered structures, we propose an additional mutation operation named duplication to simulate the process of gene duplication since it has been proved as an important mechanism for obtaining new genes and could lead to evolutionary innovation zhang2003evolution (). In our implementation, we treat the encoded DNA as a combination of blocks where each block includes the layers with the same size of feature maps. As shown in Figure 1, the optimal structure discovered from the rapid succession could mutate into different networks by combining the blocks in several ways through the duplication. We duplicate the entire block instead of single layer because the block contains the beneficial structures discovered automatically while simple layer copying is already an operation in the succession.
4 Experimental Results and Analysis
In this section, we report the experimental results of using EIGEN for structure search of neural networks. We firstly describe the experiment setup including datasets prepossessing and training strategy in Subsection 4.1 and show the results. Following that, we analyze the experimental results of rapid succession and mimicry in Subsection 4.2.
4.1 Experiment Setup and Results
The experiments are conducted on two benchmark datasets including CIFAR-10 krizhevsky2009learning () and CIFAR-100 krizhevsky2009learning (). The CIFAR-10 dataset contains 10 classes with 50, 000 training images and 10, 000 test images. The images has the size of 3232. The data augmentation is applied by a Global Contrast Normalization (GCN) and ZCA whitening goodfellow2013maxout (). The CIFAR-100 dataset is similar to CIFAR-10 except it includes 100 classes.
Training Strategy and Details.
During the training process, we use mini-batch Stochastic Gradient Descent (SGD) to train each individual network with the batch size as 128, momentum as 0.9, and weight-decay as 0.0005. Each network is trained for a maximum of 25, 000 iterations. The initial learning rate is 0.1 and is set as 0.01 and 0.001 at 15, 000 iterations and 20, 000 iterations, respectively. The parameters in Algorithm 1 are set to , , , , and . For the mimicry, we set to 5 and to 0.9 in Eq. 3. The teacher network is an ensemble of four Wide-DenseNet () huang2017densely (). The fitness score is validation accuracy from validation set. The primary succession ends when the fitness score saturates and then the secondary succession starts. The entire evolution procedure is terminated until the fitness score converges. Training is conducted with TensorFlow abadi2016tensorflow ().
We directly adopt the hyper-parameters developed on CIFAR-10 dataset to CIFAR-100 dataset. The experiments are run on a machine that has one Intel Xeon E5-2680 v4 2.40GHz CPU and one Nvidia Tesla P100 GPU.
|MAXOUT goodfellow2013maxout ()||-||90.7%||61.4%||-|
|Network In Network lin2013network ()||-||91.2%||64.3%||-|
|ALL-CNN springenberg2014striving ()||1.3 M||92.8%||66.3%||-|
|DEEPLY SUPERVISED lee2015deeply ()||-||92.0%||65.4%||-|
|HIGHWAY srivastava2015highway ()||2.3 M||92.3%||67.6%||-|
|RESNET he2016deep ()||1.7 M||93.4%||72.8%||-|
|DENSENET () huang2017densely ()||25.6 M||96.5%||82.8%||-|
|Teacher Network||17.2 M||96.0%||82.0%||-|
|EDEN dufourq2017eden ()||0.2 M||74.5%||-||-|
|Genetic CNN xie2017genetic ()||-||92.9%||71.0%||408 GPUH|
|LS-Evolution real2017large ()||5.4 M||94.6%||-||65,536 GPUH|
|LS-Evolution real2017large ()||40.4 M||-||77.0%||> 65,536 GPUH|
|AG-Evolution li2018evoltuion ()||-||90.5%||-||72 GPUH|
|AG-Evolution li2018evoltuion ()||-||-||66.9%||136 GPUH|
|EIGEN||2.6 M||94.6%||-||48 GPUH|
|EIGEN||11.8 M||-||78.1%||120 GPUH|
The experimental results shown in Table 1 justify the proposed approach are competitive with hand designed networks. Compared with the evolution-based algorithms, we can achieve the best results with the minimum computational cost. For example, we obtain similar results on the two benchmark datasets compared to real2017large (), but our approach is 1,000 times faster. Also, the number of parameters of the networks found by our approach on the two datasets are more than two times smaller than LS-Evolution real2017large (). More details for the network architectures are reported in the supplementary materials.
Effect of Primary Succession.
We show the results on different mutation step-size for the primary succession in Figure 2. The solid lines show average test accuracy of the best networks among five experiments and the shaded area represents the standard deviation in each generation among five experiments. Larger mutation step-size, such as 100, leads to the faster convergence of fitness score compared with the smaller mutation step-size, as shown in Figure (a)a. However, no further improvement is observed by using too large mutation step-size, such as 200, as shown in Figure (b)b.
Effect of Secondary Succession.
We further analyze the effect of the secondary succession during the evolution process. After the primary succession, we utilize the secondary succession to search the networks with a smaller searching space. We adopt small mutation step-size for the purpose of fine-grained searching based on the survived network from previous generation. Figure 3 shows the example evolution on CIFAR-10 and CIFAR-100 during the rapid succession. We use mutation step-size 100 and 10 for primary succession and secondary succession, respectively. The blue line in the plots shows performance of the best individual in each generation. The gray dots show the number of parameters for the population in each generation, and the red line indicates where the primary succession ends. The accuracy on the two datasets for the secondary succession shown in Table 3 demonstrates that small mutation step-size is helpful for searching better architectures in the rapid succession.
Analysis on Mimicry.
In order to analyze the effect of mimicry, we consider the situation where only primary and secondary succession are applied during the evolution. Both the duplication and mimicry are disabled. We denote the method as EIGEN w/o mimicry and duplication. We compare EIGEN w/o mimicry and duplication with the approach where mimicry is enabled and denote it as EIGEN w/o duplication. The comparison between EIGEN w/o mimicry and duplication and EIGEN w/o duplication in Table 3 proves the effectiveness of the mimicry during the rapid succession.
Effect of Gene Duplication.
After the rapid succession, the duplication operation is applied to leverage the automatically discovered structures. To analyze the effect of gene duplication, we denote the approach without duplication as EIGEN w/o duplication and show the results on CIFAR-10 and CIFAR-100 in Table 4. Although more parameters are induced in the networks by duplication, the beneficial structures contained in the block can actually contribute to the network performance through duplication.
Furthermore, we analyze the effect of mimicry on the network after the gene duplication. We denote the best network found by our approach as EIGEN network. By utilizing the mimicry to train the network from scratch, which is EIGEN network w mimicry, the networks obtain the improvement as 1.3% and 4.2% on CIFAR-10 and CIFAR-100, respectively, compared with the network trained from scratch without mimicry, which is EIGEN network w/o mimicry.
In this paper, we propose EIGEN for searching neural network architectures automatically with the rapid succession, mimicry and gene duplication. The rapid succession and mimicry could evolve the community of networks into an optimal status under the limited computational resources. With the help of gene duplication, the performance of the found network could be boosted without sacrificing any computational cost. The experimental results show the proposed approach can achieve competitive results on CIFAR-10 and CIFAR-100 under dramatically reduced computational cost compared with other genetic-based algorithms. The future work includes more exploration to improve the efficiency in the searching space.
-  Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041, 2017.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Ross Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Lingxi Xie and Alan Yuille. Genetic cnn. arXiv preprint arXiv:1703.01513, 2017.
-  Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
-  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.
-  Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning. arXiv preprint arXiv:1708.05552, 2017.
-  Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
-  Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, et al. Evolving deep neural networks. arXiv preprint arXiv:1703.00548, 2017.
-  Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
-  Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
-  Zhe Li, Xuehan Xiong, Zhou Ren, Ning Zhang, Xiaoyu Wang, and Tianbao Yang. An aggressive genetic programming approach for searching neural network structure under computational constraints. arXiv preprint arXiv:1806.00851, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Sarda Sahney and Michael J Benton. Recovery from the most profound mass extinction of all time. Proceedings of the Royal Society of London B: Biological Sciences, 275(1636):759–765, 2008.
-  HF Greeney, LA Dyer, and AM Smilanich. Feeding by lepidopteran larvae is dangerous: A review of caterpillars’ chemical, physiological, morphological, and behavioral defenses against natural enemies. Invertebrate Survival Journal, 9(1), 2012.
-  Emmanuel Dufourq and Bruce A Bassett. Eden: Evolutionary deep networks for efficient machine learning. arXiv preprint arXiv:1709.09161, 2017.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
-  Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
-  Jianzhi Zhang. Evolution by gene duplication: an update. Trends in ecology & evolution, 18(6):292–298, 2003.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
-  Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.
-  Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
-  Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
-  Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
-  Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.