Learning Sparse Deep Feedforward Networks via Tree Skeleton Expansion
Abstract
Despite the popularity of deep learning, structure learning for deep models remains a relatively underexplored area. In contrast, structure learning has been studied extensively for probabilistic graphical models (PGMs). In particular, an efficient algorithm has been developed for learning a class of treestructured PGMs called hierarchical latent tree models (HLTMs), where there is a layer of observed variables at the bottom and multiple layers of latent variables on top. In this paper, we propose a simple method for learning the structures of feedforward neural networks (FNNs) based on HLTMs. The idea is to expand the connections in the tree skeletons from HLTMs and to use the resulting structures for FNNs. An important characteristic of FNN structures learned this way is that they are sparse. We present extensive empirical results to show that, compared with standard FNNs tunedmanually, sparse FNNs learned by our method achieve better or comparable classification performance with much fewer parameters. They are also more interpretable.
1 Introduction
Deep learning has achieved great successes in the past few years [\citeauthoryearLeCun et al.2015, \citeauthoryearHinton et al.2012a, \citeauthoryearMikolov et al.2011, \citeauthoryearKrizhevsky et al.2012]. More and more researchers are now starting to investigate the possibility of learning structures for deep models instead of constructing them manually [\citeauthoryearChen et al.2017b, \citeauthoryearZoph and Le2017, \citeauthoryearBaker et al.2017, \citeauthoryearReal et al.2017]. Structure learning is interesting not only because it can save manual labor, but also because it can yield models that fit data better and hence perform better than manually built ones. In addition, it can also lead to models that are sparse and interpretable.
In this paper, we focus on structure learning for standard feedforward neural networks (FNNs). While convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are designed for spatial and sequential data respectively, standard FNNs are used for data that are neither spatial nor sequential. The structures of CNNs and RNNs are relatively more sophisticated than those of FNNs. For example, a neuron at a convolutional layer in a CNN is connected only to neurons in a small receptive field at the level below. The underlying assumption is that neurons in a small spatial region tend to be strongly correlated in their activations. In contrast, a neuron in an FNN is connected to all neurons at the level below. We aim to learn sparse FNN structures where a neuron is connected to only a small number of strongly correlated neurons at the level below.
Our work is built upon hierarchical latent tree analysis (HLTA) [\citeauthoryearLiu et al.2014, \citeauthoryearChen et al.2017a], an algorithm for learning treestructured PGMs where there is a layer of observed variables at the bottom and multiple layers of latent variables on top. HLTA first partitions all the variables into groups such that the variables in each group are strongly correlated and the correlations can be properly modelled using a single latent variable. It then introduces a latent variable for each group. After that it converts the latent variables into observed variables via data completion and repeats the process to produce a hierarchy.
To learn a sparse FNN structure, we assume data are generated from a PGM with multiple layers of latent variables and we try to approximately recover the structure of the generative model. To do so, we first run HLTA to obtain a tree model and use it as a skeleton. Then we expand it with additional edges to model salient probabilistic dependencies not captured by the skeleton. The result is a PGM structure and we call it a PGM core. To use the PGM core for classification, we further introduce a small number of neurons for each layer, and we connect them to all the units at the layers and all output units. This is to allow features from all layers to contribute to classification directly.
Figure 1 illustrates the result of our method. The PGM core includes the bottom three layers . The solid connections make up the skeleton and the dashed connections are added during the expansion phase. The neurons at layer and the output units are added at the last step. The neurons at layer can be conceptually divided into two groups: those connected to the top layer of the PGM core and those connected to other layers. The PGM core, the first group at layer and the output units together form the Backbone of the model, while the second group at layer provide narrow skippaths from low layers of the PGM core to the output layer. As the structure is obtained by expanding the connections of a tree skeleton, our model is called Tree Skeleton Expansion Network (TSENet).
Here is a summary of our contributions:

We propose a novel method for learning sparse structures for FNNs. The method depends heavily on HLTA. However, HLTA by itself is not an algorithm for FNN structure learning and it has certainly not been tested for that purpose.

We have conducted extensive experiments to compare TSENets with manuallytuned FNNs.

We have analyzed the pros and cons of our method with respect to related works, and we have empirically compared our method with a pruning method [\citeauthoryearHan et al.2015] for obtaining FNNs with sparse connectivities.
2 Related Works
The primary goal in structure learning is to find a model with optimal or closetooptimal generalization performance. Bruteforce search is not feasible because the search space is large and evaluating each model is costly as it necessitates model training. Early works in the 1980’s and 1990’s have focused on what we call the micro expansion approach where one starts with a small network and gradually adds new neurons to the network until a stopping criterion is met [\citeauthoryearAsh1989, \citeauthoryearBello1992, \citeauthoryearKwok and Yeung1997]. The word “micro” is used here because at each step only one or a few neurons are added. This makes learning large model computationally difficult as reaching a large model would require many steps and model evaluation is needed at each step. In addition, those early methods typically do not produce layered structures that are commonly used nowadays. Recently, a macro expansion method [\citeauthoryearLiu et al.2017] has been proposed where one starts from scratch and repeatedly add layers of hidden units until a threshold is met.
Other recent efforts have concentrated on what we call the contraction approach where one starts with a largerthannecessary structure and reduces it to the desired size. Contraction can be done either by repeatedly pruning neurons and/or connections [\citeauthoryearSrinivas and Babu2015, \citeauthoryearLi et al.2017, \citeauthoryearHan et al.2015], or by using regularization to force some of the weights to zero [\citeauthoryearWen et al.2016]. From the perspective of structure learning, the contraction approach is not ideal because it requires a complex model as input. A key motivation for a user to consider structure learning is to avoid building models manually.
A third approach is to explore the model space stochastically. One way is to place a prior over the space of all possible structures and carry out MCMC sampling to obtain a collection of models with high posterior probabilities [\citeauthoryearAdams et al.2010]. Another way is to encode a model structure as a sequence of numbers, use a reinforcement learning meta model to explore the space of such sequences, learn a good meta policy from the sequences explored, and use the policy to generate model structures [\citeauthoryearZoph and Le2017]. An obvious drawback of such stochastic exploration method is that they are computationally very expensive.
What we propose in this paper is a skeleton expansion method where we first learn a treestructured model and then add a certain number of new units and connections to it in one shot. The method has two advantages: First, learning tree models is easier than learning nontree models; Second, we need to train only one nontree model, i.e., the final model.
The skeleton expansion idea has been used in [\citeauthoryearChen et al.2017b] to learn structures for restricted Boltzmann machines, which have only one hidden layer. This is the first time that the idea is applied to and tested on multilayer feedforward networks.
3 Learning Tree Skeleton via HLTA
The first step of our method is to learn a treestructured probabilistic graphical model (an example is shown in the left panel in Figure 2). Let be the set of observed variables at the bottom and be the unobserved latent variables. Then defines a joint distribution over all the variables:
where denotes the parent variable of in . The distribution of can be computed as:
The model parameters, , can be trained to maximize the data loglikelihood through EM algorithm [\citeauthoryearDempster et al.1977], where denotes the data. Stepwise EM, which is an efficient version of EM similar to stochastic gradient descent, can be used for parameter estimation in .
Although parameter learning in is straightforward, the parameterization of depends heavily on the structure of (e.g. how the variables are connected, how many latent variables are introduced) which is relatively difficult to learn. We learn the tree structure of in a layerwise manner to approximately optimize the BIC score [\citeauthoryearSchwarz and others1978] of over data:
where denotes the number of free parameters and is the number of training samples. More specifically, we first learn a twolayer tree structure with being the leaf nodes and a layer of latent variables on top. Then we repeat the same process over the layer of latent variables to learn another twolayer tree. In this way, multiple twolayer trees are obtained and we finally stack all the trees up to form a multilayer tree structure. The whole procedure is shown in Figure 3.
3.1 Learning A TwoLayer Tree
To learn a twolayer tree structure, we need to first partition the observed variables into disjoint groups such that each group are strongly correlated and can be placed together as children of a shared latent variable. HLTA achieves this by greedily optimizing the BIC score of the model. It starts by finding the two most correlated variables to form one group and keeps expanding the group if necessary. Let denotes the set of observed variables which haven’t been included into any variable groups. HLTA firstly computes the mutual information between each pair of observed variables. Then it picks the pair in with the highest mutual information and uses them as the seeds of a new variable group . New variables from are then added to one by one in descending order of their mutual information with variables already in . Each time when a new variable is added into , HLTA builds two models ( and ) with as the observed variables. The two models are the best models with one single latent variable and two latent variables respectively, as shown in Figure 4. HLTA computes the BIC scores of the two models and tests whether the following condition is met:
where is a threshold which is usually set at 3 [\citeauthoryearChen et al.2017a]. When the condition is met, the two latent variable model is not significantly better than the one latent variable model . Correlations among variables in are still well modelled using a single latent variable. Then HLTA keeps on adding new variables to . If the test fails, HLTA takes the subtree in which doesn’t contain the newly added variable and identifies the observed variables in it as a finalized variable group. The group are then removed from . And the above process is repeated on until all the variables in are partitioned into disjoint groups. An efficient algorithm, Progressive EM [\citeauthoryearChen et al.2016], is used to estimate the parameters in and .
After partitioning the observed variables into groups, we introduce a latent variable for each group and compute the mutual information among the latent variables. Then we link up the latent variables to form a ChowLiu Tree [\citeauthoryearChow and Liu1968] based on their mutual information. The result is a latent tree model [\citeauthoryearPearl1988, \citeauthoryearZhang2004], as shown in Figure 3(c). Note that all the latent variables here are directly connected to some observed variables and hence we call them the layer1 latent variables.
3.2 Stacking TwoLayer Trees to a Tree Skeleton
After learning the first twolayer tree, we convert the layer1 latent variables to observed variables through data completion. Then another twolayer tree can be built by applying the above method to . The resulting tree gives the layer2 latent variables. And this procedure of building a twolayer tree can be repeated on the newlyintroduced latent variables recursively until the number of latent variables on top falls below a threshold , resulting in multiple twolayer trees. Finally we stack all the twolayer trees up one by one to form a tree skeleton with multiple layers of latent variables as shown in Figure 3(e). Note that the connections between latent variables at the same layer are removed, except those in the top layer. The result here is a hierarchical latent tree model [\citeauthoryearLiu et al.2014, \citeauthoryearChen et al.2017a].
4 Expanding Tree Skeleton to PGM Core
We have restricted the structure of to be a tree, as parameter estimation in treestructured PGMs is relatively efficient. However, this restriction in return also hurts the model’s expressiveness. For example, in text analysis, the word Apple is highly correlated with both fruit words and technology words conceptually. But Apple is directly connected to only one latent variable in and it is difficult for the latent variable to express both the two concepts, which may cause severe underfitting. On the other hand, in standard FNNs, units at a layer are always fully connected to those at the previous layer, resulting in high connection redundancies.
In this paper, we aim to learn sparse connections between adjacent layers, such that they are neither as sparse as those in a tree, nor as dense as those in an FNN. To this end, the sparse connections should capture only the most important correlations among the observed variables. Thus we propose to use as a structure skeleton and expand it to a denser structure which we call the PGM core.
Let be a latent variable at layer in . We consider adding new connections to link it to more variables at layer . We evaluate the importance of a connection by computing the empirical conditional mutual information:
where is the parent of in and . If with its tree structure perfectly models the correlation between and , then and should be conditional independent and thus should be zero. In other words, if is a large value other than zero, then it indicates that the correlation between and is still not well modelled by and we should consider adding a new connection between and . With this intuition, we sort in descending order by , and add new links to connect the top variables to . This expansion phase is carried out over all the adjacent layers. We then remove the connections between the top layer latent variables and call the resulting structure the PGM core . The skeleton expansion phase is shown in Figure 2. Note that at this stage we don’t need to learn the parameters of the new connections.
5 Constructing Sparse FNNs from PGM Core
Our tree expansion method learns a multilayer sparse structure in an unsupervised manner. One key advantage of unsupervised structure learning is, the structure learned from a set of unlabelled data can be transfered to any supervised learning tasks on the same type of data. Convolutional layer widely used in computer vision tasks is a good example: We humans have seen many unlabelled scenes and conclude that there are strong correlations between neighbouring pixels in vision data. And hence we humans design the locallyconnected structure of convolutional layer which is well suited to the nature of vision data and works well in supervised learning tasks. Similarly, our method discovers strong correlations in general data other than images. To utilize the resulting structure in a discriminative model, we convert each latent variable in to a hidden unit by defining the conditional probability:
where denotes a vector of the units directly connected to at the layer below, and are connection weights and bias respectively, and denotes a probability function mapping real values to probabilities, e.g. the sigmoid function. As in many deep learning models, we can further replace with other nonlinear activation functions, such as the tanh and ReLU [\citeauthoryearNair and Hinton2010, \citeauthoryearGlorot et al.2011] functions which usually benefit the training of deep models. In this way, we convert into a sparse multilayer neural network. Next we discuss how we use it as a feature extractor in supervised learning tasks. Our model contains two parts, the Backbone and the skippaths.
The Backbone
For a specific classification or regression task, we introduce a fullyconnected layer on the top of , which we call the feature layer, followed by a output layer. As shown in Figure 1, the feature layer acts as a feature “aggregator”, aggregating the features extracted by and feeding them to the output layer. We call the whole resulting module (, feature layer and output layer together) the Backbone, as it is supposed to be the major module of our model.
The Skippaths
As the structure of is sparse and is learned to capture the strongest correlations in data, some weak but useful correlations may easily be missed. More importantly, different tasks may rely on different weak correlations and this cannot be taken into consideration during the unsupervised structure learning. To remedy this, we consider allowing the model to contain some narrow fullyconnected paths to the feature layer such that they can capture those missed features. More specifically, suppose there are layers of units in . We introduce more groups of units into the feature layer, with each group fully connected to a layer in (except the top layer). In this way, each layer except the top one in has both a sparse path (the Backbone) and a fullyconnected path to the feature layer. The fullyconnected paths are supposed to capture those minor features during parameter learning. These new paths are called skippaths.
As shown in Figure 1, the Backbone and the skippaths together form our final model, named Tree Skeleton Expansion Network (TSENet). The model can then be trained like a normal neural network using backpropagation.
6 Experiments
6.1 Datasets
We evaluate our method in 17 classification tasks. Table 1 gives a summary of the datasets. We choose 12 tasks of chemical compounds classification and 5 tasks of text classification. All the datasets are published by previous researchers and are available to the public.
Tox21 challenge dataset
There are about 12,000 environmental chemical compounds in the dataset, each represented as its chemical structure. The tasks are to predict 12 different toxic effects for the chemical compounds. We treat them as 12 binary classification tasks. We filter out sparse features which are present in fewer than 5% compounds, and rescale the remaining 1,644 features to zero mean and unit variance. The validation set is randomly sampled and removed from the original training set.
Text classification datasets
We use 5 text classification datasets from [\citeauthoryearZhang et al.2015]. After removing stop words, the top 10,000 frequent words in each dataset are selected as the vocabulary respectively and each document is represented as bagofwords over the vocabulary. The validation set is randomly sampled and removed from the training samples.
Dataset  Classes 





Tox21  2  9,000  500  600  
Yelp Review Full  5  640,000  10,000  50,000  
DBPedia  14  549,990  10,010  70,000  
Sogou News  5  440,000  10,000  60,000  
Yahoo!Answer  10  1,390,000  10,000  60,000  
AG’s News  4  110,000  10,000  7,600 
Hyperparameter  Values considered 

Number of units per layer  {512, 1024, 2048} 
Number of hidden layers  {1,2,3,4} 
Network shape  {Rectangle, Conic} 
6.2 Experiment Setup
We compare our model TSENet with standard FNN. For fair comparison, we treat the number of units and number of layers as hyperparameters of an FNN and optimize them via gridsearch over all the defined combinations using validation data. Table 2 shows the space of network configurations considered, following the setup in [\citeauthoryearKlambauer et al.2017]. In our TSENet, the number of layers and the number of units at each layer are determined by the algorithm. We set the upperbound for the number of units at the top layer in to around 500, resulting in with 2 or 3 hidden layers. In the expansion phase, we expand the connections such that each unit in is connected to 5% of the units at the layer below. By sampling a subset of data for structure learning, our method runs efficiently on a standard desktop.
We also compare our model with pruned FNN whose connections are sparse. We take the best FNN as the initial model and perform pruning as in [\citeauthoryearHan et al.2015]. As micro expansion and stochastic exploration methods are not learning layered FNNs and are computationally expensive, they are not included in comparison.
We use ReLUs [\citeauthoryearNair and Hinton2010] as the nonlinear activation functions in all the networks. Dropout [\citeauthoryearHinton et al.2012b, \citeauthoryearSrivastava et al.2014] with rate 0.5 is applied after each nonlinear projection. We use Adam [\citeauthoryearKingma and Ba2014] as the network optimizer. Codes will be released after the paper is accepted.
6.3 Results
Classification results are reported in Table 3. All the experiments are run for three times and we report the average classification AUC scores/accuracies with standard deviations.
TSENets vs FNNs
From the table we can see that, TSENet contains only 6.25%32.07% of the parameters in FNN. Although the structure of FNN is manually optimized over the validation data, TSENet still achieves better or comparable results than FNN with much fewer parameters. In our experiments, TSENet achieves better AUC scores than FNN in 10 out of the 12 tasks in Tox21 dataset. The results show that, although TSENet is much sparser than FNN, the structure successfully captures the crucial correlations in data and greatly reduces the number of parameters without significant performance loss. The number of parameters in different models are also plotted in Figure 5, which clearly shows that TSENets contain much fewer parameters than FNNs.
It is worth noting that pure FNNs are not the stateoftheart models for the tasks here. For example, [\citeauthoryearMayr et al.2016] proposes an ensemble of FNNs, random forests and SVMs with expert knowledge for the Tox21 dataset. [\citeauthoryearKlambauer et al.2017] tests different normalization techniques for FNNs on the Tox21 dataset. They both achieve an average AUC score around 0.846. Complicated RNNs [\citeauthoryearYang et al.2016] with attention also achieve better results than FNNs for the 5 text datasets. However, the goal of our paper is to improve standard FNNs by learning sparse structure, instead of proposing stateoftheart methods for any specific tasks. Their methods are all much more complex and even taskspecific, and hence it is not fair to include their results as comparison. Moreover, their methods can also be combined with our TSENets to give better results.
Contribution of the Backbone
To validate our assumption that the Backbone in TSENet captures most of the crucial correlations in data and acts as a main part of the model, we remove the narrow skippaths in TSENet and train the model to test its performance.
Tox21 Average  Yelp Review  DBPedia  Sogou News  Yahoo!Answer  AG’s News  
FNN  0.8010 0.0017  59.13% 0.14%  97.99% 0.04%  96.12% 0.06%  71.84% 0.07%  91.61% 0.01% 
1.64M  5.38M  10.36M  13.39M  5.39M  28.88M  
TSENet  0.8150 0.0038  59.14% 0.06%  98.11% 0.03%  96.09% 0.06%  71.42% 0.06%  91.39% 0.03% 
338K/20.64%  1.73M/32.07%  1.78M/17.13%  1.84M/13.77%  1.69M/31.42%  1.81M/6.25%  
Backbone  0.7839 0.0076  58.63% 0.13%  97.91% 0.04%  95.67% 0.04%  69.95% 0.08%  91.33% 0.03% 
103K/6.29%  613K/11.38%  651K/6.28%  712K/5.32%  582K/10.80%  678K/2.35%  
Pruned FNN  0.7998 0.0034  59.12% 0.01%  98.11% 0.02%  96.20% 0.06%  71.74% 0.05%  91.49% 0.09% 
As we can see from the results, the Backbone path alone already achieves AUC scores or accuracies which are only slightly worse than those of TSENet. Note that the number of parameters in the Backbone is even much smaller than that of TSENet. The Backbone contains only 2%11% of the parameters in FNN. The results not only show the importance of the Backbone in TSENet, but also show that our structure learning method for the Backbone path is effective.
TSENets vs Pruned FNNs
We also compare our method with a baseline method [\citeauthoryearHan et al.2015] for obtaining sparse FNNs. The pruning method provides regularization over the weights of a network. The regularization is even stronger than norm as it is producing many weights being exactly zeros. We start from the fully pretrained FNNs reported in Table 3, and prune the weak connections with the smallest absolute weight values. The pruned networks are then retrained again to compensate for the removed connections. After pruning, the number of remaining parameters in each FNN is the same as that in the corresponding TSENet for the same task. As shown in Table 3, TSENet and pruned FNN achieve pretty similar results. Without any supervision or pretraining over connection weights, our unsupervised structure learning method successfully identifies important connections and learns sparse structures. This again validates that our method is effective for learning sparse structures.
Interpretability
We also compare the interpretability of different models on the text datasets following the experiments in [\citeauthoryearChen et al.2017b]. We feed the data to the networks and conduct forward propagation to get the values of the hidden units corresponding to each data sample. Then for each hidden unit, we sort the words in descending order of the correlations between the words and the hidden unit. The top 10 words with the highest correlations are chosen to characterize the hidden unit. Following [\citeauthoryearChen et al.2017b], we measure the “interpretability” of a hidden unit by considering how similar each pair of words in the top10 list are. The similarity between two words is calculated from a word2vec model [\citeauthoryearMikolov et al.2013a, \citeauthoryearMikolov et al.2013b] trained on the Google News datasets and released by Google, where each word is mapped to a high dimensional vector. The similarity between two words is defined as the cosine similarity of the two corresponding vectors. The interpretability score of a hidden unit is computed as the average similarity of all pairs of words. And the interpretability score of a model is defined as the average interpretability score of all hidden units.
Table 4 reports the interpretability scores of TSENets, FNNs and Pruned FNNs. Sogounews dataset is not included in the experiment since its vocabulary are Chinese pingyin characters and most of them do not appear in the Google News word2vec model. We measure the interpretability scores by considering only the toplayer hidden units. From the table we can see that, TSENets significantly outperform FNNs and Pruned FNNs in most cases and is comparable if not better, showing superior coherency and compactness in the characterizations of hidden units and thus better model interpretability.
Task  TSENets  FNNs  Pruned FNNs 

Yelp Review Full  0.1632  0.1117  0.1000 
DBPedia  0.0609  0.0497  0.0553 
Yahoo!Answer  0.1729  0.1632  0.1553 
AG’s News  0.0531  0.0595  0.0561 
To further demonstrate that our method can learn good structures, we apply it to the MNIST dataset [\citeauthoryearLeCun et al.1998] to learn a tree skeleton. Each layer of latent variables partition the pixels into disjoint groups. We take the first three layers of latent variables and visualize the partitions of pixels in Figure 6. In each subfigure, pixels with the same color belong to the same group. As we can see, even though pixel location information is not used in the analysis, our method grouped neighboring pixels together. The reason is that neighbor pixels tend to be strongly correlated. Also note that the blocks at the bottom and the top are mostly horizontal, while those in the middle are often vertical. Those reflect the interesting characteristics of handwritten digits.
7 Conclusions
Structure learning for deep neural network is a challenging and interesting research problem. We have proposed an unsupervised structure learning method which utilizes the correlation information in data for learning sparse deep feedforward networks. In comparison with standard FNN, although our TSENet contains much fewer parameters, it achieves better or comparable classification results in all kinds of tasks. Our method is also shown to learn models with better interpretability, which is also an important problem in deep learning. In the future, we will generalize our method to other networks like RNNs and CNNs.
Footnotes
 https://github.com/bioinfjku/SNNs
 https://github.com/zhangxiangxiao/Crepe
References
 Ryan Prescott Adams, Hanna M Wallach, and Zoubin Ghahramani. Learning the structure of deep sparse graphical models. In AISTATS, 2010.
 Timur Ash. Dynamic node creation in backpropagation networks. Connection science, 1(4):365–375, 1989.
 Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. In ICLR, 2017.
 Martin G Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural networks, 1992.
 Peixian Chen, Nevin L Zhang, Leonard KM Poon, and Zhourong Chen. Progressive em for latent tree models and hierarchical topic detection. In AAAI, 2016.
 Peixian Chen, Nevin L Zhang, Tengfei Liu, Leonard KM Poon, Zhourong Chen, and Farhan Khawar. Latent tree models for hierarchical topic detection. Artificial Intelligence, 250:105–124, 2017.
 Zhourong Chen, Nevin L Zhang, DitYan Yeung, and Peixian Chen. Sparse boltzmann machines with structure learning as applied to text analysis. In AAAI, 2017.
 C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968.
 Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society., 1977.
 Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
 Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
 Geoffrey E Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Selfnormalizing neural networks. In NIPS, 2017.
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 TinYau Kwok and DitYan Yeung. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks, 8(3):630–645, 1997.
 Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
 Tengfei Liu, Nevin L. Zhang, and Peixian Chen. Hierarchical latent tree analysis for topic detection. In ECML/PKDD, 2014.
 Jia Liu, Maoguo Gong, Qiguang Miao, Xiaogang Wang, and Hao Li. Structure learning for deep neural networks based on multiobjective optimization. IEEE Transactions on Neural Networks and Learning Systems, 2017.
 Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Deeptox: toxicity prediction using deep learning. Frontiers in Environmental Science, 2016.
 Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan Černockỳ. Strategies for training large scale neural network language models. In IEEE Workshop on Automatic Speech Recognition and Understanding, pages 196–201, 2011.
 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR Workshops, 2013.
 Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
 Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
 Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
 Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Largescale evolution of image classifiers. In ICML, 2017.
 Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
 Suraj Srinivas and R. Venkatesh Babu. Datafree parameter pruning for deep neural networks. In Proceedings of the British Machine Vision Conference, 2015.
 Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
 Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
 Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Eduard H Hovy. Hierarchical attention networks for document classification. In HLTNAACL, 2016.
 Xiang Zhang, Junbo Zhao, and Yann LeCun. Characterlevel convolutional networks for text classification. In NIPS, 2015.
 Nevin L Zhang. Hierarchical latent class models for cluster analysis. JMLR, 5(6):697–723, 2004.
 Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017.