Learning Sparse Deep Feedforward Networks via Tree Skeleton Expansion

Learning Sparse Deep Feedforward Networks via Tree Skeleton Expansion


Despite the popularity of deep learning, structure learning for deep models remains a relatively under-explored area. In contrast, structure learning has been studied extensively for probabilistic graphical models (PGMs). In particular, an efficient algorithm has been developed for learning a class of tree-structured PGMs called hierarchical latent tree models (HLTMs), where there is a layer of observed variables at the bottom and multiple layers of latent variables on top. In this paper, we propose a simple method for learning the structures of feedforward neural networks (FNNs) based on HLTMs. The idea is to expand the connections in the tree skeletons from HLTMs and to use the resulting structures for FNNs. An important characteristic of FNN structures learned this way is that they are sparse. We present extensive empirical results to show that, compared with standard FNNs tuned-manually, sparse FNNs learned by our method achieve better or comparable classification performance with much fewer parameters. They are also more interpretable.

1 Introduction

Deep learning has achieved great successes in the past few years [\citeauthoryearLeCun et al.2015, \citeauthoryearHinton et al.2012a, \citeauthoryearMikolov et al.2011, \citeauthoryearKrizhevsky et al.2012]. More and more researchers are now starting to investigate the possibility of learning structures for deep models instead of constructing them manually [\citeauthoryearChen et al.2017b, \citeauthoryearZoph and Le2017, \citeauthoryearBaker et al.2017, \citeauthoryearReal et al.2017]. Structure learning is interesting not only because it can save manual labor, but also because it can yield models that fit data better and hence perform better than manually built ones. In addition, it can also lead to models that are sparse and interpretable.

In this paper, we focus on structure learning for standard feedforward neural networks (FNNs). While convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are designed for spatial and sequential data respectively, standard FNNs are used for data that are neither spatial nor sequential. The structures of CNNs and RNNs are relatively more sophisticated than those of FNNs. For example, a neuron at a convolutional layer in a CNN is connected only to neurons in a small receptive field at the level below. The underlying assumption is that neurons in a small spatial region tend to be strongly correlated in their activations. In contrast, a neuron in an FNN is connected to all neurons at the level below. We aim to learn sparse FNN structures where a neuron is connected to only a small number of strongly correlated neurons at the level below.

Our work is built upon hierarchical latent tree analysis (HLTA) [\citeauthoryearLiu et al.2014, \citeauthoryearChen et al.2017a], an algorithm for learning tree-structured PGMs where there is a layer of observed variables at the bottom and multiple layers of latent variables on top. HLTA first partitions all the variables into groups such that the variables in each group are strongly correlated and the correlations can be properly modelled using a single latent variable. It then introduces a latent variable for each group. After that it converts the latent variables into observed variables via data completion and repeats the process to produce a hierarchy.

To learn a sparse FNN structure, we assume data are generated from a PGM with multiple layers of latent variables and we try to approximately recover the structure of the generative model. To do so, we first run HLTA to obtain a tree model and use it as a skeleton. Then we expand it with additional edges to model salient probabilistic dependencies not captured by the skeleton. The result is a PGM structure and we call it a PGM core. To use the PGM core for classification, we further introduce a small number of neurons for each layer, and we connect them to all the units at the layers and all output units. This is to allow features from all layers to contribute to classification directly.

Figure 1 illustrates the result of our method. The PGM core includes the bottom three layers . The solid connections make up the skeleton and the dashed connections are added during the expansion phase. The neurons at layer and the output units are added at the last step. The neurons at layer can be conceptually divided into two groups: those connected to the top layer of the PGM core and those connected to other layers. The PGM core, the first group at layer and the output units together form the Backbone of the model, while the second group at layer provide narrow skip-paths from low layers of the PGM core to the output layer. As the structure is obtained by expanding the connections of a tree skeleton, our model is called Tree Skeleton Expansion Network (TSE-Net).

Here is a summary of our contributions:

  1. We propose a novel method for learning sparse structures for FNNs. The method depends heavily on HLTA. However, HLTA by itself is not an algorithm for FNN structure learning and it has certainly not been tested for that purpose.

  2. We have conducted extensive experiments to compare TSE-Nets with manually-tuned FNNs.

  3. We have analyzed the pros and cons of our method with respect to related works, and we have empirically compared our method with a pruning method [\citeauthoryearHan et al.2015] for obtaining FNNs with sparse connectivities.

Figure 1: Model structure of our Tree Skeleton Expansion Networks (TSE-Nets). The PGM core includes the bottom three layers . The solid connections make up the skeleton and the dashed connections are added during the expansion phase. The black part of the model is called the Backbone, while the red part provides narrow skip-paths from the PGM core to the output layer.

2 Related Works

The primary goal in structure learning is to find a model with optimal or close-to-optimal generalization performance. Brute-force search is not feasible because the search space is large and evaluating each model is costly as it necessitates model training. Early works in the 1980’s and 1990’s have focused on what we call the micro expansion approach where one starts with a small network and gradually adds new neurons to the network until a stopping criterion is met  [\citeauthoryearAsh1989, \citeauthoryearBello1992, \citeauthoryearKwok and Yeung1997]. The word “micro” is used here because at each step only one or a few neurons are added. This makes learning large model computationally difficult as reaching a large model would require many steps and model evaluation is needed at each step. In addition, those early methods typically do not produce layered structures that are commonly used nowadays. Recently, a macro expansion method [\citeauthoryearLiu et al.2017] has been proposed where one starts from scratch and repeatedly add layers of hidden units until a threshold is met.

Other recent efforts have concentrated on what we call the contraction approach where one starts with a larger-than-necessary structure and reduces it to the desired size. Contraction can be done either by repeatedly pruning neurons and/or connections [\citeauthoryearSrinivas and Babu2015, \citeauthoryearLi et al.2017, \citeauthoryearHan et al.2015], or by using regularization to force some of the weights to zero [\citeauthoryearWen et al.2016]. From the perspective of structure learning, the contraction approach is not ideal because it requires a complex model as input. A key motivation for a user to consider structure learning is to avoid building models manually.

A third approach is to explore the model space stochastically. One way is to place a prior over the space of all possible structures and carry out MCMC sampling to obtain a collection of models with high posterior probabilities [\citeauthoryearAdams et al.2010]. Another way is to encode a model structure as a sequence of numbers, use a reinforcement learning meta model to explore the space of such sequences, learn a good meta policy from the sequences explored, and use the policy to generate model structures [\citeauthoryearZoph and Le2017]. An obvious drawback of such stochastic exploration method is that they are computationally very expensive.

What we propose in this paper is a skeleton expansion method where we first learn a tree-structured model and then add a certain number of new units and connections to it in one shot. The method has two advantages: First, learning tree models is easier than learning non-tree models; Second, we need to train only one non-tree model, i.e., the final model.

The skeleton expansion idea has been used in [\citeauthoryearChen et al.2017b] to learn structures for restricted Boltzmann machines, which have only one hidden layer. This is the first time that the idea is applied to and tested on multi-layer feedforward networks.

3 Learning Tree Skeleton via HLTA

The first step of our method is to learn a tree-structured probabilistic graphical model (an example is shown in the left panel in Figure 2). Let be the set of observed variables at the bottom and be the unobserved latent variables. Then defines a joint distribution over all the variables:

where denotes the parent variable of in . The distribution of can be computed as:

The model parameters, , can be trained to maximize the data log-likelihood through EM algorithm [\citeauthoryearDempster et al.1977], where denotes the data. Stepwise EM, which is an efficient version of EM similar to stochastic gradient descent, can be used for parameter estimation in .

Figure 2: Tree skeleton expansion: A multi-layer tree skeleton is first learned (left). New connections are then added to all the layers according to the empirical conditional mutual information (middle). The connections between variables at the top layer are removed and the resulting structure is called the PGM core (right). Black nodes represent observed variables, while white nodes represent latent variables.

Although parameter learning in is straightforward, the parameterization of depends heavily on the structure of (e.g. how the variables are connected, how many latent variables are introduced) which is relatively difficult to learn. We learn the tree structure of in a layer-wise manner to approximately optimize the BIC score [\citeauthoryearSchwarz and others1978] of over data:

where denotes the number of free parameters and is the number of training samples. More specifically, we first learn a two-layer tree structure with being the leaf nodes and a layer of latent variables on top. Then we repeat the same process over the layer of latent variables to learn another two-layer tree. In this way, multiple two-layer trees are obtained and we finally stack all the trees up to form a multi-layer tree structure. The whole procedure is shown in Figure 3.

3.1 Learning A Two-Layer Tree

To learn a two-layer tree structure, we need to first partition the observed variables into disjoint groups such that each group are strongly correlated and can be placed together as children of a shared latent variable. HLTA achieves this by greedily optimizing the BIC score of the model. It starts by finding the two most correlated variables to form one group and keeps expanding the group if necessary. Let denotes the set of observed variables which haven’t been included into any variable groups. HLTA firstly computes the mutual information between each pair of observed variables. Then it picks the pair in with the highest mutual information and uses them as the seeds of a new variable group . New variables from are then added to one by one in descending order of their mutual information with variables already in . Each time when a new variable is added into , HLTA builds two models ( and ) with as the observed variables. The two models are the best models with one single latent variable and two latent variables respectively, as shown in Figure 4. HLTA computes the BIC scores of the two models and tests whether the following condition is met:

where is a threshold which is usually set at 3 [\citeauthoryearChen et al.2017a]. When the condition is met, the two latent variable model is not significantly better than the one latent variable model . Correlations among variables in are still well modelled using a single latent variable. Then HLTA keeps on adding new variables to . If the test fails, HLTA takes the subtree in which doesn’t contain the newly added variable and identifies the observed variables in it as a finalized variable group. The group are then removed from . And the above process is repeated on until all the variables in are partitioned into disjoint groups. An efficient algorithm, Progressive EM [\citeauthoryearChen et al.2016], is used to estimate the parameters in and .

After partitioning the observed variables into groups, we introduce a latent variable for each group and compute the mutual information among the latent variables. Then we link up the latent variables to form a Chow-Liu Tree [\citeauthoryearChow and Liu1968] based on their mutual information. The result is a latent tree model  [\citeauthoryearPearl1988, \citeauthoryearZhang2004], as shown in Figure 3(c). Note that all the latent variables here are directly connected to some observed variables and hence we call them the layer-1 latent variables.

3.2 Stacking Two-Layer Trees to a Tree Skeleton

After learning the first two-layer tree, we convert the layer-1 latent variables to observed variables through data completion. Then another two-layer tree can be built by applying the above method to . The resulting tree gives the layer-2 latent variables. And this procedure of building a two-layer tree can be repeated on the newly-introduced latent variables recursively until the number of latent variables on top falls below a threshold , resulting in multiple two-layer trees. Finally we stack all the two-layer trees up one by one to form a tree skeleton with multiple layers of latent variables as shown in Figure 3(e). Note that the connections between latent variables at the same layer are removed, except those in the top layer. The result here is a hierarchical latent tree model [\citeauthoryearLiu et al.2014, \citeauthoryearChen et al.2017a].

Figure 3: The structure learning procedure for multi-layer tree skeleton. Black nodes represent observed variables while white nodes represent latent variables. (a) A set of observed variables. (b) Partition the observed variables into groups. (c) Introduce a latent variable for each group and link the latent variables up as a Chow-Liu tree. (d) Convert the layer-1 latent variables into observed variables and repeat the previous process on them to obtain another two-layer tree. (e) Stack the multiple two-layer trees up to form a multi-layer tree skeleton.

4 Expanding Tree Skeleton to PGM Core

We have restricted the structure of to be a tree, as parameter estimation in tree-structured PGMs is relatively efficient. However, this restriction in return also hurts the model’s expressiveness. For example, in text analysis, the word Apple is highly correlated with both fruit words and technology words conceptually. But Apple is directly connected to only one latent variable in and it is difficult for the latent variable to express both the two concepts, which may cause severe underfitting. On the other hand, in standard FNNs, units at a layer are always fully connected to those at the previous layer, resulting in high connection redundancies.

Figure 4: Example of testing whether five observed variables should be grouped together: (a) The best model with one latent variable. (b) The best model with two latent variables.

In this paper, we aim to learn sparse connections between adjacent layers, such that they are neither as sparse as those in a tree, nor as dense as those in an FNN. To this end, the sparse connections should capture only the most important correlations among the observed variables. Thus we propose to use as a structure skeleton and expand it to a denser structure which we call the PGM core.

Let be a latent variable at layer in . We consider adding new connections to link it to more variables at layer . We evaluate the importance of a connection by computing the empirical conditional mutual information:

where is the parent of in and . If with its tree structure perfectly models the correlation between and , then and should be conditional independent and thus should be zero. In other words, if is a large value other than zero, then it indicates that the correlation between and is still not well modelled by and we should consider adding a new connection between and . With this intuition, we sort in descending order by , and add new links to connect the top variables to . This expansion phase is carried out over all the adjacent layers. We then remove the connections between the top layer latent variables and call the resulting structure the PGM core . The skeleton expansion phase is shown in Figure 2. Note that at this stage we don’t need to learn the parameters of the new connections.

5 Constructing Sparse FNNs from PGM Core

Our tree expansion method learns a multi-layer sparse structure in an unsupervised manner. One key advantage of unsupervised structure learning is, the structure learned from a set of unlabelled data can be transfered to any supervised learning tasks on the same type of data. Convolutional layer widely used in computer vision tasks is a good example: We humans have seen many unlabelled scenes and conclude that there are strong correlations between neighbouring pixels in vision data. And hence we humans design the locally-connected structure of convolutional layer which is well suited to the nature of vision data and works well in supervised learning tasks. Similarly, our method discovers strong correlations in general data other than images. To utilize the resulting structure in a discriminative model, we convert each latent variable in to a hidden unit by defining the conditional probability:

where denotes a vector of the units directly connected to at the layer below, and are connection weights and bias respectively, and denotes a probability function mapping real values to probabilities, e.g. the sigmoid function. As in many deep learning models, we can further replace with other non-linear activation functions, such as the tanh and ReLU [\citeauthoryearNair and Hinton2010, \citeauthoryearGlorot et al.2011] functions which usually benefit the training of deep models. In this way, we convert into a sparse multi-layer neural network. Next we discuss how we use it as a feature extractor in supervised learning tasks. Our model contains two parts, the Backbone and the skip-paths.

The Backbone

For a specific classification or regression task, we introduce a fully-connected layer on the top of , which we call the feature layer, followed by a output layer. As shown in Figure 1, the feature layer acts as a feature “aggregator”, aggregating the features extracted by and feeding them to the output layer. We call the whole resulting module (, feature layer and output layer together) the Backbone, as it is supposed to be the major module of our model.

The Skip-paths

As the structure of is sparse and is learned to capture the strongest correlations in data, some weak but useful correlations may easily be missed. More importantly, different tasks may rely on different weak correlations and this cannot be taken into consideration during the unsupervised structure learning. To remedy this, we consider allowing the model to contain some narrow fully-connected paths to the feature layer such that they can capture those missed features. More specifically, suppose there are layers of units in . We introduce more groups of units into the feature layer, with each group fully connected to a layer in (except the top layer). In this way, each layer except the top one in has both a sparse path (the Backbone) and a fully-connected path to the feature layer. The fully-connected paths are supposed to capture those minor features during parameter learning. These new paths are called skip-paths.

As shown in Figure 1, the Backbone and the skip-paths together form our final model, named Tree Skeleton Expansion Network (TSE-Net). The model can then be trained like a normal neural network using back-propagation.

6 Experiments

6.1 Datasets

We evaluate our method in 17 classification tasks. Table 1 gives a summary of the datasets. We choose 12 tasks of chemical compounds classification and 5 tasks of text classification. All the datasets are published by previous researchers and are available to the public.

Tox21 challenge dataset


There are about 12,000 environmental chemical compounds in the dataset, each represented as its chemical structure. The tasks are to predict 12 different toxic effects for the chemical compounds. We treat them as 12 binary classification tasks. We filter out sparse features which are present in fewer than 5% compounds, and rescale the remaining 1,644 features to zero mean and unit variance. The validation set is randomly sampled and removed from the original training set.

Text classification datasets


We use 5 text classification datasets from [\citeauthoryearZhang et al.2015]. After removing stop words, the top 10,000 frequent words in each dataset are selected as the vocabulary respectively and each document is represented as bag-of-words over the vocabulary. The validation set is randomly sampled and removed from the training samples.

Dataset Classes
Tox21 2  9,000 500 600
Yelp Review Full 5 640,000 10,000 50,000
DBPedia 14 549,990 10,010 70,000
Sogou News 5 440,000 10,000 60,000
Yahoo!Answer 10 1,390,000 10,000 60,000
AG’s News 4 110,000 10,000 7,600
Table 1: Statistics of all the datasets.
Hyper-parameter Values considered
Number of units per layer {512, 1024, 2048}
Number of hidden layers {1,2,3,4}
Network shape {Rectangle, Conic}
Table 2: Hyper-parameters for the structure of FNNs.

6.2 Experiment Setup

We compare our model TSE-Net with standard FNN. For fair comparison, we treat the number of units and number of layers as hyper-parameters of an FNN and optimize them via grid-search over all the defined combinations using validation data. Table 2 shows the space of network configurations considered, following the setup in [\citeauthoryearKlambauer et al.2017]. In our TSE-Net, the number of layers and the number of units at each layer are determined by the algorithm. We set the upper-bound for the number of units at the top layer in to around 500, resulting in with 2 or 3 hidden layers. In the expansion phase, we expand the connections such that each unit in is connected to 5% of the units at the layer below. By sampling a subset of data for structure learning, our method runs efficiently on a standard desktop.

We also compare our model with pruned FNN whose connections are sparse. We take the best FNN as the initial model and perform pruning as in [\citeauthoryearHan et al.2015]. As micro expansion and stochastic exploration methods are not learning layered FNNs and are computationally expensive, they are not included in comparison.

We use ReLUs [\citeauthoryearNair and Hinton2010] as the non-linear activation functions in all the networks. Dropout [\citeauthoryearHinton et al.2012b, \citeauthoryearSrivastava et al.2014] with rate 0.5 is applied after each non-linear projection. We use Adam [\citeauthoryearKingma and Ba2014] as the network optimizer. Codes will be released after the paper is accepted.

6.3 Results

Classification results are reported in Table 3. All the experiments are run for three times and we report the average classification AUC scores/accuracies with standard deviations.

TSE-Nets vs FNNs

From the table we can see that, TSE-Net contains only 6.25%32.07% of the parameters in FNN. Although the structure of FNN is manually optimized over the validation data, TSE-Net still achieves better or comparable results than FNN with much fewer parameters. In our experiments, TSE-Net achieves better AUC scores than FNN in 10 out of the 12 tasks in Tox21 dataset. The results show that, although TSE-Net is much sparser than FNN, the structure successfully captures the crucial correlations in data and greatly reduces the number of parameters without significant performance loss. The number of parameters in different models are also plotted in Figure 5, which clearly shows that TSE-Nets contain much fewer parameters than FNNs.

It is worth noting that pure FNNs are not the state-of-the-art models for the tasks here. For example, [\citeauthoryearMayr et al.2016] proposes an ensemble of FNNs, random forests and SVMs with expert knowledge for the Tox21 dataset. [\citeauthoryearKlambauer et al.2017] tests different normalization techniques for FNNs on the Tox21 dataset. They both achieve an average AUC score around 0.846. Complicated RNNs [\citeauthoryearYang et al.2016] with attention also achieve better results than FNNs for the 5 text datasets. However, the goal of our paper is to improve standard FNNs by learning sparse structure, instead of proposing state-of-the-art methods for any specific tasks. Their methods are all much more complex and even task-specific, and hence it is not fair to include their results as comparison. Moreover, their methods can also be combined with our TSE-Nets to give better results.

Contribution of the Backbone

To validate our assumption that the Backbone in TSE-Net captures most of the crucial correlations in data and acts as a main part of the model, we remove the narrow skip-paths in TSE-Net and train the model to test its performance.

Tox21 Average Yelp Review DBPedia Sogou News Yahoo!Answer AG’s News
FNN 0.8010 0.0017 59.13% 0.14% 97.99% 0.04% 96.12% 0.06% 71.84% 0.07% 91.61% 0.01%
1.64M 5.38M 10.36M 13.39M 5.39M 28.88M
TSE-Net 0.8150 0.0038 59.14% 0.06% 98.11% 0.03% 96.09% 0.06% 71.42% 0.06% 91.39% 0.03%
338K/20.64% 1.73M/32.07% 1.78M/17.13% 1.84M/13.77% 1.69M/31.42% 1.81M/6.25%
Backbone 0.7839 0.0076 58.63% 0.13% 97.91% 0.04% 95.67% 0.04% 69.95% 0.08% 91.33% 0.03%
103K/6.29% 613K/11.38% 651K/6.28% 712K/5.32% 582K/10.80% 678K/2.35%
Pruned FNN 0.7998 0.0034 59.12% 0.01% 98.11% 0.02% 96.20% 0.06% 71.74% 0.05% 91.49% 0.09%
Table 3: Experiment results. For each model, the first row shows the classification AUC scores/accuracies, while the second row shows the number of parameters and the ratio w.r.t that of FNNs. Best result on each dataset is marked in bold.

As we can see from the results, the Backbone path alone already achieves AUC scores or accuracies which are only slightly worse than those of TSE-Net. Note that the number of parameters in the Backbone is even much smaller than that of TSE-Net. The Backbone contains only 2%11% of the parameters in FNN. The results not only show the importance of the Backbone in TSE-Net, but also show that our structure learning method for the Backbone path is effective.

TSE-Nets vs Pruned FNNs

Figure 5: The number of parameters in TSE-Nets and FNNs for different tasks. The left panel shows the 12 tasks in the Tox21 dataset.

We also compare our method with a baseline method [\citeauthoryearHan et al.2015] for obtaining sparse FNNs. The pruning method provides regularization over the weights of a network. The regularization is even stronger than norm as it is producing many weights being exactly zeros. We start from the fully pretrained FNNs reported in Table 3, and prune the weak connections with the smallest absolute weight values. The pruned networks are then retrained again to compensate for the removed connections. After pruning, the number of remaining parameters in each FNN is the same as that in the corresponding TSE-Net for the same task. As shown in Table 3, TSE-Net and pruned FNN achieve pretty similar results. Without any supervision or pre-training over connection weights, our unsupervised structure learning method successfully identifies important connections and learns sparse structures. This again validates that our method is effective for learning sparse structures.


We also compare the interpretability of different models on the text datasets following the experiments in [\citeauthoryearChen et al.2017b]. We feed the data to the networks and conduct forward propagation to get the values of the hidden units corresponding to each data sample. Then for each hidden unit, we sort the words in descending order of the correlations between the words and the hidden unit. The top 10 words with the highest correlations are chosen to characterize the hidden unit. Following [\citeauthoryearChen et al.2017b], we measure the “interpretability” of a hidden unit by considering how similar each pair of words in the top-10 list are. The similarity between two words is calculated from a word2vec model [\citeauthoryearMikolov et al.2013a, \citeauthoryearMikolov et al.2013b] trained on the Google News datasets and released by Google, where each word is mapped to a high dimensional vector. The similarity between two words is defined as the cosine similarity of the two corresponding vectors. The interpretability score of a hidden unit is computed as the average similarity of all pairs of words. And the interpretability score of a model is defined as the average interpretability score of all hidden units.

Table 4 reports the interpretability scores of TSE-Nets, FNNs and Pruned FNNs. Sogounews dataset is not included in the experiment since its vocabulary are Chinese pingyin characters and most of them do not appear in the Google News word2vec model. We measure the interpretability scores by considering only the top-layer hidden units. From the table we can see that, TSE-Nets significantly outperform FNNs and Pruned FNNs in most cases and is comparable if not better, showing superior coherency and compactness in the characterizations of hidden units and thus better model interpretability.

Task TSE-Nets FNNs Pruned FNNs
Yelp Review Full 0.1632 0.1117 0.1000
DBPedia 0.0609 0.0497 0.0553
Yahoo!Answer 0.1729 0.1632 0.1553
AG’s News 0.0531 0.0595 0.0561
Table 4: Interpretability scores of TSE-Nets, FNNs and Pruned FNNs on different datasets

To further demonstrate that our method can learn good structures, we apply it to the MNIST dataset [\citeauthoryearLeCun et al.1998] to learn a tree skeleton. Each layer of latent variables partition the pixels into disjoint groups. We take the first three layers of latent variables and visualize the partitions of pixels in Figure 6. In each sub-figure, pixels with the same color belong to the same group. As we can see, even though pixel location information is not used in the analysis, our method grouped neighboring pixels together. The reason is that neighbor pixels tend to be strongly correlated. Also note that the blocks at the bottom and the top are mostly horizontal, while those in the middle are often vertical. Those reflect the interesting characteristics of handwritten digits.

Figure 6: The partitions of the MNIST pixels by layer-1, layer-2 and layer-3 latent variables respectively. Pixels in each sub-figure with the same color belong to the same group.

7 Conclusions

Structure learning for deep neural network is a challenging and interesting research problem. We have proposed an unsupervised structure learning method which utilizes the correlation information in data for learning sparse deep feed-forward networks. In comparison with standard FNN, although our TSE-Net contains much fewer parameters, it achieves better or comparable classification results in all kinds of tasks. Our method is also shown to learn models with better interpretability, which is also an important problem in deep learning. In the future, we will generalize our method to other networks like RNNs and CNNs.


  1. https://github.com/bioinf-jku/SNNs
  2. https://github.com/zhangxiangxiao/Crepe


  1. Ryan Prescott Adams, Hanna M Wallach, and Zoubin Ghahramani. Learning the structure of deep sparse graphical models. In AISTATS, 2010.
  2. Timur Ash. Dynamic node creation in backpropagation networks. Connection science, 1(4):365–375, 1989.
  3. Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. In ICLR, 2017.
  4. Martin G Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural networks, 1992.
  5. Peixian Chen, Nevin L Zhang, Leonard KM Poon, and Zhourong Chen. Progressive em for latent tree models and hierarchical topic detection. In AAAI, 2016.
  6. Peixian Chen, Nevin L Zhang, Tengfei Liu, Leonard KM Poon, Zhourong Chen, and Farhan Khawar. Latent tree models for hierarchical topic detection. Artificial Intelligence, 250:105–124, 2017.
  7. Zhourong Chen, Nevin L Zhang, Dit-Yan Yeung, and Peixian Chen. Sparse boltzmann machines with structure learning as applied to text analysis. In AAAI, 2017.
  8. C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968.
  9. Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society., 1977.
  10. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
  11. Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
  12. Geoffrey E Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  13. Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  14. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  15. Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In NIPS, 2017.
  16. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  17. Tin-Yau Kwok and Dit-Yan Yeung. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks, 8(3):630–645, 1997.
  18. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  19. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  20. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
  21. Tengfei Liu, Nevin L. Zhang, and Peixian Chen. Hierarchical latent tree analysis for topic detection. In ECML/PKDD, 2014.
  22. Jia Liu, Maoguo Gong, Qiguang Miao, Xiaogang Wang, and Hao Li. Structure learning for deep neural networks based on multiobjective optimization. IEEE Transactions on Neural Networks and Learning Systems, 2017.
  23. Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Deeptox: toxicity prediction using deep learning. Frontiers in Environmental Science, 2016.
  24. Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan Černockỳ. Strategies for training large scale neural network language models. In IEEE Workshop on Automatic Speech Recognition and Understanding, pages 196–201, 2011.
  25. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR Workshops, 2013.
  26. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
  27. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  28. Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
  29. Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. In ICML, 2017.
  30. Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
  31. Suraj Srinivas and R. Venkatesh Babu. Data-free parameter pruning for deep neural networks. In Proceedings of the British Machine Vision Conference, 2015.
  32. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
  33. Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
  34. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Eduard H Hovy. Hierarchical attention networks for document classification. In HLT-NAACL, 2016.
  35. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In NIPS, 2015.
  36. Nevin L Zhang. Hierarchical latent class models for cluster analysis. JMLR, 5(6):697–723, 2004.
  37. Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description