Variational Depth Search in ResNets

Variational Depth Search in ResNets

Abstract

One-shot neural architecture search allows joint learning of weights and network architecture, reducing computational cost. We limit our search space to the depth of residual networks and formulate an analytically tractable variational objective that allows for obtaining an unbiased approximate posterior over depths in one-shot. We propose a heuristic to prune our networks based on this distribution. We compare our proposed method against manual search over network depths on the MNIST, Fashion-MNIST, SVHN datasets. We find that pruned networks do not incur a loss in predictive performance, obtaining accuracies competitive with unpruned networks. Marginalising over depth allows us to obtain better-calibrated test-time uncertainty estimates than regular networks, in a single forward pass.

\iclrfinalcopy

1 Introduction and Related Work

One-shot Neural Architecture Search (NAS) is a promising approach to NAS that uses weight-sharing to significantly reduce the computational cost of exploring the architecture search space. This makes NAS more accessible to researchers and practitioners without large computational budgets. In this work, we describe a computationally cheap, gradient-based, one-shot NAS method that uses Variational Inference (VI) to learn distributions over the depth of residual networks (ResNets). Our approach inherits advantages from Bayesian neural networks such as capturing model uncertainty and robustness to over-fitting (Hernández-Lobato and Adams, 2015; Gal, 2016).

Perhaps the most well known gradient-based one-shot NAS approach is DARTS (Liu et al., 2019). It uses a continuous relaxation of the search space to learn the structure of cells within a larger, fixed, computational graph. Each edge in the graph of a cell represents a mixture of possible operations. Mixture weights are optimised with respect to the validation set. SNAS (Xie et al., 2019), ProxylessNAS (Cai et al., 2019) and BayesNAS (Zhou et al., 2019) take similar approaches, varying the distributions over cell operations and optimisation procedures. Shin et al. (2018) use gradients to jointly optimise weights and hyper-parameters for network layers. Ahmed and Torresani (2018) jointly optimise graph connectivity and weights using binary variables and a modified back-propagation algorithm. In contrast, we restrict our search to network depth. We jointly learn model weights and a distribution over depths, as opposed to a point estimate, using only the train set.

More closely related to this work is that of Dikov et al. (2019), who learn both the depth and width of a ResNet using VI. They obtain biased estimates of the Evidence Lower BOund’s (ELBO’s) gradients with respect to model architecture by leveraging continuous relaxations of discrete probability distributions. Nalisnick et al. (2019) interpret dropout as a structured shrinkage prior. They use it for automatic depth determination in ResNets, reducing the influence of, but not removing, whole residual blocks. Bender et al. (2018) use path dropout, in which whole edges of cells are dropped out at training time, to prevent co-adaptation while performing one-shot AS. Conversely, we directly model depth as a categorical variable instead of a product of Bernoullis. As a result, we are able to evaluate our ELBO exactly and efficiently; only a single forward pass is required to evaluate the likelihood rather than high-variance Monte-Carlo sampling.

Our main contributions are as follows: 1. We propose a probabilistic model, an approximate distribution, and a network architecture that, when combined, allow for exact evaluation of the ELBO with a single forward pass through the network. Network depth and weights are optimised jointly. 2. We show how our formulation learns distributions over depths that assign more mass to better performing architectures and are amenable to layer pruning. 3. We show how to obtain model uncertainty estimates from a single forward pass through our networks.

Figure 1: Left: graphical model under consideration. Right: computational model. Each layer’s activations are passed through the output block, producing predictions .

2 Variational Inference over Architecture Space

Consider a dataset and a neural network, parametrised by , formed of residual blocks , an input block , and an output block . We take all layers to have a fixed width, or number of channels, . We introduce a set of binary variables such that the activations at depth , , can be obtained recursively as . We obtain as and parameterise a distribution over targets with our model’s output: . This computational model is displayed in fig. 1.

Given the above formulation, we can obtain a network of depth by setting . Its outputs are then given by . Deeper networks can express more complex functions but come at increased risk of overfitting and test time computational cost. We propose to manage this trade-off by placing a categorical prior distribution over depths . By selecting larger values of for smaller depths, we encourage simpler, computationally cheaper models. The posterior distribution over depths takes the form of a categorical. Unfortunately, obtaining it requires computing the likelihood of the full dataset.

We approximate the posterior distribution over depths by introducing a surrogate categorical distribution . We can optimise the variational parameters and model parameters simultaneously through the following objective:

(1)

Intuitively, the first term in eq. 1 encourages quality of fit while the second keeps our model shallow. In appendix A, we link the objective in eq. 1 to variational EM and show it is a lower bound on . Because both our approximate and true posteriors are categorical, eq. 1 is convex w.r.t. . At the optima, and the bound is tight. Thus, we are able to perform unbiased maximum likelihood estimation of network weights while the depth is marginalised. Taking expectations over allows us to avoid calculating the exact posterior at every optimisation step.

can be computed from the activations of every residual block. These are obtained with a single forward pass. As a result, both terms in eq. 1 can be evaluated exactly. This removes the need for high-variance estimators, often associated with performing VI in neural networks (Kingma et al., 2015). Using mini-batches of size , we stochastically estimate eq. 1 as:

(2)

After training, represents our confidence that the number of residual blocks we should use is . In low data regimes, where both terms in eq. 1 are of comparable scale, we choose . In medium to big data regimes, where the log-likelihood dominates our objective, we find that the values of flatten out after reaching an appropriate depth. Heuristically, we define and select , ensuring we keep the minimum number of layers needed to explain the data well. We prune all blocks after by setting and then . Instead of also discarding the learnt probabilities over shallower networks, we incorporate them when making predictions on new data points through marginalisation at no additional computational cost:

(3)

3 Experiments

We refer to our approach as learnt depth networks (LDN). We benchmark against deterministic depth networks (DDN) for which we evaluate our search space by training networks of multiple depths. We use the same architectures and hyperparameters for LDNs and DDNs. Implementation details are given in appendix B. Our code can be found at: github.com/anonimoose12345678/arch_uncert.

3.1 Spiral Classification

We generate a 2d training set by drawing 200 samples from a 720° rotation 2-armed spiral function with additive Gaussian noise of . The test set is composed of an additional 1800 samples. We repeat experiments 6 times and report standard deviations as error bars.

Figure 2: Left: posterior over depths for a LDN of trained on our spirals dataset. Test log-likelihood values obtained for DDNs at every depth are overlaid with the log-likelihood value obtained with a LDN when marginalising over layers. Right: the LDN’s chosen depth and test performance remain stable as increases up until .

Choosing a relatively small width to ensure the task can not be solved with a shallow model, we train LDNs with varying values of and DDNs of all depths up to . fig. 2 shows how the depths to which LDNs assign larger probabilities match those at which DDNs perform best. Predictions from LDNs pruned to layers outperform DDNs at all depths. The chosen remains stable for increasing maximum depths up to . The same is true for test performance, showing some robustness to overfitting. After this point, training starts to become unstable.

Figure 3: Top: spiral functions learnt at different depths of a LDN. The “i” indicator refers to the use of eq. 3 for predictions. Bottom: functions learnt at different depths of a DDN. In all cases .

We plot the functions learnt by different layers of a DDN in fig. 3. In excessively deep DDNs, intermediate layers contribute minimally. Only at layer 15 does the learnt function start to resemble a spiral. In LDNs, the first layers do most of the work. Layers after learn functions close to the identity. This allows us to prune them, reducing computational cost at test time while obtaining the same test performance as when marginalising over all layers. This is shown in appendix C.

3.2 Small Image Datasets

We further evaluate our approach on MNIST, Fashion-MNIST and SVHN. Each experiment is repeated 4 times. The results obtained with are shown in fig. 4. The larger size of these datasets diminishes the effect of the prior on the ELBO. Models that explain the data well obtain large probabilities, regardless of their depth. For MNIST, the probabilities assigned to each depth in our LDN grow quickly and flatten out around . For Fashion-MNIST, depth probabilities grow slower. We obtain . For SVHN, probabilities flatten out around . These distributions and values correlate with dataset complexity. In most cases, pruned LDNs achieve test log-likelihoods competitive with the best performing DDNs, while achieving equal or better accuracies, as shown in appendix D. Additionally, our pruning strategy allows us to perform test-time inference approximately 62%, 41%, and 37% faster than using layer networks on MNIST, Fashion-MNIST, and SVHN, respectively. We find pruning not to impact predictive performance.

We investigate the predictive uncertainty calibration of LDNs and DDNs on the datasets under consideration. Detailed results are found in appendix D. Similarly to Guo et al. (2017), we find DDNs to be pathologically overconfident. LDNs present marginally better calibration on Fashion and SVHN, tending to be less overconfident for probabilities greater than 0.5. We observe a negligible degradation in calibration when pruning layers after .

Figure 4: Approximate posterior over depths for LDNs of trained on image datasets. Test log-likelihoods obtained for DDNs at various depths are overlaid with those from our LDNs when marginalising over the first layers. is chosen with the heuristic described in section 2.

4 Discussion and Future Work

We formulate a variational objective over ResNet depth which can be evaluated exactly. It allows for one-shot learning of both model weights and a distribution over depth. We leverage this distribution to prune our networks, making test-time inference cheaper, and obtain model uncertainty estimates. Pruned networks obtain predictions competitive with regular networks of any depth on a toy spiral dataset, MNIST, Fashion-MNIST and SVHN. They also tend to provide better calibrated uncertainty estimates. Despite promising results, we have yet to evaluate the scalability of our approach to larger datasets. We leave this, and comparing to existing NAS approaches, to future work. We would also like to further investigate the uncertainty estimates given by depth-wise marginalisation.

Appendix A Derivation of the ELBO in eq. 1 and Link to Variational EM

Referring to with and, for simplicity, dropping sub-indices referring to functions’ parameters , we show that eq. 1 is a lower bound on :

(4)

Using the non-negativity of the KL divergence, we can see that: .

We now show how our formulation corresponds to a variational EM algorithm (Tzikas et al., 2008). Here, network depth acts as the latent variable and network weights are parameters. For a given setting of network weights , at optimisation step , we can obtain the exact posterior over using the E step:

(5)

The posterior depth probabilities can now be used to marginalise this latent variable and perform maximum likelihood estimation of network parameters. This is the M step:

(6)

Unfortunately, the E step eq. 5 requires calculating the likelihood of the complete training dataset, an expensive operation when dealing with neural network models and big data. We sidestep this issue through the introduction of an approximate posterior , parametrised by , and a variational lower bound on the marginal log-likelihood appendix A. The corresponding variational E step is given by:

(7)

Because our variational family contains the exact posterior distribution - they are both categorical - the ELBO is tight at the optima with respect to the variational parameters . Solving eq. 7 recovers . We now combine the variational E step eq. 7 and M step appendix A updates, recovering eq. 1, where and are updated simultaneously through gradient steps:

This objective is amenable to minibatching. The variational posterior tracks the true posterior during gradient updates, providing an unbiased estimate. Thus, eq. 1, allows us to optimise an unbiased estimate of the data’s log-likelihood with network depth marginalised.

Appendix B Implementation Details

We implement all of our models in PyTorch (Paszke et al., 2019). We train our models using SGD with a momentum value of . The log-likelihood of the train data is obtained using the cross entropy function. We use the default PyTorch parameter initialisation in all experiments. We do not set specific random seeds. However, we run each experiment multiple times and obtain similar results, showing our approach’s robustness to this parameter initialisation.

For our experiments on spirals, a fixed learning rate of and a batch size of are used. Note that for all experiments on spirals, except for the ones where the amount of training data is increased as part of the experiment, this results in full-batch gradient descent. Training progress is evaluated using the ELBO eq. 1. Early stopping is applied after 500 iterations of not improving on the previous best ELBO. The parameter setting which provides the best ELBO is kept. We choose an exponentially decreasing prior to encourage shallower models:

where is set to .

For MNIST LeCun et al. (2010), Fashion-MNIST (Xiao et al., 2017), and SVHN (Netzer et al., 2011), the same setup as above is used, with a few exceptions. Early stopping is applied after 30 iterations of not improving. Additionally, the learning rate is dropped from to after 30 iterations. Each data-set is normalised per-channel to have a mean of and a standard deviation of . No other forms of data modification are used.

For our fully connected networks, used for spiral datasets, our input and output blocks consist of linear layers. These map from input space to the selected width and from to the output size respectively. Thus, selecting results in a linear model. The softmax operation is applied after . The functions applied in residual blocks, , consist of a fully connected layer followed by a ReLU activation function and Batch Normalization (Ioffe and Szegedy, 2015).

Our CNN architecture uses a kernel convolutional layer together with a average pooling layer as an input block . Due to the relatively small size of the images used in our experiments, no additional down-sampling layers are used in the convolutional blocks. The output block, , is composed of a global average pooling layer followed by a fully connected residual block, as described in the previous paragraph, and a linear layer. The softmax operation is applied to output activations. The function applied in the residual blocks, , matches the preactivation bottleneck residual function described by He et al. (2016) and uses kernels. The outer number of channels is set to 64 and the bottleneck number is 32.

Appendix C Additional Experiments on 2D Spirals

We further explore the properties of LDNs in the small data regime by varying the layer width . As shown in fig. 5, very small widths result in very deep LDNs and worse test performance. Increasing layer width gives our models more representation capacity at each layer, causing the learnt depth to decrease rapidly. Test performance remains stable for widths in the range of to , showing that our approach adapts well to changes in this parameter. The test log-likelihood starts to decrease for widths of , possibly due to training instabilities.

Figure 5: Evolution of LDNs’ chosen depth and test performance as their layer width increases. The results obtained when making predictions by marginalising over all layers overlap with those obtained when only using the first layers. The x-axis is presented in logarithmic scale.

Setting back to , we generate spiral datasets (code given in repo) with varying degrees of rotation while keeping the number of train points fixed to . In fig. 6, we see how LDNs increase their depth to match the increasing complexity of the underlying generative process of the data. For rotations larger than 720°, may be excessively restrictive. Test performance starts to suffer significantly. fig. 7 shows how our LDNs struggle to fit these datasets well.

Figure 6: The left-side plots show the evolution of test performance and learnt depth as the data complexity increases. The right side plots show changes in the same variables as the number of train points increases. The results obtained when making predictions by marginalising over all layers overlap with those obtained when only using the first layers. Best viewed in colour.

Returning to 720°spirals, we vary the number of training points in our dataset. We plot the LDNs’ learnt functions in fig. 8. LDNs overfit the 50 point train set but, as displayed in fig. 7, learn very shallow network configurations. Increasingly large training sets allow the LDNs to become deeper while increasing test performance. Around 500 train points seem to be enough for our models to fully capture the generative process of the data. After this point oscillates around 11 layers and the test performance remains constant. Marginalising over layers consistently produces the same test performance as only considering the first . All figures are best viewed in colour.

Figure 7: Functions learnt at each depth of a LDN on increasingly complex spirals. Note that single depth settings are being evaluated in this plot. We are not marginalising all layers up to .

Figure 8: Functions learnt by LDNs trained on increasingly large spiral datasets. Note that single depth settings are being evaluated in this plot. We are not marginalising all layers up to .

Appendix D Additional Experiments on Image Datasets

Figure 9 shows more detailed experiments comparing LDNs with DDNs on image datasets. We introduce expected depth as an alternative to the \nth95 percent heuristic introduced in section 2. The first row of the figure adds further evidence that the depth learnt by LDNs corresponds to dataset complexity. For any maximum depth, and both pruning approaches, the LDN’s learnt depth is smaller for MNIST than Fashion-MNIST and likewise smaller for Fashion-MNIST than SVHN. For MNIST, Fashion-MNIST and, to a lesser extent, SVHN the depth given by the \nth95 percent pruning tends to saturate. On the other hand, the expected depth grows with , making it a less suitable pruning strategy.

As shown in rows 2 to 5, for SVHN and Fashion-MNIST, \nth95 percentile-pruned LDNs suffer no loss in predictive performance compared to expected depth-pruned and even non-pruned LDNs. They outperform DDNs. For MNIST, \nth95 percent pruning gives results with high variance and reduced predictive performance in some cases. Here, DDNs yield better log-likelihood and accuracy results. Expected depth is more resilient in this case, matching full-depth LDNs and DDNs in terms of accuracy.

fig. 10 shows calibration results for the image datasets under consideration. In all cases, DDNs are overconfident for all predicted probabilities. For Fashion-MNIST and SVHN, LDNs present less overconfidence for probabilities greater than . They achieve lower expected calibration errors (Guo et al., 2017) overall. On MNIST, LDNs present strong underconfidence for probabilities larger than . Their calibration error is worse than that of DDNs. Together with the results from fig. 9, this suggests that our LDNs are underfitting on MNIST. In all cases, the expected calibration errors of pruned LDNs are marginally larger than those of non-pruned LDNs.

fig. 11 shows the proportional reduction in forward pass time for pruned LDNs relative to DDNs, both having the same maximum depth . In line with our expectations, the speedup provided by our proposed \nth95 percent pruning strategy grows with . For , we obtain 62%, 41%, and 37% speedups for MNIST, Fashion-MNIST, and SVHN respectively.

Figure 9: Comparisons of DDNs and LDNs using different pruning strategies and maximum depths. LDN-95 refers to the pruning strategy described in section 2. LDN- refers to pruning to the expected depth given by . LDN-full refers to an unpruned LDN. \nth1 row: comparison of learnt depth. \nth2 row: comparison of test log-likelihoods for DDNs and LDNs with \nth95 percent pruning. \nth3 row: comparison of test log-likelihoods for LDN pruning methods. \nth4 and \nth5 rows: as above but for test error. Best viewed in colour.

(a) MNIST,

(b) Fashion-MNIST,

(c) SVHN,
Figure 10: Calibration plots obtained for image datasets. Results for a 50 layer DDN, layer LDN and layer LDN, are shown on the left, centre, and right respectively. The expected calibration errors corresponding to each plot are given in the titles. All models have a max depth of .

Figure 11: Proportional speedup for a single forward pass obtained with layer LDNs compared to their layer DDN counterparts.

Appendix E The NAS Best Practices Checklist

e.1 Best practices for releasing code

  • Code for the training pipeline used to evaluate the final architectures

  • Code for the search space

  • The hyperparameters used for the final evaluation pipeline, as well as random seeds

    • Our evaluation pipeline has no random seeds or hyperparameters.

  • Code for your NAS method

  • Hyperparameters for your NAS method, as well as random seeds

    • We report all hyperparameters but not random seeds. We run our methods multiple times in all of our experiments and obtain similar results.

e.2 Best practices for comparing NAS methods

  • For all NAS methods you compare, did you use exactly the same NAS benchmark, including the same dataset (with the same training-test split), search space and code for training the architectures and hyperparameters for that code?

  • Did you control for confounding factors (different hardware, versions of DL libraries, different run times for the different methods)?

  • Did you run ablation studies?

    • Not applicable for our method.

  • Did you use the same evaluation protocol for the methods being compared?

  • Did you compare performance over time?

    • Not applicable for our method and benchmarks.

  • Did you compare to random search?

    • Our search space is one dimensional allowing us to perform grid search.

  • Did you perform multiple runs of your experiments and report seeds?

    • While we did perform multiple runs of our experiments, we did not report the seeds.

  • Did you use tabular or surrogate benchmarks for in-depth evaluations?

    • Not applicable for our search space.

e.3 Best practices for reporting important details

  • Did you report how you tuned hyperparameters, and what time and resources this required?

    • Our approach has the same non-architecture hyperparameters as training a regular neural network: learning rate, learning rate decay, early stopping, and batch size. We used standard values as reported in appendix B. We did not perform hyperparameter tuning. For our spiral experiments, we ran an experiment comparing performance across network widths, see appendix C.

  • Did you report the time for the entire end-to-end NAS method (rather than, e.g., only for the search phase)?

    • Both times are equivalent in our case as it is a one-shot method. Our approach takes the same time as training a regular ResNet.

  • Did you report all the details of your experimental setup?

Footnotes

  1. footnotemark:

References

  1. MaskConnect: connectivity learning by gradient descent. In The European Conference on Computer Vision (ECCV), Cited by: §1.
  2. Understanding and simplifying one-shot architecture search. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 550–559. External Links: Link Cited by: §1.
  3. ProxylessNAS: direct neural architecture search on target task and hardware. External Links: Link Cited by: §1.
  4. Bayesian learning of neural network architectures. Cited by: §1.
  5. Uncertainty in deep learning. Ph.D. Thesis, University of Cambridge. Cited by: §1.
  6. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1321–1330. External Links: Link Cited by: Appendix D, §3.2.
  7. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: Appendix B.
  8. Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 1861–1869. Cited by: §1.
  9. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. Cited by: Appendix B.
  10. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 2575–2583. External Links: Link Cited by: §2.
  11. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: Appendix B.
  12. DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  13. Dropout as a structured shrinkage prior. In ICML, pp. 4712–4722. External Links: Link Cited by: §1.
  14. Reading digits in natural images with unsupervised feature learning. Cited by: Appendix B.
  15. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: Appendix B.
  16. Differentiable neural network architecture search. Cited by: §1.
  17. The variational approximation for bayesian inference. IEEE Signal Processing Magazine 25 (6), pp. 131–146. Cited by: Appendix A.
  18. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. External Links: cs.LG/1708.07747 Cited by: Appendix B.
  19. SNAS: stochastic neural architecture search. External Links: Link Cited by: §1.
  20. BayesNAS: a bayesian approach for neural architecture search. pp. 7603–7613. External Links: Link Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407768
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description