Essentially No Barriers in Neural Network Energy Landscape (under review at ICML 2018)
Abstract
Training neural networks involves finding minima of a highdimensional nonconvex loss function. Knowledge of the structure of this energy landscape is sparse. Relaxing from linear interpolations, we construct continuous paths between minima of recent neural network architectures on CIFAR10 and CIFAR100. Surprisingly, the paths are essentially flat in both the training and test landscapes. This implies that neural networks have enough capacity for structural changes, or that these changes are small between minima. Also, each minimum has at least one vanishing Hessian eigenvalue in addition to those resulting from trivial invariance.
1 Introduction
Neural networks have achieved remarkable success in practical applications such as object recognition [He et al.(2016)He, Zhang, Ren, and Sun, Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten], machine translation [Bahdanau et al.(2015)Bahdanau, Cho, and Bengio, Vinyals & Le(2015)Vinyals and Le], speech recognition [Hinton et al.(2012)Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, et al., Graves et al.(2013)Graves, Mohamed, and Hinton, Xiong et al.(2017)Xiong, Droppo, Huang, Seide, Seltzer, Stolcke, Yu, and Zweig] etc. Theoretical insights on why neural networks can be trained successfully despite their highdimensional and nonconvex loss functions are few or based on strong assumptions such the eigenvalues of the Hessian at critical points being random [Dauphin et al.(2014)Dauphin, Pascanu, Gülçehre, Cho, Ganguli, and Bengio], linear activations [Choromanska et al.(2014)Choromanska, Henaff, Mathieu, Arous, and LeCun, Kawaguchi(2016)] or wide hidden layers [Soudry & Carmon(2016)Soudry and Carmon, Nguyen & Hein(2017)Nguyen and Hein].
In the current literature, minima of the loss function are typically depicted as points at the bottom a valley with a certain width that reflects the generalisation of the network with parameters given by the location of the minimum [Keskar et al.(2016)Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang]. This is also the picture obtained when the loss function of neural networks is visualised in low dimension [Li et al.(2017)Li, Xu, Taylor, and Goldstein].
In this work, we argue that neural network loss minima are not isolated points in parameter space, but essentially form a connected manifold. More precisely, the part of the parameter space where the loss remains below a certain low threshold forms one single connected component.
We support the above claim by studying the energy landscape of several ResNets and DenseNets on CIFAR10 and CIFAR100: For pairs of minima, we construct continuous paths through parameter space for which the loss remains very close to the value found directly at the minima. An example for such a path is shown in \creffig:min_ene_path.
Our main contribution is the finding of paths

that connect minima trained from different initialisations which are not related to each other via known operations such as rescaling,

along which the training loss remains essentially at the same value as at the minima,

along which the test loss decreases while the test error rate slightly increases.
The abundance of such paths suggests that modern neural networks have enough parameters such that they can achieve good predictions while a big part of the network undergoes structural changes. In closing, we offer qualitative justification of this behaviour that may offer a handle for future theoretical investigation.
2 Related Work
In discussions about why neural networks generalise despite the extremely large number of parameters, one often finds the argument that wide minima generalise better [Keskar et al.(2016)Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang]. Definitions of the width of a minimum are hampered by the fact that it is possible to scale all parameters in one layer by a constant and by the following layer without changing the outcome when using ReLU activations [Dinh et al.(2017)Dinh, Pascanu, Bengio, and Bengio]. We extend this viewpoint that not only rescaling leads to directions with flat loss, but that also nontrivial changes can.
The characterisation of energy surfaces by connecting minima through lowenergy paths originates from molecular statistical mechanics. Here, the parameters of the energy function are threedimensional coordinates of the atoms and molecules in a system. A path with low energy can be used to define reaction coordinates for chemical reactions and estimating the rate at which a reaction occurs [Wales et al.(1998)Wales, Miller, and Walsh]. In this work, we apply the Automated Nudged Elastic Band (AutoNEB) algorithm [Kolsbjerg et al.(2016)Kolsbjerg, Groves, and Hammer] that is based on the Nudged Elastic Band (NEB) algorithm [Jónsson et al.(1998)Jónsson, Mills, and Jacobsen], a popular method in this field.
NEB has so far been applied to a multilayer perceptron with a single hidden layer [Ballard et al.(2016)Ballard, Stevenson, Das, and Wales]. High energy barriers between the minima of network were found when using three hidden neurons, and disappeared upon adding more neurons to the hidden layer. In a followup, [Ballard et al.(2017)Ballard, Das, Martiniani, Mehta, Sagun, Stevenson, and Wales] trained a multilayer perceptron with a single hidden layer on the MNIST dataset. They found that with regularisation, the landscape had no significant energy barriers. However, for their network they report an error rate of which is higher than the achieved even by a linear classifier [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] and the 0.35% achieved with a standard CNN [Ciresan et al.(2011)Ciresan, Meier, Masci, Maria Gambardella, and Schmidhuber].
In this work, we apply AutoNEB to a nontrivial network for the first time, and make the surprising observation that different minima of state of the art networks on CIFAR10 and CIFAR100 are connected through essentially flat paths.
3 Method
In the following, we use the terms energy and loss interchangeably.
3.1 Minimum Energy Path
A neural network loss function depends on the architecture, the training set and the network parameters . Keeping the former two fixed, we simply write and start with two parameter sets and . In our case, they are minima of the loss function, i.e. they result from training the networks to convergence. The goal is to find the continuous path from to through parameter space with the lowest maximum loss:
For this optimisation to be tractable, the loss function must be sufficiently smooth, i.e. contain no jumps along the path. The output and loss of neural networks are continuous functions of the parameters [Montúfar et al.(2014)Montúfar, Pascanu, Cho, and Bengio]; only the derivative is discontinuous for the case of ReLU activations. However, we cannot give any bounds on how steep the loss function may be. We address this problem by sampling all paths very densely.
Such a lowest path is called the minimum energy path (MEP) [Jónsson et al.(1998)Jónsson, Mills, and Jacobsen]. We refer to the parameter set with the maximum loss on a path as the “saddle point” of the path because it is a true saddle point of the loss function.
In lowdimensional spaces, it is easy to construct the exact minimum energy path between two minima, for example by using dynamic programming on a densely sampled grid.
This is not possible in the highdimensional spaces of present day’s neural networks with with parameter spaces that have millions down to billions of dimensions. We thus must resort to methods that construct an approximation of the MEP between two points using some local heuristics. In particular, we resort to the Automated Nudged Elastic Band (AutoNEB) algorithm [Kolsbjerg et al.(2016)Kolsbjerg, Groves, and Hammer]. This method is based on the Nudged Elastic Band (NEB) algorithm [Jónsson et al.(1998)Jónsson, Mills, and Jacobsen].
NEB bends a straight line segment by applying gradient forces until there are no more gradients perpendicular to the path. Then, as for the MEP, the highest point of the resulting path is a critical point. While this critical point is not necessarily the saddle point we were looking for, it gives an upper bound for the energy at the saddle point.
In the following, we present the mechanical model behind and the details of NEB. We then proceed to AutoNEB.
Mechanical Model
A chain of pivots (parameter sets) is connected via springs of stiffness with the initial and the final pivot fixed to the minima to connect, i.e. and . Using gradient descent, the path that minimises the following energy function is found:
(1) 
The problem with this formulation lies in the choice of the spring constant: If is too small, the distances between the pivots become larger in areas with high energy. However, identifying the highest point on the path and its energy is the very goal of the algorithm, so the sampling rate should be high in the highenergy regions. If, on the other hand, is chosen too large, it becomes energetically advantageous to shorten and hence straighten the path as the spring energy grows quadratically with the total length of the path. This cuts into corners of the loss surface and the resulting path can miss the saddle point.
Nudged Elastic Band
Inspired by the above model, [Jónsson et al.(1998)Jónsson, Mills, and Jacobsen] presented the Nudged Elastic Band (NEB). For brevity, we directly present the improved version by [Henkelman & Jónsson(2000)Henkelman and Jónsson]. The force resulting from \crefeq:PEB consists of a force derived from the loss and a force originating from the springs:
For NEB, the physical forces are modified, or nudged, so that the loss force only acts perpendicularly to the path and the spring force only parallelly to the path (see also \creffig:all_forces):
The direction of the path is defined by the local tangent to the path. The two forces now read:
(2) 
where the spring force opposes unequal distances along the path:
In this formulation, high energy pivots no longer “slide down” from the saddle point. The spring force only redistributes pivots on the path, but does not straighten it.
The local tangent is chosen to point in the direction of one of the adjacent pivots ( normalises to length one):
This particular choice of prevents kinks in the path and ensures a good approximation near the saddle point [Henkelman & Jónsson(2000)Henkelman and Jónsson].
alg:neb shows how an initial path can be iteratively updated using the above forces. As a companion, \creffig:all_forces visualises the forces in one update step in a two dimensional example. In this formulation, we use gradient descent to update the path. Any other gradient based optimiser can be used.
[htb] {algorithmic} \STATEInput: Initial path with pivots, and . \FORt = \STATE // Store current pivot \FORi = \STATE // Forces as in \crefeq:nudged_forces_details \STATE // Update pivot \ENDFOR\ENDFOR\STATEReturn: Final path
The evaluation time of \crefalg:neb rises linearly with the number of iterations and the number of pivots on the path. Computing the forces can trivially be parallelised over the pivots.
As hyperparameters of the algorithm, spring stiffness and number of pivots need to be chosen in addition to hyperparameters of the optimiser used. The number of iterations should be chosen large enough for the optimisation to converge.
[Sheppard et al.(2008)Sheppard, Terrell, and Henkelman] claim that a wide range of leads to the same result on a given loss surface. However, if chosen too large, the optimisation can become unstable. If it is too small, an excessive number of iterations are needed before the pivots become equally distributed. We did not find a value for that worked well across different loss surfaces and number of pivots . Instead, we equally space the pivots in each iteration and set the actual spring force to zero. The loss force is still restricted to act parallel to the path. In the literature, this is sometimes referred to as the string method [Sheppard et al.(2008)Sheppard, Terrell, and Henkelman].
The number of pivots trades off between computational effort on the one hand and subsampling artefacts on the other hand. In neural networks, it is not known what sampling density is needed for traversing the parameter space. We use an adaptive procedure that inserts more pivots where needed:
AutoNEB
The Automated Nudged Elastic Band (AutoNEB) wraps the above NEB algorithm [Kolsbjerg et al.(2016)Kolsbjerg, Groves, and Hammer]. Initially, it runs NEB with low . After some iterations, it is checked if the current pivots are sufficient to accurately sample the path. If this is not the case, new pivots are added at locations where it is estimated that the path requires more accuracy.
As a criterion, new pivots are inserted where the true energy curve deviates from the linear interpolation between each neighbouring pivot pair larger than a certain threshold, as visualised in \creffig:auto_neb_insert.
[htb] {algorithmic} \STATEInput: Minima to connect . \STATEInitialise pivots equally spaced on line segment . \FOR \STATEOptimise path using NEB (\crefalg:neb). \STATEEvaluate loss along NEB. \STATEInsert pivots where residuum is large. \ENDFOR
3.2 Local minimum energy paths
AutoNEB can get stuck in local minimum energy paths (local MEPs) which are not the true MEP. This means that the saddle point energies reported by AutoNEB can only be an upper bound for the unknown minimal saddle point energies.
The good news is that the graph of minima and local MEPs has an ultrametric property: Suppose some local MEPs from a minimum to and from to are known. We call them and . The respective saddle point energies give an upper bound for the true saddle point energies (marked with an asterisk):
Additionally, they yield an upper bound for the saddle point energy between and (ultrametric triangle inequality):
This is easy to see: When concatenating the paths and , this gives a new path connecting to . The saddle point is measured at the maximum of the path and hence the saddle point energy of the path is .
This has three consequences:

As soon as the minima and computed paths form a connected graph, upper bounds for all saddle energies are available.

When AutoNEB finds a bad local MEP, this can be addressed by computing paths between other pairs of minima. As soon as a lower path is found by concatenating other paths, the bad local MEP can be removed.

When we evaluate the saddle point energies of a set of computed local MEPs, we can ignore paths with higher energy than the concatenation of paths with a lower maximal energy.
These lowest local MEPs form a minimum spanning tree in the available graph [Gower & Ross(1969)Gower and Ross]. A Minimum Spanning Tree (MST) can be found efficiently, e.g. using Kruskal’s algorithm.
As a consequence, we resort to the heuristic (surely not new, though we have no reference) spelled out in \crefalg:choose_pairs and visualised in \creffig:transition_network to determine which pair of minima to connect next.
The procedure suggests new tuples of minima until local MEPs are known between all pairs of minima. Since running AutoNEB is computationally expensive (effort on the order of training the corresponding network), we stop the iteration when the minimum spanning tree contains only similar saddle point energies.
[htb]
{algorithmic}
\STATEInput: Set of minima .
\STATEConnect to all . // Initial spanning tree
\REPEAT\STATERemove edge with highest loss from spanning tree.
\STATEFrom each resulting tree, select one minimum, so that
no local MEP is known for the pair.
\IFno pair is found
\STATEReinsert and ignore it when searching the highest
edge in the future.
\ELSE\STATECompute new path using AutoNEB.
\IF
\STATEAdd to the tree. // Makes tree “lighter”
\ELSE\STATEReinsert to the tree. // Found no better path
\ENDIF\ENDIF\UNTILone local MEP is known for each pair of minima
4 Experiments
We connect minima of different ResNets [He et al.(2016)He, Zhang, Ren, and Sun] and DenseNets [Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten] on the image classification tasks CIFAR10 and CIFAR100. We train several instances of the network from distinct random initialisations following the instructions in the original literature. Then we connect pairs of the minima using AutoNEB.
We report the average crossentropy loss and misclassification rates over the full training and test data for the minima found. For the final evaluation, we compute the minimum spanning tree of the saddle point energies and again report the average crossentropy and misclassification rate on the saddle points in this tree.
4.1 ResNet
We train ResNets on both CIFAR10 (ResNet20, 32, 44 and 56) and CIFAR100 (ResNet20 and 44) following the training procedure in [He et al.(2016)He, Zhang, Ren, and Sun]. For each tuple of architecture and dataset, we train ten networks. The pairs to connect are suggested by \crefalg:choose_pairs.
AutoNEB is run for a total of 14 cycles of NEB per minimum pair. The loss is evaluated for each pivot on a random batch of 512 training samples for ResNet20 and ResNet32 and 256 training samples for ResNet44 and ResNet56 (double the number compared to training).
After each cycle, new pivots are inserted at positions where the loss exceeds the energy estimated by linear interpolation between pivots by at least 20% compared to the total energy difference along the path. This reduces big bumps in the energy first which is beneficial as each additional pivot implies more loss evaluations per iteration. The energy was evaluated on 9 points between each pair of neighbouring pivots. Evaluating the energies after each cycle takes approximately half the time compared to running a corresponding cycle of 1000 steps.
As in the original paper, SGD with momentum and regularisation with was used.
We use the following learning schedule, see also \creffig:neb_cycles:

Four cycles of 1000 steps each with learning rate 0.1.

Two cycles with 2000 steps and learning rate 0.1. In this part of the optimisation, it did not prove necessary to insert new pivots and was hence omitted to reduce cost.

This was followed by four cycles of 1000 steps with learning rate 0.01. In this phase, the energy drops significantly.

No big improvement was seen in the last four cycles of 1000 steps with a learning rate of 0.001.
This learning rate schedule was inspired by [He et al.(2016)He, Zhang, Ren, and Sun].
4.2 DenseNet
We train a DenseNet4012 and a DenseNet10012BC on both CIFAR10 and CIFAR100 following the training procedure in [Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten]. The AutoNEB cycles were configured exactly as for the ResNets except for the batch sizes which was set to 64.
Evaluating the gradients for DenseNets is more expensive than for the ResNets. Thus we only compute four minima for this architecture. Then we select the lowest minimum and connect it to all others via AutoNEB.
4.3 Results
The saddle point losses for both training and test sets found by AutoNEB are shown in \creffig:summary. For reference, the corresponding numbers can be found in \creftab:summary.
Train energy  Test energy  Test error rate  
Minima  Saddles  Factor  Minima  Saddles  Minima  Saddles  
dataset  architecture  
C10+  ResNet20  0.016  0.034  2  0.36  0.37  8.5  8.9  0.5 
ResNet32  0.006  0.015  3  0.36  0.37  7.5  8.1  0.6  
ResNet44  0.003  0.018  6  0.36  0.36  7.1  7.6  0.5  
ResNet56  0.002  0.017  8  0.35  0.36  6.9  7.6  0.7  
DenseNet4012  0.001  0.019  6  0.25  0.25  5.6  6.6  1.0  
DenseNet10012BC  0.001  0.009  16  0.21  0.22  4.9  5.5  0.6  
C100+  ResNet20  0.353  0.674  2  1.42  1.49  33.3  36.2  2.9 
ResNet44  0.075  0.383  5  1.60  1.61  30.8  32.7  1.9  
DenseNet4012  0.010  0.091  9  1.30  1.32  26.3  27.6  1.3  
DenseNet10012BC  0.005  0.031  6  1.12  1.15  23.7  25.0  1.3 
The training energies can be compared to a few other characteristic loss values of a neural network, ordered from high to low:

The average loss for an untrained network. For the crossentropy loss, it is on CIFAR10 and on CIFAR100.
The saddle point energies on both training sets are about two orders of magnitude smaller than the loss at the initialisation of the network. 
The loss of the test set at the minima.
All saddle point energies on CIFAR10 are about one order of magnitude smaller than the average minimum energy on the test set. On CIFAR100, the saddle point energies of the ResNets are smaller than a third of the value on the test set. For the DenseNets, they are at least one order of magnitude smaller. 
The loss of the training set at the minima.
The loss at the saddle points is 220 times as large as the mean loss of the minima. These ratios are noise because the denominator can approach zero when the network fits the data perfectly. [Zhang et al.(2017)Zhang, Bengio, Hardt, Recht, and Vinyals].
Instead, we measure the epoch at which the training energy falls below the saddle point energy during training for the network. This procedure is visualised for the DenseNet4012 on CIFAR10 in \creffig:first_passage. It is the network with the highest ratio between minimum energy and saddle point energy.
Furthermore, the test error rate at the saddle point gives an intuition on how much information was actually lost on the saddle point. On the ResNets, the error rises by up to 0.7% on CIFAR10 and 3% on CIFAR100. For the DenseNets, the error rises by up to 1% on CIFAR10 and 2% on CIFAR100.
This beats the existing technique of computing the energy barrier between minima by evaluating the loss on the connecting line segment by a large margin.
5 Discussion
We have pointed out an intriguing property of the loss surface of currentday deep networks, by upperbounding the saddle points between the parameter sets that result from stochastic gradient descent, a.k.a. “minima”. These empirical upper bounds are astonishingly close to the loss at the minima themselves. At this point, we cannot give a formal characterization of the regime in which this finding holds. A formal proof is also complicated by the fact that the loss surface is a function not only of the parameters and the architecture, but also of the training set; and the distribution of realworld structured data such as images or sentences does not lend itself to a compact mathematical representation. That said, we want to make two related arguments that may help explain why we observe no substantial barrier between minima.
5.1 Resilience
State of the art neural networks have dozens or hundreds of neurons / channels per layer, and skip connections between nonadjacent layers. Assume that by training, a parameter set with low loss has been identified. Now if we perturb a single parameter, say by adding a small constant, but leave the others free to adapt to this change to still minimize the loss, it may be argued that by adjusting somewhat, the myriad other parameters can “make up” for the change imposed on only one of them. After this relaxation, the procedure and argument can be repeated, though possibly with the perturbation of a different parameter than in the previous rounds.
This type of resilience is exploited and encouraged by procedures such as Dropout [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] or ensembling [Hansen & Salamon(1990)Hansen and Salamon]. It is also the reason why neural networks can be greatly condensed before a substantial increase in loss occurs [Liu et al.(2017)Liu, Li, Shen, Huang, Yan, and Zhang].
5.2 Redundancy
Consider the textbook example of a twolayer perceptron that can fit the XOR problem. The two neurons traditionally used in the first hidden layer – let’s call them Alice and Bob – are shown in \creffig:xor(A). We can obtain an equivalent network by exchanging Alice and Bob (and permuting the weights of the neuron in the second hidden layer, not shown). This network, also corresponding to a minimum of the loss surface, is shown in \creffig:xor(B). Now, any path between these two minima will entail parameter sets such as the one in in Fig. 8(C) that incur high loss.
If, on the other hand, we introduce an auxiliary neuron, Charlie, we can play a small choreography: Enter Charlie. Charlie stands in for Bob. Bob transitions to Alice’s role. Alice takes over from Charlie. Exit Charlie. If the neuron in the second hidden layer adjusts its weights so as to disregard the output from the neuronintransition, the entire network incurs no higher loss than at the two original minima. We have constructed a perfect minimum energy path.
6 Conclusion
We find that the loss surface of deep neural networks contains paths with constantly low loss. We put forth two closely related arguments in the above. Both hold only if the network has some extra capacity, or degrees of freedom, to spare. Empirically, this seems to be the case for modernday architectures applied to standard problems. We argue that due to the width of each layer, the network heavily replace parameters while producing an output with low loss.
This has the profound implication that low Hessian eigenvalues exist apart from the eigenvectors with analytically zero eigenvalues due to scaling.
The method opens the door to further empirical research on the energy landscape neural networks. When the hyperparameters of AutoNEB are further refined, we expect to find even lower paths up to the level where the true saddle points are recovered. It is then interesting to see if certain minima have a higher barrier between them than others. This makes it possible to recursively form clusters of minima, i.e. using singlelinkage clustering. This analysis is yet not possible for the large error bars that we find. In the traditional energy landscape literature, this kind of clustering is done in disconnectivity graphs [Wales et al.(1998)Wales, Miller, and Walsh].
For practical applications, we can imagine using the resulting paths as a large ensemble of neural networks, especially given that we observe practically lower test loss along the path.
References
 Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
 Ballard, A. J., Das, R., Martiniani, S., Mehta, D., Sagun, L., Stevenson, J. D., and Wales, D. J. Energy landscapes for machine learning. Physical Chemistry Chemical Physics (Incorporating Faraday Transactions), 19:12585–12603, 2017. doi: 10.1039/C7CP01108C.
 Ballard, Andrew J., Stevenson, Jacob D., Das, Ritankar, and Wales, David J. Energy landscapes for a machine learning application to series data. J. Chem. Phys., 144(12):124119, Mar 2016. ISSN 10897690. doi: 10.1063/1.4944672. URL http://dx.doi.org/10.1063/1.4944672.
 Choromanska, Anna, Henaff, Mikael, Mathieu, Michaël, Arous, Gérard Ben, and LeCun, Yann. The loss surface of multilayer networks. CoRR, abs/1412.0233, 2014. URL http://arxiv.org/abs/1412.0233.
 Ciresan, Dan C, Meier, Ueli, Masci, Jonathan, Maria Gambardella, Luca, and Schmidhuber, Jürgen. Flexible, high performance convolutional neural networks for image classification. In IJCAI ProceedingsInternational Joint Conference on Artificial Intelligence, volume 22, pp. 1237. Barcelona, Spain, 2011.
 Dauphin, Yann, Pascanu, Razvan, Gülçehre, Çaglar, Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. CoRR, abs/1406.2572, 2014. URL http://arxiv.org/abs/1406.2572.
 Dinh, Laurent, Pascanu, Razvan, Bengio, Samy, and Bengio, Yoshua. Sharp minima can generalize for deep nets. In Precup, Doina and Teh, Yee Whye (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1019–1028, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/dinh17b.html.
 Gower, J. C. and Ross, G. J. S. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):54–64, 1969. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346439.
 Graves, Alex, Mohamed, Abdelrahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. IEEE, 2013.
 Hansen, Lars Kai and Salamon, Peter. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001, 1990.
 He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Henkelman, Graeme and Jónsson, Hannes. Improved tangent estimate in the nudged elastic band method for finding minimum energy paths and saddle points. The Journal of chemical physics, 113(22):9978–9985, 2000.
 Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdelrahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Huang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and van der Maaten, Laurens. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, pp. 3, 2017.
 Jónsson, Hannes, Mills, Greg, and Jacobsen, Karsten W. Nudged elastic band method for finding minimum energy paths of transitions. In Classical and quantum dynamics in condensed phase simulations, pp. 385–404. World Scientific, 1998.
 Kawaguchi, Kenji. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
 Keskar, Nitish Shirish, Mudigere, Dheevatsa, Nocedal, Jorge, Smelyanskiy, Mikhail, and Tang, Ping Tak Peter. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 Kolsbjerg, Esben L, Groves, Michael N, and Hammer, Bjørk. An automated nudged elastic band method. The Journal of chemical physics, 145(9):094107, 2016.
 LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Li, Hao, Xu, Zheng, Taylor, Gavin, and Goldstein, Tom. Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017.
 Liu, Zhuang, Li, Jianguo, Shen, Zhiqiang, Huang, Gao, Yan, Shoumeng, and Zhang, Changshui. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2736–2744, 2017.
 Montúfar, Guido, Pascanu, Razvan, Cho, Kyunghyun, and Bengio, Yoshua. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932, 2014.
 Nguyen, Quynh and Hein, Matthias. The loss surface of deep and wide neural networks. In ICML, 2017.
 Sheppard, Daniel, Terrell, Rye, and Henkelman, Graeme. Optimization methods for finding minimum energy paths. The Journal of Chemical Physics, 128(13):134106, Apr 2008. ISSN 10897690. doi: 10.1063/1.2841941. URL http://dx.doi.org/10.1063/1.2841941.
 Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Soudry, D. and Carmon, Y. No bad local minima: Data independent training error guarantees for multilayer neural networks. ArXiv eprints, May 2016.
 Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Vinyals, Oriol and Le, Quoc V. A neural conversational model. CoRR, abs/1506.05869, 2015. URL http://arxiv.org/abs/1506.05869.
 Wales, David J, Miller, Mark A, and Walsh, Tiffany R. Archetypal energy landscapes. Nature, 394(6695):758, 1998.
 Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stolcke, A., Yu, D., and Zweig, G. Toward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2410–2423, Dec 2017. ISSN 23299290. doi: 10.1109/TASLP.2017.2756440.
 Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, and Vinyals, Oriol. Understanding deep learning requires rethinking generalization. In ICLR, 2017.