Essentially No Barriers in Neural Network Energy Landscape (under review at ICML 2018)

Essentially No Barriers in Neural Network Energy Landscape (under review at ICML 2018)

Abstract

Training neural networks involves finding minima of a high-dimensional non-convex loss function. Knowledge of the structure of this energy landscape is sparse. Relaxing from linear interpolations, we construct continuous paths between minima of recent neural network architectures on CIFAR10 and CIFAR100. Surprisingly, the paths are essentially flat in both the training and test landscapes. This implies that neural networks have enough capacity for structural changes, or that these changes are small between minima. Also, each minimum has at least one vanishing Hessian eigenvalue in addition to those resulting from trivial invariance.

\printAffiliationsAndNotice

1 Introduction

Neural networks have achieved remarkable success in practical applications such as object recognition [He et al.(2016)He, Zhang, Ren, and Sun, Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten], machine translation [Bahdanau et al.(2015)Bahdanau, Cho, and Bengio, Vinyals & Le(2015)Vinyals and Le], speech recognition [Hinton et al.(2012)Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, et al., Graves et al.(2013)Graves, Mohamed, and Hinton, Xiong et al.(2017)Xiong, Droppo, Huang, Seide, Seltzer, Stolcke, Yu, and Zweig] etc. Theoretical insights on why neural networks can be trained successfully despite their high-dimensional and non-convex loss functions are few or based on strong assumptions such the eigenvalues of the Hessian at critical points being random [Dauphin et al.(2014)Dauphin, Pascanu, Gülçehre, Cho, Ganguli, and Bengio], linear activations [Choromanska et al.(2014)Choromanska, Henaff, Mathieu, Arous, and LeCun, Kawaguchi(2016)] or wide hidden layers [Soudry & Carmon(2016)Soudry and Carmon, Nguyen & Hein(2017)Nguyen and Hein].

In the current literature, minima of the loss function are typically depicted as points at the bottom a valley with a certain width that reflects the generalisation of the network with parameters given by the location of the minimum [Keskar et al.(2016)Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang]. This is also the picture obtained when the loss function of neural networks is visualised in low dimension [Li et al.(2017)Li, Xu, Taylor, and Goldstein].

In this work, we argue that neural network loss minima are not isolated points in parameter space, but essentially form a connected manifold. More precisely, the part of the parameter space where the loss remains below a certain low threshold forms one single connected component.

We support the above claim by studying the energy landscape of several ResNets and DenseNets on CIFAR10 and CIFAR100: For pairs of minima, we construct continuous paths through parameter space for which the loss remains very close to the value found directly at the minima. An example for such a path is shown in \creffig:min_ene_path.

Figure 1: Left: A slice through the one million-dimensional training loss function of DenseNet-40-12 on CIFAR10 and the minimum energy path found by our method. The plane is spanned by the two minima and the mean of the nodes of the path. Right: Loss along the linear line segment between minima, and along our high-dimensional path. Surprisingly, the energy along this path is essentially flat.

Our main contribution is the finding of paths

  1. that connect minima trained from different initialisations which are not related to each other via known operations such as rescaling,

  2. along which the training loss remains essentially at the same value as at the minima,

  3. along which the test loss decreases while the test error rate slightly increases.

The abundance of such paths suggests that modern neural networks have enough parameters such that they can achieve good predictions while a big part of the network undergoes structural changes. In closing, we offer qualitative justification of this behaviour that may offer a handle for future theoretical investigation.

2 Related Work

In discussions about why neural networks generalise despite the extremely large number of parameters, one often finds the argument that wide minima generalise better [Keskar et al.(2016)Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang]. Definitions of the width of a minimum are hampered by the fact that it is possible to scale all parameters in one layer by a constant and by the following layer without changing the outcome when using ReLU activations [Dinh et al.(2017)Dinh, Pascanu, Bengio, and Bengio]. We extend this viewpoint that not only rescaling leads to directions with flat loss, but that also nontrivial changes can.

The characterisation of energy surfaces by connecting minima through low-energy paths originates from molecular statistical mechanics. Here, the parameters of the energy function are three-dimensional coordinates of the atoms and molecules in a system. A path with low energy can be used to define reaction coordinates for chemical reactions and estimating the rate at which a reaction occurs [Wales et al.(1998)Wales, Miller, and Walsh]. In this work, we apply the Automated Nudged Elastic Band (AutoNEB) algorithm [Kolsbjerg et al.(2016)Kolsbjerg, Groves, and Hammer] that is based on the Nudged Elastic Band (NEB) algorithm [Jónsson et al.(1998)Jónsson, Mills, and Jacobsen], a popular method in this field.

NEB has so far been applied to a multi-layer perceptron with a single hidden layer [Ballard et al.(2016)Ballard, Stevenson, Das, and Wales]. High energy barriers between the minima of network were found when using three hidden neurons, and disappeared upon adding more neurons to the hidden layer. In a follow-up, [Ballard et al.(2017)Ballard, Das, Martiniani, Mehta, Sagun, Stevenson, and Wales] trained a multi-layer perceptron with a single hidden layer on the MNIST dataset. They found that with -regularisation, the landscape had no significant energy barriers. However, for their network they report an error rate of which is higher than the achieved even by a linear classifier [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] and the 0.35% achieved with a standard CNN [Ciresan et al.(2011)Ciresan, Meier, Masci, Maria Gambardella, and Schmidhuber].

In this work, we apply AutoNEB to a nontrivial network for the first time, and make the surprising observation that different minima of state of the art networks on CIFAR10 and CIFAR100 are connected through essentially flat paths.

3 Method

In the following, we use the terms energy and loss interchangeably.

3.1 Minimum Energy Path

A neural network loss function depends on the architecture, the training set and the network parameters . Keeping the former two fixed, we simply write and start with two parameter sets and . In our case, they are minima of the loss function, i.e. they result from training the networks to convergence. The goal is to find the continuous path from to through parameter space with the lowest maximum loss:

For this optimisation to be tractable, the loss function must be sufficiently smooth, i.e. contain no jumps along the path. The output and loss of neural networks are continuous functions of the parameters [Montúfar et al.(2014)Montúfar, Pascanu, Cho, and Bengio]; only the derivative is discontinuous for the case of ReLU activations. However, we cannot give any bounds on how steep the loss function may be. We address this problem by sampling all paths very densely.

Such a lowest path is called the minimum energy path (MEP) [Jónsson et al.(1998)Jónsson, Mills, and Jacobsen]. We refer to the parameter set with the maximum loss on a path as the “saddle point” of the path because it is a true saddle point of the loss function.

In low-dimensional spaces, it is easy to construct the exact minimum energy path between two minima, for example by using dynamic programming on a densely sampled grid.

This is not possible in the high-dimensional spaces of present day’s neural networks with with parameter spaces that have millions down to billions of dimensions. We thus must resort to methods that construct an approximation of the MEP between two points using some local heuristics. In particular, we resort to the Automated Nudged Elastic Band (AutoNEB) algorithm [Kolsbjerg et al.(2016)Kolsbjerg, Groves, and Hammer]. This method is based on the Nudged Elastic Band (NEB) algorithm [Jónsson et al.(1998)Jónsson, Mills, and Jacobsen].

NEB bends a straight line segment by applying gradient forces until there are no more gradients perpendicular to the path. Then, as for the MEP, the highest point of the resulting path is a critical point. While this critical point is not necessarily the saddle point we were looking for, it gives an upper bound for the energy at the saddle point.

In the following, we present the mechanical model behind and the details of NEB. We then proceed to AutoNEB.

Mechanical Model

A chain of pivots (parameter sets) is connected via springs of stiffness with the initial and the final pivot fixed to the minima to connect, i.e. and . Using gradient descent, the path that minimises the following energy function is found:

(1)

The problem with this formulation lies in the choice of the spring constant: If is too small, the distances between the pivots become larger in areas with high energy. However, identifying the highest point on the path and its energy is the very goal of the algorithm, so the sampling rate should be high in the high-energy regions. If, on the other hand, is chosen too large, it becomes energetically advantageous to shorten and hence straighten the path as the spring energy grows quadratically with the total length of the path. This cuts into corners of the loss surface and the resulting path can miss the saddle point.

Nudged Elastic Band

Inspired by the above model, [Jónsson et al.(1998)Jónsson, Mills, and Jacobsen] presented the Nudged Elastic Band (NEB). For brevity, we directly present the improved version by [Henkelman & Jónsson(2000)Henkelman and Jónsson]. The force resulting from \crefeq:PEB consists of a force derived from the loss and a force originating from the springs:

For NEB, the physical forces are modified, or nudged, so that the loss force only acts perpendicularly to the path and the spring force only parallelly to the path (see also \creffig:all_forces):

The direction of the path is defined by the local tangent to the path. The two forces now read:

(2)

where the spring force opposes unequal distances along the path:

Figure 2: Two dimensional loss surface, with two minima connected by a minimum energy path (MEP) and a nudged elastic band (NEB) at iteration 0, 10 and converged. Construction of NEB force for one pivot. The tangent points to the neighbouring pivot with higher energy. The spring force acts parallel and the loss force perpendicular to the tangent.

In this formulation, high energy pivots no longer “slide down” from the saddle point. The spring force only re-distributes pivots on the path, but does not straighten it.

The local tangent is chosen to point in the direction of one of the adjacent pivots ( normalises to length one):

This particular choice of prevents kinks in the path and ensures a good approximation near the saddle point [Henkelman & Jónsson(2000)Henkelman and Jónsson].

\cref

alg:neb shows how an initial path can be iteratively updated using the above forces. As a companion, \creffig:all_forces visualises the forces in one update step in a two dimensional example. In this formulation, we use gradient descent to update the path. Any other gradient based optimiser can be used.

{algorithm}

[htb] NEB {algorithmic} \STATEInput: Initial path with pivots, and . \FORt = \STATE   // Store current pivot \FORi = \STATE   // Forces as in \crefeq:nudged_forces_details \STATE   // Update pivot \ENDFOR\ENDFOR\STATEReturn: Final path

The evaluation time of \crefalg:neb rises linearly with the number of iterations and the number of pivots on the path. Computing the forces can trivially be parallelised over the pivots.

As hyperparameters of the algorithm, spring stiffness and number of pivots need to be chosen in addition to hyperparameters of the optimiser used. The number of iterations  should be chosen large enough for the optimisation to converge.

[Sheppard et al.(2008)Sheppard, Terrell, and Henkelman] claim that a wide range of leads to the same result on a given loss surface. However, if chosen too large, the optimisation can become unstable. If it is too small, an excessive number of iterations are needed before the pivots become equally distributed. We did not find a value for that worked well across different loss surfaces and number of pivots . Instead, we equally space the pivots in each iteration and set the actual spring force to zero. The loss force is still restricted to act parallel to the path. In the literature, this is sometimes referred to as the string method [Sheppard et al.(2008)Sheppard, Terrell, and Henkelman].

The number of pivots trades off between computational effort on the one hand and subsampling artefacts on the other hand. In neural networks, it is not known what sampling density is needed for traversing the parameter space. We use an adaptive procedure that inserts more pivots where needed:

AutoNEB

The Automated Nudged Elastic Band (AutoNEB) wraps the above NEB algorithm [Kolsbjerg et al.(2016)Kolsbjerg, Groves, and Hammer]. Initially, it runs NEB with low . After some iterations, it is checked if the current pivots are sufficient to accurately sample the path. If this is not the case, new pivots are added at locations where it is estimated that the path requires more accuracy.

As a criterion, new pivots are inserted where the true energy curve deviates from the linear interpolation between each neighbouring pivot pair larger than a certain threshold, as visualised in \creffig:auto_neb_insert.

{algorithm}

[htb] AutoNEB {algorithmic} \STATEInput: Minima to connect . \STATEInitialise pivots equally spaced on line segment . \FOR \STATEOptimise path using NEB (\crefalg:neb). \STATEEvaluate loss along NEB. \STATEInsert pivots where residuum is large. \ENDFOR

Figure 3: New items are inserted in each cycle of AutoNEB when the true energy at an interpolated position between two points rises too high compared to the interpolated energy. Between and , a new pivot is inserted. Between and , the difference is small enough that no additional pivot is needed.

3.2 Local minimum energy paths

AutoNEB can get stuck in local minimum energy paths (local MEPs) which are not the true MEP. This means that the saddle point energies reported by AutoNEB can only be an upper bound for the unknown minimal saddle point energies.

The good news is that the graph of minima and local MEPs has an ultrametric property: Suppose some local MEPs from a minimum to and from to are known. We call them and . The respective saddle point energies give an upper bound for the true saddle point energies (marked with an asterisk):

Additionally, they yield an upper bound for the saddle point energy between and (ultrametric triangle inequality):

This is easy to see: When concatenating the paths and , this gives a new path connecting to . The saddle point is measured at the maximum of the path and hence the saddle point energy of the path is .

This has three consequences:

  1. As soon as the minima and computed paths form a connected graph, upper bounds for all saddle energies are available.

  2. When AutoNEB finds a bad local MEP, this can be addressed by computing paths between other pairs of minima. As soon as a lower path is found by concatenating other paths, the bad local MEP can be removed.

  3. When we evaluate the saddle point energies of a set of computed local MEPs, we can ignore paths with higher energy than the concatenation of paths with a lower maximal energy.
    These lowest local MEPs form a minimum spanning tree in the available graph [Gower & Ross(1969)Gower and Ross]. A Minimum Spanning Tree (MST) can be found efficiently, e.g. using Kruskal’s algorithm.

As a consequence, we resort to the heuristic (surely not new, though we have no reference) spelled out in \crefalg:choose_pairs and visualised in \creffig:transition_network to determine which pair of minima to connect next.

The procedure suggests new tuples of minima until local MEPs are known between all pairs of minima. Since running AutoNEB is computationally expensive (effort on the order of training the corresponding network), we stop the iteration when the minimum spanning tree contains only similar saddle point energies.

{algorithm}

[htb] Energy Landscape Exploration {algorithmic} \STATEInput: Set of minima . \STATEConnect to all . // Initial spanning tree \REPEAT\STATERemove edge with highest loss from spanning tree. \STATEFrom each resulting tree, select one minimum, so that
    no local MEP is known for the pair. \IFno pair is found \STATERe-insert and ignore it when searching the highest
    edge in the future. \ELSE\STATECompute new path using AutoNEB. \IF \STATEAdd to the tree. // Makes tree “lighter” \ELSE\STATERe-insert to the tree. // Found no better path \ENDIF\ENDIF\UNTILone local MEP is known for each pair of minima

Figure 4: Algorithm for determining the next minimum pair to connect. First, all minima are connected to one particular minimum. Then, bad edges in the minimum spanning tree are replaced with new edges in the full graph.

4 Experiments

We connect minima of different ResNets [He et al.(2016)He, Zhang, Ren, and Sun] and DenseNets [Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten] on the image classification tasks CIFAR10 and CIFAR100. We train several instances of the network from distinct random initialisations following the instructions in the original literature. Then we connect pairs of the minima using AutoNEB.

We report the average cross-entropy loss and misclassification rates over the full training and test data for the minima found. For the final evaluation, we compute the minimum spanning tree of the saddle point energies and again report the average cross-entropy and misclassification rate on the saddle points in this tree.

4.1 ResNet

We train ResNets on both CIFAR10 (ResNet-20, -32, -44 and -56) and CIFAR100 (ResNet-20 and -44) following the training procedure in [He et al.(2016)He, Zhang, Ren, and Sun]. For each tuple of architecture and dataset, we train ten networks. The pairs to connect are suggested by \crefalg:choose_pairs.

AutoNEB is run for a total of 14 cycles of NEB per minimum pair. The loss is evaluated for each pivot on a random batch of 512 training samples for ResNet-20 and ResNet-32 and 256 training samples for ResNet-44 and ResNet-56 (double the number compared to training).

After each cycle, new pivots are inserted at positions where the loss exceeds the energy estimated by linear interpolation between pivots by at least 20% compared to the total energy difference along the path. This reduces big bumps in the energy first which is beneficial as each additional pivot implies more loss evaluations per iteration. The energy was evaluated on 9 points between each pair of neighbouring pivots. Evaluating the energies after each cycle takes approximately half the time compared to running a corresponding cycle of 1000 steps.

As in the original paper, SGD with momentum and -regularisation with was used.

We use the following learning schedule, see also \creffig:neb_cycles:

  1. Four cycles of 1000 steps each with learning rate 0.1.

  2. Two cycles with 2000 steps and learning rate 0.1. In this part of the optimisation, it did not prove necessary to insert new pivots and was hence omitted to reduce cost.

  3. This was followed by four cycles of 1000 steps with learning rate 0.01. In this phase, the energy drops significantly.

  4. No big improvement was seen in the last four cycles of 1000 steps with a learning rate of 0.001.

This learning rate schedule was inspired by [He et al.(2016)He, Zhang, Ren, and Sun].

Figure 5: Minimum and saddle point energies for all considered architectures. Each pair of points represents a minimum and the corresponding saddle point, connected by a straight line. On the test set, the points of the minima and the saddle points are practically identical. The left chart shows CIFAR10, the right CIFAR100. The lower row (blue) is evaluated on the training set, the higher energies (orange) on the test set.
Figure 6: Typical energies along path between different NEB cycles for ResNet-20 on CIFAR10.: (1) After the first cycle, typically one or two pivots have to be inserted. (2) After four cycles of learning rate, the energy reduced by a factor of five. Between pivots we find low energy regions that we attribute to the high learning rate . (3) The first round with low learning rate reduces the energy by another factor of 2. (4) After 14 cycles, NEB is converged and no major energy bumps are visible between the pivots. On the test set, the loss has a similar behaviour as the training set, but the loss goes lower than at the minima.

4.2 DenseNet

We train a DenseNet-40-12 and a DenseNet-100-12-BC on both CIFAR10 and CIFAR100 following the training procedure in [Huang et al.(2017)Huang, Liu, Weinberger, and van der Maaten]. The AutoNEB cycles were configured exactly as for the ResNets except for the batch sizes which was set to 64.

Evaluating the gradients for DenseNets is more expensive than for the ResNets. Thus we only compute four minima for this architecture. Then we select the lowest minimum and connect it to all others via AutoNEB.

4.3 Results

The saddle point losses for both training and test sets found by AutoNEB are shown in \creffig:summary. For reference, the corresponding numbers can be found in \creftab:summary.


Train energy Test energy Test error rate
Minima Saddles Factor Minima Saddles Minima Saddles
dataset architecture
C10+ ResNet-20 0.016 0.034 2 0.36 0.37 8.5 8.9 0.5
ResNet-32 0.006 0.015 3 0.36 0.37 7.5 8.1 0.6
ResNet-44 0.003 0.018 6 0.36 0.36 7.1 7.6 0.5
ResNet-56 0.002 0.017 8 0.35 0.36 6.9 7.6 0.7
DenseNet-40-12 0.001 0.019 6 0.25 0.25 5.6 6.6 1.0
DenseNet-100-12-BC 0.001 0.009 16 0.21 0.22 4.9 5.5 0.6
C100+ ResNet-20 0.353 0.674 2 1.42 1.49 33.3 36.2 2.9
ResNet-44 0.075 0.383 5 1.60 1.61 30.8 32.7 1.9
DenseNet-40-12 0.010 0.091 9 1.30 1.32 26.3 27.6 1.3
DenseNet-100-12-BC 0.005 0.031 6 1.12 1.15 23.7 25.0 1.3
Table 1: Quantitative results.

The training energies can be compared to a few other characteristic loss values of a neural network, ordered from high to low:

  1. The average loss for an untrained network. For the cross-entropy loss, it is on CIFAR10 and on CIFAR100.
    The saddle point energies on both training sets are about two orders of magnitude smaller than the loss at the initialisation of the network.

  2. The loss of the test set at the minima.
    All saddle point energies on CIFAR10 are about one order of magnitude smaller than the average minimum energy on the test set. On CIFAR100, the saddle point energies of the ResNets are smaller than a third of the value on the test set. For the DenseNets, they are at least one order of magnitude smaller.

  3. The loss of the training set at the minima.
    The loss at the saddle points is 2-20 times as large as the mean loss of the minima. These ratios are noise because the denominator can approach zero when the network fits the data perfectly. [Zhang et al.(2017)Zhang, Bengio, Hardt, Recht, and Vinyals].
    Instead, we measure the epoch at which the training energy falls below the saddle point energy during training for the network. This procedure is visualised for the DenseNet-40-12 on CIFAR10 in \creffig:first_passage. It is the network with the highest ratio between minimum energy and saddle point energy.

Furthermore, the test error rate at the saddle point gives an intuition on how much information was actually lost on the saddle point. On the ResNets, the error rises by up to 0.7% on CIFAR10 and 3% on CIFAR100. For the DenseNets, the error rises by up to 1% on CIFAR10 and 2% on CIFAR100.

This beats the existing technique of computing the energy barrier between minima by evaluating the loss on the connecting line segment by a large margin.

Figure 7: Logarithmic learning curve of DenseNet-40-12 on CIFAR10. The training loss passes the mean saddle point energy at about 135000 iterations or slightly after reducing the learning rate for the first time. On a linear scale, training loss variations along MEPs pale in comparison to the difference between training and test loss, see \creffig:summary.

5 Discussion

We have pointed out an intriguing property of the loss surface of current-day deep networks, by upper-bounding the saddle points between the parameter sets that result from stochastic gradient descent, a.k.a. “minima”. These empirical upper bounds are astonishingly close to the loss at the minima themselves. At this point, we cannot give a formal characterization of the regime in which this finding holds. A formal proof is also complicated by the fact that the loss surface is a function not only of the parameters and the architecture, but also of the training set; and the distribution of real-world structured data such as images or sentences does not lend itself to a compact mathematical representation. That said, we want to make two related arguments that may help explain why we observe no substantial barrier between minima.

5.1 Resilience

State of the art neural networks have dozens or hundreds of neurons / channels per layer, and skip connections between non-adjacent layers. Assume that by training, a parameter set with low loss has been identified. Now if we perturb a single parameter, say by adding a small constant, but leave the others free to adapt to this change to still minimize the loss, it may be argued that by adjusting somewhat, the myriad other parameters can “make up” for the change imposed on only one of them. After this relaxation, the procedure and argument can be repeated, though possibly with the perturbation of a different parameter than in the previous rounds.

This type of resilience is exploited and encouraged by procedures such as Dropout [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] or ensembling [Hansen & Salamon(1990)Hansen and Salamon]. It is also the reason why neural networks can be greatly condensed before a substantial increase in loss occurs [Liu et al.(2017)Liu, Li, Shen, Huang, Yan, and Zhang].

5.2 Redundancy

Consider the textbook example of a two-layer perceptron that can fit the XOR problem. The two neurons traditionally used in the first hidden layer – let’s call them Alice and Bob – are shown in \creffig:xor(A). We can obtain an equivalent network by exchanging Alice and Bob (and permuting the weights of the neuron in the second hidden layer, not shown). This network, also corresponding to a minimum of the loss surface, is shown in \creffig:xor(B). Now, any path between these two minima will entail parameter sets such as the one in in Fig. 8(C) that incur high loss.

Figure 8: Network capacity for XOR dataset: The continuous transition from one minimum (A) to another minimum (B) is not possible without misclassifying at least one instance (C). Adding one helper neuron (D) makes the transition possible while always predicting the right class for all data points, i.e. by turning off the outgoing weight of Bob.

If, on the other hand, we introduce an auxiliary neuron, Charlie, we can play a small choreography: Enter Charlie. Charlie stands in for Bob. Bob transitions to Alice’s role. Alice takes over from Charlie. Exit Charlie. If the neuron in the second hidden layer adjusts its weights so as to disregard the output from the neuron-in-transition, the entire network incurs no higher loss than at the two original minima. We have constructed a perfect minimum energy path.

6 Conclusion

We find that the loss surface of deep neural networks contains paths with constantly low loss. We put forth two closely related arguments in the above. Both hold only if the network has some extra capacity, or degrees of freedom, to spare. Empirically, this seems to be the case for modern-day architectures applied to standard problems. We argue that due to the width of each layer, the network heavily replace parameters while producing an output with low loss.

This has the profound implication that low Hessian eigenvalues exist apart from the eigenvectors with analytically zero eigenvalues due to scaling.

The method opens the door to further empirical research on the energy landscape neural networks. When the hyperparameters of AutoNEB are further refined, we expect to find even lower paths up to the level where the true saddle points are recovered. It is then interesting to see if certain minima have a higher barrier between them than others. This makes it possible to recursively form clusters of minima, i.e. using single-linkage clustering. This analysis is yet not possible for the large error bars that we find. In the traditional energy landscape literature, this kind of clustering is done in disconnectivity graphs [Wales et al.(1998)Wales, Miller, and Walsh].

For practical applications, we can imagine using the resulting paths as a large ensemble of neural networks, especially given that we observe practically lower test loss along the path.

References

  1. Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  2. Ballard, A. J., Das, R., Martiniani, S., Mehta, D., Sagun, L., Stevenson, J. D., and Wales, D. J. Energy landscapes for machine learning. Physical Chemistry Chemical Physics (Incorporating Faraday Transactions), 19:12585–12603, 2017. doi: 10.1039/C7CP01108C.
  3. Ballard, Andrew J., Stevenson, Jacob D., Das, Ritankar, and Wales, David J. Energy landscapes for a machine learning application to series data. J. Chem. Phys., 144(12):124119, Mar 2016. ISSN 1089-7690. doi: 10.1063/1.4944672. URL http://dx.doi.org/10.1063/1.4944672.
  4. Choromanska, Anna, Henaff, Mikael, Mathieu, Michaël, Arous, Gérard Ben, and LeCun, Yann. The loss surface of multilayer networks. CoRR, abs/1412.0233, 2014. URL http://arxiv.org/abs/1412.0233.
  5. Ciresan, Dan C, Meier, Ueli, Masci, Jonathan, Maria Gambardella, Luca, and Schmidhuber, Jürgen. Flexible, high performance convolutional neural networks for image classification. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, pp. 1237. Barcelona, Spain, 2011.
  6. Dauphin, Yann, Pascanu, Razvan, Gülçehre, Çaglar, Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. CoRR, abs/1406.2572, 2014. URL http://arxiv.org/abs/1406.2572.
  7. Dinh, Laurent, Pascanu, Razvan, Bengio, Samy, and Bengio, Yoshua. Sharp minima can generalize for deep nets. In Precup, Doina and Teh, Yee Whye (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1019–1028, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/dinh17b.html.
  8. Gower, J. C. and Ross, G. J. S. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):54–64, 1969. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346439.
  9. Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. IEEE, 2013.
  10. Hansen, Lars Kai and Salamon, Peter. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001, 1990.
  11. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  12. Henkelman, Graeme and Jónsson, Hannes. Improved tangent estimate in the nudged elastic band method for finding minimum energy paths and saddle points. The Journal of chemical physics, 113(22):9978–9985, 2000.
  13. Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  14. Huang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and van der Maaten, Laurens. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, pp.  3, 2017.
  15. Jónsson, Hannes, Mills, Greg, and Jacobsen, Karsten W. Nudged elastic band method for finding minimum energy paths of transitions. In Classical and quantum dynamics in condensed phase simulations, pp. 385–404. World Scientific, 1998.
  16. Kawaguchi, Kenji. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
  17. Keskar, Nitish Shirish, Mudigere, Dheevatsa, Nocedal, Jorge, Smelyanskiy, Mikhail, and Tang, Ping Tak Peter. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  18. Kolsbjerg, Esben L, Groves, Michael N, and Hammer, Bjørk. An automated nudged elastic band method. The Journal of chemical physics, 145(9):094107, 2016.
  19. LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  20. Li, Hao, Xu, Zheng, Taylor, Gavin, and Goldstein, Tom. Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017.
  21. Liu, Zhuang, Li, Jianguo, Shen, Zhiqiang, Huang, Gao, Yan, Shoumeng, and Zhang, Changshui. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2736–2744, 2017.
  22. Montúfar, Guido, Pascanu, Razvan, Cho, Kyunghyun, and Bengio, Yoshua. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932, 2014.
  23. Nguyen, Quynh and Hein, Matthias. The loss surface of deep and wide neural networks. In ICML, 2017.
  24. Sheppard, Daniel, Terrell, Rye, and Henkelman, Graeme. Optimization methods for finding minimum energy paths. The Journal of Chemical Physics, 128(13):134106, Apr 2008. ISSN 1089-7690. doi: 10.1063/1.2841941. URL http://dx.doi.org/10.1063/1.2841941.
  25. Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
  26. Soudry, D. and Carmon, Y. No bad local minima: Data independent training error guarantees for multilayer neural networks. ArXiv e-prints, May 2016.
  27. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  28. Vinyals, Oriol and Le, Quoc V. A neural conversational model. CoRR, abs/1506.05869, 2015. URL http://arxiv.org/abs/1506.05869.
  29. Wales, David J, Miller, Mark A, and Walsh, Tiffany R. Archetypal energy landscapes. Nature, 394(6695):758, 1998.
  30. Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stolcke, A., Yu, D., and Zweig, G. Toward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2410–2423, Dec 2017. ISSN 2329-9290. doi: 10.1109/TASLP.2017.2756440.
  31. Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, and Vinyals, Oriol. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
103348
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description