Gradientonly line searches: An Alternative to Probabilistic Line Searches
Abstract
Step sizes in neural network training are largely determined using predetermined rules such as fixed learning rates and learning rate schedules, which require user input to determine their functional form and associated hyperparameters. Global optimization strategies to resolve these hyperparameters are computationally expensive. Line searches are capable of adaptively resolving learning rate schedules. However, due to discontinuities induced by minibatch sampling, they have largely fallen out of favor. Notwithstanding, probabilistic line searches have recently demonstrated viability in resolving learning rates for stochastic loss functions. This method creates surrogates with confidence intervals, where restrictions are placed on the rate at which the search domain can grow along a search direction.
This paper introduces an alternative paradigm, GradientOnly Line Searches that are inexact (GOLSI), as an alternative strategy to automatically resolve learning rates in stochastic cost functions over a range of 15 orders of magnitude without the use of surrogates. We show that GOLSI is a competitive strategy to reliably resolve step sizes, adding high value in terms of performance, while being easy to implement. Considering minibatch sampling, we open the discussion on how to split the effort to resolve quality search directions from quality step size estimates along a search direction.
Keywords: Artificial Neural Networks, Gradientonly, Line Searches, Learning Rates
1 Introduction
Selecting learning rate related parameters is still an active field of research in deep learning (Smith, 2015; Orabona and Tommasi, 2017; Wu et al., 2018), since they have been shown to be the most sensitive hyperparameters in training (Bergstra and Bengio, 2012). In practice, these parameters are often selected a priori by the user. However, in mathematical programming, a common strategy is to resolve step sizes (learning rates) is the use of line searches (Arora, 2011). Stochastic subsampling spoils the utility of conventional line searches in neural network training, since it introduces discontinuities into the cost functions and gradients, causing line searches that minimize along a descent direction to become stuck in false minima resulting from discontinuities (Wilson and Martinez, 2003; Schraudolph and Graepel, 2003; Schraudolph et al., 2007). This has resulted in line searches being replaced by a priori rule based step size schedules typical of subgradient methods that includes stochastic gradient descent (Schraudolph, 1999; Boyd et al., 2003; Smith, 2015). Consider for example Figure 1, that depicts both function values and directional derivatives^{1}^{1}1Consider the search direction , where the is zero at all entries apart from those corresponding to , which have value resulting in a normalized direction. Select a starting point . We sample on a regular grid and and compute the function values and directional derivatives using ., of a simple neural network as applied to the famous Iris dataset (Fisher, 1936). The top row of Figure 1 depicts the function values and directional derivatives when all the samples are used in the training data, where the bottom row depicts the function values and directional derivatives when a single training data point is randomly removed for every function and gradient evaluation.
Recently, Gaussian processes incorporating both function value and gradient information along search directions have successfully been used to construct line searches in stochastic environments via Bayesian optimization methods (Mahsereci and Hennig, 2017). However, we postulate that a simpler and more accessible approach may be sufficient to construct line searches, using only gradient information. The premise for this postulate is that the severity of the discontinuities in the function values are more abrupt than the directional derivatives which are considerably more robust as is evident in Figure 1. In this paper we demonstrate how this characteristic in conjunction with estimating NonNegative Associated Gradient Projection Point (NNGPP) (Wilke et al., 2013; Snyman and Wilke, 2018), allows for the construction of gradientonly line searches to automatically resolve step sizes.
The NNGPP merely presents an alternative solution to a function minimizer for discontinuous functions, i.e. instead of minimizing the discontinuous stochastic function directly, we use the associated derivative (Wilke et al., 2013; Snyman and Wilke, 2018) to filter out all discontinuities from the stochastic function. Essentially, when we only consider associated derivatives to make decisions during line searches we may interpret the discontinuous stochastic function presented in Figure 2(a) to be the stochastic continuous function presented in Figure 2(c), since both are consistent with the associated derivatives presented in Figure 2(b). It is clear that the function minimizer of the discontinuous stochastic function, depicted as a gray dot in Figure 2(a), is associated with a negative directional derivative in its neighborhood along the direction . This implies that the global function minimizer present in the discontinuous stochastic function is not representative of a local minimum according to the associated derivatives, (Wilke et al., 2013; Snyman and Wilke, 2018). This is because a global or local minimum would be characterized by a directional derivative going from negative to positive along a descent direction. Traditionally, for smooth functions the derivative would be zero at the local minimum, indicative of a critical point (Snyman and Wilke, 2018). Fortunately, since NNGPP were developed for discontinuous functions, (Wilke et al., 2013; Snyman and Wilke, 2018), it does not rely on the concept of a critical point as there is usually no point where the derivative is zero when discontinuous stochastic functions are considered. A NNGPP only requires the directional derivative to change sign from negative to positive as one travels along a descent direction. As outlined by Wilke (2012), this way of characterizing solutions of discontinuous stochastic functions is also consistent with solutions that subgradient algorithms or stochastic gradient descent would find, i.e. using stochastic gradient descent to optimize Figure 2(a) would only result in converge around the NNGPP (red dot), while the global function minimizer (gray dot) would be completely ignored.
In this study, based on empirical evidence, we argue that developing line searches that locate NNGPPs offers two advantages: 1) it offers a more representative (and consistent) way to define candidate solutions of a discontinuous stochastic cost function, and 2) allows for solutions to be isolated more robustly and with lower variance by the line search as some solutions are filtered out.
2 Cost functions in Machine learning
Commonly, the objective functions used in machine learning training are of the form
(1) 
where is a training dataset of size , is an dimensional vector of model parameters, and defines the loss quantifying the fitness of parameters with regards to training sample . Backpropagation (Werbos, 1994) allows for the computation of the exact gradient w.r.t. as follows:
(2) 
In the limit case, where all the training data is used for both function and gradient evaluations, and are smooth. We demonstrate this in our test example in Figures 3(a) and (b). In smooth environments such as these, minimization line search methods are capable of locating local minima. However, the cost of computation is high, due to processing datapoints at every function evaluation. The minimization line search is also more likely to become ”stuck” in a smooth local minimum within the multimodal and nonconvex cost function.
Using minibatches of the data during training decreases the computational cost and increases the chance of an optimization algorithm overcoming local minima. This changes the form of the cost function as follows: Minibatches, of size are sampled from the training set of size , resulting in an approximate loss function
(3) 
and corresponding approximate stochastic gradient
(4) 
The approximate loss has expectation and corresponding expected gradient (Tong and Liu, 2005), but individual instances may vary significantly from the mean. This implies that the first order optimality criterion (Arora, 2011) may not exist for the instance of minibatch , even if it may exist for the full batch case, .
For discontinuous functions, Wilke et al. (Wilke et al., 2013; Snyman and Wilke, 2018) proposed the gradientonly optimality criterion given by:
(5) 
as an alternative to the first order optimality criterion (Arora, 2011). Candidate solutions of the gradientonly optimality criterion developed for discontinuous functions, are defined as NonNegative Associative Gradient Projection Points (NNGPPs) (Wilke et al., 2013; Snyman and Wilke, 2018).
For smooth functions, NNGPP is equivalent to finding critical points that are semipositive definite Wilke et al. (2013); Snyman and Wilke (2018). Hence, NNGPP incorporates second order information in the form of requiring that there are no descent directions from NNGPP.
For notational convenience, we define a univariate function along a descent direction, from :
(6) 
with associated derivative
(7) 
3 Our Contribution
In this paper, we automatically resolve learning rates over a range of 15 orders of magnitude for stochastic loss functions using gradientonly line searches. We propose an Inexact GradientOnly Line Search (GOLSI) method that isolates NonNegative Associate Gradient Projection Points (NNGPP). As argued before, when considering univariate functions, a NNGPP is merely a sign change from negative to positive in the univariate directional derivative along the descent direction.
Importantly, we select a new minibatch subsample from the training data at every evaluation of the loss function within the line search. We stress again that we do not rely on the concept of a critical point as we do not require the derivative at a NNGPP to be zero. For multidimensional functions this naturally requires that we search for a sign change from negative to positive in the directional derivative along a descent direction. Since we require a sign change from negative to positive along a descent direction, and not from positive to negative, we incorporate some second information, i.e. the requirement of a local minimum.
Commonly used learning rate schedules use step sizes ranging over 5 orders of magnitude (Senior et al., 2013), while the magnitudes of cyclical learning rate schedules typically range over 3 to 4 orders of magnitude (Smith, 2015; Loshchilov and Hutter, 2016). Manually selected schedules can require a number of hyperparameters to be determined. Our proposed method, GOLSI, can resolve step sizes over a range of 15 orders of magnitude. The high range of available step sizes within the line search allow GOLSI to effectively traverse flat planes or steep declines in discontinuous stochastic cost functions, while requiring no user intervention.
3.1 Empirical evidence that NNGPP is more robust than minimizers
We present empirical evidence that indicates that NNGPPs offers a more representative and consistent way to define candidate solutions for discontinuous stochastic cost functions, as well as, allowing solutions to be isolated more robustly and with lower variance by a line search as some sporadic minima are filtered out.
Consider the Iris test problem where we sample along the search direction with only nonzero elements and equal to . Along this direction in 100 increments of , we note the locations of all the minimizers and NNGPP. We repeat this procedure 100 times for different sample sizes and construct the distributions determining the locations of minima and NNGPP observed in Figure 4. The spatial distribution of local minima across the sampled domain approximate a uniform distribution. The location of the true minimum is identified by the full batch . Conversely, the spatial location of NNGPPs are constrained in what resembles a Gaussian distribution around the true minimum, with variance inversely proportional to the sample size . The central message of these plots is that the spatial location of NNGPP is restricted, making it a reliable metric to be implemented to resolve step sizes in stochastic cost functions. Additionally, the NNGPP definition generalizes to the minimization definition in the limit case of using the full batch .
4 Algorithmic details
We propose GOLSI, the following inexact gradientonly line search method. Given an initial () descent direction , with initial () step size and real scaling parameter . First it is determined whether the update can be accepted without further refinement. Towards this we consider a modified strong Wolfecondition
(8) 
with . Hence, the initial update step will be taken as is when the directional derivative is positive, but with a restricted magnitude w.r.t. the initial descent magnitude for the th direction. This implies that we have stepped over the sign change in a controlled fashion. The reason why we consider this update is that it has been found to work better than the strong Wolfe condition (Arora, 2011)
(9) 
which also allows some restricted negative directional derivative to be acceptable. Hence, our studies have found that larger step sizes are preferred over smaller step sizes for computationally as well as generalization benefits for the architectures under consideration in this study. We note that it is of some importance to conduct a more comprehensive study for a wider group of architectures to properly understand this empirically observed asymmetry around a sign change.
Should the initial step not be acceptable, the following decisions are made, based on the sign of the directional derivative at initial guess, : If, then whereafter until Alternatively, if, then whereafter until The at which either conditions terminates is used as the acceptable update and the next search direction is computed. For the th search direction the update domains are illustrated in Figure 5.
Depending on the nature of the problem (loss function, architecture, activation function etc.), for small minibatch sizes it is possible to obtain divergent behavior where no sign change is located along a search direction for many consecutive updates. We therefore introduce a maximum step size to protect the line search from divergent steps. Inspired by the Lipschitz condition for convergent fixed step sizes, we choose the maximum step size conservatively as
(10) 
The Euclidean norm of the descent direction limits the line search towards more conservative updates for steep search directions, but allows larger update steps for faster progress over flat planes. The upper bound restricts divergent behaviour from unreliable directions in flat planes.
Step sizes are restricted to a minimum to avoid expensive line searches that may reduce step sizes to approach 0, in cases where a computed descent direction is statistically unlikely, which is given by
(11) 
As a result of these bounds, the line search can resolve an iteration specific step size over 15 orders of magnitude. We do not set a cap on the number of gradient evaluations allowed per iteration, which is common practice in other line search approaches used in machine learning training (Mahsereci and Hennig, 2017).
For the first search direction of GOLSI, i.e. along , a conservative initial guess of is chosen. This is an overly conservative assumption based on gradients being steep in the beginning of optimization and the length scale of the problem not being initially known. GOLSI is then grows the step size until the length scale of the first sign change is determined. In practice the initial guess can be increased, but having a small initial guess in our investigations also demonstrates that the method is capable of automatically adjusting the step size magnitude in a single iteration. In subsequent iterations, , the initial guess along the next search direction is that of the previous iteration, . A conceptual summary of GOLSI is given in Algorithm 1, while a detailed pseudo code can be reviewed in the Appendix under listing Algorithm 2.
4.1 Proof of Global Convergence for Full Batch Sampling
Suppose that the loss function obtained from full batch sampling is smooth, coercive with a unique minimizer . Any Lipschitz function can be regularized to be coercive using Tikhonov regularization with a sufficient large regularization coefficient.
The step updates of an optimization algorithm can be considered as a dynamical system in discrete time:
(12) 
It follows from Lyapunov’s global stability theorem (Aleksandr. M., 1992) in discrete time that any Lyapunov function defined by positivity, coercive and strict decrease:

Positivity: and

Coercive: as

Strict descent: ,
results in as .
Theorem 4.1
Let be any smooth coercive function with a unique global minimum , for restricted such that . Then will result in updates that are globally convergent.
Let the error at step be given by for which we can construct the Lyapunov function . It follows that and that , since is a unique global minimum of .
At every iteration our line search update locates a NNGPP along the descent direction , by locating a sign change from negative to positive along . Wilke et al. (2013) proved this to be equivalent to minimizing along when is smooth and the sign of the directional derivative , is negative along . Here, defines the step length to the first minimum along the search direction . It is therefore guaranteed that at every iteration . In addition, ensures that for our choice of discrete dynamical update , we can always make progress unless . Hence, for any it follows that
It then follows from Lyuaponov’s global stability theorem that as Hence we have that , which proves that finding NNGPP at every iteration results in a globally convergent strategy.
4.2 Proof of Global Convergence for MiniBatch Sampling
Consider the discontinuous loss function obtained from minibatch sampling with smooth expected response and unique expected minimizer . Assume that the function is directional derivative coercive (see Wilke et al. (2013)) around a ball of given radius that is centered around the expected minimizer . This implies that for given radius and for any point outside the ball and any point inside the ball with the following must hold:
(13) 
As before, the step updates of an optimization algorithm can be considered as a dynamical system in discrete time:
(14) 
We relax Lyapunov’s global stability theorem in discrete time for minibatched subsampled discontinuous functions that any smooth expected Lyapunov function defined by expected positivity, coercive and expected strict decrease around a ball of given radius :

Expected positivity: and

Coercive: as

Directional derivative coercive for any point of radius

Expected strict descent: ,
results in as .
Theorem 4.2
Let be any smooth expected coercive function with a unique expected global minimum that is directional derivative coercive around a ball of radius . Then restricted such that along descent direction . Then will result in updates that globally converges to the ball of radius centered around .
Let the error at step be given by for which we can construct the Lyapunov function and expected Lyapunov function . It follows that and that , since is a unique expected global minimum of .
At every iteration our line search update locates a NNGPP along the descent direction , by locating a sign change from negative to positive along . Since the function is smooth expected coercive and directional derivative coercive around a ball , expected descent follows of radius . It is therefore guaranteed that at every iteration . In addition, ensures that for our choice of discrete dynamical update , we can always make progress unless . In addition, since the function is directional derivative coercive around the ball , any point remains in due to the update requirement of a sign change from negative to positive along the descent direction. Hence, for any such that it follows that
It then follows from Lyuaponov’s relaxed global stability theorem that as Hence we have that as , which proves that finding NNGPP at every iteration results in a globally converges to the ball .
5 Numerical Studies
The architectures and problems for the numerical studies conducted in this study are taken from Mahsereci and Hennig (2017) for their probabilistic line search strategy research in which they compared to stochastic gradient descent using constant step sizes. This allows for a direct comparison of our obtained results to at least the stochastic gradient descent using constant step sizes that they reported. The problems we consider are:

Breast Cancer Wisconsin Diagnostic (BCWD) Dataset (Street et al., 1993), a binary classification problem, distinguishing between ”benign” and ”malignant” tumors, using 30 different features;

MNIST Dataset (Lecun et al., 1998), a multiclass classification problem with images of handwritten digits from 0 to 9 in greyscale with a resolution of 28x28 pixels; and

CIFAR10 (Krizhevsky and Hinton, 2009), a multiclass classification problem with images of 10 natural objects such as deer, cats, dogs, ships, etc.; the colour images have a resolution of 32x32.
Further details about the datasets, and the various parameters governing their implementation are given in Table 1. These details are used as given by Mahsereci and Hennig (2017) (”the authors”), where the dataset problems are trained with different network architectures, fixed step size and line search methods, and different corresponding batch sizes. Our implementation was done using PyTorch 1.0. All datasets were preprocessed using the standard transformation (Ztransform).
Datset  Training obs  Test obs  Input dim.  Output dim.  Net structure  Max. F.E.  for training 
BCWD  400  169  30  2  Log. Regression  100000  10,50,100,400 
MNIST  50000  10000  784  10  NetI, NetII  40000  10,100,200,1000 
CIFAR  10000 (Batch1)  10000  3072  10  NetI, NetII  10000  10,100,200,1000 
Following Mahsereci and Hennig (2017), both MNIST and CIFAR10 are implemented using two different network architectures, NetI and NetII. Including the logistic regression for the BCWD Dataset, this constitutes a total of 5 architectures to be used in the numerical study. The parameters concerning the implementations of the different architectures are summarized in Table 2. All networks are fully connected, and the detail given concerning the hidden layers of the network excludes the biases, although they are included. Mahsereci and Hennig (2017) have stated that a normal distribution was used to initialize all networks. However, we found that anything resembling comparable results could not be obtained for NetII (with with constant step sizes or otherwise), unless Xavier initialization (Glorot and Bengio, 2010) was used.
Network  Hidden layer architecture  Activation func.  Initialization  Loss func.  Fixed step sizes 

log. Regression  N/A  Sigmoid  Binary cross entropy  1,10,100  
NetI  800  Sigmoid  Cross entropy  1e1,1,10  
NetII (MNIST)  1000,500,250  Tanh  Xavier  Mean Squared Error  1e2,1e1,1 
NetII (CIFAR10)  1000,500,250  Tanh  Xavier  Mean Squared Error  1e1,1,4 
We conducted an extensive study using numerous fixed step sizes for the different architectures and problems. In our analyses we chose three constant step sizes, each one order of magnitude apart, ensuring that the full training performance modality is captured. This means that step sizes selected within the 3 orders of magnitude encapsulates a potential optimal constant step size. Thus we select a small, a medium and a large constant step size, along the following guidelines:

Small: Resembles a slow and overly conservative learning rate that leads to wasted gradient computations during training.

Medium: Resembles an effective and efficient learning rate with desired convergence performance.

Large: Resembles a learning rate that is aggressive and usually leads to detrimental performance.
The training algorithm used in this study is Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951). We apply both our line search methods GOLSI, as well as the 3 fixed step sizes assigned, to every network architecture shown in Table 2. For each of the 4 different step size schemes (GOLSI and 3 constant leaning rates), 10 runs were conducted, using the same initial seeds.
The training and test classification errors (as evaluated on the full respective training and test datasets) are evaluated and noted during training. The resolution of these plots is therefore limited by the size of the respective datasets.
For the benefit of reproducible science, we highlight results that could not be exactly recovered according to the information supplied by Mahsereci and Hennig (2017): It was not possible to obtain the same refinement for error plots given the number of data points in the BCWD dataset. Additionally, the test errors obtained for CIFAR10 in NetI did not approach the same values as those shown by Mahsereci and Hennig (2017). This is true even for the ”best” fixed step analyses as indicated in their work. Though we believe that a reasonable investigation into the source of these discrepancies has been conducted, we are open to the possibility that there are unidentified inconsistencies between the implementations of Mahsereci and Hennig (2017) and ours or  perish the thought  between PyTorch and Matlab which they used in their study. For these reasons we cannot directly compare with their results but we can compare against our implementations of the constant step sizes they considered in their studies. We therefore use their results as guidelines, but not absolutes. For this reason we include fixed steps as relative comparisons and wish to demonstrate that GOLSI is more effective strategy than seeking for an effective fixed step. Nevertheless, in order to aid comparison where possible, we use similar ranges on the axes and match the layout of our plots to those of the authors.
6 Results
6.1 Breast Cancer Wisconsin Diagnostic (BCWD) Dataset
We plot the log of the training error, log of the training loss, log of the the test error, and log of the step size for the BCWD using minibatch sizes of , in Figure 6. Note is indicative of a full batch and is representative of a smooth loss function. Since we do not cap the cost of the line search, the number of gradient evaluations per iteration varied from 1 (immediate accept) to 17. On average, the number of gradient evaluations per iteration is in the low 2’s. Hence, all results are listed in terms of the number of gradient evaluations, as it quantifies the value added by the line search when compared to the equivalent computational cost of a fixed step size method. To avoid unfortunate scaling of figures in the log domain, the minimum training error was clipped to , as indicated by the lowest training error in Figure 6. However, in the interest of unrestricted convergence comparison, no clipping was applied in the log of the training loss plot.
Let us first consider the performance of the constant step size line searches. As expected, the small constant step size exhibits slow convergence, the medium step size performs well, and the large step size often leads to divergence. As the batch size increases, the performance of the large constant step size performs better for isolated instances as is evident for and .
The unclipped log of the training loss for this problem gives a better perspective of the convergence behaviour of GOLSI. For () the variance in the computed gradient between batches is high, which hinders the performance of GOLSI. However, GOLSI remains competitive, performing better than the small fixed step size, but worse than the medium constant step size. As the batch size increases to and beyond, the quality of the computed gradient improves sufficiently that GOLSI trains faster than any of the constant step size methods. The constant step sizes continue to converge linearly towards the optimum, while GOLSI converges exponentially.
It seems the aggressive training performance of GOLSI and the medium constant step length suffers from overfitting in this problem. In fact even the small step length seems to overfit within the first 1000 function evaluations for this problem. We speculate that during training areas associated with generalization of the cost function were either missed or possibly overstepped, indicative of an architecture that is much more flexible than required by the data set. Interestingly, overfitting is not present in the work done by Mahsereci and Hennig (2017) for this problem, neither for their line search, nor for their constant step implementations. Their implementations also seem to train more slowly, requiring a larger number of function evaluations.
Resolved step sizes are plotted to allow comparison between the magnitudes of step sizes obtained by GOLSI, to the chosen constant step sizes. The large range of step sizes available to GOLSI is immediately evident. Recall that we do not limit the number of gradient computations per iteration, which allows the line search to vary its magnitude significantly between iterations. Another consequence of this is that the variance in the resolved step size can be used as an indication for the variance in the computed gradient information. As the batch size increases, the range in magnitude of the step sizes begins to narrow, and a slowly increasing step size trend as training progresses begins to emerge. For , this increase is slow and still considerably noisy, whereas for the step size magnitude increases rapidly in a narrow band, as the gradient magnitude drops and the method approaches an optimum. Presumably this occurs to compensate for the decreasing magnitude in the gradient vector, thus requiring a larger step size for an equivalent magnitude in update to the weights. In a ball around this optimum, the line search ”bounces” around in high dimensional space. Since the gradient norm is small, the step size magnitudes are large, which corresponds to the flat error region in the corresponding log Training Loss plot for . Here the variance in step size is only due to the inexact line search, since there is no variance in the data for the full batch. This example therefore confirmations that GOLSI generalizes naturally to smooth loss functions.
6.2 MNIST Dataset
The results for training MNIST with the NetI network architecture using minibatch sizes of , are shown in Figure 7. Again, GOLSI is hindered by the inconsistent information offered by the smallest minibatch size . However, as the minibatch size increases GOLSI remains competitive. The convergence performance of GOLSI is better than that of the medium fixed step size from and larger. In the case of training is particularly aggressive in comparison to the constant step sizes. The automatically resolved step sizes of GOLSI increases with an increase in batch size as well as the training progresses. The superior training performance of GOLSI in this problem also translates to better test classification errors.
In comparison to the BCWD problem the resolved step sizes of this problem are relatively consistent, having low variance while also showing a slight growing trend during the course of training. Recall, that the initial guess for GOLSI is . The plots show magnitudes that are quickly within the range of the fixed step sizes. This shows that GOLSI is capable of recovering an effective step size from the given problem within a few gradient computations.
In this analysis GOLSI has different convergence characteristics to those of the work of Mahsereci and Hennig (2017). We cannot comment on the absolute error obtained, due to possible differences in implementation. However, concerning the shape of the convergence rates: The authors’ method tends to progress quickly, then stagnate. Instead, GOLSI follows a consistent linear convergence rate, which does not stagnate (not counting , where this is not evident) up to the number of function values used for the analyses.
The MNIST results for the NetII architecture are depicted in Figure 8, which exhibit a less competitive view of GOLSI when compared to the to the NetI results. Firstly, the overall performance of GOLSI is less competitive; and secondly, the variance in the error curves is much lower. As expected the architecture significantly effects the training, which is evident in that the three equivalent constant step sizes that had to be chosen one order of magnitude lower than for NetI training. In contrast, GOLSI remained unchanged. This demonstrates that GOLSI is able to automatically recover step size within the range of the carefully selected constant step sizes. It is interesting, that in this case GOLSI tends to decrease the step size slightly as training progresses. We suspect that this behaviour is due to narrow ravines in the cost function, as observed by Goodfellow et al. (2015) (for an additional visual example, refer back to Figure 1) , which is due to the NetII architecture. The consequence is that smaller step sizes are being resolved, whereas the medium constant step size could potentially step over these ravines, instead of traversing along them. We would also like to remind the reader, that this would not be a shortcoming of the line search, but of the directions obtained using SGD. This may offer an explanation as to why the convergence of GOLSI slows down. Although GOLSI is not as efficient as the medium step size it automatically identified and resolved step size updates in the range of the medium step size without any intervention or tuning required.
This analysis is an example where reproduction of the work of Mahsereci and Hennig (2017) was difficult, as even their chosen step sizes did not perform in our implementation as in theirs. However, a notable positive in our case is that GOLSI is more stable with than their probabilistic line search, which diverges at this batch size in their implementation.
6.3 Cifar10
The results for CIFAR10 with NetI using minibatch sizes of are shown in Figure 9. It is evident that as the large constant step size improves dramatically as the batch size increases. Similarly, the automatically resolved step sizes of GOLSI increases with an increase in batch size as well as during the training progress. As before, for the smallest batch size GOLSI struggles the most to reduce the training error, where as GOLSI improves in performance, as the batch size increases. As expected, the medium step size consistently performs well, setting a competitive baseline. For batch sizes and above, GOLSI outperforms the medium constant step size in training. However, since training only occurs on Batch1 similar to Mahsereci and Hennig (2017), it is difficult to make statements about generality from the test error, irrespective of the step size method used.
Apart from the test error, our results are very similar to those obtained by Mahsereci and Hennig (2017). Again, this is true for both constant step sizes, and GOLSI, therefore not being due to the line search. The data combination given by the authors is plausible, since their results are well replicated for NetII. However, we were unable to replicate their test results for CIFAR10 on NetI. Irrespective thereof, the training plots represent the effectiveness of the training methods. In this regard GOLSI again proves itself to be a capable method, performing well on this example.
For NetI, the performance of GOLSI on the training data with a minibatch size of performed the best. As noted before, the resolved step size not only increases progressively during training but also as the batch size gets larger. A trend which is repeated from the MNIST analysis with the same architecture. This might suggest, that the trends of optimal step sizes over training may be linked to network architecture. For this example the step sizes has low variance.
Lastly, the training plots for CIFAR10 with NetII are given in Figure 10. For this example, the training and test errors we obtained were the closest match to those reported by Mahsereci and Hennig (2017). Instead of choosing the medium and large step sizes an order apart we selected the medium constant step size to be , and the large constant step size . This highlights that the difference between a ”good” and ineffective constant training step size can be small. For training using and larger, GOLSI is able to recover competitive step sizes effectively without user intervention, even though the step size sensitivity is high for this architecture and problem.
Comparing the step sizes of this analysis to those of MNIST with the same NetII architecture in Figure 8, it is evident that in both cases GOLSI overestimates the resolved step sizes for the smallest minibatch size of . For larger batch sizes the resolved step size trend decreases as training progresses, similar to NetII on MNIST. As expected, it that the architecture might dominate the influence on step size evolution during training.
Interestingly, GOLSI does not perform as well for as for or . To confirm that this was not an anomaly, we conducted further analyses using and , which confirmed these trends. This indicates that the quality of the search directions may not be effective for the given problem, since the precision of a NNGPP along the direction can only improve with increasing minibatch size. To substantiate this intuitive speculation, we dedicate an additional numerical investigation on the influence of minibatch size on the descent direction quality versus its influence on identifying NNGPP along a descent direction.
7 Uncoupling search direction from directional information quality
In this section we highlight the difference in contribution between the quality of the search direction, and the quality of the information contained along that search direction, in the context of stochastic line searches.
There are undoubtedly two aspects of a line search that utilize information. Firstly, the search direction, and secondly, estimating a step size along the search direction. In the case of full batch training, both of these utilize maximum available information. However, in minibatch sampling the contribution of information to the search direction and information along a search direction may be affected differently by minibatch sampling. We therefore investigate the sensitivity of the descent direction and the sensitivity of locating a sign change along a descent direction with respect to batch size by investigating the performance of GOLSI.
We conduct an experiment, by which we separate the sampling related to generating the direction, from the sampling that occurs along the search direction. Since we are using SGD, this amounts to evaluating the gradient of the cost function used to decide the descent direction (superscript of indicates the batch size) with a different batch size to the gradient computations (superscript of indicates the batch size) that is used to evaluate the directional derivative along the descent direction. To this end we use the BCWD dataset, as its small size allows us to easily use the full dataset during evaluation. We use the same batch sizes as used in the previous section for this dataset, namely: , , and . In the investigation each batch size used for the descent direction is paired with each batch size used to resolve the step size along the descent direction, resulting in the full combinatorial range. One can consider the constant step size method with different batch sizes to be SGD with a set constant radius, but varying quality in search direction. Therefore, we include the constant step size results for a given batch size with the corresponding GOLSI training run with the same search direction batch size. The end result is a 16 plot gird of loss curves relative to function evaluation, shown in Figure 11.
Since the magnitude remains fixed between iterations, constant step sizes are only sensitive to the search direction. Hence, small step sizes are affected less by the variance in direction, as the algorithm never moves particularly far in a given direction and generally moves along the expected direction due to the relatively large number of gradient evaluations within the same local neighbourhood of weight estimates. Conversely, large step sizes performs significantly better when larger batch sizes are used for search directions as opposed to directional derivatives along a search direction. Compare Figures 11(c,d)) to Figures 11(i,m)). It is evident that the medium step size has more uniform improvement when search directions are resolved with higher accuracy, compare again Figures 11(c,d)) to Figures 11(i,m)) but this time in view of the medium sep size.
Considering GOLSI in terms of direction quality, a poorly resolved search direction results in poor training, regardless of the quality to which the NNGPP along that direction is resolved. This makes intuitive sense, as the line search can make significantly large step updates along the search direction under the immediate accept condition. This is evident when comparing Figures 11(a)(d) to Figures 11(a), (e), (i) and (m) in view of GOLSI.
Interestingly, good search directions and inferior resolution along them also do not result in competitive training (see Figures 11(m)(o)). If one compares this to the use of competitive stochastic directions with (see Figures 11(e)(g))) and (see Figures 11(i)(k)), full batch directions show severely slower convergence indicating that the additional computational cost to compute better search directions are not capitalized on when the step size is poorly resolved along the descent direction. However, it is expected that improvements should be observed albeit for significantly more gradient computations.
If we consider sample accuracy along a search direction, a low quality in spatial resolution of the NNGPP is ineffective regardless of the quality of the search direction (see Figure 11(a) ,(e) ,(i) and (m)). In this case the variance of 1D the location of the NNGPP is too high to result in meaningful progress. The other extreme is using very high quality and good spatial resolution to find NNGPP that are in suboptimal descent directions (see Figures 11(d), (h), (l) and (p)). This results in a high computational cost in order to resolve solutions along poor descent directions. It is important to note, that apart from the added computational cost of the larger batch size per function evaluation, the line search itself also uses more gradient evaluations per update step, as the higher resolution allows for more accuracy, prompting the algorithm to expend more iterations to find a sign change. It is not uncommon for full batch analyses to use on average 17 gradient evaluations per update step. This is in contrast to the other stochastic examples, where average evaluations per iteration are typically between 2 and 3. In general, computationally sensible strategies should match the quality of the information used to determine an appropriate search direction to that of the quality of information used to determine the solution along a search direction as indicated by the diagonal of Figure 11. The asymmetry in Figure 11, gives a relative indication that a slightly better resolved search direction is better than better resolved direction derivatives along a search direction.
The best training error was obtained using a minibatch size of , which is one order lower than the fullbatch size of . This indicates that minibatch sampling acts as a regularizer during training. The empirical evidence suggests that to keep the batch sizes the same for both direction estimation and sampling along a search direction. However, slight improvements in performance may be obtained by choosing descent directions with slightly larger sample sizes than the sampling along a search direction. We demonstrate this empirical assertion in Section 7.1.
7.1 Direction Sensitivity to Batch Size
In this section we conduct a short numerical investigation to identify the variance evolution of minibatch computed descent directions over training. As before, we use the BCWD problem for this analysis. We sample the true descent direction (with ) at the solution of every iteration, , and sample an additional 20 other descent directions using various minibatch sample sizes , where 400 indicates the full batch. We then calculate the angle between the full batch ”true” descent direction and the estimated minibatched sampled descent directions by noting the average angle. Hence, the mean angle between the true decent direction and estimated minibatched sampled descent directions are plotted as the optimizer updates using the full batch ”true” descent with full batch directional derivatives along the descent direction in Figure 12.
It is evident that between batches there are significant changes in mean angle. While later iterations exhibit larger variance for larger batch sizes, the mean angle decreases as the batch size increases. The analysis shows that there is a significant ”rampup” period in the first roughly 1000 function evaluations. During this period the mean angle increases from a minimum to a maximum value, after which it seems to settle around a constant mean until convergence, where the mean changes again.
Important features include the starting point, and the behavior towards the end of training. As the sample size increases, the mean angle at the beginning of the analysis decreases. This indicates consistency in information contained in the directions. Considering, Figure 11 it is evident that only the direction sampled using the smallest minibatch failed to converge in general using GOLSI. This may indicate that for this problem an initial directional deviation of around 50 degrees in the descent direction is too severe, leading to a different solution (recall Figure 11(ad)). Deviations of around 20 degrees seem to generate similar solutions to the full batch true descent directions for this problem, as these analyses converged (recall Figure 6(b)). At later stages of convergence larger variance in the mean angles still leads to convergence, with many mean angles being around 80 degrees. This means that at later stages in training, individual samples contribute more towards the direction, though they contribute less towards the error. Since the BCWD dataset is a classification problem, an analogy might be that each sample has a different contribution towards which way the decision boundary needs to move to improve the sample’s error. In the beginning, most datasamples contain similar information in terms of where the decision boundary needs to move to reduce the classification error. However, as more of the common information in datasamples is incorporated into the model, more individual differences between the data points become highlighted. We observe this particularly clearly when . This means that overall the error decreases, but the differences between the directions increases for a constant batch size.
8 Conclusion
For discontinuous stochastic optimization objective functions, we proposed Inexact GradientOnly Line Search (GOLSI) as a computationally efficient strategy to automatically resolve learning rates. Instead of minimizing along a descent direction or finding critical points along descent directions we locate NonNegative Associated Gradient Projection Points (NNGPP). Along a 1D descent direction NNGPP are indicated by sign changes from negative (indicative of descent) to positive (indicative ascent) in the directional derivative. Hence, NNGPP incorporates second order information indicative of a minimum.
We demonstrate on three classical machine learning problems (Breast Cancer Wisconsin Diagnostic, MNIST with two neural net architectures and CIFAR10 with two neural net architectures) that learning rates can be efficiently resolved for SGD using GOLSI.
Our method has been demonstrated to be competitive in training without requiring any manual tuning, which reduces active human hours required to successfully train a neural net. GOLSI allows for dynamic step sizes that can vary over 15 orders of magnitude, i.e. from to . Lastly, GOLSI allows for an intuitive line search implementation, which shows a great deal of potential for further development and integration into other traditional mathematical programming methods. Towards this aim, we conducted a small empirical investigation regarding the information required to resolve descent directions versus directional derivatives along a descent directions for only the Breast Cancer Wisconsin Diagnostic Dataset. For this problem it was found that keeping the batch sizes for evaluation of search directions in SGD the same results in a reliable initial selection strategy as long as the search direction is sufficiently resolved. For SGD, there seems to be some potential computational benefit in using slightly less gradient computations to resolve the directional derivatives along a descent direction. However, to obtain conclusive results may require more representative datasets as well as additional optimizers where we use GOLSI to resolve the step sizes dynamically.
This initial study will hopefully stimulate the possibility of successfully using line searches in stochastic neural network optimization, which may also present alternative opportunities to incorporate second order information with strategies like QuasiNewton and conjugate gradient methods (Arora, 2011; Le et al., 2011).
Acknowledgements
This work was supported by the Centre for Asset and Integrity Management (CAIM), Department of Mechanical and Aeronautical Engineering, University of Pretoria, Pretoria, South Africa. NVIDIA for sponsoring the Titan X Pascal GPU used in this study.
Appendix A.
References
 Aleksandr. M. (1992) Lyapunov Aleksandr. M. The general problem of the stability of motion. International Journal of Control, 55(3):531–534, 1992. doi: 10.1080/00207179208934253.
 Arora (2011) Jasbir Arora. Introduction to Optimum Design, Third Edition. Academic Press Inc, 2011. ISBN 0123813751.
 Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random Search for HyperParameter Optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012. ISSN 15324435. doi: 10.1162/153244303322533223. URL http://www.jmlr.org/papers/v13/bergstra12a.html.
 Boyd et al. (2003) Stephen Boyd, Lin Xiao, and Almir Mutapcic. Subgradient methods. lecture notes of EE392o, Stanford …, 1(May):1–21, 2003.
 Fisher (1936) R. A. Fisher. The use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(2):179–188, sep 1936. ISSN 20501420. doi: 10.1111/j.14691809.1936.tb02137.x.
 Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of Machine Learning Research, pages 1–8, 2010. ISBN 0780314212. doi: 10.1109/IJCNN.1993.716981.
 Goodfellow et al. (2015) Ian J. Goodfellow, Oriol Vinyals, and Andrew M. Saxe. Qualitatively Characterizing Neural Network Optimization Problems. ICLR, pages 1–11, 2015. URL http://arxiv.org/abs/1412.6544.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey E. Hinton. Learning Multiple Layers of Features from Tiny Images. 2009. URL https://www.cs.toronto.edu/~kriz/cifar.html.
 Le et al. (2011) Quoc V Le, Adam Coates, Bobby Prochnow, and Andrew Y Ng. On Optimization Methods for Deep Learning. Proceedings of The 28th International Conference on Machine Learning (ICML), pages 265–272, 2011. ISSN 9781450306195. doi: 10.1.1.220.8705.
 Lecun et al. (1998) Y Lecun, L Bottou, Y Bengio, and P Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, nov 1998. ISSN 00189219. doi: 10.1109/5.726791.
 Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. pages 1–16, 2016. ISSN 15826163. doi: 10.1002/fut. URL http://arxiv.org/abs/1608.03983.
 Mahsereci and Hennig (2017) Maren Mahsereci and Philipp Hennig. Probabilistic Line Searches for Stochastic Optimization. pages 1–12, 2017. ISSN 10495258. doi: 10.1016/j.physa.2015.02.029. URL http://arxiv.org/abs/1703.10034.
 Orabona and Tommasi (2017) Francesco Orabona and Tatiana Tommasi. Training Deep Networks without Learning Rates Through Coin Betting. pages 1–14, 2017. ISSN 10495258. URL http://arxiv.org/abs/1705.07795.
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, sep 1951. ISSN 00034851. doi: 10.1214/aoms/1177729586. URL http://projecteuclid.org/euclid.aoms/1177729586.
 Schraudolph et al. (2007) Nicol N Schraudolph, Jin Yu, and Simon Günter. A Stochastic QuasiNewton Method for Online Convex Optimization. International Conference on Artificial Intelligence and Statistics, pages 436—443, 2007. ISSN 15324435. doi: 10.1137/140954362. URL http://eprints.pascalnetwork.org/archive/00003992/.
 Schraudolph (1999) N.N. Schraudolph. Local gain adaptation in stochastic gradient descent. 9th International Conference on Artificial Neural Networks: ICANN ’99, 1999:569–574, 1999. ISSN 05379989. doi: 10.1049/cp:19991170. URL http://digitallibrary.theiet.org/content/conferences/10.1049/cp{_}19991170.
 Schraudolph and Graepel (2003) Nn Schraudolph and T Graepel. Combining conjugate direction methods with stochastic approximation of gradients. Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, AISTATS 2003, pages 2–7, 2003. URL http://www.schraudolph.org/pubs/SchGra03.pdf.
 Senior et al. (2013) Andrew Senior, Georg Heigold, Marc’Aurelio Ranzato, and Ke Yang. An empirical study of learning rates in deep neural networks for speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, pages 6724–6728, 2013. ISSN 15206149. doi: 10.1109/ICASSP.2013.6638963. URL http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp={&}arnumber=6638963.
 Smith (2015) Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. (April), 2015. doi: 10.1109/WACV.2017.58. URL http://arxiv.org/abs/1506.01186.
 Snyman and Wilke (2018) Jan A Snyman and Daniel N Wilke. Practical Mathematical Optimization, volume 133 of Springer Optimization and Its Applications. Springer International Publishing, Cham, 2018. ISBN 9783319775852. doi: 10.1007/9783319775869. URL http://link.springer.com/10.1007/9783319775869.
 Street et al. (1993) W.N Street, W.H. Wolberg, and O.L. Mangasarian. Nuclear Feature Extraction For Breast Tumor Diagnosis. 1993.
 Tong and Liu (2005) Fei Tong and Xila Liu. Samples Selection for Artificial Neural Network Training in Preliminary Structural Design. Tsinghua Science & Technology, 10(2):233–239, apr 2005. ISSN 10070214. doi: 10.1016/S10070214(05)700602. URL https://www.sciencedirect.com/science/article/pii/S1007021405700602.
 Werbos (1994) Paul John Werbos. The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting. WileyInterscience, New York, NY, USA, 1994. ISBN 0471598976.
 Wilke et al. (2013) Daniel Nicolas Wilke, Schalk Kok, Johannes Arnoldus Snyman, and Albert A. Groenwold. Gradientonly approaches to avoid spurious local minima in unconstrained optimization. Optimization and Engineering, 14(2):275–304, June 2013. ISSN 13894420. doi: 10.1007/s1108101191787. URL http://link.springer.com/10.1007/s1108101191787.
 Wilke (2012) D.N. Wilke. Structural shape optimization using Shor’s ralgorithm. In Third International Conference on Engineering Optimization, 2012. ISBN 9788576503439.
 Wilson and Martinez (2003) D Randall Wilson and Tony R Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10):1429–1451, 2003. ISSN 08936080. doi: 10.1016/S08936080(03)001382.
 Wu et al. (2018) Xiaoxia Wu, Rachel Ward, and Léon Bottou. WNGrad: Learn the Learning Rate in Gradient Descent. pages 1–16, 2018. URL http://arxiv.org/abs/1803.02865.