Datadriven Algorithm Selection and Parameter Tuning: Two Case studies in Optimization and Signal Processing
Abstract
Machine learning algorithms typically rely on optimization subroutines and are wellknown to provide very effective outcomes for many types of problems. Here, we flip the reliance and ask the reverse question: can machine learning algorithms lead to more effective outcomes for optimization problems? Our goal is to train machine learning methods to automatically improve the performance of optimization and signal processing algorithms. As a proof of concept, we use our approach to improve two popular data processing subroutines in data science: stochastic gradient descent and greedy methods in compressed sensing. We provide experimental results that demonstrate the answer is “yes”, machine learning algorithms do lead to more effective outcomes for optimization problems, and show the future potential for this research direction.
1 Introduction
Machine learning is a popular and powerful tool that has emerged at the forefront of a vast array of applications (most famously in image processing). At their core, neural nets rely on solving nonlinear optimization problems. From this point of view, improving key optimization subroutines and other auxiliary data processing methods directly helps to improve learning methods. Here, we aim to use machine learning algorithms to improve two optimization and signal processing subroutines, which involve choice of parameters and algorithm selection. We believe our approach has great potential because these subroutines are central to the performance of machine learning algorithms. Thus our set up leads to the intriguing “meta” notion of using machine learning to improve machine learning. Our framework will be useful in other settings as well, where one must choose methods and/or parameters with limited knowledge about the input data.
In optimization and signal processing there are often choices among several algorithms or parameters to finetune in order to apply to an input instance. These choices can often lead to drastically different outcomes and thus such selections are crucial in many applications. The questions we consider here are, what is the best way to select such algorithms? What is the best choice of parameters? In this paper, rather than using a onesizefitsall rule to choose, we focus instead on the features of individual problem instances and allow these features to guide the parameter or algorithm selection. It is wellknown that the performance of algorithms, even of those considered to be very efficient, depend on the particular input instances and data. Thus it makes sense to vary or adapt the choice of algorithm or parameters to the concrete instance in question, avoiding a onesizefitsall approach. It is therefore natural to use machine learning tools to perform the selection, just as a human expert could be making such choices.
Our work fits within a topic of Artificial Intelligence that has received several names: algorithm selection, algorithm configuration, selfadapting algorithms, or simply automated machine learning (autoML). This topic seeks to efficiently automate the selection of algorithms or their parameter configurations. It has attracted increasing attention, and relies on multiple techniques. Recent work on algorithm selection using machine learning has seen a strong surge in both practical and theoretical results. Let us mention that [28] approach the problem as a Markov Decision problem. In [45] the problem is approached with techniques similar to the matrix completion method. A learning framework for algorithm selection was presented in [20] with a follow up in [3]. From the computational and experimental point of view, algorithm selection via machine learning has been used in several areas of optimization (see e.g., [24, 2, 1] and the many references therein). In fact, recently for the purpose of training algorithm selection, some libraries have been established to organize data for a wide range of NPhard tasks (where the aim is to predict how long an algorithm will take to solve concrete instances of NPcomplete problems, or to choose best approximation schemes tailored by instances) [36, 6, 26]. While these works are combinatorial in nature, here we propose that learning can drive optimal selection in continuous and analytical problems.
Our contributions
In this work we study the problems of algorithm selection in compressed sensing and parameter tuning for SGD algorithms. In both cases we wish to train a recommendation or classification algorithm that can output an optimal algorithm (method, parameters, etc.) for a given input data set. Note that in contrast to previous work in step size tuning for SGD, our work does not require that iteration specific step sizes be computed at every step. We study these problems from the experimental point of view. Our main contributions are as follows.

In Section 2 we apply our methodology, through concrete experiments, to the selection of compressed sensing algorithms. Here we concentrate on selecting the best among three wellknown greedy algorithms for solving the compressed sensing problem: Hard Thresholding Pursuit (HTP) [16], Normalized Iterative Hard Thresholding (NIHT) [8], and Compressive Sampling Matching Pursuit with Subspace Pursuit (CSMPSP) [34, 31]. We have been inspired by the work of [7], where the authors catalog optimal algorithm selection through brute force experimental testing. Although our machine learning approach is useful precisely when such a rigorous catalog is not available, the work of [7] will be used as validation of our framework.

In Section 3 we apply our methodology, through concrete experiments, to the selection of the best step size in the popular stochastic gradient descent algorithm [38], which itself is used as a subroutine in many learning frameworks. Unfortunately, tuning the step size (also called the learning rate) is often more an art than a science, and the selection can lead to drastically different overall behavior. We aim to alleviate this issue by allowing for such selections to be done by the trained machine.
For our experiments we use Neural Networks for classification. Neural Networks are computing systems inspired by the biological neural networks. They have shown remarkable success in various machine learning tasks including classification [27]. While there are plenty of sophisticated, state of the art neural net architectures such as GoogLeNet [42], ResNet [22], DenseNet [23] and CliqueNet [46], we will demonstrate that even simple networks that do not have to be run on expensive remote processors can aid in algorithm selection. This neural net learning approach is perfect for learning and modeling nonlinear and complex relationships allowing, as the name suggests, the machine to learn data relationships by itself.
While there has been much work in the area of (autoML) and automated algorithm selection, these automated approaches differ from ours in simplicity of application to new data. These methods require a preprocessing step before application of the learning technique for algorithm selection [4, 5, 15, 37, 45]. Meanwhile, our approach simply uses the data encoding the problem (or even simpler attributes of the data) as input features to our learning approach. This straightforward approach allows practitioners to apply this algorithm selection framework without the expertise to determine metafeatures of the data that will enable an effective learning approach.
Notation
Here and throughout the paper, we write
(1) 
where is the measurement (or data) matrix, is the measurement vector, and is the signal being recovered. We use to denote the transpose operator and denotes the pseudoinverse.
In the compressed sensing problem considered in Section 2, the measurement matrix is underdetermined () and the signal is assumed to be sparse; in particular, we say is sparse when it has at most nonzero entries. Furthermore, for any vector , returns the indices corresponding to the largest in magnitude entries of and returns a vector whose entries are 0 outside of the support of set and equal to on the support of . For any set , a matrix constrained to the columns indexed by is denoted by .
In the leastsquares problems considered in Section 3, the measurement matrix is overdetermined () and no sparsity assumption is made on the signal . We use the recovery error and the residual error at the th iteration of SGD, and respectively, to measure the performance of the algorithm with given learning rates.
2 Application I: compressed sensing algorithm selection
We begin the investigation of our framework with a proof of concept inspired by the work done in [7] where the authors rigorously test various compressed sensing methods under various settings in a bruteforce way. We will show that we can use neural networks to recover the phase transitions that were acquired via rigorous testing in the aforementioned paper. For this reason, we adopt a similar algorithmic and experimental setup. First, we will explain the compressed sensing problem and notation used throughout, then we will present three greedy algorithms. Following that, the experimental setup including the different sensing matrices, signal initialization, and stopping criteria are discussed. Finally, we present our experimental results and remark on our findings.
There is now an abundance of both theory and algorithms that guarantee robust and accurate recovery of sparse signals, under various assumptions on the measurement matrix [17, 14]. For example, the socalled Restricted Isometry Property [9] guarantees such recovery and random matrix constructions are shown to satisfy this property when the number of measurements scales like [39]. Under this or related assumptions, both greedy (iterative) algorithms and optimizationbased methods (e.g., L1minimization) are shown to produce accurate recovery results. In general, the performance of such algorithms depend on the undersampling and oversampling rates which we denote as
(2) 
respectively. Furthermore, we refer to combinations of and as the () plane. By observing the behavior of algorithms on the plane, we can see how different approaches act under various sampling rates.
We consider three greedy algorithms for solving the compressed sensing problem: Hard Thresholding Pursuit (HTP) [16], Normalized Iterative Hard Thresholding (NIHT) [8], and Compressive Sampling Matching Pursuit with Subspace Pursuit (CSMPSP) [34, 31]. The pseudocode for HTP, NIHT, and CSMPSP appears in Algorithm 1, Algorithm 2, and Algorithm 3 respectively. These methods are all similar in spirit; they seek to recover the signal from while also identifying the support of , which is discovered iteratively. Each essentially uses a proxy for the signal (e.g., ) to identify a support estimate , then estimates on that support (e.g., ) , then computes the residual and repeats the process to locate the remainder of . HTP and NIHT use specially chosen step sizes (denoted ) when updating the estimate to and recompute the support in each iteration, whereas CSMPSP uses a union of prior estimates followed by pruning. See Algorithms 13 and [16, 8, 31] for details about these approaches. What is important for our purpose is that each algorithm may perform differently for a given set of inputs, leading to varying accuracy on the output. Therefore, there is value in using machine learning tools to decide what is the best choice of algorithm in a given problem instance. Also note that each algorithm takes the same inputs, namely the measurement matrix , the measurement vector , and an approximation for the number of nonzero entries in the sparse signal.
Although the theory for these approaches holds uniformly, meaning it holds for any sparse signal and matrix satisfying the assumptions, it has long been observed that the algorithms actually behave quite differently on various kinds of signal and measurement ensembles [18, 10, 7]. In fact, [7] documents an extensive comparison of these approaches for various ensembles while ranging the parameters and . This latter work can be used as a “lookup table,” when one knows the input information and wants to select the optimal algorithm for their purpose. Their work, in some sense, motivates us to apply the machine learning methodology to compressed sensing, as we have a comprehensive benchmark with which to compare these methods. Note that these comparisons were made in a brute force manner, where each method was run on each ensemble type over a fine grid of input parameters. Such an exhaustive approach is not practical when the input domain is extremely large. Moreover, in this setting, we have a greater understanding of how these greedy algorithms will behave for a specific problem instances, making it an appropriate problem to verify and validate our framework.
2.1 Experimental setup
We consider three randomly generated measurement matrices for this setting: Gaussian, Sparse, and Discrete Cosine Transform (DCT). Entries of the Gaussian matrices are drawn from so that in expectation, they have normalized columns. Sparse measurement matrices have nonzero entries in each column where the value of the nonzero entries is drawn from with equal probability. Finally the DCT measurement matrices consist of randomly subsampled rows of the full DCT matrix. The number of measurements is determined by and the vector being recovered has nonzero entries (determined by ) and takes on values with equal probability where and are as defined in (2). The measurement vector where is one of the three types of measurement matrices and is the signal to be recovered.
We terminate any algorithm when it satisfies one of the following stopping criteria.

Convergence  An algorithm is convergent if the residual error is small enough. In particular, if

Divergence  An algorithm is divergent if the residual error is larger than a factor of the norm of the initial residual:

Slow Progress I  After iterations of NIHT or iterations of CSMPSP or HTP, we begin to check for slow progress. For the first version of “slow progress” we check whether the residual has made any significant progress over the last 15 iterations:

Slow Progress II  After iterations of NIHT or iterations of CSMPSP or HTP, we check whether the convergence rate is close to 1:

Maximum Iteration  An algorithm that runs for longer than 60 minutes (discounting time for computing metrics) or iterations (where for NIHT and for CSMPSP and HTP) has reached the allowable computation time and is terminated.
It should be noted the algorithm stopping criteria of (1)(4) are as in [7] while the last exit was added to keep from a single experiment from running for too long. Practically, the last stopping criteria reflects a computational time constraint.
2.2 Experiments
In the following set of experiments, we train neural networks to classify whether or not an algorithm can recover a signal in the standard compressed sensing problem (1). The experiment requires three phases: creating training data, training the neural network, and testing the neural network.
In the first phase, training data with labels are created to input into the neural network. The training data set comprises of 2241 samples. For each matrix type (Gaussian, Sparse, DCT), there are 747 training points on the plane (See (2)). For each pair, we run Algorithm 1, Algorithm 2, and Algorithm 3 until the algorithm satisfies one of the stopping criteria discussed in Section 2.1. In order for a given algorithm to be labeled as “successful” at recovering signals for a specified and measurement matrix, 50 of the 100 randomly generated samples must have satisfied the “convergence” stopping criteria. This phase is completed in MATLAB using version R2014b on a desktop running Linux.
The training data from the first phase and labels are used to train neural networks in the second phase. The input variables used by the neural network are the signal dimension , the number of measurements , the number of nonzero entries , and an indicator variable that indicates the measurement matrix. The second phase is accomplished using Python 3 and Keras 2.1 with TensorFlow as a backend. We set up the neural network to contain two hidden layers, the first with three nodes and a second layer with nine nodes, and offer the following intuition for the neural network structure. The purpose of the first layer is to determine the measurement matrix type while the second layer classifies whether or not an algorithm will be successful. The hidden layers utilize ReLu as their activation function, with the exception of the final output layer which uses the sigmoid function. Approximately 90% of the available data is used to train our neural network for each algorithm and the remaining 10% is used to measure validation accuracy on the trained network.
Figures 1, 2, and 3 present the computational results for HTP, NIHT, and CSMPSP respectively. In each subplot, the horizontal axis represents the value of and the vertical axis represents the value of . Furthermore, each figure can be broken down as follows. Each column isolates a specific measurement matrix: Gaussian, sparse, and DCT (left to right). The first row of each figure shows the training pairs (i.e., data created in first phase of experiment) along with their labels, indicated by the color of the data point. Here, yellow points indicate that an algorithm is “successful” and blue points indicate that the algorithm is not successful. In the second row of each figure, we show results produced by the trained neural network from the second phase on test data created by uniformly sampling the plane. The accuracy of the trained networks on validation data is reported in the captions of each figure. For all experiments, the signal dimension while and are computed according to the specified and .
These numerical experiments show that even a simple neural network is able to approximately determine whether or not a given greedy algorithm and pairing will result in successful signal recovery. In particular, the yellow regions in the second row of each figure not only roughly approximate the yellow regions in the first row but they also noticeably vary across both algorithm and measurement matrix to match the input training data, as desired.
3 Application II: stochastic gradient descent learning rate selection
We now further test our machine learning framework with an exploration of learning rate schedule selection for the stochastic gradient descent (SGD) algorithm. In this set of experiments, we demonstrate that one can use neural networks to select a learning rate schedule which improves the behavior of SGD on a given instance, provided proper training data. After a brief introduction to the vast body of literature regarding the convergence behavior of SGD and corresponding learning rates, we discuss our experimental results and comment on our findings. In Subsection 3.1, we describe in detail the design of our neural network framework. We additionally describe the construction of the training and testing data provided to the network in each experiment.
SGD is an ubiquitous firstorder iterative method for convex optimization. The classical SGD algorithm for optimizing works as follows: After selecting a learning rate (or step size) schedule and an initialization , we randomly select an index . While the stopping criteria is not satisfied we update . The applications in which SGD is la méthode du jour are diverse and cut across many scientific fields, with perhaps the hottest application currently being in the training of neural networks. The performance of SGD depends heavily on the selected learning rate (or step size) schedule, , and parameters of the objective function such as the Lipchitz constant or strongconvexity parameter [38, 41, 33, 35]. Parameter tuning SGD can also be interpreted as an algorithm selection problem. There are numerous proposed line search methods for selecting learning rates and methods for performing onedimensional optimization on the learning rate to speed convergence [32, 11, 30, 43]. In practice, learning rate selection can be quite adhoc and there are popular heuristics for updating the learning rate [19].
Recently, practitioners and theorists alike have turned their attention to adaptive learning rate schedules, in which the learning rate assigned to a component updates according to information gleaned from the sample [13, 47, 40, 25, 12]. Recent adaptive learning rate approaches approximate Lipschitz parameters and use this to approximately compute a learning rate [35, 44].
Our work presents a machine learning framework which allows practitioners to choose a learning rate schedule without knowledge of objective function parameters. As a proof of concept, we focus on solving leastsquares problems, but we stress that our framework could be applied to more complex objective functions. This framework offers practitioners an alternative to heuristics and unknown objective function parameters.
3.1 Experiments
In each of the experiments presented below, we apply SGD to solve a leastsquares problem defined by measurement matrix and measurement vector . The goal of our machine learning framework is to train a neural network to select the optimal learning rate schedule (out of a fixed set of schedules) for a given input linear system represented by its measurement vector, ; we specify the measure with which we compare learning rates in each section below. These experiments also require the same three phases as in Section 2: creating training data, training the neural network, and testing the behavior of SGD with the neural network predicted learning rates.
In the first phase, we generate data points consisting of measurement vectors and labels that indicate the optimal learning rate schedule. We compare only two types of learning rate schedules: the constant learning rate and the epochbased learning rate schedule ; these constants are defined in each experiment below. To select the optimal learning rate schedule and assign a label to each data point, we run iterations of SGD and assign the label of the learning rate schedule that resulted in the smallest recovery error, where is the signal. This phase is completed in MATLAB using version R2017a on a laptop running macOS. In each experiment, our input data points form two classes which correspond to each of the learning rate schedules. The consistent systems are optimally solved with the constant learning rate schedule, while the inconsistent systems are optimally solved with the epochbased learning rate schedule; this decreasing learning rate schedule helps SGD avoid the larger convergence horizon of the inconsistent systems. These data points are labeled accordingly and the neural network task of predicting the optimal learning rate schedule is equivalent to predicting to which set of systems each data point belongs. In each experiment, the data set consists of 3000 measurement vectors, a portion of which is used for training and the remaining data set is reserved for testing.
In the second phase, we train a neural network with the training portion of the data set, consisting of the measurement vectors and the optimal learning rate schedule labels for each system. The second phase is performed in Python 3 and Keras 2.1 with TensorFlow as a backend. The neural network architecture we adopt has one hidden layer with nodes. The intuition for this choice of network architecture is that in our experimental setup the network only needs to determine which systems are consistent; as a linear problem, we expect that a thin, simple architecture should be successful. The hidden layer nodes use ReLU as the activation function and the final output layer uses the sigmoid function. In the experiments below, we sample , and of the data to train the neural network and reserve the remaining data for testing validation accuracy.
In the third phase, we measure the validation accuracy of the trained neural network predictions on the test set. Additionally, we use the neural network predicted learning rate schedules to solve each leastsquares problem in the test set with iterations of SGD and measure the resulting average recovery error, , and average residual error, over the test set. We compare these average error measures for the neural network predicted learning rates with the average errors solving the test set using only the constant learning rate schedule and only the epochbased learning rate schedule.
3.1.1 Synthetic Linear Systems
In this experiment, we train a neural network to recommend either the constant learning rate or the epochbased learning rate schedule . Here we set to be a fixed matrix with Gaussian random variable entries drawn i.i.d. from , and we design two types of linear systems with this matrix, consistent and inconsistent. For the set of consistent linear systems, we set where is a Gaussian vector. For the set of inconsistent systems, we set the error where is a Gaussian random variable, so that is orthogonal to the column space of . We then set and normalize so that has . The set of consistent systems are optimally solved with the constant learning rate schedule and the set of inconsistent systems are optimally solved with the epochbased learning rate schedule.
We train the neural network with random subsets of a collection of 3000 linear system measurement vectors , 1500 of which are consistent and 1500 of which are inconsistent. In our experiment, we measure the average validation accuracy of the neural network predictions on the remaining test measurement vectors for ten trials in which we randomly sample subsets of , , and training data of the 3000 measurement vectors; the average validation accuracies are listed in Table 1. Furthermore, we list the average recovery error, , and average residual error, , for the approximation computed by SGD iterations using first the constant learning rate schedule, then the epochbased learning rate schedule, and finally the neural network predicted learning rate for each system. These measures are listed in Table 1; the smallest average error is bolded in each row. Note that the average recovery error and average residual error for the neural network predicted learning rates are lower than those of the constant learning rate or epochbased learning rate for the neural networks trained with and of the data. We suspect that the errors associated with the learning rates predicted by the neural network trained with of the data are not the lowest because of the low neural network validation accuracy, which is in turn due to the small amount of training data.
Train  Validation  

Accuracy  Const.  Epoch  NN Pred.  
86.00%  0.01142  0.01525  0.00909  
77.01%  0.01138  0.01530  0.01064  
66.32%  0.01116  0.01524  0.01177 
Train  

Const.  Epoch  NN Pred.  
0.50980  0.64512  0.45912  
0.50806  0.64590  0.49707  
0.49739  0.64027  0.53053 
3.1.2 Computerized Tomography Systems
In this experiment, we again train a neural network to recommend either the constant learning rate or the epochbased learning rate schedule . We input two types of linear systems, consistent and inconsistent. Each data point input is the measurement vector from a computerized tomography system of equations, (generated by code adapted from the regularization toolbox by PC Hansen [21]). We fix the matrix to be a CT matrix generated by the command tomo(20,10); here is the discretization parameter (number of pixels along one edge of the square image) and is the oversampling factor. This matrix represents the ray directions which are sampled through the signal (image). We then produce consistent CT systems by applying the CT matrix to the signal , which is an image from the MNIST database [29], producing the measurement vector and then normalizing so that . These measurement vectors contain a linear combination of the pixels through which the tomography rays pass. This set of systems is optimally solved with the constant learning rate. We produce inconsistent CT systems with error where is a Gaussian random variable, so that is orthogonal to the column space of . The measurement vector for these inconsistent CT systems is normalized so that , where is an image from the MNIST database. This set of systems is optimally solved with the epochbased learning rate schedule.
To evaluate these methods, we measure the average validation accuracy of the neural network predictions on the remaining test measurement vectors for ten trials in which we randomly sample subsets of , , and training data of the 3000 measurement vectors; the average validation accuracies are listed in Table 2. Furthermore, we list the average recovery error, , and average residual error, , for the approximation computed by SGD iterations using first the constant learning rate schedule, then the epochbased learning rate schedule, and finally the neural network predicted learning rate for each system. These measures are listed in Table 2; the smallest average error is bolded in each row. Note that the average recovery error and average residual error for the neural network predicted learning rates are lower than those of the constant learning rate or epochbased learning rate, except for the average residual error of the neural network trained with of the data.
Train  Validation  

Accuracy  Const.  Epoch  NN Pred.  
88.19%  0.00669  0.00687  0.00584  
79.68%  0.00669  0.00685  0.00550  
85.42%  0.00664  0.00683  0.00538 
Train  

Const.  Epoch  NN Pred.  
0.25087  0.26671  0.25121  
0.25247  0.26792  0.24717  
0.24885  0.26469  0.24323 
In order to visualize the potential improvement offered by using the trained neural net to select optimal step sizes for each tomography system, we plot in Figure 4 a recovered image using 5000 SGD iterations with each learning rate schedule, and the original image. The neural network predicts the correct optimal learning rate schedule on these systems.
These numerical experiments show that a simple neural network trained with proper training data can predict learning rates which improve the recovery error of SGD on a set of given systems. We emphasize that this approach is promising for data sets in which knowledge of the data (e.g., consistency of linear systems, approximate Lipschitz parameters, etc.) is limited. Depending upon the makeup of the given data set and the choice of learning rate schedules, choosing to use a single schedule on all data sets may be the optimal choice (in average recovery error), but this approach is useful if you do not have much knowledge about the data set. We illustrate this with a toy situation plotted in Figure 5. We plot the average recovery error versus the proportion of the test set systems that are inconsistent. For this visualization, we use the recovery errors from the experiment in Figure 4 to approximate the average recovery errors for each learning rate schedule on each set of systems. For the neural network predicted average recovery error, we assume that the neural network predictions are 80% accurate on both the inconsistent systems and the consistent systems. In this toy example, we see that the neural network predictions outperform the other learning rate schedules when the proportion of inconsistent systems in the test set is between approximately 30% and 80%. However, we additionally note that the neural network predicted learning rates never result in a significantly worse average recovery error than the optimal. Thus, if you know very little about your data set (e.g., how many systems are inconsistent) then our framework offers an efficient method to decrease the resulting average recovery error over all data.
4 Conclusion
We have presented a machine learning framework for algorithm selection or parameter tuning that is applicable in all areas of computational mathematics. We showcased its broad potential by applying it to compressed sensing and stochastic gradient descent. As long as we have a choice of algorithms, or parameter values that determine the behavior of an algorithm, the same process of training a neural network can be used to obtain automatic recommendations. This presents the possibility that in the future, software will integrate some way to collect data in order to improve itself. Futuristic code will adjust its own parameters based on historic experience of executions of prior instances. We predict this will be useful not only in selfimprovement in machine learning (e.g., as stochastic gradient descent improves itself from its data, it will improve learning models), but it will also be useful in other fields of computational mathematics where a nonexpert human is at a disadvantage with respect to code that collects lots of data points and selfimproves. What are the challenges and limits for this approach? They include the right selection of features for training, the amount of data required, and the type of neural networks used. All of these present interesting mathematical directions.
References
 [1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 [2] M. Balcan, T. Dick, T. Sandholm, and E. Vitercik. Learning to branch. In Int. Conf. Mach. Learn., pages 353–362, 2018.
 [3] M. Balcan, V. Nagarajan, E. Vitercik, and C. White. Learningtheoretic foundations of algorithm configuration for combinatorial partitioning problems. In Proc. Conf. Learn. Th., pages 213–274, 2017.
 [4] A. Balte, N. Pise, and P. Kulkarni. Metalearning with landmarking: A survey. International Journal of Computer Applications, 105(8), 2014.
 [5] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag. Collaborative hyperparameter tuning. In International conference on machine learning, pages 199–207, 2013.
 [6] B. Bischl, P. Kerschke, L. Kotthoff, M. T. Lindauer, Y. Malitsky, A. Fréchette, H. H. Hoos, F. Hutter, K. LeytonBrown, K. Tierney, and J. Vanschoren. ASlib: A benchmark library for algorithm selection. Artif. Intell., 237:41–58, 2016.
 [7] J. D. Blanchard and J. Tanner. Performance comparisons of greedy algorithms in compressed sensing. Numer. Linear Algebr., 22(2):254–282, 2015.
 [8] T. Blumensath and M. E. Davies. Normalized iterative hard thresholding: Guaranteed stability and performance. IEEE J. Sel. Top. Signa., 4(2):298–309, 2010.
 [9] E. J. Candès and T. Tao. Decoding by linear programming. IEEE T. Inform. Theory, 51:4203–4215, 2005.
 [10] M. Davenport, D. Needell, and M. B. Wakin. Signal space CoSaMP for sparse recovery with redundant dictionaries. IEEE T. Inform. Theory, 59(10):6820, 2012.
 [11] S. De, A. Yadav, D. Jacobs, and T. Goldstein. Big batch SGD: Automated inference using adaptive batch sizes. arXiv preprint arXiv:1610.05792, 2016.
 [12] A. Défossez and F. Bach. Adabatch: Efficient gradient aggregation rules for sequential and parallel stochastic gradient methods. arXiv preprint arXiv:1711.01761, 2017.
 [13] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12(Jul):2121–2159, 2011.
 [14] Y. C. Eldar and G. Kutyniok. Compressed sensing: theory and applications. Cambridge University Press, 2012.
 [15] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In Advances in neural information processing systems, pages 2962–2970, 2015.
 [16] S. Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM J. Numer. Anal., 49(6):2543–2563, 2011.
 [17] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing, volume 1. Birkhäuser Basel, 2013.
 [18] C. Garnatz, X. Gu, A. Kingman, J. LaManna, D. Needell, and S. Tu. Practical approximate projection schemes in greedy signal space methods. In Proc. Allerton Conf. Comm. Cont. Comp., 2014.
 [19] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [20] R. Gupta and T. Roughgarden. A PAC approach to applicationspecific algorithm selection. SIAM J. Comput., 46(3):992–1017, 2017.
 [21] P. C. Hansen. Regularization tools: A MATLAB package for analysis and solution of discrete illposed problems. Numer. Algorithms, 6(1):1–35, 1994.
 [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR IEEE, pages 770–778, 2016.
 [23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proc. CVPR IEEE, volume 1, page 3, 2017.
 [24] E. B. Khalil, B. Dilkina, G. L. Nemhauser, S. Ahmed, and Y. Shao. Learning to run heuristics in tree search. In Proc. Int. Joint Conf. Artif., pages 659–666, 2017.
 [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [26] L. Kotthoff, B. Hurley, and B. O’Sullivan. The ICON challenge on algorithm selection. AI Magazine, 38(2):91–93, 2017.
 [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Adv. Neur. In., pages 1097–1105, 2012.
 [28] M. G. Lagoudakis and M. L. Littman. Algorithm selection using reinforcement learning. In Int. Conf. Mach. Learn., pages 511–518, 2000.
 [29] Y. LeCun, C. Cortes, and C. Burges. The MNIST database of handwritten digits, 2010.
 [30] M. Mahsereci and P. Hennig. Probabilistic line searches for stochastic optimization. In Adv. Neur. In., pages 181–189, 2015.
 [31] A. Maleki and D. L. Donoho. Optimally tuned iterative reconstruction algorithms for compressed sensing. IEEE J. Sel. Top. Signa., 4(2):330–341, 2010.
 [32] P.Y. Massé and Y. Ollivier. Speed learning on the fly. arXiv preprint arXiv:1511.02540, 2015.
 [33] E. Moulines and F. R. Bach. Nonasymptotic analysis of stochastic approximation algorithms for machine learning. In Adv. Neur. In., pages 451–459, 2011.
 [34] D. Needell and J. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmon. A., 26(3):301–321, 2009.
 [35] D. Needell, R. Ward, and N. Srebro. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In Adv. Neur. In., pages 1017–1025, 2014.
 [36] E. Nudelman, A. Devkar, Y. Shoham, and K. LeytonBrown. Understanding random SAT: Beyond the clausestovariables ratio. In Lect. Notes Comput. SC, pages 438–452, 2004.
 [37] B. Pfahringer, H. Bensusan, and C. G. GiraudCarrier. Metalearning by landmarking various learning algorithms. In ICML, pages 743–750, 2000.
 [38] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat., 22:400–407, 1951.
 [39] M. Rudelson and R. Vershynin. On sparse reconstruction from Fourier and Gaussian measurements. Comm. Pure Appl. Math., 61:1025–1045, 2008.
 [40] T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In Int. Conf. Mach. Learn., pages 343–351, 2013.
 [41] O. Shamir and T. Zhang. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In Int. Conf. Mach. Learn., pages 71–79, 2013.
 [42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. CVPR IEEE, pages 1–9, 2015.
 [43] C. Tan, S. Ma, Y.H. Dai, and Y. Qian. BarzilaiBorwein step size for stochastic gradient descent. In Adv. Neur. In., pages 685–693, 2016.
 [44] X. Wu, R. Ward, and L. Bottou. WNGrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
 [45] C. Yang, Y. Akimoto, D. W. Kim, and M. Udell. OBOE: Collaborative filtering for AutoML initialization. arXiv preprint arXiv:1808.03233, 2018.
 [46] Y. Yang, Z. Zhong, T. Shen, and Z. Lin. Convolutional neural networks with alternately updated clique. In Proc. CVPR IEEE, pages 2413–2422, 2018.
 [47] M. D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.