Predicting Runtime Distributions using Deep Neural Networks
Many state-of-the-art algorithms for solving hard combinatorial problems include elements of stochasticity that lead to high variations in runtime, even for a fixed problem instance, across runs with different pseudo-random number seeds. Knowledge about the runtime distributions (RTDs) of algorithms on given problem instances can be exploited in various meta-algorithmic procedures, such as algorithm selection, portfolios, and randomized restarts. Previous work has shown that machine learning can be used to individually predict mean, median and variance of RTDs. To establish a new state-of-the-art in predicting RTDs, we demonstrate that the parameters of an RTD should be learned jointly and that neural networks can do this well by directly optimizing the likelihood of an RTD given runtime observations. In an empirical study involving four algorithms for SAT solving and AI planning, we show that our neural networks predict the true RTDs of unseen instances better than previous methods. As an exemplary application of RTD predictions, we show that our RTD models also yield good predictions of running these algorithms in parallel.
Algorithms for solving hard combinatorial problems often rely on random choices and decisions to improve their performances. For example, randomization helps to escape local optima, enforces stronger exploration and diversifies the search strategy by not only relying on heuristic information. In particular, most local search algorithms are randomized [?], but structured search (such as tree-based algorithms) can also benefit from randomization [?].
The runtimes of randomized algorithms for hard combinatorial problems are well-known to vary substantially, often by orders of magnitude, even when running the same algorithm multiple times on the same instance [?]. Hence, the central object of interest in the analysis of a randomized algorithm on an instance is its runtime distribution (RTD), in contrast to a single scalar for deterministic algorithms. Knowing these RTDs is important in many practical applications, such as computing optimal restart strategies [?], optimal algorithm portfolios [?] and the speedups obtained by executing multiple independent runs of randomized algorithms [?]. It is trivial to measure an algorithm’s empirical RTD on an instance by running it many times to completion, but for new instances that one would like to solve, this is of course not practical. Instead, one would like to estimate the RTD for a new instance without running the algorithm on it.
There is a rich history in artificial intelligence that shows that the runtime of algorithms for solving hard combinatorial problems can indeed be predicted to a certain degree [?]. These runtime predictions have enabled a wide range of meta-algorithmic procedures, such as algorithm selection [?], model-based algorithm configuration [?], generating hard benchmarks [?], gaining insights into instance hardness [?] and algorithm performance [?], and creating cheap-to-evaluate surrogate benchmarks [?].
Given a method for predicting RTDs of randomized algorithms, all of these applications could be extended by an additional dimension. Indeed, predictions of RTDs have already enabled dynamic algorithm portfolios [?], adaptive restart strategies [?], and predictions of the runtime of parallelized algorithms [?], and many more applications are possible given effective methods for RTD prediction. To advance the underlying foundation of all these applications, in this paper we focus on better methods for predicting RTDs. Specifically, our contributions are as follows:
We propose a principled pipeline to build RTD predictors for new problem instances based on observed runtimes on a set of training instances.
We compare different ways of predicting RTDs and demonstrate that neural networks (NN) can jointly predict all parameters of various parametric RTDs, yielding RTD predictions that are superior to those of previous approaches that predict the RTD’s parameters independently.
We propose DistNet, a practical NN for predicting RTDs, and discuss the bells and whistles that make it work.
To illustrate the use of our NN-based RTD predictions in an application, we demonstrate that they can effectively predict the performance that can be obtained by running multiple independent copies of a randomized algorithm in parallel [?].
The rich history in predicting algorithm runtimes mentioned in the introduction focuses on predicting mean runtimes, with only a few exceptions. [?] ([?]) predicted the single distribution parameter of an exponential RTD and [?] ([?]) predicted the two parameters of log-normal and shifted exponential RTDs with independent models. In contrast, we jointly predict multiple RTD parameters (and also show that the resulting predictions are better than those by independent models).
The work most closely related to ours is by [?] ([?]), who proposed to use NNs to learn a distribution of the time left until an algorithm solves a problem based on features describing the algorithm’s current state and the problem to be solved; they used these predictions to dynamically assign time slots to algorithms. In contrast, we use NNs to predict RTDs for unseen problem instances.
A final related field of study on predicting distributions uses non-parametric estimators. Very recently, [?] ([?]) used quantile regression forests [?] as a non-parametric model that enables sampling from predicted runtime distributions. Also very recently, [?] ([?]) showed that NNs can be used as a non-parametric estimator for arbitrary conditional distributions. In contrast to that, we focus on a parametric approach, using NNs to predict distribution parameters for non-Gaussian distributed target values.
All existing methods for predicting runtime on unseen instances base their predictions on instance features that numerically characterize problem instances. In particular in the context of algorithm selection, these instance features have been proposed for many domains of hard-combinatorial problems, such as propositional satisfiability [?], AI planning [?], mixed integer programming [?] and answer set programming [?]. To avoid this manual step of feature construction, [?] ([?]) proposed to directly use the text format of an instance as the input to a neural network to obtain a numerical representation of the instance. Since this approach performed a bit worse than manually constructed features, in this work we use traditional features, but our framework will work with any type of features.
3A Pipeline for Predicting RTDs
The problem we address in this work can be formally described as follows:
Following the typical approach in the literature [?], we address this problem in two steps:
Determine a parametric family of RTDs with parameters that fits well across training instances;
Fit a machine learning model that, given a new instance and its features, predict ’s parameters on that instance.
=[rectangle, draw=black, rounded corners, text centered, text width=8em, fill=white, drop shadow] =[rectangle, draw=black, text centered, fill=black!10, text width=8em, drop shadow] =[->, thick]
Figure 1 illustrates the pipeline we use for training these RTD predictors and using them on new instances. In the following, we first discuss the various RTD families we considered; explain how we measure the goodness of our RTD fits (i.e., our loss function); and finally discuss different ways of predicting RTDs, including our new approach of jointly learning RTD parameters using NNs.
4Parametric Families of RTDs
|Inverse Gaussian (INV)|
We considered a set of parametric continuous probability distributions (shown in Table 1 with exemplary instantiations shown in Figure 2), most of which have been widely studied to describe the RTDs of combinatorial problem solvers [?]. As a baseline we consider the normal (aka Gaussian) distribution, due to its widespread use throughout in the sciences; assumptions of Gaussian observation noise also underlie many machine learning methods, such as Gaussian processes.
Since the runtimes of hard combinatorial solvers often vary on an exponential scale (likely due to the -hardness of the problems studied), a much better fit of empirical RTDs is typically achieved by a lognormal distribution; this distribution is attained if the logarithm of the runtimes is Gaussian-distributed and has been shown to fit empirical RTDs well in previous work [?].
Another popular parametric family from the literature on RTDs is the exponential distribution, which tends to describe the RTDs of many well-behaved stochastic local search algorithms well [?]. It is the unique family with the property that the probability of finding a solution in the next time interval (conditional on not having found one yet) remains constant over time. This distribution, like the lognormal distribution, can therefore model the relatively long tails of typical RTDs of randomized combinatorial problem solvers quite well.
By empirically studying a variety of alternative parametric families, we also found that an inverse Gaussian distribution (INV) tends to fit RTDs very well.
5Quantifying the Quality of RTDs
To measure how well a parametric distribution with parameters fits our empirical runtime observations (the empirical RTD), we use the likelihood of parameters given all observations , which is equal to the probability of the observations under distribution with parameters :
Consequently, when estimating the parameters of a given empirical RTD, we use a maximum-likelihood fit. For numerical reasons, as is common in machine learning, we use the negative log-likelihood as a loss function to be minimized:
Since each instance results in an RTD, we measure the quality of a parametric family of RTDs for a given instance set by averaging over the negative log-likelihoods of all instances. To obtain comparable likelihoods across the instances, we normalize them by multiplying the likelihoods with the maximal observed runtime of each instance:
An alternative is to use a goodness of fit statistical test, such as the Kolmogorov-Smirnov (KS) test. The KS-statistic is based on the maximal distance between an empirical distribution and the cumulative distribution function of a reference distribution. To aggregate the test results across instances, we count the number of times for which the KS-test rejected the null-hypothesis that our measured are drawn from a reference RTD.
Having selected a parametric family of distributions, the last part of our pipeline is to fit an RTD predictor to efficiently obtain RTDs for new instances. Formally, the problem is to find a predictive model that maps from instance features to the parameters of the selected RTD family. In the following, we briefly discuss how traditional regression models have been used for this problem, and why this optimizes the wrong loss function. We then show how to obtain better predictions with neural networks and introduce our practical DistNet for this task.
6.1Generalizing from Training RTDs
A straightforward approach for predicting parametric RTDs based on standard regression models is to fit the RTD’s parameters for each training instance , and to then train a regression model on data points that directly maps from instance features to RTD parameters. This approach has been used before based on Gaussian processes [?] and random forests [?]. There are two variants to extend these approaches to the problem of predicting multiple parameters of RTDs governed by parameters: (1) fitting independent regression models, or (2) fitting a multi-output model with outputs. Since random forests have been shown to perform very well for standard runtime prediction tasks [?], we experimented with them in variant (1), but also in variant (2), using the multi-output random forest implementation in scikit-learn [?] (which learns the mapping jointly by minimizing the errors across the output variables). However, we note that this multi-output variant, although in principle more powerful than independent models, does not optimize the loss function from Equation 1 we care about. We also note that both variants require fitting RTDs on each training instance, making the approach inapplicable if we, e.g., only have access to one run for each of a million instances. Now, we show how neural networks can be used to solve both of these problems.
6.2Joint Predictions with Neural Networks
Neural networks have recently been shown to achieve state-of-the-art performance for many supervised machine learning problems as large data sets became available, e.g., in image classification and segmentation, speech processing and natural language processing. For a thorough introduction, we refer the interested reader to [?] ([?]). Here, we apply NNs to RTD prediction.
Background on Neural Networks. NNs can approximate arbitrary functions by defining a mapping where are the weights to be learnt during training to approximate the function. In this work we use a fully-connected feedforward network, which can be described as an acyclic graph that connects nonlinear transformations in a chain, from layer to layer. For example, a NN with two hidden layers that predicts for some input can be written as:
with denoting trainable network weights and (the so-called activation function) being a nonlinear transformation applied to the weighted outputs of the -th layer. The last activation function, is special, and in our case is to constrain all outputs to be positive.
NNs are usually trained with stochastic gradient descent (SGD) methods using backpropagation to effectively obtain gradients of a task-specific loss function for each weight.
Neural Networks for predicting RTDs. Figure ? shows the general architecture of our NNs for the joint prediction of multiple RTD parameters. We have one input neuron for each instance feature , and we have one output neuron for each distribution parameter . To this end, we assume that we know the best-fitting distribution family from the previous step of our pipeline.
In contrast to RFs, we train our networks to directly minimize the negative log-likelihood of the predicted distribution parameters given our observed runtimes. Formally, for a given set of observed runtimes and instance features , we minimize the following loss function in an end-to-end fashion:
Here, denotes the values of the distribution parameters obtained in the output layer given an instantiation of the NN’s weights. This optimization process, which targets exactly our loss function of interest (Equation 1), allows to effectively predict all distribution parameters jointly. Since predicted combinations are judged directly by their resulting negative log-likelihood, the optimization process is driven to find combinations that work well together (rather than penalizing distance from a prescribed ground truth as was the case with the random variants (1) and (2) above). This end-to-end optimization process is also more general as it removes the need of fitting an RTD on each training instance and thereby enables using an arbitrary set of algorithm performance data for fitting the model.
DistNet: RTD predictions with NNs in practice. Unfortunately, training an accurate NN in practice can be tricky and requires manual attention to many details, including the network architecture, training procedure, and other hyperparameter settings.
To preprocess our runtime data , we performed the following steps:
We removed all (close to be) constant features.
For each instance feature type, we imputed missing values (e.g., because of resource limitations during feature computation) by the median of the known instance features.
We normalized each feature to mean and standard deviation because the weights of a neural network are typically initialized by random samples from a normal distribution with mean and standard deviation .
We scaled the observed runtimes in a range of by dividing it by the maximal observed runtime across all instances. This also helps the neural network training to converge faster.
For training DistNet, we considered the following aspects:
Our first networks tended to overfit the training data if the training data set was too small and the network too large. Therefore, we recommend to use a sufficiently large number of instances and runtime observations, and we chose a fairly small neural network with two hidden layers each with neurons. We believe that far better performance could be achieved with larger and deeper networks when more training instances are available.
We considered each runtime observation as an individual data sample. Hence, we increased the number of training samples to the number of instances times the number of observations —the training set size for the random forest equals the number of instances.
We shuffled the runtime observations (as opposed to, e.g., using only data points from a single instance in each batch) and use a fairly small batch size to reduce the correlation of the training data points in each batch.
Our loss function can have very large gradients because slightly suboptimal RTD parameters can lead to likelihoods close to zero (or a very large negative log-likelihood). Therefore, we used a fairly small initial learning rate and used gradient clipping [?] on top of it with an exponentially decaying learning rate.
|[0.5mm] # hidden layers||2||optimization algo.||SGD|
|# neurons per layer||16||init. learn. rate||1e-3|
|batch size||16||final learn. rate||1e-5|
|activation function||tanh||batch normalization||True|
|output act. function||-regularization||1e-4|
Table ? shows all architectural choices and hyperparameters of DistNet.
In our experiments, we study the following research questions:
Which of the parametric RTD families we considered best describe the empirical RTDs of our algorithms and instances?
How do DistNet’s joint predictions of RTD parameters compare to those of popular random forest models?
Can DistNet be used to predict the runtime achieved by multiple-independent-run parallelization?
We focus on well-studied algorithms, each evaluated on a different set of problem instances from two different domains:
is based on the dynamic local search SAT solver Saps [?]. The SAT instances are randomly generated with a varying clause-variable ratio (CV-VAR).
is based on the local search AI-planning solver LPG [?]. The instances are from the zenotravel planning domain [?], which arise in a version of route planning.
is based on the local search SAT solver ProbSAT [?]. We ran this on 7SAT random instances.
is based on the tree-based CDCL solver Clasp [?]. In our experiments, Clasp is randomized by using a randomly-selected split variable with a probability of . The instances are randomly generated unsatisfiable SAT instances.
The sizes of our instance sets are shown in Table ?. To gather training data, we ran each algorithm with different seeds on each instance. All runs were performed on a compute cluster with nodes equipped with two Intel Xeon E5-2630v4 and GB memory running CentOS 7. We used the open-source neural network library keras [?] for our neural networks, scikit-learn [?] for the RF implementation and scipy [?] for fitting the distributions.
7.2Q1: Best RTD Families
Figure ? shows some exemplary CDFs of our empirical RTDs; each line is the RTD on one of the instances. The different algorithms’ RTDs show different characteristics. On Saps-CV-VAR, most RTDs are very similar and have short right tails. On LPG-Zenotravel, the RTDs have a long right tail and a short left tail. In contrast, the RTDs of ProbSAT-7SAT have a very long right tail and no left tail. On Clasp-K5, the RTDs are nearly symmetric.
Table ? shows a quantitative evaluation of the different RTD families we considered (see Section Quantifying the Quality of RTDs). Overall, our fitted distributions closely resembled the true empirical RTDs, with a rejection rate of the KS test for the best fitting distribution of at most 12.7%. Hence, on most instances our best distributions were not statistically significantly different from the observed ones.
Not surprisingly, different parametric RTD families performed best on each scenario, with LOG and INV showing overall good performance, followed by N. EXP performed worst in out of cases, but best for ProbSAT-7SAT. On LPG-Zenotravel the KS-test showed the most statistically significant differences for the best fitting distribution since the CDFs start with a small shift (see Figure ?) and therefore cannot be approximated perfectly by our distributions. Still, these distributions achieved good negative log-likelihood values. On Clasp-K5 all distributions except EXP achieved good results.
7.3Q2: Predicting RTDs
Next, we turn to the empirical evaluation of our models, comparing the predictive negative log-likelihood obtained by DistNet, a multi-output random forest (RF) and fitting multiple independent RFs, one for each distribution parameter (iRFs). For DistNet, we used the settings as described in Table ? and limit the training to take at most 1h or 1000 epochs, whichever was less. As a gold standard, we report the negative log-likelihood obtained by a maximum likelihood fit to the empirical RTD.
Table ? shows the negative log-likelihood achieved using a 10-fold cross-validation, i.e., we split the instances into ten disjoint sets, train our models on all but one subset and measure the test performance on the left out subset. We report the average performance on train and test data across all splits for the two best fitting distributions according to Table ?.
Overall our results show that it is possible to predict RTD parameters for unseen instances, and that DistNet performed best. For three out of four scenarios, our models achieved a negative log-likelihood close to the gold standard of fitting the RTDs to the observed data. Also, for most scenarios both distribution families were similarly easy to predict for all models. For the RF-based model, we observed slight overfitting for Saps-CV-VAR, LPG-Zenotravel, and Clasp-K5. For DistNet, we only observed this on the smallest data sets, ProbSAT-7SAT and Clasp-K5.
On Saps-CV-VAR, the multi-output RF yielded very poor predictions for the LOG distribution; we believe this is due to the fact that the ranges of its two distribution parameters obtained by fitting the distributions vary greatly. This affects the loss function the RF uses to fit the data (i.e., the sum of the losses for all parameters). This does not happen when fitting two RFs individually (as can be seen by the better negative log-likelihood values obtained by iRFs), but it underlines the importance of optimizing the right loss function as we do in DistNet.
On ProbSAT-7SAT, the models basically only needed to fit a single parameter: EXP only has a single parameter (), and although LOG has two parameters ( and ), only was important for ProbSAT-7SAT since was almost constant. Thus, in this case, the difference between RF and iRFs was due to noise and DistNet could not profit from its joint model.
Overall, DistNet clearly yielded the most robust results. It achieved the best test set predictions in 6/8 cases, sometimes with substantial improvements over the RF baselines, and only performed slightly worse for the smallest ProbSAT-7SAT benchmark (for which also only one RTD parameter was relevant).
7.4Q3: DistNet for Parallel RTD prediction
Finally, we evaluate DistNet on a typical application: predicting the estimated runtime of a simple parallelization scheme. In this setting, an algorithm is trivially parallelized by running multiple independent copies with different random seeds in parallel. Because of the randomization, one of the runs will solve the instance faster than the others, at which point all other runs can be terminated. Knowing in advance how many parallel ressources are required to reach a certain speed-up allows for efficient trade-offs between computational resources and the time needed to solve an instance.
Here, we approximate the runtime of parallel runs by drawing samples from the sequential algorithm’s RTD times, and obtain the expected parallel runtime by taking the average of the best out of samples across all simulated parallel runs.
We simulate running copies of the same algorithm in parallel and compare the expected parallel runtime based on our empirical RTDs and based on the predicted RTDs of DistNet. To aggregate the results, we used the average estimated runtime across all problem instances (by concatenating the results on all 10 test folds).
Figure ? shows the expected runtime depending on the number of parallel runs. For ProbSAT-7SAT and Clasp-K5, we obtained estimated runtimes close to the true distribution. For LPG-Zenotravel, the KS-test already indicated that LOG is not a perfectly-fitting distribution, and hence the estimated runtimes of parallel algorithm runs obtained by fitting distributions do not line up with the true parallel runtimes. Nevertheless, our DistNet perfectly matched the gold standard of the fitted distribution. Finally, on Saps-CV-VAR, DistNet overestimated the left tail of the RTD slightly, causing it to also underestimate the runtime of a parallel portfolio. Nevertheless, the overall trend was predicted correctly. Overall, from these experiments, we conclude that the predictions obtained with DistNet are suitable to estimate the runtime of parallel algorithm runs.
8Conclusion and Future Work
In this paper we showed that neural networks can be used to jointly learn distribution parameters to predict runtime distributions (RTDs). In contrast to previous random forest models, we train our model on individual runtime observations, removing the need to fit RTDs on all training instances. More importantly, our neural network – which we dub DistNet – directly optimizes the loss function of interest in an end-to-end fashion, and by doing so obtains better predictive performance than previously-used random forests models that do not directly optimize this loss function.
One way to extend our work and increase the amount of available training data would be to also consider censored observations in the loss function of our neural network as proposed by [?] ([?]). Furthermore, so far, we only studied problems with either only satisfiable or only unsatisfiable instances. In practice, we may face instances of both types, and each may follow a different RTD family. We expect that using a mixture of models [?] to learn different distribution families could alleviate this problem.
While in this paper we only briefly illustrated the suitability of RTD predictions on one simple application, we note that our methodology allows for better RTD predictions in general and therefore may pave the way for improving many exciting applications that currently rely on mean predictions only, such as, e.g., algorithm selection [?] and model-based algorithm configuration [?].
The authors acknowledge funding by the DFG (German Research Foundation) under Emmy Noether grant HU 1900/2-1. K. Eggensperger additionally acknowledges funding by the State Graduate Funding Program of Baden-Württemberg. Furthermore, the authors acknowledge support by the state of Baden-Württemberg through bwHPC and the DFG through grant no INST 39/963-1 FUGG.
- The name of the inverse Gaussian distribution can be misleading in the sense that it is not the inverse of a Gaussian distribution.
- When this is not done, easy instances are weighted more heavily: if two RTDs “look” the same but differ in scale by a factor of 10 due to one instance being 10 times harder, the pdf for the easier instance is 10 times larger (in order to still integrate to 1); our normalization removes this bias towards fitting easy instances better.
- We ignore bias terms for simplicity of exposition.
- Automated machine learning tools, such as AutoNet [?], can help with this process, but AutoNet so far does not support distributions; we aim to contribute this feature to it in the future.