Practical Design Space Exploration
Abstract
Multiobjective optimization is a crucial matter in computer systems design space exploration because realworld applications often rely on a tradeoff between several objectives. Derivatives are usually not available or impractical to compute and the feasibility of an experiment can not always be determined in advance. These problems are particularly difficult when the feasible region is relatively small, and it may be prohibitive to even find a feasible experiment, let alone an optimal one.
We introduce a new methodology and corresponding software framework, HyperMapper 2.0, which handles multiobjective optimization, unknown feasibility constraints, and categorical/ordinal variables. This new methodology also supports injection of user prior knowledge in the search when available. All of these features are common requirements in computer systems but rarely exposed in existing design space exploration systems. The proposed methodology follows a whitebox model which is simple to understand and interpret (unlike, for example, neural networks) and can be used by the user to better understand the results of the automatic search.
We apply and evaluate the new methodology to automatic static tuning of hardware accelerators within the recently introduced Spatial programming language, with minimization of design runtime and compute logic under the constraint of the design fitting in a target field programmable gate array chip. Our results show that HyperMapper 2.0 provides better Pareto fronts compared to stateoftheart baselines, with better or competitive hypervolume indicator and with 8x improvement in sampling budget for most of the benchmarks explored.
1 Introduction
Design problems are ubiquitous in scientific and industrial achievements. Scientists design experiments to gain insights into physical and social phenomena, and engineers design machines to execute tasks more efficiently. These design problems are fraught with choices which are often complex and highdimensional and which include interactions that make them difficult for individuals to reason about. In software/hardware codesign, for example, companies develop libraries with tens or hundreds of free choices and parameters that interact in complex ways. In fact, the level of complexity is often so high that it becomes impossible to find domain experts capable of tuning these libraries [10].
Typically, a human developer that wishes to tune a computer system will try some of the options and get an insight of the response surface of the software. They will start to fit a model in their head of how the software responds to the different choices. However, humans are good primarily at fitting continuous linear models. So, if we picture the space of choices with one choice per axis, a designer will be able to relate the different axis by lines, planes or, in general, hyperplanes or perhaps convex or concave shapes. When the response surface is complex, e.g. nonlinear, nonconvex, discontinuous, or multimodal, a human designer will hardly be able to model this complex process, ultimately missing the opportunity of delivering high performance products.
Mathematically, in the monoobjective formulation, we consider the problem of finding a global minimizer of an unknown (blackbox) objective function :
(1) 
where is some input design space of interest. The problem addressed in this paper is the optimization of a deterministic function over a domain of interest that includes lower and upper bound constraints on the problem variables.
When optimizing a smooth function, it is well known that useful information is contained in the function gradient/derivatives which can be leveraged, for instance, by first order methods. The derivatives are often computed by handcoding, by automatic differentiation, or by finite differences. However, there are situations where such firstorder information is not available or even not well defined. Typically, this is the case for computer systems workloads that include many discrete variables, i.e., either categorical (e.g., boolean) or ordinal (e.g., choice of cache sizes), over which derivatives cannot even be defined. Hence, we assume in our applications of interest that the derivative of is neither symbolically nor numerically available. This problem is referred to in the literature as derivativefree optimization (DFO) [10, 24], also known as blackbox optimization [12] and, in the computer systems community, as designspace exploration (DSE) [16, 17].
Name  Multi  RIOC var.  Constr.  Prior 

GpyOpt  ✗  ✗  ✗  ✗ 
OpenTuner  ✗  ✓  ✗  ✗ 
SURF  ✗  ✓  ✗  ✗ 
SMAC  ✗  ✓  ✗  ✗ 
Spearmint  ✗  ✗  ✓  ✗ 
Hyperopt  ✗  ✓  ✗  ✓ 
Hyperband  ✗  ✓  ✗  ✗ 
GPflowOpt  ✓  ✗  ✓  ✗ 
cBO  ✗  ✗  ✓  ✗ 
BOHB  ✗  ✓  ✗  ✗ 
HyperMapper  ✓  ✓  ✓  ✓ 
In addition to objective function evaluations, many optimization programs have similarly expensive evaluations of constraint functions. The set of points where such constraints are satisfied is referred to as the feasibility set. For example, in computer microarchitecture, finetuning the particular specifications of a CPU (e.g., L1Cache size, branch predictor range, and cycle time) need to be carefully balanced to optimize CPU speed, while keeping the power usage strictly within a prespecified budget. A similar example is in creating hardware designs for fieldprogrammable gate arrays (FPGAs). FPGAs are a type of reconfigurable logic chip with a fixed number of units available to implement circuits. Any generated design must keep the number of units strictly below this resource budget to be implementable on the FPGA. In these examples, feasibility of an experiment cannot be checked prior to termination of the experiment; this is often referred to as unknown feasibility in the literature [14]. Also note that the smaller the feasible region, the harder it is to check if an experiment is feasible (and even more costly to check optimality [14]).
While the growing demand for sophisticated DFO methods has triggered the development of a wide range of approaches and frameworks, none to date are featured enough to fully address the complexities of design space exploration and optimization in the computer systems domain. To address this problem, we introduce a new methodology and a framework dubbed HyperMapper 2.0. HyperMapper 2.0 is designed for the computer systems community and can handle design spaces consisting of multiple objectives, categorical/ordinal variables, unknown constraints, and exploitation of user prior knowledge. To aid comparison, we provide a list of existing tools and the corresponding taxonomy in Table 1. Our framework uses a modelbased algorithm, i.e., construct and utilize a surrogate model of to guide the search process. A key advantage of having a model, and more specifically a whitebox model, is that the final surrogate of can be analyzed by the user to understand the space and learn fundamental properties of the application of interest.
As shown in Table 1, HyperMapper 2.0 is the only framework to provide all the features needed for a practical designspace exploration software in computer systems applications. The contributions of this paper are:

A methodology for multiobjective optimization that deals with categorical and ordinal variables, unknown constraints, and exploitation of user prior knowledge when available.

A validation of our methodology by providing experimental results as applied to accelerator design.

An integration of our solution in a full, productionlevel compiler toolchain.

An opensource framework dubbed HyperMapper 2.0 implementing the newly introduced methodology, designed to be simple and userfriendly.
The remainder of this paper is organized as follows: Section 2 provides a problem statement and background on this work methodology. In Section 3, we describe our methodology and framework. In Section 4 we present our experimental evaluation. Section 5 discusses related work. We conclude in Section 6 with a brief discussion of future work.
2 Background
In this section, we provide the notation and basic concepts used in the paper. We describe the mathematical formulation of the monoobjective optimization problem with feasibility constraints. We then expand this to a definition of the multiobjective optimization problem and provide background on randomized decision forests [8].
2.1 Unknown Feasibility Constraints
Mathematically, in the monoobjective formulation, we consider the problem of finding a global minimizer (or maximizer) of an unknown (blackbox) objective function under a set of constraint functions :
subject to 
where is some input design space of interest and are unknown constraint functions. The problem addressed in this paper is the optimization of a deterministic function over a domain of interest that includes lower and upper bounds on the problem variables.
The variables defining the space can be real (continuous), integer, ordinal, and categorical. Ordinal parameters have a domain of a finite set of values which are either integer and/or real values. For example, the sets and are possible domains of ordinal parameters. Ordinal values must have an ordering by the lessthan operator. Ordinal and integer cases are also referred to as discrete variables. Categorical parameters also have domains of a finite set of values but have no ordering requirement. For example, sets of strings describing some property like and are categorical domains. The primary benefit of encoding a variable as an ordinal is that it can allow better inferences about unseen parameter values. With a categorical parameter, the knowledge of one value does not tell one much about other values, whereas with an ordinal value we would expect closer values (with respect to the ordering) to be more related.
We assume that the derivative of is not available, and that bounds, such as Lipschitz constants, for the derivative of is also unavailable. Evaluating feasibility is often in the same order of expense as evaluating the objective function . As for the objective function, no particular assumptions are made on the constraint functions.
2.2 MultiObjective Optimization: Problem Statement
A pictorial representation of a multiobjective problem is shown in Figure 1. On the left, a threedimensional design space is composed by one ordinal (), one real (), and one categorical () variable. The red dots represent samples from this search space. The multiobjective function maps this input space to the output space on the right, also called the optimization space. The optimization space is composed by two optimization objectives ( and ). The blue dots correspond to the red dots in the left via the application of . The arrows Min and Max represent the fact that we can minimize or maximize the objectives as a function of the application. Optimization will drive the search of optima points towards either the Min or Max of the right plot.
Formally, let us consider a multiobjective optimization (minimization) over a design space . We define as our vector of objective functions , taking as input, and evaluating . Our goal is to identify the Pareto frontier of ; that is, the set of points which are not dominated by any other point, i.e., the maximally desirable which cannot be optimized further for any single objective without making a tradeoff. Formally, we consider the partial order in : iff and , and define the induced order on : iff . The set of minimal points in this order is the Paretooptimal set such that .
We can then introduce a set of inequality constraints , to the optimization, such that we only consider points where all constraints are satisfied (). These constraints directly correspond to realworld limitations of the design space under consideration. Applying these constraints gives the constrained Pareto
Similarly to the monoobjective case in [13], we can define the feasibility indicator function which is if , and otherwise. A design point where is termed feasible. Otherwise, it is called infeasible.
We aim to identify with the fewest possible function evaluations, solving a sequential decision problem and constructing a strategy to iteratively generate the next to evaluate. If the evaluation is not very expensive then it is possible to construct a strategy that, for each sequential step, runs multiple evaluations, i.e., a batch of evaluations. In this case it is standard practice to warmup the strategy with some previously sampled points, using sampling techniques from the design of experiments literature [26].
It is worth noting that, while infeasible points are never considered our best experiment, they are still useful to add to our set of performed experiments to improve the probabilistic model posteriors. Practically speaking, infeasible samples help to determine the shape and descent directions of , allowing the probabilistic model to discern which regions are more likely to be feasible without actually sampling there. The fact that we do not need to sample in feasible regions to find them is a property that is highly useful in cases where the feasible region is relatively small, and uniform sampling would have difficulty finding these regions.
As an example, in this paper, we evaluate the compiler optimization case for targeting FPGAs. In this case, , , (number of total cycles, i.e., runtime), (logic utilization, i.e., quantity of logic gates used) in percentage, and represents whether the design point fits in the target FPGA board.
2.3 Randomized Decision Forests
A decision tree is a nonparametric supervised machine learning method widely used to formalize decision making processes across a variety of fields. A randomized decision tree is an analogous machine learning model, which “learns” how to regress (or classify) data points based on randomly selected attributes of a set of training examples. The combination of many weak regressors (binary decisions) allows approximating highly nonlinear and multimodal functions with great accuracy. Randomized decision forests [8, 11] combine many such decorrelated trees based on the randomization at the level of training data points and attributes to yield an even more effective supervised regression and classification model.
A decision tree represents a recursive binary partitioning of the input space, and uses a simple decision (a onedimensional decision threshold) at each nonleaf node that aims at maximizing an “information gain” function. Prediction is performed by “dropping” down the test data point from the root, and letting it traverse a path decided by the node decisions, until it reaches a leaf node. Each leaf node has a corresponding function value (or probability distribution on function values), adjusted according to training data, which is predicted as the function value for the test input. During training, randomization is injected into the procedure to reduce variance and avoid overfitting. This is achieved by training each individual tree on randomly selected subsets of the training samples (also called bagging), as well as by randomly selecting the deciding input variable for each tree node to decorrelate the trees.
A regression random forest is built from a set of such decision trees where the leaf nodes output the average of the training data labels and where the output of the whole forest is the average of the predicted results over all trees. In our experiments, we train separate regressors to learn the mapping from our input parameter space to each output variable.
It is believed that random forests are a good model for computer systems workloads [15, 7]. In fact, these workloads are often highly discontinuous, multimodal, and nonlinear [21], all characteristics that can be captured well by the space partitioning behind a decision tree. In addition, random forests naturally deal with categorical and ordinal variables which are important in computer systems optimization. Other popular models like Gaussian processes [23] are less appealing for these type of variables. Additionally, a trained random forest is a “white box” model which is relatively simple for users to understand and to interpret (as compared to, for example, neural network models, which are more difficult to interpret).
3 Methodology
3.1 Injecting Prior Knowledge to Guide the Search
Here we consider the probability densities and distributions that are useful to model computer systems workloads. In these type of workloads the following should be taken into account:

the range of values for a variable is finite.

the density mass can be uniform, bellshaped (Gaussianlike) or Jshaped (decay or exponentiallike).
For these reasons, in HyperMapper 2.0 we propose the Beta distribution as a model for the search space variables. The following three properties of the Beta distribution make it especially suitable for modeling ordinal, integer and real variables; the Beta distribution:

has a finite domain;

can flexibly model a wide variety of shapes including a bellshape (symmetric or skewed), Ushape and Jshape. This is thanks to the parameters and (or and ) of the distribution;

has probability density function (PDF) given by:
(2) for and , where is the Gamma function. The mean and variance can be computed in closed form.
Note that the Beta distribution has samples that are confined in the interval . For ordinal and integer variables, HyperMapper 2.0 automatically rescales the samples to the range of values of the input variables and then finds the closest allowed value in the ones that define the variables.
For categorical variables (with modalities) we use a probability distribution, i.e., instead of a density, that can be easily specified as pairs of (, ), where the set represents the values of the variable and is the probability associated to each of them with .
In Figure 2 we show Beta distributions with parameters and selected to suit computer systems workloads. We have selected four shapes as follows:

Uniform (): used as a default if the user has no prior knowledge on the variable.

Gaussian (): when the user thinks that it is likely that the optimum value for that variable is located in the center but still wants to sample from the whole range of values with lower probability at the borders. This density is reminiscent of an actual Gaussian distribution, though it is finite.

Exponential (): used when the optimum is likely located at the end of the range of values. This is similar in shape to the drawn exponentially distribution as in [5]
3.2 Sampling with Categorical and Discrete Parameters
We first warmup our model with simple random sampling. In the design of experiments (DoE) literature [26], this is the most commonly used sampling technique to warmup the search. When prior knowledge is used, samples are drawn from each variable’s prior distribution, or the uniform distribution by default if no prior knowledge is provided.
3.3 Active Learning
Active learning is a paradigm in supervised machine learning which uses fewer training examples to achieve better prediction accuracy by iteratively training a predictor, and using the predictor in each iteration to choose the training examples which will increase its accuracy the most. Thus the optimization results are incrementally improved by interleaving exploration and exploitation steps. We use randomized decision forests as our base predictors created from a number of sampled points in the parameter space.
The application is evaluated on the sampled points, yielding the labels of the supervised setting given by the multiple objectives. Since our goal is to accurately estimate the points near the Pareto optimal front, we use the current predictor to provide performance values over the parameter space and thus estimate the Pareto fronts. For the next iteration, only parameter points near the predicted Pareto front are sampled and evaluated, and subsequently used to train new predictors using the entire collection of training points from current and all previous iterations. This process is repeated over a number of iterations forming the active learning loop. Our experiments in Section 4 indicate that this guided method of searching for highly informative parameter points in fact yields superior predictors as compared to a baseline that uses randomly sampled points alone. By iterating this process several times in the active learning loop, we are able to discover highquality design configurations that lead to good performance outcomes.
Algorithm 1 shows the pseudocode of the modelbased search algorithm used in HyperMapper 2.0. Figure 3 shows a corresponding graphical representation of the algorithm. The while loop on line 1 in Algorithm 1 is the active learning loop, represented by the big loop in the preprocessing box of Figure 3. The user specifies a maximum number of active learning iterations given by the variable . The function at lines 1, 1, 1 and 1 trains random forest regressors and which are the surrogate models to predict the objectives given a parameter vector. We train separate models, one for each objective (=2 in Algorithm 1). The random forest regressor is represented by the box "Regressor" in Figure 3.
The function on lines 1 and 1 trains a random forest classifier to predict if a parameter vector is feasible or infeasible. The classifier becomes increasingly accurate during active learning. Using a classifier to predict the infeasible parameter vectors has proven to be very effective as later shown in Section 4.3. The random forest classifier is represented by the box "Classifier (Filter)" in Figure 3. The function on lines 1 and 1 filters the parameter vectors that are predicted infeasible from before computing the Pareto, thus dramatically reducing the number of function evaluations. This function is represented by the box "Compute Valid Predicted Pareto" in Figure 3.
For sake of space some details are not shown in Algorithm 1. For example, the while loop on line 1 is limited to evaluations per active learning iteration. When the cardinality , a maximum of samples are selected uniformly at random from the set for evaluation. In the case where , a number of parameter vector samples is drawn uniformly at random without repetition. This ensures exploration analogous to the greedy algorithm in the reinforcement learning literature [32]. greedy is known to provide balance between the explorationexploitation tradeoff.
3.4 Pareto Wall
In Algorithm 1 lines 1 and 1, the function eliminates the samples from before computing the Pareto front. This means that the newly predicted Pareto front never contains previously evaluated samples and, by consequence, a new layer of Pareto front is considered at each new iteration. We dub this multilayered approach the Pareto Wall because we consider one Pareto front per active learning iteration, with the result that we are exploring several adjacent Pareto frontiers. Adjacent Pareto frontiers can be seen as a thick Pareto, i.e., a Pareto Wall. The advantage of exploring the Pareto Wall in the active learning loop is that it minimizes the risk of using a surrogate model which is currently inaccurate. At each active learning step, we search previously unexplored samples which, by definition, must be predicted to be worse than the current approximated Pareto front. However, in cases where the predictor is not yet very accurate, some of these unexplored samples will often dominate the approximated Pareto, leading to a better Pareto front and an improved model.
3.5 The HyperMapper 2.0 Framework
HyperMapper 2.0 is written in Python and makes use of widely available libraries, e.g., scikitlearn and pyDOE. It will be released open source soon. The HyperMapper 2.0 setup is via a simple json file. A light interface with third party softwares used for optimization is also necessary: templates for Python and Scala are provided in the repository. HyperMapper 2.0 is able to run in parallel on multicore machines the classifiers and regressors as well as the computation of the Pareto front to accelerate the active learning iterations.
4 Evaluation
We first evaluate HyperMapper 2.0 applied to the recently proposed Spatial compiler [19] for designing application hardware accelerators on FPGAs.
4.1 The Spatial Programming Language
Spatial [19] is a domainspecific language (DSL) and corresponding compiler for the design of application accelerators on reconfigurable architectures. The Spatial frontend is tailored to present programmers with a high level of abstraction for hardware design. Control in Spatial is expressed as nested, parallelizable loops, while data structures are allocated based on their placement in the target hardware’s memory hierarchy. The language also includes support for design parameters to express values which do not change the behavior of the application and which can be changed by the compiler. These parameters can be used to express loop tile sizes, memory sizes, loop unrolling factors, and the like.
As shown in Figure 4, the Spatial compiler lowers user programs into synthesizable Chisel [2] designs in three phases. In the first phase, it performs basic hardware optimizations and estimates a possible domain for each design parameter in the program. In the second phase, the compiler computes loop pipeline schedules and onchip memory layouts for some given value for each parameter. It then estimates the amount of hardware resources and the runtime of the application. When targeting an FPGA, the compiler uses a devicespecific model to estimate the amount of compute logic (LUTs), dedicated multipliers (DSPs), and onchip scratchpads (BRAMs) required to instantiate the design. Runtime estimates are performed using similar devicespecific models with average and worst case estimates computed for runtimedependent values. Runtime is typically reported in clock cycles.
In the final phase of compilation, the Spatial compiler unrolls parallelized loops, retimes pipelines via register insertion, and performs onchip memory layout and compute optimizations based on the analyses performed in the previous phase. Finally, the last pass generates a Chisel design which can be synthesized and run on the target FPGA.
Benchmark  Variables  Space Size 

BlackScholes  4  
KMeans  6  
OuterProduct  5  
DotProduct  5  
GEMM  7  
TPCH Q6  5  
GDA  9 
4.2 HyperMapper in the Spatial Compiler
The collection of design parameters in a Spatial program, together with their respective domains, yields a hardware design space. The second phase of the compiler gives a way to estimate two cost metrics  performance and FPGA resource utilization  for a given design in this space. Existing work on Spatial has evaluated two methods for design space exploration. The first method heuristically prunes the design space and then performs randomized search with a fixed number of samples. The heuristics, first established by Spatial’s predecessor [20], help to eliminate obviously bad points within the design space prior to random search; the pruning is provided by expert FPGA developers. This is, in essence, a onetime hint to guide search. The second method evaluated the feasibility of using HyperMapper 1.0 [7] to drive exploration, concluding that the tool was promising but still required future development. In some cases, HyperMapper 1.0 performed poorly without a feasibility classifier as the search often focused on infeasible regions of the design space [19].
Spatial’s compiler includes hooks at the beginning and end of its second phase to interface with external tools for design space exploration. As shown in Figure 4, the compiler can query at the beginning of this phase for parameter values to evaluate. Similarly, the end of the second phase has hooks to output performance and resource estimates. HyperMapper 2.0 interfaces with these hooks to receive cost estimates, build a surrogate model, and drive search of the space.
In this work, we evaluate design space exploration when Spatial is targeting an Altera Stratix V FPGA with 48 GB of dedicated DRAM and a peak memory bandwidth of 76.8 GB/sec (an identical approach could be used for any FPGA target). We list the benchmarks we evaluate with HyperMapper 2.0 in Table 2. These benchmarks are a representative subset of those previously used to evaluate the Spatial compiler [19]. We use random search with heuristic pruning as our comparison baseline, as this is the most used and evaluated search technique with the compiler.
4.3 Feasibility Classifier Effectiveness
We address the question of the effectiveness of the feasibility classifier in the Spatial use case. Of all the hyperparameters defined for binary random forests [22], the parameters that usually have the most impact on the performance of the random forest classifier are: , , and . We run an exhaustive search to finetune the binary random forest classifier hyperparameters and test its performance. The range of values we considered for these parameters is shown in Table 3. This defines a comprehensive space of 81 possible choices, small enough that it can be explored using exhaustive search. We dub these choices of parameter vectors as to on the axis.
Name  Range of Values 

[10, 100, 1000]  
[None, 4, 8]  
[’auto’, 0.5, 0.75]  
[{T: 0.50, F: 0.50}, {T: 0.75, F: 0.25}, {T: 0.9, F: 0.1}] 
We perform a 5fold crossvalidation using the data collected by HyperMapper 2.0 as training data and report validation recall averaged over the 5 folds. The goal of this optimization procedure is for the binary classification to maximize recall. We want to maximize recall, i.e., , because it is important to not throw away feasible points that are misclassified as being infeasible and that can potentially be good fits. Precision, i.e., , is less important as there is smaller cost associated with classifying an infeasible parameter vector as feasible. In this case the only downside is that some samples will be wasted because we are evaluating samples that are infeasible, which is not a major drawback.
Figure 7 reports the recall of the random forest classifier across the 7 benchmarks and hyperparameter configurations. For sake of space, we only report the first 25 configurations, but the trend persists across all configurations. Figure (a)a shows the recall just after the warmup sampling and before the first active learning iteration (Algorithm 1 line 1). Figure (b)b shows the recall after 50 active learning iterations (Algorithm 1 line 1 after 50 iterations of the while loop, where each iteration is evaluating 100 samples). The recall goes up during the active learning loop implying that the feasibility constraint is being predicted more accurately over time. The tables in Figure 7 show this general performance trend with the max mean improving from 0.784 to 0.967.
In Figure (a)a the recall is low prior to the start of active learning. The configuration that scores best (the maximum score of the minimum scores across the different configurations) has a minimum score of 0.6 on the 7 benchmarks. The configuration is: {’class_weight’:{T:0.75,F:0.25}, ’max_depth’:8, ’max_features’:’auto’, ’n_estimators’:10}. The recall of this configuration ranges from a minimum of 0.6 for TPCH Q6 to a maximum of 1.0 on BlackScholes with mean and standard deviation of 0.735 and 0.15 respectively.
In Figure (b)b the recall is high after 50 iterations of active learning. There are two configurations that score best, with a minimum score of 0.886 on the 7 benchmarks. The configurations are: {’class_weight’:{T:0.75,F:0.25}, ’max_depth’:None, ’max_features’:’0.75’, ’n_estimators’:10} and {’class_weight’:{T:0.9,F:0.1}, ’max_depth’:None, ’max_features’:’0.75’, ’n_estimators’:10}. In general, most of the configurations are very close in terms of recall and the default random forest configuration scores high, perhaps suggesting that the random forest for these kind of workloads does not need a major tuning effort. The statistics of these configurations range from a minimum of 0.886 for DotProduct to a maximum of 1.0 on BlackScholes with mean and standard deviation of 0.964 and 0.04 respectively.
Figure 8 shows the predicted Pareto fronts of GDA, the benchmark with the largest design space, with and without the binary classifier for feasibility constraints. In both cases, we use plain random sampling to warmup the optimization with 1,000 samples followed by 5 iterations of active learning.
The red dots representing the invalid points for the case without feasibility constraints are spread farther from the corresponding Pareto frontier while the green dots for the case with constraints are close to the respective frontier. This happens because the nonconstrained search focuses on seemingly promising but unrealistic points. The constrained search is focused in a region that is more conservative but feasible. The effect of the feasibility constraint is apparent in its improved Pareto front, which almost entirely dominates the approximated Pareto front resulting from unconstrained search.
4.4 Optimum vs. Approximated Pareto
We next take the smallest benchmarks, BlackScholes, DotProduct and OuterProduct, and run exhaustive search to compare the approximated Pareto front computed by HyperMapper 2.0 with the true optimal one. This can be achieved only for such small benchmarks as exhaustive search is feasible. However, even on these small spaces, exhaustive search requires 6 to 12 hours when parallelized across 16 CPU cores. In our framework, we use random sampling to warmup the search with random samples followed by 5 active learning iterations of about 500 samples total.
Comparisons are synthesized in Figure 9. The optimal Pareto front is very close to the approximated one provided by HyperMapper 2.0, showing our software’s ability to recover the optimal Pareto front on these benchmarks. About total samples are required to recover the Pareto optimum, about the same number of samples for BlackScholes and 66 times fewer for OuterProduct and DotProduct compared to the prior Spatial design space exploration approach using pruning and random sampling.
4.5 Hypervolume Indicator
We next show the hypervolume indicator (HVI) [12] function for the whole set of the Spatial benchmarks as a function of the initial number of warmup samples (for sake of space we omit the smallest benchmark, BlackScholes). For every benchmark, we show 5 repetitions of the experiments and report variability via a line plot with 80% confidence interval. The HVI metric gives the area between the estimated Pareto frontier and the space’s true Pareto front. This metric is the most common to compare multiobjective algorithm performance. Since the true Pareto front is not always known, we use the accumulation of all experiments run on a given benchmark to compute our best approximation of the true Pareto front and use this as a true Pareto. This includes all repetitions across all approaches, e.g., baseline and HyperMapper 2.0. In addition, since logic utilization and cycles have different value ranges by several order of magnitude, we normalize the data by dividing by the standard deviation before computing the HVI. This has the effect of giving the same importance to the two objectives and not skewing the results towards the objective with higher raw values. We set the same number of samples for all the experiments to (the default value in the prior work baseline). Based on advice by expert hardware developers, we modify the Spatial compiler to automatically generate the prior knowledge discussed in Section 3.1 based on design parameter types. For example, onchip tile sizes have a “decay” prior because increasing memory size initially helps to improve DRAM bandwidth utilization but has diminishing returns after a certain point. This prior information is passed to HyperMapper 2.0 and is used to magnify random sampling. The baseline has no support for prior knowledge.
Figure 10 shows the two different approaches: HyperMapper 2.0 using a warmup sampling phase with the use of the prior and then an active learning phase; Spatial’s previous design space exploration approach (the baseline).
The yaxis reports the HVI metric and the xaxis the number of samples in thousands.
Table 4 quantitatively summarizes the results. We observe the general trend that HyperMapper 2.0 needs far fewer samples to achieve competitive performance compared to the baseline. Additionally, our framework’s variance is generally small, as shown in 10. The total number of samples used by HyperMapper 2.0 is 12,500 on all experiments while the number of samples performed by the baseline varies as a function of the pruning strategy. The number of samples for GEMM, TPCH Q6, GDA, and DotProduct is 100,000, which leads to an efficiency improvement of , while OuterProduct and KMeans are 31,068 and 18,720, which leads to an improvement of and , respectively.
Benchmark  HyperMapper 2.0  Spatial Baseline 

BlackScholes  
KMeans  
OuterProduct  
DotProduct  
GEMM  
TPCH Q6  
GDA 
As a result, the autotuner is robust to randomness and only a reasonable number of random samples are needed for the warmup and active learning phases.
4.6 Understandability of the Results
Objective  
Benchmark  Parameter  Logic Util.  Cycles 
BlackScholes  Tile Size  0.003  0.569 
OP  0.261  0.072  
IP  0.735  0.303  
Pipelining  0.001  0.056  
OuterProduct  Tile Size A  0.095  0.290 
Tile Size B  0.075  0.323  
OP  0.170  0.084  
IP  0.321  0.248  
Pipelining  0.340  0.055 
HyperMapper 2.0 can be used by domain nonexperts to understand more about the domain they are trying to optimize. In particular, users can view feature importance to gain a better understanding of the impact of various parameters on the design objectives. The feature importances for the BlackScholes and OuterProduct benchmarks are given in Table 5.
In BlackScholes, innermost loop parallelization (IP) directly determines how fast a single tile of data can be processed. Consequently, as shown in Figure 5, IP is highly related to both the design logic utilization and design runtime (cycles). Since BlackScholes is bandwidth bound, changing DRAM utilization with tile sizes directly changes the runtime, but has no impact on the compute logic since larger memories do not require more LUTs. Outer loop parallelization (OP) also duplicates compute logic by making multiple copies of each inner loop, but as shown in Figure 5, OP has less importance for runtime than IP.
Similarly, in OuterProduct, both tile sizes have roughly even importance on the number of execution cycles, while IP has roughly even importance for both logic utilization and cycles. Unlike BlackScholes, which includes a large amount of floating point compute, OuterProduct has relatively little computation, making the cost of outer loop pipelining relatively impactful on logic utilization but with little importance on cycles. In both cases, the Spatial compiler can take this information into account when determining whether to prioritize further optimizing the application for inner loop parallelization or outer loop pipelining.
5 Related Work
During the last two decades, several design space exploration techniques and frameworks have been used in a variety of different contexts ranging from embedded devices to compiler research to system integration. Table 1 provides a taxonomy of methodologies and software from both the computer systems and machine learning communities. HyperMapper has been inspired by a wide body of work in multiple subfields of these communities. The nature of computer systems workloads brings some important features to the design of HyperMapper 2.0 which are often missing in the machine learning community research on design space exploration tools.
In the system community, a popular, stateoftheart designspace exploration tool is OpenTuner [1]. This tool is based on direct approaches (e.g., , differential evolution, NelderMead) and a methodology based on the Area Under the Curve (AUC) and multiarmed bandit techniques to decide what search algorithm deserves to be allocated a higher resource budget. OpenTuner is different from our work in a number of ways. First, our work supports multiobjective optimization. Second, our whitebox modelbased approach enables the user to understand the results while learning from them. Third, our approach is able to consider unknown feasibility constraints. Lastly, our framework has the ability to inject prior knowledge into the search. The first point in particular does not allow a direct performance comparison of the two tools.
Our work is inspired by HyperMapper 1.0 [7, 21, 25, 19]. Bodin et al. [7] introduce HyperMapper 1.0 for autotuning of computer vision applications by considering the full software/hardware stack in the optimization process. Other prior work applies it to computer vision and robotics applications [21, 25]. There has also been preliminary study of applying HyperMapper to the Spatial programming language and compiler like in our work [19]. However, HyperMapper 1.0 lacks some fundamental features that makes it ineffective in the presence of applications with nonfeasible designs and prior knowledge.
In [16] the authors use an active learning technique to build an accurate surrogate model by reducing the variance of an ensemble of fully connected neural network models. However, our work is fundamentally different because we are not interested in building a perfect surrogate model, instead we are interested in optimizing the surrogate model (over multiple objectives). So, in our case building a very accurate surrogate model over the entire space would result in a waste of samples.
Recent work [9] uses decision trees to automatically tune discrete NVIDIA and SoC ARM GPUs. Norbert et al. tackle the software configurability problem for binary [29] and for both binary and numeric options [28] using a performanceinfluence model which is based on linear regression. They optimize for execution time on several examples exploring algorithmic and compiler spaces in isolation.
In particular, machine learning (ML) techniques have been recently employed in both architectural and compiler research. Khan et al. [18] employed predictive modeling for crossprogram design space exploration in multicore systems. The techniques developed managed to explore a large design space of chipmultiprocessors running parallel applications with low prediction error. In [4] Balaprakash et al. introduce AutoMOMML, an endtoend, MLbased framework to build predictive models for objectives such as performance, and power. [3] presents the abdynaTree active learning parallel algorithm that builds surrogate performance models for scientific kernels and workloads on singlecore, multicore and multinode architectures. In [34] the authors propose the Pareto Active Learning (PAL) algorithm which intelligently samples the design space to predict the Paretooptimal set.
Our work is similar in nature to the approaches adopted in the Bayesian optimization literature [27]. Example of widely used monoobjective Bayesian DFO software are SMAC [15], SpearMint [30, 31] and the work on treestructured Parzen estimator (TPE) [6]. These monoobjective methodologies are based on random forests, Gaussian processes and TPEs making the choice of learned models varied.
6 Conclusions and Future Work
HyperMapper 2.0 is inspired by the algorithm introduced by [7], later dubbed HyperMapper 1.0 [21], by the philosophy behind OpenTuner [1] and SMAC [15]. We have introduced a new derivativefree optimization methodology and corresponding framework which uses guided search using active learning. This framework, dubbed HyperMapper 2.0, is built for practical, userfriendly design space exploration in computer systems, including support for categorical and ordinal variables, design feasibility constraints, multiobjective optimization, and user input on variable priors. Additionally, HyperMapper 2.0 uses randomized decision forests to model the searched space. This model not only maps well for the discontinuous, nonlinear spaces in computer systems, but also gives a “white box” result which the end user can inspect to gain deeper understanding of the space.
We have presented the application of HyperMapper 2.0 as a compiler pass of the Spatial language and compiler for generating application accelerators on FPGAs. Our experiments show that, compared to the previously used heuristic random search, our framework finds similar or better approximations of the true Pareto frontier, with significantly fewer samples required, 8x in most of the benchmarks explored.
Future work on HyperMapper 2.0 will include analysis and incorporation of other DFO strategies. In particular, the use of a full Bayesian approach in the active learning loop would help to leverage the prior knowledge by computing a posterior distribution. In our current approach we only exploit the prior distribution at the level of the initial warmup sampling. Exploration of additional methods to warmup the search from the design of experiments literature is a promising research venue. In particular the Latin Hypercube sampling technique was recently adapted to work on categorical variables [33] making it suitable for computer systems workloads.
References
 [1] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan RaganKelley, Jeffrey Bosboom, UnaMay O’Reilly, and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on, pages 303–315. IEEE, 2014.
 [2] J. Bachrach, Huy Vo, B. Richards, Yunsup Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic. Chisel: Constructing hardware in a scala embedded language. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 1212–1221, June 2012.
 [3] Prasanna Balaprakash, Robert B Gramacy, and Stefan M Wild. Activelearningbased surrogate models for empirical performance tuning. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages 1–8. IEEE, 2013.
 [4] Prasanna Balaprakash, Ananta Tiwari, Stefan M Wild, Laura Carrington, and Paul D Hovland. Automomml: Automatic multiobjective modeling with machine learning. In International Conference on High Performance Computing, pages 219–239. Springer, 2016.
 [5] James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
 [6] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in neural information processing systems, pages 2546–2554, 2011.
 [7] Bruno Bodin, Luigi Nardi, M Zeeshan Zia, Harry Wagstaff, Govind Sreekar Shenoy, Murali Emani, John Mawer, Christos Kotselidis, Andy Nisbet, Mikel Lujan, et al. Integrating algorithmic parameters into benchmarking and design space exploration in 3d scene understanding. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pages 57–69. ACM, 2016.
 [8] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 [9] Marco Cianfriglia, Flavio Vella, Cedric Nugteren, Anton Lokhmotov, and Grigori Fursin. A modeldriven approach for a new generation of adaptive libraries. arXiv preprint arXiv:1806.07060, 2018.
 [10] Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivativefree optimization, volume 8. Siam, 2009.
 [11] Antonio Criminisi, Jamie Shotton, Ender Konukoglu, et al. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semisupervised learning. Foundations and Trends® in Computer Graphics and Vision, 7(2–3):81–227, 2012.
 [12] Paul Feliot, Julien Bect, and Emmanuel Vazquez. A bayesian approach to constrained singleand multiobjective optimization. Journal of Global Optimization, 67(12):97–133, 2017.
 [13] Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. Bayesian optimization with inequality constraints. In ICML, pages 937–945, 2014.
 [14] Michael A Gelbart, Jasper Snoek, and Ryan P Adams. Bayesian optimization with unknown constraints. arXiv preprint arXiv:1403.5607, 2014.
 [15] Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
 [16] Engin Ïpek, Sally A McKee, Rich Caruana, Bronis R de Supinski, and Martin Schulz. Efficiently exploring architectural design spaces via predictive modeling, volume 41. ACM, 2006.
 [17] Eunsuk Kang, Ethan Jackson, and Wolfram Schulte. An approach for effective design space exploration. In Monterey Workshop, pages 33–54. Springer, 2010.
 [18] Salman Khan, Polychronis Xekalakis, John Cavazos, and Marcelo Cintra. Using predictivemodeling for crossprogram design space exploration in multicore systems. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 327–338. IEEE Computer Society, 2007.
 [19] David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Spatial: A Language and Compiler for Application Accelerators. In ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), June 2018.
 [20] David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delimitrou, Christos Kozyrakis, and Kunle Olukotun. Automatic generation of efficient accelerators for reconfigurable hardware. In International Symposium in Computer Architecture (ISCA), 2016.
 [21] Luigi Nardi, Bruno Bodin, Sajad Saeedi, Emanuele Vespa, Andrew J Davison, and Paul HJ Kelly. Algorithmic performanceaccuracy tradeoff in 3d vision applications using hypermapper. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pages 1434–1443. IEEE, 2017.
 [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [23] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
 [24] Luis Miguel Rios and Nikolaos V Sahinidis. Derivativefree optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56(3):1247–1293, 2013.
 [25] Sajad Saeedi, Luigi Nardi, Edward Johns, Bruno Bodin, Paul HJ Kelly, and Andrew J Davison. Applicationoriented design space exploration for slam algorithms. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 5716–5723. IEEE, 2017.
 [26] Thomas J Santner, Brian J Williams, and William I Notz. The design and analysis of computer experiments. Springer Science & Business Media, 2013.
 [27] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 [28] Norbert Siegmund, Alexander Grebhahn, Sven Apel, and Christian Kästner. Performanceinfluence models for highly configurable systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 284–294. ACM, 2015.
 [29] Norbert Siegmund, Sergiy S Kolesnikov, Christian Kästner, Sven Apel, Don Batory, Marko Rosenmüller, and Gunter Saake. Predicting performance via automated featureinteraction detection. In Proceedings of the 34th International Conference on Software Engineering, pages 167–177. IEEE Press, 2012.
 [30] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 [31] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180, 2015.
 [32] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.
 [33] Laura P Swiler, Patricia D Hough, Peter Qian, Xu Xu, Curtis Storlie, and Herbert Lee. Surrogate models for mixed discretecontinuous variables. In Constraint Programming and Decision Making, pages 181–202. Springer, 2014.
 [34] Marcela Zuluaga, Guillaume Sergent, Andreas Krause, and Markus Püschel. Active learning for multiobjective optimization. In International Conference on Machine Learning, pages 462–470, 2013.