Practical Design Space Exploration

Practical Design Space Exploration

Luigi Nardi Stanford University David Koeplinger Stanford University Kunle Olukotun Stanford University
Abstract

Multi-objective optimization is a crucial matter in computer systems design space exploration because real-world applications often rely on a trade-off between several objectives. Derivatives are usually not available or impractical to compute and the feasibility of an experiment can not always be determined in advance. These problems are particularly difficult when the feasible region is relatively small, and it may be prohibitive to even find a feasible experiment, let alone an optimal one.

We introduce a new methodology and corresponding software framework, HyperMapper 2.0, which handles multi-objective optimization, unknown feasibility constraints, and categorical/ordinal variables. This new methodology also supports injection of user prior knowledge in the search when available. All of these features are common requirements in computer systems but rarely exposed in existing design space exploration systems. The proposed methodology follows a white-box model which is simple to understand and interpret (unlike, for example, neural networks) and can be used by the user to better understand the results of the automatic search.

We apply and evaluate the new methodology to automatic static tuning of hardware accelerators within the recently introduced Spatial programming language, with minimization of design run-time and compute logic under the constraint of the design fitting in a target field programmable gate array chip. Our results show that HyperMapper 2.0 provides better Pareto fronts compared to state-of-the-art baselines, with better or competitive hypervolume indicator and with 8x improvement in sampling budget for most of the benchmarks explored.

1 Introduction

Design problems are ubiquitous in scientific and industrial achievements. Scientists design experiments to gain insights into physical and social phenomena, and engineers design machines to execute tasks more efficiently. These design problems are fraught with choices which are often complex and high-dimensional and which include interactions that make them difficult for individuals to reason about. In software/hardware co-design, for example, companies develop libraries with tens or hundreds of free choices and parameters that interact in complex ways. In fact, the level of complexity is often so high that it becomes impossible to find domain experts capable of tuning these libraries [10].

Typically, a human developer that wishes to tune a computer system will try some of the options and get an insight of the response surface of the software. They will start to fit a model in their head of how the software responds to the different choices. However, humans are good primarily at fitting continuous linear models. So, if we picture the space of choices with one choice per axis, a designer will be able to relate the different axis by lines, planes or, in general, hyperplanes or perhaps convex or concave shapes. When the response surface is complex, e.g. non-linear, non-convex, discontinuous, or multi-modal, a human designer will hardly be able to model this complex process, ultimately missing the opportunity of delivering high performance products.

Mathematically, in the mono-objective formulation, we consider the problem of finding a global minimizer of an unknown (black-box) objective function :

(1)

where is some input design space of interest. The problem addressed in this paper is the optimization of a deterministic function over a domain of interest that includes lower and upper bound constraints on the problem variables.

When optimizing a smooth function, it is well known that useful information is contained in the function gradient/derivatives which can be leveraged, for instance, by first order methods. The derivatives are often computed by hand-coding, by automatic differentiation, or by finite differences. However, there are situations where such first-order information is not available or even not well defined. Typically, this is the case for computer systems workloads that include many discrete variables, i.e., either categorical (e.g., boolean) or ordinal (e.g., choice of cache sizes), over which derivatives cannot even be defined. Hence, we assume in our applications of interest that the derivative of is neither symbolically nor numerically available. This problem is referred to in the literature as derivative-free optimization (DFO) [10, 24], also known as black-box optimization [12] and, in the computer systems community, as design-space exploration (DSE) [16, 17].

Name Multi RIOC var. Constr. Prior
GpyOpt
OpenTuner
SURF
SMAC
Spearmint
Hyperopt
Hyperband
GPflowOpt
cBO
BOHB
HyperMapper
Table 1: Derivative-free optimization software taxonomy. Multi notes if the software is multi-objective or not. RIOC var. says if the software supports all Real, Int, Ordinal and Categorical variables. Constr. refers to inequality constraints that define a feasible region that are used in the optimization process. Prior represents the ability of the software to inject prior knowledge in the search.

In addition to objective function evaluations, many optimization programs have similarly expensive evaluations of constraint functions. The set of points where such constraints are satisfied is referred to as the feasibility set. For example, in computer micro-architecture, fine-tuning the particular specifications of a CPU (e.g., L1-Cache size, branch predictor range, and cycle time) need to be carefully balanced to optimize CPU speed, while keeping the power usage strictly within a pre-specified budget. A similar example is in creating hardware designs for field-programmable gate arrays (FPGAs). FPGAs are a type of reconfigurable logic chip with a fixed number of units available to implement circuits. Any generated design must keep the number of units strictly below this resource budget to be implementable on the FPGA. In these examples, feasibility of an experiment cannot be checked prior to termination of the experiment; this is often referred to as unknown feasibility in the literature [14]. Also note that the smaller the feasible region, the harder it is to check if an experiment is feasible (and even more costly to check optimality [14]).

While the growing demand for sophisticated DFO methods has triggered the development of a wide range of approaches and frameworks, none to date are featured enough to fully address the complexities of design space exploration and optimization in the computer systems domain. To address this problem, we introduce a new methodology and a framework dubbed HyperMapper 2.0. HyperMapper 2.0 is designed for the computer systems community and can handle design spaces consisting of multiple objectives, categorical/ordinal variables, unknown constraints, and exploitation of user prior knowledge. To aid comparison, we provide a list of existing tools and the corresponding taxonomy in Table 1. Our framework uses a model-based algorithm, i.e., construct and utilize a surrogate model of to guide the search process. A key advantage of having a model, and more specifically a white-box model, is that the final surrogate of can be analyzed by the user to understand the space and learn fundamental properties of the application of interest.

As shown in Table 1, HyperMapper 2.0 is the only framework to provide all the features needed for a practical design-space exploration software in computer systems applications. The contributions of this paper are:

  • A methodology for multi-objective optimization that deals with categorical and ordinal variables, unknown constraints, and exploitation of user prior knowledge when available.

  • A validation of our methodology by providing experimental results as applied to accelerator design.

  • An integration of our solution in a full, production-level compiler toolchain.

  • An open-source framework dubbed HyperMapper 2.0 implementing the newly introduced methodology, designed to be simple and user-friendly.

The remainder of this paper is organized as follows: Section 2 provides a problem statement and background on this work methodology. In Section 3, we describe our methodology and framework. In Section 4 we present our experimental evaluation. Section 5 discusses related work. We conclude in Section 6 with a brief discussion of future work.

2 Background

In this section, we provide the notation and basic concepts used in the paper. We describe the mathematical formulation of the mono-objective optimization problem with feasibility constraints. We then expand this to a definition of the multi-objective optimization problem and provide background on randomized decision forests [8].

2.1 Unknown Feasibility Constraints

Mathematically, in the mono-objective formulation, we consider the problem of finding a global minimizer (or maximizer) of an unknown (black-box) objective function under a set of constraint functions :

subject to

where is some input design space of interest and are unknown constraint functions. The problem addressed in this paper is the optimization of a deterministic function over a domain of interest that includes lower and upper bounds on the problem variables.

The variables defining the space can be real (continuous), integer, ordinal, and categorical. Ordinal parameters have a domain of a finite set of values which are either integer and/or real values. For example, the sets and are possible domains of ordinal parameters. Ordinal values must have an ordering by the less-than operator. Ordinal and integer cases are also referred to as discrete variables. Categorical parameters also have domains of a finite set of values but have no ordering requirement. For example, sets of strings describing some property like and are categorical domains. The primary benefit of encoding a variable as an ordinal is that it can allow better inferences about unseen parameter values. With a categorical parameter, the knowledge of one value does not tell one much about other values, whereas with an ordinal value we would expect closer values (with respect to the ordering) to be more related.

We assume that the derivative of is not available, and that bounds, such as Lipschitz constants, for the derivative of is also unavailable. Evaluating feasibility is often in the same order of expense as evaluating the objective function . As for the objective function, no particular assumptions are made on the constraint functions.

2.2 Multi-Objective Optimization: Problem Statement

A pictorial representation of a multi-objective problem is shown in Figure 1. On the left, a three-dimensional design space is composed by one ordinal (), one real (), and one categorical () variable. The red dots represent samples from this search space. The multi-objective function maps this input space to the output space on the right, also called the optimization space. The optimization space is composed by two optimization objectives ( and ). The blue dots correspond to the red dots in the left via the application of . The arrows Min and Max represent the fact that we can minimize or maximize the objectives as a function of the application. Optimization will drive the search of optima points towards either the Min or Max of the right plot.

Figure 1: Example of a multi-objective space. The multi-objective function maps each point in the 3-dimensional design space on the left to the optimization space on the right.

Formally, let us consider a multi-objective optimization (minimization) over a design space . We define as our vector of objective functions , taking as input, and evaluating . Our goal is to identify the Pareto frontier of ; that is, the set of points which are not dominated by any other point, i.e., the maximally desirable which cannot be optimized further for any single objective without making a trade-off. Formally, we consider the partial order in : iff and , and define the induced order on : iff . The set of minimal points in this order is the Pareto-optimal set such that .

We can then introduce a set of inequality constraints , to the optimization, such that we only consider points where all constraints are satisfied (). These constraints directly correspond to real-world limitations of the design space under consideration. Applying these constraints gives the constrained Pareto

Similarly to the mono-objective case in [13], we can define the feasibility indicator function which is if , and otherwise. A design point where is termed feasible. Otherwise, it is called infeasible.

We aim to identify with the fewest possible function evaluations, solving a sequential decision problem and constructing a strategy to iteratively generate the next to evaluate. If the evaluation is not very expensive then it is possible to construct a strategy that, for each sequential step, runs multiple evaluations, i.e., a batch of evaluations. In this case it is standard practice to warm-up the strategy with some previously sampled points, using sampling techniques from the design of experiments literature [26].

It is worth noting that, while infeasible points are never considered our best experiment, they are still useful to add to our set of performed experiments to improve the probabilistic model posteriors. Practically speaking, infeasible samples help to determine the shape and descent directions of , allowing the probabilistic model to discern which regions are more likely to be feasible without actually sampling there. The fact that we do not need to sample in feasible regions to find them is a property that is highly useful in cases where the feasible region is relatively small, and uniform sampling would have difficulty finding these regions.

As an example, in this paper, we evaluate the compiler optimization case for targeting FPGAs. In this case, , , (number of total cycles, i.e., runtime), (logic utilization, i.e., quantity of logic gates used) in percentage, and represents whether the design point fits in the target FPGA board.

2.3 Randomized Decision Forests

A decision tree is a non-parametric supervised machine learning method widely used to formalize decision making processes across a variety of fields. A randomized decision tree is an analogous machine learning model, which “learns” how to regress (or classify) data points based on randomly selected attributes of a set of training examples. The combination of many weak regressors (binary decisions) allows approximating highly non-linear and multi-modal functions with great accuracy. Randomized decision forests [8, 11] combine many such decorrelated trees based on the randomization at the level of training data points and attributes to yield an even more effective supervised regression and classification model.

A decision tree represents a recursive binary partitioning of the input space, and uses a simple decision (a one-dimensional decision threshold) at each non-leaf node that aims at maximizing an “information gain” function. Prediction is performed by “dropping” down the test data point from the root, and letting it traverse a path decided by the node decisions, until it reaches a leaf node. Each leaf node has a corresponding function value (or probability distribution on function values), adjusted according to training data, which is predicted as the function value for the test input. During training, randomization is injected into the procedure to reduce variance and avoid overfitting. This is achieved by training each individual tree on randomly selected subsets of the training samples (also called bagging), as well as by randomly selecting the deciding input variable for each tree node to decorrelate the trees.

A regression random forest is built from a set of such decision trees where the leaf nodes output the average of the training data labels and where the output of the whole forest is the average of the predicted results over all trees. In our experiments, we train separate regressors to learn the mapping from our input parameter space to each output variable.

It is believed that random forests are a good model for computer systems workloads [15, 7]. In fact, these workloads are often highly discontinuous, multi-modal, and non-linear [21], all characteristics that can be captured well by the space partitioning behind a decision tree. In addition, random forests naturally deal with categorical and ordinal variables which are important in computer systems optimization. Other popular models like Gaussian processes [23] are less appealing for these type of variables. Additionally, a trained random forest is a “white box” model which is relatively simple for users to understand and to interpret (as compared to, for example, neural network models, which are more difficult to interpret).

3 Methodology

3.1 Injecting Prior Knowledge to Guide the Search

Here we consider the probability densities and distributions that are useful to model computer systems workloads. In these type of workloads the following should be taken into account:

  • the range of values for a variable is finite.

  • the density mass can be uniform, bell-shaped (Gaussian-like) or J-shaped (decay or exponential-like).

For these reasons, in HyperMapper 2.0 we propose the Beta distribution as a model for the search space variables. The following three properties of the Beta distribution make it especially suitable for modeling ordinal, integer and real variables; the Beta distribution:

  1. has a finite domain;

  2. can flexibly model a wide variety of shapes including a bell-shape (symmetric or skewed), U-shape and J-shape. This is thanks to the parameters and (or and ) of the distribution;

  3. has probability density function (PDF) given by:

    (2)

    for and , where is the Gamma function. The mean and variance can be computed in closed form.

Note that the Beta distribution has samples that are confined in the interval . For ordinal and integer variables, HyperMapper 2.0 automatically rescales the samples to the range of values of the input variables and then finds the closest allowed value in the ones that define the variables.

For categorical variables (with modalities) we use a probability distribution, i.e., instead of a density, that can be easily specified as pairs of (, ), where the set represents the values of the variable and is the probability associated to each of them with .

Figure 2: Beta distribution shapes in HyperMapper 2.0.

In Figure 2 we show Beta distributions with parameters and selected to suit computer systems workloads. We have selected four shapes as follows:

  1. Uniform (): used as a default if the user has no prior knowledge on the variable.

  2. Gaussian (): when the user thinks that it is likely that the optimum value for that variable is located in the center but still wants to sample from the whole range of values with lower probability at the borders. This density is reminiscent of an actual Gaussian distribution, though it is finite.

  3. Decay (): used when the optimum is likely located at the beginning of the range of values. This is similar in shape to the log-uniform distribution as in [6, 5]

  4. Exponential (): used when the optimum is likely located at the end of the range of values. This is similar in shape to the drawn exponentially distribution as in [5]

3.2 Sampling with Categorical and Discrete Parameters

We first warm-up our model with simple random sampling. In the design of experiments (DoE) literature [26], this is the most commonly used sampling technique to warm-up the search. When prior knowledge is used, samples are drawn from each variable’s prior distribution, or the uniform distribution by default if no prior knowledge is provided.

3.3 Active Learning

Active learning is a paradigm in supervised machine learning which uses fewer training examples to achieve better prediction accuracy by iteratively training a predictor, and using the predictor in each iteration to choose the training examples which will increase its accuracy the most. Thus the optimization results are incrementally improved by interleaving exploration and exploitation steps. We use randomized decision forests as our base predictors created from a number of sampled points in the parameter space.

The application is evaluated on the sampled points, yielding the labels of the supervised setting given by the multiple objectives. Since our goal is to accurately estimate the points near the Pareto optimal front, we use the current predictor to provide performance values over the parameter space and thus estimate the Pareto fronts. For the next iteration, only parameter points near the predicted Pareto front are sampled and evaluated, and subsequently used to train new predictors using the entire collection of training points from current and all previous iterations. This process is repeated over a number of iterations forming the active learning loop. Our experiments in Section 4 indicate that this guided method of searching for highly informative parameter points in fact yields superior predictors as compared to a baseline that uses randomly sampled points alone. By iterating this process several times in the active learning loop, we are able to discover high-quality design configurations that lead to good performance outcomes.

Figure 3: Active learning with unknown feasibility constraints.
Data: Design space , warm-up sampling size , is the maximum samples in an active learning iteration.
Result: Pareto front .
1 Warm-up RS; Warm-up distinct configurations from ;
2 ;
3 ;
4 ;
5 ;
6 ;
7 ;
8 while  and  do
9       ;
10       ;
11       ;
12       ;
13       ;
14       ;
15       ;
16       ;
17       ;
18       ;
19       ;
20      
21 end while
22return ;
Algorithm 1 Pseudo-code for HyperMapper 2.0 optimizing a two-objective ( and ) application with one feasibility constraint (). denotes set difference, denotes set union.

Algorithm 1 shows the pseudo-code of the model-based search algorithm used in HyperMapper 2.0. Figure 3 shows a corresponding graphical representation of the algorithm. The while loop on line 1 in Algorithm 1 is the active learning loop, represented by the big loop in the preprocessing box of Figure 3. The user specifies a maximum number of active learning iterations given by the variable . The function at lines 1, 1, 1 and 1 trains random forest regressors and which are the surrogate models to predict the objectives given a parameter vector. We train separate models, one for each objective (=2 in Algorithm 1). The random forest regressor is represented by the box "Regressor" in Figure 3.

The function on lines 1 and 1 trains a random forest classifier to predict if a parameter vector is feasible or infeasible. The classifier becomes increasingly accurate during active learning. Using a classifier to predict the infeasible parameter vectors has proven to be very effective as later shown in Section 4.3. The random forest classifier is represented by the box "Classifier (Filter)" in Figure 3. The function on lines 1 and 1 filters the parameter vectors that are predicted infeasible from before computing the Pareto, thus dramatically reducing the number of function evaluations. This function is represented by the box "Compute Valid Predicted Pareto" in Figure 3.

For sake of space some details are not shown in Algorithm 1. For example, the while loop on line 1 is limited to evaluations per active learning iteration. When the cardinality , a maximum of samples are selected uniformly at random from the set for evaluation. In the case where , a number of parameter vector samples is drawn uniformly at random without repetition. This ensures exploration analogous to the -greedy algorithm in the reinforcement learning literature [32]. -greedy is known to provide balance between the exploration-exploitation trade-off.

3.4 Pareto Wall

In Algorithm 1 lines 1 and 1, the function eliminates the samples from before computing the Pareto front. This means that the newly predicted Pareto front never contains previously evaluated samples and, by consequence, a new layer of Pareto front is considered at each new iteration. We dub this multi-layered approach the Pareto Wall because we consider one Pareto front per active learning iteration, with the result that we are exploring several adjacent Pareto frontiers. Adjacent Pareto frontiers can be seen as a thick Pareto, i.e., a Pareto Wall. The advantage of exploring the Pareto Wall in the active learning loop is that it minimizes the risk of using a surrogate model which is currently inaccurate. At each active learning step, we search previously unexplored samples which, by definition, must be predicted to be worse than the current approximated Pareto front. However, in cases where the predictor is not yet very accurate, some of these unexplored samples will often dominate the approximated Pareto, leading to a better Pareto front and an improved model.

3.5 The HyperMapper 2.0 Framework

HyperMapper 2.0 is written in Python and makes use of widely available libraries, e.g., scikit-learn and pyDOE. It will be released open source soon. The HyperMapper 2.0 setup is via a simple json file. A light interface with third party softwares used for optimization is also necessary: templates for Python and Scala are provided in the repository. HyperMapper 2.0 is able to run in parallel on multi-core machines the classifiers and regressors as well as the computation of the Pareto front to accelerate the active learning iterations.

4 Evaluation

We first evaluate HyperMapper 2.0 applied to the recently proposed Spatial compiler [19] for designing application hardware accelerators on FPGAs.

4.1 The Spatial Programming Language

Figure 4: An overview of the phases of the compiler for the Spatial application accelerator design language. HyperMapper 2.0 interfaces at the beginning and ending of phase 2 to drive accelerator design space exploration and the selection of design parameter values.

Spatial [19] is a domain-specific language (DSL) and corresponding compiler for the design of application accelerators on reconfigurable architectures. The Spatial frontend is tailored to present programmers with a high level of abstraction for hardware design. Control in Spatial is expressed as nested, parallelizable loops, while data structures are allocated based on their placement in the target hardware’s memory hierarchy. The language also includes support for design parameters to express values which do not change the behavior of the application and which can be changed by the compiler. These parameters can be used to express loop tile sizes, memory sizes, loop unrolling factors, and the like.

As shown in Figure 4, the Spatial compiler lowers user programs into synthesizable Chisel [2] designs in three phases. In the first phase, it performs basic hardware optimizations and estimates a possible domain for each design parameter in the program. In the second phase, the compiler computes loop pipeline schedules and on-chip memory layouts for some given value for each parameter. It then estimates the amount of hardware resources and the runtime of the application. When targeting an FPGA, the compiler uses a device-specific model to estimate the amount of compute logic (LUTs), dedicated multipliers (DSPs), and on-chip scratchpads (BRAMs) required to instantiate the design. Runtime estimates are performed using similar device-specific models with average and worst case estimates computed for runtime-dependent values. Runtime is typically reported in clock cycles.

In the final phase of compilation, the Spatial compiler unrolls parallelized loops, retimes pipelines via register insertion, and performs on-chip memory layout and compute optimizations based on the analyses performed in the previous phase. Finally, the last pass generates a Chisel design which can be synthesized and run on the target FPGA.

Benchmark Variables Space Size
BlackScholes 4
K-Means 6
OuterProduct 5
DotProduct 5
GEMM 7
TPC-H Q6 5
GDA 9
Table 2: Spatial benchmarks and design space size.

4.2 HyperMapper in the Spatial Compiler

The collection of design parameters in a Spatial program, together with their respective domains, yields a hardware design space. The second phase of the compiler gives a way to estimate two cost metrics - performance and FPGA resource utilization - for a given design in this space. Existing work on Spatial has evaluated two methods for design space exploration. The first method heuristically prunes the design space and then performs randomized search with a fixed number of samples. The heuristics, first established by Spatial’s predecessor [20], help to eliminate obviously bad points within the design space prior to random search; the pruning is provided by expert FPGA developers. This is, in essence, a one-time hint to guide search. The second method evaluated the feasibility of using HyperMapper 1.0 [7] to drive exploration, concluding that the tool was promising but still required future development. In some cases, HyperMapper 1.0 performed poorly without a feasibility classifier as the search often focused on infeasible regions of the design space [19].

Spatial’s compiler includes hooks at the beginning and end of its second phase to interface with external tools for design space exploration. As shown in Figure 4, the compiler can query at the beginning of this phase for parameter values to evaluate. Similarly, the end of the second phase has hooks to output performance and resource estimates. HyperMapper 2.0 interfaces with these hooks to receive cost estimates, build a surrogate model, and drive search of the space.

In this work, we evaluate design space exploration when Spatial is targeting an Altera Stratix V FPGA with 48 GB of dedicated DRAM and a peak memory bandwidth of 76.8 GB/sec (an identical approach could be used for any FPGA target). We list the benchmarks we evaluate with HyperMapper 2.0 in Table 2. These benchmarks are a representative subset of those previously used to evaluate the Spatial compiler [19]. We use random search with heuristic pruning as our comparison baseline, as this is the most used and evaluated search technique with the compiler.

4.3 Feasibility Classifier Effectiveness

We address the question of the effectiveness of the feasibility classifier in the Spatial use case. Of all the hyperparameters defined for binary random forests [22], the parameters that usually have the most impact on the performance of the random forest classifier are: , , and . We run an exhaustive search to fine-tune the binary random forest classifier hyperparameters and test its performance. The range of values we considered for these parameters is shown in Table 3. This defines a comprehensive space of 81 possible choices, small enough that it can be explored using exhaustive search. We dub these choices of parameter vectors as to on the axis.

Name Range of Values
[10, 100, 1000]
[None, 4, 8]
[’auto’, 0.5, 0.75]
[{T: 0.50, F: 0.50}, {T: 0.75, F: 0.25}, {T: 0.9, F: 0.1}]
Table 3: Random forest classifier hyperparameter tuning search space.

We perform a 5-fold cross-validation using the data collected by HyperMapper 2.0 as training data and report validation recall averaged over the 5 folds. The goal of this optimization procedure is for the binary classification to maximize recall. We want to maximize recall, i.e., , because it is important to not throw away feasible points that are misclassified as being infeasible and that can potentially be good fits. Precision, i.e., , is less important as there is smaller cost associated with classifying an infeasible parameter vector as feasible. In this case the only downside is that some samples will be wasted because we are evaluating samples that are infeasible, which is not a major drawback.

Recall

Max mean Max median Max min 0.784 0.826 0.600
(a) Recall at 0 active learning iterations.

Recall

Max mean Max median Max min 0.967 0.984 0.886
(b) Recall at 50 active learning iterations.
Recall at 0 active learning iterations.
Figure 7: Random forest feasibility binary classifier 5-fold cross-validation recall over all benchmarks. The first 25 hyperparameter configurations of the classifier are shown. “Max mean”, “Max median”, and “Max min” are the maximum across the mean, median, and minimum recall scores for all 7 benchmarks, respectively.

Figure 7 reports the recall of the random forest classifier across the 7 benchmarks and hyperparameter configurations. For sake of space, we only report the first 25 configurations, but the trend persists across all configurations. Figure (a)a shows the recall just after the warm-up sampling and before the first active learning iteration (Algorithm 1 line 1). Figure (b)b shows the recall after 50 active learning iterations (Algorithm 1 line 1 after 50 iterations of the while loop, where each iteration is evaluating 100 samples). The recall goes up during the active learning loop implying that the feasibility constraint is being predicted more accurately over time. The tables in Figure 7 show this general performance trend with the max mean improving from 0.784 to 0.967.

In Figure (a)a the recall is low prior to the start of active learning. The configuration that scores best (the maximum score of the minimum scores across the different configurations) has a minimum score of 0.6 on the 7 benchmarks. The configuration is: {’class_weight’:{T:0.75,F:0.25}, ’max_depth’:8, ’max_features’:’auto’, ’n_estimators’:10}. The recall of this configuration ranges from a minimum of 0.6 for TPC-H Q6 to a maximum of 1.0 on BlackScholes with mean and standard deviation of 0.735 and 0.15 respectively.

In Figure (b)b the recall is high after 50 iterations of active learning. There are two configurations that score best, with a minimum score of 0.886 on the 7 benchmarks. The configurations are: {’class_weight’:{T:0.75,F:0.25}, ’max_depth’:None, ’max_features’:’0.75’, ’n_estimators’:10} and {’class_weight’:{T:0.9,F:0.1}, ’max_depth’:None, ’max_features’:’0.75’, ’n_estimators’:10}. In general, most of the configurations are very close in terms of recall and the default random forest configuration scores high, perhaps suggesting that the random forest for these kind of workloads does not need a major tuning effort. The statistics of these configurations range from a minimum of 0.886 for DotProduct to a maximum of 1.0 on BlackScholes with mean and standard deviation of 0.964 and 0.04 respectively.

Figure 8 shows the predicted Pareto fronts of GDA, the benchmark with the largest design space, with and without the binary classifier for feasibility constraints. In both cases, we use plain random sampling to warm-up the optimization with 1,000 samples followed by 5 iterations of active learning.

Figure 8: Effect of the binary constraint classifier on the GDA benchmark. HyperMapper 2.0, with feasibility classifier, is shown as black (feasible) and green (infeasible). The heuristic baseline is shown as blue (feasible) and red (infeasible). The final approximated Pareto fronts are shown as black (our approach) and blue (baseline) curves.

The red dots representing the invalid points for the case without feasibility constraints are spread farther from the corresponding Pareto frontier while the green dots for the case with constraints are close to the respective frontier. This happens because the non-constrained search focuses on seemingly promising but unrealistic points. The constrained search is focused in a region that is more conservative but feasible. The effect of the feasibility constraint is apparent in its improved Pareto front, which almost entirely dominates the approximated Pareto front resulting from unconstrained search.

4.4 Optimum vs. Approximated Pareto

We next take the smallest benchmarks, BlackScholes, DotProduct and OuterProduct, and run exhaustive search to compare the approximated Pareto front computed by HyperMapper 2.0 with the true optimal one. This can be achieved only for such small benchmarks as exhaustive search is feasible. However, even on these small spaces, exhaustive search requires 6 to 12 hours when parallelized across 16 CPU cores. In our framework, we use random sampling to warm-up the search with random samples followed by 5 active learning iterations of about 500 samples total.

Comparisons are synthesized in Figure 9. The optimal Pareto front is very close to the approximated one provided by HyperMapper 2.0, showing our software’s ability to recover the optimal Pareto front on these benchmarks. About total samples are required to recover the Pareto optimum, about the same number of samples for BlackScholes and 66 times fewer for OuterProduct and DotProduct compared to the prior Spatial design space exploration approach using pruning and random sampling.

BlackScholes OuterProduct DotProduct
Figure 9: Optimum versus approximated Pareto front for the BlackScholes (left, y-axis in log scale), OuterProduct (center, y-axis in log scale) and DotProduct (right) benchmarks. The x-axis is compute logic, reported as a percentage of the total LUT capacity of the Stratix V FPGA. The y-axis is the total cycles taken to run the benchmark. The approximated Pareto front is computed by HyperMapper 2.0 and the real Pareto is computed by exhaustive search. The invalid (or infeasible) samples are samples that would not be possible to synthesize on the FPGA given the hardware constraints.

4.5 Hypervolume Indicator

We next show the hypervolume indicator (HVI) [12] function for the whole set of the Spatial benchmarks as a function of the initial number of warm-up samples (for sake of space we omit the smallest benchmark, BlackScholes). For every benchmark, we show 5 repetitions of the experiments and report variability via a line plot with 80% confidence interval. The HVI metric gives the area between the estimated Pareto frontier and the space’s true Pareto front. This metric is the most common to compare multi-objective algorithm performance. Since the true Pareto front is not always known, we use the accumulation of all experiments run on a given benchmark to compute our best approximation of the true Pareto front and use this as a true Pareto. This includes all repetitions across all approaches, e.g., baseline and HyperMapper 2.0. In addition, since logic utilization and cycles have different value ranges by several order of magnitude, we normalize the data by dividing by the standard deviation before computing the HVI. This has the effect of giving the same importance to the two objectives and not skewing the results towards the objective with higher raw values. We set the same number of samples for all the experiments to (the default value in the prior work baseline). Based on advice by expert hardware developers, we modify the Spatial compiler to automatically generate the prior knowledge discussed in Section 3.1 based on design parameter types. For example, on-chip tile sizes have a “decay” prior because increasing memory size initially helps to improve DRAM bandwidth utilization but has diminishing returns after a certain point. This prior information is passed to HyperMapper 2.0 and is used to magnify random sampling. The baseline has no support for prior knowledge.

Figure 10 shows the two different approaches: HyperMapper 2.0 using a warm-up sampling phase with the use of the prior and then an active learning phase; Spatial’s previous design space exploration approach (the baseline).

GEMM OuterProduct K-Means GDA T-PCH Q6 DotProduct
Figure 10: Performance of HyperMapper 2.0 versus the Spatial baseline.

The y-axis reports the HVI metric and the x-axis the number of samples in thousands.

Table 4 quantitatively summarizes the results. We observe the general trend that HyperMapper 2.0 needs far fewer samples to achieve competitive performance compared to the baseline. Additionally, our framework’s variance is generally small, as shown in 10. The total number of samples used by HyperMapper 2.0 is 12,500 on all experiments while the number of samples performed by the baseline varies as a function of the pruning strategy. The number of samples for GEMM, T-PCH Q6, GDA, and DotProduct is 100,000, which leads to an efficiency improvement of , while OuterProduct and K-Means are 31,068 and 18,720, which leads to an improvement of and , respectively.

Benchmark HyperMapper 2.0 Spatial Baseline
BlackScholes
K-Means
OuterProduct
DotProduct
GEMM
TPC-H Q6
GDA
Table 4: Performance of HyperMapper 2.0 in terms of mean  80% confidence interval at the end of the optimization process. Note that our approach terminates using a much lower number of samples.

As a result, the autotuner is robust to randomness and only a reasonable number of random samples are needed for the warm-up and active learning phases.

4.6 Understandability of the Results

Objective
Benchmark Parameter Logic Util. Cycles
BlackScholes Tile Size 0.003 0.569
OP 0.261 0.072
IP 0.735 0.303
Pipelining 0.001 0.056
OuterProduct Tile Size A 0.095 0.290
Tile Size B 0.075 0.323
OP 0.170 0.084
IP 0.321 0.248
Pipelining 0.340 0.055
Table 5: Parameter feature importance per benchmark. Tile sizes are given for each data structure. OP and IP are the outer and inner loop parallelization factors, respectively. Pipelining determines whether the key compute loop in the benchmark is pipelined or sequentially executed. Scores closer to 1 mean that the parameter is more important for that objective. Scores for a single objective sum to 1.

HyperMapper 2.0 can be used by domain non-experts to understand more about the domain they are trying to optimize. In particular, users can view feature importance to gain a better understanding of the impact of various parameters on the design objectives. The feature importances for the BlackScholes and OuterProduct benchmarks are given in Table 5.

In BlackScholes, innermost loop parallelization (IP) directly determines how fast a single tile of data can be processed. Consequently, as shown in Figure 5, IP is highly related to both the design logic utilization and design run-time (cycles). Since BlackScholes is bandwidth bound, changing DRAM utilization with tile sizes directly changes the run-time, but has no impact on the compute logic since larger memories do not require more LUTs. Outer loop parallelization (OP) also duplicates compute logic by making multiple copies of each inner loop, but as shown in Figure 5, OP has less importance for run-time than IP.

Similarly, in OuterProduct, both tile sizes have roughly even importance on the number of execution cycles, while IP has roughly even importance for both logic utilization and cycles. Unlike BlackScholes, which includes a large amount of floating point compute, OuterProduct has relatively little computation, making the cost of outer loop pipelining relatively impactful on logic utilization but with little importance on cycles. In both cases, the Spatial compiler can take this information into account when determining whether to prioritize further optimizing the application for inner loop parallelization or outer loop pipelining.

5 Related Work

During the last two decades, several design space exploration techniques and frameworks have been used in a variety of different contexts ranging from embedded devices to compiler research to system integration. Table 1 provides a taxonomy of methodologies and software from both the computer systems and machine learning communities. HyperMapper has been inspired by a wide body of work in multiple sub-fields of these communities. The nature of computer systems workloads brings some important features to the design of HyperMapper 2.0 which are often missing in the machine learning community research on design space exploration tools.

In the system community, a popular, state-of-the-art design-space exploration tool is OpenTuner [1]. This tool is based on direct approaches (e.g., , differential evolution, Nelder-Mead) and a methodology based on the Area Under the Curve (AUC) and multi-armed bandit techniques to decide what search algorithm deserves to be allocated a higher resource budget. OpenTuner is different from our work in a number of ways. First, our work supports multi-objective optimization. Second, our white-box model-based approach enables the user to understand the results while learning from them. Third, our approach is able to consider unknown feasibility constraints. Lastly, our framework has the ability to inject prior knowledge into the search. The first point in particular does not allow a direct performance comparison of the two tools.

Our work is inspired by HyperMapper 1.0 [7, 21, 25, 19]. Bodin et al.  [7] introduce HyperMapper 1.0 for autotuning of computer vision applications by considering the full software/hardware stack in the optimization process. Other prior work applies it to computer vision and robotics applications  [21, 25]. There has also been preliminary study of applying HyperMapper to the Spatial programming language and compiler like in our work [19]. However, HyperMapper 1.0 lacks some fundamental features that makes it ineffective in the presence of applications with non-feasible designs and prior knowledge.

In [16] the authors use an active learning technique to build an accurate surrogate model by reducing the variance of an ensemble of fully connected neural network models. However, our work is fundamentally different because we are not interested in building a perfect surrogate model, instead we are interested in optimizing the surrogate model (over multiple objectives). So, in our case building a very accurate surrogate model over the entire space would result in a waste of samples.

Recent work [9] uses decision trees to automatically tune discrete NVIDIA and SoC ARM GPUs. Norbert et al. tackle the software configurability problem for binary [29] and for both binary and numeric options [28] using a performance-influence model which is based on linear regression. They optimize for execution time on several examples exploring algorithmic and compiler spaces in isolation.

In particular, machine learning (ML) techniques have been recently employed in both architectural and compiler research. Khan et al.  [18] employed predictive modeling for cross-program design space exploration in multi-core systems. The techniques developed managed to explore a large design space of chip-multiprocessors running parallel applications with low prediction error. In [4] Balaprakash et al. introduce AutoMOMML, an end-to-end, ML-based framework to build predictive models for objectives such as performance, and power. [3] presents the ab-dynaTree active learning parallel algorithm that builds surrogate performance models for scientific kernels and workloads on single-core, multi-core and multi-node architectures. In [34] the authors propose the Pareto Active Learning (PAL) algorithm which intelligently samples the design space to predict the Pareto-optimal set.

Our work is similar in nature to the approaches adopted in the Bayesian optimization literature [27]. Example of widely used mono-objective Bayesian DFO software are SMAC [15], SpearMint [30, 31] and the work on tree-structured Parzen estimator (TPE) [6]. These mono-objective methodologies are based on random forests, Gaussian processes and TPEs making the choice of learned models varied.

6 Conclusions and Future Work

HyperMapper 2.0 is inspired by the algorithm introduced by [7], later dubbed HyperMapper 1.0 [21], by the philosophy behind OpenTuner [1] and SMAC [15]. We have introduced a new derivative-free optimization methodology and corresponding framework which uses guided search using active learning. This framework, dubbed HyperMapper 2.0, is built for practical, user-friendly design space exploration in computer systems, including support for categorical and ordinal variables, design feasibility constraints, multi-objective optimization, and user input on variable priors. Additionally, HyperMapper 2.0 uses randomized decision forests to model the searched space. This model not only maps well for the discontinuous, non-linear spaces in computer systems, but also gives a “white box” result which the end user can inspect to gain deeper understanding of the space.

We have presented the application of HyperMapper 2.0 as a compiler pass of the Spatial language and compiler for generating application accelerators on FPGAs. Our experiments show that, compared to the previously used heuristic random search, our framework finds similar or better approximations of the true Pareto frontier, with significantly fewer samples required, 8x in most of the benchmarks explored.

Future work on HyperMapper 2.0 will include analysis and incorporation of other DFO strategies. In particular, the use of a full Bayesian approach in the active learning loop would help to leverage the prior knowledge by computing a posterior distribution. In our current approach we only exploit the prior distribution at the level of the initial warm-up sampling. Exploration of additional methods to warm-up the search from the design of experiments literature is a promising research venue. In particular the Latin Hypercube sampling technique was recently adapted to work on categorical variables [33] making it suitable for computer systems workloads.

References

  • [1] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on, pages 303–315. IEEE, 2014.
  • [2] J. Bachrach, Huy Vo, B. Richards, Yunsup Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic. Chisel: Constructing hardware in a scala embedded language. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 1212–1221, June 2012.
  • [3] Prasanna Balaprakash, Robert B Gramacy, and Stefan M Wild. Active-learning-based surrogate models for empirical performance tuning. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages 1–8. IEEE, 2013.
  • [4] Prasanna Balaprakash, Ananta Tiwari, Stefan M Wild, Laura Carrington, and Paul D Hovland. Automomml: Automatic multi-objective modeling with machine learning. In International Conference on High Performance Computing, pages 219–239. Springer, 2016.
  • [5] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • [6] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pages 2546–2554, 2011.
  • [7] Bruno Bodin, Luigi Nardi, M Zeeshan Zia, Harry Wagstaff, Govind Sreekar Shenoy, Murali Emani, John Mawer, Christos Kotselidis, Andy Nisbet, Mikel Lujan, et al. Integrating algorithmic parameters into benchmarking and design space exploration in 3d scene understanding. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pages 57–69. ACM, 2016.
  • [8] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [9] Marco Cianfriglia, Flavio Vella, Cedric Nugteren, Anton Lokhmotov, and Grigori Fursin. A model-driven approach for a new generation of adaptive libraries. arXiv preprint arXiv:1806.07060, 2018.
  • [10] Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization, volume 8. Siam, 2009.
  • [11] Antonio Criminisi, Jamie Shotton, Ender Konukoglu, et al. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends® in Computer Graphics and Vision, 7(2–3):81–227, 2012.
  • [12] Paul Feliot, Julien Bect, and Emmanuel Vazquez. A bayesian approach to constrained single-and multi-objective optimization. Journal of Global Optimization, 67(1-2):97–133, 2017.
  • [13] Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. Bayesian optimization with inequality constraints. In ICML, pages 937–945, 2014.
  • [14] Michael A Gelbart, Jasper Snoek, and Ryan P Adams. Bayesian optimization with unknown constraints. arXiv preprint arXiv:1403.5607, 2014.
  • [15] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
  • [16] Engin Ïpek, Sally A McKee, Rich Caruana, Bronis R de Supinski, and Martin Schulz. Efficiently exploring architectural design spaces via predictive modeling, volume 41. ACM, 2006.
  • [17] Eunsuk Kang, Ethan Jackson, and Wolfram Schulte. An approach for effective design space exploration. In Monterey Workshop, pages 33–54. Springer, 2010.
  • [18] Salman Khan, Polychronis Xekalakis, John Cavazos, and Marcelo Cintra. Using predictivemodeling for cross-program design space exploration in multicore systems. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 327–338. IEEE Computer Society, 2007.
  • [19] David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Spatial: A Language and Compiler for Application Accelerators. In ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), June 2018.
  • [20] David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delimitrou, Christos Kozyrakis, and Kunle Olukotun. Automatic generation of efficient accelerators for reconfigurable hardware. In International Symposium in Computer Architecture (ISCA), 2016.
  • [21] Luigi Nardi, Bruno Bodin, Sajad Saeedi, Emanuele Vespa, Andrew J Davison, and Paul HJ Kelly. Algorithmic performance-accuracy trade-off in 3d vision applications using hypermapper. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pages 1434–1443. IEEE, 2017.
  • [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [23] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
  • [24] Luis Miguel Rios and Nikolaos V Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56(3):1247–1293, 2013.
  • [25] Sajad Saeedi, Luigi Nardi, Edward Johns, Bruno Bodin, Paul HJ Kelly, and Andrew J Davison. Application-oriented design space exploration for slam algorithms. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 5716–5723. IEEE, 2017.
  • [26] Thomas J Santner, Brian J Williams, and William I Notz. The design and analysis of computer experiments. Springer Science & Business Media, 2013.
  • [27] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
  • [28] Norbert Siegmund, Alexander Grebhahn, Sven Apel, and Christian Kästner. Performance-influence models for highly configurable systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 284–294. ACM, 2015.
  • [29] Norbert Siegmund, Sergiy S Kolesnikov, Christian Kästner, Sven Apel, Don Batory, Marko Rosenmüller, and Gunter Saake. Predicting performance via automated feature-interaction detection. In Proceedings of the 34th International Conference on Software Engineering, pages 167–177. IEEE Press, 2012.
  • [30] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • [31] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180, 2015.
  • [32] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.
  • [33] Laura P Swiler, Patricia D Hough, Peter Qian, Xu Xu, Curtis Storlie, and Herbert Lee. Surrogate models for mixed discrete-continuous variables. In Constraint Programming and Decision Making, pages 181–202. Springer, 2014.
  • [34] Marcela Zuluaga, Guillaume Sergent, Andreas Krause, and Markus Püschel. Active learning for multi-objective optimization. In International Conference on Machine Learning, pages 462–470, 2013.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
306334
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description