# High-dimensional Black-box Optimization Under Uncertainty

###### Abstract

Limited informative data remains the primary challenge for optimization the expensive complex systems. Learning from limited data and finding the set of variables that optimizes an expected output arise practically in the design problems. In such situations, the underlying function is complex yet unknown, a large number of variables are involved though not all of them are important, and the interactions between the variables are significant. On the other hand, it is usually expensive to collect more data and the outcome is under uncertainty. Unfortunately, despite being real-world challenges, exiting works have not addressed these jointly.

We propose a new surrogate optimization approach in this article to tackle these challenges. We design a flexible, non-interpolating, and parsimonious surrogate model using a partitioning technique. The proposed model bends at near-optimal locations and identifies the peaks and valleys for optimization purposes. To discover new candidate points an exploration-exploitation Pareto method is implemented as a sampling strategy. Furthermore, we develop a smart replication approach based on hypothesis testing to overcome the uncertainties associated with the black-box outcome. The Smart-Replication approach identifies promising points to replicate rather than wasting evaluation on less informative data points. We conduct a comprehensive set of experiments on challenging global optimization test functions to evaluate the performance of our proposal.

###### Keywords:

Surrogate optimization Black-box functions Derivative-free Limited data Non-interpolating model Uncertainty∎

## 1 Introduction

Optimization is an art of getting the best solution among the feasible solutions for challenging problems. Challenging optimization problems appear in many areas of science, technology, and industry. This includes farms placement problem, self-tuned driving vehicles, green building design, vehicle design crash simulators, finite element design, and optimum molecule structure in drug and material discovery. All these are design problems where we are looking for the best set of design variables to optimize certain specifications.

These problems are complex systems of inputs and outputs with entirely unavailable information about the underlying behavior. The existence of all of these applications and their challenges resulted in a body of work called Black-Box Optimization (BBO). Simulators can be used to study such black-box systems and potentially optimize them. Indeed, a large number of variables are involved and there is substantial interaction between them. This results in a large combinatorial search space where each setting is a valid solution. The evaluation process though includes costly experiments due to the complexity of the systems, which can be either computer simulators such as energy simulation tools or actual experiments such as crash simulators. Achieving an optimal solution of a high-dimensional expensive black-box function within a limited number of function evaluations is a matter of concern.

The black-box optimization problem formulation is similar to conventional optimization problem with an objective function Equation 1 and constraints defining the feasible region Equation 2. In this research we consider the box-constrained, Equation 3, black-box systems.

(1) | ||||

(2) | ||||

(3) |

where is the black-box function, and the goal is to obtain a global optimal solution of in the feasible input space of , where .

Since is often computationally expensive to be embedded directly within a global optimization method, a more practical approach employs surrogate models that are computationally less expensive to evaluate. A surrogate model is a mathematical approximation of the relationship between input and output variables. Several types of statistical models including Kriging Krige (1951), Radial Basis Functions (RBF) Powell (1985), Regression Trees Breiman et al. (1984), Multivariate Adaptive Regression Splines (MARS) Friedman (1991), Artificial Neural Networks Papadrakakis et al. (1998), Support Vector Regression (SVR) Clarke et al. (2005), etc. are distinguished as surrogate models. Optimizing the cheap to evaluate surrogate models for black-box optimization is one of the existing well-known derivative-free techniques called surrogate optimization. Surrogate optimization requires careful selection of simulator runs to simultaneously improve the surrogate model and gain more information on the potential optimum in each iteration.

Historically, surrogate optimization methods assume that there is no uncertainty in the complicated black-box systems and the set of significant decision variables is known *a priori*. Both of these assumptions are often violated in real world, Picheny et al. (2013a); Pilla (2007).
This research introduces a new surrogate optimization paradigm to address two primary concerns:
1) How can we handle unimportant inputs (decision variables of the black-box function) in surrogate optimization?
2) How can we handle the uncertainties associated with the black-box functions in surrogate optimization?
To handle this paradigm shift, we develop a new surrogate optimization approach, Tree-Knot Multivariate Adaptive Regression Splines (TK-MARS) as well as a smart replication strategy.

We introduce the surrogate optimization algorithm and the related works in the literature in § 2 to highlight the gaps that are not properly investigated. § 3 describes TK-MARS and the proposed approach for deterministic black-box functions. In this section we also present the proposed replication approach for handling the uncertainties associated with the black-box systems. Finally, in § 4 we discuss the performance of the proposed deterministic and stochastic approaches.

## 2 Background

In this section, we provide some technical background on surrogate optimization, interpolating and non-interpolating surrogate models, as well as an exploration-exploitation sampling to be used in this research. We also describe the Classification and Regression Tree (CART) as we see it in our proposed surrogate model as a partitioning technique.

### 2.1 Surrogate Optimization

Algorithm 1 is the formal visualization of a generic surrogate optimization algorithm for black-box functions. Ideally, surrogate optimization attempts to find a global optimum of the surrogate model and evaluate it with the black-box function to get the actual objective value . It adds the newly evaluated point to the initial data set and iteratively minimizes the gap.

The stopping criteria can be either a maximum number of expensive evaluations of the black-box function or the expected improvement of the best sampled mean solution (BSMS). BSMS is the best known solution after iterations.

### 2.2 Interpolating vs. Non-interpolating Surrogate Models

Choosing an appropriate surrogate model, also known as metamodel, is extremely dependent on the performance of distinct methods under different circumstances. Surrogate models are classified as interpolating (kriging Krige (1951), RBF Powell (1985), etc.) or non-interpolating (polynomial regression models Neter et al. (1996), multivariate adaptive regression spline Friedman (1991), etc.). Interpolating models are the most common techniques applied in surrogate optimization literature that cannot handle the uncertainty inherently. On the other side, non-interpolating models approximate the data smoothly under uncertainty, Hwang and Martins (2018).Non-interpolating surrogate models do not necessarily traverse all the data points to capture the exact behavior of the function, unlike interpolating surrogate models. In terms of bias-variance trade-off, non-interpolating surrogate models may have higher bias and lower variance than interpolating surrogates. Therefore, in highly noisy systems, non-interpolating models are preferred to avoid oscillations of the fitted surface. Stochastic black-box systems are rarely addressed in surrogate optimization literature as we discussed before. In Figure 1, we show how an interpolating model, Figure 1(b), and a non-interpolating model, Figure 1(c), perform for a simple function with noise and recognize the true underlying behavior under uncertainty. As we can see, to capture the single simulation with two or more different outputs, the interpolation method needs to traverse the points with a large slope. Consequently, interpolation-based surrogate models result in highly fluctuated approximations for data points include uncertainty.

### 2.3 Radial Basis Function

Radial Basis Function (RBF) is one of the most common interpolating surrogate models Powell (1985). Assuming distinct already evaluated points, , the RBF interpolant is of the form, where refers to the norm, and is a basis function. The most common types of basis functions are Multiquadric (MQ) , Gaussian (G) , Cubic (C) , and Thin Plate Splines (TPS) . The shape parameter and the number of points affect the accuracy and stability of RBF (a larger shape parameter makes the function flatter).

### 2.4 Mars

Multivariate Adaptive Regression Splines, MARS, was introduced by Friedman Friedman (1991). MARS is a non-parametric non-interpolating surrogate model. The structure of MARS model is based on basis functions. The MARS algorithm utilizes splines to construct a piecewise continuous function to model the dependent variable. The MARS approximation has the form:

(4) |

where is an -dimensional vector of input variables, is the intercept coefficient, is the maximum number of linearly independent basis functions, is the coefficient for the th basis function, and is a basis function that is either univariate or multivariate interaction terms. The interaction terms are presented in the form of:

(5) |

where is the number of interaction terms in the th basis function, is the th dependent variable corresponding to the th truncated linear function in the th basis function, and is the knot value corresponding to . The value is the direction that the truncated linear basis function can take.

Each data point can be an eligible knot location for MARS. Adding more data points increases the size of eligible knot locations. MARS interpolates the data points and loses its flexibility as the number of eligible knots increases. Interpolating a limited data set early when insufficient data is collected leads to a false assessment of function behavior and multiple local optima. For highly noisy functions, interpolating leads to high local variance in function estimation Koc and Iyigun (2014). This may cause difficulties for the optimization procedure. Yet, when a large number of eligible knots are selected, multicollinearity can occur between basis functions with knots that are close to each other.

Eligible knot selection techniques are developed to solve local variance and multicollinearity issues. One of the most common techniques selects evenly spaced knot locations within the range of the input data Martinez (2013); Cao et al. (2010); Huang and Shen (2004); Song and Yang (2010). Evenly-spaced knots may not capture true patterns in the data. Friedman Friedman (1991) proposes the minimum span (MinSpan) approach to minimize the local variability. In the MinSpan approach, for each independent variable, a local search around its current knot location is designed to reduce the number of eligible knot locations. Miyata Miyata and Shen (2005) presents a simulated annealing approach to choose eligible knot locations. Koc Koc and Iyigun (2014) develops a mapping strategy by transforming the original data into a network of nodes through a nonlinear mapping. The nodes in the mapped network are basically a reference for choosing the new candidate knots. The mapping hyperparameters have an impact on the accuracy and time efficiency of the model.

MARS is intended to be parsimonious, including a forward selection and backward elimination procedures. In forward selection, MARS adds the basis functions in pairs for different dimensions, which gives the maximum reduction in the sum of squares error. The process of adding continues until it reaches the maximum number of basis functions. By removing the least effective term at each step, the backward elimination process avoids overfitting. MARS basically has an embedded dimension reduction technique, which is very useful for black-box optimization where there is no previous understanding of input variables and their impact on the response. As a consequence, in the final model it maintains the most important variables. The advantages of MARS over other statistical models, such as linear regression models lie in its ability to handle curvature in high-dimensional space and produce easier-to-interpret models.

### 2.5 Cart

CART is a non-parametric decision tree and nonlinear predictive modeling method Breiman et al. (1984). It partitions the space into smaller sub-regions, recursively, so that the interactions are manageable. For regression predictive modeling problems, CART chooses binary splits by minimizing the sum of the squared error, Equation 6, across all data points that fall within each partition. The partitions discover the structure of the unknown function as we progress over time.

(6) |

where is the response value for each observations and is the average response value of the observations in terminal .

### 2.6 Exploration-Exploitation Pareto Approach (EEPA)

EEPA Dickson (2014) is an exploration-exploitation candidate selection approach. It creates a Pareto frontier on the estimated function value from the surrogate model, as one dimension, and the distance of the candidate points from the already evaluated points, as the second dimension. The first dimension exploits the interesting regions around the optimum function value area, and the second dimension explores the undiscovered regions.
The exploration metric is where , and the exploitation metric is the fitted function value . The first metric needs to be maximized, while the second should be minimized. The non-dominated Pareto set is given by , where is a fixed random pool of sample points.
There might be candidate points on the Pareto set that are close to each other. To find the winning candidate points from the Pareto set *maximin* exploration technique is applied.

### 2.7 Literature

Various surrogate models has been used in the black-optimization literature. Moore et al. (2000); Costas et al. (2014); Crino and Brown (2007); Dickson (2014) employ non-interpolating models such as MARS and polynomial regression. Krigigng Sacks et al. (1989); Jones et al. (1998); Simpson et al. (1998); Huang et al. (2006); Picheny et al. (2013a); Dong et al. (2015) is one the most traditional surrogate models used in the literature for black-box optimization. RBF, an interpolating technique, is the most common surrogate model that has been given attention lately Regis,Rommel (2011); Regis (2014a, b); Krityakierne et al. (2016); Datta and Regis (2016); Jakobsson et al. (2010); Müller and Shoemaker (2014); Müller (2016). Considering every variable as a significant factor to the output is the weakest part of the surrogate optimization literature. Crino Crino and Brown (2007) demonstrates that MARS is capable of screening and reducing variables using the parsimonious nature of MARS.

Crino and Brown (2007); Costas et al. (2014); Picheny et al. (2013a); Jakobsson et al. (2010); Huang et al. (2006) consider uncertainty components in the metamodeling. Kriging and RBF do not inherently handle uncertainty, and some modifications are required. Huang Huang et al. (2006) develops a Kriging-based surrogate optimization approach and applies it to low-dimensional test problems including a low-level noise. The cost of fitting Kriging increases by the number of samples as it leads to impractical higher-dimensional problems. The proposed method for highly fluctuated functions under higher noise levels requires further investigation. Picheny Picheny et al. (2013b) adds Gaussian noise with a fixed independent variance to the response and performs it on low-dimensional optimization test problems. The results show the relative poor modeling performance of Kriging. A large part of the variability that cannot be explained by the model is due to the observation noise during optimization. Jakabsson Jakobsson et al. (2010) and Picheny Picheny et al. (2013a) apply RBF and Kriging based surrogates that are performed only on low-dimensional test problems under low levels of uncertainty. MARS and regression require no revision to handle uncertainty. Costas Costas et al. (2014) shows MARS is preferable in real-world applications due to slope discontinuities and uncertainties, even though they do not study the effect of the noise.

A few other studies in surrogate optimization concentrate primarily on the development of an efficient optimizer. The main approach to recommend a new candidate is to solve an optimization sub-problem that is subject to exploration and exploitation constraints on the estimated objective function Moore et al. (2000); Regis and Shoemaker (2005); Knowles (2006); Müller and Woodbury (2017). Although this method synthesizes sample points, it is computationally costly to solve an optimization problem and the complexity increases as the dimension increases. A stochastic sampling approach has been applied in Müller (2016); Müller et al. (2015). In this strategy, sample points are generated by perturbing the variables of the best point found so far with constant perturbation probability. The Pareto-based candidate selection approach has been recently taken into consideration to select multiple sample points at a time Dickson (2014); Bischl et al. (2014); Krityakierne et al. (2016); Dong et al. (2015).

### 2.8 Contributions

In this research, we propose a new non-interpolating surrogate model that is specifically designed for surrogate optimization of black-box functions. We demonstrate that the modified surrogate model is capable of identifying significant variables and handling uncertainties. Furthermore, a centroid-based dynamic pool generation approach is developed to improve the candidate selection procedure. We propose an alternative approach to identify subregions of interest, which nicely surround promising points for subsequent function evaluations. We introduce uncertainty associated with the black-box functions and propose effective strategies to handle it in the surrogate optimization procedure.

## 3 Technical Description

In this section, we describe our technical contributions to surrogate optimization.

### 3.1 Tk-Mars

To address the two major issues in black-box optimization, a flexible, non-interpola- ting and parsimonious surrogate model is required in the context of surrogate optimization to identify significant variables and address the uncertainties associated with black-box function. In this section, we develop a modified version of MARS, which is capable of identifying significant variables and specifically designed for black-box optimization.

The idea behind the new eligible knot selection approach is to fit accurately around potential optimum points to speed up the convergence within fewer number of evaluations.In addition, MARS is sensitive to the order of the data points. The regression tree therefore helps bring in the near-optimal data points earlier, which accelerates the efficiency of MARS for optimization. The proposed approach, which is called Tree-Knot MARS or briefly TK-MARS, uses a partitioning technique to capture the function structure in the solution space and identify near-optimal knot locations in each partition.

Algorithm 2, presents TK-MARS framework. Consider the following high-level example of TK-MARS. Suppose we have a data set for , where is a normal noise term with a mean of , as in Figure 2(a). Consequently, . CART splits the existing data points into four partitions, as in Figure 2(b). Note that CART is capable of identifying the structure of the underlying function. Each partitions are the solid lines, and the horizontal lines show the estimated average function value in each partition. Next, we identify the centroid of each partition, and pick the closest points to the centroids as eligible knot locations. The centroids are the dashed lines as in Figure 2(b). We want to point out that the centroids are close to peaks and valleys. Focusing on the peaks and valleys where optima lie facilitates the optimization process. Given the new set of eligible knot locations, the breaking points of TK-MARS are the nearest points to the peaks (valleys) of the function as in Figure 2(c).

The representatives of each partition are considered as a reference point for eligible knot locations. CART partitions the data set on the response function structure. It splits more where the function structure changes to minimize the least square error. CART does not split where there are no significant changes in the response. Hence, there are more partitions in highly structured regions. As a consequence, TK-MARS defines more knots in highly structured regions. The peaks and valleys of the black-box function are often in the middle of the space defined by the tree logic at the terminal nodes rather than the edges since CART splits where the response values significantly change. Hence, the centroids as calculated by Equation 7, are appropriate locations for the eligible knots

(7) |

where is the centroid of terminal for dimension , and is the th dimension of observations in terminal .

Knots in MARS are in one-dimensional space. A transformation is therefore needed from multi-dimensional to one-dimensional space. Equation 8 represents a transformation function. The eligible knot location for each dimension is the closest point to the terminal centroid in the same dimension.

(8) |

where is the nearest point to the centroid in terminal node for dimension . returns the smallest index if there are ties.

### 3.2 Dynamic Pool Generation

We apply EEPA, exploration-exploitation Pareto approach Dickson (2014) in this research and modify it to incorporate TK-MARS and a standardized Cosine metric. To discover the new candidate points, we employ EEPA to add multiple candidate points at a time and ignore the set of dominated points by pruning regions that are worse in both exploration and exploitation. Dickson Dickson (2014) shows that EEPA outperforms pure exploration and exploitation methods. However, EEPA and the quality of BSMS found using EEPA depends on the fixed random data set. The space can not be perfectly covered by generating a fixed set of random points. As a result, it needs a large pool, which makes the surrogate optimization procedure inefficient, since more exploration is needed. Further, in the presence of noise, even a large pool may not be sufficient. As a consequence, the candidate selection process can be facilitated by a dynamic pool generation that synthesizes some new sample points as required. Using the information about the function structure in each iteration helps with the gradual evolution of the pool. Regarding the first step of TK-MARS, fitting CART on the already evaluated points, we have the set of centroids (representing the partitions). The quality of these centroids from the TK-MARS knot selection procedure is discussed in § 3.1, as a result of the regression tree logic. Since the centroids of the terminal nodes are close to peaks and valleys of the underlying black-box function, they can be good candidates to be added to the fixed random pool generated. Note that the centroids are not necessarily in the pool, even though they are close to the promising points. In the new approach at the end of each iteration, the solution space is defined as where is the fixed random pool, and is the set of centroids of the terminal nodes, . Over time, the centroids effectively move in the solution space. We add more points near BSMS using the tree logic, instead of using a large fixed random pool representing the whole feasible region. As a result, the candidate points are effectively near potential optima.

### 3.3 Standardized Cosine

As we discussed, to identify the non-dominated set of solutions, EEPA defines a distance metric for exploration and uses estimated objective value for exploitation. Different distance metrics can be applied in EEPA, such as Euclidean and Cosine similarity. Cosine similarity calculates the Cosine of the angle between two vectors by using inner product space . Cosine provides information about alternative directions. To find the perfect right angle of a direction, the Cosine measure needs to be zero.

A perpendicular direction of the points to the already evaluated points provides more information on the unexplored areas. The main challenge is that it is not possible to calculate the Cosine similarity metric for points but vectors. We have a set of already evaluated points and a random pool from which the next candidate points are selected. To calculate the Cosine distance of the potential candidates from the already evaluated points, we need to consider the points as origin-starting vectors, and then calculate the angle between the vectors. Hypothetically, the origin can be anywhere in the space. However, to find the right angle, we consider the center of the feasible domain as the origin. As a result, a standardization procedure is required to calculate the Cosine metric as given by Algorithm 3.

Algorithm 3, calculates the Cosine angle between the vectors originated from the center of the space. The standardized Cosine discovers the candidates that are uniformly scattered in the domain around the middle of the space, therefore.

### 3.4 Smart-Replication

In some expensive computer simulations, there is inherent noise associated with the system. Most of the existing methods cannot handle uncertainty Davis and Ierapetritou (2007); Grill et al. (2015); Moore et al. (2000). We relax the noise-free assumption in this section, hence, the response obtained from simulation contains uncertainty. where is the output of the black-box system that contains uncertainty. This implies the simulation output is a stochastic function that differs from the true value of the function. Each time a point is evaluated, a different response comes out. Therefore, for a single simulation, we have different outputs. The goal is still to minimize the true objective function Equation 1. We assume that the uncertainty or noise is independent identically distributed with mean 0 and variance of , , following an unknown distribution.

Since the data points include uncertainties associated with the black-box function, a single evaluation may not be reliable and does not show the true output value of the black-box function. Consequently, the deterministic approach may not be adequate to handle the uncertainty effect and, therefore, mislead the optimization process. Performing more than one evaluation for the same point provides more accurate data about the real objective value. An alternative approach is to replicate data points to reduce the uncertainty. Suppose the number of replication is . Let us define as . In other words, is the number of equal pair of sample points from the set of already evaluated points, . Let be the sample mean of after replications. That is, .

(9) |

in Equation 9 follows based on the central limit theorem. is the uncertainty component of . The second component in Equation 9 corresponds to the mean of replications’ uncertainties for a sample point. is an indicator function that counts the number of replications for a single point.

We propose a distinct replication strategy, which replicates not all of the candidate points but only the promising points for optimization. Following this strategy, in this section, we propose an approach for replication, using a hypothesis testing based on confidence intervals. This approach replicates only the promising points closer to the best sampled mean solution, BSMS. Smart-Replication automatically chooses the number of replications for each selected candidate point following the hypothesis rule.

Assume that the current BSMS is . For each selected candidate point , Smart-Replication considers the following null hypothesis and stops replicating , if it can reject the null hypothesis.

H: if , where , then .
H: if , where , then .
Decision rule: if , Reject .
Conclusion: Even though the objective value of BSMS, , is smaller than the objective value of , , the true objective value of , , is smaller than the true objective value of , . Hence, we are confident that more replications are required for .

For each point that is evaluated times, , we calculate the standard deviation and the confidence interval with a significance level of : , where . is the noisy evaluation.

Now, we calculate the confidence interval, . To select the promising points, we prune the candidate points based on the lower bound of the confidence interval . If of a point is less than the threshold, i.e. , it is selected as a promising point to be replicated. The promising points are the candidate points that are close to the current BSMS, i.e., below the threshold, Figure 3. Since the number of function evaluations is limited, and the experiments are costly, we consider a maximum number of replications for the promising points .

The smart approach behaves similar to the deterministic, No-Replication, approach when the uncertainty level is low and similar to the Fixed-Replication approach when the uncertainty level is high. For a higher noise level, the variance is larger. It makes the condition of ineffective, and, as a result, more replications are required. However, it can replicate points up to the maximum number of replications , which needs to be large and determined in advance. Consequently, Smart-Replication is efficient in both ways, i.e. low/high noise. That is because it recognizes if the system is deterministic or stochastic, and decides about replications automatically.

### 3.5 Performance Metric

Since the function evaluations are expensive, the number of function evaluations before finding an optimal solution is a good metric for measuring the performance of an algorithm. However, obtaining a global optimum cannot be guaranteed. A metric that can quantify the convergence pattern of an algorithm is more plausible in the context of black-box optimization.

First, we define the area under the curve (AUC) metric in Definition 1.

###### Definition 1 (Area Under the Curve – AUC)

Given the best sampled mean solution (BSMS) found by an algorithm after each function evaluation. Let be the true objective value of BSMS, where is the BSMS after function evaluations. Let and . Using the normalized objective value of BSMS:

the AUC is

(10) |

AUC comprises both the quality of BSMS, as well as the time that the algorithm finds it. It works well for the deterministic cases, as the objective value of the BSMS monotonically decreases over time. However, this may not hold for stochastic black-box systems, as the curve consists of jumps. In these cases, the stability of BSMS represents the robustness. Even though the jumps in early iterations are tolerable, we expect a stable behavior towards the end of the evaluation. Therefore, next, we propose a metric for stochastic black-box systems that is able to monitor the stability and jump locations across the number of function evaluations. A good algorithm is the one with fewer and shorter jumps, in which after a reasonable number of function evaluations, the results are reliable. To consider the instability in the metric, we consider the maximum objective value of BSMS obtained among all BSMS found forward. Subsequently, we define the maximal true function area under the curve (MTFAUC) as following:

###### Definition 2 (Maximal True Function Area Under the Curve – MTFAUC)

Given the best sampled mean solution (BSMS) found by an algorithm after each function evaluation. Let be the true objective value of BSMS, where is the BSMS after function evaluations. Let and . Using the normalized objective value of BSMS

(Equation 9) is the sample mean of the observed objective value after replications. Let be the BSMS after number of function evaluations and be its estimated objective value. Consider as the maximum objective value of BSMS obtained among all BSMS found forward:

(11) |

The MTFAUC is

(12) |

MTFAUC comprises the instability, by using to penalize the jumps. Hence, MTFAUC deteriorates if the convergence pattern highly fluctuates. A more stable algorithm towards uncertainty has a lower MTFAUC value.

## 4 Experimental Results

To evaluate the performance of the proposed approaches we consider the global optimization test functions with known optimal points that have proven challenging for black-box optimization algorithms Surjanovic and Bingham ; Hansen et al. (2009). In this research, we first assume that black-box functions is deterministic and later relax the assumption. Table 1 describes the characteristics of the selected test functions.

Function | Formulation | Range | Global Min |
---|---|---|---|

Rosenbrock | [-5,10] | ||

Rastrigin | [-5.12,5.12] | ||

Levy | [-10,10] | ||

### 4.1 Tk-Mars

First, we evaluate the new knot preprocessing approach, TK, versus the evenly-spaced eligible knot selection, Friedman (1991), in the context of surrogate optimization. The number of eligible knot locations, , is usually preset to a static value in MARS. We study the results for , and . It is more appropriate to have MARS and TK-MARS dynamically set based on the number of terminal nodes of CART,, to ensure a fair comparison between the two approaches. Yet the eligible knot location is distinct in each approach. As a consequence, we demonstrate that TK-MARS eligible knot locations promise near optimal sites.

The results are provided in Figures 4(a)-(d). The initial set of 35 points with independent variables designed with LHD for this experiment. As one can see, the tree-based knots approach improves the optimization process, significantly.

We are now presenting an extensive analysis of the performance of TK-MARS in a new class of test functions that includes unimportant variables. We believe that the existing test functions that are designed for global optimization differ from the less-symmetrical and less-structured real-world applications. In our work the function evaluation would be based on a fraction of the input variables, , not all of them, to show that the proposed method can identify the significant variables. Figures 5 correspond to average MTFAUC of 30 different runs for each test functions in four different .

Note that as the fraction of important variables decreases, MTFAUC increases. It is more difficult for the algorithm to recognize significant variables when the output is evaluated using fewer variables.

Besides, a component of uncertainty is added to the real objective value for each function evaluation to represent stochastic black-box systems. In this research, the experiments are provided with a Gaussian noise, , where , and is the noise percentage level.

An Orthogonal Array, Bose and Bush (1952); Plackett and Burman (1946), is considered to design an efficient set of experiments on a subset of combinations of multiple parameters at multiple levels. Table 2 represents the problem setting and algorithm parameters for an extensive set of experiments.

Problem Parameters | levels |
---|---|

Test function | Rosenbrock, |

Rastrigin, Levy | |

Dimension | 10, 20, 30 |

Fraction of import. vars. | 0.25, 0.50, 0.75, 1 |

Noise level (%) | 0, 5, 10, 25 |

Algorithm Parameters | levels |

Initial pool size | , |

DOE method | LHD, sobol |

EEPA distance | Euclidean, Cosine |

EEPA # candidates | 3, 6 |

Replication type | fixed_rep, smart_rep |

Replication # | 5, 10 |

Model | RBF, TK-MARS |

We perform an ANOVA on an OA-designed experiments to test the hypothesis of variations caused by different factors. In ANOVA, the reference population is the first level of each factor. From the results, Table 3, it can be observed that statistically significant parameters are the fraction of important factors () and the noise level (). This indicates that the deterministic approach can not handle the noise effect and has an important impact on the algorithm efficiency (MTFAUC).

Estimate | Std. Error | t value | ||||
---|---|---|---|---|---|---|

(Intercept) | 0.0213 | 0.06535 | 0.326 | 0.745659 | ||

fiv=0.75 | 0.12197 | 0.04773 | 2.556 | 0.013294 | * | |

fiv=0.50 | 0.22659 | 0.04773 | 4.748 | 1.43E-05 | *** | |

fiv=0.25 | 0.30299 | 0.04773 | 6.348 | 3.84E-08 | *** | |

noise level=10% | 0.14558 | 0.04133 | 3.522 | 0.000851 | *** | |

noise level=25% | 0.23302 | 0.04133 | 5.638 | 5.59E-07 | *** | |

Rastrigin | 0.33031 | 0.04133 | 7.991 | 7.10E-11 | *** | |

Levy | -0.02759 | 0.04133 | -0.667 | 0.507163 | ||

dimension=20 | -0.06038 | 0.04133 | -1.461 | 0.149573 | ||

dimension=30 | 0.01005 | 0.04133 | 0.243 | 0.808709 | ||

poolSize=2(d+1) | 0.01179 | 0.03375 | 0.35 | 0.728 | ||

DOE=Sobol | -0.02552 | 0.03375 | -0.756 | 0.452732 | ||

EEPA distance=Cosine | 0.01675 | 0.03375 | 0.496 | 0.62165 | ||

EEPA number of candidates=6 | 0.0452 | 0.03375 | 1.339 | 0.185774 | ||

model=TK-MARS | -0.01327 | 0.03375 | -0.393 | 0.695633 | ||

Signif. Codes: | 0 ‘***’ | 0.001 ‘**’ | 0.01 ‘*’ | 0.05 ‘.’ | 0.1 ‘ ’ | 1 |

### 4.2 Smart-Replication

Next, we consider the replication approache to mitigate the effect of noise. An analysis of variance is performed with MTFAUC as the response variable on the set of parameters shown in Table 2. We consider two and replication types with two distinct replication numbers and . The replication number for is the fixed number of replications for each candidate points, however, for it is the maximum number of replications.

From Table 4, note that the noise parameter is not significant anymore, which indicates that replication helps to cancel out the noise effect. TK-MARS is margina- lly significant. The Smart-Replication strategy is statistically better than Fixed-Replication, which is consistent with our expectation. Using smart replication saves function evaluations and and in a noisy scenario decreases the MTFAUC.

Estimate | Std. Error | t value | ||||

(Intercept) | 0.250806 | 0.086891 | 2.886 | 0.005626 | ** | |
---|---|---|---|---|---|---|

fiv=0.75 | 0.07965 | 0.056382 | 1.413 | 0.163596 | ||

fiv=0.50 | 0.207822 | 0.056382 | 3.686 | 0.000537 | *** | |

fiv=0.25 | 0.132775 | 0.056382 | 2.355 | 0.022263 | * | |

model=TK-MARS | 0.113446 | 0.048828 | 2.323 | 0.024028 | * | |

noise level=10% | 0.004939 | 0.048828 | 0.101 | 0.91982 | ||

noise level=25% | 0.009702 | 0.048828 | 0.199 | 0.843259 | ||

smart_rep | -0.09262 | 0.039868 | -2.323 | 0.02404 | * | |

Rastrigin | 0.344011 | 0.048828 | 7.045 | 3.81E-09 | *** | |

Levy | 0.045359 | 0.048828 | 0.929 | 0.357128 | ||

dimension=20 | 0.073677 | 0.048828 | 1.509 | 0.137262 | ||

dimension=30 | 0.051091 | 0.048828 | 1.046 | 0.300151 | ||

Replication=10 | 0.146338 | 0.048828 | 2.997 | 0.004142 | ** | |

poolSize=2(d+1) | 0.016954 | 0.039868 | 0.425 | 0.67238 | ||

DOE=Sobol | -0.00669 | 0.039868 | -0.168 | 0.867332 | ||

EEPA distance=Cosine | -0.01711 | 0.039868 | -0.429 | 0.669465 | ||

EEPA number of candidates=6 | -0.02138 | 0.039868 | -0.536 | 0.594059 | ||

Signif. Codes: | 0 ‘***’ | 0.001 ‘**’ | 0.01 ‘*’ | 0.05 ‘.’ | 0.1 ‘ ’ | 1 |

Looking at the ANOVA Tables 3 and 4, we drop the unimportant parameters to explore MTFAUC variation caused by important parameters and fixed them to the significant level identified by ANOVA. Dimension=30, initial pool size=d+1, DOE method=LHD, EEPA distance=Euclidean, and EEPA number of candidate points is fixed to 3. We perform two separate full factorial designs considering the significant parameters; one for TK-MARS and one for RBF. The fraction of important variables is statistically significant in almost all of the analysis. For simplicity and focusing on noise effect, we continue to study the replication strategies with the fixed , the most significant level.

Figures 6(a)-(c) show the average performance of deterministic, fixed replication, and smart replication for 30 random pools. The plots indicate that No-Replication outperforms the Replication approaches when we use TK-MARS. Note that, Smart-Replication outperforms Fixed-Replication. The MTFAUC is improving as expected as the level of noise increases. But let us verify the robustness and the quality of the solution found through each approach.

Figure 7, shows the box-plot of the MTFAUC values for 30 different runs at different noise levels with different methods. Although from Figure 6(a) we observed that No-Replication outperforms Replication approaches, looking at Figure 7(c) () and Figure 7(e) (), we confirm that the variance of MTFAUC for replication strategies is more reasonable than , Figure 7(a). As in Figures 7(d) and (e), in lower levels of noise, Smart-Replication is competitive with No-Replication approach. This indicates the robustness of the Smart-Replication approach to randomness. Although is robust to different noise levels and there is no significant difference in the means across the noise levels, the overall average MTFAUC is larger in lower noise levels compared to and . This is because the Fixed-Replication strategy makes unnecessary functional evaluations.

From Figure 8, we see the variance of the objective values of BSMS after 1000 function evaluations for 30 different runs at each noise level. As shown in the Figure, in the no-noise case, is competitive with and they have the shortest boxes, comparatively. and are robust to different noise levels since the means are close compared to . In the highest level of noise, has the shortest box, indicating that it is more robust to randomness and is more reliable for the Rosenbrock function.

Looking into the results on the Rastrigin function, Figure 6(b), we can see that there is a small difference between the different methods. The reason can be justified by the highly fluctuating behavior of the Rastrigin function with several local optima, and as the noise level added to the function increases, the optimum is harder to obtain. The robustness and quality of the final solution found has been analyzed through the box-plots for Rastrigin function. , which is the full replication approach, has the shortest box and is the most robust approach to randomness, however, it has a higher average across different noise levels compare to . The same pattern was observed for the quality of the final BSMS. finds better BSMS at the highest noise level with relative robustness.

For the Levy function, No-Replication outperforms the replication approaches in almost every case (Figure 6(c)). Although, we can see that Smart-Replication is doing well too. had a comparatively short box at a higher level of noise for the average of different runs. No-Replication outperforms in terms of the quality of final BSMS after 1000 evaluation for the Levy function, however, the performance of the is very competitive. The boxplots for Rastrigin and Levy can be found in Appendix, § 6. Overall, we observe that No-Replication is not robust to the noise effect although it has better average value.

Next, we analyze the average performance of No-replication, Fixed-replication, and Smart-replication using RBF for 30 random pools across different noise levels.

The plots in Figure 9 confirm that in the deterministic situation, No-Replication, is the best option. However, as the noise level increases, outperforms . We can see that is more competitive when we use the interpolating model, RBF for the Rosenbrock function. Looking into the Rastrigin function, Figure 9(b), we can see that there is a small difference between different methods, especially at the highest level of uncertainty. No-replication slightly outperforms all the other methods, since exploration overcomes the replication in the case of having a complicated function like Rastrigin. It can be seen that Replication approaches with 5 replications are slightly competitive with No-Replication at a very high level of noise.

Next, we analyze the robustness of algorithms and the quality of the final BSMS, using the boxplots. One can verify that is the most robust approach to the uncertainty level and randomness since it has the shortest box as the noise level increases, Figure 9. is competitive under a higher level of uncertainty. is the best option in the no-noise case since replication wastes function evaluations when there is no uncertainty.

From Figures 11, observe that the variance of BSMS after 1000 function evaluations for 30 different runs at each considered noise level. As is observed, comparatively outperforms in terms of the BSMS at the end. and are robust to different noise levels in terms of finding the BSMS. For Rastrigin function we observed that has the highest robustness toward randomness, however, it had larger BSMS, comparatively. finds better BSMS, on average when the uncertainty level is low, as we expected. In this case, finds better BSMS at the highest noise level. This confirms that for a highly fluctuated function like Rastrigin we need more exploration as well as replication. For Levy function outperforms other methods under the highest level of uncertainty. is competitive, as well. We verified that has the shortest box overall, except for the no-noise case, which No-replication has the lower MTFAUC on average. In terms of robustness to randomness and the different level of noise, and are competitive. However, the MTFAUC is lower for .

## 5 Conclusion and Future Work

TK-MARS outperforms RBF, the leading surrogate optimization technique in the literature, with regard to the evaluation described in § 4. TK-MARS shows excellent improvement in the handling of insignificant inputs over RBF. Smart-Replication is outperforms No-Replication in terms of robustness to the uncertainty level and the quality of the final optimum solution discovered.

TK-MARS is a regression-based model that does not traverse all the points and forms a surface that minimizes the error based on the distance between the real values and the fitted line. As a result, regression-based models help to have a better approximation for data sets with noisy outcomes. On the other hand Smart-Replication can adapt itself based on the uncertainty level. Consequently, its behavior is similar to No-Replication when the uncertainty level is low and is similar to Fixed-Replication when the uncertainty level is high. Further, Smart-Replication does not waste evaluations on the unpromising points and replicate the near-optima points. Overall, Smart-Replication with TK-MARS is the most robust and reliable approach for high-dimensional black-box optimization undet uncertainty.

## References

- MOI-mbo: multiobjective infill for parallel model-based optimization. In International Conference on Learning and Intelligent Optimization, pp. 173–186. Cited by: §2.7.
- Orthogonal arrays of strength two and three. The Annals of Mathematical Statistics, pp. 508–524. Cited by: §4.1.
- Classification and regression trees. CRC press. Cited by: §1, §2.5.
- Penalized spline estimation for functional coefficient regression models. Computational statistics & data analysis 54 (4), pp. 891–905. Cited by: §2.4.
- Analysis of support vector regression for approximation of complex engineering analyses. Journal of mechanical design 127 (6), pp. 1077–1087. Cited by: §1.
- A multi-objective surrogate-based optimization of the crashworthiness of a hybrid impact absorber. International Journal of Mechanical Sciences 88, pp. 46–54. Cited by: §2.7, §2.7.
- Global optimization with multivariate adaptive regression splines. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 37 (2), pp. 333–340. Cited by: §2.7, §2.7.
- A surrogate-assisted evolution strategy for constrained multi-objective optimization. Expert Systems with Applications 57, pp. 270–284. Cited by: §2.7.
- Adaptive optimisation of noisy black-box functions inherent in microscopic models. Computers & chemical engineering 31 (5), pp. 466–476. Cited by: §3.4.
- An exploration and exploitation pareto approach to surrogate optimization. The University of Texas at Arlington. Cited by: §2.6, §2.7, §2.7, §3.2.
- A kind of balance between exploitation and exploration on kriging for global optimization of expensive functions. Journal of Mechanical Science and Technology 29 (5), pp. 2121–2133. Cited by: §2.7, §2.7.
- Multivariate adaptive regression splines. The annals of statistics, pp. 1–67. Cited by: §1, §2.2, §2.4, §2.4, §4.1.
- Black-box optimization of noisy functions with unknown smoothness. In Advances in Neural Information Processing Systems, pp. 667–675. Cited by: §3.4.
- Real-parameter black-box optimization benchmarking 2009: noiseless functions definitions. Ph.D. Thesis, INRIA. Cited by: §4.
- Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of global optimization 34 (3), pp. 441–466. Cited by: §2.7, §2.7.
- Functional coefficient regression models for non-linear time series: a polynomial spline approach. Scandinavian journal of statistics 31 (4), pp. 515–534. Cited by: §2.4.
- A fast-prediction surrogate model for large datasets. Aerospace Science and Technology 75, pp. 74–87. Cited by: §2.2.
- A method for simulation based optimization using radial basis functions. Optimization and Engineering 11 (4), pp. 501–532. Cited by: §2.7, §2.7.
- Efficient global optimization of expensive black-box functions. Journal of Global optimization 13 (4), pp. 455–492. Cited by: §2.7.
- ParEGO: a hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems. IEEE Transactions on Evolutionary Computation 10 (1), pp. 50–66. Cited by: §2.7.
- Restructuring forward step of mars algorithm using a new knot selection procedure based on a mapping approach. Journal of Global Optimization 60 (1), pp. 79–102. Cited by: §2.4, §2.4.
- A statistical approach to some mine valuation and allied problems on the witwatersrand: by dg krige. Ph.D. Thesis, University of the Witwatersrand. Cited by: §1, §2.2.
- SOP: parallel surrogate global optimization with pareto center selection for computationally expensive single objective problems. Journal of Global Optimization 66 (3), pp. 417–437. Cited by: §2.7, §2.7.
- Variants of multivariate adaptive regression splines (mars): convex vs. non-convex, piecewise-linear vs. smooth and sequential algorithms. Ph.D. Thesis, Ph. D. thesis, The University of Texas at Arlington. Cited by: §2.4.
- Free-knot splines and adaptive knot selection. Journal of the Japan Statistical Society 35 (2), pp. 303–324. Cited by: §2.4.
- Q2: memory-based active learning for optimizing noisy continuous functions. In Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE International Conference on, Vol. 4, pp. 4095–4102. Cited by: §2.7, §2.7, §3.4.
- CH 4 parameter estimation in clm4. 5bgc using surrogate global optimization. Geoscientific Model Development 8 (10), pp. 3285–3310. Cited by: §2.7.
- Influence of ensemble surrogate models and sampling strategy on the solution quality of algorithms for computationally expensive black-box global optimization problems. Journal of Global Optimization 60 (2), pp. 123–144. Cited by: §2.7.
- GOSAC: global optimization with surrogate approximation of constraints. Journal of Global Optimization 69 (1), pp. 117–136. Cited by: §2.7.
- MISO: mixed-integer surrogate optimization framework. Optimization and Engineering 17 (1), pp. 177–203. Cited by: §2.7, §2.7.
- Applied linear statistical models. Vol. 4, Irwin Chicago. Cited by: §2.2.
- Structural optimization using evolution strategies and neural networks. Computer methods in applied mechanics and engineering 156 (1-4), pp. 309–333. Cited by: §1.
- Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics 55 (1), pp. 2–13. Cited by: §1, §2.7, §2.7.
- A benchmark of kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary Optimization 48 (3), pp. 607–626. Cited by: §2.7.
- Robust airline fleet assignment. Cited by: §1.
- The design of optimum multifactorial experiments. Biometrika 33 (4), pp. 305–325. Cited by: §4.1.
- Radial basis funcitionn for multivariable interpolation: a review. In IMA Conference on Algorithms for the Approximation of Functions ans Data, pp. 143–167. Cited by: §1, §2.2, §2.3.
- Constrained global optimization of expensive black box functions using radial basis functions. Journal of Global optimization 31 (1), pp. 153–171. Cited by: §2.7.
- Constrained optimization by radial basis function interpolation for high-dimensional expensive black-box problems with infeasible initial points. Engineering Optimization 46 (2), pp. 218–243. Cited by: §2.7.
- Evolutionary programming for high-dimensional constrained expensive black-box optimization using radial basis functions. IEEE Transactions on Evolutionary Computation 18 (3), pp. 326–347. Cited by: §2.7.
- Stochastic radial basis function algorithms for large-scale optimization involving expensive black-box objective and constraint functions. Computers & Operations Research 38 (5), pp. 837–853. Cited by: §2.7.
- Design and analysis of computer experiments. Statistical science, pp. 409–423. Cited by: §2.7.
- Comparison of response surface and kriging models for multidisciplinary design optimization. In 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, pp. 4755. Cited by: §2.7.
- Oracally efficient spline smoothing of nonlinear additive autoregression models with simultaneous confidence band. Journal of Multivariate Analysis 101 (9), pp. 2008–2025. Cited by: §2.4.
- [45] Virtual library of simulation experiments: test functions and datasets. Note: Retrieved April 25, 2017, from \urlhttp://www.sfu.ca/ ssurjano Cited by: §4.