Efficient Simulation Budget Allocation for Subset Selection Using Regression Metamodels
Abstract
This research considers the ranking and selection (R&S) problem of selecting the optimal subset from a finite set of alternative designs. Given the total simulation budget constraint, we aim to maximize the probability of correctly selecting the top designs. In order to improve the selection efficiency, we incorporate the information from across the domain into regression metamodels. In this research, we assume that the mean performance of each design is approximately quadratic. To achieve a better fit of this model, we divide the solution space into adjacent partitions such that the quadratic assumption can be satisfied within each partition. Using the large deviation theory, we propose an approximately optimal simulation budget allocation rule in the presence of partitioned domains. Numerical experiments demonstrate that our approach can enhance the simulation efficiency significantly.
Gaoa]Fei Gao, Shi]Zhongshun Shi, Gaob]Siyang Gao, Xiao]Hui Xiao
Network Planning Department, SF Technology, Shenzhen 518052, China
Department of Industrial and Systems Engineering, University of WisconsinMadison, Madison, WI 53706, USA
Department of Systems Engineering and Engineering Management, City University of Hong Kong, Hong Kong
School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China
Key words: Simulation optimization; ranking and selection; OCBA; subset selection; regression.
1 Introduction
Discreteevent systems (DES) simulation has played an important role in analyzing modern complex systems and evaluating decision problems, since these systems are usually too difficult to be described using analytical models. DES simulation has been a common analysis method of choice and widely used in many practical applications, such as the queueing systems, electric power grids, air and land traffic control systems, manufacturing plants and supply chains (Xu et al., 2015, 2016; Gao and Chen, 2016). However, running the simulation model is usually time consuming, and a large number of simulation replications are typically required to achieve an accurate estimate of a design decision (Lee et al., 2010). In addition, it could be computationally quite expensive to select the best design(s) when the number of alternatives is relatively large.
In this paper, we consider the problem of selecting the optimal subset of the top designs out of alternatives, where the performance of each design is estimated based on simulation. In order to improve the selection efficiency, we aim to intelligently allocate the simulation replications to each design to maximize the probability of correctly selecting all the top designs. This problem setting falls in the wellestablished statistics branch known as ranking and selection (R&S) (Xu et al., 2015).
In the literature, several types of efficient R&S procedures have been developed. The indifferencezone (IZ) approach allocates the simulation budget to provide a guaranteed lower bound for the probability of correct selection () (Kim and Nelson, 2001). Chen et al. (2000) proposed an optimal computing budget allocation (OCBA) approach for R&S problems. The OCBA approach allocates the simulation replications sequentially in order to maximize under a simulation budget constraint. He et al. (2007), Gao and Shi (2015) and Gao et al. (2017a) further developed the OCBA method with the expected opportunity cost () measure, which focuses more on the consequence of a wrong selection compared to . Brantley et al. (2013) proposed another approach called optimal simulation design (OSD) to select the best design with regression metamodels. It assumes that all designs fit a single quadratic line and the variance of each design is identically distributed. The OSD approach was further extended in Brantley et al. (2014), Xiao et al. (2015) and Gao et al. (2018) to consider more general problems by dividing the solution space into adjacent partitions. Although these studies are also based on partitioning or metamodeling, they do not aim to select the top designs, and are therefore different in objectives from this research. Other variants of OCBA include selecting the best design considering resource sharing and allocation (Peng et al., 2013) and input uncertainty (Gao et al., 2017b).
Most of the existing R&S procedures focus on identifying the best design and return a single choice as the estimated optimum. However, decision makers may prefer to have several good alternatives instead of one and make the final selection by considering some qualitative criteria, such as political feasibility and environmental consideration, which might be neglected by computer models (Gao and Chen, 2016; Zhang et al., 2016). The selection procedure providing the top designs can help the decision makers make their final decision in a more flexible way.
The literature on the optimal subset selection is sparse. Chen et al. (2008) and Zhang et al. (2016) considered the optimal subset selection problem using the OCBA framework, which maximizes under a simulation budget constraint. In Gao and Chen (2015), an optimal subset selection procedure was proposed to minimize the measure of . The optimal subset selection problem was further extended in Gao and Chen (2016) for general underlying distributions using the large deviations theory.
The aforementioned R&S procedures could smartly allocate the computing budget given the simulation results. They estimate the performance of each design only by considering the sample information of the considered design itself. However, the designs nearby could also provide useful information since neighboring designs usually have similar performance. Based on this idea, we aim to improve the selection efficiency by incorporating the information from across the domain into some response surfaces. Unlike traditional R&S methods, the regression based approaches require simulation experiments on only a subset of all the designs under consideration. The performances of the other designs can be inferred based on the sample information of the simulated designs. This provides us an effective way to further improve the efficiency for solving the subset selection problem, which is the motivation of this paper.
In this research, we assume the underlying function is quadratic or approximately quadratic. This assumption could help utilize the structure information of the design space and led to significant improvement of the computational efficiency. It is commonly used in the literature, such as Brantley et al. (2013, 2014); Xiao et al. (2015); McConnell and Servaes (1990). Based on this assumption we built a quadratic regression metamodel to incorporate the information from across the domain. The first contribution of this work is that we propose an asymptotically optimal allocation rule that determines which designs need to be simulated and the number of simulation budget allocated to them, such that the of the optimal subset could be maximized. We call this procedure the optimal computing budget allocation for selecting the top designs with regression (OCBAmr). In order to further extend the OCBAmr procedure to more general cases where the underlying function is partially quadratic or nonquadratic, we divide the solution space into adjacent partitions and build a quadratic regression metamodel within each partition. The underlying function in each partition could be well approximated by a quadratic function if the solution space is properly partitioned or each partition is small enough. According to the results in Brantley et al. (2014); Xiao et al. (2015); Gao et al. (2018), the use of partitioned domains along with regression metamodels could significantly improve the simulation efficiency. That means interpolating the solution space can be an effective way for us to have further improvement. For different problems, the solution space could be divided into discrete partitions using different criteria, such as the size of corporations, the type of industries and the temperature of chemical process (Xiao et al., 2015). Based on the idea mentioned above, we develop an asymptotically optimal computing budget allocation procedure for selecting the top designs with regression in partitioned domains (OCBAmrp), which is an extension of the OCBAmr procedure for more general cases. In order to maximize the of the optimal subset, the OCBAmrp procedure not only determines the optimal simulation budget allocation within each partition but also determines the optimal budget allocation between partitions.
The rest of the paper is organized as follows. In Section 2, we formulate the optimal subset selection problem with regression metamodel and derive an asymptotically optimal simulation budget allocation rule, called OCBAmr. Section 3 extends the OCBAmr method for more general cases with partitioned domains and derives another asymptotically optimal simulation budget allocation rule, called OCBAmrp. The performance of the proposed methods is illustrated with numerical examples in Section 4. Section 5 concludes the paper.
2 Optimal Subset Selection Strategy
In this section, we provide an optimal computing budget allocation rule for the subset selection problem based on the regression metamodel.
2.1 Problem formulation
Without loss of generality, the best design is defined as the design with the smallest mean performance. We introduce the following notations:

: total number of designs;

: location of design ;

: design with the th smallest mean value;

: mean performance value for design ;

: simulation output for design ;

: vector of , representing the coefficients of the regression model;

: estimate of based on the regression model;

: set of the true top designs;

: set of designs not in (complement of );

: total number of simulation replications;

: number of simulation replications allocated to design ;

: vector of , where is the proportion of the total simulation budget allocated to design , i.e., , .
In the last bulleted item, takes values only at , and , where designs and are the first and last designs in the solution space, and design is an intermediate design determined by (5) (the rationale of it will be explained in more detail in Theorem 2). The problem we considered in this paper is to select the top designs out of the alternatives by allocating simulation replications to these three designs. The optimal set is
where is the size of subset .
We study the problem where the expected performance value across the solution space is quadratic or approximately quadratic in nature. Then, the mean performance of design can be written as
The coefficients are unknown beforehand, but can be estimated via simulation samples. Assume that the noises of the simulation experiments of design follow normal distribution . For each design, the noise is independent from replication to replication. The simulation output is with
Given a total of samples, we define as the vector of the simulation samples and be an matrix with each row corresponding to each entry of . We estimate the coefficients using the ordinary least squares (OLS) method (Hayashi, 2000) and denote the estimates as . Then, we have and where denotes transposition and is known as the information matrix (Kiefer, 1959). The estimate of can be written as
(1) 
We can use the equation (1) which incorporates the sample information of the simulated designs to estimate the expected performance for each design across the solution space. As is a linear combination of , we have where .
Due to the uncertainty of the estimate of the underlying function, a correct selection of the optimal subset may not always occur. Therefore, we introduce a measure, the probability of correct selection (), to formulate the R&S problem considered in this paper. The is given by
Given a fixed simulation budget, the optimization problem can be written as follows:
(2) 
In this section, we aim to solve problem (2) where the mean performance of each design is estimated using the regression metamodel. Due to the uncertainty in the simulation experiments, multiple simulation replications are needed to generate the estimates of the underlying function accurately. The variance of is a function of the information matrix and can be reduced if additional simulation replications are conducted (Xiao et al., 2015). We are interested in how to intelligently allocate the simulation budget to proper designs such that can be better estimated using regression metamodel, and the in problem (2) can be maximized.
2.2 Optimization model under large deviations framework
A major difficulty in solving problem (2) is that the objective function does not have a closedform expression. We seek to solve this optimization problem under an asymptotic framework in which the is maximized or the probability of false selection is minimized as goes to infinity.
The used in this paper is defined based on the quadratic regression model (1), which is constructed using the simulation information of only a fraction of the designs. We call the designs receiving simulation replications the support designs. In order to construct the quadratic regression model, we need at least three support designs to obtain all of the information in (Kiefer, 1959). For simplicity, in this research we let the number of support designs be three, two of which are at the extreme locations, i.e., and (for this setting, see, e.g., Brantley et al. (2013, 2014); Xiao et al. (2015); Kiefer (1959)). The case with more than three designs can be similarly analyzed.
Lemma 1: Let , . follows normal distribution , where
and
The proof is similar to that is given in Eq.(22) of Brantley et al. (2013) (becomes the same when ), and hence is omitted for brevity.
For any , is a normally distributed random variable. We can use the large deviations theory to derive the convergence rate function of the false selection probability.
Lemma 2: The convergence rate function of incorrect comparison probability for each design is:
Based on the results in Lemma 2, we can get an explicit expression of the convergence rate function of .
Lemma 3: The convergence rate function of is:
The main assertion of Lemma 3 is that the overall convergence rate of is determined by the minimum convergence rate of the incorrect comparison for each design. Minimizing the is asymptotically equivalent to maximizing the rate at which goes to zero as a function of , i.e., maximizing . Based on Lemma 3, the asymptotical version of (2) becomes
(3) 
2.3 Asymptotically optimal solution
In this subsection, we seek to derive the optimality conditions for (3). Since the overall convergence rate of is determined by the design with the minimum convergence rate. A false selection is most likely to happen at this key design. Therefore, it is enough for us to investigate the convergence rate of the key design across the solution space. We define as the key design. That is
Theorem 1
The optimization problem (2) can be asymptotically optimized with the following allocation rule:
(4) 
The is also known as the Lagrange interpolating polynomial coefficient (De la Garza et al., 1954; Burden and Faires, 2001). It represents the relative importance of each support design for estimating . Theorem 1 indicates that the support design will receive more simulation budget if it has larger .
Given the optimal allocation rule (Theorem 1), we next determine the optimal location for the support design .
Theorem 2
The rate function of with allocation rule satisfying (4) can be maximized if the support design satisfies the following equations.
The support design
(5) 
When the derived from (5) does not correspond to any design available, we round it to the nearest one. The expression of Theorem 2 is similar to the results in Brantley et al. (2013). The differences are the selection of the key design . We define as the design with the minimum convergence rate of PFS, where a false selection of the optimal subset is most likely to happen.
3 Optimal Subset Selection Strategy for Partitioned Domains
The regression based method mentioned above can greatly improve the subset selection efficiency compared to the traditional methods. However, it is constrained with the typical assumptions such as the assumption of quadratic underlying function for the means. It is possible that the underlying function is neither quadratic nor approximately quadratic, so that we will fail to find the top designs. In order to extend our method to more general cases, we divide the solution space into adjacent partitions. The quadratic pattern can be expected when the solution space is properly partitioned or each partition is small enough.
3.1 Problem formulation
We first add the following notations for partitioned domains:

: number of partitions of the entire domain;

: number of designs in partition , ;

: th design in partition , (when , we denote design as design for notational simplicity);

: location of design ;

: the partition containing the design with the th smallest mean value;

: design with the th smallest mean value;

: vector of , representing the coefficients of the regression model in partition ;

: number of simulation replications allocated to partition ;

: number of simulation replications allocated to design ;

: the proportion of simulation budget allocated to partition , i.e., ;

: vector of , where is the proportion of the simulation budget allocated to design , i.e., , .
The notations , and defined for partitioned domains are similar to those in Section 2, except the design is replaced by . The entire domain is divided into adjacent partitions. Each partition contains designs, i.e., there are designs in total.
We assume there exists such that there are exactly designs from the total designs with mean performances less than and the rest designs with mean performances greater than . It ensures that the optimal subset is well defined and can be distinguished. We define the optimal subset as
In this section, we assume that the expected performance value in each partition is quadratic or approximately quadratic when the solution space is properly partitioned and the mean performance within a partition is continuous and smooth. For this problem setting the is defined as
Given a fixed simulation budget, the optimization problem can be written as follows:
(6) 
In this section, we aim to solve problem (6) in the presence of regression metamodels. We are interested in how to intelligently allocate the simulation budget to proper designs such that can be better estimated using regression metamodels, and the in problem (6) can be maximized. Note that this model does not require all the designs to be on the same axis, and therefore does not hinder it from being applied for multidimensional problems. For multidimensional problems, we can treat the range of the underlying function on each dimension as one or more partitions, and then apply this formulation.
3.2 Optimization model under large deviations framework
In order to solve the optimization problem (6), one challenge is how to derive an explicit expression of the . We seek to solve (6) under an asymptotic framework in which the probability of false selection () is minimized as goes to infinity.
Similar to the setting in Section 2, we let the number of support designs in each partition be three, two of which are at the extreme locations, i.e., and . We have follow normal distribution and follow normal distribution where .
Similar to the proof of Lemma 1, we have
and
For any and , is a normally distributed random variable. We can use the large deviations theory to derive the convergence rate function of the false selection probability.
Lemma 4: The convergence rate functions of incorrect comparison probability for each design are provided as follows:
(7) 
and
(8) 
According to the Bonferroni inequality, we have The is bounded below by
and bounded above by
Therefore, according to the results in Lemma 4, the convergence rate functions of is given by
Minimizing the is asymptotically equivalent to maximizing the rate at which goes to zero as a function of and , Similarly, the asymptotical version of (6) becomes
(9) 
3.3 Asymptotically optimal solution
In this section, we seek to derive the optimality conditions for (9). We want to determine (i) the number of simulation replication allocated to each partition, (ii) the locations of the designs should be simulated in each partition and (iii) the number of simulation replications allocated to those selected designs.
According to Xiao et al. (2015), we have as the number of partitions goes to infinity. That means the fraction of the simulation budget allocated to the partition containing the th smallest mean value far exceeds the fraction given to any other partition when goes to infinity. Given that, the rate function converges to where
The problem (9) can be asymptotically rewritten as
(10) 
In order to better analyze problem (10) above, we decompose it as follows:
(11) 
for , and
(12) 
for .
Lemma 5: Let and be the optimal solution to (10). The and are independent and can be solved separately. In addition, the ’s corresponding to the partitions are also mutually independent and can be solved separately using (11) and (12).
Similar to Section 2, we define a key design of each partition , denoted as . A false selection is most likely to happen at these key designs. That is
3.3.1 Determine and locations of support designs
Given Lemma 5, we can determine separately for each partition by solving the optimization problems (11) and (12).
Theorem 3
The optimization problem (6) can be asymptotically optimized with the following allocation rule: