A Mathematical Programming Approach for Integrated Multiple Linear Regression Subset Selection and Validation
Abstract
Subset selection for multiple linear regression aims to construct a regression model that minimizes errors by selecting a small number of explanatory variables. Once a model is built, various statistical tests and diagnostics are conducted to validate the model and to determine whether regression assumptions are met. Most traditional approaches require human decisions at this step, for example, the user adding or removing a variable until a satisfactory model is obtained. However, this trialanderror strategy cannot guarantee that a subset that minimizes the errors while satisfying all regression assumptions will be found. In this paper, we propose a fully automated model building procedure for multiple linear regression subset selection that integrates model building and validation based on mathematical programming. The proposed model minimizes mean squared errors while ensuring that the majority of the important regression assumptions are met. When no subset satisfies all of the considered regression assumptions, our model provides an alternative subset that satisfies most of these assumptions. Computational results show that our model yields better solutions (i.e., satisfying more regression assumptions) compared to benchmark models while maintaining similar explanatory power.
Keywords: Regression diagnostics, Subset selection, Mathematical programming
1Introduction
Regression analysis is one of the most popular forms of statistical modeling for analyzing the relationship between variables. Due to its intuitive characteristics and simplicity, multiple (multivariate) linear regression has been the most commonly used model for several decades, with a variety of fields employing it to handle prediction tasks. Multiple linear regression seeks to identify the most suitable linear relationship for several explanatory variables and a response variable. A multiple linear regression model for explanatory variables can be represented by the linear equation
where represents the data matrix corresponding to explanatory variables with observations, denotes the data matrix corresponding to the values of the response variable, is the matrix of estimated parameters corresponding to the coefficients of the variables, and refers to the errors between the predictions of the linear model and the response variable. We further assume that data matrices and are already standardized with respective means and standard deviations, and thus, without loss of generality, we can remove the intercept from the model above. It then makes sense that should be chosen to minimize the errors.
Subset selection is a procedure in which a subset of explantory variables is selected to reduce model complexity while maintaining explanatory power. [12] and [13] demonstrate that a reduced subset can provide faster, more costeffective predictors, along with a better understanding of the underlying process that generated the data. A reduced subset can also prevent overfitting in the presence of too many irrelevant or redundant features [36] and can help users to more readily understand the results [15].
Many studies have been conducted on subset selection in the literature. Stepwise selection, including forward selection, backward elimination, and their combination, are the most popular algorithms thanks to their simple implementation and fast computing time. However, the solutions found by these algorithms is often poor in quality due to their greedy characteristics. Therefore, in order to improve subset selection, more complicated algorithms have been proposed. Metaheuristic algorithms, such as genetic algorithm variations and simulated annealing, are presented in [35], [19], [30], and [21]. Subset selection procedures based on statistical or machine learning methods have also been carried out. For example, Bayesian approach to subset selection is taken by [22], while [10] suggest a selection procedure based on the superiority of each subset estimated from posterior probabilities given by Gibbs sampling. [9] point out that random forests can be employed in a strategy that involves a ranking of explanatory variables which provides insight for the selection of the variable(s). In addition, other research on machine learning algorithms for subset selection has been conducted by [27], [7], and [37]. By assigning an penalty to coefficients, least absolute shrinkage and selection operator (LASSO) is introduced by [32]. Although it has been several years since LASSO was first devised, it is still one of the most commonly used regression subset selection methods. Note that most of the methods above are heuristics, indicating that they do not necessarily give parameters that minimize the mean squared error (MSE).
To find the best subset, several models based on mathematical programming have been proposed. Although the best subset selection problem can be formulated as mixed integer quadratic programming (MIQP) when ordinary least squares (OLS) is used, the statistical community has not paid it close attention because it requires significant computational time for a practical implementation. Recently, however, [1] have pointed out that the computing power of mixed integer programming (MIP) solvers has increased at an sensational rate in the last 25 years and also emphasized eyeopening progress in exact algorithms for solving integer programs. These recent developments have made integer programming a key method for finding the best subset. [18] introduce an MIQP formulation to find the best subset with a fixed number of variables for multiple linear regression with the minimum sum of squared error (SSE). [17] proposed a multistep method to generate a nearly optimal solution in a shorter time. Based on the formulation of [18], [2] suggest obtaining tight big values in the indicator constraint to improve computing speed. They also introduce an algorithm to achieve an initial solution for a warmstart, effectively reducing the computing time. The choice of a best subset for logistic regression via MIP using piecewise linear approximation is also presented in [28]. Models and algorithms that do not require the number of selected variables to be fixed have also been proposed. [23] introduce mixed integer secondorder cone programming formulations that do not restrict the cardinality of the subsets. [24] set Mallows’ as the goodness of fit for the model and formulate the subset selection as MIP. In their work, Mallows’ includes the number of selected variables and hence the cardinality of the subset need not to be fixed when Mallows’ is used as a measure of goodness of fit. [26] consider a MIQP formulation for subset selection when the shape of data is not only ordinary (i.e., ) but also for high dimensional variable selection when the number of selected variables is not fixed. They also provide a proof of big as an upper bound on coefficients and an efficient heuristic algorithm for cases when . Note that these studies only consider several goodnessoffit measures to obtain an optimal subset.
While numerous studies have focused on regression subset selection, few have considered regression assumptions and diagnostics in their solution approach. [1] suggest an algorithmic approach that iteratively solves MIP problems to obtain a desirable model. They use penalized objective functions and constraints for sparsity, multicollinearity reduction, robustness to outliers, and statistical significance. [31] propose MIQP for best subset selection while eliminating multicollinearity from linear regression models. [6] study a mathematical model for subset selection in order to construct a linear regression model with the significance of coefficients and small multicollinearity by constraints performing shrinkage of the coefficients.
In this paper, we propose a fully automated model building and validation procedure for multiple linear regression via a mathematical programmingbased algorithm that minimizes MSE while guaranteeing both (i) statistical significance of the regression coefficients and (ii) regression assumptions and diagnostics. Note that it is not trivial to incorporate this model validation step into other popular subset selection approaches such as LASSO and stepwise selection. Our model is also capable of providing an alternative solution when there is no subset that satisfies all of the considered regression assumptions and tests. Our model is different from [1] and [6] in that our algorithm finds a solution that satisfies the tests and diagnostics with only one call to an MIP solver, whereas they use an iterative method that calls an MIP solver multiple times or heuristic observations for modeling. The contributions of our paper can be summarized as follows:
We provide a mathematical programmingbased algorithm that allows a regression model to be built that satisfies the majority of statistical tests and diagnostics. In particular, residualbased diagnostics are incorporated into the model for the first time.
Our algorithm smartly uses a commercial MIP solver to which exactly one call is required. To the best of our knowledge, our algorithm is unique in the sense that it does not require multiple iterative calls to the solver. The experimental results demonstrate the efficiency of our algorithm in comparison to existing iterative methods.
The experimental results show that our model identifies a subset that satisfies all of the considered tests and assumptions within a reasonable time, while at the same time keeping the adjusted values close to those of the benchmarks (without statistical tests and diagnostics).
We present a logical procedure to find a nearfeasible solution when no subset satisfies the tests and diagnostics or our model fails to find a feasible solution within the given time limit. The proposed procedure mimics the typical steps used to building a linear regression model. The experimental results show that our procedure produces higher quality subsets than the benchmarks.
The paper is structured as follows. In Section 2, we discuss several characteristics of a desirable multivariate linear regression model which incorporates transformed variables and meets essential regression assumptions. In Section 3, a mathematical programming approach for the best subset which reflects the features discussed in Section 2 is presented. A logical procedure for obtaining an alternative solution when the model is infeasible is proposed in Section 4. The result of computational experiments are presented in Section 5, followed by conclusions in Section 6.
2Model Validation for Multiple Linear Regression
As mentioned previously, building a multivariate linear regression model involves finding a linear equation which explains well the relationship between the explanatory variables and the response variable. “Explain well” in this case means that the errors between the predictions and the observations are suitably small. At the same time, the model needs to satisfy the assumptions of linear regression from a statistical point of view. However, when a model is obtained from a single run using one of the popular least squares methods, some of the statistical assumptions are usually violated. [25] suggests building a regression model by repetitively checking the assumptions and diagnostics. Figure 1 summarizes this strategy. After preprocessing the collected data, several explanatory variables (or nonlinear transformed variables) are selected to build a statistically significant model. Diagnostics are then conducted on the model. If the model satisfies all assumptions, it is forwarded to the postprocessing step. Otherwise, another subset of explanatory variables is selected and tested. As a result, constructing a useful regression model requires many diagnostics tests. In this section, we discuss three popular techniques and tests that will be included in the mathematical formulation proposed in Section 3.
2.1Transformation of explanatory variables
When the explanatory and response variables have a nonlinear relationship, it is possible to have nonlinear residual trend versus fitted values. In these cases, the transformation of the explanatory variables using a nonlinear function including or is one of the most commonly used techniques to resolve this problem. A model can be established with a subset of explanatory variables and their transformations, with restriction that a variable and its transformation cannot be selected simultaneously because this leads to multicollinearity. If the explanatory variables are transformed using an appropriate function, then a model constructed with transformed variables will no longer demonstrate a nonlinear relationship in the residual plot. Therefore, we generate transformed data as fixed candidates set of selected variables. The MIQP model presented in Section 3.2 has a constraint to select at most one between an original and transformed variable pair.
2.2Statistical tests for linear regression parameters
Because linear regression models are constructed based on statistical assumptions, any proposed model should be verified to determine whether it satisfies these assmptions. One linear regression assumption is that the data always has noise that is generated from a normal distribution. For this reason, the estimated coefficients have a probability distribution. Suppose that . The standard deviations of the coefficients are then composed of the diagonal of a matrix derived as
where denotes the mean squared error of the model developed from data matrix (refer to [25] for its proof). The th element of the diagonal of is equal to the standard deviation of the th estimated coefficient (i.e., th the element of ). We denote the standard deviation and the estimated coefficient as and , respectively. If we employ a data matrix with variables, then the coefficients follow Student’s distribution.
Note that the statistic is constant when the number of selected variables is fixed. Our interest is to test whether coefficient is equal to zero. If is close to zero, then it can be regarded that there is no linear relationship between the response variable and . Formally, the following hypothesis can be tested:
Thus, to reject the null hypothesis, the following inequality must hold:
We will discuss how this requirement can be handled with a mathematical programming approach in Section 3.3 and Section 3.4.
2.3Model validation with residual plots
One of the key assumptions of linear regression is that errors have constant variance over fitted values or over independent variables. This can be verified by drawing residual plots. The residuals have constant variance if they form a bandlike region with constant width over the entire range of fitted values. [5] noted that the violation of residual assumptions can lead to inefficiency when using OLS and subsequently invalid inferences. However, linear models often do not satisfy these assumptions. Thus, once a linear regression model is constructed, it is critical to check the diagnostic plots of the residuals.
Figure ? shows representative plots of residuals verses fitted values. In Figure 2, the variance of the residuals seems to be constant over the fitted values and thus the generated model satisfies the assumption that the variance of residuals is constant. However, this is not the case for the other examples in Figure ?. Figure 3 displays a positive correlation between the residuals and fitted values, Figure 4 shows heteroscedasticity in which the variance of the residuals increases as the fitted values increase, and Figure 5 presents nonlinear relationship between the residuals and fitted values.
In fact, the latter three cases in Figure ? are frequently observed when implementing linear regression models in many realworld problems. Thus, we need to run several statistical tests to diagnose these cases and thus allow an appropriate combination of explanatory variables to be selected.
To detect the situation depicted in Figure 3, a univariate linear regression model, called an auxiliary model, is constructed where the explanatory and response variables are the fitted values and residuals in the plot, respectively. If there is no issue, the estimated slope is close to zero. If linearity exists, the estimated slope is nonzero. A simple hypothesis test can be used to check whether the estimated slope is zero. If the test rejects the null hypothesis (), then we conclude that the model violates the residual assumption. Otherwise, we conclude that the model does not suffer linearity.
To detect heteroscedasticity (Figure 4), two statistical tests will be introduced. The first test, referred to as an absolute residual test in this paper, is identical to the previous test except for the fact that the values of response variable in the auxiliary model are the absolute values of the residuals. Figure 7 shows that, if the negative side of Figure 4 is flipped to overlap the positive side, an increasing trend in the residuals is observed. On the other hand, the constant variance of the absolute values of the residuals with respect to the fitted values is presented in Figure 6, which is associated with the desirable residual plot of Figure 2. Thus, as a linear residual trend, if a statistical test on the coefficient of the auxiliary model rejects the null hypothesis, we can regard the residuals of the linear model as having heteroscedasticity. Another widely used diagnostic tool for heteroscedasticity is the BreuschPagan test [5]. To prevent being overly rigorous, we consider the residual assumption to be violated if both proposed tests indicate heteroscedasticity. In Section 3.4, our mathematical programmingbased diagnostic approach will be presented to capture the features of heteroscedasticity.
Finally, for the case illustrated in Figure 5, we formulate a model that selects variables from among the explanatory variables and their nonlinear transformations.
3Integrated Model Building and Validation Procedure via Mathematical Programming
In this section, we develop an automated procedure based on mathematical programming models that minimize SSE to select a subset of a fixed number of explanatory variables and include statistical tests and diagnostics for multivariate linear regression. In Section 3.1, we introduce the base model from the literature, which does not consider statistical test or diagnostics. In Section 3.2, we extend the base model to include logtransformed explanatory variables. In Section 3.3, we propose constraints to remove statistically insignificant coefficients. In Section 3.4, we present the final algorithm, which considers all of the remaining tests and diagnostics.
3.1Base model
We start with a basic model similar to that proposed by [33]. All the sets, parameters, and decision variables used are as follows.
Sets and Parameters
:  number of observations  
:  number of independent variables  
:  index set of observations,  
:  index set of independent variables,  
:  index of observations,  
:  index of independent variables,  
:  number of selected independent variables  
:  standardized data matrix corresponding to independent variables,  
:  standardized data matrix corresponding to dependent variables, 
Decision Variables
:  coefficient of independent variables,  
:  error between the th observation and its prediction value,  
:  if the th independent variable is selected; otherwise, 
Using the parameters and decision variables above, the basic mathematical programming model is presented as follows.
The basic model ( ?) minimizes the SSE of a multivariate linear regression model with a fixed . Remark that MSE is actually calculated as , but it is equivalent to simply minimizing SSE because is fixed as a given parameter. Constraint defines the residuals, and constraint indicates that if an explanatory variable is not selected, then the coefficient of the variable must be zero. Lastly, constraint ensures that the number of selected variables is .
The mathematical model ( ?) can be converted into an MIQP with a popular linearization technique. Instead of unrestricted continuous decision variables and , nonnegative variables , , , and are used, where and . By plugging these in, the following MIQP can be obtained.
It is to be remembered that the main purpose of our study is to build a multivariate linear regression model that considers diagnostics. To meet this goal, we now extend the base mathematical model (Equation 3) above to include several of the important diagnostic tests presented in Section 2.
3.2Inclusion of logtransformed explanatory variables
We first discuss how to include logtransformed explanatory variables in the model. Because an explanatory variable and its logtransformation are highly correlated, we need to prevent both variables from being selected simultaneously, as discussed in Section 2.1. Let us then define new parameters and sets.
Set and Parameters
:  index set of the logarithm of the independent variables,  
:  standardized data matrix corresponding to the logarithm of the independent variables,  
:  data matrix concatenating and , i.e., 
If a vector corresponding to the th column of possesses nonpositive elements, we compute the logarithm of the column using a conventional method to deal with the nonpositives: . Since all original explanatory variables have transformed variables, we can define for each . The model incorporating the logtransformed explanatory variables, , can be formulated as follows:
The extended model is similar to formulation . The difference is that has additional derived explanatory variables in , while presenting the simultaneous selection of original and logtransformed explanatory variables by . We note that, in addition to the logtransformation, another type of transformation of the explanatory variables can be considered in a similar manner. Finally, we remark that an appropriate value of in is required to solve the problem. When is too small, we cannot guarantee optimality. When is too large, the optimization solver can struggle due to numerical issues. [26] proposed a samplingbased approach, where is estimated by iteratively sampling a subset of explanatory variables. We use this method of with a slight modification to make sure that the sampled explanatory variables do not simultaneously include an original explanatory variable and its transformation.
3.3Inclusion of constraints corresponding to tests for the significance of the regression coefficients
In this section, we present constraints to check the statistical significance of the regression coefficients. Formulating such constraints is not a trivial task because the statistical significance of an estimated coefficient depends on the selected subset. In detail, in can only be calculated given a subset so that an inequality including cannot be trivially formulated as a convex constraint. In order to address this issue, we convert the inequality into a constraint that checks the statistical significance of the th variable if it is selected. We derive
where is a subset with explanatory variables, is a submatrix of derived from , and is the standard deviation of the estimated coefficient for explanatory variable which is equilvalent to and only defined for explanatory variables in . The resulting constraint above is only activated when the th variable is selected (i.e., ). Note that varies according to the selected variables. It implies that the values in the constraint above vary accordingly. Hence, it is still difficult to add constraint in its current form to the MIQP model.
To handle the changing values of depending on the selected subset, we instead use the lower bound of , which gives a relaxed version of the exact constraint. A tight lower bound for can cut some of the subsets with insignificant regression coefficients.
To obtain a lower bound for , we consider the definition of , Since , it is straightforward to multiply the lower bound of by the lower bound of to provide the lower bound of .
We first consider the lower bound of . Exact can be calculated by solving . However, solving , which is an MIQP, can take a long time. Thus, we instead use its LP relaxation,
:  minimize  
subject to  , , , , ,  
, ,  
which gives us the lower bound of , notated as .
We next consider the lower bound of . In a multivariate linear regression model with data matrix , the elements of the diagonal of are called a variance inflation factor (VIF). The VIF for the th coefficient is calculated by following equation:
where is the coefficient of determination for a linear regression model, where variable is the response variable and all variables except for are explanatory variables. Note that since for a linear model with an arbitrary data matrix. Therefore, we can set the lower bound of the th diagonal element of to 1.
Finally, the lower bound of , denoted as , can be derived based on the following inequality:
Thus, we can formulate the relaxed constraint of as
Since is a calculated constant and is determined by the user, constraint is a linear constraint and can be added to the MIQP model.
We remark that is a relaxed version of the exact test constraint and can be useful in reducing the search space of the branchandbound algorithm of the solver.
3.4Final algorithm
Recall that constraint is a relaxed constraint derived from the lower bound of and a feasible solution for can have statistically insignificant coefficients, while we want all coefficients to be statistically significant. Furthermore, it is challenging to formulate tests to check the residual assumptions in Section 2.3 as linear or convex constraints. To overcome all these limitations, we propose a lazy constraintbased approach to solve the following problem.
Lazy callback is mainly employed in practical implementation using an optimization solver when the number of constraints in an MIP model is extremely large [11]. For example, we may use the lazy callback when solving a traveling salesman problem (TSP) because it has a large number of subtour elimination constraints. Instead of adding all subtour elimination constraints at the root node of the branchandbound algorithm, we can iteratively and selectively add some of the subtour elimination constraints. In detail, when the solver arrives at a branchandbound node and the solution contains a subtour, lazy callback allows us to add the corresponding subtour elimination constraint at the current node. Although our MIQP model does not have an extremely large number of constraints, we can use lazy callback to stop at a branchandbound node and check if the solution completely satisfies all of the tests and diagnostics. The lazy constraintbased procedure is presented in Algorithm ?.
In the algorithm, is the index set of selected variables at the current node in the brandandbound tree. The lazy constraint is added to the model when (a) a linear model associated with fails to pass at least one of the statistical tests for the coefficients or (b) the model fails to pass the residual diagnostics test. The lazy constraint eliminates the solution of selecting all explanatory variables in from the feasible region. For example, let us consider a problem with 10 explanatory variables with the associated binary variables set . We assume that we set and the solver is currently at the node with a solution having and for all other explanatory variables. The procedure generates a linear model with the selected explanatory variable set and performs statistical tests and diagnostics for the model. When the regression model fails to pass at least one of the tests, lazy constraint is added to the MIQP model. As a result, the solution (the others are zero by constraint ) becomes infeasible in all of the subsequent branches.
4Alternative Solution Procedure for Infeasible or Difficult Problems
When all possible subsets violate at least one of the tests and assumptions, model in Section 3.4 becomes infeasible and the algorithm will not return a solution. However, even for this case, a nearfeasible solution is desired (i.e., satisfying most of the tests and assumptions while mildly violating a few of the tests). In this section, we propose a procedure, referred to as the alternative solution procedure, to find the most attractive nearfeasible solution when the original problem is infeasible or no feasible solution is found in a suitable period of time.
The alternative solution procedure is invoked at each branchandbound node while no feasible solution satisfying all tests and diagnostics has been found by the algorithm. Once a feasible solution is found, the alternative solution procedure will not be invoked since at least one feasible solution to the problem satisfying all of the tests and diagnostics exists. When the alternative solution procedure is invoked, it compares a new subset with the incumbent best subset (not feasible, but the best alternative nearfeasible subset) and decides which subset is better. Our model updates and keeps only one alternative solution until it finds a feasible solution. The new subset, denoted as , is the subset at the current branchandbound node and the incumbent best subset, denoted as , is the best alternative subset found so far. Remark that and must be infeasible solutions with some insignificant coefficients or residual assumption violations, as the alternative solution procedure keeps until a feasible solution is found. To simplify the discussion, we will use and to refer to both selected subsets and linear regression models.
Note that one of the simplest approaches to compare , , and all other subsets is to add penalty terms to the objective function based on diagnostic violations such as average pvalues or number of insignificant variables. Before presenting the details of the alternative solution procedure, we first explain why the simple penalty approach may fail and the proposed procedure is needed. First, the penalty approach cannot effectively reflect the overall development procedure of a linear regression model. In typical model development, explained in Section 2, we first establish a linear regression model with statistically significant coefficients. We then check the residual assumptions via diagnostics. That is, we should consider the residual assumptions after establishing a statistically significant model. This cannot be achieved with the simple penalty approach. Next, there exist nontrivial cases where the penalty approach may fail to select the better subset. The problem is discussed with the illustrative examples in Table 1.



Case 1 

Case 2 

Case 3 

Table 1 includes three cases determining which solution is better between and , where and are set to 5 and 0.05, respectively. The set of pvalues from the statistical test for the coefficients is shown for each case. In Case , it is not surprising that we choose because the number of insignificant coefficients of is less than that of (i.e., ), and the average of the violating pvalues for is also less than that of (i.e., ). On the other hand, must be better in Case 2 because the average of the violating pvalues for is much larger than that of () although the number of insignificant coefficients of is greater than that of (). In Case 3, although the average of the violating pvalues of is less than that of (), selecting as the better solution seems to be reasonable because the difference is negligible while the number of insignificant coefficients of is significantly greater than of (i.e., ). These illustrative examples indicate that we should consider both the average of the violating pvalues and the number of insignificant coefficients. However, due to the different magnitudes of the measures, it is not trivial to set apropriate weights. Consequently, it is necessary to develop an intuitive procedure that reflects the framework of linear regression model construction and further alleviats the scale difference for measures of diagnostic violations.
Now, we discuss the alternative solution procedure. For a more efficient explanation of the procedure, we define the functions related to a significance test for coefficients.
:  number of insignificant coefficients in the model with subset  
:  average pvalues of the insignificant coefficients in the model with subset ,  
if all coefficients in subset are statistically significant  
:  decision by quality; , are subsets  
:  decision with tolerance; , are subsets  
The last two functions, and decide whether or is better based on different principles, which will be discussed in detail later in this section.
The overall alternative solution procedure is presented in Algorithm ?.
Step 1 is for when is the statistically significant model while is not. Hence, it is natural to return in Step 2. Steps 3 and 4 consider the opposite case to that of Steps 1 and 2. In Steps 59, nontrivial cases are considered: both models are statistically significant in Step 5 and both models are statistically insignificant in Step 7. For these two cases, , referred to as decision by quality, and , referred to as decision by toloerance, are employed to make a decision.
Function returns the better solution between and , as determined by quantifying the quality of the solutions. Function quantifies the solutions using penalty function of solution introduced below.
:  the MSE of solution  
:  pvalue from the residual linearity test for solution  
:  pvalue for the residual heteroscedasticity tests, which is the maximum pvalue between the pvalues of the absolute residual test and the BreuschPagan test  
:  penalty parameter for  
:  penalty parameter for  
:  penalty parameter for  
:  penalty parameter for  
:  penalty parameter for  
:  percentage transform function for ;  
:  percentage transform function for and ;  
where , , and are the significance levels for the tests for coefficients, diagnostics for residual linearity, and residual heteroscedasticity, respectively. Note that function includes percentage transform functions and to indicate the percentage gap from the significance levels. We introduce these functions for the following two reasons. First, we want to accurately measure the insignificance of the pvalues when the significance levels are at different scales. Second, we do not want to apply penalties if the pvalue is within the significance level.
Two examples are provided for and to support our claims and demonstrate their necessity. The first example is of penalty nullification. Suppose and , which implies the residuals of are consistent while those of have heteroscedasticity. Given , we can calculate , and . The penalties make sense because the does not violate the test. Hence, the evaluation of is based only on the MSE and the significance of coefficients. The second example demonstrates the role of scaling between pvalues from different statistical tests. Suppose that we get , . Further, and are both set to . Then, violates the tests of the coefficients by , which is equal to the value of the heteroscedasticity test violation . However, the violation of is relatively small for the possible test violation range compared to the possible residual test violation range Thus, despite the same violation value, the significance of the violation can differ depending on the tests and diagnostics. We can scale the magnitudes of the violations through and : , and .
We now discuss the decision with tolerance presented in Algorithm ?. Steps 12 are key to our procedure. If exceeds the tolerance , a positive parameter predetermined by the user, then the alternative solution procedure concludes that is better. When the average of the violating pvalues of is smaller or not significantly worse (smaller than ) than that of , we conclude is competitive and investigate further in Steps 311 by comparing the two solutions based on two factors, and . Steps 45 conclude that, if is smaller than in terms of those two factors, is better. The opposite case is shown in Steps 67. If there is a conflict between the two factors, settles the decision. Remark that Algorithms ? and ? imitate the linear regression model building procedure described in Section 2 by using statistical significance tests first to make the decision, followed by other diagnostics.
We illustrate the entire alternative solution procedure using the three toy examples presented in Table 1. Suppose that our model is searching a subset whose cardinality is set to 5 and a feasible solution has not been found yet. A user sets the significance levels and to 95 percent and , respectively.
Case 1.
Our model cannot determine which solution is better at the first step because and also . Hence, the decision with tolerance is invoked (Algorithm ?). In Step 1, the algorithm computes , , and . Thus, the next steps compare with and with . Since and , is determined to the better solution.
Case 2.
Since and are both greater than , our model cannot determine which solution is better in Step 7 of Algorithm ?, and is called. In Algorithm ?, since in Step 2, is returned.
Case 3.
Similar to the previous cases, is called in Step 8 of Algorithm ?. In Algorithm ?, although and is greater than , Steps 410 are considered because . However, since is greater than and is less than , the winning solution is determined by .
Overall, we found that the alternative solution procedure picks the desired solution discussed in Table 1. Throughout the alternative solution procedure, we first compare with using the significance of the estimated coefficients, in particular with and . If this step cannot make a decision, residual diagnostics are considered. This multistage procedure reflects the linear regression model building framework discussed in Section 2. The procedure also deals with the possible issues illustrated in Case 2 and 3 in Table by introducing . A user can intuitively give a value of based on the average level of pvalue difference he or she can permit.
In the next sections, we will refer our final algorithm, solving with Algorithm ? and the alternative solution procedure in Algorithms ? and ?, to .
5Computational Results
In this section, the results of numerical experiments using the proposed model and benchmarks are presented. The experiments are designed to demonstrate the performance of the proposed model and the necessity of the alternative solution procedure presented in Section 4.
5.1Benchmark dataset description and preprocessing
Twelve publicly available datasets are used in the experiment. We collect five datasets for regression from the UCI machine learning data repository [20], seven datasets from several sources. Table ? summarizes the features of the experimental datasets.
We preprocess the datasets as follows. First, we remove records which include at least one missing value. Second, dummy variables are introduced for the categorical variables. Third, nominal variables are changed into numerical variables. Next, logtransformed variables are created for the numerical variables in the original data. Finally, all of the datasets are standardized.
5.2Experimental design
In this section, we present the design of the experiments. To demonstrate the practical viability of our model, we set a time limit of 600 seconds for every experiment. The algorithm time includes the computation time for big in and in . For all experiments, we use the values in Table 2 for the significance levels of the tests and parameters for the alternative procedures.



0.95  0.1  
0.99  4  
0.99  6  
0.5  
0.5  
0.5  
Experiment 1: proposed model vs. simple benchmark models
We compare with two simple benchmark models: and a forward selection algorithm. With this experiment, the performance of the proposed model in terms of the aforementioned statistical significance and regression assumptions can be verified. The forward selection algorithm is presented in Algorithm ?.
The algorithm is constructed by modifying a standard forward selection algorithm to select from among variables and their logtransforms. In Step 5 in , once a variable is selected, both the original and transformed variables are excluded from the candidate set of variables. If , then the logtransformed variable is excluded as well. If , then the original variable is excluded as well. Thus, always provides a solution, with at most one variable from an original and transformed variable pair.
For each dataset in Table ?, all models and algorithms are tested over . For each case, we check the goodnessoffit with adjusted , denoted as . Although the objective of all models and algorithms is to maximize the SSE, it is equivalent to maximizing because is fixed. Also, since ranges from to , we can compare the goodnessoffit over different values and datasets easily.
Experiment 2: proposed model vs. iterative model
We compare our model with an iterative algorithm used in [1] for cases where the solution for violates some of the tests and diagnostics. The algorithm iteratively adds constraints to avoid subsets with insignificant coefficients. We will refer to the iterative model as in this paper. Recall that a key feature of our algorithm is the ability to avoid iterative calls of the MIQP solver, where iterative approaches solve multiple MIQPs by adding constraints. In this comparison, we demonstrate the effectiveness of our lazy constraintbased algorithm. As the two models have different constraints and parameters, we compare the two models using common constraints, the number of variables selected and tests. In detail, we exclude residual test constraints and set from the default parameters in Table 2.
The algorithm is initialized with the empty set of constraints CutSet in Step 1. The algorithm continues while the obtained subset has at least one statistically insignificant coefficient. In Step 3, the current model ( with constraints in CutSet) is solved to obtain . In Steps 46, if includes statistically insignificant coefficient, then a constraint that cuts from the feasible region is added to CutSet.
Experiment 3: alternative solution procedure vs. simple penalty function
We demonstrate the effectiveness of the alternative solution procedure by comparing it with a simple penalty function. The benchmark is obtained by replacing the alternative solution procedure with simple penalty function . The benchmark model will be referred to as .
5.3Experimental Results
For all experiments, we utilize an Intel(R) Core(TM) i74770 CPU @ 3.40GHz (8 CPUs) and 8GB RAM. All models and algorithms are implemented with Python, in which mathematical models are solved by Gurobi 7.0.0. For the construction of the linear model in , we employ the Python package statsmodels [29].
In every experiment, all of the solutions from every comparative model do not demonstrate linearity between their residuals and fitted values, so the corresponding results are not provided.
Result for Experiment 1: proposed model vs. simple benchmark models
To measure the explanatory power of each model, we introduce a measure for relative explanatory power , where is a subset obtained by solving and is the solution of the corresponding model. We use the of as a denominator of since provides the greatest among all of the compared models because does not have any diagnostic constraints.
To provide better insights, we present the summarized results for the datasets and over number of variables selected () in Tables ? and ?, respectively. In both tables, is the average over datasets or values. To check the performance for feasible datasets, we also present , which only considers cases with feasible solutions available when calculating the average. Additionally, we count the number of cases satisfying the (i) test and residual test, (ii) test only, and (iii) residual test only.
In Table ?, we observe that obtains solutions with or above except for the Servo, Barro, and Crime datasets. This implies that our model maintains explanatory power while satisfying all diagnostics constraints. Column also indicates our model maintains an closer to the base model. The results in the ’test & residual test’ column indicate that our model can find substantially more linear models satisfying both the statistical significance of all coefficients and the residual assumptions for most of the datasets. In particular, our model finds optimal solutions satisfying all diagnostics constraints for all cases for four data sets (bolded in the ’test & residual test’ column). On the other hand, our model finds the same number of feasible solutions as the benchmarks for the Barro and Framing datasets, with the results from our model for equal to 1 for these datasets. In fact, the corresponding solutions for are exactly the same as those of our model. It implies that although finds an optimal solution without considering any diagnostics or statistical significance, the solution fortunately has no insignificant coefficients and satisfies the residual assumptions. Thus, these results in Table ? make clear that our model is powerful to provide a quality linear model independent of the dataset. Finally, the execution time indicates that our model can find linear models in an practical timeframe.
Table ? presents the results by averaging or summing the number of variables selected (). We can verify that the results of our model for are close to 1, and those in ’test & residual test’ are substantially greater than those of the benchmarks, as in Table ?. Thus, it suggests that our model can also provide quality subsets regardless of the size of .
In Figure 8, we check the results for the alternative solution procedure for cases where no feasible solution is found within the 600second time limit: Servo, Winequality, Bodyfat, Barro, and Framing with . In the plot matrix of Figure 8, the horizontal and vertical axes are for datasets and the performance measure, respectively. In each plot of Figure 8, the horizontal and vertical axes present the corresponding performance measure and , respectively. Note that if a solution has no statistically insignificant coefficient because, as illustrated in Section 4, a penalty term is activated when violation of corresponding statistical test occurs. For instance, heteroscedasticity would not be a concern for the Barro dataset because every case in the dataset satisfies the regression assumption ().
The plots in Figure 8 explicitly show the promising performance of the alternative solution procedure. According to the charts in the first row, the of the proposed model is kept above 0.89 except for the Winequality dataset. Moreover, except three cases marked by ’’ (Servo (), Barro (), and, Framing ()), each measure , , and of the alternative solutions is better than or equal to those of the benchmarks. These results indicate that the alternative solutions dominates those of the benchmarks in most cases. For example, in the results for Bodyfat (), the and of are considerably less than those of the benchmarks while its is greater than 0.99. On the other hand, for the alternative solutions for Winequalityred () is relatively less than for the other results. However, and are significantly better than the benchmarks. This implies that our model sacrifices explanatory power (i.e., MSE) to improve the statistical tests and diagnostics ( and ).
We investigate the three cases marked by ’’ in detail, where the alternative solutions do not dominate the benchmark solutions. In fact, the results are due to our algorithm’s ability to balance and given the default parameters and . Consider Framing () as an example. Let be the solution of with and and be the solution of with and from the result in Figure 8. These two solutions can be compared by our alternative solution procedure. Based on Algorithms ? and ?, the algorithm calls in Step 9 of Algorithm ?. Note that and differences of the corresponding values of and can be ignored. Thus, if is to be selected, it must be held that (). However, . In turn, the alternative solution procedure concludes that is a better solution.
Figure 9 presents representative residual plots (residuals verses fitted values) for AutoMPG (), Automobile () and Hprice3 (). In Table ?, the corresponding pvalues are presented. The plots in Figure 9 shows that the variance of the residuals from the benchmarks gradually increases, while the linear regression model derived from our model has a relatively constant residual trend along with fitted values. The pvalues in Table ? show that our model provides better solutions, as higher pvalue is better for the heteroscedasticity tests.
Results for Experiment 2: proposed model vs. iterative model
In Table ?, we compare our model with the iterative model () for cases where the solution for violates some of the tests and diagnostics. We only consider these 28 cases because both algorithms are not needed if can provide a solution that satisfies all tests and diagnostics. In the Status column, ‘alt sol’ means that, within the time limit, a model could not find a solution with statistically significant coefficients for all of the selected variables, and an alternative solution is obtained at termination. Terms ‘opt’, ‘best feas’ represent cases where an optimal solution (satisfying all constraints) and a feasible solution are found, respectively. The term ‘infeas’ for means that the problem is infeasible and our model returns an alternative solution. Dominative relationships over them are decided as follows : ‘opt’ ‘best feas’ ‘alt sol’ ‘infeas’. In columns Status, , and , results indicating our model is better are boldfaces. The results in Table ? are summarized in Table ?.
According to Tables ? and ?, outperforms in most cases, while maintaining explanatory power above 90% of . In cases finds an optimal solution, is appreciably faster than . Moreover, by comparing performance measures and for alternative solutions, we can see that alternative solutions from our model dominate those from the iterative model in terms of the statistical significance of the coefficients, while differences in explanatory power can be ignored in most cases. Furthermore, in Framing (), our model finds a better solution for both explanatory power and statistical significance. Even though the three cases that performs better in (Bodyfat (), Winequalityred (), and Framing ()), outperforms in . Hence, our solutions are not outperformed by the iterative solutions. Several results in the Bodyfat and Barro datasets indicate that the explanatory power of our model is less than 80% of the . However, clearly outperforms the in the sense of the statistical significance of coefficients in these cases.
Result for Experiment 3: alternative solution procedure vs. simple penalty function
In Table ?, the pvalues for the coefficients from the linear models derived from and for the Bodyfat () are presented. Recall that the only difference between and is that the former uses the alternative solution procedure whereas the latter uses simple penalty function . The pvalues which are greater than are bolded. Note that the average of the violating pvalues () for is significantly less than that of (). For , although an insignificant pvalue in the model () is close to the significance level, the other two pvalues for the insignificant coefficients are relatively large (0.3002, 0.3842) compared to (). Furthermore, the difference in is negligible. Thus, our model, which contains the alternative solution procedure, can provide a better solution than the model employing a simple penalty function.
6Conclusion
Although minimizing fitting errors is the most important task in building a regression model, it is also important to satisfy regression assumptions and diagnostics to build a good model. However, most of the current approaches cannot incorporate this model validation step into the algorithm.
In this work, we propose a fully automated model building procedure for multiple linear regression subset selection by integrating model building and validation steps based on mathematical programming. The proposed procedure is also capable of providing an alternative solution when there is no feasible solution or a feasible solution is not found within the time limit. Since many of the constraints for statistical tests are nonconvex and nonlinear, it is not trivial to solve the problem in its original form. To overcome this difficulty, we used the lazy constraint approach, which is available in many commercial optimization solvers. At every branchandbound node, diagnostics are implemented to decide whether the solution satisfies the assumptions and tests. If the solution at the current node violates some of the assumptions and tests, our model adds a lazy constraint that restricts the selection of the subset.
Given a feasible problem (there exist subsets satisfying all of the assumptions and tests), our model can provide an optimal subset minimizing the SSE while satisfying all assumptions given a fixed number of explanatory variables to be selected. However, the proposed model can fail to provide an optimal solution or a feasible solution if the problem is infeasible or hard to solve within a time limit. Hence, we establish the alternative solution procedure to derive an alternative solution which is nearfeasible. The procedure follows the typical steps for constructing and validating a linear regression model.
Through computational experiments using various realworld datasets, we show that our model can provide quality solutions satisfying all of the considered diagnostics while maintaining a value close to the base solution. Furthermore, the results demonstrate the viability of our model in real regression analysis by setting a practical time limit. We also find that the alternative solutions are superior to those from the benchmarks. The value of the alternative solutions are comparable with those from the benchmarks, while being significantly better in terms of diagnostics. We also show that our model based on lazy constraints outperforms the existing iterative approach. The proposed lazy constraintsbased procedure is faster than the iterative approach while providing more quality alternative solutions within the time limit. Lastly, we demonstrate the necessity of the alternative solution procedure by comparing it with the simple penalty function.
In conclusion, we strongly believe that our model is useful in practice as we integrate model building (subset selection) and validation steps, whereas traditional approaches include human decisions and require many trialanderror steps. Furthermore, the proposed lazy constraint approach in mathematical programming to address several statistical tests or diagnostics for linear regression is substantially more efficient than the existing iterative model. One of the most challenging research directions to the improve the current procedure includes formulating a linear or a convex constraint to replace the current approximation constraint for tests.
Footnotes
References
 OR forum  an algorithmic approach to linear regression.
D. Bertsimas and A. King. Operations Research  Best subset selection via a modern optimization lens.
D. Bertsimas, A. King, and R. Mazumder. The Annals of Statistics  Unobserved ability, efficiency wages, and interindustry wage differentials.
M. Blackburn and D. Neumark. The Quarterly Journal of Economics  What triggers public opposition to immigration? anxiety, group cues, and immigration threat.
T. Brader, N. A. Valentino, and E. Suhay. American Journal of Political Science  A simple test for heteroscedasticity and random coefficient variation.
T. S. Breusch and A. R. Pagan. Econometrica: Journal of the Econometric Society  Enhancing interpretability by tightening linear regression models.
E. Carrizosaa, A. V. OlivaresNadal, and P. RamirezCobob. Technical report  Variable selection using neuralnetwork models.
G. Castellano and A. M. Fanelli. Neurocomputing  Estimating the economic model of crime with panel data.
C. Cornwell and W. N. Trumbull. The Review of Economics and Statistics  Variable selection using random forests.
R. Genuer, J.M. Poggi, and C. TuleauMalot. Pattern Recognition Letters  Variable selection via Gibbs sampling.
E. I George and R. E. McCulloch. Journal of the American Statistical Association  Gurobi optimizer reference manual, 2016.
Inc. Gurobi Optimization. URLhttp://www.gurobi.com
.  An introduction to variable and feature selection.
I. Guyon and A. Elisseeff. Journal of Machine Learning Research 
An Introduction to Statistical Learning
G. James, D. Witten, T. Hastie, and R. Tibshirani. , volume 6.  Fitting percentage of body fat to simple body measurements.
R. W. Johnson. Journal of Statistics Education  Feature subset selection problem using wrapper approach in supervised learning.
A. G. Karegowda, M.A. Jayaram, and A.S. Manjunath. International Journal of Computer Applications  Goodness of fit and related inference processes for quantile regression.
R. Koenker and J. A. F. Machado. Journal of the American Statistical Association  Multistep methods for choosing the best set of variables in regression analysis.
H. Konno and Y. Takaya. Computational Optimization and Applications  Choosing the best set of variables in regression analysis using integer programming.
H. Konno and R. Yamamoto. Journal of Global Optimization  Genetic algorithms as a strategy for feature selection.
R. Leardi, R. Boggia, and M. Terrile. Journal of Chemometrics  UCI machine learning repository, 2013.
M. Lichman. URLhttp://archive.ics.uci.edu/ml
.  Using simulated annealing to optimize the feature selection problem in marketing applications.
R. Meiri and J. Zahavi. European Journal of Operational Research  Bayesian variable selection in linear regression.
T. J. Mitchell and J. J. Beauchamp. Journal of the American Statistical Association  Mixed integer secondorder cone programming formulations for variable selection in linear regression.
R. Miyashiro and Y. Takano. European Journal of Operational Research  Subset selection by mallows’ : A mixed integer programming approach.
R. Miyashiro and Y. Takano. Expert Systems with Applications 
Applied Linear Statistical Models
J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. , volume 4.  Subset selection for multiple linear regression via optimization.
Y. W. Park and D. Klabjan. arXiv:1701.07920 [stat.ML].  Variable selection using SVMbased criteria.
A. Rakotomamonjy. Journal of machine learning research  Feature subset selection for logistic regression via mixed integer optimization.
T. Sato, Y. Takano, R. Miyashiro, and A. Yoshise. Computational Optimization and Applications  Statsmodels: Econometric and statistical modeling with Python.
S. Seabold and J. Perktold. In 9th Python in Science Conference, 2010.  A note on genetic algorithms for largescale feature selection.
W. Siedlecki and J. Sklansky. Pattern Recognition Letters  Best subset selection for eliminating multicollinearity.
R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, and T. Matsui. Optimization Online  Regression shrinkage and selection via the lasso.
R. Tibshirani. Journal of the Royal Statistical Society. Series B (Methodological)  Least squares versus minimum absolute deviations estimation in linear models.
H. G. Wilson. Decision Sciences 
Introductory econometrics : a modern approach
J. M. Wooldridge. , volume 6.  Feature subset selection using a genetic algorithm.
J. Yang and V. Honavar. IEEE Intelligent Systems and their Applications  Efficient feature selection via analysis of relevance and redundancy.
L. Yu and H. Liu. Journal of Machine Learning Research  Regularization and variable selection via the elastic net.
H. Zou and T. Hastie. Journal of the Royal Statistical Society: Series B (Statistical Methodology)