Variable importance is central to scientific studies, including the social sciences and causal inference, healthcare, and in other domains. However, current notions of variable importance are often tied to a specific predictive model. This is problematic: what if there were multiple well-performing predictive models, and a specific variable is important to some of them and not to others? In that case, we may not be able to tell from a single well-performing model whether a variable is always important in predicting the outcome. Rather than depending on variable importance for a single predictive model, we would like to explore variable importance for all approximately-equally-accurate predictive models. This work introduces the concept of a variable importance cloud, which maps every variable to its importance for every good predictive model. We show properties of the variable importance cloud and draw connections other areas of statistics. We introduce variable importance diagrams as a projection of the variable importance cloud into two dimensions for visualization purposes. Experiments with criminal justice and marketing data illustrate how variables can change dramatically in importance for approximately-equally-accurate predictive models.
Keywords: variable importance, Rashomon set, interpretable machine learning
In predictive modeling, how do we know whether a feature is actually important? If we find an accurate predictive model that depends heavily on a feature, it does not necessarily mean that the feature is always important for good models. On the contrary, what if there is another equally accurate model that does not depend on the feature at all? Perhaps in order to answer this question, we need a holistic view of variable importance, that includes not just the importance of a variable to a single model, but to any accurate model. Variable importance clouds, which we introduce in this work, aims to provide a lens into the secret life of the class of almost-equally-accurate predictive models.
Ideally we would like to obtain a more complete understanding of variable importance for the set of models that predict almost equally well. This set of almost-equally-accurate predictive models is called the Rashomon set; it is the set of models with training loss below a threshold. The term Rashomon set comes from Breiman’s Rashomon effect (Breiman et al. (2001)), which is the notion that there could be many good explanations for any given phenomenon. Breiman (2001) also defined a useful notion of variable importance; namely the increase in loss that occurs when a variable is purposely scrambled (randomly permuted). Unfortunately, however, there is something fundamental incomplete about considering these two quantities separately: if we look at variable importance only for a single model, we miss the potentially more important question of what the variable importance could be for another different but equally-accurate model. A variable importance cloud (VIC) is precisely the joint set of variable importance values for all models in the Rashomon set.
Specifically, we define a vector for a single predictive model, each element representing the dependence of the model on a feature. The VIC is the set of such vectors for all models in the Rashomon set. The VIC thus reveals the importance of a feature in the context of the importance of other features for all good models. For example, it may reveal that a feature is important only when another feature is not important, which may happen when these features are highly correlated. Understanding the VIC helps interpret predictive models and provides a context for model selection. This type of analysis provides a deeper understanding of variable importance, going beyond single models and now encompassing the set of every good model. In this paper, we analyze the VIC for linear models, and extend the analysis to some of the nonlinear problems including logistic regression, decision trees, and deep learning.
Some existing methods can be used to analyze the impact of features on predictive models. Friedman (2001) uses the partial dependence plot (PDP) to visualize the impact of a feature on the average prediction as the feature varies within its support. If the prediction changes drastically as the feature changes, or visually if the PDP of that feature is steep, it suggests this feature has a large impact on prediction based on the model, and is hence an important variable for this model. There are three differences between the PDP and the VIC. First, the PDP reveals the importance of a single predictive model, not multiple models (it does not address the Rashomon effect). Second, the PDP reveals local importance while the VIC reveals overall importance. Finally, the PDP measures importance by the difference in prediction outcomes while the VIC measures by the difference in prediction losses.
Another method that is related to visualizing the impact of a feature on prediction is called partial leverage plot (PLP), proposed by Velleman and Welsch (1981). It regresses the outcome variable and a feature against the rest of the features, and plots the residuals against each other. The PLP reveals how the information embodied uniquely by a feature marginally affects the prediction. Unlike the VIC that works for any predictive models, the PLP works only for linear models. Moreover, the PLP is for single models, and does not take the Rashomon effect into account either.
In a recent research, Casalicchio et al. (2018) proposed a measure called partial importance (PI), which identifies the average effect of changing the value of a feature on the model performance. By plotting the changed feature values against model performances, the graph identifies the regions in the feature space where the feature is more (or less) important to the model. This visualization tool provides richer information than aggregated measures of variable importance (for instance, the one proposed by Breiman (2001)). Unlike our work, the PI plot visualizes the variable importance for single models.
Model class reliance (Fisher et al. (2018)) is a method to study variable importance that does address the Rashomon effect. Using this method, one can estimate the bounds for the importance of a feature, in the sense that any good predictive model cannot rely on this feature to the degree that exceeds the bounds. While this method allows for comparisons of variable importance for different features, it overlooks the possible correlation in variable importance among those features. Some feature may never (or only) be important when some other feature is important. This piece of information, which is exactly the target of the current work, leads to a better understanding of the dataset and the interpretability of predictive models.
When there are many features of interest, the VIC becomes a subset of a high dimensional space. To facilitate understanding of the VIC, we propose a visualization tool called variable importance diagram (VID). It is a collection of 2d projections of the VIC onto the space spanned by the importance of a pair of features. The VID offers graphical information about the magnitude of variable importance measures, the bounds, and the relation of variable importance for each pair of features. An upward-sloping projection suggests that a feature is importance only when the other feature is also important, and vice versa for a downward-sloping projection. We provide examples of VID’s in the context of concrete applications, and illustrate how the VID facilitates model interpretation and selection.
The remainder of the paper is organized as follows. In Section 2, we introduce notation and definitions. In Section 3.1, we derive an analytical expression for the VIC for linear models. In Section 3.2, we show that the VIC for linear models is scale-invariant in the features, yet variant in the outcome variables. In Section 3.3, we show that the VIC for linear models is an ellipsoid in the special case where the features are uncorrelated. In Section 3.4, we offer a linear approximation of the VIC. Section 3.5 visualizes the VIC for linear models with two features in order to gain intuition. We then proceed to nonlinear problems in Section 4. We propose the algorithms for deriving the VIC for logistic regression and decision trees in this section, as well as the general approach for any nonlinear problem. In Section 5.1, we introduce the visualization tool VID. In Section 5.2, we analyze the connection of our measure of variable importance to hypothesis testing for linear models. In Section 5.3, we discuss the idea of trading off accuracy for variable importance. We demonstrate the VIC and VID with concrete examples in Section 6, which includes three experiments; in Section 6.1, we study the Propublica dataset for criminal recidivism prediction and demonstrate the VIC/VID analysis for both logistic regression and decision trees. We move onto an in-vehicle coupon recommendation dataset in Section 6.2 and illustrate the trade-off between accuracy and variable importance. We finally study an image classification problem based on VGG16 in Section 6.3. We discuss the related work in Section 7 and conclude in Section 8.
For a vector , we denote its element by and all elements except for the one by . For a matrix , we denote its transpose by , row by , and column by .
Let be a random vector of length , with being a positive integer, where is the vector of covariate variables (referred to as features) and is the outcome variable. Our dataset is an matrix, , where each row is an i.i.d. realization of the random vector .
Let be a predictive model, and be the class of predictive models we consider. For a given model and an observation , let be the loss function. The expected loss and empirical loss of model are defined by and . Without causing confusion, we may drop the superscript or the reliance on the data for simplicity. We consider different classes of predictive models and loss functions in the paper, including least squares loss, logistic loss, and 0-1 loss.
2.2 Rashomon Set
Fix a predictive model as a benchmark. Following Fisher et al. (2018), a model is considered as a “good” one if its loss does not exceed the loss of by a factor . A Rashomon set is the set of all good models in the class . In most cases, we select to be the best model within the set that minimizes the loss, and we define this way in what follows.
Definition 2.1 (Rashomon Set).
Given a model class , a benchmark model , and a factor , the Rashomon set is defined as
Note that the Rashomon set also implicitly depends on the loss function and the dataset.
2.3 Model Reliance
For a given model , we want to measure the degree to which its predictive power relies on a particular variable , where Like Fisher et al. (2018), our notion of model reliance is similar to the notion of variable importance used by random forest (Breiman (2001)). Let by another random vector that is independent of and identically distributed as . We replace the with , which gives us a new vector denoted by .
Intuitively, is larger than , since it breaks the connection between feature and outcome . We refer to this increment as the reliance of model on feature , and interpret it as the importance of feature to model . Formally,
Definition 2.2 (Model Reliance).
The (population) reliance of model on variable is given by either the ratio
or the difference
depending on the specific application.
This notion can be alternatively defined with the empirical dataset and loss function. Larger indicates greater reliance on feature . We may drop the superscript for simplicity. With this measure, we can further define the model reliance function, which specifies the importance of each feature to a predictive model.
Definition 2.3 (Model Reliance Function).
The function maps a model to a vector of its reliances on all the features by
We refer to as the model reliance vector of model .
2.4 Variable Importance Cloud and Diagram
For a single model , we compute its model reliance vector , which shows how important the features are to the single model. But usually, there is no clear reason to choose one model over another equally-accurate model. Thus, model reliance hides how importance a variable could be. Accordingly, it hides the joint importance of multiple variables. Variable Importance Clouds explicitly characterize this joint importance of multiple variables. The Variable importance cloud (VIC) consists of the set of model reliance vectors for all predictive models in the Rashomon set .
Definition 2.4 (Vic).
The Variable Importance Cloud of the Rashomon set is given by
While the VIC is a set in the -dimensional space, we may project it onto lower dimensional spaces for visualization. We construct a collection of such projections, referred to as the Variable Importance Diagram (VID). Both the VIC and the VID embody rich information. This argument will be illustrated with concrete applications later.
3 VIC for Linear Models
In this section, we restrict our attention to the class of linear models . We use the least squares loss with -regularization parameterized by . That is, We focus on the expected loss and define model reliance using in this section.
We define and characterize the VIC for linear models, and show that it is invariant to the relative scaling of the covariates. It turns out that it is hard to study the VIC analytically due to the non-linear nature of the model reliance function MR. To gain intuition, we analyze a special case there features are uncorrelated. We then turn to the general case and study the VIC under a linear approximation. We will provide a 2d visualization of the VIC to make the concepts more concrete.
3.1 Rashomon Set and VIC for Linear Models
Fix a random vector . For a linear model , the expected loss is given by
Given a benchmark model , a factor , following Definition 2.1, the Rashomon set for linear models is defined as
That is, a linear model is in the Rashomon set if it satisfies
Observe that if the random vector is normalized so that the expectation is zero, then captures the covariance structure among the features, and captures the covariance between the outcome and the features. Therefore, the Rashomon set for linear models is completely determined by the covariances.
Model reliance function MR in definition 2.3 turns out to have a specific formula for linear models, given by the lemma below.
Given a random vector and the least squares loss function ,
As a result, the model reliance function for linear models becomes
The lemma is proved as Theorem 2 in Fisher et al. (2018).111Note that the theorem has been only proven for the least squares loss function without regularization, but one can easily see that adding a regularization term does not change the expression. Note that the function MR is non-linear in . With a slight abuse of notation, we define as the inverse function that maps variable importance vectors to coefficients of a linear model (rather than the model itself). That is, instead of . We assume the existence of the inverse function .
Theorem 2 (VIC for Linear Models).
Fix a benchmark model , a factor . Let . Then a vector if it satisfies
The theorem suggests that the VIC for linear models depends solely on the covariance structure of the random vector , which includes , , and . (The function also depends solely on the covariance structure of .)
3.2 Scale of Data
In this subsection, we set the regularization parameter to be 0. We are interested in how the VIC is affected by the scale of our data . We prove that the VIC is scale-invariant in features . Rescaling the outcome variable does affect the VIC, as it should.
Corollary 2.1 (Scale of VIC).
Let with for all , and . It follows that
where denotes the VIC with respect to the Rashomon set with and being the model that minimizes the expected loss with respect to , and is defined in the same way for the scaled variable .
The proof of the corollary is given in Appendix A. This corollary suggests that the importance of a feature does not rely on its scale, in the sense that rescaling a feature does not change the reliance of any good predictive model on the feature. (In contrast, the magnitude of the coefficients in linear models is sensitive to the scale of the data.) Note that the corollary holds only for linear models with no regularization.
3.3 Special Case: Uncorrelated Features
As Equation 3.3 suggests, to analyze the VIC for linear models, the key is to study the inverse model reliance function . Unfortunately, due to the non-linear nature of MR, it is difficult to get a closed-form expression of the inverse function in general. In this section, we focus on the special case that all the features are uncorrelated in order to understand some properties of the VIC, before proceeding to the correlated case in later subsections.
Corollary 2.2 (Uncorrelated features).
Suppose that for all . Let be the minimum expected loss within the class , and choose the minimizer as the benchmark for the Rashomon set . Then the VIC for linear models, , is an ellipsoid centered at
with radius along dimension as follows:
Moreover, when the regularization parameter is 0,
where is the correlation coefficient between and .
The result is very useful. First, it pins down the variable importance vector for the best linear model. Second, it tells that for any accurate model, the reliance on feature is bounded by . Third, within the set of models that have the same expected loss, the surface of the ellipsoid tells how a reduction in the reliance on one feature can be compensated by the increase in the reliances on other features.
3.4 Approximation of VIC with Correlated Features
We now proceed to the general case of correlated features. A key difference from the special case is that the MR function defined by Equation 3.2 is no longer linear. As a result, the VIC is no longer an ellipsoid.
While it always works to numerically compute the true VIC, from which we can directly get (1) the model reliance vector for the best linear model and (2) the bounds for the reliance on each feature for any model in the Rashomon set, it is hard to see how the reliances on different features change when we switch between models with the same loss (which is revealed by the surface of the VIC). To that end, we propose a way to approximate the VIC as an ellipsoid. Under the approximation, we can at least numerically compute the parameters of the ellipsoid, including the center, radii, and how it is rotated. We also comment on the accuracy of the approximation.
Observe that Equation 3.2 is a quadratic function of . By invoking Taylor’s theorem, we have
where is an arbitrary vector,
Equation 3.4 is accurate since there are no higher order terms in Equation 3.2. The quadratic term in Equation 3.4 is small if either is small or the Hessian matrix is small. The former happens when we focus on small Rashomon sets and the latter happens when the features are less correlated. In both cases, approximating with only the linear term in Equation 3.4 would be close to the original function .
If we ignore the higher order term, the relationship between the model reliance vector and the coefficients can be more compactly written with the Jacobi matrix ,
where the th row of is . That is,
where and .
We assume the Jacobi matrix is invertible. (Cases where this would not be true are, for instance, cases where are all 0, which means there no signal for predicting from the ’s.) Then we can linearly approximate the inverse MR function as follows.
For an arbitrary vector , the approximated is given by
We can choose any to approximate , and we should choose it depending on our purpose. Suppose we are interested in the approximation performance at the boundary of the Rashomon set, it makes sense to pick that lies on the boundary. Instead, for overall approximation performance, we should choose , which is the vector that minimizes expected loss. We can apply Definition 3.1 to Theorem 2 as follows.
Fix a benchmark model , a factor . Pick a . A vector mr is in the approximated VIC if it satisfies
The theorem suggests that the approximated VIC is an ellipsoid. Therefore, we can study its center and radii and perform the same tasks as mentioned in the previous subsection. More details are provided in Appendix C, namely the formula for the ellipsoid approximation of the VIC for correlated features. In what follows, we discuss the accuracy of the approximation.
Recall that from Equation 3.4 we have , where and . By dropping the second order term, we introduce the following error for ,
Note that is bounded by the radius of the Rashomon ellipsoid along dimension , denoted by . Then it follows that
It demonstrates our intuition that the approximation is more accurate when there is less correlation among the features or when the Rashomon set is smaller.
3.5 Visualization of the VIC for Linear Models in 2D
We visualize the VIC for linear models in the simplest 2-feature case for a better understanding of the VIC. Let . We normalize the variables so that . It follows that
Recall that Theorem 2 and Lemma 1 suggest that the VIC is completely determined by the matrix . Moreover, by Corollary 2.1, we can assume without loss that . Effectively, the only parameters are the correlation coefficients , , and , which are the covariates normalized by standard deviations .
We visualize the VIC with regularization parameter . As is discussed above, for larger the Rashomon set has a smaller size, so that the VIC is closer to an ellipse.
Figure 1 visualizes the special case where the features are uncorrelated. The left panel of Figure 1 is the Rashomon set. The axes are the values of the coefficients. The Rashomon set is centered at the coefficient of the best linear model. Each ellipse is an iso-loss curve and the outer curves have larger losses. The right panel of Figure 1 is the VIC, with the axes being the model reliances on the features. The center point is model reliance vector of the best linear model, and each curve corresponds to a Rashomon set in the left panel. As is pointed out in Corollary 2.2, when the features are uncorrelated, the VIC is an ellipsoid. We can also observe that the VIC ellipses are narrower along than , since , which also demonstrates the result in Corollary 2.
When the features are correlated, the VIC is no longer an ellipsoid. We can see from Figure 2 below that indeed this is the case. The upper left panel contains the Rashomon sets and the upper right panel contains the corresponding VIC’s. Since there is not much correlation between and , the VIC’s are close to ellipses, especially if we are interested in the inner ones which correspond to smaller Rashomon sets.
As before, we may be interested in approximate the VIC with an ellipse. The lower left panel of Figure 2 is the approximated VIC where we invoke Taylor’s theorem at the center of the Rashomon set. We can see that the approximated VIC, which is represented by the dashed curve, is indeed close to the true VIC. If we are interested in the performance at the boundary, we may want to invoke Taylor’s theorem at the boundary of the Rashomon set, which is visualized by the lower right panel, for four different points on the boundary.
The VIC can no longer be treated as an ellipse when there is large correlation in the features, and the approximation is far from accurate. This is illustrated by Figure 3 below.
4 VIC for Nonlinear Problems
Now that we understand the VIC for linear models, we will apply our analysis to broader applications.
4.1 General Problems
Our analysis for linear models has made clear that to study the VIC, there are two key ingredients: (1) finding the Rashomon set and (2) finding the MR function. While the algorithm for finding the Rashomon set can only be done case-by-case depending on the class of predictive models we are interested in, we discuss the algorithm for finding the MR function in details here, in the context of general problems.
We adopt the following procedure to compute empirical model reliance.222Refer to Risher et al. (2018) for various versions of empirical model reliance. Recall that the (population) reliance of model on feature is defined as the ratio of and . The latter is the original expected loss, which can be computed with its empirical analog. The former is the shuffled loss after replacing the random variable with its i.i.d copy . We permute the column of our dataset and compute the empirical loss based on the shuffled dataset. By averaging this empirical shuffled loss several times with random permutations, the average shuffled loss should well-approximate the expected shuffled loss.
While this method works for general datasets, in some applications with binary datasets (both features and outcome are binary variables), the empirical model reliance can be computed with a simpler method. Suppose we are interested in the reliance of a predictive model on variable . Compute the loss when and the loss when . Find the frequency for to be in the dataset. Then the shuffled loss is .
In what follows, we introduce the VIC for nonlinear problems, including logistic regression and decision trees.
4.2 Logistic Regression
For a dataset with and , we define the following empirical logistic loss function
where and is the row of .
We consider the logistic model class . Notice that we can identify this set with , since every logistic model is completely characterized by . Therefore, we define instead to represent parameter space. Let be the coefficient that minimizes the logistic loss. Then the Rashomon set and the variable importance cloud is given by Definitions 2.1 and 2.4. Let us go through these steps for logistic regression.
The empirical MR function is introduced in Section 4.1. Let us now find the Rashomon set. Note that there is no closed-form expression for the Rashomon set for logistic regression models, but it is convex. We approximate it with a -dimensional ellipsoid in . Under this approximation, we can sample the coefficients from the Rashomon set, and proceed to the next step of VIC analysis. Below is the algorithm we use to approximate the Rashomon set with an ellipsoid.
Find the best logistic model that minimizes the logistic loss. Let be the minimum loss.
Initial sampling: Randomly draw a set of coefficients in a “box” centered at . Eliminate the coefficients that give logistic losses that exceed .
PCA: Find the principle components. Compute the center, radii of axes, and the eigenvectors. To get the boundary of the Rashomon set, resample coefficients from a slightly larger ellipsoid with radii multiplied by some scaling factor . The sampling distribution is a distribution along the radial axis in order to get more samples closer to the boundary. Eliminate the coefficients that give logistic losses that exceed .
Repeat the third step times.
There are four parameters in the whole process, the number of coefficients to sample in each step, the size of the box for initial sampling, the scaling factor that scales the ellipsoid, and the number of iterations . The parameter need to be a large number for robustness, but not too large for computation. The size of the box should be large enough to include some points outside the Rashomon set, so as to get a rough boundary for the Rashomon set. The same applies to the factor . We need to repeat times to get a stable boundary of the Rashomon set.
In our experiments, we set and set the size of the box as a certain factor times the standard deviation of the logistic estimator so that about of the sampled coefficients survive the elimination in the initial round.
We tune the other two parameters to get a robust approximation of the Rashomon set. Suppose we want to choose the optimal scaling factor and number of iterations from a set of candidates. Let be an upper bound of the scaling factors. For each candidate , we implement the above algorithm and get the resulting ellipsoid. We sample coefficients from this ellipsoid scaled by , and count the number of coefficients that remain in the Rashomon set and compute the survival rate. This number is related to the performance of the algorithm with parameter .
We first discuss the consequence of changing the scaling factor . If the factor is too small, every coefficient is in the Rashomon set. Effectively, we are sampling from a strict subset of the Rashomon set, even though we scaled it by the factor , so that the survival rate is 1. If we use the algorithm with this pair of parameters , we only get a subset of the Rashomon set. Hence the approximation is not accurate. As increases, the ellipsoid that approximates the Rashomon set grows bigger. As we test the performance by sampling from the ellipsoid scaled by the upper bound on the scaling factor, namely , we are sampling from a superset of the Rashomon set. Only a fixed portion of the sampled points are in the Rashomon set, because both the Rashomon set and are fixed.
As grows, the survival rate would first decrease and then become stable. The factor at which the survival rate becomes stable should be used in the algorithm. This is because with this factor, we get the boundary of the Rashomon set. The resulting ellipsoid well-approximates the Rashomon set. Figure 4 below demonstrates the argument.
Now we discuss the consequence of changing . Due to the initial sampling in a box, we expect that it takes several iterations to approximate the Rashomon set. Therefore, the survival rate may change when is small. On the other hand, when becomes large, the survival rate should not change, since effectively we are repeatedly sampling from the same ellipsoid. We should pick the value of at which the survival rate becomes stable.
To conclude, we have an algorithm to approximate the Rashomon set and know how to empirically compute the variable importance. This allows us to derive the VIC for logistic regression models.
4.3 Decision Trees
In this subsection, we implement the VIC analysis for binary datasets. The method extends to categorical data as well.
For a binary dataset with and , a decision tree is represented by a function .333This is actually an equivalent class of decision trees. A decision tree splits according to feature if there exists with and , and . We restrict our attention to the set of trees that splits according to no more than features, and denote this class by . The purpose is to exclude overfitted trees.
We define loss as misclassification error (0-1 loss). In particular,
where is the row of . Let be the tree in our model class that minimizes the loss. Then we have the Rashomon set and the variable importance cloud defined as usual. We use the method in Section 4.1 to find the MR function and below we describe how to find the Rashomon set.
Again, there is no closed-form expression for the Rashomon set for decision trees. This time, we are going to search for the true Rashomon set, without approximation, using the fact that features are binary. (The same method extends to categorical features.) Suppose we want to find all “good trees” that split according to features in the set , with There can be at most unique realizations of the features that affect the prediction of the decision tree. Moreover, there are at most equivalent classes of decision trees, since the outcome is also binary. The naïve method is to compute the loss for each equivalence class of trees, and the collection of “good trees” forms the Rashomon set.
While this method illustrates the idea, it is practically impossible for as low as 4. Alternatively, for each of the unique observations, we count the frequency of and and record the gap of these counts for each observation. (We will define the gap formally in the next section.) The best tree predicts according to the majority rule. The second and third best trees flip the prediction for the observation with the smallest and second-to-the-smallest gaps. The fourth best tree either flips the prediction for the one with the third-to-the-smallest gap, or for both with the smallest- and second-to-the-smallest gap, whichever is smaller. Searching for trees with this method, we can stop the process once we reach a tree that has more than loss, where is loss of the best tree. This method is computationally feasible.
The above algorithm finds the Rashomon set. With the Rashomon set, computing the VIC is straightforward. We just need to compute the variable importance vector for each model in the Rashomon set. The way to compute the variable importance vector is again in Section 4.1.
4.3.1 Clustering Structure
The VIC for decision trees is different from that for logistic models in two ways. First, the VIC for decision trees is discrete. Second, there might be a clustering structure in the VID for decision trees.
To explain the first difference, note that we define model reliance differently. For decision trees, it is defined as the ratio of 0-1 losses before and after shuffling the observations. For logistic models, it is defined as the ratio of logistic losses. While logistic loss is continuous in coefficients for a given dataset, 0-1 loss may jump discretely even for a small modification of the tree. That explains why the VIC for decision trees is discrete. The remainder of this subsection attempts to gain intuition about the clustering structure.
For any possible realization of the features , let be the number of observations with features and outcome 1. is defined similarly. For simplicity, we leave out the sparsity restriction for simplicity. In this case, the best decision tree can be defined as
For illustration, we consider the clustering structure along the dimension, which pertains to feature . Let be the loss associated with and be its reliance on . We now characterize the conditions so that there are clusters of points in the VIC. Fix a vector . Let and for . For example, is the number of observations with feature and outcome . Consider the tree that satisfies
That is, flips the prediction for the observation only. This tree has a total loss that is larger than by the gap :
We assume that so that the tree is in the Rashomon set.
Now consider the shuffled loss. Observe that and only differs when . When computing the shuffled loss, the difference comes from the observations with whose values for feature remain the same after shuffling, and the observations with whose values for feature 1 change after shuffling. There are observations with features and outcome , and observations with features and outcome . Therefore, the former situation can contribute to the difference in shuffled loss by no more than (and the loss for is larger). Similarly, the latter situation can contribute to the difference in shuffled loss by no more than (yet it is ambiguous whether the loss for is larger or smaller than by ).
Since the best tree has loss and reliance on , its shuffled loss is . The original loss of is and its shuffled loss is , where . Therefore, we know that the reliance of the two trees on feature differ by
For the set of decision trees that flip the prediction of at one leaf and whose increments in loss do not exceed , if none of them has a large , then there is no cluster in dimension. Otherwise there could be clustering, we show this empirically for a read dataset in Section 6.1.2.
5 Ways to Use VIC
We discuss the ways to use the VIC in this section and focus on understanding variable importance in the context of the importance of other variables and providing a context for model selection.
5.1 Understanding Variable Importance with VIC/VID
The goal of this paper is to study variable importance in the context of the importance of other variables. We illustrate in this section how VIC/VID achieves this goal with a simple thought experiment regarding criminal recidivism prediction. To provide a background, in 2015 questions arose from a faulty study done by the Propublica news organization, about whether a model (COMPAS - Correction Offender Management Profiling for Alternative Sanctions) used throughout the US Court system was racially biased. In their study, Propublica found a linear model for COMPAS scores that depended on race; they then concluded that COMPAS must depend on race, or its proxies that were not accounted for by age and criminal history. This conclusion is based on methodology that is not sound: what if there existed another model that did not depend on race (given age and criminal history), but also modeled COMPAS well?
While we will study the same dataset Propublica used to analyze variable importance for criminal recidivism prediction in the experiment section, we perform a thought experiment here to see how VIC/VID addresses this problem. Consider the following data-generating process. Assume that a person who has committed a crime before (regardless of whether he or she was caught or convicted) is more likely to recidivate, which is independent of race or age. However, for some reason (e.g., discrimination) a person is more likely to be found guilty (and consequently has prior criminal history) if he or she is either young or black. Under these assumptions, there might be three categories of models that predict recidivism well: each relies on race, age or prior criminal history as the most important variable. It is not sound to conclude without further justification that recidivism depends on race.
In fact, we may find all these three categories of models in the Rashomon set. The corresponding VIC may look like a 3d ellipsoid in the space spanned by the importance of race, age and prior criminal history. Note that the surface of the ellipsoid represents the models with the same loss. We may find that, staying on the surface, if the importance of race is lower, either the importance of age or prior criminal history is higher. We may conclude that race is important for recidivism prediction only when age and prior criminal history are not important, which is a more comprehensive understanding of the dataset as well as the whole class of well-performing predictive models, compared with Propublica’s claim.
For more complicated datasets and models, the VIC’s are in higher dimensional spaces, making it hard to make any statement directly from looking at the VIC’s. In these situations, we need to resort to the VID. In the context of the current example, we project VIC into the spaces spanned by pairs of the features, namely (age, race), (age, prior criminal history) and (race, prior criminal history). Each projection might look like an ellipse. Under our assumptions regarding the data-generating process, we might expect, for example, a downward sloping ellipse in the (race, prior criminal history) space, indicating the substitution of the importance of race and prior criminal history.
The axes of this thought experiment are the same as those observed in the experiments in Section 6.1; there however, we make no assumption about the data-generating process.
5.2 Trading off Error for Reliance: Context for Model Selection
VIC provides a context for model selection. As we argued before, we think of the Rashomon set as a set of almost-equally accurate predictive models. Finding the single best model in terms of accuracy may not make a lot of sense. Instead, we might have other concerns (beyond Bayesian priors or regularization) that should be taken into account when we select a model from the Rashomon set. Effectively, we trade off our pursuit for accuracy for those concerns.
For example, in some applications, some of the variables may not be admissible. When making recidivism predictions, for instance, we want to find a predictive model that does not rely explicitly on racial or gender information. If there are models in the Rashomon set that have no reliance on both race or gender, we should use them at the cost of reducing predictive accuracy. This cost is arguably negligible, since the model we switch to is still in the Rashomon set. It could be the case that every model in the Rashomon set relies on race to some non-negligible extent, suggesting that we cannot make good predictions without resorting to explicit racial information. While this limitation would be imposed by the dataset itself, and while the trade-off between accuracy and reliance on race is based on modeler discretion, VIC/VID would discover that limitation.
In addition to inadmissible variables, there could also be situations in which we know a priori that some of the variables are less credible than the others. For instance, self-reported income variables from surveys might be less reliable than education variables from census data. We may want to find a good model that relies less on variables that are not credible. VIC is a tool to achieve this goal.
5.3 Variable Importance and Its Connection to Hypothesis Testing for Linear Models
Recall that model reliance is computed by comparing the loss of a model before and after we randomly shuffle the observations for the variable. Intuitively, this should tell the degree to which the predictive power of the model relies on the specific variable. Another proxy for variable importance for linear models could be the magnitude of the coefficients (assuming features have been normalized). When the coefficient is large, the outcome is more sensitive to changes in that variable, suggesting that the variable is more important. This measure is also connected to hypothesis testing; the goal of this subsection is to illustrate this.
We first argue that the magnitude of the coefficients is a different measure of variable importance than model reliance. Coefficients do not capture the correlations among features, whereas model reliance does. We illustrate this argument with Figure 3. The dotted line in the upper left panel is the set of models within the Rashomon set that have the same and different . The coefficients might suggest that feature is equally important to each of these models, because ’s coefficient is the same for all of them. (The coefficient is 0.5.) We compute the model reliance for these models and plot them with the dotted line in the upper right panel of Figure 3. (One can check that these indeed form a line.) This suggests that these models rely on feature to different degrees. This is because the variable importance metric based on coefficients ignores the correlations among features. On the other hand, model reliance on is computed by breaking the connection between and the rest of the data . One can check that , which intuitively represents the correlation between and the variation of not explained by . Therefore, this measure is affected by the correlation between and .
While one can check whether a variable is important or not by hypothesis testing, this technique relies heavily on parametric assumptions. On the other hand, model reliance does not make any additional assumption beyond that the observations are i.i.d. However, given the same set of assumptions for testing the coefficients444See Appendix D for the set of assumptions., we can also test whether the model reliance of the best linear model on each feature is zero or not when the regularization parameter is 0.
Fix a dataset . Let be the best linear model. Let , the empirical model reliance function for variable , be given by
where and are the empirical covariance and variance. Let
where is the gradient of with the population covariance and variance replaced by their empirical analogs, and is the variance of the estimator, which is standard for hypothesis testing. Then,
where is the true coefficient.
That is, suppose we want to test whether variable is not important at all. Theorem 4 implies that under ,
This allows us to test if the population model reliance for the best linear model on variable is zero. If variable is not important, our testing statistic is distributed.
In this section, we want to apply VIC/VID analysis to real datasets and demonstrate its usage. We work with criminal recidivism prediction data, in-vehicle coupon recommendation data, and image classification data.
6.1 Experiment 1: Recidivism Prediction
As we introduced before, the Propublica news organization found a linear model for COMPAS score that depends on race, and concluded that it is racially biased. This conclusion is unwarranted, since there could be other models that explain COMPAS well without relying on race. (See also Flores et al. (2016).)
To investigate this possibility, we study the same dataset of 7214 defendants in Broward County, Florida. The dataset contains demographic information as well as the prior criminal history and 2-year recidivism information for each defendant. Our outcome variable is recidivism, and covariate variables are age, race, gender, prior criminal history, juvenile criminal history, and current charge.555recidivism = 1 if a defendant recidivates in two years. age = 1 if a defendant is younger than 20 years old. race = 1 if a defendant is black. gender = 1 if a defendant is a male. prior = 1 if a defendant has at least one prior crime. juvenile = 1 if a defendant has at least one juvenile crime. charge = 1 if a defendant is charged with crime. We explore two model classes: logistic models and decision trees. In our analysis below, we find that in both classes there are indeed models that do not rely on race. Moreover, race tends to be an important variable only when prior criminal history is not important.
6.1.1 VID for Logistic Regression
Since we have 6 variables, the VIC is a subset of . We display only VID (see Figure 5) based on four variables: age, race, prior criminal history and gender.
The first row of the VID is the projection of the VIC onto the space spanned by age and each of the other variables of interest, with the variable importance of age on the vertical axis. We can see that the variable importance of age is roughly bounded by , which suggests there is no good model that relies on age to a degree more than 1.05, and there exists a good model that does not rely on age. Note that the bounds are the same for any of the three diagrams in the first row.
By comparing multiple rows, we observe that the variable importance of prior criminal history has the greatest upper bound, and the variable importance of gender has the lowest upper bound. Moreover, prior criminal history has the greatest average variable importance and gender has the lowest average importance. We also find that there exist models that do not rely on each of the four variables. However, the diagrams in the third row reveal that there are only a few models with variable importance of prior criminal history being 1, while the diagrams in the fourth row reveal that there are a lot models with variable importance of gender being 1. All of this evidence indicates that prior criminal history is the most important variable of those we considered, while gender is the least important one among the four.
We now focus on the diagram at Row 3 Column 2, which reveals the variable importance of prior criminal history in the context of the variable importance of race. We see that when importance of race is close to 1.05, which is its upper bound, the variable importance of prior criminal history is in the range of . On the other hand, while the importance of prior criminal history is close to 1.13, which is its upper bound, the variable importance of race is lower. The scatter plot has a slight downward sloping right edge. Since the boundary of the scatter plot represents models with equal loss (because they are on the boundary of the Rashomon set), the downward sloping edge suggests that as we rely less on prior criminal history, we must rely more on race to maintain the same accuracy level. In contrast, the diagram at Row 3 Column 1 has a vertical right edge, suggesting that we can reduce the reliance on prior criminal history without increasing the reliance on age.
6.1.2 VID for Decision Trees
In this subsection we work on the same dataset but focus on a different class of models, the class of decision trees that split according to no more than 4 features. The restriction put on splitting aims to avoid overfitting. The VID (see Figure 6) for the same four variables of interest is given below. There is a striking difference between the VID for decision trees and logistic models: the former is discrete and has clustering structure. This demonstrates our analysis in Section 4.3.
The VID for decision trees also reveals that prior criminal history is the most importance variable for decision trees. However, gender becomes more important than age for decision trees. Figure 6 at Row 3 Column 2 regarding the variable importance of prior criminal history and race also suggests a substitution pattern: the importance of prior criminal history is lower when race is important, and vice versa.
6.2 Experiment 2: In-vehicle Coupon Recommendation
In designing practical classification models, we might desire to include other considerations besides accuracy. For instance, if we know that when the model is deployed, one of the variables may not always be available, we might prefer to choose a model that does not depend as heavily on that variable. For instance, let us say we deploy a model that provides social services to children. In the training set we possess all the variables for all of the observations, but in deployment, the school record may not always be available. In that case, it would be helpful, all else being equal, to have a model that did not rely heavily on school record. It happens fairly often in practice that the sources of some of the variables are not trustworthy or reliable. In this case, we may face the same tradeoff between accuracy and desired model characteristics of the variable importance. This section provides an example where we create a trade-off between accuracy and variable importance; among the set of accurate model, we choose one that places less importance on a chosen variable.
We study a dataset about mobile advertisements documented in Wang et al. (2017), which consists of surveys of 752 individuals. In each survey, an individual is asked whether he or she would accept a coupon for a particular venue in different contexts (time of the day, weather, etc.) There are 12,684 data cases within the surveys.
We use a subset of this dataset, and focus on coupons for coffee shops. Acceptance of the coupon is the binary outcome variable, and the binary covariates include zeroCoffee (takes value 1 if the individual never drinks coffee), noUrgentPlace (takes value 1 if the individual has no urgent place to visit when receiving the coupon), sameDirection (takes value 1 if the destination and the coffee shop are in the same direction), expOneDay (takes value 1 if the coupon expires in one day), withFriends (takes value 1 if the individual is driving with friends when receiving the coupon), male (takes value 1 if the individual is a male), and sunny (takes value 1 if it is sunny when the individual receives the coupon).
We compute the VIC for the class of logistic regression models. Rather than providing the corresponding VID, we display only coarser information about the bounds of the variable importance in Table 1, sorted by importance.
|upper bound||lower bound|
|zeroCoffee||1.31||1.19||most important variable|
|sunny||1.00||1.00||least important variable|
Obviously, whether a person ever drinks coffee is a crucial variable for predicting if she will use a coupon for a coffee shop. Whether the person has an urgent place to go, whether the coffee shop is in the same direction as the destination, and whether the coupon is going to expire immediately are important variables for prediction too. The other variables seem to be of minimal importance.
|Least Reliance||Least Error|
|on noUrgentPlace||Logistic Model|
|logistic loss = 2366||logistic loss = 2296|
Suppose we think a priori that the variable noUrgentPlace is unreliable since the survey does not actually place people in an “urgent” situation. in that case, we may want to find an accurate predictive model with the least possible reliance on this variable. This is possible with VIC. Table 2 illustrates the trade-off.
The first and third columns in the table are the coefficient vectors for the two different models. The first column represents the model with the least reliance on noUrgentPlace within the VIC. The coefficients in the third column are for the plain logistic regression model. The second and fourth columns in the table are the model reliance vectors for the two model. The second column is the vector in VIC that minimizes the reliance on noUrgentPlace. The fourth column is the model reliance vector for the plain logistic regression model. By comparing the second and fourth column, we see that we can find an accurate model that relies on noUrgentPlace less. However, its logistic loss is 2366, which is about 3% higher than the logistic regression model. This illustrates the trade-off between reliance and accuracy. By comparing these two columns, we also find that as we switch to a model with the least reliance on noUrgentPlace, the reliance on zeroCoffee, sameDirection and withFriends increases.
6.3 Experiment 3: Image Classification
The VIC analysis can be useful for any domain, including image classification and other problems that involve latent representations of data. We want to study how image classification relies on each of the latent features and how models with reasonable prediction accuracy can differ in terms of their reliance of these features.
We collected 1572 images of cats and dogs from ImageNet, and we use VGG16 features (Simonyan and Zisserman (2014)) to analyze them. We use the convolutional base to extract features and train our own model. In particular, we get a vector of latent features of length 512 for each of our image. That is, our input dataset is of size 1572 (512+1), where is the latent representation of the raw data .
To get a sense of the performance of the pre-trained VGG16 model on our dataset as a benchmark, we build a fully connected neural network with two layers and train it with the data . The accuracy of this model is about 75% on the training set. We then apply logistic regression and perform the VIC analysis on the dataset. Given the large dimension of features and relatively small sample size, we impose an penalty on the loss function. With cross validation to select the best penalty parameter, we get a logistic model with non-zero coefficients on 61 of the features. The accuracy of this classifier on training sample is about , which is approximately the same as the neural network we trained. We will restrict our analysis to the 61 features for simplicity.
We use the same method as Section 6.1 and randomly sample 417 logistic models in the Rashomon set. We restrict our attention to the four latent features with the highest variable importance. In addition to constructing the VID, which we will do later, we explore another possible visualization tool, t-SNE (see Figure 7). While the resulting representation of the VIC is not informative about variable importance, we separate it into four clusters and color the VID accordingly (see Figure 8).
From the VID, we gain a comprehensive view of the importance of these four latent features. However, it remains unclear how variable importance is linked to the function of the predictive models. For example, we do not know how a model that relies heavily on feature 160 and feature 28 is different from a model that does not rely on them at all.
To answer this question, we select a representative model from each of the cluster and attempt to visualize these four models. We consider the following visualization method. Given an input image, we ask a model: How would you modify the image so that you believe it is more likely to be a dog/cat? Given the functional form of the model, gradient ascent would answer this question. We choose a not-too-large step size and number of iterations so that the modified images of the four representative models are not too far from the original one (so that we can interpret them) yet display significant differences (see Figure 9).
The upper panel represents the modification that increases the probability for being a dog, and the lower panel does the opposite. In each panel, the left part is the original image. The middle part is the output images after modification. The right part is the gray-scale images of the absolute value of the difference between the input and output images. In a gray-scale image, a darker pixel indicates more significant modification. Note that the gray-scale images do not differentiate how the pixels are modified. For example, the models “amplify” the head to make it more like a dog, while they “erase” the head to make it more like a cat. The gray-scale images do not tell what operation (amplify or erase) is implemented on the image, for that we need to look at the images on the left.
Overall, the four representative models modify the input image similarly. This is because that they have similar accuracy in the classification problem. However, they are very different if we look at the details. In the upper panel, for example, we can see that Model 258 “creates” a dog head in the air above the body of the dog and Model 50 modifies this part of the input image similarly. The other two models“create” an eye above the body of the dog. The part of the image around the ear of the dog is another example. Model 36 does not modify much of this part, while the other models “create” another eye. The four models are also different when they modify the input image and make it more like a cat in the lower panel.
This experiment attempts to bridge the gap between the importance of latent features and the subtle differences among almost-equally accurate models for image classification. We believe that more work could be done along this direction to understand the black-box algorithms of image classification.
7 Related Work
The vast majority of work about variable importance is posthoc, meaning that it addresses a single model that has been chosen prior to the variable importance analysis. These works do not explore the class of models that could have been chosen, and are approximately equally good to the one that was chosen.
Probably the most direct posthoc method to investigate variable importance is to simply look at the coefficients or weights of the model, after normalizing the features. For example, a zero coefficient of a variable indicates no importance, while a large coefficient indicates greater importance. This interpretation is common for linear models (Breiman et al. (2001), Gevrey et al. (2003)), and is also applicable to some of the non-linear models. This is a posthoc analysis of variable importance: It tells that a variable is important because the prediction is sensitive to the value of this variable, if we select this predictive model. Yet it does not posit that this variable is important to every good predictive model, and we could have selected another equally good predictive model in which this variable is not important at all.
In addition to looking at the coefficients or weights, there are many more sophisticated posthoc analyses of variable importance in various domains. Visual saliency (Harel et al. (2007)), for instance, is not a measure of variable importance that has been been extended to an entire class of good models. Visual saliency tells us only what part of an image a single model is using. It does not show what part of that image every good model is choosing. However, it is possible to extend the VIC idea to visual saliency, where one would attempt to illustrate the range of saliency maps arising from the set of good predictive models.
There are several posthoc methods of visualizing variable importance. For linear models, the partial leverage plot (Velleman and Welsch (1981)) is a tool that visualizes the importance of a variable. To understand the importance of a variable, it extracts the information in this variable and the outcome that is not explained by the rest of the variables. The shape of the scatter plot of this extracted information informs us of the importance of the variable. The partial dependence plot (Friedman (2001)) is another method that visualizes the impact of a variable on the average prediction. By looking at the steepness of the plot, one can tell the magnitude of the change of predicted outcome caused by a local change of a variable. One recent attempt to visualize variable importance is made by Casalicchio et al. (2018). They introduce a local variable importance measure and propose visualization tools to understand how changes in a feature affect model performance both on average and for individual data points. These methods, while useful, take a given predictive model as a primitive and visualize variable importance with respect to this single model. They neglect the existence of other almost-equally accurate models and the fact that variable importance can be different with respect to these models.
In our paper, we analyze variable importance in the context of a set of good models, namely the Rashomon set. This idea is closely related to the works that investigate variable importance without using posthoc analysis. Fisher et al. (2018) estimate the bounds of variable importance in the sense that no good predictive model has variable importance that exceed the bounds. Unlike the posthoc methods mentioned above, they do not study variable importance for a pre-selected model. Instead, the analysis is performed on the set of all good models within a model class. Our work differs from theirs because we look at the joint set of variable importance rather than only the extremes. Our approach can reveal the importance of a variable in the context of other variables, which is not what their framework is designed to do.
Post-selection inference, though not directly related to variable importance, is another common posthoc method in statistical analysis. Inference is the process of learning what we do not know from the dataset, and is usually done with a pre-selected data analysis specification. Inference results might not always be robust since there could be other specifications that are consistent with the dataset, yet lead to results that are not necessarily the same (and researchers might inevitably choose the specifications that are in favor of their presumptions.) One way to address this issue is to use a non-posthoc method and look at the joint inference results from all specifications that are justified by the dataset, which is referred to as the hacking interval by Coker et al. (2018). The idea of looking for all consistent data analysis specification in their work is similar to ours and the work by Fisher et al. (2018) that look at all the almost-equally accurate predictive models, and both works are non-posthoc.
Finally, the idea that multiple models can explain the observed dataset well, namely the Rashomon effect (Breiman et al. (2001)), is discussed in other works that are not related to variable importance. For example, Tulabandhula and Rudin (2013) implement robust optimization to find a decision rule that works well under a uncertainty set of possible situations that are consistent with the historical data. The fields of robust optimization and chance-constrained programming use uncertainty sets to make decisions; these uncertainty sets could arise from the Rashomon set. Nevo and Ritov (2017) analyze a model selection problem in high-dimensional regression and identify a minimal class of models that rely on different subsets of variables but perform almost equally well in terms of prediction accuracy. In some sense, they are exploring the Rashomon set with random search algorithms rather than fully characterizing it.
In this paper, we propose a new framework to analyze and visualize variable importance. We analyze this for linear models, and extend to non-linear problems including logistic regression and decision trees. This framework is useful if we want to study the importance of a variable in the context of the importance of other variables. It informs us, for example, how the importance of a variable changes when another variable becomes more important as we switch among a set of almost-equally-accurate models. We show connections from variable importance to hypothesis testing for linear models, and the trade-off between accuracy and model reliance.
- Breiman (2001) Breiman L (2001) Random forests. Machine learning 45(1):5–32.
- Breiman et al. (2001) Breiman L, et al. (2001) Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science 16(3):199–231.
- Casalicchio et al. (2018) Casalicchio G, Molnar C, Bischl B (2018) Visualizing the feature importance for black box models. arXiv preprint arXiv:1804.06620 .
- Coker et al. (2018) Coker B, Rudin C, King G (2018) A theory of statistical inference for ensuring the robustness of scientific results. arXiv preprint arXiv:1804.08646 .
- Fisher et al. (2018) Fisher A, Rudin C, Dominici F (2018) Model class reliance: Variable importance measures for any machine learning model class, from the” rashomon” perspective. arXiv preprint arXiv:1801.01489 .
- Flores et al. (2016) Flores AW, Bechtel K, Lowenkamp CT (2016) False positives, false negatives, and false analyses: A rejoinder to machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. Fed. Probation 80:38.
- Friedman (2001) Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics 1189–1232.
- Gevrey et al. (2003) Gevrey M, Dimopoulos I, Lek S (2003) Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological modelling 160(3):249–264.
- Harel et al. (2007) Harel J, Koch C, Perona P (2007) Graph-based visual saliency. Advances in neural information processing systems, 545–552.
- Hayashi (2000) Hayashi F (2000) Econometrics (New Jersey: Princeton University Press).
- Nevo and Ritov (2017) Nevo D, Ritov Y (2017) Identifying a minimal class of models for high–dimensional data. Journal of Machine Learning Research 18(24):1–29.
- Simonyan and Zisserman (2014) Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
- Tulabandhula and Rudin (2013) Tulabandhula T, Rudin C (2013) Machine learning with operational costs. The Journal of Machine Learning Research 14(1):1989–2028.
- Velleman and Welsch (1981) Velleman PF, Welsch RE (1981) Efficient computing of regression diagnostics. The American Statistician 35(4):234–242.
- Wang et al. (2017) Wang T, Rudin C, Doshi-Velez F, Liu Y, Klampfl E, MacNeille P (2017) A bayesian framework for learning rule sets for interpretable classification. The Journal of Machine Learning Research 18(1):2357–2393.
Appendix A Proof of Corollary 2.1
We first look at how the center of the Rashomon set is affected by the scale of data.
Let be the linear model that minimizes the expected loss for and for . If follows that and