Regularization of Case-Specific Parameters for Robustness and Efficiency

# Regularization of Case-Specific Parameters for Robustness and Efficiency

[ [    [ [    [ [ Ohio State University, Ohio State University and University of Texas Yoonkyung Lee is Associate Professor, Department of Statistics, Ohio State University, Columbus, Ohio 43210, USA \printeade1. Steven N. MacEachern is Professor, Department of Statistics, Ohio State University, Columbus, Ohio 43210, USA \printeade2. Yoonsuh Jung is Postdoctoral Fellow, University of Texas, MD Anderson Cancer Center, Houston, Texas 77030, USA \printeade3.
###### Abstract

Regularization methods allow one to handle a variety of inferential problems where there are more covariates than cases. This allows one to consider a potentially enormous number of covariates for a problem. We exploit the power of these techniques, supersaturating models by augmenting the “natural” covariates in the problem with an additional indicator for each case in the data set. We attach a penalty term for these case-specific indicators which is designed to produce a desired effect. For regression methods with squared error loss, an  penalty produces a regression which is robust to outliers and high leverage cases; for quantile regression methods, an penalty decreases the variance of the fit enough to overcome an increase in bias. The paradigm thus allows us to robustify procedures which lack robustness and to increase the efficiency of procedures which are robust.

We provide a general framework for the inclusion of case-specific parameters in regularization problems, describing the impact on the effective loss for a variety of regression and classification problems. We outline a computational strategy by which existing software can be modified to solve the augmented regularization problem, providing conditions under which such modification will converge to the optimum solution. We illustrate the benefits of including case-specific parameters in the context of mean regression and quantile regression through analysis of NHANES and linguistic data sets.

\kwd
\volume

27 \issue3 2012 \firstpage350 \lastpage372 \doi10.1214/11-STS377

\runtitle

Regularization of Case-Specific Parameters

{aug}

a]\fnmsYoonkyung \snmLee\correflabel=e1]yklee@stat.osu.edu, b]\fnmsSteven N. \snmMacEachernlabel=e2]snm@stat.osu.edu and c]\fnmsYoonsuh \snmJunglabel=e3]yjung1@mdanderson.org

Case indicator \kwdlarge margin classifier \kwdLASSO \kwdleverage point \kwdoutlier \kwdpenalized method \kwdquantile regression.

## 1 Introduction

A core part of regression analysis involves the examination and handling of individual cases (Weisberg, 2005). Traditionally, cases have been removed or downweighted as outliers or because they exert an overly large influence on the fitted regression surface. The mechanism by which they are downweighted or removed is through inclusion of case-specific indicator variables. For a least-squares fit, inclusion of a case-specific indicator in the model is equivalent to removing the case from the data set; for a normal-theory, Bayesian regression analysis, inclusion of a case-specific indicator with an appropriate prior distribution is equivalent to inflating the variance of the case and hence downweighting it. The tradition in robust regression is to handle the case-specific decisions automatically, most often by downweighting outliers according to an iterative procedure (Huber, 1981).

This idea of introducing case-specific indicators also applies naturally to criterion based regression procedures. Model selection criteria such as AIC or BIC take aim at choosing a model by attaching a penalty for each additional parameter in the model. These criteria can be applied directly to a larger space of models—namely those in which the covariates are augmented by a set of case indicators, one for each case in the data set. When considering inclusion of a case indicator for a large outlier, the criterion will judge the trade-off between the empirical risk (here, negative log-likelihood) and model complexity (here, number of parameters) as favoring the more complex model. It will include the case indicator in the model, and, with a least-squares fit, effectively remove the case from the data set. A more considered approach would allow differential penalties for case-specific indicators and “real” covariates. With adjustment, one can essentially recover the familiar -tests for outliers (e.g., Weisberg, 2005), either controlling the error rate at the level of the individual test or controlling the Bonferroni bound on the familywise error rate.

Case-specific indicators can also be used in conjunction with regularization methods such as the LASSO (Tibshirani, 1996). Again, care must be taken with details of their inclusion. If these new covariates are treated in the same fashion as the other covariates in the problem, one is making an implicit judgment that they should be penalized in the same fashion. Alternatively, one can allow a second parameter that governs the severity of the penalty for the indicators. This penalty can be set with a view of achieving robustness in the analysis, and it allows one to tap into a large, extant body of knowledge about robustness (Huber, 1981).

With regression often serving as a motivatingtheme, a host of regularization methods for model selection and estimation problems have been developed. These methods range broadly across the field of statistics. In addition to traditional normal-theory linear regression, we find many methods motivated by a loss which is composed of a negative log-likelihood and a penalty for model complexity. Among these regularization methods are penalized linear regression methods [e.g., ridge regression (Hoerl and Kennard, 1970) and the LASSO], regression with a nonparametric mean function, [e.g., smoothingsplines (Wahba, 1990) and generalized additive models (Hastie and Tibshirani, 1990)], and extension to regression with nonnormal error distributions, namely, generalized linear models (McCullagh and Nelder, 1989). In all of these cases, one can add case-specific indicators along with an appropriate penalty in order to yield an automated, robust analysis. It should be noted that, in addition to a different severity for the penalty term, the case-specific indicators sometimes require a different form for their penalty term.

A second class of procedures open to modification with case-specific indicators are those motivated by minimization of an empirical risk function. The risk function may not be a negative log-likelihood. Quantile regression (whether linear or nonlinear) falls into this category, as do modern classification techniques such as the support vector machine (Vapnik, 1998) and the -learner (Shen et al., 2003). Many of these procedures are designed with the robustness of the analysis in mind, often operating on an estimand defined to be the population-level minimizer of the risk. The procedures are consistent across a wide variety of data-generating mechanisms and hence are asymptotically robust. They have little need of further robustification. Instead, scope for bettering these procedures lies in improving their finite sample properties. The finite sample performance of many procedures in this class can be improved by including case-specific indicators in the problem, along with an appropriate penalty term for them.

This paper investigates the use of case-specific indicators for improving modeling and prediction procedures in a regularization framework. Section 2 provides a formal description of the optimization problem which arises with the introduction of case-specific indicators. It also describes a computational algorithm and conditions that ensure the algorithm will obtain the global solution to the regularized problem. Section 3 explains the methodology for a selection of regression methods, motivating particular forms for the penalty terms. Section 4 describes how the methodology applies to several classification schemes. Sections 5 and 6 contain simulation studies and worked examples. We discuss implications of the work and potential extensions in Section 7.

## 2 Robust and Efficient Modeling Procedures

Suppose that we have pairs of observations denoted by , , for statistical modeling and prediction. Here with covariates and the ’s are responses. As in the standard setting of regression and classification, the ’s are assumed to be conditionally independent given the ’s. In this paper, we take modeling of the data as a procedure of finding a functional relationship between and , with unknown parameters that is consistent with the data. The discrepancy or lack of fit of is measured by a loss function . Consider a modeling procedure, say, of finding which minimizes ( times) the empirical risk

 Rn(f)=n∑i=1L(yi,f(xi;β))

or its penalized version, , where is a positive penalty parameter for balancing the data fit and the model complexity of measured by . A variety of common modeling procedures are subsumed under this formulation, including ordinary linear regression, generalized linear models, nonparametric regression, and supervised learning techniques. Forbrevity of exposition, we identify with through a parametric form and view as a functional depending on . Extension of the formulation presented in this paper to a nonparametric function is straightforward via a basis expansion.

### 2.1 Modification of Modeling Procedures

First, we introduce case-specific parameters, , for the observations by augmenting the covariates with case-specific indicators. For convenience, we use to refer to a generic element of , dropping the subscript. Motivated by the beneficial effects of regularization, we propose a general scheme to modify the modeling procedure using the case-specific parameters , to enhance for robustness or efficiency. Define modification of to be the procedure of finding the original model parameters, , together with the case-specific parameters, , that minimize

 L(β,γ––) = n∑i=1L(yi,f(xi;β)+γi) +λβJ(f)+λγJ2(γ––).

If is zero, involves empirical risk minimization, otherwise penalized risk minimization. The adjustment that the added case-specific parameters bring to the loss function is the same regardless of whether is zero or not.

In general, measures the size of . When concerned with robustness, we often take . A rationale for this choice is that with added flexibility, the case-specific parameters can curb the undesirable influence of individual cases on the fitted model. To see this effect, consider minimizing for fixed , which decouples to a minimization of for each . In most cases, an explicit form of the minimizer of can be obtained. Generally ’s are large for observations with large “residuals” from the current fit, and the influence of those observations can be reduced in the next round of fitting with the -adjusted data. Such a case-specific adjustment would be necessary only for a small number of potential outliers, and the norm which yields sparsity works to that effect. The adjustment in the process of sequential updating of is equivalent to changing the loss from to , which we call the -adjusted loss of . The -adjusted loss is a re-expression of in terms of the adjusted residual, used as a conceptual aid to illustrate the effect of adjustment through the case-specific parameter  on . Concrete examples of the adjustments will be given in the following sections. Alternatively, one may view as a whole to be the “effective loss” in terms of after profiling out . The effective loss replaces for the modified procedure. When concerned with efficiency, we often take . This choice has the effect of increasing the impact of selected, nonoutlying cases on the analysis.

In subsequent sections, we will take a few standard statistical methods for regression and classification and illustrate how this general scheme applies. This framework allows us to see established procedures in a new light and also generates new procedures. For each method, particular attention will be paid to the form of adjustment to the loss function by the penalized case-specific parameters.

### 2.2 General Algorithm for Finding Solutions

Although the computational details for obtaining the solution to (2.1) are specific to each modeling procedure , it is feasible to describe a common computational strategy which is effective for a wide range of procedures that optimize a convex function. For fixed and , the solution pair of and to the modified can be found with little extra computational cost. A generic algorithm below alternates estimation of and . Given , minimization of is done via the original modeling procedure . In most cases we consider, minimization of given entails simple adjustment of “residuals.” These considerations lead to the following iterative algorithm for finding and :

1. Initialize and (the ordinary solution).

2. Iteratively alternate the following two steps, :

• modifies “residuals.”

• . This stepamounts to reapplying the procedure to -adjusted data although the nature of the data adjustment would largely depend on .

3. Terminate the iteration when , where is a prespecified convergence tolerance.

In a nutshell, the algorithm attempts to find the joint minimizer by combining the minimizers and resulting from the projected subspaces. Convergence of the iterative updates can be established under appropriate conditions. Before we state the conditions and results for convergence, we briefly describe implicit assumptions on the loss function and the complexity or penalty terms, and . is assumed to be nonnegative. For simplicity, we assume that of depends on  only, and that it is of the form and for . The LASSO penalty has while a ridge regression type penalty sets . Many other penalties of this form for can be adopted as well to achieve better model selection properties or certain desirable performance of . Examples include those for the elastic net (Zou and Hastie. 2005), the grouped LASSO (Yuan and Lin, 2006) and the hierarchical LASSO (Zhou and Zhu, 2007).

For certain combinations of the loss and the penalty functionals, and , more efficient computational algorithms can be devised, as in Hastie et al. (2004), Efron et al. (2004a) and Rosset and Zhu (2007). However, in an attempt to provide a general computational recipe applicable to a variety of modeling procedures which can be implemented with simple modification of existing routines, we do not pursue the optimal implementation tailored to a specific procedure in this paper.

Convexity of the loss and penalty terms plays a primary role in characterizing the solutions of the iterative algorithm. For a general reference to properties of convex functions and convex optimization, see Rockafellar (1997). Nonconvex problems require different optimization strategies.

If in (2.1) is continuous and strictly convex in and for fixed and , the minimizer pair in each step is properly defined. That is, given , there exists a unique minimizer , and vice versa. The assumption that is strictly convex holds if the loss itself is strictly convex. Also, it is satisfied when a convex is combined with and strictly convex in and , respectively.

Suppose that is strictly convex in and  with a unique minimizer for fixed and . Then, the iterative algorithm gives a sequence of with strictly decreasing . Moreover, converges to . This result of convergence of the iterative algorithm is well known in convex optimization, and it is stated here without proof. Interested readers can find a formal proof in Lee, MacEachern and Jung (2007).

## 3 Regression

Consider a linear model of the form . Without loss of generality, we assume that each covariate is standardized. Let be an design matrix with in the th row and let .

### 3.1 Least Squares Method

Taking the least squares method as a baseline modeling procedure , we make a link between its modification via case-specific parameters and a classical robust regression procedure.

The least squares estimator of is the minimizer of . To reduce the sensitivity of the estimator to influential observations, the covariates are augmented by case indicators. Let be the indicator variable taking 1 for the th observation and 0 otherwise, and let be the coefficients of the case indicators. The additional design matrix for is the identity matrix, and becomes itself. The proposed modification of the least squares method with leads to a well-known robust regression procedure. For the robust modification, we find and that minimize

 L(β,γ––) = 12{Y−(Xβ+γ––)}⊤{Y−(Xβ+γ––)} +λγ∥γ––∥1,

where is a fixed regularization parameter constraining . Just as the ordinary LASSO with the  norm penalty stabilizes regression coefficients byshrinkage and selection, the additional penalty in (3.1) has the same effect on , whose components gauge the extent of case influences.

The minimizer of for a fixed can be found by soft-thresholding the residual vector . That is, . For observations with small residuals, , is set equal to zero with no effect on the current fit, and for those with large residuals, , is set equal to the residual offset by toward zero. Combining with , we define the adjusted residuals to be ; that is, if , and , otherwise. Thus, introduction of the case-specific parameters along with the penalty on amounts to winsorizing the ordinary residuals. The -adjusted loss is equivalent to truncated squared error loss which is if , and is otherwise. Figure 1 shows (a) the relationship between the ordinary residual  and the corresponding , (b) the residual and the adjusted residual , (c) the -adjusted loss as a function of , and (d) the effective loss.

The effective loss is if , and otherwise. This effective loss matches Huber’s loss function for robust regression (Huber, 1981). As in robust regression, we choose a sufficiently large  so that only a modest fraction of the residuals are adjusted. Similarly, modification of the LASSO as a penalized regression procedure yields the Huberized LASSO described by Rosset and Zhu (2004).

### 3.2 Location Families

More generally, a wide class of problems can be cast in the form of a minimization of where is the negative log-likelihood derived from a location family. The assumption that we have a location family implies that the negative log-likelihood is a function only of . Dropping the subscript, common choices for the negative log-likelihood, , include  (least squares, normal distributions) and (least absolute deviations, Laplace distributions).

Introducing the case-specific parameters , we wish to minimize

 L(β,γ––)=n∑i=1g(yi−x⊤iβ−γi)+λγ∥γ––∥1.

For minimization with a fixed , the next result applies to a broad class of (but not to ).

###### Proposition 1

Suppose that is strictly convex with the minimum at 0, and , respectively. Then,

 ^γ = argminγ∈R g(r−γ)+λγ|γ| = ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩r−g′−1(λγ),for r>g′−1(λγ),0,for g′−1(−λγ)≤r≤g′−1(λγ),r−g′−1(−λγ),for r

The proposition follows from straightforward algebra. Set the first derivative of the decoupled minimization equation equal to and solve for . Inserting these values for into the equation for yields

 L(^β,^γ––)=n∑i=1g(ri−^γi)+λγ∥^γ––∥1.

The first term in the summation can be decomposed into three parts. Large contribute . Large, negative contribute . Those with intermediate values have and so contribute . Thus a graphical depiction of the -adjusted loss is much like that in Figure 1, panel (c), where the loss is truncated above. For asymmetric distributions (and hence asymmetric log-likelihoods), the truncation point may differ for positive and negative residuals. It should be remembered that when is large, the corresponding is large, implying a large contribution of to the overall minimization problem. The residuals will tend to be large for vectors that are at odds with the data. Thus, in a sense, some of the loss which seems to disappear due to the effective truncation of is shifted into the penalty term for . Hence the effective loss is the same as the original loss, when the residual is in and is linear beyond the interval. The linearized part of is joined with such that is differentiable.

Computationally, the minimization of given  entails application of the same modeling procedure with to winsorized pseudo responses , where for , for , and for . So, the -adjusted data in Step 2 of the main algorithm consist of pairs in each iteration. A related idea of subsetting data and model-fitting to the subset iteratively for robustness can be found in the computer vision literature, the random sample consensus algorithm (Fischler and Bolles, 1981) for instance.

### 3.3 Quantile Regression

Consider median regression with absolute deviation loss , which is not covered in the foregoing discussion. It can be verified easily that the -adjustment of is void due to the piecewise linearity of the loss, reaffirming the robustness of median regression. For an effectual adjustment, the norm regularization of the case-specific parameters is considered. With the case-specific parameters , we have the following objective function for modified median regression:

 L(β,γ––)=n∑i=1|yi−x⊤iβ−γi|+λγ2∥γ––∥22. (3)

For a fixed and residual , the minimizing is given by

 sgn(r)1λγI(|r|>1λγ)+rI(|r|≤1λγ).

The -adjusted loss for median regression is

 L(y,x⊤β+^γ)=∣∣∣y−x⊤β−1λγ∣∣∣I(|y−x⊤β|>1λγ),

as shown in Figure 3(a) below. Interestingly, this -adjusted absolute deviation loss is the same as the so-called “-insensitive linear loss” for support vector regression (Vapnik, 1998) with .

With this adjustment, the effective loss is Huberized squared error loss. The adjustment makes median regression more efficient by rounding the sharp corner of the loss, and leads to a hybrid procedure which lies between mean and median regression. Note that, to achieve the desired effect for median regression, one chooses quite a different value of than one would when adjusting squared error loss for a robust mean regression.

The modified median regression procedure can be also combined with a penalty on for shrinkage and/or selection. Bi et al. (2003) considered support vector regression with the norm penalty for simultaneous robust regression and variable selection. These authors relied on the -insensitive linear loss which comes out as the -adjusted loss of the absolute deviation. In contrast, we rely on the effective loss which produces a different solution.

In general, quantile regression (Koenker and Bassett, 1978; Koenker and Hallock, 2001) can be used to estimate conditional quantiles of given . It is a useful regression technique when the assumption of normality on the distribution of the errors  is not appropriate, for instance, when the error distribution is skewed or heavy-tailed. For the th quantile, the check function is employed:

 ρq(r)={qr,for r≥0,−(1−q)r,for r<0. (4)

The standard procedure for the th quantile regression finds that minimizes the sum of asymmetrically weighted absolute errors with weight on positive errors and weight on negative errors:

 L(β)=n∑i=1ρq(yi−x⊤iβ).

Consider modification of with a case-specific parameter and norm regularization. Due to the asymmetry in the loss, except for , the size of reduction in the loss by the case-specific parameter would depend on its sign. Given and residual , if , then the positive would lower by , while if , the negative with the same absolute value would lower the loss by . This asymmetric impact on the loss is undesirable. Instead, we create a penalty that leads to the same reduction in loss for positive and negative of the same magnitude. In other words, the desired norm penalty needs to put and on an equal footing. This leads to the following penalty proportional to and :

 J2(γ):={q/(1−q)}γ2++{(1−q)/q}γ2−.

When , becomes the symmetric norm of .

With this asymmetric penalty, given , is now defined as

 argminγ∈RLλγ(^β,γ):=ρq(r−γ)+λγ2J2(γ), (5)

and is explicitly given by

 −qλγI(r<−qλγ)+rI(−qλγ≤r<1−qλγ) +1−qλγI(r≥1−qλγ).

The effective loss is then given by

 ργq(r)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(q−1)r−q(1−q)2λγ,for r<−qλγ,λγ21−qqr2,for −qλγ≤r<0,λγ2q1−qr2,for 0≤r<1−qλγ,qr−q(1−q)2λγ,for r≥1−qλγ, (6)

and its derivative is

 ψγq(r)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩q−1,for r<−qλγ,λγ1−qqr,for −qλγ≤r<0,λγq1−qr,for 0≤r<1−qλγ,q,for r≥1−qλγ. (7)

We note that, under the assumption that the density is locally constant in a neighborhood of the quantile, the quantile remains the of the effective function.

Figure 2 compares the derivative of the check loss with that of the effective loss in (6). Through penalization of a case-specific parameter, is modified to have a continuous derivative at the origin joined by two lines with a different slope that depends on . The effective loss is reminiscent of the asymmetric squared error loss () considered by Newey and Powell (1987) and Efron (1991) for the so-called expectiles. The proposed modification of the check loss produces a hybrid of the check loss and asymmetric squared error loss, however, with different weights than those for expectiles, to estimate quantiles. The effective loss is formally similar to the rounded-corner check loss of Nychka et al. (1995) who used a vanishingly small adjustment to speed computation. Portnoy and Koenker (1997) thoroughly discussed efficient computation for quantile regression.

Redefining as the sum of the asymmetric penalty for the case-specific parameter , , modified quantile regression is formulated as a procedure that finds and by minimizing

 L(β,γ––)=n∑i=1ρq(yi−x⊤iβ−γi)+λγ2J2(γ––). (8)

In extensive simulation studies (Jung, MacEachern and Lee, 2010), such adjustment of the standard quantile regression procedure generally led to more accurate estimates. See Section 5.1.1 for a summary of the studies. This is confirmed in theNHANES data analysis in Section 6.1.

For large enough samples, with a fixed , the bias of the enhanced estimator will typically outweigh its benefits. The natural approach is to adjust the penalty attached to the case-specific covariates as the sample size increases. This can be accomplished by increasing the parameter as the sample size grows.

Let for some constant and . The following theorem shows that if is sufficiently large, the modified quantile regression estimator , which minimizes or equivalently (8), is asymptotically equivalent to the standard estimator . Knight (1998) proved the asymptotic normality of the regression quantile estimator under some mild regularity conditions. Using the arguments in Koenker (2005), we show that has the same limiting distribution as , and thus it is -consistent if is sufficiently large.

Allowing a potentially different error distribution for each observation, let be independent random variables with c.d.f.’s and suppose that each has continuous p.d.f. . Assume that the th conditional quantile function of given is linear in and given by , and let . Now consider the following regularity conditions:

1. [(C-1)]

2. , are uniformly bounded away from 0 and at .

3. , admit a first-order Taylor expansion at , and are uniformly bounded at .

4. There exists a positive definite matrix such that .

5. There exists a positive definite matrix such that .

6. in probability.

(C-1) and (C-3) through (C-5) are the conditions considered for the limiting distribution of the standard regression quantile estimator in Koenker (2005) while (C-2) is an additional assumption that we make.

###### Theorem 2

Under the conditions (C-1)–(C-5), if , then

The proof of the theorem is in the Appendix.

## 4 Classification

Now suppose that ’s indicate binary outcomes. For modeling and prediction of the binary responses, we mainly consider margin-based procedures such as logistic regression, support vector machines (Vapnik, 1998), and boosting (Freund and Schapire, 1997). These procedures can be modified by the addition of case indicators.

### 4.1 Logistic Regression

Although it is customary to label a binary outcome as 0 or 1 in logistic regression, we instead adopt the symmetric labels of for ’s. The symmetry facilitates comparison of different classification procedures. Logistic regression takes the negative log-likelihood as a loss for estimation of logit . The loss, , can be viewed as a function of the so-called margin, . This functional margin of is a pivotal quantity for defining a family of loss functions in classification similar to the residual in regression.

As in regression with continuous responses, case indicators can be used to modify the logit function in logistic regression to minimize

 L(β0,β,γ––) =n∑i=1log(1+exp(−yi{f(xi;β0,β)+γi})) (9) +λγ∥γ––∥1,

where . When it is clear in context, will be used as abbreviated notation for , a discriminant function, and the subscript will be dropped. For fixed and , the minimization decouples, and is determined by minimizing

 log(1+exp(−yi{f(xi;^β0,^β)+γi}))+λγ|γi|.

First note that the minimizer must have the same sign as . Letting and assuming that , we have if , and 0 otherwise. This yields a truncated negative log-likelihood given by

 L(y,f(x))=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩log(1+λγ/(1−λγ)),if yf(x)≤log{(1−λγ)/λγ},log(1+exp(−yf(x))),otherwise,

as the -adjusted loss. This adjustment is reminiscent of Pregibon’s (1982) proposal tapering the deviance function so as to downweight extreme observations, thereby producing a robust logistic regression. See Figure 3(b) for the -adjusted loss (the dashed line), where is a decreasing function of . determines the level of truncation of the loss. As tends to , there is no truncation. Figure 3(b) also shows the effective loss (the solid line) for the adjustment, which linearizes the negative log-likelihood below .

### 4.2 Large Margin Classifiers

With the symmetric class labels, the foregoing characterization of the case-specific parameter in logistic regression can be easily generalized to various margin-based classification procedures. In classification, potential outliers are those cases with large negative margins. Let be a loss function of the margin . The following proposition, analogous to Proposition 1, holds for a general family of loss functions.

###### Proposition 3

Suppose that is convex and monotonically decreasing in , and is continuous. Then, for ,

 ^γ = argminγ∈R g(τ+γ)+λγ|γ| = {g′−1(−λγ)−τ,for τ≤g′−1(−λγ),0,for τ>g′−1(−λγ).

The proof is straightforward. Examples of the margin-based loss satisfying the assumption include the exponential loss in boosting, the squared hinge loss in the support vector machine (Lee and Mangasarian, 2001), and the negative log-likelihood in logistic regression. Although their theoretical targets are different, all of these loss functions are truncated above for large negative margins when adjusted by . Thus, the effective loss is obtained by linearizing for .

The effect of -adjustment depends on the form of , and hence on the classification method. For boosting, if , and is 0 otherwise. This gives . So, finding and given amounts to weighted boosting, where the positive case-specific parameters downweight the corresponding cases by. For the squared hinge loss in the support vector machine, if , and is otherwise. A positive case-specific parameter has the effect of relaxing the margin requirement, that is, lowering the joint of the hinge for that specific case. This allows the associated slack variable to be smaller in the primal formulation. Accordingly, the adjustment affects the coefficient of the linear term in the dual formulation of the quadratic programming problem.

As a related approach to robust classification, Wu and Liu (2007) proposed truncation of margin-based loss functions and studied theoretical properties that ensure classification consistency. Similarity exists between our proposed adjustment of a loss function with and truncation of the loss at some point. However, it is the linearization of a margin-based loss function on the negative side that produces its effective loss, and minimization of the effective loss is quite different from minimization of the truncated (i.e., adjusted) loss. Linearization is more conducive to computation than is truncation. Application of the result in Bartlett, Jordan and McAuliffe (2006) shows that the linearized loss functions satisfy sufficient conditions for classification consistency, namely Fisher consistency, which is the main property investigated by Wu and Liu (2007) for truncated loss functions.

Xu, Caramanis and Mannor (2009) showed that regularization in the standard support vector machine is equivalent to a robust formulation under disturbances of without penalty. In contrast, under our approach, robustness of classification methods is considered through the margin, which is analogous to the residual in regression. This formulation can cover outliers due to perturbation in as well as mislabeling of .

### 4.3 Support Vector Machines

As a special case of a large margin classifier, the linear support vector machine (SVM) looks for the optimal hyperplane minimizing

 Lλ(β0,β)=n∑i=1[1−yif(xi;β0,β)]++λ2∥β∥22, (10)

where and is a regularization parameter. Since the hinge loss for the SVM, , is piecewise linear, its linearization with is void, indicating that it has little need of further robustification. Instead, we consider modification of the hinge loss with . This modification is expected to improve efficiency, as in quantile regression.

Using the case indicators and their coefficients , we modify (10), arriving at the problem of minimizing

 L(β0,β,γ––) = n∑i=1[1−yi{f(xi;β0,β)+γi}]+ +λβ2∥β∥22+λγ2∥γ––∥22.

For fixed and , the minimizer of is obtained by solving the decoupled optimization problem of

 minγi∈R[1−yif(xi;^β0,^β)−yiγi]++λγ2γ2ifor each % γi.

With an argument similar to that for logistic regression, the minimizer should have the same sign as . Let . A simple calculation shows that

 argminγ≥0[ξ−γ]++λγ2γ2 =⎧⎨⎩0,if ξ≤0,ξ,if 0<ξ<1/λγ,1/λγ,if ξ≥1/λγ.

Hence, the increase in margin due to inclusion of is given by

 {1−yif(xi)}I(0<1−yif(xi)<1λγ) +1λγI(1−yif(xi)≥1λγ).

The -adjusted hinge loss is with the hinge lowered by as shown in Figure 3(c) (the dashed line). The effective loss (the solid line in the figure) is then given by a smooth function with the joint replaced with a quadratic piece between and 1 and linear beyond the interval.

## 5 Simulation Studies

We present results from various numerical experiments to illustrate the effect of the proposed modification of modeling procedures by regularization of case-specific parameters.

### 5.1 Regression

#### 5.1.1 ℓ2-adjusted quantile regression

The effectiveness of the -adjusted quantile regression depends on the penalty parameter in (6), which yields as the interval of quadratic adjustment.

We undertook extensive simulation studies (available in Jung, MacEachern and Lee, 2010) to establish guidelines for selection of the penalty parameter  in the linear regression model setting. The studies encompassed a range of sample sizes, from to , a variety of quantiles, from to , and distributions exhibiting symmetry, varying degrees of asymmetry, and a variety of tail behaviors. The modified quantile regression method was directly implemented by specifying the effective -function , the derivative of the effective loss, in the rlm function in the R package.

An empirical rule was established via a (robust) regression analysis. The analysis considered of the form , where is a constant depending on and is a robust estimate of the scale of the error distribution. The goal of the analysis was to find which, across a broad range of conditions, resulted in an MSE near the condition-specific minimum MSE. Here MSE is defined as mean squared error of estimated regression quantiles at a new integrated over the distribution of the covariates.

After initial examination of the MSE with a range of values, we made a decision to set to for good finite sample performance across a wide range of conditions. With fixed , we varied to obtain the smallest MSE by grid search for each condition under consideration. For a quick illustration, Figure 4 shows the intervals of adjustment with such optimal for various error distributions, values, and sample sizes. Wider optimal intervals indicate that more quadratic adjustment is preferred to the standard quantile regression for reduction of MSE. Clearly, Figure 4 demonstrates the benefit of the proposed quadratic adjustment of quantile regression in terms of MSE across a broad range of situations, especially when the sample size is small.

In general, MSE values begin to decrease as the size of adjustment increases from zero and increase after hitting the minimum, due to an increase in bias. There is an exception of this typical pattern when estimating the median with normally distributed errors. MSE monotonically decreases in this case as the interval of adjustment widens, confirming the optimality properties of least squares regression for normal theory regression. The comparisons between sample mean and sample median can be explicitly found under the error distributions using different degrees of freedom. The benefit of the median relative to the mean is greater for thicker tailed distributions. We observe that this qualitative behavior carries over to the optimal intervals. Thicker tails lead to shorter optimal intervals, as shown in Figure