Robust convex clustering: How does fusion penalty enhance robustness?

Robust convex clustering: How does fusion penalty enhance robustness?

Chenyu Liu,  Qiang Sun,  Kean Ming Tan11footnotemark: 1 School of Statistics, University of Minnesota Twin Cities, 224 Church St SE, Minneapolis, MN 55455, USA; E-mail: ktan@umn.edu.Department of Statistical Sciences, University of Toronto, 100 St. George Street, Toronto, ON M5S 3G3, Canada; E-mail: qsun@utstat.toronto.edu.
Abstract

Convex clustering has gained popularity recently due to its desirable performance in empirical studies. It involves solving a convex optimization problem with the cost function being a squared error loss plus a fusion penalty that encourages the estimated centroids for observations in the same cluster to be identical. However, when data are contaminated, convex clustering with a squared error loss will fail to identify correct cluster memberships even when there is only one outlier. To address this challenge, we propose a robust convex clustering method. Theoretically, we show that the new estimator is resistant to arbitrary outliers: it does not break down until more than half of the observations are arbitrary outliers. The key observation is that the fusion penalty can help enhance robustness. Numerical studies are performed to demonstrate the competitive performance of the proposed method.

Keywords: Breakdown point, fusion penalty, Huber loss, outliers, robustness.

1 Introduction

Clustering is ubiquitous in many scientific disciplines such as pattern recognition, machine learning, and bioinformatics. Given observations, the goal of clustering is to group the observations into distinct non-overlapping clusters. Traditional clustering methods such as -means and hierarchical clustering take a greedy approach and are sensitive to initializations of the clusters, and the choice of distance metric and linkage, respectively (Hastie et al., 2009; Johnson and Wichern, 2002).

To address these challenges, several authors have proposed a convex formulation of the clustering problem, referred to as convex clustering (Pelckmans et al., 2005; Hocking et al., 2011; Lindsten et al., 2011). Specifically, convex clustering solves a convex optimization problem with the cost function being a squared error loss plus a fusion penalty that encourages centroids for observations in the same cluster to be identical. Efficient algorithms for convex clustering have been developed (Chi and Lange, 2015; Chen et al., 2015; Sun et al., 2018; Weylandt et al., 2019). Theoretical properties of convex clustering were studied (Zhu et al., 2014; Tan and Witten, 2015; Wang et al., 2018; Radchenko and Mukherjee, 2017; Chi and Steinerberger, 2018). Chi et al. (2017) and Chi et al. (2018) considered extensions of convex clustering to perform co-clustering on matrices and tensors, respectively.

Convex clustering is developed based on an inherent assumption that there are no outliers in the data. However, in practice, large-scale data sets are often accompanied by contamination. Due to the use of squared error loss, a naive application of convex clustering will cluster each outlier into a singleton cluster. To address this issue, we propose a robust convex clustering method by substituting the squared error loss in the convex clustering formulation with a Huber loss (Huber, 1964, 1973). The resulting optimization problem is convex, which we solve using an alternative direction method of multipliers algorithm. We refer readers to Rousseeuw (1984); Rousseeuw and Yohai (1984); Yohai (1987); Mizera and Müller (1999) and Salibian-Barrera and Zamar (2002) for classical analysis of robust -estimators in the presence of arbitrary outliers, and to Catoni (2012); Sun et al. (2019); Avella-Medina et al. (2018); Ke et al. (2019); Tan et al. (2018) for nonasymptotic analysis of Huber regression with a diverging robustification parameter under heavy-tailed distributions.

We analyze the breakdown point of the proposed robust convex clustering method. Informally, the breakdown point of an estimator is defined as the proportion of arbitrary outliers an estimator can tolerate before the estimator produces arbitrarily large estimates (Hampel, 1971). We show that the proposed estimator does not break down until more than half of the observations are arbitrary outliers. This is counterintuitive, at least to us. We expected one arbitrary large outlier will destroy the clustering procedure because there are as many parameters as the samples. The key observation is that the fusion penalty plays a central role in robustifying the clustering procedure by fusing the outliers to the uncontaminated observations.

2 Robust Convex Clustering

Let be a data matrix with observations and features. Convex clustering estimates a centroid matrix by solving the following convex optimization problem

(2.1)

where and are the th row of and , respectively (Pelckmans et al., 2005; Hocking et al., 2011; Lindsten et al., 2011). Let be a solution obtained from solving (2.1). The fused group lasso penalty, , encourages the rows of to be similar to each other. The number of unique rows in is controlled by the nonnegative weight, , and the nonnegative tuning parameter . The cluster assignments can be inferred based on : the th and th observations are estimated to belong to the same cluster if and only if .

Because the squared error loss is sensitive to outliers, convex clustering often fails to identify the correct cluster memberships when the data are contaminated. In particular, if the th observation is contaminated to take a value that is significantly different from other observations that belong to the same cluster, then the procedure will fail clustering the th observation. To mitigate this issue, we propose to substitute the squared error loss in (2.1) by a loss function that is robust against outliers. We focus on the Huber loss, formally defined as follows (Huber, 1964):

(2.2)

The Huber loss is a mixture of the squared error loss when , and the absolute error loss when , where is referred to as the robustification parameter that controls the tradeoff between bias and robustness. For small value of , the Huber loss gains robustness with the price of allowing the estimator to be bias.

We propose to estimate the centroid matrix under the Huber loss:

(2.3)

where we use the notation to indicate . Note that (2.3) reduces to the convex clustering formulation (2.1) when . The Huber loss is a convex function, and thus an efficient algorithm can be developed to obtain a global optimum to (2.3). While we pick the Huber loss due to its computational advantage, we will show later that the Huber loss together with the fusion penalty guarantees the robustness of the proposed estimator.

Since (2.3) is a convex optimization problem, we solve (2.3) using an alternating direction method of multipliers algorithm (Boyd and Vandenberghe, 2004). Our algorithm is a modified version of that of Chi and Lange (2015) to accommodate the Huber loss. The main idea is to decouple the terms in (2.3) that are difficult to optimize jointly. Let be an matrix. With some abuse of notation, let be the row of corresponding to the pair of indices . We recast (2.3) as the following equivalent constrained problem

(2.4)

Construct an matrix such that . Then, it can be shown that the scaled augmented Lagrangian function for (3.1) takes the form

where , , are the primal variables, and are the dual variables, is a nonnegative tuning parameter for the ADMM algorithm, and is the Frobenius norm. The updates on both the primal and dual variables can be derived by minimizing the scaled augmented Lagrangian function . Algorithm 1 summarizes the ADMM algorithm for solving (3.1). A detailed derivation of the ADMM algorithm is deferred to Appendix S.1.

  1. Input the tuning parameter , robustification parameter , tolerance level , and .

  2. Initialize the primal variables , , , and dual variables and .

  3. Iterate until the stopping criterion is met:

    1. .

    2. For each element in :

      where is the soft-thresholding operator.

    3. For all , let and set

      where .

    4. For all , .

    5. .

Algorithm 1 An alternating direction method of multipleirs algorithm for solving (3.1).

Many authors have observed that the weights in (2.1) can affect the solution of convex clustering substantially (Chi and Lange, 2015; Chi et al., 2017). A commonly used weight function is for some (Chi and Lange, 2015). However, when a few elements of are contaminated, the weight will be close to zero even when and belong to the same cluster. Consequently, under the presence of outliers, the th and th observations will not be estimated to belong to the same cluster since the centroids for the two observations are not penalized to be the same.

To mitigate this issue, we propose a new weight function that is less sensitive to outliers. Let and for some . We propose

(2.5)

The weight function in (2.5) has the desirable property to assign higher weights to contaminated observations that belong to the same cluster than to observations that belong to different clusters. We will assess the performance of the two different weight functions via numerical studies in Section 4.

3 Breakdown Point Analysis

In this section, we examine the breakdown point property of our proposed estimator. Without loss of generality, we focus on robust convex clustering (2.3) when all of the weights are set to equal to one:

(3.1)

This assumption can be relaxed as long as the weights are strictly positive, i.e. for all . Recall that is the original data that are uncontaminated. We define the set . In other words, is the set of all possible contaminated data matrices that are obtained by replacing at most rows of the original data . Throughout this section, let be the contaminated data. Let and be the solution to (3.1) with the original data and the contaminated data , respectively. We now provide a formal definition on the breakdown point of (Donoho and Huber, 1983).

Definition 3.1.

The breakdown point of is defined as

The supremum is taken over all possible contaminated data in the set . Thus, the quantity can be interpreted as the smallest proportion of contaminated samples for which the convex clustering procedure (3.1) produces an arbitrarily large estimate with respect to . We are ready to present the main theorem.

Theorem 3.2.

Take . Then the robust convex clustering estimator obtained from solving (3.1) has a breakdown point of at least , and at most .

Theorem 3.2 indicates that as long as the ratio does not grow too quickly, robust convex clustering with Huber loss has a breakdown point of at least . This result seems to be counterintuitive. In particular, Theorem 4 of Rousseeuw and Leroy (1987) showed that the breakdown point for any regression equivariant estimator is at most

where is the total number of unknown parameters and is the total number of observations. In the convex clustering formulation, there are a total of observations and unknown parameter vectors. Because the Huber regression estimator is regression equivariant, this indicates that the estimator obtained from solving

has a breakdown point of at most . Thus, the fusion penalty helps enhance the robustness property by improving the breakdown point from to at least .

4 Numerical Studies

We consider two cases in numerical studies: heavy-tailed random noise and arbitrary outliers. To evaluate the quality of a clustering procedure, we use the adjusted Rand index (Rand, 1971). A value that is close to one indicates good agreement between the true and estimated clusters. We compare the proposed method to convex clustering (2.1), implemented using the R package cvxclustr (Chi and Lange, 2014).

Our proposed method involves two tuning parameters: we simplely set and pick such that the solution yields the true number of clusters. We select the tuning parameter for convex clustering with squared error loss in a similar fashion. We implement our proposed method using , and convex clustering using both weights and . We set and : these values are chosen such that all clustering methods can identify the correct cluster memberships when there are no outliers in the data.

Throughout our simulation studies, we assume that we have observations that belong to two distinct non-overlapping clusters, and . We generate an data matrix according to the model if , and , otherwise. Centroids for the two clusters are constructed as and , where is a -dimensional vector of threes. We simulate each element of the random noise from three different distributions: (i) the normal distribution , (ii) the -distribution with degrees of freedom two, and (iii) the log-normal distribution . In addition, we consider scenarios in which certain elements of the data are contaminated. We generate each element of the random noise from . We then randomly contaminate and of the observations by replacing of the variables with random values generated from a uniform distribution on the interval . The results for and are summarized in Table 1.

Arbitrary outliers Heavy-tailed
0% 6% 10% Gaussian Log-normal
our proposal 1 (0) 1 (0) 1 (0) 1 (0) 099 (001) 1 (0)
cvxclustr with 1 (0) 0 (0) 0 (0) 1 (0) 0 (0) 0 (0)
cvxclustr with 1 (0) 013 (003) 006 (002) 1 (0) 006 (002) 053 (005)
our proposal 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0)
cvxclustr with 1 (0) 0 (0) 0 (0) 1 (0) 0 (0) 0 (0)
cvxclustr with 1 (0) 037 (003) 035 (005) 1 (0) 021 (004) 053 (005)
Table 1: Simulation results for and for data contamination model and heavy-tailed random noise. The mean (standard error) of the adjusted Rand index, averaged over 100 replications, are reported.

From Table 1, we see that when there are no data contamination and that the random noise is Gaussian, all three methods give an adjusted Rand index of one, indicating that all methods identify the correct cluster memberships. When 6% and 10% of the observations are contaminated with outliers, our proposed method can still perfectly identify the correct cluster memberships, while convex clustering with squared error loss using both weight functions perform poorly. In particular, convex clustering with weight has an adjusted Rand index of zero. This is due to the fact that convex clustering identify one outlier has its own cluster and group the rest of the observations as one cluster. Our proposed method is able to identify the correct cluster memberships when the random noise are generated from a -distribution and a log-normal distribution. Convex clustering, however, fails to identify the cluster memberships when the random noise is heavy-tailed. We observe similar results for the case when and .

5 Discussion

We propose a robust convex clustering method for clustering problem with outliers. An alternating direction method of multipliers is proposed to solve the resulting convex optimization problem. We study the breakdown point property of our estimator. We establish a rather counterintuitive result: coupling the Huber loss with a fused group lasso penalty, robust convex clustering estimator has a breakdown point of . While without the fusion penalty, the estimator has a breakdown point of . In other words, the fused group lasso penalty helps improve the robustness property. To the best of our knowledge, such phenomenon has not been observed in the literature. It is interesting to study whether this observation can be generalized.

Our proposed method can be readily extended to biclustering problem (Tan and Witten, 2014; Chi et al., 2017) and co-clustering problem for tensors (Chi et al., 2018). In addition, different robust loss functions can be employed for robust convex clustering. One practical limitation of convex clustering approaches is that it is computationally intensive to obtain the entire solution path. To address this issue, Weylandt et al. (2019) proposed an iterative one-step approximation scheme and showed that the estimated solution path converges to the exact solution path, under the assumption that the loss function is strongly convex. It would be interesting to study whether the results in Weylandt et al. (2019) can be generalized to robust loss functions.

Acknowledgement

We would like to thank Professor Peter Rousseeuw for carefully reading the first draft and for pointing us to a proof of an upper bound for the breakdown point.

References

  • Avella-Medina et al. (2018) Avella-Medina, M., Battey, H. S., Fan, J. and Li, Q. (2018). Robust estimation of high-dimensional covariance and precision matrices. Biometrika 105 271–284.
  • Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New York.
  • Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Annales de I’Institut Henri Poincaré - Probabilités et Statistiques 48 1148–1185.
  • Chen et al. (2015) Chen, G. K., Chi, E. C., Ranola, J. and Lange, K. (2015). Convex clustering: An attractive alternative to hierarchical clustering. PLOS Computational Biology 11 e1004228.
  • Chi et al. (2017) Chi, E. C., Allen, G. I. and Baraniuk, R. G. (2017). Convex biclustering. Biometrics 73 10–19.
  • Chi et al. (2018) Chi, E. C., Gaines, B. R., Sun, W. W., Zhou, H. and Yang, J. (2018). Provable convex co-clustering of tensors. arXiv preprint arXiv:1803.06518.
  • Chi and Lange (2014) Chi, E. C. and Lange, K. (2014). cvxclustr: Splitting methods for convex clustering. URL http://cran.r-project.org/web/packages/cvxclustr. R package version 1.1.1.
  • Chi and Lange (2015) Chi, E. C. and Lange, K. (2015). Splitting methods for convex clustering. Journal of Computational and Graphical Statistics 24 994–1013.
  • Chi and Steinerberger (2018) Chi, E. C. and Steinerberger, S. (2018). Recovering trees with convex clustering. arXiv preprint arXiv:1806.11096.
  • Donoho and Huber (1983) Donoho, D. L. and Huber, P. J. (1983). The notion of breakdown point. In A Festschrift For Erich L. Lehmann. Belmont, Wadsworth, 157–184.
  • Hampel (1971) Hampel, F. R. (1971). A general qualitative definition of robustness. The Annals of Mathematical Statistics 42 1887–1896.
  • Hastie et al. (2009) Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York.
  • Hocking et al. (2011) Hocking, T. D., Joulin, A., Bach, F. and Vert, J.-P. (2011). Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning.
  • Huber (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35 73–101.
  • Huber (1973) Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and monte carlo. The Annals of Statistics. 1 799–821.
  • Johnson and Wichern (2002) Johnson, R. A. and Wichern, D. W. (2002). Applied Multivariate Statistical Analysis. Prentice Hall, New Jersey.
  • Ke et al. (2019) Ke, Y., Minsker, S., Ren, Z., Sun, Q. and Zhou, W.-X. (2019). User-friendly covariance estimation for heavy-tailed distributions. Statistical Science, in press.
  • Lindsten et al. (2011) Lindsten, F., Ohlsson, H. and Ljung, L. (2011). Clustering using sum-of-norms regularization: With application to particle filter output computation. In 2011 IEEE Statistical Signal Processing Workshop (SSP).
  • Mizera and Müller (1999) Mizera, I. and Müller, C. H. (1999). Breakdown points and variation exponents of robust -estimators in linear models. The Annals of Statistics 27 1164–1177.
  • Pelckmans et al. (2005) Pelckmans, K., De Brabanter, J., Suykens, J. and De Moor, B. (2005). Convex clustering shrinkage. In PASCAL Workshop on Statistics and Optimization of Clustering Workshop.
  • Radchenko and Mukherjee (2017) Radchenko, P. and Mukherjee, G. (2017). Consistent clustering via fusion penalty. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 1527–1546.
  • Rand (1971) Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 846–850.
  • Rousseeuw (1984) Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association 79 871–880.
  • Rousseeuw and Leroy (1987) Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley, New York.
  • Rousseeuw and Yohai (1984) Rousseeuw, P. J. and Yohai, V. (1984). Robust regression by means of S-estimators. In Robust and Nonlinear Time Series Analysis. Springer, 256–272.
  • Salibian-Barrera and Zamar (2002) Salibian-Barrera, M. and Zamar, R. H. (2002). Bootrapping robust estimates of regression. The Annals of Statistics 30 556–582.
  • Sun et al. (2018) Sun, D., Toh, K.-C. and Yuan, Y. (2018). Convex clustering: Model, theoretical guarantee and efficient algorithm. arXiv preprint arXiv:1810.02677.
  • Sun et al. (2019) Sun, Q., Zhou, W.-X. and Fan, J. (2019). Adaptive Huber regression. Journal of the American Statistical Association, in press.
  • Tan et al. (2018) Tan, K. M., Sun, Q. and Witten, D. M. (2018). Robust sparse reduced rank regression in high dimensions. arXiv preprint arXiv:1810.07913v2.
  • Tan and Witten (2014) Tan, K. M. and Witten, D. M. (2014). Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics 23 985–1008.
  • Tan and Witten (2015) Tan, K. M. and Witten, D. M. (2015). Statistical properties of convex clustering. Electronic Journal of Statistics 9 2324–2347.
  • Wang et al. (2018) Wang, B., Zhang, Y., Sun, W. W. and Fang, Y. (2018). Sparse convex clustering. Journal of Computational and Graphical Statistics 27 393–403.
  • Weylandt et al. (2019) Weylandt, M., Nagorski, J. and Allen, G. (2019). Dynamic visualization and fast computation for convex clustering via algorithm regularization. arXiv preprint arXiv:1901.01477v1.
  • Yohai (1987) Yohai, V. J. (1987). High breakdown-point and high efficiency robust estimates for regression. The Annals of Statistics 15 642–656.
  • Zhu et al. (2014) Zhu, C., Xu, H., Leng, C. and Yan, S. (2014). Convex optimization procedure for clustering: Theoretical revisit. In Advances in Neural Information Processing Systems 27.

Appendix

Appendix S.1 Derivation of Algorithm 1

We derive an alternating direction method of multipliers algorithm for solving (3.1). Recall that the scaled augmented Lagrangian function for (3.1) takes the form

The alternating direction method of multipliers algorithm requires the following updates:

We now derive the updates for , , and .

Update for : An update for can be obtained by solving the following minimization problem:

The above problem can be solved element-wise:

(S.1)

Due to the use of Huber loss, there are two different cases: (i) ; and (ii) .

For the case when , (S.1) reduces to

Thus, we have . Substituting this into the constraint , we obtain . Thus,

For the case , we solve the problem

To this end, let . By a change of variable, we have

It can be shown that , where is the soft-thresholding operator. Thus we have

Update for : For each pair of , we update by solving the problem:

This is a standard group lasso problem with the following update:

where .

Update for : To update , we solve

(S.2)

To simplify the expression above, we construct an matrix such that . Then, (S.2) is equivalent to

Solving the above yields

Appendix S.2 Proof of Theorem 3.2

We first present several lemmas on the properties of Huber loss that will be helpful to the proof of Theorem 3.2.

Lemma S.2.1.

For any two scalars and , we have .

Proof of Lemma s.2.1.

The proof of Lemma S.2.1 is collected in Appendix S.3. ∎

Lemma S.2.2.

For any and , we have

Proof of Lemma s.2.2.

The proof of this lemma is a direct application of L’Hopital’s Rule and thus is omitted. ∎

Lemma S.2.3.

For any and , we have

Proof of Lemma s.2.3.

The proof of this lemma is a direct application of L’Hopital’s Rule and thus is omitted. ∎

Proof of Theorem 3.2.

We break the proof into two parts.

Part I: A lower bound for the breakdown point.

We start with defining some notation. Let

(S.1)

where is the Huber loss defined in (2.2). Note that (S.1) is a special case of robust convex clustering (2.3) with for all . Recall from Definition 3.1 the breakdown point of an estimator, . Let . For every , there exists an such that , where and are estimators obtained from minimizing and , respectively. Without loss of generality, we assume that the first samples in are uncontaminated, i.e., , where are the contaminated data. For notational simplicity, we write . Moreover, we define two sets that contain indices for the original data and contaminated data, and , respectively.

Since is the minimizer of , we have , implying

(S.2)

By Lemma S.2.1 and the symmetry of Huber loss, we obtain

(S.3)

and

(S.4)

Substituting (S.3) and (S.4) into (S.2) yields

(S.5)

We now study the effect of the fused group lasso penalty on the contaminated data. The penalty term can be rewritten as

By definition, as , . Per the compactness of the unit sphere, we may assume that converges to some point , passing to a subsequence otherwise.

Dividing (S.5) by and taking the limit when , we obtain

(S.6)

Dropping the third term on the left hand side and by Lemmas S.2.2S.2.3, (S.6) reduces to

Using the fact that for any vector , we obtain

(S.7)

We now analyze (S.7) by considering two cases. By the triangle inequality , (S.7) reduces to

Simplifying the above expression yields

(S.8)

For the second case, we use the triangle inequality . Following a similar calculation, we obtain

The above inequality can be simplified to

(S.9)

provided that the tuning parameters and are chosen such that such that .

Combining (S.8) and (S.9), we obtain

Now if , we immediately have , which further implies

If , by (S.9), we must have . However, this contradicts the fact that by construction.

Therefore, the assumption reduces to

and we obtain

(S.10)

Part II: An upper bound for the breakdown point. We now sharpen the upper bound in (S.10). First, the cost function in (3.1) is translation equivariant which indicates the obtained estimator is also translation equivariant, i.e.,

for any data matrix and any . Let be a data matrix such that the rows are

where is a vector of all ’s. Because there are contaminated rows, for any . Similarly, let be a data matrix such that the rows are

Because has contaminated rows where , for any . Moreover, we have , where is a matrix of all ’s, with some abuse of notation.

By triangle inequality, we have

which by translation inequality reduces to

Taking acquires

This implies

Therefore, combining the results in both parts gives