1 Introduction
\OneAndAHalfSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\MANUSCRIPTNO

MS-14-00887

\RUNTITLE

Size Matters: Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization \RUNAUTHORRujeerapaiboon, Schindler, Kuhn, Wiesemann \TITLESize Matters: Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization

\ARTICLEAUTHORS\AUTHOR

Napat Rujeerapaiboon \AFFDepartment of Industrial Systems Engineering and Management, National University of Singapore, Singapore, \EMAILisenapa@nus.edu.sg \AUTHORKilian Schindler, Daniel Kuhn \AFFRisk Analytics and Optimization Chair, École Polytechnique Fédérale de Lausanne, Switzerland,
\EMAILkilian.schindler@epfl.ch, \EMAILdaniel.kuhn@epfl.ch \AUTHORWolfram Wiesemann \AFFImperial College Business School, Imperial College London, United Kingdom, \EMAILww@imperial.ac.uk

\ABSTRACT

Plain vanilla -means clustering has proven to be successful in practice, yet it suffers from outlier sensitivity and may produce highly unbalanced clusters. To mitigate both shortcomings, we formulate a joint outlier detection and clustering problem, which assigns a prescribed number of datapoints to an auxiliary outlier cluster and performs cardinality-constrained -means clustering on the residual dataset, treating the cluster cardinalities as a given input. We cast this problem as a mixed-integer linear program (MILP) that admits tractable semidefinite and linear programming relaxations. We propose deterministic rounding schemes that transform the relaxed solutions to feasible solutions for the MILP. We also prove that these solutions are optimal in the MILP if a cluster separation condition holds.

\KEYWORDS

Semidefinite programming, -means clustering, outlier detection, optimality guarantee

1 Introduction

Clustering aims to partition a set of datapoints into a set of clusters so that datapoints in the same cluster are more similar to another than to those in other clusters. Among the myriad of clustering approaches from the literature, -means clustering stands out for its long history dating back to 1957 as well as its impressive performance in various application domains, ranging from market segmentation and recommender systems to image segmentation and feature learning Jain (2010).

This paper studies the cardinality-constrained -means clustering problem, which we define as the task of partitioning datapoints into clusters of prescribed sizes , with , so as to minimize the sum of squared intra-cluster distances. We can formalize the cardinality-constrained -means clustering problem as follows,

(1)

where

denotes the ordered partitions of the set into sets of sizes , respectively.

Our motivation for studying problem (1) is threefold. Firstly, it has been shown by Bennett et al. (2000) and Chen et al. (2006) that the algorithms commonly employed for the unconstrained -means clustering problem frequently produce suboptimal solutions where some of the clusters contain very few or even no datapoints. In this context, cardinality constraints can act as a regularizer that avoids local minima of poor quality. Secondly, many application domains require the clusters to be of comparable size. This is the case, among others, in distributed clustering (where different computer clusters should contain similar numbers of network nodes), market segmentation (where each customer segment will subsequently be addressed by a marketing campaign) and document clustering (where topic hierarchies should display a balanced view of the available documents), see Banerjee and Ghosh (2006) and Balcan et al. (2013). Finally, and perhaps most importantly, -means clustering is highly sensitive to outliers. To illustrate this, consider the dataset in Figure 1, which accommodates three clusters as well as three individual outliers. The -means clustering problem erroneously merges two of the three clusters in order to assign the three outliers to the third cluster (top left graph), whereas a clustering that disregards the three outliers would recover the true clusters and result in a significantly lower objective value (bottom left graph). The cardinality-constrained -means clustering problem, where the cardinality of each cluster is set to be one third of all datapoints, shows a similar behavior on this dataset (graphs on the right). We will argue below, however, that the cardinality-constrained -means clustering problem (1) offers an intuitive and mathematically rigorous framework to robustify -means clustering against outliers. A comprehensive and principled treatment of outlier detection methods can be found in the book of Aggarwal (2013).

Figure 1: Sensitivity of the (un)constrained -means clustering problem to outliers. Indicated in parentheses next to the panel titles are the respectively achieved sums of squared intra-cluster distances.

To our best knowledge, to date only two solution approaches have been proposed for problem (1). Bennett et al. (2000) combine a classical local search heuristic for the unconstrained -means clustering problem due to Lloyd (1982) with the repeated solution of linear assignment problems to solve a variant of problem (1) that imposes lower bounds on the cluster sizes . Banerjee and Ghosh (2006) solve the balanced version of problem (1), where , by sampling a subset of the datapoints, performing a clustering on this subset, and subsequently populating the resulting clusters with the remaining datapoints while adhering to the cardinality constraints. Balanced clustering is also considered by Malinen and Fränti (2014) and Costa et al. (2017). Malinen and Fränti (2014) proceed similarly to Bennett et al. (2000) but take explicit advantage of the Hungarian algorithm to speed up the cluster assignment step within the local search heuristic. Costa et al. (2017) propose a variable neighbourhood search heuristic that starts from a random partition of the datapoints into balanced clusters and subsequently searches for better solutions in the neighbourhood obtained by an increasing number of datapoint swaps between two clusters. Although all of these heuristics tend to quickly produce solutions of high quality, they are not known to be polynomial-time algorithms, they do not provide bounds on the suboptimality of the identified solutions, and their performance may be sensitive to the choice of the initial solution. Moreover, neither of these local search schemes accommodates for outliers.

In recent years, several conic optimization schemes have been proposed to alleviate the shortcomings of these local search methods for the unconstrained -means clustering problem Peng and Wei (2007), Awasthi et al. (2015). Peng and Wei (2007) develop two semidefinite programming relaxations of the unconstrained -means clustering problem. Their weaker relaxation admits optimal solutions that can be characterized by means of an eigenvalue decomposition. They further use this eigenvalue decomposition to set up a modified -means clustering problem where the dimensionality of the datapoints is reduced to (provided that their original dimensionality was larger than that). To obtain an upper bound, they solve this -means clustering problem of reduced dimensionality, which can be done either exactly by enumerating Voronoi partitions, as described in Inaba et al. (1994), or by approximation methods such as those in Hasegawa et al. (1993). Using either approach, the runtime grows polynomially in the number of datapoints  but not in the number of desired clusters . Hence, this method is primarily suitable for small . Similar conic approximation schemes have been developed by Elhamifar et al. (2012) and Nellore and Ward (2015) in the context of unconstrained exemplar-based clustering.

Awasthi et al. (2015) and Iguchi et al. (2017) develop probabilistic recovery guarantees for the stronger semidefinite relaxation of Peng and Wei (2007) when the data is generated by a stochastic ball model (i.e., datapoints are drawn randomly from rotation symmetric distributions supported on unit balls). More specifically, they use primal-dual arguments to establish conditions on the cluster separation under which the semidefinite relaxation of Peng and Wei (2007) recovers the underlying clusters with high probability as the number of datapoints increases. The condition of Awasthi et al. (2015) requires less separation in low dimensions, while the condition of Iguchi et al. (2017) is less restrictive in high dimensions. In addition, Awasthi et al. (2015) consider a linear programming relaxation of the unconstrained -means clustering problem, and they derive similar recovery guarantees for this relaxation as well.

Two more papers study the recovery guarantees of conic relaxations under a stochastic block model (i.e., the dataset is characterized by a similarity matrix where the expected pairwise similarities of points in the same cluster are higher than those of points in different clusters). Ames (2014) considers the densest -disjoint-clique problem whose aim is to split a given complete graph into subgraphs such as to maximize the sum of the average similarities of the resulting subgraphs. -means clustering can be considered as a specific instance of this broader class of problems. By means of primal-dual arguments, the author derives conditions on the means in the stochastic block model such that his semidefinite relaxation recovers the underlying clusters with high probability as the cardinality of the smallest cluster increases. Vinayak and Hassibi (2016) develop a semidefinite relaxation and regularize it with the trace of the cluster assignment matrix. Using primal-dual arguments they show that, for specific ranges of the regularization parameter, their regularized semidefinite relaxation recovers the true clusters with high probability as the cardinality of the smallest cluster increases. The probabilistic recovery guarantees of Ames (2014) and Vinayak and Hassibi (2016) can also be extended to datasets containing outliers.

Awasthi et al. Iguchi et al. Ames Vinayak and Hassibi This Paper
data generating model stochastic ball stochastic ball stochastic block stochastic block none/arbitrary
type of relaxation SDP + LP SDP SDP SDP SDP + LP
type of guarantee stochastic stochastic stochastic stochastic deterministic
guarantee depends on yes yes yes yes no
guarantee depends on yes yes no no no
requires balancedness yes yes no no yes
proof technique primal-dual primal-dual primal-dual primal-dual valid cuts
access to cardinalities no no no no yes
outlier detection no no yes yes yes


Table 1: Comparison of Recovery Guarantees for -means Clustering Relaxations.

In this paper, we propose the first conic optimization scheme for the cardinality-constrained -means clustering problem (1). Our solution approach relies on an exact reformulation of problem (1) as an intractable mixed-integer linear program (MILP) to which we add a set of valid cuts before relaxing the resulting model to a tractable semidefinite program (SDP) or linear program (LP). The set of valid cuts is essential in strengthening these relaxations. Both relaxations provide lower bounds on the optimal value of problem (1), and they both recover the optimal value of (1) whenever a cluster separation condition is met. The latter requires all cluster diameters to be smaller than the distance between any two distinct clusters and, in case of outlier presence, also smaller than the distance between any outlier and any other point. The same condition (in the absence of outliers) was used in Elhamifar et al. (2012) and Awasthi et al. (2015). Our relaxations also give rise to deterministic rounding schemes which produce feasible solutions that are provably optimal in (1) whenever the cluster separation condition holds. Table 1 compares our recovery guarantees to the results available in the literature. We emphasize that our guarantees are deterministic, that they apply to arbitrary data generating models, that they are dimension-independent, and that they hold for both our SDP and LP relaxations. Finally, our algorithms extend to instances of (1) that are contaminated by outliers and whose cluster cardinalities are not known precisely. We summarize the paper’s contributions as follows.

  1. We derive a novel MILP reformulation of problem (1) that only involves binary variables, as opposed to the standard MILP reformulation that contains binary variables, and whose LP relaxation is at least as tight as the LP relaxation of the standard reformulation.

  2. We develop lower bounds which exploit the cardinality information in problem (1). Our bounds are tight whenever a cluster separation condition is met. Unlike similar results for other classes of clustering problems, our separation condition is deterministic, model-free and dimension-independent. Furthermore, our proof technique does not rely on the primal-dual argument of SDPs and LPs.

  3. We propose deterministic rounding schemes that transform the relaxed solutions to feasible solutions for problem (1). The solutions are optimal in (1) if the separation condition holds. To our best knowledge, we propose the first tractable solution scheme for problem (1) with optimality guarantees.

  4. We illustrate that our lower bounds and rounding schemes extend to instances of problem (1) that are contaminated by outliers and whose cluster cardinalities are not known precisely.

The remainder of the paper is structured as follows. Section 2 analyzes the cardinality-constrained -means clustering problem (1) and derives the MILP reformulation underlying our solution scheme. Sections 3 and 4 propose and analyze our conic rounding approaches for problem (1) in the absence and presence of outliers, respectively. Section 5 presents numerical experiments, and Section 6 gives concluding remarks. Finally, a detailed description of the heuristic proposed by Bennett et al. (2000) for cardinality-constrained -means clustering is provided in the appendix.

Notation:

We denote by the vector of all ones and by the Euclidean norm. For symmetric square matrices , the relation means that is positive semidefinite, while means that is elementwise non-negative. The notation represents the trace inner product of and . Furthermore, we use to denote a vector in whose entries coincide with those of ’s main diagonal. Finally, for a set of datapoints , we use to denote the matrix of squared pairwise distances .

2 Problem Formulation and Analysis

We first prove that the clustering problem (1) is an instance of a quadratic assignment problem and transform (1) to an MILP with binary variables. Then, we discuss the complexity of (1) and show that an optimal clustering always corresponds to some Voronoi partition of .

Our first result relies on the following auxiliary lemma, which we state without proof.

{lemma}

For any vectors , we have

Proof.

ProofSee Zha et al. (2002, p. 1060).

Using Lemma 2, Costa et al. (2017) notice that the -means objective can be stated as a sum of quadratic terms. In the following proposition, we elaborate on this insight and prove that problem (1) is a specific instance of a quadratic assignment problem.

{proposition}

[Quadratic Assignment Reformulation] The clustering problem (1) can be cast as the quadratic assignment problem

(2)

where is a block diagonal matrix with blocks , , is the set of permutations of , and is defined through if ; otherwise.

Proof.

ProofWe show that for any feasible solution of (1) there exists a feasible solution of (2) which attains the same objective value and vice versa. To this end, for any partition feasible in (1), consider any permutation that satisfies for all , and denote its inverse by . This permutation is feasible in (2), and it achieves the same objective value as in (1) because

where the first equality is implied by Lemma 2, the second equality is a consequence of the definition of , and the third equality follows from the definition of .

Conversely, for any feasible in (2), consider any partition satisfying for all . This partition is feasible in (1), and a similar reasoning as before shows that the partition achieves the same objective value as in (2).

Generic quadratic assignment problems with facilities and locations can be reformulated as MILPs with binary variables via the Kaufmann and Broeckx linearization; see, e.g., Burkard (2013, p. 2741). The LP relaxations of these MILPs are, however, known to be weak, and give a trivial lower bound of zero; see, e.g., Zhang et al. (2013, Theorem 4.1). In Proposition 2 below we show that the intra-cluster permutation symmetry of the datapoints enables us to give an alternative MILP reformulation containing only binary variables. We also mention that the related, yet different, cardinality-constrained exemplar-based clustering problem can be formulated as an MILP containing binary variables; see Mulvey and Beck (1984).

{proposition}

[MILP Reformulation] The clustering problem (1) is equivalent to the MILP

 minimize ()
 subject to

The binary variable in the MILP  satisfies if ; otherwise. At optimality, is equal to 1 if (i.e., ) and 0 otherwise.

Proof.

Proof of Proposition 2 At optimality, the decision variables in problem  take the values . Accordingly, problem  can equivalently be stated as

 minimize ()
 subject to

In the following, we show that any feasible solution of (1) gives rise to a feasible solution of  with the same objective value and vice versa. To this end, consider first a partition that is feasible in (1). Choosing if and otherwise, , is feasible in  and attains the same objective value as in (1) since

Here, the first equality is implied by Lemma 2, and the second equality follows from the construction of . By the same argument, every feasible in  gives rise to a partition , for , that is feasible in  and that attains the same objective value.

{remark}

Note that zero is a (trivial) lower bound on the objective value of the LP relaxation of the MILP . As a consequence, this LP relaxation is at least as tight as the LP relaxation of the Kaufmann and Broeckx exact MILP formulation of problem (2), which always yields the lower bound of zero. It is also possible to construct instances where the LP relaxation of the MILP  is strictly tighter.

-means clustering with cardinality constraints is known to be NP-hard as it is a special case of cardinality-constrained -norm clustering, which was shown to be NP-hard (for any ) by Bertoni et al. (2012). The restriction to the Euclidean norm (i.e., ), however, allows for a more concise proof, which is given in the following proposition.

{proposition}

-means clustering with cardinality constraints is NP-hard even for . Hence, unless P NP, there is no polynomial time algorithm for solving problem (1).

Proof.

ProofIn analogy to Proposition 2, one can show that the unconstrained -means clustering problem can be formulated as a variant of problem that omits the first set of assignment constraints, which require that for all k, and replaces the (now unconstrained) cardinality in the objective function by the size of , which can be expressed as . If , we can thus solve the unconstrained -means clustering problem by solving problem for all cluster cardinality combinations and selecting the clustering with the lowest objective value. Thus, in this case, if problem were polynomial-time solvable, then so would be the unconstrained -means clustering problem. This, however, would contradict Theorem 1 in Aloise et al. (2009), which shows that the unconstrained -means clustering problem is NP-hard even for clusters.

In the context of balanced clustering, similar hardness results have been established by Pyatkin et al. (2017). Specifically, they prove that the balanced -means clustering problem is NP-complete for and (i.e., the shared cardinality of all clusters is greater than or equal to three). In contrast, if and (i.e., each cluster should contain two points), balanced -means clustering reduces to a minimum-weight perfect matching problem that can be solved in polynomial-time by different algorithms; see Cook and Rohe (1999, Table I) for a review.

In -means clustering without cardinality constraints, the convex hulls of the optimal clusters do not overlap, and thus each cluster fits within a separate cell of a Voronoi partition of ; see e.g., Hasegawa et al. (1993, Theorem 2.1). We demonstrate below that this property is preserved in the presence of cardinality constraints.

{theorem}

[Voronoi Partition] For every optimal solution to problem (1), there exists a Voronoi partition of such that each cluster is contained in exactly one Voronoi cell.

Proof.

ProofWe show that for every optimal clustering of (1) and every , , there exists a hyperplane separating the points in from those in . This in turn implies the existence of the desired Voronoi partition. Given a cluster for any , define its cluster center as , and let be the vector that connects the cluster centers of and . The statement holds if for all and as itself determines a separating hyperplane for and in that case. We thus assume that for some and . However, this contradicts the optimality of the clustering  because

where the last equivalence follows from multiplying both sides of the second inequality with and then completing the squares by adding on both sides. Defining and , the above would imply that

The left-hand side of the above inequality represents an upper bound on the sum of squared intra-cluster distances attained by the clustering since and may not coincide with the minimizers and , respectively. Recall that the cluster centers are chosen so as to minimize the sum of the distances from the cluster center to each point in the cluster. We thus conclude that the clustering attains a strictly lower objective value than in problem (1), which is a contradiction.

3 Cardinality-Constrained Clustering without Outliers

We now relax the intractable MILP  to tractable conic programs that yield efficiently computable lower and upper bounds on .

3.1 Convex Relaxations and Rounding Algorithm

We first eliminate the variables from  by re-expressing the problem’s objective function as

where the last equality holds because the variables are binary. Next, we apply the variable transformation , whereby simplifies to

minimize (3)
subject to

Here, takes the value if the -th datapoint is assigned to cluster and otherwise. Note that the constraints in (3) are indeed equivalent to the first two constraints in , respectively. In Theorem 3.1 below we will show that the reformulation (3) of the MILP  admits the SDP relaxation

()
 subject to

where, for any , the convex set is defined as

Note that is semidefinite representable because Schur’s complement allows us to express the constraint as a linear matrix inequality; see, e.g., Boyd and Vandenberghe (2004). Furthermore, we point out that the last four constraints in are also used in the reformulation-linearization technique for nonconvex programs, as described by Anstreicher (2009).

We can further relax the above SDP to an LP, henceforth denoted by , where the constraints are replaced with , and where, for any , the polytope is obtained by removing the non-linear constraint from .

{theorem}

[SDP and LP Relaxations] We have    .

Proof.

ProofThe inequality is trivially satisfied because is constructed as a subset of for every . To prove the inequality , consider any set of binary vectors feasible in (3) and define for . By construction, the objective value of in (3) coincides with that of in . Moreover, the constraints in (3) imply that

and

which ensures that for every . Finally, the constraint in  coincides with the last constraint in (3). Thus, is feasible in . The desired inequality now follows because any feasible point in (3) corresponds to a feasible point in  with the same objective value. Note that the converse implication is generally false.

{remark}

In the special case when , we can half the number of variables in and by setting and without loss of generality.

It is possible to show that is at least as tight as the naïve LP relaxation of the MILP , where the integrality constraints are simply ignored. One can also construct instances where is strictly tighter than . We also emphasize that both LP relaxations entail variables and constraints.

{proposition}

We have .

Proof.

ProofConsider a feasible solution of . Its feasibility implies that

Next, set and for all . Then,

Hence, this solution is feasible in . A direct calculation also reveals that both solutions attain the same objective value in their respective optimization problems. This confirms that is a relaxation that is at least as tight as .

Next, we develop a rounding algorithm that recovers a feasible clustering (and thus an upper bound on ) from an optimal solution of the relaxed problem  or ; see Algorithm 1.

1:Input: (data indices), (cluster sizes).
2:Solve  or for the datapoints , , and record the optimal .
3:Solve the linear assignment problem
4:Set for all .
5:Set for all .
6:Solve the linear assignment problem
7:Set for all .
8:Output: .
Algorithm 1 Rounding algorithm for cardinality-constrained clustering

Recall that the continuous variables in  and correspond to the binary variables in (3) with identical names. This correspondence motivates us to solve a linear assignment problem in Step 3 of Algorithm 1, which seeks a matrix with for all and subject to the prescribed cardinality constraints. Note that even though this assignment problem constitutes an MILP, it can be solved in polynomial time because its constraint matrix is totally unimodular, implying that its LP relaxation is exact. Alternatively, one may solve the assignment problem using the Hungarian algorithm; see, e.g.Burkard et al. (2009).

Note that Steps 5–7 of Algorithm 1 are reminiscent of a single iteration of Lloyd’s algorithm for cardinality-constrained -means clustering as described by Bennett et al. (2000). Specifically, Step 5 calculates the cluster centers , while Steps 6 and 7 reassign each point to the nearest center while adhering to the cardinality constraints. Algorithm 1 thus follows just one step of Lloyd’s algorithm initialized with an optimizer of  or . This refinement step ensures that the output clustering is compatible with a Voronoi partition of , which is desirable in view of Theorem 2.

3.2 Tighter Relaxations for Balanced Clustering

The computational burden of solving  and grows with . We show in this section that if all clusters share the same size (i.e., for all ), then  can be replaced by

minimize ()
subject to

whose size no longer scales with . Similarly,  simplifies to the LP  obtained from by replacing with . This is a manifestation of how symmetry can be exploited to simplify convex programs, a phenomenon which is studied in a more general setting by Gatermann and Parrilo (2004).

{corollary}

[Relaxations for Balanced Clustering] We have     .

Proof.

ProofThe inequality   is trivially satisfied. To prove the inequality    , we first add the symmetry breaking constraint to the MILP . Note that this constraint does not increase the optimal value of . It just requires that the cluster containing the datapoint should be assigned the number . This choice is unrestrictive because all clusters have the same size. By repeating the reasoning that led to Theorem 3.1, the MILP  can then be relaxed to a variant of the SDP  that includes the (linear) symmetry breaking constraint . Note that the constraints and the objective function of the resulting SDP are invariant under permutations of the cluster indices because for all . Note also that the constraints are not invariant under permutations involving due to the symmetry breaking constraint. Next, consider any feasible solution of this SDP, and define

Moreover, construct a permutation-symmetric solution by setting

By the convexity and permutation symmetry of the SDP, the symmetrized solution is also feasible in the SDP and attains the same objective value as . Moreover, as the choice of was arbitrary, we may indeed restrict attention to symmetrized solutions with and for all without increasing the objective value of the SDP. Therefore, the simplified SDP relaxation  provides a lower bound on .

If for all , then the SDP and LP relaxations from Section 3.1 admit an optimal solution where both and are independent of , in which case Algorithm 1 performs poorly. This motivates the improved relaxations and involving the symmetry breaking constraint , which ensures that—without loss of generality—the cluster harboring the first datapoint is indexed by . As the symmetry between clusters persists and because any additional symmetry breaking constraint would be restrictive, the optimal solutions of and only facilitate a reliable recovery of cluster 1. To recover all clusters, however, we can solve or times over the yet unassigned datapoints, see Algorithm 2. The resulting clustering could be improved by appending one iteration of Lloyd’s algorithm (akin to Steps 5–7 in Algorithm 1).

In contrast, the naïve relaxation of becomes significantly weaker when all cardinalities are equal. To see this, we note that a solution and for all and for all is feasible in (i.e., it satisfies all constraints in problem except the integrality constraints which are imposed on ) whenever . Hence, the optimal objective value of is zero. This could be avoided by adding a symmetry breaking constraint to problem to ensure that the cluster containing the first datapoint