Nuclear norm minimization for the planted clique and biclique problemsSupported in part by a Discovery Grant from NSERC (Natural Science and Engineering Research Council of Canada)

Nuclear norm minimization for the planted clique and biclique problems1

Abstract

We consider the problems of finding a maximum clique in a graph and finding a maximum-edge biclique in a bipartite graph. Both problems are NP-hard. We write both problems as matrix-rank minimization and then relax them using the nuclear norm. This technique, which may be regarded as a generalization of compressive sensing, has recently been shown to be an effective way to solve rank optimization problems. In the special cases that the input graph has a planted clique or biclique (i.e., a single large clique or biclique plus diversionary edges), our algorithm successfully provides an exact solution to the original instance. For each problem, we provide two analyses of when our algorithm succeeds. In the first analysis, the diversionary edges are placed by an adversary. In the second, they are placed at random. In the case of random edges for the planted clique problem, we obtain the same bound as Alon, Krivelevich and Sudakov as well as Feige and Krauthgamer, but we use different techniques.

1 Introduction

Several recent papers including Recht et al. [17] and Candès and Recht [4] consider nuclear norm minimization as a convex relaxation of matrix rank minimization. Matrix rank minimization refers to the problem of finding a matrix to minimize subject to linear constraints on . As we shall show in Sections 3 and 4, the clique and biclique problems, both NP-hard, are easily expressed as matrix rank minimization, thus showing that matrix rank minimization is also NP-hard.

Each of the two papers mentioned in the previous paragraph has results of the following general form. Suppose an instance of matrix rank minimization is posed in which it is known a priori that a solution of very low rank exists. Suppose further that the constraints are random in some sense. Then the nuclear norm relaxation turns out to be exact, i.e., it recovers the (unique) solution of low rank. The nuclear norm of a matrix , also called the trace norm, is defined to be the sum of the singular values of .

These authors build upon recent breakthroughs in compressive sensing [10, 5, 3]. In compressive sensing, the problem is to recover a sparse vector that solves a set of linear equations. In the case that the equations are randomized and a very sparse solution exists, compressive sensing can be solved by relaxation to the norm. The correspondence between matrix rank minimization and compressive sensing is as follows: matrix rank (number of nonzero singular values) corresponds to vector sparsity (number of nonzero entries) and nuclear norm corresponds to norm.

Our results follow the spirit of Recht et al. but use different technical approaches. We establish results about two well known graph theoretic problems, namely maximum clique and maximum-edge biclique. The maximum clique problem takes as input an undirected graph and asks for the largest clique (i.e., induced subgraph of nodes that are completely interconnected). This problem is one of Karp’s original NP-hard problems [8]. The maximum-edge biclique takes as input a bipartite graph and asks for the subgraph that is a complete bipartite graph that maximizes the product . This problem was shown to be NP-hard by Peeters [16].

In Sections 3 and 4, we relax these problems to convex optimization using the nuclear norm. For each problem, we show that convex optimization can recover the exact solution in two cases. The first case, described in Section 3.2, is the adversarial case: the -node graph under consideration consists of a single -node clique plus a number of diversionary edges chosen by an adversary. We show that the algorithm can tolerate up to diversionary edges provided that no non-clique vertex is adjacent to more than clique vertices. We argue also that these two bounds, and , are the best possible. We show analogous results for the biclique problem in Section 4.1.

Our second analysis, described in Sections 3.3 and 4.2, supposes that the graph contains a single clique or biclique, while the remaining nonclique edges are inserted independently at random with fixed probability . This problem has been studied by Alon et al. [2] and by Feige and Krauthgamer [6]. In the case of clique, we obtain the same result as they do, namely, that as long as the clique has at least nodes, where is the number of nodes in , then our algorithm will find it. Like Feige and Krauthgamer, our algorithm also certifies that the maximum clique has been found due to a uniqueness result for convex optimization, which we present in Section 3.1. We believe that our technique is more general than Feige and Krauthgamer; for example, ours extends essentially without alteration to the biclique problem, whereas Feige and Krauthgamer rely on some special properties of the clique problem. Furthemore, Feige and Krauthgamer use more sophisticated probabilistic tools (martingales), whereas our results use only Chernoff bounds and classical theorems about the norms of random matrices. The random matrix results needed for our main theorems are presented in Section 2.

Our interest in the planted clique and biclique problems arises from applications in data mining. In data mining, one seeks a pattern hidden in an apparently unstructured set of data. A natural question to ask is whether a data mining algorithm is able to find the hidden pattern in the case that it is actually present but obscured by noise. For example, in the realm of clustering, Ben-David [1] has shown that if the data is actually clustered, then a clustering algorithm can find the clusters. The clique and biclique problems are both simple model problems for data mining. For example, Pardalos [13] reduces a data mining problem in epilepsy prediction to a maximum clique problem. Gillis and Glineur [11] use the biclique problem as a model problem for nonnegative matrix factorization and finding features in images.

2 Results on norms of random matrices

In this section we provide a few results concerning random matrices with independently identically distributed (i.i.d.) entries of mean 0. In particular, the probability distribution for an entry will be as follows:

It is easy to check that the variance of is .

We start by recalling a theorem of Füredi and Komlós [7]:

Theorem 2.1

For all integers , , let be distributed according to . Define symmetrically for all .

Then the random symmetric matrix satisfies

with probability at least to for some that depends on .

Remark 1. In this theorem and for the rest of the paper, denotes , often called the spectral norm. It is equal to the maximum singular value of or equivalently to the square root of the maximum eigenvalue of .

Remark 2. The theorem is not stated exactly in this way in [7]; the stated form of the theorem can be deduced by taking and in the inequality

on p. 237.

Remark 3. As mentioned above, the mean value of entries of is 0. This is crucial for the theorem; a distribution with any other mean value would lead to .

A similar theorem due to Geman [9] is available for unsymmetric matrices.

Theorem 2.2

Let be a matrix whose entries are chosen according to for fixed . Then, with probability at least where , , and depend on and ,

for some also depending on .

As in the case of [7], this theorem is not stated exactly this way in Geman’s paper, but can be deduced from the equations on pp. 255–256 by taking for a satisfying .

The last theorem about random matrices requires a version of the well known Chernoff bounds, which is as follows (see [15, Theorem 4.4]).

Theorem 2.3 (Chernoff Bounds)

Let be a sequence of independent Bernoulli trials, each succeeding with probability so that . Let be the binomially distributed variable describing the total number of successes. Then for

(1)

It follows that for all ,

(2)

The final theorem of this section is as follows.

Theorem 2.4

Let be an matrix whose entries are chosen according to . Let be defined as follows. For such that , we define . For entries such that , we take , where is the number of ’s in column of . Then there exist and depending on such that

(3)

Remark 1. The notation denotes the Frobenius norm of , that is, . It is well known that for any .

Remark 2. Note that is undefined if there is a such that . In this case we assume that , i.e., the event considered in fails.

Remark 3. Observe that the column sums of are random variables with mean zero since the mean of the entries is 0. On the other hand, the column sums of are identically zero deterministically; this is the rationale for the choice of .

Proof:  From the definition of , for column , there are exactly entries of that differ from those of . Furthermore, the difference of these entries is exactly . Therefore, for each , the contribution of column to the square norm difference is given by

Recall that the numbers are independent, and each is the result of Bernoulli trials done with probability .

We now define to be the event that at least one is very far from the mean. In particular, is the event that there exists a such that , where . Let be its complement, and let be the indicator of this complement (i.e., if else ). Let be a positive scalar depending on to be determined later. Observe that

(4)

We now analyze the two terms separately. For the first term we use a technique attributed to S. Bernstein (see Hoeffding [12]). Let be the indicator function of nonnegative reals, i.e., for while for . Then, in general, . Thus,

Let be a positive scalar depending on to be determined later. Observe that for any such and for all , . Thus,

(5)
(6)

where

To obtain , we used the independence of the ’s. Let us now analyze in isolation.

To derive the last line, we used the fact that since . Now let us reorganize this summation by considering first such that , and next such that , etc. Notice that, since , we need consider intervals only until reaches .

where, for the last line, we have applied . The theorem is valid since .

Continuing this derivation and overestimating the finite sum with an infinite sum,

Choose so that , i.e., . Then the second term in the square-bracket expression at least twice the first term for all , hence

(7)

Observe that is dominated by a geometric series and hence is a finite number depending on . Thus, once is selected, it is possible to choose sufficiently large so that each of the two terms in is at most . Thus, with appropriate choices of and , we conclude that . Thus, substituting this into shows that

(8)

We now turn to the second term in . For a particular , the probability that is bounded using by where , where , i.e., . Then the union bound asserts that the probability that any satisfies is at most . Thus,

This concludes the proof.

3 Maximum Clique

Let be a simple graph. The maximum clique problem focuses on finding the largest clique of graph , i.e., the largest complete subgraph of . For any clique of , the adjacency matrix of the graph obtained by taking the union of and the set of loops for each is a rank-one matrix with 1’s in the entries indexed by and 0’s everywhere else. Therefore, a clique of containing vertices can be found by solving the rank minimization problem

(9)
(10)
(11)

Unfortunately, this rank minimization problem is also NP-hard. We consider the relaxation obtained by replacing the objective function with the nuclear norm, the sum of the singular values of the matrix:

Underestimating with , we obtain the following convex optimization problem:

(12)

Notice that the relaxation has dropped the constraint that was present in the original formulation. This constraint turns out to be superfluous (and, in fact, unhelpful—see the remark following ) for our approach. Using the Karush-Kuhn-Tucker conditions, we derive conditions for which the adjacency matrix of a graph comprising a clique of of size together with loops for each vertex in the clique is optimal for this convex relaxation.

3.1 Optimality Conditions

In this section, we prove a theorem that gives sufficient conditions for optimality and uniqueness of a solution to . These conditions involve multipliers and and a matrix . In subsequent subsections we explain how to select , and based on the underlying graph to satisfy the conditions.

Recall that if is a convex function, then a subgradient of at a point is defined to be a vector such that for all , . It is a well-known theorem that for a convex and for every , the set of subgradients forms a nonempty closed convex set. This set of subgradients, called the subdifferential, is denoted as .

In this section we consider the following generalization of because it will also arise in our discussion of biclique below:

(13)

Here, , is a subset of , and the complement of is denoted .

The following lemma characterizes the subdifferential of (see [4, Equation 3.4] and also [18]).

Lemma 3.1

Suppose has rank with singular value decomposition . Then is a subgradient of at if and only if is of the form

where satisfies such that the column space of is orthogonal to and the row space of is orthogonal to for all .

Let be a subset of . We say that is the characteristic vector of if for while for .

Let be a subset of and a subset of , and let , be their characteristic vectors respectively. Suppose and with , . Let , an matrix. Clearly has rank 1. Note that Lemma 3.1 implies that

(14)

This leads to the main theorem for this section.

Theorem 3.1

Let be a subset of of cardinality , and let be a subset of of cardinality . Let and be the characteristic vectors of , respectively. Let . Suppose is feasible for . Suppose also that there exist , and such that , , and

(15)

Here, denotes the vector of all ’s while denotes the th column of the identity matrix (either in or ). Then is an optimal solution to . Moreover, for any and such that , .

Furthermore, if and , then is the unique optimizer of (and hence will be found if a solver is applied to ).

Proof:  The fact that is optimal is a straightforward application of the well-known KKT conditions. Nonetheless, we now explicitly prove optimality because the inequalities in the proof are useful for the uniqueness proof below.

Suppose is another matrix feasible for . We wish to show that . To prove this, we use the definition of subgradient followed by . The notation is used to denote the elementwise inner product of two matrices .

(16)
(17)
(18)
(19)

Equation follows by the definition of subgradient and ; follows from ; and follows from the fact that by definition of and for by feasibility. Finally, follows since and by feasibility. This proves that is an optimal solution to .

Now consider such that . Then , where is the characteristic vector of and is the characteristic vector of , is also a feasible solution to . Recall that for a matrix of the form , the unique nonzero singular value (and hence the nuclear norm) equals . Thus, and . Since is optimal, , i.e., . Simplifying yields .

Now finally we turn to the uniqueness of , which is the most complicated part of the proof. This argument requires a preliminary claim. Let denote the subspace of matrices such that and . Let denote the subspace of matrices that can be written in the form , where has all zeros in positions indexed by . Let denote the subspace of matrices that can be written in the form , where has all zeros in positions indexed by . Let denote the subspace of all matrices that can be written in the form , where has nonzeros only in positions indexed by , has nonzeros only in positions indexed by , and the sum of entries of is zero. Finally, let be the subspace of matrices of the form , where is a scalar.

The preliminary claim is that are mutually orthogonal and that . To check orthogonality, we proceed case by case. For example, if and , then so since . The identity similarly shows that is orthogonal to all of . Next, observe that has nonzero entries only in positions indexed by , where denotes . Similarly, has nonzero entries only in positions indexed by , and and have nonzero entries only in positions indexed by . Thus, the nonzero entries of , and are disjoint, and hence these spaces are mutually orthogonal. The only remaining case is to show that and are orthogonal; this follows because a matrix in is a multiple of the all ’s matrix in positions indexed by , while the entries of a matrix in , also only in positions indexed by , sum to .

Now we must show that . Select a . We first split off an component: let and define . Then . Let . One checks from the definition of that . It remains to write as a matrix in .

Next we split off an component. Let and . Observe that . Similarly, . Let and . Then

where the third line follows because and the fourth by definition of . Similarly, . Thus, .

It remains to split among , and . Write , where is nonzero only in entries indexed by while is nonzero only in entries indexed by . Similarly, split using and . Then . Then and , so define and . Finally, we must consider the remaining term . This has the form required for membership in , but it remains to verify that the sum of entries of add to zero. This is shown as follows:

The second line follows because is all zeros outside entries indexed by . The fourth line follows because is zero outside and similarly for . The last line follows from equalities derived in the previous paragraph.

This concludes the proof of the claim that split into mutually orthogonal subspaces.

Now we prove the uniqueness of under the assumption that and . Let be a feasible solution different from . Write , where lie in respectively. Now we consider several cases.

The first case is that . Then since and , , it follows from Lemma 3.1 that lies in for sufficiently small. This means that ’’ appearing in above may be replaced by without harming the validity of the inequality. This adds the term to the right-hand sides of the inequalities following . Observe that . Thus, a positive quantity is added to all these right-hand sides, so we conclude .

For the remaining cases, we assume . We claim that as well. For example, suppose . Recall that is nonzero only for entries indexed by (and in particular, must be zero on ). Since all of , and are zero in , <