Max-Norm Optimization for Robust Matrix Recovery

Max-Norm Optimization for Robust Matrix Recovery

Ethan X. Fang    Han Liu    Kim-Chuan Toh    Wen-Xin Zhou Department of Statistics, Department of Industrial and Manufacturing Engineering, Pennsylvania State University, University Park, PA 16802, USA. E-mail: xxf13@psu.eduDepartment of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA. E-mail: hanliu@princeton.eduNational University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076. Research supported in part by Ministry of Education Academic Research Fund R-146-000-194-112. E-mail: mattohkc@nus.edu.sgDepartment of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA. E-mail: wenxinz@princeton.edu
Abstract

This paper studies the matrix completion problem under arbitrary sampling schemes. We propose a new estimator incorporating both max-norm and nuclear-norm regularization, based on which we can conduct efficient low-rank matrix recovery using a random subset of entries observed with additive noise under general non-uniform and unknown sampling distributions. This method significantly relaxes the uniform sampling assumption imposed for the widely used nuclear-norm penalized approach, and makes low-rank matrix recovery feasible in more practical settings. Theoretically, we prove that the proposed estimator achieves fast rates of convergence under different settings. Computationally, we propose an alternating direction method of multipliers algorithm to efficiently compute the estimator, which bridges a gap between theory and practice of machine learning methods with max-norm regularization. Further, we provide thorough numerical studies to evaluate the proposed method using both simulated and real datasets.

1 Introduction

We consider the matrix completion problem, which aims to reconstruct an unknown matrix based on a small number of entries contaminated by additive noise. This problem has drawn significant attention over the past decade due to its wide applications, including collaborative filtering (the well-known Netflix problem) (Netflix, 2006; Bennett and Lanning, 2007), multi-task learning (Abernethy et al., 2009; Amit et al., 2007; Argyriou et al., 2008), sensor-network localization (Biswas et al., 2006) and system identification (Liu and Vandenberghe, 2009). Specifically, our goal is to recover an unknown matrix based on a subset of its entries observed with noise, say . In general, the problem of recovering a partially observed matrix is ill-posed, as the unobserved entries can take any values without further assumption. However, in many applications mentioned above, it is natural to impose the condition that the target matrix is of either exact or approximately low-rank, which avoids the ill-posedness and makes the recovery possible.

To obtain a low-rank estimate of the matrix, a straightforward approach is to consider the rank minimization problem

(1.1)

where is the index set of observed entries, and is a tuning parameter. This method directly searches for a matrix of the lowest rank with reconstruction error controlled by . However, the optimization problem (1.1) is computationally intractable due to its nonconvexity. A commonly used alternative is the following convex relaxation of (1.1):

(1.2)

where denotes the nuclear-norm (also known as the trace-norm, Ky Fan-norm or Schatten 1-norm), and it is defined as the sum of singular values of a matrix. Low-rank matrix recovery based on nuclear-norm regularization has been extensively studied in both noiseless and noisy cases (Candès and Recht, 2009; Candès and Tao, 2010; Recht et al., 2010; Koltchinskii et al., 2011; Rohde and Tsybakov, 2011; Recht, 2011; Keshavan et al., 2010; Negahban and Wainwright, 2012). Furthermore, various computational algorithms have been proposed to solve this problem. For example, Cai et al. (2010) propose a singular value thresholding algorithm which is equivalent to the gradient method for solving the dual of a regularized version of (1.2); Toh and Yun (2010) propose an accelerated proximal gradient method to solve a least squares version of (1.2); Liu and Vandenberghe (2009) exploit an interior-point method; Chen et al. (2012) adopt an alternating direction method of multipliers approach to solve (1.2).

Though significant progress has been made, it remains unclear whether the nuclear-norm is the best convex relaxation for the rank minimization problem (1.1). Recently, some disadvantages of the nuclear-norm regularization have been noted. For instance, the theoretical guarantee of the nuclear-norm regularization relies on an assumption that the indices of the observed entries are uniformly sampled. That is, each entry is equally likely to be observed as illustrated in Figure 1(a). This assumption is restrictive in applications. Taking the well-known Netflix problem as an example, our goal is to reconstruct a movie-user rating matrix, in which each row represents a user and each column represents a movie. The -th entry of the rating matrix represents the -th user’s rating for the -th movie. In practice, we only observe a small proportion of the entries. In this example, the uniform sampling assumption is arguably violated due to the following reasons: (1) Some users are more active than others, and they rate more movies than others. (2) Some movies are more popular than others and are rated by more users. As a consequence, the entries from certain columns or rows are more likely to be observed. See Figure 1(b) for a simple illustration. To sum up, the sampling distribution can be highly non-uniform in real world applications.

Figure 1: (a) The theoretical guarantee of the nuclear-norm estimator assumes each entry is equally likely to be observed. (b) In practice, some entries related to some popular movies or some active users, such as Movie 5 or User 5, are more likely to be sampled than others. Thus, the uniform sampling assumption is violated.

To relax or even avoid the unrealistic uniform sampling assumption, several recent papers propose to use the matrix max-norm as a convex surrogate for the rank. Srebro and Salakhutdinov (2010) observe from empirical comparisons that the max-norm regularized approach outperforms the nuclear-norm based one for matrix completion and collaborative filtering under non-uniform sampling schemes. Lee et al. (2010) and Jalali and Srebro (2012) demonstrate the advantage of using max-norm regularizer over nuclear-norm in some other applications. More recently, Cai and Zhou (2016) prove that the max-norm regularized estimator is minimax rate-optimal (over a class of approximately low-rank matrices) under non-uniform sampling schemes.

Though the max-norm approach possesses attractive theoretical properties, efficiently solving large-scale max-norm optimization problem remains challenging and prevents the wide adoption of max-norm regularizer. As we shall see later, despite the fact that the max-norm is a convex regularizer and can be formulated as a semidefinite programming problem, classical methods such as interior-point methods are only scalable to moderate dimensions, while the problem of practical interest is of large dimensions. In recent work, Lee et al. (2010) and Shen et al. (2014) propose first-order algorithms for a nonconvex relaxation of the problem. However, these methods are sensitive to the choice of initial points and stepsizes, and are only capable of producing stationary solutions, whose statistical properties remain open due to the nonconvexity. Meanwhile, although the max-norm estimator is adaptive to general sampling schemes, it was shown in Cai and Zhou (2016) that if the target matrix is of exact low-rank, and the sampling scheme is uniform, the max-norm estimator only achieves a sub-optimal rate compared to the nuclear-norm estimator. Specifically, letting and be the estimators using max-norm and nuclear-norm regularizers, we have

where is the rank of and . To compare, under the uniform sampling scheme, the nuclear-norm regularized method achieves the optimal rate of convergence (up to a logarithmic factor) and is computationally more scalable.

To achieve the advantages of both regularizers, we propose a new estimator using a hybrid regularizer. Meanwhile, we propose an efficient alternating direction method of multipliers (ADMM) algorithm to solve the optimization problem. Our method includes the max-norm regularizer as a special case, and the proposed algorithm is scalable to modestly large dimensions. The contribution of this paper is two-fold: First, we propose an estimator for matrix completion under genearal sampling scheme, which achieves optimal rate of convergence in the exact low-rank case and is adaptive to different sampling schemes. Second, we provide an efficient algorithm to solve the corresponding max-norm plus nuclear-norm penalized optimization problem. We illustrate the efficiencies of the proposed methods and algorithms by numerical experiments on both simulated and real datasets.

Notation. Throughout this paper, we adopt the following notations. For any positive integer , denotes the set of integers . For a vector and a positive number , we denote as the -norm, i.e., . Also, we let . For a matrix , let be the Frobenius-norm, and we denote the matrix elementwise -norm by . Given the and norms on and , we define the corresponding operator-norm, where . For examples, is the spectral-norm, and is the maximum row norm of . We denote by if for two constants and .

Paper Organization. The rest of this paper is organized as follows. In Section 2, we review the max-norm approach and formulate the problem. In Section 3, we propose the algorithm. In Section 4, we provide theoretical analysis of the estimator. We provide extensive numerical studies in Section 5, and we conclude the paper in Section 6.

2 Preliminaries and Problem Formulation

In this section, we first introduce the concept of the matrix max-norm (Linial et al., 2007). Next, we propose a new estimator which involves both max-norm and nuclear-norm regularizers.

Definition 2.1.

The max-norm of a matrix is defined as

where the minimum is over all factorizations for , for , and , denote the operator-norms of and .

We briefly compare the max-norm and nuclear-norm regularizers. We refer to Srebro and Shraibman (2005) and Cai and Zhou (2016) for more detailed discussions. Recall that the nuclear-norm of the matrix is defined as

From the definition, the nuclear-norm encourages low-rank approximation with factors in the -space. On the other hand, it is known (Jameson, 1987) that the max-norm has a similar interpretation by replacing the constraints in the -space by those in the -space:

where the factor of equivalence is the Grothendieck’s constant . More specifically, a consequence of Grothendieck’s inequality is that (Srebro and Shraibman, 2005), where for any . This gives some intuition on why the max-norm regularizer could outperform the nuclear-norm regularizer when the matrix entries are uniformly bounded. This scenario indeed stands in many applications. For example, in the Netflix problem or the low-rank correlation matrix estimation problem, the entries of the unknown matrix are either ratings or correlation coefficients, and are uniformly bounded.

As mentioned in Section 1, the advantages of using the max-norm over the nuclear-norm are well illustrated in the literature from both theoretical and practical perspectives. Specifically, we consider the matrix completion problem in a general sampling scheme. Let denote the unknown matrix to be recovered. Assume that we are given a random index set of size :

where for . We further assume that the samples of the indices are drawn independently from a general sampling distribution on . Note that we consider the sampling scheme with replacement, i.e., we assume for all and all . For example, the sampling scheme is uniform if for all . Given the sampled index set , we further observe noisy entries :

where denotes the noise level, and ’s are independent and identically distributed random variables with and .

Using the max-norm regularization, Cai and Zhou (2016) propose to construct an estimator

(2.1)

where with being a prespecified upper bound for the elementwise -norm of and a tuning parameter. Note that, in many real world applications, we have a tight upper bound on the magnitudes of all the entries of in advance. This condition enforces that should not be too “spiky”, and a loose upper bound may jeopardize the estimation accuracy (Negahban and Wainwright, 2012). Also, the recent work by Lee et al. (2010) argues that the max-norm regularizer produces better empirical results on low-rank matrix recovery for uniformly bounded data.

Cai and Zhou (2016) provide theoretical guarantees for the max-norm regularizer (2.1). Specifically, under the approximately low-rank assumption that , we have,

where . This rate matches the minimax lower bound over all approximately low-rank matrices even under non-uniform sampling schemes. See Cai and Zhou (2016) for more details.

The optimization problem (2.1) is computationally challenging. Cai and Zhou (2016) employ a first-order method proposed in Lee et al. (2010). In particular, Lee et al. (2010) and Shen et al. (2014) consider first-order methods based on rewriting problem (2.1) into the following form:

where and denote the -th row of and the -th row of , respectively. Then, Lee et al. (2010) and Shen et al. (2014) consider different efficient first-order methods to solve this problem. However, the problem is nonconvex, and the convergence behaviors of those methods on such a nonconvex problem are generally sensitive to the choice of the initial point and stepsize selection. More seriously, the algorithms mentioned can only guarantee local stationary solutions, which may not necessarily possess the nice theoretical properties for the solution to problem (2.1). More recently, Orabona et al. (2012) solve the optimization problem (2.1) without the uniform-boundedness constraint. However, it is unclear how to extend their algorithms to solve the problem (2.1) with the -norm constraint.

In the next section, we aim to solve the max-norm penalized optimization problem

(2.2)

where is a tuning parameter. By convexity and strong duality, the problem (2.2) is equivalent to (2.1) for a properly chosen . Specifically, for any specified in (2.1), there exists a such that the solutions to the two problems coincide.

As discussed in Section 1, a major drawback of the max-norm penalized estimator (2.1) is that if the underlying true matrix is of exact low-rank, and when the sampling scheme is indeed uniform, the max-norm regularizer does not perform as well as the nuclear-norm regularizer. Since the underlying structure of and the sampling scheme are unknown, it is difficult to choose the better approach in practice. To overcome this issue, we propose the following hybrid estimator which is expected to be more flexible and adaptive:

(2.3)

where is a nonnegative tuning parameter. The addition of the nuclear-norm penalization is motivated by the fact that the nuclear-norm also serves as a convex surrogate for the rank of the estimator. Thus, the addition of the nuclear-norm encourages the estimator to be low rank or approximately low rank as compared to the max-norm estimator in (2.2). However, note that our primary goal here is not to find a low-rank estimator but one which approximates the underlying matrix at near optimal recovery and is robust against the unknown sampling scheme. It is worth mentioning that the use of the sum of two norms in matrix recovery has been considered in other contexts. For example, in robust principal component analysis Candès et al. (2009), the sum of the nuclear and norms is used in the recovery of the low-rank and sparse components of a given superposition. In Doan and Vavasis (2013), a similar combination of the two norms (denoted as for a given matrix and a parameter ) is used to find hidden sparse rank-one matrices in a given matrix. The geometry of the unit -norm ball is further analyzed in Drusvyatskiy et al. (2015). It is interesting to note that (2.3) is the first time that the sum of the max-norm and nuclear norm is considered in matrix recovery.

In Section 3, we propose an efficient algorithm to solve (2.3), which includes (2.2) as a special case by taking . Section 4 provides theoretical justification for the hybrid estimator in (2.3). In particular, it achieves fast rate of convergence under the “ideal” situation, and is robust against non-uniform sampling schemes. To sum up, this estimator possesses the advantages of both the max-norm and nuclear-norm regularizers. Section 5 provides empirical results of the algorithm.

3 Algorithm

In this section, we propose a new algorithm to solve the problem (2.3). The key step is to reformulate the problem to expose the structure.

3.1 Algorithmic Framework

We first review that the max-norm regularized problem (2.2) can be equivalently formulated as a semidefinite programming (SDP) problem. By Definition 2.1, it is unclear how to efficiently compute the max-norm of a given matrix. By Srebro et al. (2004), the max-norm of a matrix can be computed via solving the following SDP problem:

Thus, the max-norm penalized problem (2.2) can be formulated as an SDP problem that

(3.1)
subject to

where , and

One may observe that the problem (3.1) does not explicitly encourage the optimal solutions to be low-rank matrices, although such a property is desirable in many practical applications such as collaborative filtering. Thus, we propose to add the regularization term involving , which is the convex surrogate for the rank of the positive semidefinite matrix , to the objective function in (3.1) to obtain the following hybrid optimization problem:

(3.2)
subject to

where is a tuning parameter. Note that the estimator in Cai and Zhou (2016) is constructed by solving a special case of this problem by setting .

Remark 3.1.

The problem (3.2) is equivalent to the problem (2.3). To see this, by Lemma 1 of Fazel et al. (2001), there exists an SDP formulation of the trace-norm such that if and only if there exist matrices , and satisfying

The optimization problem (3.2) is computationally challenging. Directly solving the problem by generic interior-point method based SDP solvers is not computationally scalable. This is because the problem of interest is often of high dimensions, and the -norm constraint in (3.2) induces a large number of constraints in the SDP. In addition, the feasible set is in a very complex form as it involves both the positive semidefinite and -norm constraints. Although gradient projection methods are the most straightforward methods to use, the complicated feasible set also makes them difficult to be applied. This is because applying such a method requires projecting the intermediate solution to the feasible set, but it is unclear how to efficiently compute the projection.

To solve the problem efficiently, we consider an equivalent form of (3.2) below. As we shall see immediately, this formulation is crucial for efficiently solving the problem:

(3.3)

where the function and the set are defined as follows:

(3.4)

and denotes the set of symmetric matrices in .

Intuitively, the advantage of formulating the problem (3.2) into the form of (3.3) is that we “split” the complicated feasible set of (3.3) into two parts. In particular, and in (3.3) enforce the positive semidefinite constraint and the -norm constraints, respectively. The motivation of this splitting is that though projection onto the feasible set of (3.2), which contains both the semidefinite and -norm constrains, is difficult, we can efficiently compute the projection onto the positive semidefinite set or the -constraint set individually. As a result, adopting an alternating direction approach, in each step, we only need to project onto the positive semidefinite cone, and control the -norm of . Meanwhile, we impose an additional constraint to ensure the feasibility of both and to the problem (3.2).

To solve (3.3), we consider the augmented Lagrangian function of (3.3) defined by

where is the dual variable, and is the positive semidefinite cone.

Then, we apply the ADMM algorithm to solve the problem (3.3). The algorithm runs iteratively, at the -th iteration, we update by

(3.5)

where is a step-length parameter which is typically chosen to be . Here, denotes the projection of the matrix onto the semidefinite cone . The worst-case rate of convergence of ADMM method is shown, for example, in Fang et al. (2015).

3.2 Solving Subproblems

For fast implementations of the algorithm (3.5), it is important to solve the - and -subproblems of (3.5) efficiently. For the -subproblem, we have that the minimizer is obtained by truncating all the negative eigenvalues of the matrix to 0’s by Eckart-Young Theorem (Trefethen and Bau III, 1997). Moreover, the following proposition provides a solution to the -subproblem in (3.5), which can be computed efficiently.

Proposition 3.2.

Let be the index set of observed entries in . For a given matrix , we have

where

(3.6)

and projects to the interval .

Proof.

By the definition of in (3.2), we have

This optimization problem is equivalent to

(3.7)

For the first term of the above optimization problem, utilizing its separable structure, it is equivalent to

from which we see that its minimizer is given by .

In addition, the optimality of and are obtained by considering the remaining terms of (3.7), which concludes the proof. ∎

Note that in (3.7), we need to solve the following optimization problem

(3.8)

where and . A direct approach to solve this problem is to reformulate it into a linearly constrained quadratic programming problem. In the next lemma, we show that it actually admits a closed-form solution. For ease of presentation, we assume without loss of generality that .

Lemma 3.3.

Suppose that . The solution to the optimization problem (3.8) is of the form

where and is the index such that . If no such exists, then , where .

Proof.

Let . By the assumption that , one can prove by contradiction that the optimal solution to (3.8) must satisfy the property that . It is clear that (3.8) is equivalent to the following convex minimization problem:

(3.9)

whose KKT optimality conditions are given by

Define

Let be the index such that . If no such exists, i.e., for all , then set . Now one can verify that the point defined below satisfies the KKT conditions:

Hence is the optimal solution to (3.8). This completes the proof. ∎

Remark 3.4.

We avoid presenting the general case of for simplicity. The solution in the general case can be derived similarly, and we implement the algorithm for the general case in later numerical studies.

The algorithm for solving problem (3.2) is summarized in Algorithm 1.

0:  , , , , , , , , , .
  while Stopping criterion is not satisfied. do
     Update .
     Update by (3.6).
     Update .
     .
  end while
  , .
Algorithm 1 Solving max-norm optimization problem (3.2) by the ADMM
Remark 3.5.

Taking a closer look at Algorithm 1, we see that the equivalent reformulation (3.3) of the original problem (3.2) brings us computational efficiency. In particular, all sub-problems can be solved efficiently. Among them, the most computationally expensive step is the -update step as we need to compute an eigenvalue decomposition of the matrix , which has the complexity of . Meanwhile, we point out that if a rank- solution to the -subproblem is desired, the computational complexity can be reduced to .

Remark 3.6.

Note that if the user requires an exact low rank solution, solving the -subproblem can be further accelerated. In particular, we can apply the Eckart-Young Theorem and project the solution onto the nearest face for the target rank. See, for example, Oliveira et al. (2015), where this idea is applied to the SDP relaxation of the quadratic assignment problem with nonnegativity constraints added.

In addition to the algorithm for the regularized max-norm minimization problem (2.2), we also provide the algorithm for solving the constrained version (2.1) in Appendix A. We focus our discussions on the regularized version since it is computationally more challenging.

3.3 Stopping Conditions

In this section, we discuss the stopping conditions for Algorithm 1. Denote by the indicator function over a given set such that if and if . The optimality conditions for (3.3) are given as follows:

(3.10)

where is the Lagrangian multiplier associated with the equality constraint . Here denotes the subdifferential of at ; similarly for .

By the optimality conditions of and in (3.5), we have that

if and only if

and

if and only if

where . Observe that the iterate generated from Algorithm 1 is an accurate approximate optimal solution to (3.5) if the residual

is small, where

denote the primal and dual residuals. In the practical implementation, we let the algorithm stop when or when the number of iterations exceeds .

3.4 Practical Implementations

We should mention that tuning the parameter properly in the ADMM method is critical for the method to converge at a reasonable rate. In our implementation, starting with the initial value of for , we adaptively tune the parameter at every tenth iterations based on the following criterion:

The basic idea is to balance the progress of and so that the stopping criterion can be attained within a small number of iterations.

Another important computational issue which we need to address is to cut down the cost of computing the full eigenvalue decomposition in the -update step in Algorithm 1. Given a matrix , we observe that to compute the projection , we need only the eigen-pairs corresponding to the positive eigenvalues of . Thus in our implementation, we use the LAPACK subroutine dsyevx.f to compute only a partial eigenvalue decomposition of if we know that the number of positive eigenvalues of is substantially smaller than , say less than of . Such a partial eigenvalue decomposition is typically cheaper than a full eigenvalue decomposition when the number of eigenvalues of interest is much smaller than the dimension . For Algorithm 1, at the -th iteration, we estimate the potential number of positive eigenvalues of (and hence the rank of ) based on the rank of the previously computed iterate . Such an estimation is usually accurate when the sequence of iterates starts to converge. During the initial phase of Algorithm 1, we do not have a good estimate on the rank of , and we compute the projection based on the full eigenvalue decomposition of .

To further reduce the cost of computing in Algorithm 1, we employ a heuristic strategy to truncate the small positive eigenvalues of to 0’s. That is, if there is a group of positive eigenvalues of with magnitudes which are significantly larger than the remaining positive eigenvalues, we compute using only the eigen-pairs corresponding to the large positive eigenvalues of . Such a strategy can significantly reduce the cost of computing since the number of large eigenvalues of is typically small in a low-rank matrix completion problem. A surprising bonus of adopting such a cost cutting heuristic is that the recovery error can actually become 30–50% smaller, despite the fact that the computed now is only an approximate solution of the -update subproblem in Algorithm 1. One possible explanation for such a phenomenon is that the truncation of small positive eigenvalues of to 0’s actually has a debiasing effect to eliminate the attenuation of the singular values of the recovered matrix due to the presence of the convex regularization term. In the case of compressed sensing, such a debiasing effect has been explained in Figueiredo et al. (2007).

4 Theoretical Properties

In this section, we provide theoretical guarantees for the hybrid estimator (2.3). To facilitate our discussions, we introduce the following notations. Let be i.i.d. copies of a random matrix with distribution on the set , i.e., , where are the canonical basis vectors in . By definition,

(4.1)

for all matrices . Moreover, let

be, respectively, the probabilities of observing an element from the -th row and the -th column.

Considering the exact low-rank matrix recovery, i.e., , the first part of the next theorem shows that the estimator (2.3) achieves a fast rate of convergence under some “ideal” situations, and the second part indicates that it is also robust against non-uniform sampling schemes. For ease of presentation, we conduct the analysis by considering a constrained form of (2.3), namely,

(4.2)

where . Our proof partly follows the arguments in Cai and Zhou (2016). The major technical challenge here is to carefully balance the tuning parameters and in (4.2) to achieve the desired recovery results for both the uniform and non-uniform sampling schemes.

Theorem 4.1.

Assume that , and that are i.i.d. random variables. The sampling distribution is such that for some . Choose in (4.2) and write .

  • Let for some constant . Then, for a sample size , the estimator given at (4.2) satisfies