1-Bit Matrix Completion

1-Bit Matrix Completion

Mark A. Davenport, Yaniv Plan, Ewout van den Berg, Mary Wootters M.A. Davenport is with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA. Email: {mdav@gatech.edu}.
Y. Plan is with the Department of Mathematics, University of Michigan, Ann Arbor, MI. Email: {yplan@umich.edu}.
E. van den Berg is with the IBM T.J. Watson Research Center, Yorktown Heights, NY. Email: {evandenberg@us.ibm.com}.
M. Wootters is with the Department of Mathematics, University of Michigan, Ann Arbor, MI. Email: {wootters@umich.edu}.
This work was partially supported by NSF grants DMS-0906812, DMS-1004718, DMS-1103909, CCF-0743372, and CCF-1350616 and NRL grant N00173-14-2-C001.
September 2012 (Revised May 2014)
Abstract

In this paper we develop a theory of matrix completion for the extreme case of noisy 1-bit observations. Instead of observing a subset of the real-valued entries of a matrix , we obtain a small number of binary (1-bit) measurements generated according to a probability distribution determined by the real-valued entries of . The central question we ask is whether or not it is possible to obtain an accurate estimate of from this data. In general this would seem impossible, but we show that the maximum likelihood estimate under a suitable constraint returns an accurate estimate of when and . If the log-likelihood is a concave function (e.g., the logistic or probit observation models), then we can obtain this maximum likelihood estimate by optimizing a convex program. In addition, we also show that if instead of recovering we simply wish to obtain an estimate of the distribution generating the 1-bit measurements, then we can eliminate the requirement that . For both cases, we provide lower bounds showing that these estimates are near-optimal. We conclude with a suite of experiments that both verify the implications of our theorems as well as illustrate some of the practical applications of 1-bit matrix completion. In particular, we compare our program to standard matrix completion methods on movie rating data in which users submit ratings from 1 to 5. In order to use our program, we quantize this data to a single bit, but we allow the standard matrix completion program to have access to the original ratings (from 1 to 5). Surprisingly, the approach based on binary data performs significantly better.

1 Introduction

The problem of recovering a matrix from an incomplete sampling of its entries—also known as matrix completion—arises in a wide variety of practical situations. In many of these settings, however, the observations are not only incomplete, but also highly quantized, often even to a single bit. In this paper we consider a statistical model for such data where instead of observing a real-valued entry as in the original matrix completion problem, we are now only able to see a positive or negative rating. This binary output is generated according to a probability distribution which is parameterized by the corresponding entry of the unknown low-rank matrix . The central question we ask in this paper is: “Given observations of this form, can we recover the underlying matrix?”

Questions of this form are often asked in the context of binary PCA or logistic PCA. There are a number of compelling algorithmic papers on these subjects, including [22, 57, 38, 19, 64], which suggest positive answers on simulated and real-world data. In this paper, we give the first theoretical accuracy guarantees under a generalized linear model. We show that binary observations are sufficient to accurately recover a , rank- matrix by convex programming. Our theory is inspired by the unquantized matrix completion problem and the closely related problem of 1-bit compressed sensing, described below.

1.1 Matrix completion

Matrix completion arises in a wide variety of practical contexts, including collaborative filtering [28], system identification [45], sensor localization [7, 59, 60], rank aggregation [26], and many more. While many of these applications have a relatively long history, recent advances in the closely related field of compressed sensing [23, 13, 21] have enabled a burst of progress in the last few years, and we now have a strong base of theoretical results concerning matrix completion [30, 16, 17, 36, 37, 49, 42, 15, 55, 56, 40, 25, 39, 41]. A typical result from this literature is that a generic matrix of rank can be exactly recovered from randomly chosen entries. Similar results can be established in the case of noisy observations and approximately low-rank matrices [37, 49, 42, 15, 56, 40, 25, 39, 41].

Although these results are quite impressive, there is an important gap between the statement of the problem as considered in the matrix completion literature and many of the most common applications discussed therein. As an example, consider collaborative filtering and the now-famous “Netflix problem.” In this setting, we assume that there is some unknown matrix whose entries each represent a rating for a particular user on a particular movie. Since any user will rate only a small subset of possible movies, we are only able to observe a small fraction of the total entries in the matrix, and our goal is to infer the unseen ratings from the observed ones. If the rating matrix has low rank, then this would seem to be the exact problem studied in the matrix completion literature. However, there is a subtle difference: the theory developed in this literature generally assumes that observations consist of (possibly noisy) continuous-valued entries of the matrix, whereas in the Netflix problem the observations are “quantized” to the set of integers between 1 and 5. If we believe that it is possible for a user’s true rating for a particular movie to be, for example, 4.5, then we must account for the impact of this “quantization noise” on our recovery. Of course, one could potentially treat quantization simply as a form of bounded noise, but this is somewhat unsatisfying because the ratings aren’t just quantized — there are also hard limits placed on the minimum and maximum allowable ratings. (Why should we suppose that a movie given a rating of 5 could not have a true underlying rating of 6 or 7 or 10?) The inadequacy of standard matrix completion techniques in dealing with this effect is particularly pronounced when we consider recommender systems where each rating consists of a single bit representing a positive or negative rating (consider for example rating music on Pandora, the relevance of advertisements on Hulu, or posts on sites such as MathOverflow). In such a case, the assumptions made in the existing theory of matrix completion do not apply, standard algorithms are ill-posed, and alternative theory is required.

1.2 Statistical learning theory and matrix completion

While the theory of matrix completion gained quite a bit of momentum following advances in compressed sensing, earlier results from Srebro et al. [62, 63] also addressed this problem from a slightly different perspective rooted in the framework of statistical learning theory. These results also deal with the binary setting that we consider here. They take a model-free approach and prove generalization error bounds; that is, they give conditions under which good agreement on the observed data implies good agreement on the entire matrix. For example, in [63], agreement is roughly measured via the fraction of predicted signs that are correct, but this can also be extended to other notions of agreement [62]. From the perspective of statistical learning theory, this corresponds to bounding the generalization error under various classes of loss functions.

One important difference between the statistical learning approach and the path taken in this paper is that here we focus on parameter estimation — that is, we seek to recover the matrix itself (or the distribution parameterized by that governs our observations) — although we also prove generalization error bounds en route to our main results on parameter and distribution recovery. We discuss the relationship between our approach and the statistical learning approach in more detail in Section A.1 (see Remark 3). Briefly, our generalization error bounds correspond to the case where the loss function is the log likelihood, and do not seem to fit directly within the framework of the existing literature.

1.3 1-bit compressed sensing and sparse logistic regression

As noted above, matrix completion is closely related to the field of compressed sensing, where a theory to deal with single-bit quantization has recently been developed [10, 33, 51, 52, 32, 43]. In compressed sensing, one can recover an -sparse vector in from random linear measurements—several different random measurement structures are compatible with this theory. In 1-bit compressed sensing, only the signs of these measurements are observed, but an -sparse signal can still be approximately recovered from the same number of measurements [33, 51, 52, 1]. However, the only measurement ensembles which are currently known to give such guarantees are Gaussian or sub-Gaussian [1], and are thus of a quite different flavor than the kinds of samples obtained in the matrix completion setting. A similar theory is available for the closely related problem of sparse binomial regression, which considers more classical statistical models [2, 11, 52, 34, 46, 48, 54, 65] and allows non-Gaussian measurements. Our aim here is to develop results for matrix completion of the same flavor as 1-bit compressed sensing and sparse logistic regression.

1.4 Challenges

In this paper, we extend the theory of matrix completion to the case of 1-bit observations. We consider a general observation model but focus mainly on two particular possibilities: the models of logistic and probit regression. We discuss these models in greater detail in Section 2.1, but first we note that several new challenges arise when trying to leverage results in 1-bit compressed sensing and sparse logistic regression to develop a theory for 1-bit matrix completion. First, matrix completion is in some sense a more challenging problem than compressed sensing. Specifically, some additional difficulty arises because the set of low-rank matrices is “coherent” with single entry measurements (see [30]). In particular, the sampling operator does not act as a near-isometry on all matrices of interest, and thus the natural analogue to the restricted isometry property from compressed sensing cannot hold in general—there will always be certain low-rank matrices that we cannot hope to recover without essentially sampling every entry of the matrix. For example, consider a matrix that consists of a single nonzero entry (which we might never observe). The typical way to deal with this possibility is to consider a reduced set of low-rank matrices by placing restrictions on the entry-wise maximum of the matrix or its singular vectors—informally, we require that the matrix is not too “spiky”.

We introduce an entirely new dimension of ill-posedness by restricting ourselves to 1-bit observations. To illustrate this, we describe one version of 1-bit matrix completion in more detail (the general problem definition is given in Section 2.1 below). Consider a matrix with rank . Suppose we observe a subset of entries of a matrix . The entries of depend on in the following way:

(1)

where is a matrix containing noise. This latent variable model is the direct analogue to the usual 1-bit compressed sensing observation model. In this setting, we view the matrix as more than just a parameter of the distribution of ; represents the real underlying quantity of interest that we would like to estimate. Unfortunately, in what would seem to be the most benign setting—when is the set of all entries, , and has rank 1 and a bounded entry-wise maximum—the problem of recovering is ill-posed. To see this, let for any vectors , and for simplicity assume that there are no zero entries in or . Now let and be any vectors with the same sign pattern as and respectively. It is apparent that both and will yield the same observations , and thus and are indistinguishable. Note that while it is obvious that this 1-bit measurement process will destroy any information we have regarding the scaling of , this ill-posedness remains even if we knew something about the scaling a priori (such as the Frobenius norm of ). For any given set of observations, there will always be radically different possible matrices that are all consistent with observed measurements.

After considering this example, the problem might seem hopeless. However, an interesting surprise is that when we add noise to the problem (that is, when is an appropriate stochastic matrix) the picture completely changes—this noise has a “dithering” effect and the problem becomes well-posed. In fact, we will show that in this setting we can sometimes recover to the same degree of accuracy that is possible when given access to completely unquantized measurements! In particular, under appropriate conditions, measurements are sufficient to accurately recover .

1.5 Applications

The problem of 1-bit matrix completion arises in nearly every application that has been proposed for “unquantized” matrix completion. To name a few:

  • Recommender systems: As mentioned above, collaborative filtering systems often involve discretized recommendations [28]. In many cases, each observation will consist simply of a “thumbs up” or “thumbs down” thus delivering only 1 bit of information (consider for example rating music on Pandora, the relevance of advertisements on Hulu, or posts on sites such as MathOverflow). Such cases are a natural application for 1-bit matrix completion.

  • Analysis of survey data: Another potential application for matrix completion is to analyze incomplete survey data. Such data is almost always heavily quantized since people are generally not able to distinguish between more than categories [47]. 1-bit matrix completion provides a method for analyzing incomplete (or potentially even complete) survey designs containing simple yes/no or agree/disagree questions.

  • Distance matrix recovery and multidimensional scaling: Yet another common motivation for matrix completion is to localize nodes in a sensor network from the observation of just a few inter-node distances [7, 59, 60]. This is essentially a special case of multidimensional scaling (MDS) from incomplete data [8]. In general, work in the area assumes real-valued measurements. However, in the sensor network example (as well as many other MDS scenarios), the measurements may be very coarse and might only indicate whether the nodes are within or outside of some communication range. While there is some existing work on MDS using binary data [29] and MDS using incomplete observations with other kinds of non-metric data [61], 1-bit matrix completion promises to provide a principled and unifying approach to such problems.

  • Quantum state tomography: Low-rank matrix recovery from incomplete observations also has applications to quantum state tomography [31]. In this scenario, mixed quantum states are represented as Hermitian matrices with nuclear norm equal to 1. When the state is nearly pure, the matrix can be well approximated by a low-rank matrix and, in particular, fits the model given in Section 2.2 up to a rescaling. Furthermore, Pauli-operator-based measurements give probabilistic binary outputs. However, these are based on the inner products with the Pauli matrices, and thus of a slightly different flavor than the measurements considered in this paper. Nevertheless, while we do not address this scenario directly, our theory of 1-bit matrix completion could easily be adapted to quantum state tomography.

1.6 Notation

We now provide a brief summary of some of the key notation used in this paper. We use to denote the set of integers . We use capital boldface to denote a matrix (e.g., ) and standard text to denote its entries (e.g., ). Similarly, we let denote the matrix of all-zeros and the matrix of all-ones. We let denote the operator norm of , denote the Frobenius norm of , denote the nuclear or Schatten-1 norm of (the sum of the singular values), and denote the entry-wise infinity-norm of . We will use the Hellinger distance, which, for two scalars , is given by

This gives a standard notion of distance between two binary probability distributions. We also allow the Hellinger distance to act on matrices via the average Hellinger distance over their entries: for matrices , we define

Finally, for an event , is the indicator function for that event, i.e., is if occurs and otherwise.

1.7 Organization of the paper

We proceed in Section 2 by describing the 1-bit matrix completion problem in greater detail. In Section 3 we state our main results. Specifically, we propose a pair of convex programs for the 1-bit matrix completion problem and establish upper bounds on the accuracy with which these can recover the matrix and the distribution of the observations . We also establish lower bounds, showing that our upper bounds are nearly optimal. In Section 4 we describe numerical implementations of our proposed convex programs and demonstrate their performance on a number of synthetic and real-world examples. Section 5 concludes with a brief discussion of future directions. The proofs of our main results are provided in the appendix.

2 The 1-bit matrix completion problem

2.1 Observation model

We now introduce the more general observation model that we study in this paper. Given a matrix , a subset of indices , and a differentiable function , we observe

(2)

We will leave general for now and discuss a few common choices just below. As has been important in previous work on matrix completion, we assume that is chosen at random with . Specifically, we assume that follows a binomial model in which each entry is included in with probability , independently.

Before discussing some particular choices for , we first note that while the observation model described in (2) may appear on the surface to be somewhat different from the setup in (1), they are actually equivalent if behaves likes a cumulative distribution function. Specifically, for the model in (1), if has i.i.d. entries, then by setting , the model in (1) reduces to that in (2). Similarly, for any choice of in (2), if we define as having i.i.d. entries drawn from a distribution whose cumulative distribution function is given by , then (2) reduces to (1). Of course, in any given situation one of these observation models may seem more or less natural than the other—for example, (1) may seem more appropriate when is viewed as a latent variable which we might be interested in estimating, while (2) may make more sense when is viewed as just a parameter of a distribution. Ultimately, however, the two models are equivalent.

We now consider two natural choices for (or equivalently, for ):

Example 1 (Logistic regression/Logistic noise).

The logistic regression model, which is common in statistics, is captured by (2) with and by (1) with i.i.d. according to the standard logistic distribution.

Example 2 (Probit regression/Gaussian noise).

The probit regression model is captured by (2) by setting where is the cumulative distribution function of a standard Gaussian and by (1) with i.i.d. according to a mean-zero Gaussian distribution with variance .

2.2 Approximately low-rank matrices

The majority of the literature on matrix completion assumes that the first singular values of are nonzero and the remainder are exactly zero. However, in many applications the singular values instead exhibit only a gradual decay towards zero. Thus, in this paper we allow a relaxation of the assumption that has rank exactly . Instead, we assume that , where is a parameter left to be determined, but which will often be of constant order. In other words, the singular values of belong to a scaled ball. In compressed sensing, belonging to an ball with is a common relaxation of exact sparsity; in matrix completion, the nuclear-norm ball (or Schatten-1 ball) plays an analogous role. See [18] for further compelling reasons to use the nuclear-norm ball relaxation.

The particular choice of scaling, , arises from the following considerations. Suppose that each entry of is bounded in magnitude by and that . Then

Thus, the assumption that is a relaxation of the conditions that and . The condition that essentially means that the probability of seeing a or does not depend on the dimension. It is also a way of enforcing that should not be too “spiky”; as discussed above this is an important assumption in order to make the recovery of well-posed (e.g., see [49]).

3 Main results

We now state our main results. We will have two goals—the first is to accurately recover itself, and the second is to accurately recover the distribution of given by .111Strictly speaking, is simply a matrix of scalars, but these scalars implicitly define the distribution of , so we will sometimes abuse notation slightly and refer to as the distribution of . All proofs are contained in the appendix.

3.1 Convex programming

In order to approximate either or , we will maximize the log-likelihood function of the optimization variable given our observations subject to a set of convex constraints. In our case, the log-likelihood function is given by

To recover , we will use the solution to the following program:

(3)

To recover the distribution , we need not enforce the infinity-norm constraint, and will use the following simpler program:

(4)

In many cases, is a concave function and thus the above programs are convex. This can be easily checked in the case of the logistic model and can also be verified in the case of the probit model (e.g., see [68]).

3.2 Recovery of the matrix

We now state our main result concerning the recovery of the matrix . As discussed in Section 1.4 we place a “non-spikiness” condition on to make recovery possible; we enforce this with an infinity-norm constraint. Further, some assumptions must be made on for recovery of to be feasible. We define two quantities and which control the “steepness” and “flatness” of , respectively:

(5)

In this paper we will restrict our attention to such that and are well-defined. In particular, we assume that and are non-zero in . This assumption is fairly mild—for example, it includes the logistic and probit models (as we will see below in Remark 1). The quantity appears only in our upper bounds, but it is generally well behaved. The quantity appears both in our upper and lower bounds. Intuitively, it controls the “flatness” of in the interval —the flatter is, the larger is. It is clear that some dependence on is necessary. Indeed, if is perfectly flat, then the magnitudes of the entries of cannot be recovered, as seen in the noiseless case discussed in Section 1.4. Of course, when is a fixed constant and is a fixed function, both and are bounded by fixed constants independent of the dimension.

Theorem 1.

Assume that and . Suppose that is chosen at random following the binomial model of Section 2.1 with . Suppose that is generated as in (2). Let and be as in (5). Consider the solution to (3). Then with probability at least ,

with . If then this simplifies to

(6)

Above, and are absolute constants.

Note that the theorem also holds if , i.e., if we sample each entry exactly once or observe a complete realization of . Even in this context, the ability to accurately recover is somewhat surprising.

Remark 1 (Recovery in the logistic and probit models).

The logistic model satisfies the hypotheses of Theorem 1 with and . The probit model has

where we can take and . In particular, in the probit model the bound in (6) reduces to

(7)

Hence, when , increasing the size of the noise leads to significantly improved error bounds—this is not an artifact of the proof. We will see in Section 3.4 that the exponential dependence on in the logistic model (and on in the probit model) is intrinsic to the problem. Intuitively, we should expect this since for such models, as grows large, we can essentially revert to the noiseless setting where estimation of is impossible. Furthermore, in Section 3.4 we will also see that when (or ) is bounded by a constant, the error bound (6) is optimal up to a constant factor. Fortunately, in many applications, one would expect to be small, and in particular to have little, if any, dependence on the dimension. This ensures that each measurement will always have a non-vanishing probability of returning as well as a non-vanishing probability of returning .

Remark 2 (Nuclear norm constraint).

The assumption made in Theorem 1 (as well as Theorem 2 below) that does not mean that we are requiring the matrix to be low rank. We express this constraint in terms of to aid the intuition of researchers well-versed in the existing literature on low-rank matrix recovery. (If is exactly rank and satisfies , then as discussed in Section 2.2, will automatically satisfy this constraint.) If one desires, one may simplify the presentation by replacing with a parameter and simply requiring , in which case (6) reduces to

for a numerical constant .

3.3 Recovery of the distribution

In many situations, we might not be interested in the underlying matrix , but rather in determining the distribution of the entries of . For example, in recommender systems, a natural question would be to determine the likelihood that a user would enjoy a particular unrated item.

Surprisingly, this distribution may be accurately recovered without any restriction on the infinity-norm of . This may be unexpected to those familiar with the matrix completion literature in which “non-spikiness” constraints seem to be unavoidable. In fact, we will show in Section 3.4 that the bound in Theorem 2 is near-optimal; further, we will show that even under the added constraint that , it would be impossible to estimate significantly more accurately.

Theorem 2.

Assume that . Suppose that is chosen at random following the binomial model of Section 2.1 with . Suppose that is generated as in (2), and let . Let be the solution to (4). Then with probability at least ,

(8)

Furthermore, as long as , we have

(9)

Above, and are absolute constants.

While for the logistic model, the astute reader will have noticed that for the probit model is unbounded—that is, tends to as . would also be unbounded for the case where takes values of or outside of some range (as would be the case in (1) if the distribution of the noise had compact support). Fortunately, however, we can recover a result for these cases by enforcing an infinity-norm constraint, as described in Theorem 6 below. Moreover, for a large class of functions, , is indeed bounded. For example, in the latent variable version of (1) if the entries are at least as fat-tailed as an exponential random variable, then is bounded. To be more precise, suppose that is continuously differentiable and for simplicity assume that the distribution of is symmetric and is monotonic for sufficiently large. If for all , then one can show that is finite. This property is also essentially equivalent to the requirement that a distribution have bounded hazard rate. As noted above, this property holds for the logistic distribution, but also for many other common distributions, including the Laplacian, Student’s , Cauchy, and others.

3.4 Room for improvement?

We now discuss the extent to which Theorems 1 and 2 are optimal. We give three theorems, all proved using information theoretic methods, which show that these results are nearly tight, even when some of our assumptions are relaxed. Theorem 3 gives a lower bound to nearly match the upper bound on the error in recovering derived in Theorem 1. Theorem 4 compares our upper bounds to those available without discretization and shows that very little is lost when discretizing to a single bit. Finally, Theorem 5 gives a lower bound matching, up to a constant factor, the upper bound on the error in recovering the distribution given in Theorem 2. Theorem 5 also shows that Theorem 2 does not suffer by dropping the canonical “spikiness” constraint.

Our lower bounds require a few assumptions, so before we delve into the bounds themselves, we briefly argue that these assumptions are rather innocuous. First, without loss of generality (since we can always adjust to account for rescaling ), we assume that . Next, we require that the parameters be sufficiently large so that

(10)

for an absolute constant . Note that we could replace this with a simpler, but still mild, condition that . Finally, we also require that where is either 1 or 4 and that , where hides parameters (which may differ in each Theorem) that we make explicit below. This last assumption simply means that we are in the situation where is significantly smaller than and , i.e., the matrix is of approximately low rank.

In the following, let

(11)

denote the set of matrices whose recovery is guaranteed by Theorem 1.

3.4.1 Recovery from 1-bit measurements

Theorem 3.

Fix and to be such that and (10) holds. Let be defined as in (5), and suppose that is decreasing for . Let be any subset of with cardinality , and let be as in (2). Consider any algorithm which, for any , takes as input for and returns . Then there exists such that with probability at least ,

(12)

as long as the right-hand side of (12) exceeds . Above, and are absolute constants.222Here and in the theorems below, the choice of in the probability bound is arbitrary, and can be adjusted at the cost of changing in (10) and and . Similarly, can be replaced by for any .

The requirement that the right-hand side of (12) be larger than is satisfied as long as . In particular, it is satisfied whenever

for a fixed constant . Note also that in the latent variable model in (1), is simply the probability density of . Thus, the requirement that be decreasing is simply asking the probability density to have decreasing tails. One can easily check that this is satisfied for the logistic and probit models.

Note that if is bounded by a constant and is fixed (in which case and are bounded by a constant), then the lower bound of Theorem 3 matches the upper bound given in (6) up to a constant. When is not treated as a constant, the bounds differ by a factor of . In the logistic model and so this amounts to the difference between and . The probit model has a similar change in the constant of the exponent.

3.4.2 Recovery from unquantized measurements

Next we show that, surprisingly, very little is lost by discretizing to a single bit. In Theorem 4, we consider an “unquantized” version of the latent variable model in (1) with Gaussian noise. That is, let be a matrix of i.i.d. Gaussian random variables, and suppose the noisy entries are observed directly, without discretization. In this setting, we give a lower bound that still nearly matches the upper bound given in Theorem 1, up to the term.

Theorem 4.

Fix and to be such that and (10) holds. Let be any subset of with cardinality , and let be a matrix with i.i.d. Gaussian entries with variance . Consider any algorithm which, for any , takes as input for and returns . Then there exists such that with probability at least ,

(13)

as long as the right-hand side of (13) exceeds . Above, and are absolute constants.

The requirement that the right-hand side of (13) be larger than is satisfied whenever

for a fixed constant .

Following Remark 1, the lower bound given in (13) matches the upper bound proven in Theorem 1 for the solution to (4) up to a constant, as long as is bounded by a constant. In other words:

When the signal-to-noise ratio is constant, almost nothing is lost by quantizing to a single bit.

Perhaps it is not particularly surprising that 1-bit quantization induces little loss of information in the regime where the noise is comparable to the underlying quantity we wish to estimate—however, what is somewhat of a surprise is that the simple convex program in (4) can successfully recover all of the information contained in these 1-bit measurements.

Before proceeding, we also briefly note that our Theorem 4 is somewhat similar to Theorem 3 in [49]. The authors in [49] consider slightly different sets : these sets are more restrictive in the sense that it is required that and less restrictive because the nuclear-norm constraint may be replaced by a general Schatten-p norm constraint. It was important for us to allow in order to compare with our upper bounds due to the exponential dependence of on in Theorem 1 for the probit model. This led to some new challenges in the proof. Finally, it is also noteworthy that our statements hold for arbitrary sets , while the argument in [49] is only valid for a random choice of .

3.4.3 Recovery of the distribution from 1-bit measurements

To conclude we address the optimality of Theorem 2. We show that under mild conditions on , any algorithm that recovers the distribution must yield an estimate whose Hellinger distance deviates from the true distribution by an amount proportional to , matching the upper bound of (9) up to a constant. Notice that the lower bound holds even if the algorithm is promised that , which the upper bound did not require.

Theorem 5.

Fix and to be such that and (10) holds. Let be defined as in (5), and suppose that and for , for some constants . Let be any subset of with cardinality , and let be as in (2). Consider any algorithm which, for any , takes as input for and returns . Then there exists such that with probability at least ,

(14)

as long as the right-hand side of (14) exceeds . Above, and are constants that depend on .

The requirement that the right-hand side of (14) be larger than is satisfied whenever

for a constant that depends only on . Note also that the condition that and be well-behaved in the interval is satisfied for the logistic model with and . Similarly, we may take and in the probit model.

4 Simulations

4.1 Implementation

Before presenting the proofs of our main results, we provide algorithms and a suite of numerical experiments to demonstrate their usefulness in practice.333The code for these algorithms, as well as for the subsequent experiments, is available online at http://users.ece.gatech.edu/mdavenport/. We present algorithms to solve the convex programs (3) and (4), and using these we can recover (or ) via 1-bit matrix completion.

4.1.1 Optimization algorithm

We begin with the observation that both (3) and (4) are instances of the more general formulation

(15)

where is a smooth convex function from , and is a closed convex set in . In particular, defining to be the bijective linear mapping that vectorizes to , we have and, depending on whether we want to solve (3) or (4), equal to either or , where

(16)

One possible solver for problems of the form (15) is the nonmonotone spectral projected-gradient (SPG) algorithm proposed by Birgin et al. [6]. Another possibility is to use an accelerated proximal-gradient methods for the minimization of composite functions [4, 50], which are useful for solving optimization problems of the form

where and are convex functions with possibly non-smooth. This formulation reduces to (15) when choosing to be the extended-real indicator function corresponding to :

Both algorithms are iterative and require at each iteration the evaluation of , its gradient , and an orthogonal projection onto (i.e., the prox-function of )

(17)

For our experiments we use the SPG algorithm, which we describe in more detail below. The implementation of the algorithm is based on the SPGL1 code [66, 67].

4.1.2 Spectral projected-gradient method

In basic gradient-descent algorithms for unconstrained minimization of a convex function , iterates are of the form , where the step length is chosen such that sufficient descent in the objective function is achieved. When the constraint is added, the basic scheme can no longer guarantee feasibility of the iterates. Projected gradient methods resolve this problem by including on orthogonal projections back onto the feasible set (17) at each iteration.

The nonmonotone SPG algorithm described in [6] modifies the basic projected gradient method in two major ways. First, it scales the initial search direction using the spectral step-length as proposed by Barzilai and Borwein [3]. Second, it relaxes monotonicity of the objective values by requiring sufficient descent relative to the maximum objective over the last iterates (or when ). Two types of line search are considered. The first type is curvilinear and traces the following path:

(18)

The second type first determines a projected gradient step, and uses this to obtain the search direction :

Next, a line search is done along the linear trajectory

(19)

In either case, once the step length is chosen, we set , and proceed with the next iteration.

In the 1-bit matrix completion formulation proposed in this paper, the projection onto forms the main computational bottleneck. As a result, it is crucial to keep the number of projections to a minimum, and our implementation therefore relies primarily on the line search along the linear trajectory given by (19). The more expensive curvilinear line search is used only when the linear one fails. We have observed that this situation tends to arise only when is near optimal.

The optimality condition for (15) is that

(20)

Our implementation checks if (20) is approximately satisfied. In addition it imposes bounds on the total number of iterations and the run time.

4.1.3 Orthogonal projections onto the feasible sets

It is well known that the orthogonal projection onto the nuclear-norm ball amounts to singular-value soft thresholding [12]. In particular, let with , then

where the maximum is taken entrywise, and is the smallest value for which .

Unfortunately, no closed form solution is known for the orthogonal projection onto . However, the underlying problem

(21)

can be solved using iterative methods. In particular, we can rewrite (21) as

(22)

and apply the alternating-direction method of multipliers (ADMM) [24, 27]. The augmented Lagrangian for (22) with respect to the constraint is given by

The ADMM iterates the following steps to solve (22):

Step 0.

Initialize , and select , , , such that and .

Step 1.

Minimize with respect to , which can be rewritten as

This is exactly the orthogonal projection of onto , and gives .

Step 2.

Minimize with respect to . This gives

and simplifies to .

Step 3.

Update , set , and increment .

Step 4.

Return when and for some sufficiently small . Otherwise, repeat steps 1–4.

4.2 Synthetic experiments

To evaluate the performance of this algorithm in practice and to confirm the theoretical results described above, we first performed a number of synthetic experiments. In particular, we constructed a random matrix with rank by forming where and are matrices with entries drawn i.i.d. from a uniform distribution on . The matrix is then scaled so that . We then obtained 1-bit observations by adding Gaussian noise of variance and recording the sign of the resulting value.

We begin by comparing the performance of the algorithms in (3) and (4) over a range of different values of . In this experiment we set , , and , and we measured performance of each approach using the squared Frobenius norm of the error (normalized by the norm of the original matrix ) and averaged the results over 15 draws of .444In evaluating these algorithms we found that it was beneficial in practice to follow the recovery by a “debiasing” step where the recovered matrix is forced to be rank by computing the SVD of and hard thresholding the singular values. In cases where we report the Frobenius norm of the error, we performed this debiasing step, although it does not dramatically impact the performance. The results are shown in Figure 1. We observe that for both approaches, the performance is poor when there is too little noise (when is small) and when there is too much noise (when is large). These two regimes correspond to the cases where the noise is either so small that the observations are essentially noise-free or when the noise is so large that each observation is essentially a coin toss. In the regime where the noise is of moderate power, we observe better performance for both approaches. Perhaps somewhat surprisingly, we find that for much of this range, the approach in (4) appears to perform almost as well as (3), even though we do not have any theoretical guarantees for (4). This suggests that adding the infinity-norm constraint as in (3) may have only limited practical benefit, despite the key role this constraint played in our analysis. By using the simpler program in (4) one can greatly simplify the projection step in the algorithm, so in practice this approach may be preferable.

Figure 1: Average relative error in the results obtained using algorithms (3) and (4) over a range of different values of . We observe that for both approaches, the performance is poor when there is too little noise (when is small) and when there is too much noise (when is large). In the case of moderate noise, we observe that (4) appears to perform almost as well as (3), even though we do not have any theoretical guarantees for (4).

We also conducted experiments evaluating the performance of both (3) and (4) as a function of for a particular choice of . The results showing the impact of on (4) are shown in Figure 2 (the results for (3) at this noise level are almost indistinguishable). In this experiment we set , and chose such that , which lies in the regime where the noise is neither negligible nor overwhelming. We considered matrices with rank and evaluated the performance over a range of . Figure 2(a) shows the performance in terms of the relative Frobenius norm of the error, and Figure 2(b) shows the performance in terms of the Hellinger distance between the recovered distributions. Consistent with our theoretical results, we observe a decay in the error (under both performance metrics) that appears to behave roughly on the order of .

(a) (b)
Figure 2: The impact of on (4). (a) shows the performance in terms of the relative Frobenius norm of the error. (b) shows the performance in terms of the Hellinger distance between the recovered distributions. In both cases, we observe a decay in the error (under both performance metrics) that appears to behave roughly on the order of .

4.3 Collaborative filtering

To evaluate the performance of our algorithm in a practical setting, we consider the MovieLens (100k) data set, which is available for download at http://www.grouplens.org/node/73. This data set consists of 100,000 movie ratings from 1000 users on 1700 movies, with each rating occurring on a scale from 1 to 5. For testing purposes, we converted these ratings to binary observations by comparing each rating to the average rating for the entire dataset (which is approximately 3.5). We then apply the algorithm in (4) (using the logistic model of ) on a subset of 95,000 ratings to recover an estimate of . However, since our only source of data is the quantized ratings, there is no “ground truth” against which to measure the accuracy of our recovery. Thus, we instead evaluate our performance by checking to see if the estimate of accurately predicts the sign of the remaining 5000 unobserved ratings in our dataset. The result of this simulation is shown in the first line of Table 1, which gives the accuracy in predicting whether the unobserved ratings are above or below the average rating of 3.5.

By comparison, the second line in the table shows the results obtained using a “standard” method that uses the raw ratings (on a scale from 1 to 5) and tries to minimize the nuclear norm of the recovered matrix subject to a constraint that requires the Frobenius norm of the difference between the recovered and observed entries to be sufficiently small. We implement this traditional approach using the TFOCS software package [5], and evaluate the results using the same error criterion as the 1-bit matrix completion approach—namely, we compare the recovered ratings to the average (recovered) rating. This approach depends on a number of input parameters: in (3), the constraint on the Frobenius norm in the traditional case, as well as the internal parameter in TFOCS. We determine the parameter values by simply performing a grid search and selecting those values that lead to the best performance.

Perhaps somewhat surprisingly, the 1-bit approach performs significantly better than the traditional one, even though the traditional approach is given more information in the form of the raw ratings, instead of the binary observations.555While not reported in the table, we also observed that the 1-bit approach is relatively insensitive to the choice of , so that this improvement in performance does not rely on a careful parameter setting. The intuition as to how this might be possible is that the standard approach is likely paying a significant penalty for actually requiring that the recovered matrix yields numerical ratings close to “1” or “5” when a user’s true preference could extend beyond this scale.

Original rating 1 2 3 4 5 Overall
1-bit matrix completion 79% 73% 58% 75% 89% 73%
“Standard” matrix completion 64% 56% 44% 65% 75% 60%
Table 1: Results of a comparison between a 1-bit matrix completion approach and “standard” matrix completion on the MovieLens 100k dataset, as evaluated by predicting whether the unobserved ratings were above or below average. Overall, the 1-bit approach achieves an accuracy of 73%, which is significantly better than the 60% accuracy achieved by standard methods. The 1-bit approach also outperforms the standard approach for each possible individual rating, although we see that both methods perform poorly when the true rating is very close to the average rating.

5 Discussion

Many of the applications of matrix completion consider discrete observations, often in the form of binary measurements. However, matrix completion from noiseless binary measurements is extremely ill-posed, even if one collects a binary measurement for each of the matrix entries. Fortunately, when there are some stochastic variations (noise) in the observations, recovery becomes well-posed. In this paper we have shown that the unknown matrix can be accurately and efficiently recovered from binary measurements in this setting. When the infinity norm of the unknown matrix is bounded by a constant, our error bounds are tight to within a constant and even match what is possible for undiscretized data. We have also shown that the binary probability distribution can be reconstructed over the entire matrix without any assumption on the infinity-norm, and we have provided a matching lower bound (up to a constant).

Our theory considers approximately low-rank matrices—in particular, we assume that the singular values belong to a scaled Schatten-1 ball. It would be interesting to see whether more accurate reconstruction could be achieved under the assumption that the unknown matrix has precisely nonzero singular values. We conjecture that the Lagrangian formulation of the problem could be fruitful for this endeavor. It would also be interesting to study whether our ideas can be extended to deal with measurements that are quantized to more than 2 (but still a small number) of different values, but we leave such investigations for future work.

Acknowledgements

We would like to thank Roman Vershynin for helpful discussions and Wenxin Zhou for pointing out an error in an earlier version of the proof of Theorem 6.

Appendix A Proofs of the main results

We now provide the proofs of the main theorems presented in Section 3. To begin, we first define some additional notation that we will need for the proofs. For two probability distributions and on a finite set , will denote the Kullback-Leibler (KL) divergence,

where denotes the probability of the outcome under the distribution . We will abuse this notation slightly by overloading it in two ways. First, for scalar inputs , we will set

Second, for two matrices , we define

We first prove Theorem 2. Theorem 1 will then follow from an approximation argument. Finally, our lower bounds will be proved in Section A.3 using information theoretic arguments.

a.1 Proof of Theorem 2

We will actually prove a slightly more general statement, which will be helpful in the proof of Theorem 1. We will assume that , and we will modify the program (4) to enforce . That is, we will consider the program

(23)

We will then send to recover the statement of Theorem 2. Formally, we prove the following theorem.

Theorem 6.

Assume that and . Suppose that is chosen at random following the binomial model of Section 2.1 and satisfying . Suppose that is generated as in (2), and let be as in (5). Let be the solution to (23). Then with probability at least ,

(24)

Above, and are absolute constants.

For the proof of Theorem 6, it will be convenient to work with the function

rather than with itself. The key will be to establish the following concentration inequality.

Lemma 1.

Let be

for some and . Then

(25)

where and are absolute constants and the probability and the expectation are both over the choice of and the draw of .

We will prove this lemma below, but first we show how it implies Theorem 6. To begin, notice that for any choice of ,

where the expectation is over both and . Next, note that by assumption . Then, we have for any

Moreover, from the definition of we also have that and . Thus

Applying Lemma 1, we obtain that with probability at least