Bertsimas and Li
Interpretable Matrix Completion
Interpretable Matrix Completion: A Discrete Optimization Approach
Dimitris Bertsimas \AFFSloan School of Management and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, \EMAILdbertsim@mit.edu \AUTHORMichael Lingzhi Li \AFFOperations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, \EMAILmlli@mit.edu
We consider the problem of matrix completion on an matrix. We introduce the problem of Interpretable Matrix Completion that aims to provide meaningful insights for the lowrank matrix using side information. We show that the problem can be reformulated as a binary convex optimization problem. We design OptComplete, based on a novel concept of stochastic cutting planes to enable efficient scaling of the algorithm up to matrices of sizes and . We report experiments on both synthetic and realworld datasets that show that OptComplete has favorable scaling behavior and accuracy when compared with stateoftheart methods for other types of matrix completion, while providing insight on the factors that affect the matrix.
Matrix Completion, MixedInteger Optimization, Stochastic Approximation
1 Introduction
Lowrank matrix completion has attracted much attention after the successful application in the Netflix Competition. It is now widely utilized in farreaching areas such as computer vision (Candes and Plan (2010)), signal processing (Ji et al. (2010)), and control theory (Boyd et al. (1994)) to generate a completed matrix from partially observed entries.
The classical lowrank matrix completion problem considers the following problem: Given a matrix with entries only partially known (denote as the set of known entries), we aim to recover a matrix of rank that minimizes a certain distance metric between and on the known entries of :
where we normalized the objective so that it is . The rank constraint on can be equivalently formulated as the existence of two matrices , such that . Therefore, the problem can be restated as:
(1) 
In many applications for matrix completion, it is customary for each row of the data to represent an individual and each column a product or an item of interest, and being the response data of individual on item . Therefore, the matrices and are commonly interpreted as the “user matrix” and “product matrix” respectively.
Let us denote each row of as and each column as (similarly for ). Then, () represents a “latent feature” for users (products), and in total there are latent features for users (products). The goal of matrix completion is thus to discover such latent features of the users and the products, so that the dot product of such features on user and product , , is the intended response of user on product .
While this interpretation is intuitive, it does not offer insight on what the latent features of users and products mean. Inductive Matrix Completion, first considered in Dhillon et al. (2013), aims to rectify such problem by asserting that each of the latent features is a linear combination of known features. We focus on the onesided information case, where only one of the user/product matrix is subject to such constraint. This case is more relevant as features related to users are often fragmented and increasingly constrained by data privacy regulations, while features about products are easy to obtain. We would also have a short discussion later on why the twosided information case is not interesting under the context of this paper.
Without loss of generality, we would assume the information is on the product matrix and denote the known feature matrix with features as . As an example, if the items are movies, then represents feature for a movie (actors, budget, running time, etc), and each product feature needs to be linear combination of such features. Mathematically, this translates to the constraint:
where . Therefore, the inductive version of the problem in (1) can be written as:
(2) 
Although the inductive version of the problem adds more interpretability to the product latent features, they are still far from fully interpretable. For example, if we take the items to be movies, and features to be running time, budget, box office, and number of top 100 actors, then the generated features could look like:
These features, although a linear combination of interpretable features, are not very interpretable itself due to the involvement of multiple factors with different units. Therefore, it cannot significantly help decision makers to understand “what is important” about the product. Furthermore, the appearance of the same factor in multiple features with different signs (as shown above) further complicates any attempt at understanding the result.
Therefore, we argue that instead of supposing the features are linear combinations of the known features, we should assume that the features are selected from the known features. This formulation alleviates the two previous problems mentioned: it guarantees the latent features to be interpretable (as long as the original features are), and it prevents any duplicating features in the selected latent features. We denote this Interpretable Matrix Completion. We use the term interpretable, as opposed to inductive, to highlight that our approach, like sparse linear regression, gives actionable insights on what are the important features of matrix . We note that Interpretable Matrix Completion is considerably harder than inductive or the classical matrix completion problem as it is a discrete problem and selecting out of factors is exponential in complexity.
In this paper, we show that the Interpretable Matrix Completion problem can be written as a mixed integer convex optimization problem. Inspired by Bertsimas and van Parys (2020) for sparse linear regression, we reformulate the interpretable matrix completion problem as a binary convex optimization problem. Then we introduce a new algorithm OptComplete, based on stochastic cutting planes, to enable scalability for matrices of sizes on the order of . In addition, we provide empirical evidence on both synthetic and realworld data that OptComplete is able to match or exceed current stateoftheart methods for inductive and general matrix completion on both speed and accuracy, despite OptComplete solving a more difficult problem.
Specifically, our contributions in this paper are as follows:

We introduce the interpretable matrix completion problem, and reformulate it as a binary convex optimization problem that can be solved using cutting planes methods.

We propose a new novel approach to cutting planes by introducing stochastic cutting planes. We prove that the new algorithm converges to an optimal solution of the interpretable matrix completion problem with exponentially vanishing failure probability.

We present computational results on both synthetic and real datasets that show that the algorithm matches or outperforms current stateoftheart methods in terms of both scalability and accuracy.
The structure of the paper is as follows. In Section 2, we introduce the binary convex reformulation of the lowrank interpretable matrix completion problem, and how it can be solved through a cutting plane algorithm, which we denote CutPlanes. In Section 3, we introduce OptComplete, a stochastic cutting planes method designed to scale the CutPlanes algorithm in Section 2, and show that it recovers the optimal solution of CutPlanes with exponentially vanishing failure probability. In Section 4, we report on computational experiments with synthetic data that compare OptComplete to Inductive Matrix Completion (IMC) introduced in Natarajan and Dhillon (2014) and SoftImputeALS (SIALS) by Hastie et al. (2015), two stateoftheart matrix completion algorithms for inductive and general completion. We also compare OptComplete to CutPlanes to demonstrate the x to x speedup of the stochastic algorithm. In Section 5, we report computational experiments on the Netflix Prize dataset. In Section 6 we provide our conclusions.
Literature
Matrix completion has been applied successfully to many tasks, including recommender systems Koren et al. (2009), social network analysis Chiang et al. (2014) and clustering Chen et al. (2014b). After Candès and Tao (2010) proved a theoretical guarantee for the retrieval of the exact matrix under the nuclear norm convex relaxation, a lot of methods have focused on the nuclear norm problem (see Mazumder et al. (2010), Beck and Teboulle (2009), Jain et al. (2010), and Tanner and Wei (2013) for examples). Alternative methods include alternating projections by Recht and Ré (2013) and Grassmann manifold optimization by Keshavan et al. (2009). There has also been work where the uniform distributional assumptions required by the theoretical guarantees are violated, such as Negahban and Wainwright (2012) and Chen et al. (2014a).
Interest in inductive matrix completion intensified after Xu et al. (2013) showed that given predictive side information, one only needs samples to retrieve the full matrix. Thus, most of this work (see Xu et al. (2013), Jain and Dhillon (2013), Farhat et al. (2013), Natarajan and Dhillon (2014)) have focused on the case in which the side information is assumed to be perfectly predictive so that the theoretical bound of sample complexity Xu et al. (2013) can be achieved. Chiang et al. (2015) explored the case in which the side information is corrupted with noise, while Shah et al. (2017) and Si et al. (2016) incorporated nonlinear combination of factors into the side information. Surprisingly, as pointed out by a recent article Nazarov et al. (2018), there is a considerable lack of effort to introduce sparsity/interpretability into inductive matrix completion, with Lu et al. (2016), Soni et al. (2016) and Nazarov et al. (2018) being among the only works that attempt to do so. Our work differs from the previous attempts in that previous attempts mainly focus on choosing latent features which are sparse linear combinations of the given features. In contrast interpretable matrix completion is aimed to select exactly features from the known features.
2 Interpretable Matrix Completion
In this section, we present the mathematical formulation of Interpretable Matrix Completion and how it can be reformulated as a binary convex problem that is based on Bertsimas and van Parys (2020). We show how this naturally leads to a cutting plane algorithm, and discuss its computational complexity. We also discuss the twosided information case, and how that reduces to the sparse regression problem.
2.1 Binary Convex Reformulation of Interpretable Matrix Completion
The (onesided) interpretable matrix completion problem can be written as a mixed binary optimization problem:
where and:
We note that given that , the rank of matrix is indeed . We further note that the coefficients of can be taken to be binary without loss of generality, since if they are not and , then by applying the transformation:
(3) 
for , results in an equivalent problem with the coefficients of being binary.
For this paper, we consider the squared norm, and for robustness purposes (see Bertsimas and van Parys (2020) and Bertsimas and Copenhaver (2018)), we add a Tikhonov regularization term to the original problem. Specifically, the (onesided) interpretable matrix completion problem with regularization we address is
(4) 
In this section, we show how that problem (4) can be reformulated as a binary convex optimization problem, and can be solved to optimality using a cutting plane algorithm. The main theorem and proof is presented below:
Theorem 1
Problem (4) can be reformulated as a binary convex optimization problem:
where are diagonal matrices:
, , where is the th row of with unknown entries taken to be 0, and , with the th column of .
Proof.
Proof: With the diagonal matrices defined above, we can rewrite the sum in (4) over known entries of , , as a sum over the rows of :
where is the th row of . Using , then where is the th row of . Moreover,
Then, Problem (4) becomes:
We then notice that within the sum each row of can be optimized separately, leading to:
(5) 
The inner optimization problem can be solved in closed form given , as it is a weighted linear regression problem with Tiknorov regularization, see Bertsimas and van Parys (2020). The closed form solution is:
(6) 
So Problem (5) can be simplified to:
Finally, we notice that
and we obtain the required expression. Since are positive semidefinite, and the inverse of positive semidefinite matrices is a convex function, the entire function is convex in . ∎
With Theorem 1, our original problem can now be restated as:
(6) 
This can be solved utilizing the cutting plane algorithm first introduced by Duran and Grossmann (1986), summarized as Algorithm 1.
The cutting plane algorithm, at iteration , adds a linear approximation of at the current feasible solution to the set of constraints:
(7) 
and we solve the mixedinteger linear programming problem:
to obtain . We see that is exactly the minimum value of the current approximation of , , defined below:
Since is convex, the piecewise linear approximation is an outer approximation (), so . As the algorithm progresses, the set of linear approximations form an increasingly better approximation of , and increases with . The algorithm terminates once does not further increase, as it implies the linear approximation shares the same minimum value as the true function , which is the desired value.
2.2 Implementation and Computational Complexity of CutPlanes
The computational complexity of the cutting plane comes from calculating and its derivative . We first introduce the notations and .
(8) 
Then, the function in (6) can be expressed as
(9) 
To calculate the derivative , it is easier to utilize the expression in Theorem 1 and then utilize the chain rule. After some manipulations, we obtain
(10) 
Therefore, we would focus on calculating . First, by the Matrix Inversion Lemma (Woodbury (1949)) we have
(11) 
where is the feature matrix formed by the columns of such that , and we have suppressed the dependency of on for notation ease. Note that in order to compute using Eq. (8) we need to invert an matrix, while from Eq. (11) we need to invert a matrix , which only requires calculations. Furthermore, note that calculating only requires multiplications where is the number of known entries in row of as we do not need to multiply on the unknown entries. Similarly, we can compute in multiplications, and in multiplications.
2.3 Twosided Information Case
In this section, we briefly discuss the matrix completion problem under the twosided information case, and how it reduces to the problem of sparse linear regression. The two sided interpretable matrix completion problem with Tikhonov regularization can be stated as follows:
(13) 
where is a known matrix of features of each row, is a known matrix of features of each column, and is a sparse matrix that has nonzero entries, ensuring that . We note that in Eq. (13) we restrict the support of matrix to be , rather than forcing the entries of to be binary. This is because unlike in the onesided case, both and are known, so we cannot apply the scaling transformation in (3).
We denote by the th column of and the th column of . We introduce the matrices as in Theorem 1. Using , we can write
where is the th entry of the matrix formed by multiplying th column of with th column of . Then, Problem (13) becomes:
(14) 
As every matrix is known, this becomes a sparse regression problem where there are features to choose from (the matrices), there are samples (the matrix), the sparsity requirement is , the regression coefficients are , and we have Tikhonov regularization. Vectorizing , , and reduces the problem back to the familiar form of sparse linear regression, that can be solved by the algorithm developed in Bertsimas and van Parys (2020) at scale.
3 OptComplete: The Stochastic Cutting Plane Speedup
In this section, we introduce OptComplete, a stochastic version of the cutting plane algorithm introduced in Section 2. We present theoretical results to show that the stochastic algorithm recovers the true optimal solution of the original algorithm with high probability without distributional assumptions. We also include a discussion on the dependence of such probability with various factors and its favorable theoretical computational complexity.
3.1 Introduction of OptComplete
In the previous section, we showed that through careful evaluation, we can calculate a full cutting plane in calculations. However, in very high dimensions where are extremely large, the cost of generating the full cutting plane is still prohibitive. Thus, we consider generating approximations of the cutting plane that would enable the algorithm to scale for high values for and . Specifically, consider the cutting plane function in (9), reproduced below:
where:
We approximate the inner term by choosing samples from without replacement, with the set denoted . Then we formulate the submatrix with such selected rows, and similarly with . Then we calculate the approximation:
Then we choose samples from without replacement, with the set denoted . We can then calculate an approximation of using the approximated :
where the set is chosen independently for every row . Then the derivative of is:
Using such approximations, we can derive a stochastic cuttingplane algorithm, which we call OptComplete presented as Algorithm 2.
For this algorithm to work, we need the approximation and its derivative to be close to the nominal values. Furthermore, the approximated cutting planes should not cutoff the true solution. In the next section, we show that OptComplete enjoys such properties with high probability. In Section 3.3, we discuss how to select the size of and .
3.2 Main Theoretical Results
We would first show that the inner approximation is close to the true term with high probability:
Theorem 2
Let be a partially known matrix, a known feature matrix, and as defined in Theorem 1. Let be a random sample of size from the set , chosen without replacement. With probability at least , we have
where are absolute constants.
We see that, without assumptions on the data, the inner approximation for both the value and the derivative follows a bound with terms with very high probability. Furthermore, inverting the statements give that, for all and all :
So the failure probability drops off exponentially with increasing bound , reflecting a Gaussian tail structure for and . The proof is contained in Appendix A.
Using this result, we are able to prove a tight deviation bound for the approximated cost function and :
Theorem 3
Let be a partially known matrix, a known feature matrix, and as defined in Theorem 1. Let be a random sample of size from chosen without replacement. Then for each , we let be a random sample of size from the set , all chosen without replacement. We have, with probability at least :
where are absolute constants.
Similar to the inner approximations, and has Gaussian tails. Furthermore, the scaling here only depends on and not : This shows that the error of the inner approximation is dominated by the outer sampling of the rows . The proof is contained in Appendix B.
Then, using this result, we are able to prove our main result for OptComplete. We would first introduce a new definition:
Definition 1
The convexity parameter of the cost function is defined as the largest positive number for which the the following statement is true:
(15) 
We have the following proposition which shows that unless the cost function is degenerate (i.e. different sets of features doesn’t change the solution), we always have a positive convexity parameter:
Proposition 1
Assume that there does not exist such that . Then .
The proof is contained in Appendix C. Now, we state our main theorem for OptComplete.
Theorem 4
For the matrix completion problem (4), let be a known feature matrix, a matrix with entries partially known, and OptComplete as defined in Algorithm 2. Assume that Problem (4) is feasible. Then, OptComplete terminates in a finite number of steps , and finds an optimal solution of (4) with probability at least where is an absolute constant independent of , and is the convexity parameter of the functions .
The proof is contained in Appendix D. This theorem shows that as long as the original problem is feasible, OptComplete is able to find the optimal solution of the original binary convex problem with exponentially vanishing failure probability that scales as . The theorem requires no assumptions on the data, and thus applies generally. We again note that the bound does not depend on and only on : we would discuss how this would inform our selection of the size of and in the next section.
3.3 Sampling Size and Computational Complexity
To select an appropriate and , we first note that Candès and Tao (2010) showed that to complete a square matrix of rank , we need at least elements. Assume an average known rate of in the original matrix , the expected number of known elements under a sampling of and is . Using , we need that:
(16) 
for some constant . Theorem 4 showed that the bound on failure probability scales with , and thus we cannot have too small. Using (12), the complexity of the cutting plane with and samples are:
(17) 
Therefore, if we fix the expected known elements () constant, it is more advantageous to select a smaller , as scales with . Thus, we set:
(18) 
Experimentally, we found , to generate good results (the results were similar for ). Therefore, by (17), the approximated cutting plane has a computational complexity of:
(19) 
This scales in a square root fashion in and , rather than linearly in and for the full cutting plane. This allows OptComplete to enjoy a considerable speedup compared to CutPlanes, as demonstrated in Section 4.
4 Synthetic Data Experiments
We assume that the matrix , where , , and is an error matrix with individual elements sampled from . We sample the elements of and from a uniform distribution of , and then randomly select a fraction to be missing. We formulate the feature matrix by combining with a confounding matrix that contains unnecessary factors sampled similarly from the Uniform distribution. We run OptComplete on a server with CPU cores, using Gurobi 8.1.0. For each combination , we ran 10 tests and report the median value for every statistic.
We report the following statistics with being the groundtruth factor vector, and the estimated factor vector.

 the dimensions of .

 the number of features in the feature matrix.

 the true number of features.

 The fraction of missing entries in .

 the total time taken for the algorithm.

MAPE  the Mean Absolute Percentage Error (MAPE) for the retrieved matrix :
where is the set of missing data in .
Since the concept of Interpretable Matrix Completion is new, there is a lack of directly comparable algorithms in the literature. Thus, in lieu, we compare OptComplete to stateoftheart solvers for Inductive Matrix Completion and general matrix completion, which are:

IMC by Natarajan and Dhillon (2014)  This algorithm is a wellaccepted benchmark for testing Inductive Matrix Completion algorithms.

SoftImputeALS (SIALS) by Hastie et al. (2015)  This is widely recognized as a stateoftheart matrix completion method without feature information. It has among the best scaling behavior across all classes of matrix completion algorithms as it utilizes fast alternating least squares to achieve scalability.
We use the best existing implementations of IMC (Matlab 2018b) and SIALS (R 3.4.4, package softImpute) with parallelization on the same server.
We further compare our algorithm to CutPlanes, the original cutting plane algorithm developed in Section 2. It is known that for general mixedinteger convex problems, the cutting plane algorithm has the best overall performance (see e.g. Lubin et al. (2016) for details), and thus CutPlanes represent a good baseline of comparison for OptComplete.
We randomly selected of those elements masked to serve as a validation set. The regularization parameter of OptComplete, the rank parameter of IMC and the penalization parameter of IMC and SIALS are selected using the validation set. The results are separated into sections below. The first five sections modify one single variable out of to investigate OptComplete’s scalability, where the leftmost column indicates the variable modified. The last section compares the four algorithms scalability for a variety of parameters that reflect more realistic scenarios.
OptComplete  CutPlanes  IMC  SIALS  
MAPE  MAPE  MAPE  MAPE  
100  100  15  5  1.7s  6.0s  0.03s  0.02s  
100  100  15  5  0.9s  4.5s  0.07s  0.03s  
100  100  15  5  0.6s  2.5s  0.09s  0.06s  
100  100  15  5  0.2s  1.2s  0.12s  0.12s  
100  100  15  5  0.9s  4.5s  0.07s  0.03s  
100  15  5  3.1s  72.5s  0.6s  0.1s  
100  15  5  9.5s  957s  4.5s  6.5s  
100  15  5  18.0s  10856s  32.7s  38s  
100  100  15  5  0.9s  4.5s  0.07s  0.03s  
100  15  5  0.7s  18.6s  0.8s  0.1s  
100  15  5  1.2s  68.5s  6.2s  0.8s  
100  15  5  3.0s  259s  56.2s  12.7s  
100  100  15  5  0.9s  4.5s  0.07s  0.03s  
100  100  50  5  2.0s  18.0s  0.3s  0.03s  
100  100  200  5  12.1s  95.9s  1.9s  0.03s  
100  100  5  90.3s  680s  10.4s  0.03s  
100  100  50  5  2.0s  18.0s  0.3s  0.03s  
100  100  50  10  20.7s  130s  0.06%  0.20s 