Categorical Matrix Completion

# Categorical Matrix Completion

Yang Cao and Yao Xie H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology
{caoyang, yao.xie}@gatech.edu
###### Abstract

We consider the problem of completing a matrix with categorical-valued entries from partial observations. This is achieved by extending the formulation and theory of one-bit matrix completion [1]. We recover a low-rank matrix by maximizing the likelihood ratio with a constraint on the nuclear norm of , and the observations are mapped from entries of through multiple link functions. We establish theoretical upper and lower bounds on the recovery error, which meet up to a constant factor where is the fixed number of categories. The upper bound in our case depends on the number of categories implicitly through a maximization of terms that involve the smoothness of the link functions. In contrast to one-bit matrix completion, our bounds for categorical matrix completion are optimal up to a factor on the order of the square root of the number of categories, which is consistent with an intuition that the problem becomes harder when the number of categories increases. By comparing the performance of our method with the conventional matrix completion method on the MovieLens dataset, we demonstrate the advantage of our method.

\defaultfontfeatures

Ligatures=TeX \setmainfontTeX Gyre Termes \setmonofontTeX Gyre Cursor \setmathfontTeX Gyre Termes Math \NewDocumentCommand\entropyomH[#2 \IfValueT#1 — #1] \NewDocumentCommand\bentropylm ~H#1[#2] \NewDocumentCommand\mutualInfoommI[#2;#3 \IfValueT#1 — #1]

## I Introduction

Recovering a low-rank matrix from a subset of its entries is a fundamental problem that arises from many real-world applications. The so-called matrix completion problem was originally formulated as estimating a matrix with real-valued entries, subjecting to the data fit constraint [2, 3]. However, in many problems entries are categorical (e.g., in recommender systems the ratings take integer values 1 to 5, or in health care applications, where the results are positive, negative or uncertain.) A better formulation for these scenarios would be categorical matrix completion.

In this paper, we consider categorical matrix completion by extending the formulation of one-bit matrix completion [1] to deal with categorical entries and adopt the proof techniques to obtain upper and lower bounds. Assume the input variables form a low-rank matrix , and we observe partial entries of a matrix which are categorical responses of the underlying low-rank matrix. A new problem arises in the categorical setting is to choose appropriate link functions ’s that map entries of to entries of the observed matrix . We consider multinomial logistic regression link functions, which are smooth and they are easy to construct for an arbitrary number of categories. We consider a nuclear norm regularized maximum likelihood estimator with a likelihood function for categorical distribution (different from the Bernoulli distribution used in the one-bit case). To obtain theoretical upper and lower bounds, we introduce new conditions taking in account of the characteristics of the categorical distribution. Our upper and lower bounds match up to a factor that is on the order of the square root of the number of categories. Finally, we compare the performance of our method with the convention matrix completion method on the MovieLens dataset.

As mentioned, a closely related work is one-bit matrix completion [1], where the matrix entries are binary valued and therein the authors establish theoretical upper and lower bounds for the mean squared error of the recovered matrix which demonstrates the optimality of the estimator. Recently, [4] considers matrix completion over finite alphabet with a nuclear norm regularization, and considers a more general sampling model that only requires knowledge about an upper bound for the entries of the matrix; a theoretical upper bound is given therein, which has a faster convergence rate than that in [1]; there is no theoretical lower bound though. A more recent work [5] provides a lower bound for the special case when . Other related work on matrix completion with quantized entries or Poisson observations include [6, 7, 8, 9, 10].

The rest of this paper is organized as follows. Section II sets up the formalism for categorical matrix completion and the nuclear norm regularized maximum likelihood estimator. Section III establishes the upper and lower bounds for the recovery error. Section IV presents an numerical example using the MovieLens dataset to demonstrate the performance of our method. All proofs are delegated to the appendix111Full version of the paper can be downloaded from
www2.isye.gatech.edu/yxie77/Categorical-MC-CAMSAP.pdf
.

The notation in this paper is standard. In particular, ; is the indicator function for an event ; denotes the number of elements in a set . Let entries of a matrix be denoted by or , be the spectral norm which is the largest absolute singular value, be the Frobenius norm, be the nuclear norm which is the sum of the singular values and finally = be the infinity norm. Let denote the rank of a matrix . The inner product for two matrices and is denoted by . Given the number of categories and any set , we say that a random variable satisfies the categorical distribution with the parameters if and for all . Also define the Kullback-Leibler (KL) divergence between two categorical distributions with parameters and as

 D((p1,p2,…,pK)∥(q1,q2,…,qK))≜K∑k=1pklogpkqk,

and define their Hellinger distance as

 d2H((p1,p2,…,pK),(q1,q2,…,qK))≜K∑k=1(√pk−√qk)2.

## Ii Formulation

Suppose we make noisy observations of a matrix on an index set . The indices are randomly selected with , or, equivalently, the indicator functions are i.i.d. Bernoulli random variables with parameter . Assume that the observed entries take one of the possible values: . Given a set of differentiable link functions , that satisfy , we have that the noisy observations follow the categorical distribution:

 Yij=ak with probability fk(Mij) for (i,j)∈Ω. (1)

Our goal is to recover from the categorical observations and further filling the missing entries using the link functions and the entries of recovered matrix . This is done by letting , for all , where .

The following are two simple illustrative examples for link functions. In a -categorical recommender system, there are possible ratings, and the matrix with entries is the true rating matrix of users for items. Suppose users can be in three possible moods: good, normal, and bad. The link function characterizes the bias of a user and we can observe a subset of biased ratings. Suppose a user tends to rate an item one category lower than the truth in a bad mood, and one category higher than the truth in a good mood, with the probabilities of being in bad, normal, and good mood being 0.2, 0.6, and 0.2, then the link functions are given by

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩f1(1)=0.8;f1(2)=0.2;f1(x)=0,otherwise.% fk(k−1)=0.2;fk(k)=0.6;fk(k+1)=0.2;fk(x)=0,otherwise, k=2,…,K−1;fK(K−1)=0.2;fK(K)=0.8;f1(x)=0,otherwise.

The second example is the widely used proportional-odds cumulative logit model, or multinomial logistic model [11], where

 fk(x)∝eαk+βkx,k=[K], (2)

and . Here and are parameters of the model that are given (or obtained from a training stage).

In addition, we make assumptions for the matrix to be recovered. First, we assume an upper bound for to entail the recovery problem is well posed. Second, similarly to the conventional matrix completion, we assume that the nuclear norm of the matrix is bounded . This assumption can be viewed as a relaxation of and rank [1], since and lead to .

To estimate , we consider the following nuclear norm regularized maximum log-likelihood formulation. In our case, the log-likelihood function is given by

 FΩ,Y(X)≜∑(i,j)∈ΩK∑k=1I[Yij=ak]log(fk(Xij)).

Based on the assumptions above, we consider a set of candidate estimators:

 S≜{X∈Rd1×d2+:∥X∥∗≤α√rd1d2,−α≤Xij≤α,∀(i,j)∈[d1]×[d2]}, (3)

and recover by solving the following optimization problem

 ˆM=argmaxX∈SFΩ,Y(X). (4)

This problem is convex and it can be solved exactly by the interior-point method [12] or approximately by the efficient singular value thresholding method [13].

## Iii Performance bounds

To establish our performance bounds, we make the following assumptions on the link functions . Define for any and a region

 L(k)α≜sup|x|≤α|f′k(x)|fk(x),βα(x)≜max1≤k≤K(f′k(x))2fk(x).

Assume that (1) there exists a positive constant such that

 max1≤k≤KL(k)α≤Lα. (5)

The interpretation of this assumption is that the function does not change sharply when it is near the boundaries of the region; and (2) there exist two positive constants and such that

 inf|x|≤αβα(x)≥β−αandsup|x|≤αβα(x)≤β+α. (6)

This lower bound for means that for every fixed , there exists at least one such that does not change too slowly. Another interpretation for the assumption on the lower bound is that ’s overlap moderately so that we may determine the category uniquely for a given . The interpretation of upper bound for is similar to that for the upper bound . When , these assumptions coincide with those in [1].

Many link functions satisfy the previous two assumptions, including the widely used multinomial logistic model (2). Fig. 1 illustrates one such example of link functions where and .

Furthermore, define the average Hellinger distance and KL divergence for entries of two matrices as:

 d2H(f(P),f(Q))=1d1d2∑i,jd2H(f(Pij),f(Qij)),D(f(P)∥f(Q))=1d1d2∑i,jD(f(Pij)∥f(Qij)). (7)

The following two lemmas are needed to prove the upper bound. To use the contraction principle in Lemma 1, we introduce a function

 ¯FΩ,Y(X)=FΩ,Y(X)−FΩ,Y(0). (8)
###### Lemma 1.

Let be the likelihood function defined in (8) and be the set defined in (3), then

 P{supX∈S∣∣¯FΩ,Y(X)−E¯FΩ,Y(X)∣∣≥C′KLαα√r⋅(√m(d1+d2)+d1d2log(d1d2))}≤Cd1d2, (9)

where and are absolute positive constants and the probability and the expectation are both over and .

###### Lemma 2.

For and both in , , we have

 d2H(f(M),f(ˆM))≥β−α4∥M−ˆM∥2Fd1d2.

Our main results are the upper bound for the average mean square error per-entry in Theorem 1, and an information theoretic lower bound in Theorem 2:

###### Theorem 1 (Upper bound).

Assume , and is chosen at random following the binomial sampling model with . Suppose that is generated as in (1). Let and be as in (5) and (6). Let be the solution to (4). Then with a probability exceeding , we have

 1d1d2∥M−ˆM∥2F≤C′αKLαβ−α⋅√r(d1+d2)m√1+(d1+d2)log(d1d2)m. (10)

If then (10) simplifies to

 1d1d2∥M−ˆM∥2F≤√2C′αKLαβ−α√r(d1+d2)m. (11)

Above, are absolute constants.

###### Remark 1.

The ratio depends on the number of categories implicitly though the maximization of the smoothness of the functions .

###### Remark 2.

For a fixed , we can construct function ’s such that the ratio is less than some absolute constant for any given . In other words, we may be able to choose the link functions such that the upper bound is independent of the number of categories. Therefore, how to choose that satisfies the classification requirement as well as minimizing this ratio becomes important. Examining the first inequality in (14), a good choice for should be that for any , , there exists at most one such that . Fortunately, such is not hard to construct and one such example is the multinomial logistic model, as demonstrated in Fig. 1.

###### Remark 3.

Given , and , the mean squared error per entry in (11) tends to with probability as the dimensions of the matrix goes to infinity and . In other words, one can recover accurately with a sufficiently large number of observations.

The following lemmas are used in proving the lower bound.

###### Lemma 3 (Lemma A.3 in [1]).

Let be as in (3). Let be such that is an integer. Suppose , then we may construct a set of size

 |χ|≥exp(rd216γ2)

with the following properties: (1) for all , each entry has ; and (2) for all ,, ,

###### Lemma 4.

Given categories, the KL divergence for two categorical probability distributions with parameter and , is upper-bounded by

 D((x1,…,xK)∥(y1,…,yK))≤K−1∑k=1[(xk−yk)2+(xkyk−x2k)(1−yK)      +(xkyk−y2k)(1−xK)]/[yk(1−K−1∑i=1yi)].

The following theorem shows the existence of a worst case scenario in which no algorithm can reconstruct the matrix with an arbitrarily small error:

###### Theorem 2 (Lower bound).

Fix , , , and to be such that , and . Assume that each is decreasing, and its derivative is increasing in for all , and . Let be any subset of with cardinality . Let be as in (1) and be as in (6). Consider any algorithm which, for any , returns an estimator . Then there exists such that with probability at least ,

 1d1d2R(M,ˆM)≥min⎧⎨⎩C1,C2α√Kβ+α√rmax{d1,d2}m⎫⎬⎭, (12)

as long as the right-hand side of (12) exceeds , where are absolute constants.

###### Remark 4.

The ratio between the upper bound and lower bound is proportion to . However, if we carefully construct as in Remark 2 so that is less than some absolute constant, the gap between the upper and the lower bound can be reduced to a factor that is on the order of .

## Iv Numerical examples

To test the performance of the regularized maximum estimator on real data, we consider the MovieLens dataset (can be downloaded at http://www.grouplens.org). The dataset contains movie ratings from users can be viewed as a -by- matrix whose entries take value . We first randomly select ratings to fit a multi-nomial logit regression model, and then solve the optimization problem (4) using another randomly selected ratings as observed entries. Finally, we use the remaining ratings as test data and compute the average difference between the true rating and the predicted ratings based on the recovered matrix, as shown in the first row of Table I. We compare the result with that obtained from the conventional matrix completion by rounding the recovered entries to . The results are shown in the second row of Table I.

The results show that our method performs better when the original rating is larger than , and has better overall performance. This can possibly be explained by the following reasoning. Our multinomial logistic model is fitted using the training data, and in the MovieLens dataset there are relatively few ratings with values 1, 2 and 3. Therefore, the fitted link functions for , have much less accuracy than and . Hence, we can see that the categorical matrix completion performs better when the true rating is 4 and 5.

## V Discussions

We have studied a nuclear norm regularized maximum likelihood estimator for categorical matrix completion, as well as presented an upper bound and an information theoretic lower bound for our proposed estimator. Our upper and lower bounds meet up to a constant factor where is the fixed number of categories, and this factor can become in some special cases. Our current formulation assumes that the input variables form a low-rank matrix and each response is linked only to one corresponding input variable. Future extension may include a formulation that allows more general link functions with multiple input variables and exploits a low-rank tensor structure.

## Acknowledgement

The authors would like to thank Prof. Mark Davenport for stimulating discussions. This work is partially supported by NSF grant CCF-1442635.

## References

• [1] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters, “1-bit matrix completion,” Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.
• [2] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics (FOCS), vol. 9, no. 6, pp. 717–772, 2009.
• [3] E. J. Candes and Y. Plan, “Matrix completion with noise,” Proc. IEEE, vol. 98, no. 6, pp. 925–936, 2010.
• [4] O. Klopp, J. Lafond, E. Moulines, and J. Salmon, “Adaptive multinomial matrix completion,” arXiv preprint arXiv:1408.6218, 2014.
• [5] J. Lafond, O. Klopp, E. Moulines, and J. Salmon, “Probabilistic low-rank matrix completion on finite alphabets,” in Advances in Neural Information Processing Systems, pp. 1727–1735, 2014.
• [6] T. Cai and W.-X. Zhou, “A max-norm constrained minimization approach to 1-bit matrix completion,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 3619–3647, 2013.
• [7] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Noisy matrix completion under sparse factor models,” arXiv:1411.0282, 2014.
• [8] A. Soni and J. Haupt, “Estimation error guarantees for poisson denoising with sparse and structured dictionary models,” in IEEE Int. Symp. Info. Theory (ISIT), pp. 2002–2006, IEEE, 2014.
• [9] Y. Cao and Y. Xie, “Poisson matrix recovery and completion,” arXiv preprint arXiv:1504.05229, 2015.
• [10] J. Lafond, “Low rank matrix completion with exponential family noise,” arXiv:1502.06919, 2015.
• [11] A. Agresti and M. Kateri, Categorical data analysis. Springer, 2011.
• [12] Z. Liu and L. Vandenberghe, “Interior-point method for nuclear norm approximation with application to system identification,” SIAM J. Matrix Analysis and Applications, vol. 31, no. 3, pp. 1235–1256, 2009.
• [13] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
• [14] M. Ledoux and M. Talagrand, Probability in Banach Spaces: isoperimetry and processes, vol. 23. Springer, 1991.
• [15] D. Pollard, A User’s guide to measure theoretic probability, vol. 8. Cambridge University Press, 2002.
• [16] B. Yu, “Assouad, Fano, and le cam,” in Festschrift for Lucien Le Cam, pp. 423–435, Springer, 1997.

## Proofs

###### Proof of Lemma 1.

In order to prove the lemma, we let are i.i.d. Rademacher random variables. In the following derivation, the first inequality uses the Radamacher symmetrization argument (Lemma 6.3 in [14]) and the second inequality is due to the power mean inequality: if and . Then we have

 (13)

where the expectation are over both and .

In the following, we will use contraction principle to further bound the first term of (13). By the definition of , we know that

 1L(k)αlogfk(x)fk(0)

are contractions that vanish at for all . By Theorem 4.12 in [14] and using the fact that , we have

 E[supX∈S∣∣¯FΩ,Y(X)−E¯FΩ,Y(X)∣∣h]≤2hKh−1K∑k=1(2L(k)α)hE⎡⎢⎣supX∈S∣∣ ∣∣∑i,jϵijI[(i,j)∈Ω]Xij∣∣ ∣∣h⎤⎥⎦≤(4K)h(max1≤k≤KL(k)α)hE[supX∈S∥E∘ΔΩ∥h∥X∥h∗]
 ≤(4K)h(Lα)h(α√rd1d2)hE[∥E∘ΔΩ∥h],

where denotes the matrix with entries given by , denotes the indicator matrix for and denotes the Hadamard product.

To bound , we can use the result from [1] if we take :

for some constant .

Moreover when ,

 C0(8(1+√6)C′)log(d1d2)≤C0d1d2

Therefore we can use Markov inequality to see that

 P{supX∈S∣∣FΩ,Y(X)−EFΩ,Y(X)∣∣ ≥C′K(α√r)Lα⋅ (√m(d1+d2)+d1d2log(d1d2))} = P{supX∈S∣∣FΩ,Y(X)−EFΩ,Y(X)∣∣h ≥(C′K(α√r)Lα⋅ (√m(d1+d2)+d1d2log(d1d2)))h} ≤ E[supX∈S∣∣FΩ,Y(X)−EFΩ,Y(X)∣∣h]/ {(C′K(α√r)Lα⋅ (√m(d1+d2)+d1d2log(d1d2)))h}≤Cd1d2,

where and are absolute constants.

###### Proof of Lemma 2.

Assuming is any entry in and is any entry in , then and by the mean value theorem there exists for each such that

 √fk(x)−√fk(y)=f′k(ξk)2√fk(ξk)(x−y).

By the assumption of , there exist at least one such that . Then

 d2H((f1(x),…,fK(x))∥(f1(y),…,fK(y)))=K∑k=1(√fk(x)−√fk(y))2=(x−y)24K∑k=1(f′k(ξk))2fk(ξk)
 ≥(x−y)24(max1≤k≤K(f′k(ξk))2fk(ξk))≥(x−y)24(inf|ξ|≤α(max1≤k≤K(f′k(ξ))2fk(ξ))) (14)

Then the lemma is proven by summing across all entries and dividing by .

###### Proof of Lemma 4.

Let for each , then

 D((x1,…,xK)∥(y1,…,yK))=D((x1,…,xK)∥(x1+z1,…,xK+zK))=K∑k=1xklogxkxk+zk.

And then we have for each

 ∂D((x1,…,xK)∥(x1+z1,…,xK+zK))∂zk=−xkxk+zk.

By mean value theorem, we have

 (15)

for some . Since for each

 (−xkzkxk+czk)′=xkz2k(xk+czk)2≥0,

the right-hand side of (15) is an increasing function in and hence

 D((x1,…,xK)∥(y1,…,yK))≤K∑k=1xkyk(xk−yk).

Noting that and , we have

 K∑k=1xkyk(xk−yk)=K−1∑k=1(xkyk−1−∑K−1i=1xi1−∑K−1i=1yi)(xk−yk)=K−1∑k=1[(xk−yk)2+xkyk(2−xK−yK)−x2k(1−yK)−y2k(1−xK)]/[yk(1−K−1∑i=1yi)]
 =K−1∑k=1[(xk−yk)2+(xkyk−x2k)(1−yK)+(xkyk−y2k)(1−xK)]/[yk(1−K−1∑i=1yi)], (16)

where the last inequality uses the fact that .

###### Proof of Theorem 1.

First, note that

 ¯FΩ,Y(X)−¯FΩ,Y(M)=∑(i,j)∈ΩK∑k=1I(Yij=ak)logfk(Xij)fk(Mij).

Then for any ,

 (17) =−mD(f(M)∥f(X)).

For , we know and . Thus we write

 0 ≤FΩ,Y(ˆM)−FΩ,Y(M)=¯FΩ,Y(ˆM)−¯FΩ,Y(M) =¯FΩ,Y(ˆM)+E¯FΩ,Y(ˆM)−E¯FΩ,Y(ˆM) +E¯FΩ,Y(M)−E¯FΩ,Y(M)−¯FΩ,Y(M) ≤E[¯FΩ,Y(ˆM)−¯FΩ,Y(M)]+ ∣∣¯FΩ,Y(ˆM)−E¯FΩ,Y(ˆM)∣∣+∣∣¯FΩ,Y(M)−E¯FΩ,Y(M)∣∣ ≤−mD(f(M)∥f(ˆM))+2supX∈S∣∣¯FΩ,Y(X)−E¯FΩ,Y(X)∣∣.

Applying Lemma 1, we obtain that with probability at least ,

 0≤−mD(f(M)∥f(ˆM))+2C′K(α√r)Lα⋅(√m(d1+d2)+d1d2log(d1d2)).

After rearranging terms and applying the fact that , we obtain

 D(f(M)∥f(ˆM))≤2C′K(α√r)Lα⋅⎛⎝√d1+d2m√1+(d1+d2)log(d1d2)m⎞⎠. (18)

Note that the KL divergence can be bounded below by the Hellinger distance (Chapter 3 in [15]): Thus from (18), we obtain

 d2H(f(M),f(ˆM))≤2C′K(α√r)Lα⋅   ⎛⎝√d1+d2m√1+(d1+d2)log(d1d2)m⎞⎠. (19)

Finally, Theorem 1 is proved by applying Lemma 2. ∎

###### Proof of Theorem 2.

We will prove by contradiction. Lemma 3 and Lemma 4 are used in the proof. Without loss of generality, assume . Choose such that

 ϵ2=min{11024,C2α√Kβ+α√rd2m},

where is an absolute constant that will be be specified later. First, choose such that is an integer and

 4√2ϵα≤γ≤8ϵα≤14α

We may make such a choice because

 α2r64ϵ2≤rγ2≤α2r32ϵ2

and

 α2r32ϵ2−α2r64ϵ2=α2r64ϵ2>4α2r>1.

Furthermore, since we have assumed that is larger than , for an appropriate choice of . Let be the set defined in Lemma 3, by replacing with and with this choice of . Then we can construct a packing set of the same size as by defining

 χ≜{X′+α(1−γ2)1% d1×d2:X′∈χ′α/2,γ}.

The distance between pairs of elements in is bounded since

 ∥X(i)−X(j)∥2F≥α24γ2d1d22≥4d1d2ϵ2. (20)

Define , then every entry of has . Since we have assumed , for every , we have

 ∥X∥∗ =∥X′+α(1−γ2)1d1×d2∥∗ ≤∥X′∥∗+α(1−γ2)√d1d2 ≤α2√rd1d2+α√d1d2≤α√rd1d2,

for some . Since the we choose is less than , is greater than . Therefore, from the assumption that , we conclude that .

Now consider an algorithm that for any returns such that

 1d1d2∥X−ˆX∥2F<ϵ2 (21)

with probability at least